Metadata: the key to navigating data lakes
How to create a metadata management layer to gain a deeper understanding of your data
In a recent blog post, we dove into the importance of using metadata to gain value from data lakes. Now that we’ve established its importance, there’s an essential next step to using metadata effectively—implementing a metadata management layer.
A data lake without an effective metadata management layer is like a Ferrari without fuel. In order to build a successful data lake (aka a big data ecosystem), metadata should be used to drive other components such as security, ETL/ELT, and data discovery.
The question is how to fuel the Ferrari. There isn’t a magic technology bullet that can address all metadata management needs of an organization, especially in the big data ecosystem. The big data landscape has evolved and grown in the last decade. Market leaders such as Cloudera, Hortonworks, and MapR are providing data management capabilities to their customers. But they haven’t been successful in providing a holistic solution that can become a building block for the metadata management framework.
"A data lake without an effective metadata management layer is like a Ferrari without fuel."
Clients are demanding an all-in-one solution that not only understands the schema-less big data storage but also effortlessly transforms it into an on-demand schema. In other words, customers need a “Siri” that can easily answer all the hard questions about data.
Instead of only relying on a technology-driven or tool-driven initiative, organizations should embrace processes and people-driven initiatives. One such initiative is building a metadata-as-a-service model. This means building a big data ecosystem that not only serves the data needs of multiple business functions like marketing, finance, and compliance, but also makes it easy for them to navigate the complex labyrinth of the big data ecosystem.
To build a metadata-as-a-service model, follow these steps:
- Establish metadata management standards in the big data distributed storage layer.
- Design a metadata storage layer decoupled from the data storage layer in the distributed file system. The metadata storage layer will store the business glossary of all metrics and dimensional/referential data along with all technical and operational metadata and detailed data lineage.
- Design a fully automated process that scans every raw data file ingested and written to the file system. This process will be used to write to the metadata storage layer. The scans should be tightly coupled with the ingestion process and should be file-format agnostic.
- Create an open Restful API service that allows easy access to the metadata repository. The API to the metadata repository should be designed keeping in mind the varying levels of technical skills of end users. The API should not only allow users to navigate and search the metadata repository but also allow them to write back to the repository.
"Instead of only relying on a technology-driven or tool-driven initiative, organizations should embrace processes and people-driven initiatives."
To efficiently implement this process, look for a combination of technical requirements as well as vendor execution and vision, service, support, and total cost of ownership. Building a metadata management strategy and framework requires iterative planning and execution. To justify your investment, you also need a mechanism to gauge the progress. Agile methodology allows you to build the big data storage, ETL/ELT, and processing frameworks, as well as the metadata management layer. The best way to engage business users would be to involve them in the metadata management user stories in every sprint—meaning they would form semantic relationships and rules between data elements and share them within the metadata repository—which can then be used in the interference engines and query systems. Every scrum team within the big data program should dedicate resources to the metadata management framework.
Some possible tools to use are:
- Cloudera Navigator + Informatica Metadata Management
- Hortonworks + Revelytix’s Loom Technology
Another option is to build a custom metadata management layer using Apache Atlas and Apache Falcon. Apache Atlas is a shared framework to shed light on how users access data within Hadoop. It’s designed to exchange metadata with other tools and processes within and outside of the Hadoop stack. Apache Falcon is a component of Apache Atlas to manage the data lifecycle within Hadoop framework. With a suitable connector, Apache Atlas can talk with more user-friendly and rich UI-based data governance and metadata management platforms like IBM’s InfoSphere Information Governance Catalog for better control and visualization of data lineage, metadata, and business glossary.
Implementing a metadata management layer will allow you to take the vast amount of raw data in your data lake and turn it into valuable, company-wide insights.
Soumen Chakraborty is a data management consultant for Slalom New York’s Information Management and Strategy practice. His specializations include master data management, data architecture & modeling, ETL design & development, big data architecture, and designing cloud-based solutions. For over nine years, he has helped clients use their data to drive better decision-making. You can connect with Soumen here.
Sayantan Maiti is a data architect in Slalom’s Information Management and Analytics practice, based out of New York. His areas of interests include data strategy, big data architecture, and cloud computing. You can connect with Sayantan here.