Metadata: the key to navigating data lakes
Metadata answers your hard questions about data—who, what, when, where, why, and how
During a recent conversation with a client about data lakes, the discussion veered toward governance. Our client wanted to take the leap from a traditional, structured environment to a big data ecosystem but didn’t want it to turn into a Wild West scenario where a large amount of raw data is made available to the company without effective controls and data management practices.
This is a common topic among organizations interested in data lakes. A lot of big data presentations talk about the 4 “Vs”—volume, velocity, variety, and veracity. But a big “V” that usually gets ignored is value. How can we get actual business value from data lakes?
Metadata: the modern-day card catalog
Let’s use an analogy. Remember those card catalogs in the library? Each card displayed a book’s title (what), author (who), date published (when), plot synopsis (why), and Dewey decimal number (where). The cards weren’t physically linked in any way to the book they described, but they provided browsers with information to understand and locate the book. Now imagine a library with all the books you need, but without any catalog cards. Every time you need a book, you have no information to help you, and you have to blindly comb through the shelves until you find it. What a nightmare.
"You can think of data lakes as a library of data, and to derive any meaningful information, you need something similar to the card catalog: metadata."
You can think of data lakes as a library of data, and to derive any meaningful information, you need something similar to the card catalog: metadata. In the world of data management, metadata answers the hard questions about data—who, what, when, where, why, and how.
Data lake consumers range from data analysts to data scientists, and data lakes can support multiple use cases like dashboards, data mining, predictive analytics, and machine learning algorithms. Consistent business definitions of metrics and dimensions stored in a data lake can drive robust cross-functional analytics.
Developing a strong metadata solution
For example, if an organization decides to forecast sales, they need a 360-degree view. That means they have to gather input from various business functions that help drive sales or are affected by sales—such as marketing, finance, and supply chain. Demand, campaigns, and target demographics are critical variables that affect these predictive models.
There can also be multiple sales types within an organization, like net sales, adjusted sales, and promotional sales. Not only that, there could be a vast number of KPIs across the organization addressing various reporting and analytics to reflect correct sales statuses. The data can come from a variety of silos, lines of business, and external vendors and can mean different things to different people.
To use the data effectively and enable self-service users to navigate and reuse existing data, every data ecosystem needs to have a consistent definition of all data attributes. Companies should create a comprehensive data catalogue and socialize it with all producers and consumers, asking for their continuous feedback.
A good metadata solution supports data taxonomy, which can help drive information security and access controls in big data ecosystems. Taxonomy is about "semantic architecture”—naming things and making decisions about how to map different concepts and terms to a consistent structure.
The biggest challenge in a big data ecosystem is when the same term can have different meanings. It’s also very difficult to get complete agreement on what terms to use and definitions of those terms. Introducing data taxonomies in metadata management can help map ambiguous terms together to account for these inconsistencies. Taxonomies can also represent related concepts that can be used to connect processes, business logic, or dynamic/related content to support specific tasks.
Metadata is classified into the following two categories:
- Business metadata includes business rules, definitions of data files, and attributes in business terms. It doesn’t include any information on how, when, where, or by whom data is stored in the platform. It can answer questions like, “What kind of revenue is being reported? What is meant by revenue? What calculations went into the determination of revenue?”
- Technical metadata can be split three ways. Structural metadata describes the containers of data (where); descriptive metadata describes data content (what); administrative metadata describes data management (who, why, how, and when).
Now that we’ve walked through the importance of using metadata to navigate data lakes, check out our follow-up blog post on the importance of implementing an efficient metadata management layer to gain a deeper, comprehensive understanding of your data.
Soumen Chakraborty is a data architect for Slalom New York’s Information Management and Strategy practice. His specializations include master data management, data architecture & modeling, ETL design & development, big data architecture, and designing cloud-based solutions. For over nine years, he has helped clients use their data to drive better decision-making. You can connect with Soumen here.
Sayantan Maiti is a data architect in Slalom’s Information Management and Analytics practice, based out of New York. His areas of interests include data strategy, big data architecture, and cloud computing. You can connect with Sayantan here.