The 3 Must-Haves of Metadata Management

After some deep reflection, I’ve begun looking at my work with the DataHub Community as atoning for sins in my past roles in the data world. Let me explain.

In my earlier days in data analytics and BI/analytics engineering, I was guilty of my share of sins that made data management hard. I rapidly accumulated tech debt for my team by building one-off resources to answer specific questions for stakeholders, instead of taking the time to methodically understand the breadth of questions at hand and build well-designed, well-documented, reusable components.

The beginning of atonement is the sense of its necessity.

Lord Byron

This made it difficult for stakeholders and teammates alike to make sense of what I had built (while setting a sloppy example along the way), leading to a mass-proliferation of ad-hoc SQL queries that yielded conflicting results that were a far cry from being “insightful”.

It didn’t take long for me to figure out that my ways of data management and discovery weren’t quite working out.

And I’m sure this realization isn’t unique to me — the modern tech stack and its evolution make it easier (and cheaper!) than ever to rapidly produce one-time-use data resources, so it’s incredibly easy to prioritize speed of delivery over reusability. Ultimately, this poses huge challenges to data usability and management along the way.

To illustrate this, I will use the example of Long Tail Companions, a fictional pet adoption company that matches pets to humans.

P.S: Long Tail Companions is only part fictional — its story is what I’ve compiled from hundreds of conversations I’ve had with DataHub community members over the past year!

Long Tail Companions and its Journey

Long Tail Companions set out to adopt or match every pet with a caring human.

Here’s what their growth — and data — journey looked like:

Setting up the Adoption Team

They spun up an Adoptions team to build out an adoptions application.
This team stored their transactional data in Postgres, and then the application/document data in MongoDB.

Spinning off the E-Commerce (+ Data Science) Team

When Long Tails discovered that they needed to ensure every pet had a leash, a food bowl, treats, and more, they set up an E-Commerce team to supplement their adoption program.

The E-Commerce platform used Postgres for the storage of its transactional records. But the team also needed to understand how users were interacting with their website/app so they could optimize the shopping experience and recommendation engine to drive conversions.

And thus, Kafka was introduced as their event streaming platform.

Adding the Data Platform Team

To leverage their data meaningfully, they next rolled out a Data Platform Team who began using Airflow to sync all their data, land it into S3, as their data lake, and transformed it using Spark before ultimately landing it into their centralized data warehouse, Snowflake.

Setting up the Analytics Engineering Team and their tools

Lastly, they introduced an Analytics Engineering Team to adopt all modern best practices of dbt for transformations, Great Expectations for data validation, and Looker for the presentation layer.

Long Tail Companions' Fragmented Data Stack

The Problem with Long Tail Companions Data Stack

Everyone was making the best decisions for their use cases. But in reality, you end up with a really fragmented ecosystem that was hard to navigate, use, and troubleshoot.

Questions. Questions. And more Questions.

At the end of the day, Long Tail Companions still had too many visibility issues, blind spots, and as a result, unanswered questions.

Too many data solutions, but no answers.

Metadata Management: Getting to the Answers

The only approach that can help Long Tail Companions and multiple other such organizations make, and get, the most out of their data stack is metadata management.

And while a data catalog is a great way to make metadata management, most typical data catalogs fall short.

Why typical data catalogs fall short on metadata management

Most data catalogs

focus on physical metadata, alienating sets of users
end up housing stale metadata
fail to act on changing metadata
rely solely on manual enrichment of metadata
focus disproportionately on the data warehouse,

…missing the full story and the big picture.

And that brings us to #1 of our 3 must-haves for metadata management:

The #1 Must-Have for Sustainable Metadata Management: Metadata 360

Metadata 360 is a holistic view of your data stack that tells a cohesive story.

This means that you should be able to see what’s going on across systems and the business. It’s all about understanding how data is produced, transformed, stored, and used — and the operational metrics that go with these stages

What does this look like in action?

Stitching together, both the technical and the logical metadata, regardless of whether it is a dataset, an ML model, or a task.

Another huge issue with conventional data catalogs is the disproportionate focus on the warehouse. We know that data sets don’t live and die in the data warehouse — they begin their life in your production systems, run through your streaming data, to the warehouse, and onward to your third-party systems. By focusing all your attention on the data warehouse, you simply don’t get the full picture.

How DataHub Implements Metadata 360

To put it as simply as I can, DataHub is a tool for you to map out and understand the entirety of your data platform — regardless of the tools you’re using, and what stage of the data journey are in. It offers a centralized and unified search experience to navigate your entire data stack — and slice and dice it as you need. From an ML feature table to an Airflow task to a Looker dashboard, it can bubble up all this information in one spot.

In addition, DataHub ensures that business metadata (traditionally represented using glossary terms and taxonomies) connects to the physical metadata (tables and columns and operational metadata) emitted by the tools in your data stack.

Let’s see exactly how this Metadata 360 aspect plays out within DataHub.

Search across your entire data stack

DataHub makes it easy to search for data entities across systems and platforms, giving you a one-stop shop to find relevant resources.

View operational details and understand how they relate to business definitions

DataHub seamlessly maps datasets or data entities to the business. For example, when you view a glossary term, you don’t just see the code associated with it, but also a plain-language definition of the metric.

This means that you don’t have to be an SQL expert to interpret it correctly — and this ensures that you provide context to a wider audience that goes beyond data folks.

Understand the end-to-end journey of data

View the end-to-end lineage of data, so you understand how data is generated, how it is transformed, and how it passes through various touch points.

In fact, elements like Airflow pipelines and tasks show up in the end-to-end lineage view.

View validation outcomes

With Great Expectations and additional data validation plugins, you can see validation outcomes, understand what happened, and what the actual outcomes were. This helps bolster the trust or the usability of your data.

What Metadata 360 means for business and technical teams

Let’s go back to Long Tail Companions and see how DataHub can help different sets of users. Let’s suppose you are:

A Data Analytics Engineer looking for the authoritative pet dataset:
All you need to do is enter the unified search experience and type what you’re looking for. You can filter down by type, dataset or chart or dashboard, or even search by domain, to get right down to the column table schema definition.
Here, you can view recent queries, top queries, owners, and usage statistics over time to contextualize how this is being used within the organization.
A Business Manager looking to understand how Return Rate for a pet is calculated
Just search for Return Rate in DataHub for a plain-language definition of it, and also to understand who owns it so you can reach out for further assistance.

And just like that, Long Tail Companions has a much better grip on its tail… er, I mean its data.