BACK TO ALL POSTS

The 3 Must-Haves of Metadata Management — Part 1

Metadata Management

Data Engineering

Analytics

Open Source

Maggie Hays

Oct 4, 2022

Metadata Management

Data Engineering

Analytics

Open Source

After some deep reflection, I’ve begun looking at my work with the DataHub Community as atoning for sins in my past roles in the data world. Let me explain.

In my earlier days in data analytics and BI/analytics engineering, I was guilty of my share of sins that made data management hard. I rapidly accumulated tech debt for my team by building one-off resources to answer specific questions for stakeholders, instead of taking the time to methodically understand the breadth of questions at hand and build well-designed, well-documented, reusable components.

The beginning of atonement is the sense of its necessity.
Lord Byron

This made it difficult for stakeholders and teammates alike to make sense of what I had built (while setting a sloppy example along the way), leading to a mass-proliferation of ad-hoc SQL queries that yielded conflicting results that were a far cry from being “insightful”.

It didn’t take long for me to figure out that my ways of data management and discovery weren’t quite working out.

Not working!

And I’m sure this realization isn’t unique to me — the modern tech stack and its evolution make it easier (and cheaper!) than ever to rapidly produce one-time-use data resources, so it’s incredibly easy to prioritize speed of delivery over reusability. Ultimately, this poses huge challenges to data usability and management along the way.

To illustrate this, I will use the example of Long Tail Companions, a fictional pet adoption company that matches pets to humans.

P.S: Long Tail Companions is only part fictional — its story is what I’ve compiled from hundreds of conversations I’ve had with DataHub community members over the past year!

Long Tail Companions and its Journey

Long Tail Companions set out to adopt or match every pet with a caring human.

Here’s what their growth — and data — journey looked like:

Setting up the Adoption Team

They spun up an Adoptions team to build out an adoptions application.
This team stored their transactional data in Postgres, and then the application/document data in MongoDB.

Spinning off the E-Commerce (+ Data Science) Team

When Long Tails discovered that they needed to ensure every pet had a leash, a food bowl, treats, and more, they set up an E-Commerce team to supplement their adoption program.

The E-Commerce platform used Postgres for the storage of its transactional records. But the team also needed to understand how users were interacting with their website/app so they could optimize the shopping experience and recommendation engine to drive conversions.

And thus, Kafka was introduced as their event streaming platform.

Adding the Data Platform Team

To leverage their data meaningfully, they next rolled out a Data Platform Team who began using Airflow to sync all their data, land it into S3, as their data lake, and transformed it using Spark before ultimately landing it into their centralized data warehouse, Snowflake.

Setting up the Analytics Engineering Team and their tools

Lastly, they introduced an Analytics Engineering Team to adopt all modern best practices of dbt for transformations, Great Expectations for data validation, and Looker for the presentation layer.

Long Tail Companions' Fragmented Data Stack

The Problem with Long Tail Companions Data Stack

Everyone was making the best decisions for their use cases. But in reality, you end up with a really fragmented ecosystem that was hard to navigate, use, and troubleshoot.

Questions. Questions. And more Questions.

At the end of the day, Long Tail Companions still had too many visibility issues, blind spots, and as a result, unanswered questions.

Too many data solutions, but no answers.

Metadata Management: Getting to the Answers

The only approach that can help Long Tail Companions and multiple other such organizations make, and get, the most out of their data stack is metadata management.

And while a data catalog is a great way to make metadata management, most typical data catalogs fall short.

Why typical data catalogs fall short on metadata management

Most data catalogs

  • focus on physical metadata, alienating sets of users
  • end up housing stale metadata
  • fail to act on changing metadata
  • rely solely on manual enrichment of metadata
  • focus disproportionately on the data warehouse,

…missing the full story and the big picture.

And that brings us to #1 of our 3 must-haves for metadata management:

The #1 Must-Have for Sustainable Metadata Management: Metadata 360

Metadata 360 is a holistic view of your data stack that tells a cohesive story.

This means that you should be able to see what’s going on across systems and the business. It’s all about understanding how data is produced, transformed, stored, and used — and the operational metrics that go with these stages

What does this look like in action?

Stitching together, both the technical and the logical metadata, regardless of whether it is a dataset, an ML model, or a task.

Another huge issue with conventional data catalogs is the disproportionate focus on the warehouse. We know that data sets don’t live and die in the data warehouse — they begin their life in your production systems, run through your streaming data, to the warehouse, and onward to your third-party systems. By focusing all your attention on the data warehouse, you simply don’t get the full picture.

Challenges that Metadata 360 solves

How DataHub Implements Metadata 360

To put it as simply as I can, DataHub is a tool for you to map out and understand the entirety of your data platform — regardless of the tools you’re using, and what stage of the data journey are in. It offers a centralized and unified search experience to navigate your entire data stack — and slice and dice it as you need. From an ML feature table to an Airflow task to a Looker dashboard, it can bubble up all this information in one spot.

In addition, DataHub ensures that business metadata (traditionally represented using glossary terms and taxonomies) connects to the physical metadata (tables and columns and operational metadata) emitted by the tools in your data stack.

Let’s see exactly how this Metadata 360 aspect plays out within DataHub.

Search across your entire data stack

DataHub makes it easy to search for data entities across systems and platforms, giving you a one-stop shop to find relevant resources.

Data hub demo

View operational details and understand how they relate to business definitions

DataHub seamlessly maps datasets or data entities to the business. For example, when you view a glossary term, you don’t just see the code associated with it, but also a plain-language definition of the metric.

Datahub Return Rate

This means that you don’t have to be an SQL expert to interpret it correctly — and this ensures that you provide context to a wider audience that goes beyond data folks.

Understand the end-to-end journey of data

View the end-to-end lineage of data, so you understand how data is generated, how it is transformed, and how it passes through various touch points.

In fact, elements like Airflow pipelines and tasks show up in the end-to-end lineage view.

Airflow pipelines

View validation outcomes

With Great Expectations and additional data validation plugins, you can see validation outcomes, understand what happened, and what the actual outcomes were. This helps bolster the trust or the usability of your data.

validation outcomes

What Metadata 360 means for business and technical teams

Metadata 360

Let’s go back to Long Tail Companions and see how DataHub can help different sets of users. Let’s suppose you are:

  • A Data Analytics Engineer looking for the authoritative pet dataset:
    All you need to do is enter the unified search experience and type what you’re looking for. You can filter down by type, dataset or chart or dashboard, or even search by domain, to get right down to the column table schema definition.
    Here, you can view recent queries, top queries, owners, and usage statistics over time to contextualize how this is being used within the organization.
  • A Business Manager looking to understand how Return Rate for a pet is calculated
    Just search for Return Rate in DataHub for a plain-language definition of it, and also to understand who owns it so you can reach out for further assistance.

And just like that, Long Tail Companions has a much better grip on its tail… er, I mean its data.

Dog chasing tail

Stay tuned for more…

That’s it from me for now. But I’ll be back soon with the next post on the must-haves of metadata management!

3 data management must haves

Connect with DataHub

Join us on SlackSign up for our NewsletterFollow us on Twitter

Metadata Management

Data Engineering

Analytics

Open Source

NEXT UP

Governing the Kafka Firehose

Kafka’s schema registry and data portal are great, but without a way to actually enforce schema standards across all your upstream apps and services, data breakages are still going to happen. Just as important, without insight into who or what depends on this data, you can’t contain the damage. And, as data teams know, Kafka data breakages almost always cascade far and wide downstream—wrecking not just data pipelines, and not just business-critical products and services, but also any reports, dashboards, or operational analytics that depend on upstream Kafka data.

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Acryl Observe is a complete observability solution offered by Acryl Cloud. It helps you detect data quality issues as soon as they happen so you can address them proactively, rather than waiting for them to impact your business’ operations and services. And it integrates seamlessly with all data warehouses—including Snowflake, BigQuery, Redshift, and Databricks. But Acryl Observe is more than just detection. When data breakages do inevitably occur, it gives you everything you need to assess impact, debug, and resolve them fast; notifying all the right people with real-time status updates along the way.

John Joyce

2024-04-23

Five Signs You Need a Unified Data Observability Solution

A data observability tool is like loss-prevention for your data ecosystem, equipping you with the tools you need to proactively identify and extinguish data quality fires before they can erupt into towering infernos. Damage control is key, because upstream failures almost always have cascading downstream effects—breaking KPIs, reports, and dashboards, along with the business products and services these support and enable. When data quality fires become routine, trust is eroded. Stakeholders no longer trust their reports, dashboards, and analytics, jeopardizing the data-driven culture you’ve worked so hard to nurture

John Joyce

2024-04-17

TermsPrivacySecurity
© 2025 Acryl Data