BACK TO ALL POSTS

5 Features to Look Out for in a Modern Data Catalog

Metadata Management

Data Engineering

Data Lineage

Data Governance

Open Source

Paul Logan

Jan 12, 2023

Metadata Management

Data Engineering

Data Lineage

Data Governance

Open Source

By 2025, 80% of organizations will fail to scale their digital businesses if they don’t have a modern data and analytics governance approach.

It’s not hard to imagine why.

We just have to think about the large volumes of data organizations are producing,
the growing number of data tools they’re using, and the disparate sets of users involved at every step of the data journey.

Data discoverability, data sharing, and data governance are as much a challenge as they are a business priority.

And that’s where a data catalog can help.

Benefits of a Data Catalog: How exactly does a data catalog help?

By organizing metadata (the technical details around data assets) into well-defined and searchable assets, data catalogs help enable data discovery and data sharing, to help data users, at the very least,

  • Centrally access the organization’s data
  • Know what data is available and where
  • Find the data they need and any information about it
  • Evaluate the quality of that data
  • Know where data is coming from and where it’s going

This translates into improved data context and analysis, higher data efficiency and data quality, and a foundation for data governance and regulatory compliance — and, ultimately, improved business efficiency.

While this sounds great in principle, most typical data catalogs fall short because they

  • rely solely on manual enrichment of metadata
  • create status silos for metadata and end up with stale metadata
  • fail to act on changing metadata
  • cater only to technical users

But the modern data catalog can — and should — do so much more.

Data Catalog Trends: 5 capabilities that forward-looking businesses need

Here’s our take on the five critical capabilities that make a data catalog the best solution to align platforms, processes, and people — so companies make, and get, the most out of their data.

1. Shift Left

Shift Left simply refers to the practice of declaring and emitting metadata right where the data is generated. A big part of effective metadata management is that we go beyond the manual enrichment of data — often the result of treating metadata as an afterthought.

By doing so, companies meet developers or teams where they are — instead of retrofitting new processes or workflows. Even better, it can help better understand the downstream implications of any changes.

Developers can enrich their data with ownership, PII status, domains, tags, etc, in code — right where it is created. For instance, by annotating, say your schemas, annotations live alongside the schemas — ensuring that technical schemas are always aligned with the business context. If you’re thinking, “isn’t this similar to Data Contracts?”, the answer is “Yes! Data Contracts are an example of a shift-left principle applied to the collection of schema, semantic and quality metadata of a dataset.” Read this post here.

The beauty of Shift Left is that it can be tailored to teams’ tools and development patterns. All you need is a data catalog that will surface all this metadata with all the associated context.

How DataHub surfaces metadata added at source in its UI

How DataHub surfaces metadata added at source in its UI

2. Granular Lineage and Impact Analysis

We spoke about how one of the key roles of a data catalog is to show where data comes from and where it goes. What users need is a way to trace lineage across multiple platforms, datasets, pipelines, charts, and dashboards — across production, transformation, and consumption.

Lineage in most catalogs shows users that a dependency exists, but what users really need is to know exactly how. And this is where column-level lineage comes to play to enable

  • proactive impact analysis
  • reactive data debugging,

Column-level lineage has been one of DataHub’s most requested features, and for good reason.

Modern data catalog

From using it to better manage sensitive/PII data to simplifying essential operations like schema changes and refactoring, column-level lineage has become one of DataHub’s most-loved and most-used capabilities.

3. Active Metadata and Streaming

A data catalog’s active metadata approach ensures that all the metadata collected and indexed is

  • Live
  • Active, and
  • Injected into the operational plane

This means that changes happening across all your tools, such as Airflow, dbt, Snowflake, GitHub, etc., should reflect within the data catalog and be stored in a metadata graph.

A data catalog built with streaming capabilities enables you to act on metadata changes in real time.

This means that your data catalog should help you ensure data availability, completeness, and correctness — on an ongoing basis — and with additional capabilities like pipeline-breaking, reporting, verification, etc.

The following examples are just a few of the many scenarios that are possible once you have a real-time metadata platform:

  1. Pipeline observability and SLA tracking
  2. Circuit-breaking pipelines based on data quality
  3. Instant Slack notifications on any change in metadata
  4. Syncing tags across DataHub and data warehouses like Snowflake.
4. Developer Friendliness (API-first philosophy)

Developer friendliness, also known as an API-first philosophy, is a highly sought-after feature in modern data catalogs. However, not all APIs are created equal. To truly support developer friendliness, data catalogs must provide APIs that are not only robust and well-documented, but also offer advanced features that go beyond basic CRUD operations.

One such feature is support for strong-types, which allows for type checking and validation, reducing the likelihood of errors. Another important feature is support for change subscriptions, which allows developers to subscribe to updates in the data catalog and receive notifications in real-time. Additionally, support for analytics in the API surface area enables developers to easily access and manipulate data in a meaningful way.

An SDK (Software Development Kit) that is easy to use and well-documented is also crucial for developer friendliness. SDKs allow developers to easily integrate the data catalog into their systems and workflows, and provide a streamlined experience for common tasks.

Finally, a delightful CLI (Command Line Interface) experience is also an important aspect of developer friendliness. A well-designed CLI provides a simple and intuitive way for developers to interact with the data catalog, allowing them to quickly and easily access and manipulate data without the need for a graphical user interface.

Overall, a data catalog that prioritizes developer friendliness by providing advanced and well-documented APIs, an SDK, and a delightful CLI experience will be highly valued among developers and enable greater innovation and automation in data management.

5. Business User Inclusion (Business-facing Views)

The biggest problem with traditional data catalogs is that they are built and designed to cater mostly to technical users.

Something that came up repeatedly at DataHub’s Metadata Day expert panel in 2022 was the pressing need to make business users active participants in the data ecosystem.

Even Gartner, in its top data trends for 2022, stresses the need for

  • shifting the focus from IT to business and
  • enabling “business users or business technologists to collaboratively craft business-driven data and analytics capabilities.”

This can only happen when a data catalog is designed for both business and technical users.

At DataHub, we refer to this as Metadata 360, which combines technical and logical metadata for a cohesive story based on a holistic view of your data stack. This means that business users can see not just what’s happening across systems but also understand how data is produced, transformed, stored, and used — along with the operational metrics that go with these stages.

We’ve built DataHub to ensure that business metadata (traditionally represented using glossary terms and taxonomies) connects to the physical metadata (tables and columns and operational metadata) emitted by the tools in your data stack. Every entity on DataHub can be seamlessly mapped to the business. Take, for example, our Business Glossary feature that ensures that data elements are logically classified in the business context so teams understand how different terms relate to one another.

When you view a glossary term on DataHub, you don’t just see the code associated with it but also a plain-language definition of the metric and even the users associated with that asset.

Return Rate

You can compile your company-specific vocabulary such as classifications (data sensitivity levels) or acronyms (such as KPIs), etc., so you have a single source of truth for your business language. More importantly, your glossary terms now live in the same location as your company’s data assets — allowing your users to associate terms with datasets, classify datasets, view business dependencies, etc.

Wrapping up

We believe that organizations need a developer-friendly and truly dynamic data catalog to tackle the scale and diversity of the modern data stack.

We’re building DataHub to provide organizations with the most reliable and trusted enterprise data graph they need to maximize the value they derive from data.

Want to learn more about Data Governance with DataHub and Acryl Cloud?

Let's chat?


Metadata Management

Data Engineering

Data Lineage

Data Governance

Open Source

NEXT UP

Governing the Kafka Firehose

Kafka’s schema registry and data portal are great, but without a way to actually enforce schema standards across all your upstream apps and services, data breakages are still going to happen. Just as important, without insight into who or what depends on this data, you can’t contain the damage. And, as data teams know, Kafka data breakages almost always cascade far and wide downstream—wrecking not just data pipelines, and not just business-critical products and services, but also any reports, dashboards, or operational analytics that depend on upstream Kafka data.

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Acryl Observe is a complete observability solution offered by Acryl Cloud. It helps you detect data quality issues as soon as they happen so you can address them proactively, rather than waiting for them to impact your business’ operations and services. And it integrates seamlessly with all data warehouses—including Snowflake, BigQuery, Redshift, and Databricks. But Acryl Observe is more than just detection. When data breakages do inevitably occur, it gives you everything you need to assess impact, debug, and resolve them fast; notifying all the right people with real-time status updates along the way.

John Joyce

2024-04-23

Five Signs You Need a Unified Data Observability Solution

A data observability tool is like loss-prevention for your data ecosystem, equipping you with the tools you need to proactively identify and extinguish data quality fires before they can erupt into towering infernos. Damage control is key, because upstream failures almost always have cascading downstream effects—breaking KPIs, reports, and dashboards, along with the business products and services these support and enable. When data quality fires become routine, trust is eroded. Stakeholders no longer trust their reports, dashboards, and analytics, jeopardizing the data-driven culture you’ve worked so hard to nurture

John Joyce

2024-04-17

TermsPrivacySecurity
© 2025 Acryl Data