BACK TO ALL POSTS

Data Contracts in DataHub: Combining Verifiability with Holistic Data Management

Data Contract

Shirshanka Das

Sep 19, 2023

Data Contract

676 million

That’s how many results a simple Google search on Data Contracts throws up. As buzzwordy as they might seem, we’ve maintained that they aren’t as different or complicated as they might seem.

And it’s this focus on simplicity that’s guided our approach to building Data Contracts in DataHub. In this blog post, we share how we’ve implemented Data Contracts within DataHub, how you can get started, and how the Data Products functionality can help you get the most out of Data Contracts.

What is a Data Contract?

Here’s what ChatGPT has to say:

A data contract refers to an agreement or specification that defines the structure, format, and semantics of data exchanged between different systems, applications, or components. It serves as a mutual understanding between parties involved in data exchange ensuring that data is transmitted and interpreted correctly.

Not surprisingly, ChatGPT’s take misses a few key nuances, a striking one being the aspect of verifiability.

While contracts encompass agreements, specifications, and various structural aspects, their true value lies in their ability to be validated.
Verifiable elements include schemas, column-level data checks, and operational service level agreements (SLAs that can be programmatically checked and enforced.

For a detailed and nuanced understanding of Data Contracts, check out:

  • The What, Why, and How of Data Contracts, based on an AMA Maggie hosted with Chad Sanderson and me.
  • Data Contracts Wrapped 🎁 2022, that summarizes the main ideas from the most popular writings in the data contracts space.

Here’s a TLDR; version:

A Data Contract is an agreement between a producer and a consumer that clearly defines

  • what data needs to move from a producer/source to a consumer/destination
  • the shape of that data, its schema, and semantics
  • expectations around availability and data quality
  • details about contract violation(s) and enforcement

LEARN MORE: Watch the on-demand webinar:

Data Contracts: A Practitioner's Guide

presented by Gabe Lyons, founding engineer at Acryl Data

Data Contracts in DataHub

In DataHub, Data Contracts are collections of assertions, or verifiable things that can be stated and enforced on individual data assets.

Assertions in Data Contracts revolve around schema-related aspects, service level agreements (SLAs), data freshness, and data quality. DataHub’s dbt and Great Expectations integrations allow you to produce:

  • AssertionInfo aspects (that define the parameters of an assertion)
  • AssertionRunEvents (that provide the evaluation results of assertions)
DataHub already Supports Assertions

As part of DataHub’s Data Contracts implementation, we've added two new kinds of assertions:

  • SLA Assertion (talks about when a dataset should land, etc.)
  • Schema Assertion (what the data should look like here, the fields within it, etc.)

End-to-End Implementation of Data Contracts in DataHub

The implementation of Data Contracts in DataHub is designed so that:

  • Data producers can author data contracts as YAML files and store them in version control systems like Git.
  • These contracts can then be deployed to DataHub, which acts as a repository for contracts and their associated assertions
  • Business users can use DataHub to access and edit/update the Data Contract.
  • Existing data quality tools can evaluate these assertions and report the results.
Data Contracts: End to End

Check out these two videos to see all this in action.

Part 1: Creating and Deploying Data Contracts with DataHub

Part 2: Using DataHub’s UI to access and manage Data Contracts

Data Contracts + Data Products: How DataHub Combines Verifiability with Non-Verifiable Metadata

Going back to the verifiability aspect of Data Contracts, key data elements, such as documentation, ownership, and tags, lack verifiability, but we know how incredibly important they are in the context of the data ecosystem.

What are Verifiable Things

And it’s this focus on both verifiable and non-verifiable metadata that anchors DataHub’s approach to Data Contracts. Data Contracts in DataHub integrate with Data Products for a holistic approach to managing data assets. Here’s how.

Data Products vs Data Contracts

Data Products in DataHub represent collections of assets combined together in a concept for you to manage and maintain. They have owners, tags, glossary terms, and documentation.

Data Products Contain all these important things

But they also provide a way to combine verifiable and descriptive metadata.

Data Contracts

Data Contracts are the verifiable aspects stated and enforced on individual data assets, that cover schema-related aspects, service level agreements (SLAs), data freshness, and data quality.

With DataHub, you can combine the verifiable (via Data Contracts)  and the descriptive, non-verifiable (via Data Contracts) elements to create a curated metadata graph.

Data Products to Data Contract architecture

Data Products and Data Contract Management in DataHub

In the near future, to streamline the management of Data Products and Data Contracts, you can use the same YAML file to define both Data Products and Data Contract specifications – allowing them to be managed as a unified definition. This approach ensures that both documentation and schema assertions can be maintained as code, satisfying the needs of different stakeholders.

Data Product + Contract Management

In the advanced DataHub implementation available via Acryl Cloud, a Data Contract Operator responds to contracts – and starts monitoring, enforcing, and reporting results.

Acryl’s Managed DataHub Implementation for Scalable Data Management

While DataHub serves as the foundation for Data Contracts, Acryl’s managed DataHub version provides the advanced tools and capabilities you need to manage them at scale. This includes:

  • An inference engine to generate proposals for Data Contracts
  • Approval workflows for data producers and consumers, and
  • Enforcement mechanisms for data contracts.

References and Further Reading

Data Products in DataHub

Check out Data Contracts in DataHub

Join Us on Slack! Interested in Learning More? Let's Chat!

Data Contract

NEXT UP

Governing the Kafka Firehose

Kafka’s schema registry and data portal are great, but without a way to actually enforce schema standards across all your upstream apps and services, data breakages are still going to happen. Just as important, without insight into who or what depends on this data, you can’t contain the damage. And, as data teams know, Kafka data breakages almost always cascade far and wide downstream—wrecking not just data pipelines, and not just business-critical products and services, but also any reports, dashboards, or operational analytics that depend on upstream Kafka data.

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Acryl Observe is a complete observability solution offered by Acryl Cloud. It helps you detect data quality issues as soon as they happen so you can address them proactively, rather than waiting for them to impact your business’ operations and services. And it integrates seamlessly with all data warehouses—including Snowflake, BigQuery, Redshift, and Databricks. But Acryl Observe is more than just detection. When data breakages do inevitably occur, it gives you everything you need to assess impact, debug, and resolve them fast; notifying all the right people with real-time status updates along the way.

John Joyce

2024-04-23

Five Signs You Need a Unified Data Observability Solution

A data observability tool is like loss-prevention for your data ecosystem, equipping you with the tools you need to proactively identify and extinguish data quality fires before they can erupt into towering infernos. Damage control is key, because upstream failures almost always have cascading downstream effects—breaking KPIs, reports, and dashboards, along with the business products and services these support and enable. When data quality fires become routine, trust is eroded. Stakeholders no longer trust their reports, dashboards, and analytics, jeopardizing the data-driven culture you’ve worked so hard to nurture

John Joyce

2024-04-17

TermsPrivacySecurity
© 2024 Acryl Data