Data Contracts Wrapped 🎁 2022

Unless you took a well deserved break from Data in 2022, you are probably well aware of the explosion of conversations around Data Contracts.

In this post I summarize the main ideas from the most popular writings in this space and add my own predictions for the future of data contracts in 2023.

Let’s get right to it!

Definitions: What is a data contract?

Across all the popular writings about data contracts, a summarized definition looks like this.

A data contract is:

1️⃣ a data quality specification for a dataset that describes

schema of the dataset
simple column-oriented quality rules (is_not_null, values_between etc.)

2️⃣ anchored around a dataset which is typically a physical event stream produced by a team / domain

3️⃣ also a carrier for semantic aka business metadata

ownership
classification tags

4️⃣ can also describe operational SLO-s

freshness goals (e.g. must be available for processing by 7am in the warehouse)

5️⃣ can also be a specifier for provisioning configuration for a dataset (e.g. provision this dataset on Kafka and BigQuery)

My Thoughts

Different implementations of data contracts seem to have slightly different focus areas. Most acknowledge data quality as the primary focus, and then expand out to operational quality, documentation and semantic tags among other characteristics.

I predict that in 2023, the definition of a data contract will broaden to become: all metadata about a dataset that is useful to declare upfront to drive automation in data quality, data management, data governance etc. One pitfall to watch out for, the producer only has a partial understanding of all the aspects of the dataset and cannot fill out all this information with high accuracy. A few simple examples come to mind: specifying retention and writing documentation for the dataset. These need input from multiple stakeholders.

Scope: Where does it apply?

Current conversation on data contracts is focused on:

the edge between the operational data plane (where your company’s first party data lives) and the analytical data plane (typically a lake or a data warehouse)

My Thoughts

As a reader, you might be wondering whether there is a specific reason the community is hyper-focused on attaching data contracts only to this edge. One of the reasons this has happened is that a lot of the intention behind the conversation around data contracts is to shift responsibility and accountability for data to the application teams that produce it (see section below on Culture Change). Since application engineers tend to be the least knowledgeable about what happens to the data after it leaves their application database, this specific edge has seen all of the attention in the implementations we have seen so far. I can relate. In 2015, I championed a project at LinkedIn whose whole premise revolved around giving application engineers the tools they need to declare and manage their data contracts with their downstream teams.

I predict that in 2023, while we continue to see more real-world implementations of data contracts applied on this segment of the data journey, we will also see data contract implementations broaden their scope to include all important datasets that live across team boundaries. Application events in Segment, database-oriented CDC events and service events in Kafka, dbt tables in the warehouse, external (third-party) data that is synced to the lake or the warehouse, we will want to cover them all.

Lifecycle: How are data contracts created and stored?

Everyone agrees:

Use a git-based process for creating and managing the lifecycle of a data contract
Deploy to a contract registry for serving + data catalog to power search and discovery
For environments where structured events in Kafka are being emitted, also deploy the schemas to the Kafka Schema Registry for schema enforcement.

My Thoughts

People are experimenting with different formats to describe data contracts, from yaml to jsonnet. Clearly, as of now, the exact format for a data contract is less important than the information it carries and how its lifecycle is managed. I predict open source implementations to emerge next year that standardize this pattern of creation → serving + search and discovery. Given DataHub’s affinity to shift-left approaches to metadata, I think we’ll see it show up quite often in these reference implementations.

Details: How are data contracts enforced?

All implementations seem to suggest the same pattern:

Stream processor evaluates data in the stream for schema validation, column-level quality rules
Bad data routed to a dead-letter queue, good data available for warehouse and other downstream streaming consumers to process
Not much has been written about automation of non-quality oriented data contract specifications beyond ownership, lineage, documentation, classification. Most just mention pushing these pieces of information to the catalog.
GoCardless writes about provisioning topics and tables by invoking their underlying infrastructure-as-code platform.

My Thoughts

Data contract infrastructure is still in its infancy. We are just scratching the surface of the sorts of things we can catch in flight to prevent bad data from flowing downstream. Row-oriented tuple-at-a-time processing will naturally limit the kind of checks that you can authoritatively apply at this stage. Complex quality checks that involve aggregations, sliding windows or cross dataset constraints will not be feasible to implement easily without incurring delays and non-determinism.

Similar techniques for batch datasets exist using the staging → publish table pattern, but most of the data contracts conversation isn’t talking about it, presumably because it is considered solved even if typically everyone is rolling their own version of this via great expectations, dbt tests or soda.

I predict that in 2023, we will see some of these technical challenges crop up and make us realize that while data contracts (and their associated streaming implementations) will improve quality for many important edges in our data graph, there are deeper data quality problems that will continue to require a heavier batch-oriented and monitoring based approach, even if the specification of the contract is owned by the producer. In cases where consumers are okay waiting longer for the data to be completely validated before being released for processing, more complex evaluations can be moved into the enforcement layer, but I predict that consumers will want a choice: give me data quickly that has the quality characteristics that I need. Don’t slow me down because you’re checking for a quality characteristic that doesn’t affect my processing.

Beyond data quality use-cases, I’m predicting that we will see more real world implementations of automated data management, data governance being driven by data contracts.

Beyond the Tech: What sort of culture change is being proposed?

In summary:

Data producers should own the specification of the contract
They should be accountable for all the characteristics of the dataset as laid out in the contract

From my 2016 Strata Talk, before data contracts were cool :)

My Thoughts

There is unanimous agreement that data producers (a.k.a the application teams) should be more accountable for the data they produce. However, if you look around the room, the crowds that are shouting in agreement are all data consumers (a.k.a the data team). Having had some personal experience with driving a similar change at LinkedIn in 2016, a part of me that still remembers, winces, because in that case, despite our best efforts on tech, tooling and process, application developers continued to not really care about the data that they were producing, because it was really really hard to tie incentives to doing this well.

I predict that in 2023, our collective hopes will be dashed a few times, after the first wave of implementations meet this reality at their respective companies. I also predict that we will look for ways to reduce the friction associated with creating contracts (perhaps auto-generating default contracts for producers to just sign off on), while also looking for the magical credits system that will reward producers for being good data citizens (data karma points that you can trade in for a free gift from your corporate store or a free tech review pass for your next feature or some other crazy idea).

Deja Vu: Have we been through this hype before? Is this different from data mesh?

Writings about data contracts frequently reference data mesh philosophies in the following ways:

Data Contract specifications are almost the same as Data Product specifications.
Data Contracts are currently closely aligned to the Source Aligned Data Products as laid out by Zhamak.

My Thoughts

The excitement around data contracts is highly reminiscent of the excitement around data mesh when it first emerged as an architectural proposal. While practical implementations of data mesh are still emerging, data contracts in their current form seem like a more narrow, and maybe more achievable part of the dream. However, the narrowness of focus and simplicity of the bag of tricks available to deal with bad data might be the biggest challenge that this approach faces when encountering the real world problems with data quality and the requirements of needing to work with imperfect data to keep the business going wherever possible.

The Reading List

Here is a non-exhaustive list of authors and reading material that formed the basis for this post. If you haven’t already, I highly recommend you check them out to familiarize yourself with this space.

What’s Next?

2022 is almost behind us, but 2023 looks extremely promising for metadata-driven data management via data contracts. Stay tuned and subscribe to the DataHub blog for a more in-depth look about how I see this space evolving (coming early next year!).

Connect with DataHub

Join us on Slack • Sign up for our Newsletter • Follow us on Twitter

Data Contracts Wrapped 🎁 2022

Definitions: What is a data contract?

My Thoughts

Scope: Where does it apply?

Lifecycle: How are data contracts created and stored?

Details: How are data contracts enforced?

Beyond the Tech: What sort of culture change is being proposed?

Deja Vu: Have we been through this hype before? Is this different from data mesh?

The Reading List

What’s Next?

Connect with DataHub

Governing the Kafka Firehose

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Five Signs You Need a Unified Data Observability Solution