BACK TO ALL POSTS

Harnessing the Power of Data Lineage with DataHub

DataHub

Metadata Management

Data Lineage

Data

Metadata

David Anyaeche

Jun 13, 2022

DataHub

Metadata Management

Data Lineage

Data

Metadata

DataHub is the leading metadata management platform and data discovery tool. In this article, we’re going to talk about two use cases for how DataHub leverages lineage to empower your data team. First, you can use lineage to understand the downstream ramifications of making changes in your upstream datasets. In addition to that, you can harness lineage to protect sensitive data.

Lightbulb power

DataHub extracts lineage from a myriad of data platforms such as modern cloud warehouses — BigQuery and Snowflake, transformations like dbt or Airflow, and business intelligence tools including Looker along with Tableau.

Understanding Context with Data, Proactive and Reactive Error Mitigation

End-to-End Lineage

The primary goal of lineage is to provide end-to-end visibility of the production, transformation, and consumption of an organization's data, agnostic to what particular platform the data is being curated through. This enables two attributes for data engineers to mitigate the blast radius during data management: proactive impact analysis and reactive data debugging.

Visualize lineage showing data sources, sinks, and transformations from production to consumption
Visualize lineage showing data sources, sinks, and transformations from production to consumption

Proactive Impact Analysis

The Impact Analysis tab allows you to view all the downstream(s) of a dataset in one cohesive collection. Within this collection, a differentiated set of filters can be applied, such as tag, platform, entity-type, owner, free search..etc. Lineage Impact Analysis also allows filters based on dependency to observe how many N-layers deep from the current entity that is being looked upon. The collection can be downloaded as a CSV file to be used outside of the tool for business operations. For example, users can use the spreadsheet to track the progress of a migration and contact data owners.

Impact Analysis of all_entities

Impact Analysis of all_entities

Organizations can lean on the configurability of DataHub’s platform; DataHub’s API provides an endpoint in which impact analysis can be queried programmatically.

query searchAcrossLineage($input: SearchAcrossLineageInput!) {
	  searchAcrossLineage(input: $input) {
		start
		count
		total
		searchResults {
          degree
		  entity {
			type
			... on Dataset {
			  name
			  platform {
				name
          }
	}
      }
    }
  }
}

Reactive Data Debugging

DataHub allows for end-to-end debugging when there is a quality issue with a dataset and gives transparency on what part of the organization the data engineer should alert.

Organizations can visualize lineage to identify the root cause upstream. Combining Datahub’s schema history feature with lineage, you can see how the upstream dataset’s schemas have changed over time. This allows you to zero in on recent upstream changes that may have caused issues. Additionally, for transformation runs, users have transparency on the run history. That allows you to see how upstream data jobs dependencies & success rates have tracked over time.

DataHub UI detailing information of transformation runs on a data task

DataHub UI detailing information of transformation runs on a data task

Data Governance: Privacy-Conscious Data Engineering

Privacy-Enabled Features of Lineage

DataHub provides visibility with lineage: users can surface a glossary of terms and can determine sensitive information pertaining to a repository of data items. One can view the hierarchical directory of terms and the data owners associated with them. Additionally, with lineage, an organization can decide which of these sensitive data items are validated and view a topological catalog of related terms, entities, and properties.

Glossary of Terms containing related items under an organizational category

Glossary of Terms containing related items under an organizational category

Through the DataHub UI a “Term Group” — a directory of related glossary terms under a business category — can be selected to see its content, owners, documentation, and other relevant information.

Within a Term Group, a “Glossary Term” can be selected. Owners can add or modify links in a glossary term and view other owners as well.

Glossary Term, “AccountBalance” detailing documentation, directory hierarchy, about section, and owners

Glossary Term, “AccountBalance” detailing documentation, directory hierarchy, about section, and owners

Information related to Glossary Terms such as documentation, entities associated with related terms, data owners, and properties along with their place in the hierarchical structure can be viewed.

Related Entities and Properties of Glossary Term, “AccountBalance”
Related Entities and Properties of Glossary Term, “AccountBalance”

Users can select a dataset that contains or inherits this term. Furthermore, an organization can look at a term’s parent or child dataset and understand the sensitivity and relevance.

Dataset, “active_customer_ltv”, depicting its schema containing fields and tags

Dataset, “active_customer_ltv”, depicting its schema containing fields and tags

Building for the Future…

Understanding the ramifications and impact of data that is being generated, consumed, and transformed allows for sophisticated data engineering. The ability of lineage to extend transparency around sensitive items and peripheral consequences of data increases an organization’s efficacy and improves data stewardship.

DataHub’s mission is to equip how organizations understand and utilize their data through sophisticated metadata management. DataHub is building tools and features for governance, discovery, and observability for the modern data ecosystem. We’d love you to be a part of the DataHub community! Come say hello in our Slack, check out our Github, and attend our latest Town hall to learn about the latest in DataHub.

To learn more about managed DataHub solution, sign up for Acryl — click here!

DataHub

Metadata Management

Data Lineage

Data

Metadata

NEXT UP

Governing the Kafka Firehose

Kafka’s schema registry and data portal are great, but without a way to actually enforce schema standards across all your upstream apps and services, data breakages are still going to happen. Just as important, without insight into who or what depends on this data, you can’t contain the damage. And, as data teams know, Kafka data breakages almost always cascade far and wide downstream—wrecking not just data pipelines, and not just business-critical products and services, but also any reports, dashboards, or operational analytics that depend on upstream Kafka data.

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Acryl Observe is a complete observability solution offered by Acryl Cloud. It helps you detect data quality issues as soon as they happen so you can address them proactively, rather than waiting for them to impact your business’ operations and services. And it integrates seamlessly with all data warehouses—including Snowflake, BigQuery, Redshift, and Databricks. But Acryl Observe is more than just detection. When data breakages do inevitably occur, it gives you everything you need to assess impact, debug, and resolve them fast; notifying all the right people with real-time status updates along the way.

John Joyce

2024-04-23

Five Signs You Need a Unified Data Observability Solution

A data observability tool is like loss-prevention for your data ecosystem, equipping you with the tools you need to proactively identify and extinguish data quality fires before they can erupt into towering infernos. Damage control is key, because upstream failures almost always have cascading downstream effects—breaking KPIs, reports, and dashboards, along with the business products and services these support and enable. When data quality fires become routine, trust is eroded. Stakeholders no longer trust their reports, dashboards, and analytics, jeopardizing the data-driven culture you’ve worked so hard to nurture

John Joyce

2024-04-17

TermsPrivacySecurity
© 2024 Acryl Data