BACK TO ALL POSTS

It’s HERE! Say Hello to Column-Level Lineage in DataHub

Metadata

Open Source

Project Updates

Data Governance

Data Engineering

Maggie Hays

Oct 21, 2022

Metadata

Open Source

Project Updates

Data Governance

Data Engineering

Hello, DataHub Enthusiasts!

The past month was jam-packed with big DataHub feature announcements and excellent community-led code and content contributions. Without further ado, let’s get you up to speed!

🤩 The DataHub Community is Unstoppable

September 2022 was a record-setting month for the DataHub Community across the board, where we…

  • welcomed 377 new Slack Members in a single month (!!)
  • merged 291 Pull Requests from 47 Contributors to the open-source DataHub Project
  • connected live on Zoom with 185 September Town Hall Participants

I must say... it’s invigorating to see this Community of Data Practitioners continue to come together to set the new standard of how we tackle metadata management and data governance within the modern data stack.

Datahub Community at a Glance

Datahub Community at a Glance

We continue to see more and more engagement in our Monthly Town Halls, and we are always thrilled to welcome new contributors to the project!

Join us on Slack  RSVP to our Next Town Hall  Follow us on Twitter

It’s TIME! Column-Level Lineage in DataHub is Here

During the September 2022 DataHub Town Hall, we unveiled UI support for column-level lineage within the DataHub UI. This has been one of the highest-requested features from Community Members, and we are so excited to have you all start working with it!

Starting with DataHub v0.9.0, you can visualize column-level dependencies within the lineage view. This is an incredibly powerful resource to trace fine-grained inter-dependencies across datasets and reporting resources. We support auto-extracting column-level lineage in the first iteration during Snowflake and Looker ingestion.

Screenshot of Column Level Lineage in DataHub

Screenshot of Column Level Lineage in DataHub

Exciting, right?! Take it for a spin in the DataHub Demo here, and don’t forget to watch Gabe Lyons and Chris Collins give the full run-down below!

Case Study: How Stripe uses DataHub to power observability within their Airflow-based ecosystem

During the September Town Hall, we heard from Divya Manohar (Software Engineer at Stripe) about how she and the Stripe Team have leveraged DataHub to surface critical Airflow pipeline execution metrics for thousands of hourly tasks and their associated datasets. By leveraging metadata already sent to DataHub from Airflow and customizing the DataHub UI, the Stripe Team built out custom historical reporting to monitor:

  • Historical DataJob Timeliness Tracking to understand the reliability of a given pipeline over time
  • Complex Pipeline Status Tracking, providing a high-level status and estimated land time for jobs comprised of thousands of tasks
  • Critical DataJobs Historical SLA Observability
Example of Stripe’s DataJob Timeliness Tracking in DataHub

Example of Stripe’s DataJob Timeliness Tracking in DataHub

I highly recommend watching Divya’s presentation; she and the Stripe Team are building next-level resources in DataHub to navigate the complexities of their data stack — prepare to be highly impressed!

Sneak Peek: Automated PII Classification

Data Governance Practitioners are keenly aware of how important it is to accurately catalog and classify data assets with the appropriate PII category; historically, this has been a laborious, manual effort.

During the September Town Hall, we shared a sneak peek of the upcoming functionality in DataHub to automatically apply PII classifications to datasets. The goal is to minimize the amount of manual tagging required drastically and to bolster the coverage of compliance categorization within your data warehouse. Check out the demo to learn more!

Metadata Ingestion Improvements, Galore!

The DataHub Community is hard at work, ensuring our existing Ingestion Sources are performant and extract as much valuable metadata as possible. Here are some highlights from v0.8.44, v0.8.45, and v0.9.0:

  • Snowflake: Improved Snowflake connector is now stable & supports column-level lineage extraction (old version is renamed snowflake-legacy)
  • BigQuery: bigquery-beta is improving rapidly (structs, unified usage)
  • LookML: automatically clones your Git repo, supported in UI!
  • Looker: ingestion requires much less memory
  • dbt: extracts column-level meta mappings
  • Tableau: extracts chart usage
  • Presto on Hive: supports stateful ingestion, extracts table descriptions + views
  • Core: checkpoint state compression, delete + rollback support for timeseries aspects

285 people have contributed to DataHub to date

During September, we merged 291 pull requests from 47 contributors, 16 of whom contributed for the first time (names in bold):

@aditya-radhakrishnan @aezomz @amanda-her @Ankit-Keshari-Vituity @anshbansal @atul-chegg @BogdanAntoniu78 @chriscollins3456 @codesorcery @daha @danilopeixoto @de-kwanyoung-son @divyamanohar-stripe @firasomrane @gabe-lyons @GyuhoonK @hemanthkotaprolu @hieunt-itfoss @hsheth2 @jeffmerrick @jinlintt @jjoyce0510 @justinas-marozas @ksrinath @liyuhui666 @ltxlouis @maaaikoool @maggiehays @Masterchen09 @mayurinehate @mkamalas @mohdsiddique @ms32035 @MugdhaHardikar-GSLab @ngamanda @pedro93 @pghazanfari @remisalmon @rslanka @RyanHolstien @shirshanka @skrydal @szalai1 @topleft @TonyOuyangGit @treff7es @upendrao

We are endlessly grateful for the members of this Community — we wouldn’t be here without you!

One Last Thing —

I’m thrilled to welcome Paul Logan to my team at Acryl Data as Developer Relations Lead. He brings a wealth of dev rel experience, and I am so excited to team up with him to get the DataHub Community to the next level.

I caught up with him this week:

Maggie: We’re so excited to have you on board as the first DataHub Dev Rel Lead! What has been most surprising to you in your first month on the team?

Paul: The most surprising thing to me is the real depth of knowledge everyone on the team has; it’s incredible to be working with people with such expertise!

M: I looove to hear it! What’s a song you’ve been playing on repeat recently?

P: “Hangar” by 8455


That’s it for this round; see ya on the Internet :)

Connect with DataHub

Join us on SlackSign up for our NewsletterFollow us on Twitter

Metadata

Open Source

Project Updates

Data Governance

Data Engineering

NEXT UP

Governing the Kafka Firehose

Kafka’s schema registry and data portal are great, but without a way to actually enforce schema standards across all your upstream apps and services, data breakages are still going to happen. Just as important, without insight into who or what depends on this data, you can’t contain the damage. And, as data teams know, Kafka data breakages almost always cascade far and wide downstream—wrecking not just data pipelines, and not just business-critical products and services, but also any reports, dashboards, or operational analytics that depend on upstream Kafka data.

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Acryl Observe is a complete observability solution offered by Acryl Cloud. It helps you detect data quality issues as soon as they happen so you can address them proactively, rather than waiting for them to impact your business’ operations and services. And it integrates seamlessly with all data warehouses—including Snowflake, BigQuery, Redshift, and Databricks. But Acryl Observe is more than just detection. When data breakages do inevitably occur, it gives you everything you need to assess impact, debug, and resolve them fast; notifying all the right people with real-time status updates along the way.

John Joyce

2024-04-23

Five Signs You Need a Unified Data Observability Solution

A data observability tool is like loss-prevention for your data ecosystem, equipping you with the tools you need to proactively identify and extinguish data quality fires before they can erupt into towering infernos. Damage control is key, because upstream failures almost always have cascading downstream effects—breaking KPIs, reports, and dashboards, along with the business products and services these support and enable. When data quality fires become routine, trust is eroded. Stakeholders no longer trust their reports, dashboards, and analytics, jeopardizing the data-driven culture you’ve worked so hard to nurture

John Joyce

2024-04-17

TermsPrivacySecurity
© 2024 Acryl Data