BACK TO ALL POSTS

DataHub Community Update

Project Updates

Metadata

Open Source

Data Engineering

DataHub

Maggie Hays

Nov 29, 2021

Project Updates

Metadata

Open Source

Data Engineering

DataHub

DataHub october 2021 Town Hall

Hello, DataHub Enthusiasts!

The DataHub Community continues to be abuzz with activity, and the month of October was no exception. Want to see what has happened in prior months? Head over to the Project Updates section to check them out.

Let’s get you up to speed on all-things-DataHub!

Community & Project Updates

New Ways to Collaborate in DataHub Slack

We rolled out some new channels dedicated to collaboration in the DataHub Slack, including:

  • #office-hours: we now host open office hours every Tuesday at 9 am US PT — join the channel for reminders & Zoom details!
  • #contribute: looking for ways to contribute back to the DataHub Community? Post your ideas here!
  • #show-and-tell: are you excited about something you/your team has done with DataHub? Tell us all about it, and consider me your personal hype woman — I love celebrating y’all’s successes, especially when they have to do with :teamwork: 😎
:teamwork:

:teamwork:

We also announced that we are using the

Hey Taco! Slack app to show gratitude to one another. Whenever you’d like to show appreciation to someone that went above and beyond for you in the DataHub Community, give them a 🌮! Just mention their user name, write your message, and add the :taco: emoji.

  • Is it silly to send virtual tacos to say thanks? Yes! We love silly. And tacos.
  • Will we be rolling out a program to redeem 🌮s for limited edition DataHub Swag? Also Yes! We love swag. And tacos.
  • Will the rewards store have a ✨limited edition DataHub fanny pack ✨? Can neither confirm nor deny 🤐😬
Another great office hours

Q4 Roadmap Updates

Here’s what the Core DataHub team is working on in Q4 2021:

  • Updates to DataHub metadata model — we are targeting schema history, column-level lineage, and data quality (specifically Great Expectations in the first pass)
  • Support for multiple data platform instances — we’ll be making it easier for you to uniquely identify datasets across the multiple instances
  • Improved support for dbt — we will be fully leveraging the catalog & manifest JSON files in dbt projects and improving how we organize these entities in the lineage graph
  • Handling stale metadata — when data is deleted in the source environment, we will support removing/soft-deleting them in DataHub
  • Integrations with DeltaLake & Spark — woohoo!

Call for DataHub Community Support!

We are looking for Community Members to help in the design and/or development of Tableau and Clickhouse connectors — please reach out to @maggie in DataHub Slack to learn how you can contribute!

Project Updates

We’re excited to announce that three more companies have adopted DataHub!

Peloton, DFDS, and Uphold have adopted DataHub!

Peloton, DFDS, and Uphold have adopted DataHub!

(I asked — no, it doesn’t mean we all get a Peloton.. sorry y’all, I tried 🤷‍♀)

The DataHub Project saw 161 commits from 30+ people, spanning ~20 companies. We’re so excited to see the volume of contributions grow from a growing group of DataHub Enthusiasts!

Here are the biggest highlights from our v0.8.16 release:

Product / Feature Updates

  • Unified Search & Recommendations (read more about Recommendations below!)
  • Improvements to Primary / Foreign Key Support
  • Lineage Performance Improvements
User Group Management

User and Group Management Screens:

  • View all users & groups
  • Remove users & groups
  • Create a new group
  • Add & remove members from groups

Metadata Ingestion

  • Redshift Usage, External Tables
  • BigQuery Dataset Lineage
  • Ingestion performance improvements by enabling parallelism and max threads configurations
  • Nested field support for Hive & Trino (available as of v0.8.16.2)
  • Adding Owners through Ingestion Transformers
  • Want to dig into the DataHub Metadata Model? You can now view it on the demo site!
Navigating the Dashboard Entity from the DataHub Metadata Model

Navigating the Dashboard Entity from the DataHub Metadata Model

Map of DataHub’s Metadata Model

Map of DataHub’s Metadata Model

Check out the full video below —


Landing Page Recommendations

During the October 2021 DataHub Town Hall, John Joyce and Dexter Lee from Acryl Data revealed a brand new landing page for the DataHub UI, including Recommendations to help users find the metadata they care about with fewer clicks.

Available as of v0.8.17, the new user experience provides guided navigation to high-value metadata and is personalized on a per-user basis.

By building an extensible framework for generating and displaying personalized recommendations, John and Dexter walked us through the design and implementation that surfaces “Most Viewed”, “Recently Viewed”, “Top Tags”, and “Top Glossary Terms” on the new Landing Page.

This is only the beginning of what we plan to do with recommendations. Have feedback or ideas of how we can expand this? Tell us all about it in Slack!

Data Profiling Performance Improvements

Data Profiling is an extremely powerful tool to help Analysts, Data Scientists, and other consumers of data understand the shape & distribution of a dataset, including how it has changed over time. Not surprisingly, this can become a very costly operation to run every day against every column of every table in your data warehouse.

In the October DataHub Town Hall, Surya Lanka from Acryl Data gave us a preview of fine-grained control over data profiling, including:

  • High-level Performance controls, including disabling expensive operations and setting a default number of rows to sample
  • Column-level filtering to include or exclude specific columns, or to turn off column-level profiling altogether
  • Column-level metric filtering to specify which types of metrics to capture for each column (i.e. min/max/mean/stddev/quantile/histogram etc.)

Watch the full demo below —

Improvements to Lineage Support in DataHub

Gabe Lyons and Varun Bharill from Acryl Data gave an update on Lineage support within DataHub during the October Town Hall.

BigQuery users — this one’s for you! As of v0.8.17, you can fully leverage your Google Audit Logs to infer dataset lineage, making it easier than ever to build a lineage graph for your transformed datasets. Note: this requires access to Google Cloud API.

For those of you that have faced issues loading complex lineage views, we’ve rolled out improvements to make the page much more responsive and easier to navigate.

We also introduced drag-and-drop functionality so you can move lineage nodes around as you’re navigating the graph! Try it out here

Data pipeline gif

Watch the full presentation below —

Community Case Study: DataHub at hipages

During our October Town Hall, we had the honor of hearing from Chris Coulson from hipages as he shared the company’s experience adopting DataHub. He and his team have leveraged DataHub to supercharge data-related workflows for Analysts, Data Scientists, Data Engineers, and Senior Stakeholders.

Chris gave us an overview of the workflows they are able to power using DataHub:

  • Discovery — search for concepts, share ideas, and find work other people have done previously
  • Lineage — understand the data processing chain to debug problems and notify relevant stakeholders
  • Quality — define Glossary Terms to collaborate and communicate using an agreed-upon lexicon
  • Ownership — enforce and encourage responsibility and accountability for data sources

Check out the full presentation to learn how they leveraged the DataHub lineage graph to identify influential tables for data profiling, and to infer ownership by identifying clusters of datasets that commonly operate together.

That’s it for this round! Questions? Comments? Post them below — I can’t wait to hear from you!

Join us on Slack, follow us on Twitter, subscribe to our YouTube channel, and RSVP for our monthly Town Hall!

Project Updates

Metadata

Open Source

Data Engineering

DataHub

NEXT UP

Governing the Kafka Firehose

Kafka’s schema registry and data portal are great, but without a way to actually enforce schema standards across all your upstream apps and services, data breakages are still going to happen. Just as important, without insight into who or what depends on this data, you can’t contain the damage. And, as data teams know, Kafka data breakages almost always cascade far and wide downstream—wrecking not just data pipelines, and not just business-critical products and services, but also any reports, dashboards, or operational analytics that depend on upstream Kafka data.

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Acryl Observe is a complete observability solution offered by Acryl Cloud. It helps you detect data quality issues as soon as they happen so you can address them proactively, rather than waiting for them to impact your business’ operations and services. And it integrates seamlessly with all data warehouses—including Snowflake, BigQuery, Redshift, and Databricks. But Acryl Observe is more than just detection. When data breakages do inevitably occur, it gives you everything you need to assess impact, debug, and resolve them fast; notifying all the right people with real-time status updates along the way.

John Joyce

2024-04-23

Five Signs You Need a Unified Data Observability Solution

A data observability tool is like loss-prevention for your data ecosystem, equipping you with the tools you need to proactively identify and extinguish data quality fires before they can erupt into towering infernos. Damage control is key, because upstream failures almost always have cascading downstream effects—breaking KPIs, reports, and dashboards, along with the business products and services these support and enable. When data quality fires become routine, trust is eroded. Stakeholders no longer trust their reports, dashboards, and analytics, jeopardizing the data-driven culture you’ve worked so hard to nurture

John Joyce

2024-04-17

TermsPrivacySecurity
© 2025 Acryl Data