BACK TO ALL POSTS

LinkedIn DataHub Project Updates

Data Engineering

Open Source

Metadata

Data Quality

Project Updates

Shirshanka Das

Mar 25, 2021

Data Engineering

Open Source

Metadata

Data Quality

Project Updates

Introduction

This is the second post covering monthly project updates for the open-source metadata platform, LinkedIn DataHub . This post captures the developments for the month of March in 2021. To read the February 2021 edition, go here . Since everyone is observing Spring festivities, we also have an Easter egg included in this post. Read all the way to the end to discover it!

Community Update

Almost 150 PR-s were submitted and merged in over the last 30 days, representing a 3x growth in commit activity over the previous month! We had more than 50 people attending the monthly town-hall , where the data team from Wolt presented their adoption journey in using DataHub as the metadata platform for their data mesh implementation. Our Slack community grew by 20% in the last month! Join us on Slack and subscribe to our YouTube channel .

The big updates in the last month were:

  • New homepage, improved documentation at datahubproject.io
  • Live demo environment at demo.datahubproject.io hosted by Acryl
  • Official roadmap published at datahubproject.io/docs/roadmap
  • SSO (OIDC) support
  • Tags
  • Themes
  • Search and discovery for Dashboards
  • Metadata Platform Implementations for ML Models, DataFlows (Pipelines) and Jobs.
  • Deprecation of ElasticSearch 5 and migration to ElasticSearch 7
  • An official release 0.7.0 that packages all these improvements

Read on to find out more!


Support for Single Sign On (SSO) OIDC-based

SSO was one of the most requested features in DataHub, so we are excited to announce that it is finally here! We believe that data workers should be spending more time discovering and working with data, safely; and not entering credentials into every data tool that has been deployed at their company.

Google SSO Setup

Google SSO Setup

Okta SSO Setup

Okta SSO Setup

The feature has already been tested with Google, Okta and Azure AD. Read the docs here to configure it for your environment.


Case Study: Data Discovery in Data Mesh at Wolt

This month, the data team at Wolt described their journey in adopting DataHub as their data discovery solution. Their goal is to map their entire data ecosystem from operational databases and third party APIs to ML models and dashboards. Their data stack includes Kafka, Airflow, Snowflake and Looker among other technologies, and they are moving towards a data mesh implementation. They have built internal tools that make it easy to integrate metadata with DataHub and connect their ecosystem with it. They have also made important contributions back into the project including driving the Tags RFC and support for DAG metadata storage (e.g. Airflow) in the DataHub backend.

Data Mesh Architecture Use Cases

Data Mesh Architecture Use Cases

Fredrik Sannholm , who leads the data engineering and core ML team at Wolt had this to say:

A metadata platform, like Datahub, that’s able to track ownership and stakeholder relationships between entities is crucial when we at Wolt scale and move towards a data mesh architecture. Datahub allows our data consumers to not only discover datasets but also track their lineage, while data producers have a single, purpose-built place to put documentation regarding the data. At the same time the core data engineering team can build useful observability and alerting features around the Datahub APIs. Datahub is a central part of of our data platform as we scale both in terms of data volume and number of data producers and consumers.

Check out the video below:

Tags

Social, global, easy

Another of the most requested features was support for a lightweight tagging mechanism for datasets, fields, dashboards and well… anything! Thanks to work done by the community spanning multiple teams, we now have support for tags. Read the RFC here and watch the video demo-ing how to use tags below!


Themes

Your DataHub, your way

As companies deploy DataHub inside their enterprise environments, one common request we’ve got is the ability to customize the look and feel of DataHub. Some companies are all business and some companies are all fun. We don’t want to let your style preferences get in the way of enjoying great metadata!

Enter Themes. Now you can customize the look and feel of DataHub to your heart’s content. Read the documentation here and check out the video of Gabe Lyons demo-ing how he customized DataHub to look like Airbnb Dataportal below.


New Connectors

After we launched the new metadata ingestion framework last month, we’ve had some new connectors added by the community. Thanks Pedro Silva and Thomas Larsson !


DataHub + Observability = Trusted Data

We’ve heard time and time again, that when searching for data, people want to know which data they can trust. Finding a well documented dataset is a good start, finding owners of this dataset is better, knowing which dashboards are powered by this dataset is amazing, but there’s still something missing.

The key ingredient in unlocking the next level of trust is operational metadata. This includes information like when a dataset was last updated, how often it is updated, which pipeline produced this dataset, how the shape (profile) of the dataset has changed in the last update and much more.

Armed with this level of information, a data scientist who is about to take a dependency on this dataset to build an important analysis, can be secure in the knowledge that they are building on a solid foundation.

We released some exciting mocks for what a great data observability product might look like; built on DataHub.

Here is a sneak peek!

Dataset Operational Health Summary

Dataset Operational Health Summary

Check out the full set of mocks here , and give us your feedback !


Looking Forward

The pace of innovation and development continues to accelerate. We’re working on an improved lineage visualization and a deeper integration with Apache Airflow. Meanwhile, we’re expecting more integrations with popular systems like dbt, Looker, AWS Glue and others to land in the next month. Our roadmap for Q2 is packed and we’re excited to be building with all of you. Until next time!

Data Engineering

Open Source

Metadata

Data Quality

Project Updates

NEXT UP

Governing the Kafka Firehose

Kafka’s schema registry and data portal are great, but without a way to actually enforce schema standards across all your upstream apps and services, data breakages are still going to happen. Just as important, without insight into who or what depends on this data, you can’t contain the damage. And, as data teams know, Kafka data breakages almost always cascade far and wide downstream—wrecking not just data pipelines, and not just business-critical products and services, but also any reports, dashboards, or operational analytics that depend on upstream Kafka data.

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Acryl Observe is a complete observability solution offered by Acryl Cloud. It helps you detect data quality issues as soon as they happen so you can address them proactively, rather than waiting for them to impact your business’ operations and services. And it integrates seamlessly with all data warehouses—including Snowflake, BigQuery, Redshift, and Databricks. But Acryl Observe is more than just detection. When data breakages do inevitably occur, it gives you everything you need to assess impact, debug, and resolve them fast; notifying all the right people with real-time status updates along the way.

John Joyce

2024-04-23

Five Signs You Need a Unified Data Observability Solution

A data observability tool is like loss-prevention for your data ecosystem, equipping you with the tools you need to proactively identify and extinguish data quality fires before they can erupt into towering infernos. Damage control is key, because upstream failures almost always have cascading downstream effects—breaking KPIs, reports, and dashboards, along with the business products and services these support and enable. When data quality fires become routine, trust is eroded. Stakeholders no longer trust their reports, dashboards, and analytics, jeopardizing the data-driven culture you’ve worked so hard to nurture

John Joyce

2024-04-17

TermsPrivacySecurity
© 2025 Acryl Data