DataHub Project Updates

Introduction

We’re back with our seventh post covering monthly project updates for the open-source metadata platform, LinkedIn DataHub . This post captures the developments for the month August in 2021. To read the July 2021 edition, go here . To learn about DataHub’s latest developments- read on!

Community Update

August was another busy month for the DataHub community. Our Slack community surpassed 1,100 community members, we had 22 contributors from 12 companies contribute to the DataHub project, and more than 70+ people consistently joining our community town halls. I’m so excited about the thriving DataHub community that’s coming together to build a world-class, real-time, metadata management platform.

Project Update

We had 132 commits in August, continuing our 100+ commits/month rate. We had contributions from 22 different contributors from 12 companies (7 new contributors!).

The August 2021 town hall had over 70 attendees, where we shared a first look into managing Business Glossary terms via YAML, improvements to nested schemas visualization and support for key-value schemas in Kafka, Fine-Grained Access Control & User/Group management, and DataHub’s refreshed UI. We also heard a case study from folks at Bizzy Group (now Warung Pintar) about how they are adopting DataHub and integrating it with Redash at their company — watch it here . Join us on Slack and subscribe to our YouTube channel to keep up to date with this fast growing project.

The big updates in the month of August were in two tracks:

Product and Platform Improvements

Business Glossary Phase 1 allows managing glossary terms via YAML
UI support for deeply nested schemas and key-value schemas in Kafka
Improvements to User/Group Management and Fine-Grained Access Control in DataHub
A new look for the DataHub UI

Developer tooling and Operations Improvements

Performance metrics for DataHub: featuring integration with OpenTelemetry, Grafana, and Locust. Watch Dexter Lee from Acryl Data walk through the tutorial here .
Much improved documentation for ingestion sources

Read on to find out more about the August highlights!

Business Glossary Phase 1

If you have ever researched a data catalog, odds are you would have come across a term called Business Glossary. Business Glossaries are amazing things for curators of complex data ecosystems a.k.a the data governance team. Glossaries allow you to categorize all your complex technical schemas that are full of technical types like strings, numbers and structs and give them meaning in your business context. This means concepts like CustomerAccount and EmailAddress or Revenue can be attached to an individual field or a dataset. They also allow you to organize your terms in a manageable way, so for example all the terms related to “Finance” are found under the Finance node, while all terms related to “HR” are located in the HR sub-section. However individual terms can have relationships with each other that cut across these lines. For example, the Email term in the PersonalInformation domain can inherit the Confidential term from the Classification domain.

DataHub is proud to announce support for Business Glossaries in the product. We have taken a more developer-first approach to managing your business glossary by allowing you to check it in and version it just like the rest of your software. In the August town hall, I demo-ed how you can describe a business glossary that includes multiple nodes and leaves with relationships among each other and check it in as a yaml file. With every change to this glossary, you can simply ingest it into DataHub by using any CI/CD system of choice.

Here is a quick video that shows how this works!

You can also play around with the Business Glossary feature and attach terms to datasets, columns, or any entity on our live demo here .

User/Group Management and Fine-Grained Access Control in DataHub

New Improvements

During the August 2021 Community Town Hall, John Joyce from Acryl Data gave us a look into proactive & just-in-time user provisioning along with fine-grained access control.

DataHub now supports proactive batch ingestion with Okta & Azure AD, allowing you to leverage your existing user identity stores & bring them into the platform.

Just-In-Time Provisioning (w/ODIC)

Once you have your users ingested, DataHub makes it easy for you to manage fine-grained access controls with Policies. Starting with v.0.8.11, DataHub admins can now create new Policies to define who can perform what action against which resource(s).

Learn more here .

The DataHub UI is getting a new look

The team at Acryl Data has been rapidly rolling out improvements to the DataHub UI since the beginning of 2021, mainly focused on replacing the legacy Ember app with React. Now that we have reached functional parity between the two, we’re turning our focus to making improvements to user experience.

First things first: we rolled out a new look for Datasets to provide an “entity at a glance” view, an inline editor to keep your documentation fresh, and nested schemas. We’ll be rolling out similar refreshes to other entities in the coming months; read the full details here .

Entity at a glance

Case Study: DataHub + Redash at Bizzy

Taufiq Ibrahim from Bizzy (part of the Warung Pintar Group) shared the company’s journey of evaluating and adopting DataHub to address frequent questions about data discovery, ownership, lineage, and impact analysis. Taufiq gave an overview of how DataHub offered a highly-customizable interface to ingest metadata from MySQL, SQL Server (SSIS), PostgreSQL, BigQuery, and Kafka, and the ability to connect it out to reporting resources built in Redash. We caught a glimpse into how complex lineage graphs can get — a challenge faced by the broader metadata community, and one that the DataHub team is excited to tackle in upcoming months.

We also heard about Taufiq’s positive experience contributing to the DataHub project, including his first open-source contribution, ever! We’re happy to know he had the community support he needed to contribute to our project, and we’re so grateful for his work! Watch the full presentation here.

Looking Ahead

We will continue to build out better and more intuitive ways to manage your Business Glossary and integrate it within your developer workflows. With the new UI refresh as our baseline, we are looking forward to making exploring and visualizing your complex data ecosystem even more delightful and useful. Finally, as we continue to build out our GraphQL API, we plan to release official developer docs for interacting programmatically with all this rich metadata to build new integrations. Until next time!