DataHub Project Updates

Introduction

We’re back with our fifth post covering monthly project updates for the open-source metadata platform, LinkedIn DataHub . This post captures the developments for the month of June in 2021. To read the May 2021 edition, go here . To learn about Datahub’s latest developments- read on!

Community Update

First an update on the community! On June 23rd, Acryl Data , the company I co-founded with Swaroop and our stellar founding team to drive the DataHub project forward, launched publicly.. TechCrunch wrote a nice article about us and both Swaroop and I wrote a few posts reflecting on our journeys that led us to this point and the future that we are dreaming to create with this amazing community. Since January of this year, our Slack community has quadrupled in strength and I’m continually amazed at how engaged and plugged in the community is in shaping the project through constructive feedback, making meaningful contributions, and cheering us on! To all the members in our community, your love really keeps us going and makes all the long hours worth it ❤️

Project Update

We have moved to a more frequent cadence of releases to keep up with the pace of development and ensure timely release of bug fixes. We had 5 releases in June, kept up our monthly commit activity rate of 100+ commits/month and our releases in June (0.8.1–0.8.5) included 15 unique contributors from 10+ different companies! In the month of June, we had 144 PRs that were merged in and our Slack community continues to grow rapidly. Our June town-hall had over 70 attendees, where we unveiled Dataset Popularity and recent Queries using Usage Stats, a massively slimmed down DataHub deployment model and a case study from Saxo Bank and ThoughtWorks, who showcased their recent contributions around Business Glossary and shared their experience in production. Join us on Slack and subscribe to our YouTube channel to keep up to date with this fast growing project.

The big updates in the last month were in three tracks:

Product and Platform Improvements

Usage Stats to help understand what your most popular datasets are, how their columns are being used, what queries are being issued against them, etc etc.
Markdown descriptions and editing for both dataset descriptions as well as column descriptions.
Hardening of NoCode metadata capabilities

Metadata Integration Improvements

Improved integrations with dbt, AWS Glue (support for Glue jobs) and a new integration with Feast (a popular open source feature store).

Developer and Operations Improvements

Official deploy on GCP guide
Hardening authentication end to end, from OIDC, to SSL in Elastic and Kafka.
Lots of fixes to harden the Docker images and remove vulnerabilities
Dropping neo4j as a mandatory requirement and supporting Elastic as a graph backend along with migration guides .
A significantly slimmed down quickstart experience that requires much lower system and memory requirements.
An official release 0.8.5 that packaged all these improvements

We have also announced the project roadmap for the next six months, July — Dec 2021. It is packed with goodies, from data profiling and data quality, data lake and data mesh integrations to fine-grained access control on metadata and integrations with ML and metrics ecosystems. Some of this work is already in flight, as the access control rfc is now open for feedback from the community.

Read on to find out more about the June highlights!

Dataset Popularity via Query Activity

DataHub Table Usage

Usage statistics powered by Query activity are the secret metadata ingredient that makes data discovery and data management come to life. They allow data platform owners to understand how data is being used within the enterprise and determine where additional resources are needed. Data engineers can understand how people are consuming data assets that they produce, and can streamline deprecation and migration processes. For data scientists, this information builds trust in datasets and helps them understand how to query them by learning from others. Meanwhile, everyone benefits from improved search relevance in search results!

The dataset popularity feature requires usage logs to be collected from the source system. We have built integration with BigQuery and Snowflake in this release. For each dataset, we collect per-user usage statistics, per column usage statistics as well as recent and frequent queries against this dataset.

As we designed this feature, we set ourselves a few constraints around scale. Firstly, ensure that we can scale to 500K queries per day, across 10K users, while retaining a fair amount of historical data (e.g. 1 year). Secondly, ensure that we avoid refetching the same data from the source system repeatedly.

We are starting with batch, our ingestion framework can connect to these systems and pull usage data hourly or daily (configurable), perform aggregation at the edge (to keep transfer costs low) before pushing to DataHub. As we add support for time-series metadata in our next release, we will move to event-based publishing support for this kind of data as well. The aggregated usage data is stored in our analytics store (currently ElasticSearch) for future queries.

You can see usage statistics in action on the demo DataHub instance! Here are some screenshots for dataset level usage stats, column-level usage stats as well as recent queries on some of our Snowflake datasets.

Query stats, Top Users of a Table

Column-level usage stats

Recent Queries on a Dataset

You can watch the full video of Harshal Sheth describing the technical design and demo-ing the feature in the town-hall video below.

We’re working on adding support for usage statistics from Redshift, Trino and other systems. Next up is integrating usage data into search rankings and lineage visualizations, making improvements in the UX, and adding simple NoCode mechanisms for supporting time-series metadata in the platform.

Simplified Deployment in DataHub

While everyone loves DataHub for its push-based architecture with pluggable storage, search and graph indexing, one of the common concerns we heard from the community was how many containers the default quick-start guide required and the amount of system resources it consumed. Not anymore :) We have slimmed down the DataHub quick-start requirements tremendously. The old DataHub needed the local Docker engine to be allocated at least 2 CPUs, and 8GB of RAM with 2GB of swap space to get started; the new one just needs 1 core, 3 GB of RAM and 1GB of swap space. We’ve also dropped the number of containers needed almost by half, from 12 to 7. This is still a work in progress, and we expect to reduce this number even more!

Among the dependencies that we were able to drop from the default, the most significant is neo4j. The default deployment of DataHub now no longer depends on neo4j, and uses ElasticSearch as the edge store to satisfy lineage and other graph traversal use-cases such as relationships pages.

Another simplification, there is no need to checkout out DataHub source code anymore. All you need to do is install the Python package acryl-datahub to get the command-line-tool datahub that helps you spin up a local instance using datahub docker quickstart . The full QuickStart guide is located here .

Watch John Joyce and Gabe Lyons describe how they simplified the DataHub deployment significantly and what lies ahead.

Case Study: Business Glossary at Saxo Bank

Tags and Terms Business Glossary at Saxo Bank

Sheetal Pratik (Saxo Bank) and Madhusudhana Podila (ThoughtWorks) described how they implemented and deployed Business Glossary capabilities using DataHub at Saxo Bank. Business Glossaries have a long history in data catalogs and data management. They are used to define business concepts for an organization that are independent of the technical choices of database technology or representation. They can help with improving trust in data, by enabling appropriate semantics to be attached to the physical dataset or the column in the dataset.

Typically, data catalogs offer a way for data stewards, governance teams etc to attach these terms to datasets or fields using the catalog UI. However, this form of after-the-fact attachment to datasets or columns can often become stale and incorrect, leading to even more problems and eroding trust further.

Saxo Bank has implemented a DataOps approach to attaching Business Glossary terms to schemas which makes it easy for the data producer to attach meaning to datasets and their columns as part of the definition of the data product, by checking them in alongside and embedded in their schemas (written in protobuf ). As these data products are deployed to production, metadata is emitted and pushed into DataHub to ensure that there is a single global repository for these relationships that is instantly updated. I think this trend will catch on with the adoption of Data Mesh and DataOps practices, and we will see more and more organizations opting for this approach to keep business metadata in sync with technical metadata.

Watch the full presentation from the Saxo Bank and ThoughtWorks team in the YouTube video below!

Looking Ahead

With a simplified deployment model, NoCode metadata model hardening, and new abilities to contextualize metadata using query activity from data warehouses, DataHub is becoming easier to run, easier to extend, and most importantly, making it easier to improve trust in the data discovery process.

With our roadmap for the rest of the year 2021 released, our focus in July will be on developing support for data profiles, which requires adding time-series support to our metadata model for observability, and building out integrations with the ML infra tools landscape, such as Feast and AWS Sagemaker. Stay tuned!