LinkedIn DataHub Project Updates (February 2021 Edition)

Background

LinkedIn DataHub is an open-source metadata platform that I founded and architected at LinkedIn and is in wide use at many companies. We hold monthly town-hall meetings to share new developments, hear from the community about their adoption journeys and plan together for the future. I’m starting a journal to “catalog” the evolution of the project for posterity in a way that a video recording or a git commit log cannot. This post captures the developments for the month of February in 2021.

Community Update

More than 50 PR-s were submitted and merged in over the last 30 days! We had almost 50 people attending the monthly town-hall . Our Slack community is growing fast. Join us to talk metadata and subscribe to our YouTube channel .

The big contributions in the last month were:

Formalized support for GraphQL in DataHub. It supports embedding in the Play frontend server as well as a standalone service . John Joyce from Acryl and Arun Vasudevan from Expedia have been leading the charge on this.
New React application is now ready for beta-testing
New Python-based ingestion framework

Read on to find out more!

New Search and Discovery App in React

Extensible UI, delightful experience

The original DataHub UI was built on the Ember framework, a Javascript framework that is used at many companies including LinkedIn. However, Ember is not as popular as other frameworks like React among the developer community; and this impacted the ability of new contributors to ramp up and contribute. On top of that, for the frontend code, LinkedIn was following a “develop internally and then release into open-source” model; making it hard for the project to accept external contributions easily.

To change the status quo, we decided to incubate a new React application in the project. The application has been in incubation for a month, and was demo-ed at the town-hall event by John Joyce. It features a fresh look, a re-designed search and browse experience and is targeted to support a much higher-level of extensibility. The decision is already reaping dividends as we got our first UI contribution from the community to implement logout functionality.

Landing Page

Search Results Page

Dataset Page with Schema

Check out the code and documentation here, and a video demo from the town-hall here:

Python-based Ingestion Framework

Build connectors, run metadata ingestion easily

DataHub features a highly flexible third-generation metadata architecture , which supports stream-oriented push-based emission of metadata from different sources. This architecture allows DataHub to adapt perfectly to large organizations where different teams are in control of different parts of the metadata graph and would prefer to have clear contracts and ownership for emission of metadata. Proponents of the data mesh approach find this kind of architecture very attractive.

However, often when a new central team is installing DataHub in a new company, they would prefer to get started quickly by ingesting a lot of metadata from as many sources as they can. Previously, the project had a few example Python scripts to help adopters get started, but it wasn’t set up for ease of extensibility, robustness of development and monitoring, or integration with well known orchestration tools, so most teams were having to build their own adventure in integrating metadata.

Architecture Tip: If you support push-based integration of metadata, it is very easy to add on pull-based ingestion. The inverse is not true! To convert a pull-based architecture to support push, you first have to externalize your models, and then transition from a batch-ingestion model to a row-at-a-time (stream / REST-friendly) model. It can be done, but it takes time.

To address these pain-points, we decided to build a simple and extensible Python-based ingestion framework that can pull metadata from source systems and push it into DataHub very easily. We were inspired by concepts in Apache Gobblin in building this framework.

DataHub: Metadata Ingestion Framework

We’ve added a bunch of sources to get people started, from classic SQL-based systems such as MySQL, SQL Server, Postgres, Hive, Snowflake and BigQuery to more specific ones like Kafka and LDAP. The configuration for a metadata ingestion pipeline is dead simple. All you have to do is provide a source, and choose a sink that you want to write to.

---
source:
  type: "mysql"
  config:
    username: datahub
    password: datahub

sink:
  type: "datahub-rest"
  config:
    server: 'http://localhost:8080'

Running this pipeline is as simple as:

datahub ingest -c mysql_to_datahub_rest.yml

Of course, you can now schedule these ingestion pipelines using the Pythonista’s favorite scheduler Airflow! The project now includes a few sample Airflow DAGs to help you schedule your metadata ingestion.

We’re hoping to see the community contribute a ton of new metadata sources in the near future, as they connect DataHub to different systems in their companies.

Check out the full video of the ingestion framework tour, led by Harshal Sheth from Acryl here:

DataHub Adoption at Geotab

At our last town-hall, John Yoon presented the adoption journey of DataHub at Geotab , a telematics company.

What was interesting to me was how the data platform team’s responsibilities not only includes choosing the right new technology that meets the needs of the business, but also championing its use across the company and evolving the approach based on feedback. These are non-linear journeys that take time and effort, but extremely satisfying when you finally unlock value!

Check out the video here:

DataHub adoption journey at Geotab

Looking Forward

We’re hoping to hear back from the community as they start moving to the new React application and use the Python ingestion framework to integrate metadata.

We are continuing to land big features as we make LinkedIn DataHub more useful for everyone. Stay tuned!