BACK TO ALL POSTS

LinkedIn DataHub Project Updates (February 2021 Edition)

Data Engineering

Open Source

DataHub

Metadata

Project Updates

Shirshanka Das

Feb 25, 2021

Data Engineering

Open Source

DataHub

Metadata

Project Updates

DataHub Heart's React + Python

Background

LinkedIn DataHub is an open-source metadata platform that I founded and architected at LinkedIn and is in wide use at many companies. We hold monthly town-hall meetings to share new developments, hear from the community about their adoption journeys and plan together for the future. I’m starting a journal to catalog the evolution of the project for posterity in a way that a video recording or a git commit log cannot. This post captures the developments for the month of February in 2021.

Community Update

More than 50 PR-s were submitted and merged in over the last 30 days! We had almost 50 people attending the monthly town-hall . Our Slack community is growing fast. Join us to talk metadata and subscribe to our YouTube channel .

The big contributions in the last month were:

Read on to find out more!


New Search and Discovery App in React

Extensible UI, delightful experience

The original DataHub UI was built on the Ember framework, a Javascript framework that is used at many companies including LinkedIn. However, Ember is not as popular as other frameworks like React among the developer community; and this impacted the ability of new contributors to ramp up and contribute. On top of that, for the frontend code, LinkedIn was following a “develop internally and then release into open-source” model; making it hard for the project to accept external contributions easily.

To change the status quo, we decided to incubate a new React application in the project. The application has been in incubation for a month, and was demo-ed at the town-hall event by John Joyce. It features a fresh look, a re-designed search and browse experience and is targeted to support a much higher-level of extensibility. The decision is already reaping dividends as we got our first UI contribution from the community to implement logout functionality.

DataHub Landing Page

Landing Page

DataHub Search Results Page

Search Results Page

DataHubDataset Page with Schema

Dataset Page with Schema


Check out the code and documentation here, and a video demo from the town-hall here:


Python-based Ingestion Framework

Build connectors, run metadata ingestion easily

DataHub features a highly flexible third-generation metadata architecture , which supports stream-oriented push-based emission of metadata from different sources. This architecture allows DataHub to adapt perfectly to large organizations where different teams are in control of different parts of the metadata graph and would prefer to have clear contracts and ownership for emission of metadata. Proponents of the data mesh approach find this kind of architecture very attractive.

However, often when a new central team is installing DataHub in a new company, they would prefer to get started quickly by ingesting a lot of metadata from as many sources as they can. Previously, the project had a few example Python scripts to help adopters get started, but it wasn’t set up for ease of extensibility, robustness of development and monitoring, or integration with well known orchestration tools, so most teams were having to build their own adventure in integrating metadata.

Architecture Tip: If you support push-based integration of metadata, it is very easy to add on pull-based ingestion. The inverse is not true! To convert a pull-based architecture to support push, you first have to externalize your models, and then transition from a batch-ingestion model to a row-at-a-time (stream / REST-friendly) model. It can be done, but it takes time.

To address these pain-points, we decided to build a simple and extensible Python-based ingestion framework that can pull metadata from source systems and push it into DataHub very easily. We were inspired by concepts in Apache Gobblin in building this framework.

DataHub: Metadata Ingestion Framework

DataHub: Metadata Ingestion Framework

We’ve added a bunch of sources to get people started, from classic SQL-based systems such as MySQL, SQL Server, Postgres, Hive, Snowflake and BigQuery to more specific ones like Kafka and LDAP. The configuration for a metadata ingestion pipeline is dead simple. All you have to do is provide a source, and choose a sink that you want to write to.

---
source:
  type: "mysql"
  config:
    username: datahub
    password: datahub

sink:
  type: "datahub-rest"
  config:
    server: 'http://localhost:8080'

Running this pipeline is as simple as:

datahub ingest -c mysql_to_datahub_rest.yml

Of course, you can now schedule these ingestion pipelines using the Pythonista’s favorite scheduler Airflow! The project now includes a few sample Airflow DAGs to help you schedule your metadata ingestion.

We’re hoping to see the community contribute a ton of new metadata sources in the near future, as they connect DataHub to different systems in their companies.

Check out the full video of the ingestion framework tour, led by Harshal Sheth from Acryl here:



DataHub Adoption at Geotab

At our last town-hall, John Yoon presented the adoption journey of DataHub at Geotab , a telematics company.

What was interesting to me was how the data platform team’s responsibilities not only includes choosing the right new technology that meets the needs of the business, but also championing its use across the company and evolving the approach based on feedback. These are non-linear journeys that take time and effort, but extremely satisfying when you finally unlock value!

Check out the video here:

DataHub adoption journey at Geotab


Looking Forward

We’re hoping to hear back from the community as they start moving to the new React application and use the Python ingestion framework to integrate metadata.

We are continuing to land big features as we make LinkedIn DataHub more useful for everyone. Stay tuned!

Data Engineering

Open Source

DataHub

Metadata

Project Updates

NEXT UP

Governing the Kafka Firehose

Kafka’s schema registry and data portal are great, but without a way to actually enforce schema standards across all your upstream apps and services, data breakages are still going to happen. Just as important, without insight into who or what depends on this data, you can’t contain the damage. And, as data teams know, Kafka data breakages almost always cascade far and wide downstream—wrecking not just data pipelines, and not just business-critical products and services, but also any reports, dashboards, or operational analytics that depend on upstream Kafka data.

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Acryl Observe is a complete observability solution offered by Acryl Cloud. It helps you detect data quality issues as soon as they happen so you can address them proactively, rather than waiting for them to impact your business’ operations and services. And it integrates seamlessly with all data warehouses—including Snowflake, BigQuery, Redshift, and Databricks. But Acryl Observe is more than just detection. When data breakages do inevitably occur, it gives you everything you need to assess impact, debug, and resolve them fast; notifying all the right people with real-time status updates along the way.

John Joyce

2024-04-23

Five Signs You Need a Unified Data Observability Solution

A data observability tool is like loss-prevention for your data ecosystem, equipping you with the tools you need to proactively identify and extinguish data quality fires before they can erupt into towering infernos. Damage control is key, because upstream failures almost always have cascading downstream effects—breaking KPIs, reports, and dashboards, along with the business products and services these support and enable. When data quality fires become routine, trust is eroded. Stakeholders no longer trust their reports, dashboards, and analytics, jeopardizing the data-driven culture you’ve worked so hard to nurture

John Joyce

2024-04-17

TermsPrivacySecurity
© 2024 Acryl Data