BACK TO ALL POSTS

LinkedIn DataHub Project Updates

Data Engineering

Data Science

Metadata

Open Source

Project Updates

Shirshanka Das

Jun 21, 2021

Data Engineering

Data Science

Metadata

Open Source

Project Updates

DataHub Release 0.8.3 May 2021 Project Update

Introduction

We’re back with our fourth post covering monthly project updates for the open-source metadata platform, LinkedIn DataHub . This post captures the developments for the month of May in 2021. To read the April 2021 edition, go here . To learn about Datahub’s latest developments- read on!

Also, our next town hall is coming up on June 25th 2021, sign-up here to get an invite.

Community Update

First, an update on the industry landscape!

On May 19th, we co-hosted Metadata Day 2021 (Spring Edition) with LinkedIn and Acryl Data . This time, our focus being Data Mesh , an emerging topic of high interest in the data community.

We had three big takeaways from the day-long un-conference.

First, confront the data mess through a comprehensive data discovery solution. This allows you to create the right baselines, and drive culture change through metrics driven approaches.

Second, automation is key to a sustainable metadata strategy for data mesh, and allows you to put DataOps practices in motion. Instead of investing in human intensive manual programs, enlist your data producers and invest in the right tooling to enable governance at the source, where data is changing, to keep up with the scale and diversity.

Finally, if you are wondering whether you need to rethink the storage of all your data and start from scratch, the experts panel recommended that you focus on the control plane of data, which is composed mainly of metadata. The control plane allows you to create a view of the Data ecosystem that conforms to your Data Mesh specifications and you can make incremental progress quickly.

As a nice byproduct of the event, we also came up with a reference architecture for how a data catalog can enable data mesh and what its capabilities need to be. It is great to see that DataHub’s architecture is ideally suited for this kind of implementation, as we see repeatedly in multiple companies like Saxo Bank, Wolt and others.

Reference Architecture for Data Catalog for Data Mesh

Reference Architecture for Data Catalog for Data Mesh

Read the full slide deck and the full video of the event below:

https://docs.google.com/presentation/d/1WdF3KzUFboZiGcuT9mzXWVqbhTvPPbHThZ4rML_Y3Gs/edit?usp=sharing

Airflow Lineage Update

Coming back to the project, we’re now an officially supported lineage backend provider on the Astronomer Airflow registry. It was great to partner with the Astronomer team to roll this capability out to the Apache Airflow community. Check us out here.

DataHub Partner

Project Update

We kept up our monthly commit activity rate of 100+ commits/month and our May release included 26 unique contributors from 18 different companies with 13 new committers! In the 4 weeks from April 30 to May 30, we had 117 PRs that were merged in and our Slack community continues to grow rapidly. Our monthly town-hall had over 50 attendees, where we unveiled the NoCode Metadata and Product Analytics improvements, and community members showcased their recent contributions and talked through their adoption journeys. Join us on Slack and subscribe to our YouTube channel to keep up to date with this fast growing project.

The big updates in the last month were in three tracks:

Product and Platform Improvements

  • Product Analytics to help you understand how your users are interacting with DataHub and visualize how your metadata is changing.
  • No Code Metadata to enable you to evolve your metadata model without needing to write any code.
  • Search improvements: Auto-complete now matches across entity types
  • Pipeline entity page now provides visualization of the Tasks that are contained within your DAG.

Metadata Integration Improvements

  • The metadata-ingestion module added on a Transformations feature to support enrichment of metadata as it is being ingested into DataHub. This was a common ask from the community to support capabilities like adding tags or adding owners while ingesting metadata. Thanks to Thomas Larsson and Harshal Sheth who drove the development of the feature.
  • Improved integrations with dbt (support for views), Hive (support for http/s endpoints), AWS Glue, MongoDB (support for schema inference); new official integrations with Looker and AWS Redshift. We also have an incubating integration with Kafka Connect! Check out all our integrations here.
Improved integrations

Operator Tooling

  • We now have official support for DataHub on Kubernetes. Checkout datahub-k8s for all things related to deploying DataHub on k8s. We have added a quickstart configuration for setting up the storage layer, and helm charts are available via helm.datahubproject.io. Thanks to Dexter Lee for leading the charge here, and our community members (Pedro, Shakti, Zack, and Ricardo) for contributing and improving them.
  • We have also published a step-by-step guide for deploying DataHub to AWS here. This is the same guide followed by the Acryl Data team for our SaaS / AWS instance of DataHub, so you can be assured that this is going to be thoroughly vetted for DataHub deployments on AWS. We are working on releasing similar guides for GCP and Azure soon!
  • An official release 0.8.3 that packages all these improvements

Read on to find out more!



Product Analytics

Data Platform teams often deploy DataHub into their companies with a lot of hope that it will change their data culture overnight. After the initial install, they often wonder, are people using this new thing that I just rolled out? Which integrations should I prioritize to make sure that people are having a great experience? Was all this work setting it up and deploying it to production worth it? What should I do next? You’re not alone!

We built the product analytics feature into DataHub to give platform teams the insights they need to make sure that DataHub is unlocking value for their ecosystem. The product analytics page shows you how your metadata ecosystem is evolving (how many datasets, how many have owners etc), along with how your data professionals are interacting with DataHub (top searches, actions taken by entity-type etc.). Our goal was to build this feature without increasing the complexity of the DataHub deployment, so we utilize the same set of components; Kafka for event log, and ElasticSearch for analytics. You can also configure a standard analytics provider like Google Analytics or MixPanel or Amplitude for your analytics needs. However, those integrations don’t provide the deeper metadata-based analytics capabilities that the in-product integration provides. For the in-product experience, we added a new Analytics page in DataHub that shows these insights using simple cards and charts. Check it out on our public demo here.

Here are a few examples of what that looks like:

new Analytics page in DataHub

Read the user documentation to learn more about this feature and how to use it.

We built this feature over 3 days with a virtual team composed of Maggie Hayes (Sr. PM at SpotHero), who led a design sprint for the Acryl Data team following the GV design sprint methodology.

Watch the video of this feature and the methodology by which we built it here:

With product analytics, you can now understand how your users are interacting with DataHub and how your metadata is changing. We’re looking forward to hearing from the community what other insights they would like to see featured here and how they would like to use this for improving DataHub usage at their company.

No Code Metadata

DataHub’s goal is to be able to represent and capture the entire metadata graph inside your data ecosystem. As it turns out, this universe of metadata composed of entities, aspects and relationships is ever-expanding. What started out as a graph containing Datasets with Schema aspects and relationships with People, now includes Pipelines, Dashboards, Charts, and there are more additions underway, like Features and Business Glossary Terms. In addition to global, standard definitions, each enterprise also often wants to add specific extensions to this graph to add their own concepts.

Previously, to add a new entity or an aspect to DataHub, you would have to make changes to the metadata model (we use a language called PDL, part of the open-source rest.li framework); but in addition to that you would also have to make changes to the service (datahub-gms) to support CRUD operations on the new entity or aspect, make changes to the changelog consumer (mae-processor) to apply search index or graph index updates accordingly. In all, we found that for typical use-cases of extending an entity or adding new information through an aspect to an existing entity, developers were having to make changes to upwards of 50 files! Typical pull-requests (PRs) from developers would take weeks to approve and merge in because of the amount of code involved.

Not any more! With No Code Metadata, all you need to do is describe the change to the model; either adding a new entity or a new aspect, along with annotations that describe how you want to have this new information indexed for search or for graph traversals. DataHub’s metadata platform will take care of the rest for you, with generic endpoints that allow you to CRUD these new resources, as well as maintain search and graph indexes for you automatically built off of the metadata change log. This brings down the time to create a new entity from weeks to 15 mins.

Read the developer docs to learn more about this feature and watch John Joyce present the motivation, inner workings and a slick demo of this during our last town hall here.

With no code metadata, we can now accelerate the speed at which we can introduce new concepts into the metadata graph, and make DataHub much more comprehensive and useful for all enterprises. There are already some new contributions in the works using the No Code approach, such as integration with Feast, that will introduce the concept of Features into DataHub, and we are excited to see the new additions to the metadata model that the community will contribute.

Case Study: DataHub deployment on GCP

Sharath Chandra, one of Confluent’s first data engineers, walked us through his journey of deploying DataHub on GCP. Confluent’s data warehouse uses BigQuery for analytics queries and Airflow for orchestration. Data goes through multiple layers of transformations and standardization processes before it is available for the end users or stakeholders.

Sharath describes his use-cases for DataHub as starting with being a centralized tool to reflect database and schema documentation, continuing on to be a comprehensive repository for tracking lineage end-to-end from operational systems to analytics systems, through integration with push-based emitters writing metadata into Kafka from production systems.

He took an interesting approach in capturing lineage by extracting it directly from the BigQuery audit logs table. DataHub’s flexibility allowed him to emit this metadata from an Airflow job that he wrote specifically to convert BigQuery audit logs to lineage events.

Watch his video to learn more about how he did it and his plans for the future.

Looking Ahead

With the release of No Code metadata and Product analytics, we have unlocked big improvements in product agility and enabled data teams to understand and improve DataHub usage at their company. We expect to see lots of new additions and extensions to the model.

Our focus in June will be on supporting more context during data discovery about tables based on how they are being used (by mining usage logs from systems like BigQuery, Snowflake, Presto etc.) and releasing an initial design for role-based access control (RBAC) for DataHub metadata; two of the most requested items on our roadmap. We also plan to release the roadmap for the second half of the year soon. Join us for our next town hall on June 25th, and sign-up here to get an invite. Until next time!

Data Engineering

Data Science

Metadata

Open Source

Project Updates

NEXT UP

Governing the Kafka Firehose

Kafka’s schema registry and data portal are great, but without a way to actually enforce schema standards across all your upstream apps and services, data breakages are still going to happen. Just as important, without insight into who or what depends on this data, you can’t contain the damage. And, as data teams know, Kafka data breakages almost always cascade far and wide downstream—wrecking not just data pipelines, and not just business-critical products and services, but also any reports, dashboards, or operational analytics that depend on upstream Kafka data.

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Acryl Observe is a complete observability solution offered by Acryl Cloud. It helps you detect data quality issues as soon as they happen so you can address them proactively, rather than waiting for them to impact your business’ operations and services. And it integrates seamlessly with all data warehouses—including Snowflake, BigQuery, Redshift, and Databricks. But Acryl Observe is more than just detection. When data breakages do inevitably occur, it gives you everything you need to assess impact, debug, and resolve them fast; notifying all the right people with real-time status updates along the way.

John Joyce

2024-04-23

Five Signs You Need a Unified Data Observability Solution

A data observability tool is like loss-prevention for your data ecosystem, equipping you with the tools you need to proactively identify and extinguish data quality fires before they can erupt into towering infernos. Damage control is key, because upstream failures almost always have cascading downstream effects—breaking KPIs, reports, and dashboards, along with the business products and services these support and enable. When data quality fires become routine, trust is eroded. Stakeholders no longer trust their reports, dashboards, and analytics, jeopardizing the data-driven culture you’ve worked so hard to nurture

John Joyce

2024-04-17

TermsPrivacySecurity
© 2024 Acryl Data