Just Shipped: UI-Based Ingestion, Data Domains & Containers, Tableau support, and MORE!

👋 Hello, DataHub Enthusiasts!

I hope this finds you happy & healthy ❤️

I’m so excited to share this update with you. Buckle up —it’s PACKED with new features & improvements for you to explore.

UI-Based Ingestion

If you joined the December Town Hall, you likely remember John Joyce’s awesome demo of UI-Based Ingestion. As of v0.8.25, you can now create, configure, schedule, & execute batch metadata ingestion via the DataHub user interface. UI-Based Ingestion makes getting metadata into DataHub easier than ever by minimizing the overhead required to operate custom integration pipelines. Read the guide here.

Choose an Ingestion Type and Configure connection details

Easily set Ingestion Schedule and give it a name — you’re all set!

Data Domains — Fueling Data Mesh in DataHub

DataHub now supports grouping data entities into logical collections called Domains.

Domains are curated, top-level folders or categories enabling you to explicitly group related entities. This is useful for organizations that want to organize entities by department (i.e., Finance, Marketing, etc.), Data Products, or other logical groupings common in Data Mesh adoption. Read the guide here.

John Joyce gives an overview of Data Domains in DataHub

Take DataHub Domains for a spin!

Data Containers are LIVE!

Data Containers represent the physical grouping of entities — for example, a Schema is a container of 1 or more Datasets; a Dashboard is a container of 1 or more Charts.

You can associate Ownership, Glossary Terms, Tags, & Documentation at the Container-level to enrich others’ understanding of these resources.

Watch John’s walkthrough below!

John Joyce gives an overview of Data Containers in DataHub

Q1 Roadmap Progress

Data Quality — Metadata Model Support

Data Quality test results are now supported in the DataHub backend metadata model!

This is the first milestone toward surfacing Dataset & Column-level Data Quality results in the UI (read full scope of work here).

Future releases will include a Great Expectations integration & UI support — we’re on track to complete this in Q1 as planned.

Want to learn more about what DataHub is working on in Q1? View the complete roadmap here!

Incubating Metadata Sources

Tableau — BETA

We are SO EXCITED to roll out the Beta release of our Tableau ingestion source as of v0.8.26. We are eager for Community Members to test out this integration & to provide feedback — join the conversation in Slack!

Elasticsearch

We added Elasticsearch as a supported metadata ingestion source in v0.8.23. It currently extracts metadata for indexes and column types associated with each index field.

Data Lake Files — BETA

We added a new Data Lake Files ingestion source in v0.8.24 to support data profiling for local files and files stored in AWS S3; supported file types are CSV, TSV, Parquet, and JSON. Avro files are supported as of v0.8.25.

This is useful for organizations that wish to catalog files within AWS S3 without requiring Hive and/or Glue as data catalogs.

⚠️ We are aware of some performance issues with Data Profiling on this source and are working on improvements!

Have feedback to share about these new sources? Tell us all about it in our #ingestion Slack channel!

Ongoing Improvements as of v0.8.25

Support Multiple Instances of the Same Platform Type

This has been a widespread use case within the Community — you can now differentiate multiple instances of the same platform type!

If you already have pre-existing entries, use the `datahub` migrate command to migrate them over to platform instances; see the migration script here.

Ignore Specific Users in Data Profiling

We now support the ability to ignore specific users when calculating Top Users of a Dataset/Column — this is useful when you want to exclude users designated for maintenance/automated execution.

BigQuery — Data Profiling on Only the Latest Partition/Shard

Profile only the latest partition or shard in BigQuery to reduce processing time

Notable Metadata Model & Ingestion-Based Features

Avro files are now supported in the Data Lake File ingestion source
Support for nested Glue Schemas (as of v0.8.24)
Fix to add external Looker URL optionally to correctly route to Looker Chart or Dashboard when clicking “View in Looker.”
Fix to surface data profiling results for low-cardinality number columns. This fixes the reported issue when Min/Mean/Max/etc. values were not displayed in the UI after a successful Data Profiling run.
Community shoutout! @iasoon pushed a fix to match default username for Azure OIDC & Azure Ingestion source
Community shoutout! @thomasplarsson added a fix to support displaying Group Name as an Entity Owner

Community Contributions

We had 33 people contribute to the DataHub Project across v0.8.23, v0.8.24, v0.8.25, and v0.8.26!

Congrats to our first-time contributors!

@danilopeixoto, @dipeshmaurya, @eburairu, @icy, @Jiwei0, @ksrinath, @lhvubtqn, @maaaikoool, @senni0418, @ShubhamThakre

Shoutout to our repeat contributors!

@aditya-radhakrishnan, @anshbansal, @arunvasudevan, @dexter-mh-lee, @gabe-lyons, @hsheth2, @iasoon, @jeffmerrick, @jjoyce0510, @kevinhu, @maggiehays, @mayurinehate, @MikeSchlosser16, @MugdhaHardikar-GSLab, @nickwu241, @pedro93, @RickardCardell, @rslanka, @RyanHolstien, @swaroopjagadish, @thomasplarsson, @treff7es, @varunbharill

One Last Thing —

I caught up with DataHub Community Member, Ryan Holstien:

Maggie: What are you most excited about from v0.8.25?

Ryan: Containers and Domains! I’m a big fan of Domain-Driven Design (DDD) and this is a big step towards us being able to manage metadata for full Data Meshes.

M: Nice! Ok, not DataHub-related — is there a song you’ve been playing on repeat recently?

R: Don’t Call My Name — Skinshape