BACK TO ALL POSTS

Managing PII in DataHub: A Practitioner’s Guide

Metadata

Data

Data Governance

Data Science

Features

Mitchell

May 16, 2022

Metadata

Data

Data Governance

Data Science

Features

PII and its importance

Every day, the world produces 5 exabytes of data (Source: Accenture). That’s equivalent to 2.5 quintillion bytes, 2.5 billion gigabytes, or roughly 145 million digital copies of James Cameron’s Avatar, my favorite movie.

Among this data produced often lies PII, or personally identifiable information. The NIST defines PII as Any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means. In other words, PII is any piece of information that could be used to infer one’s identity.

Recent data points to an unfortunate increase in PII breaches at the companies we interact with on a daily basis, for example less than a year ago at Microsoft, and one even more recent at Okta.

Not this.

Not this.

It goes without saying PII is incredibly important and valuable to a company, and an emphasis should be placed on ensuring this data is properly managed, so it can be properly protected. In this article we will walk through how to properly annotate your datasets that contain PII, and then describe some of the powerful use cases DataHub offers once annotation is complete.

How to annotate data as containing “PII” in DataHub

Step 1: Build and ingest your business glossary

In order to begin annotating datasets that contain PII, we first need to create and ingest a glossary of terms which contain the various types of “Personally Identifiable Information” your business or organization collects. Because this will vary from organization to organization, each of your glossaries will differ. However, we have prepared a utilitarian glossary containing many different types of PII and their “levels” of sensitivity/impact according to NIST standard to get you started — all you have to do is upload this to your DataHub instance!

Examples of some of the PII terms we’ve included.

Examples of some of the PII terms we’ve included.

How does this glossary file work, you ask?

In DataHub, it is possible to create multiple glossary term “sets” which can interact with one another. For our case, this is helpful because it allows us to associate our PII term “set” with our Impact Levels term “set”, giving each PII glossary term an associated level of impact if it was accidentally disseminated. This enhances the experience for end users, as they are now able to filter by Low, Medium or High impact, vs. having to filter by PII term, of which there can be dozens. For more information please see here.

PII term set

PII term set

Furthermore, if down the line there are reasons to change a PII term from one level of impact to another, (i.e. email now is a “High” impact term and it was originally “Medium”), all datasets that contain emails will auto-switch from “Medium” to “High” impact level with a small edit to the .YAML file. Don’t take my word for it, try it yourself in your own instance.

Please feel free to use the above glossary as a starting point, and edit to your liking. Our .YAML recipes are easy to use, and we recommend that you keep your glossaries as a checked-in artifact in your tool of choice to keep track of changes over time.

For more information on business glossaries in general, why they are important, and how to use them in DataHub, see a previous article here.

Step 2: (Annotate the data) either automatically or manually

So you now have your business glossary ingested into your DataHub instance. Nice work. Now, in order for the glossary to be useful, we need to actually attach the PII term to datasets. This can be done in a few different ways:

“Shift-left” via annotations in schema languages(**Recommended**)

The DataHub community has pioneered shift-left practices for annotating schemas in CI/CD pipelines with the right glossary terms. Zendesk demonstrated how they have done this with Protobuf schemas and Saxo Bank has shared their automated approach for applying glossary terms. As we speak, other companies in the DataHub community are adding support for Thrift and other schema languages. We recommend this “Shift-left” approach because it minimizes the time teams have to spend on manual data enrichment.

Using “Transformers” and pattern matching during metadata ingestion

In DataHub, you are able to create a transformer that can automatically add glossary terms like PII to datasets prior to its ingestion into your instance. For more information on how this would look, please see here.

Specify Regex patterns to determine which glossary terms to add

Specify Regex patterns to determine which glossary terms to add

Via the DataHub UI

The DataHub UI supports adding glossary terms to datasets as well as individual columns with a few clicks.

Acryl Data Auto-PII Detection

Acryl Data partners with Vendors who produce PII detection using machine learning and has approval workflows for humans to verify proposed terms. For more information, please visit our website and fill out a form.

Coming Soon: CSV Ingestion of Associations

In the next DataHub release we plan on adding a new .CSV based ingestion plugin that will allow you to more easily programmatically annotate PII to existing datasets and schemas. End users will be able to list which PII terms to associate with each dataset or column in a .CSV file, and our plugin will do the rest!

Common Use Cases in DataHub

Congratulations on making it this far. Your glossary term sets are ingested, and a large portion of your data is now annotate as containing PII — what next? Here are some use cases that make the work you did worth it!

Search and Discovery with PII

End users can now quickly answer the following questions that they were not able to before, such as:

Where is all of the PII data residing in my data stack?

Which datasets contain emails in my data stack?

Is this dataset safe to send, or does it contain PII?

…and more!

Dataset Downloads

End users should now be able to download .CSV files of their PII datasets, all from a couple clicks from the DataHub UI.

API calls for access control to datasets

Now that a good portion of your datasets are properly annotate as containing PII, you can now use DataHub’s friendly API to begin governing access control in your provider of choice.

Metadata analytics sanity check

Navigate to the “Metadata Analytics” tab to validate the amount of datasets you have annotated as PII, and glean additional insights into data ownership and more.

1.8K datasets

Conclusion

It is becoming increasingly important for data-gathering companies to keep track of PII to ensure their end users’ data is properly handled. I hope this article has served as a useful introduction to PII, proved its importance, and shown you how to leverage DataHub to properly track it.

DataHub does a lot of other things, too. See here for more.

Have a success story using DataHub to manage PII? Write to me at feedback@acryl.io

Acryl Data is hiring, click here for more information.

Want to get involved? Join our slack channel here.

Metadata

Data

Data Governance

Data Science

Features

NEXT UP

Governing the Kafka Firehose

Kafka’s schema registry and data portal are great, but without a way to actually enforce schema standards across all your upstream apps and services, data breakages are still going to happen. Just as important, without insight into who or what depends on this data, you can’t contain the damage. And, as data teams know, Kafka data breakages almost always cascade far and wide downstream—wrecking not just data pipelines, and not just business-critical products and services, but also any reports, dashboards, or operational analytics that depend on upstream Kafka data.

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Acryl Observe is a complete observability solution offered by Acryl Cloud. It helps you detect data quality issues as soon as they happen so you can address them proactively, rather than waiting for them to impact your business’ operations and services. And it integrates seamlessly with all data warehouses—including Snowflake, BigQuery, Redshift, and Databricks. But Acryl Observe is more than just detection. When data breakages do inevitably occur, it gives you everything you need to assess impact, debug, and resolve them fast; notifying all the right people with real-time status updates along the way.

John Joyce

2024-04-23

Five Signs You Need a Unified Data Observability Solution

A data observability tool is like loss-prevention for your data ecosystem, equipping you with the tools you need to proactively identify and extinguish data quality fires before they can erupt into towering infernos. Damage control is key, because upstream failures almost always have cascading downstream effects—breaking KPIs, reports, and dashboards, along with the business products and services these support and enable. When data quality fires become routine, trust is eroded. Stakeholders no longer trust their reports, dashboards, and analytics, jeopardizing the data-driven culture you’ve worked so hard to nurture

John Joyce

2024-04-17

TermsPrivacySecurity
© 2025 Acryl Data