BACK TO ALL POSTS

Preventing Data Freshness Problems with Acryl Observe

Data Engineering

Metadata Management

Data Governance

John Joyce

Sep 11, 2023

Data Engineering

Metadata Management

Data Governance

Imagine this...

You’re a data engineer who manages a Snowflake table that tracks all user click events from your company’s e-commerce website. This data originates from an online web application upstream of Snowflake, and lands in your data warehouse every day. At least it’s supposed to.

One day, someone introduces a bug in the upstream ETL job that copies the click event data to the data warehouse, causing the pipeline to copy zero click events for the day. Suddenly, your Snowflake table is missing crucial click event data for yesterday, and you don’t have a clue.

Next thing you know, you receive a frantic Slack message from your Head of Product. She’s puzzled by the site usage dashboard that shows zero user views or purchases on the previous day!

Oops.

As data engineers, it’s all too common for end users – decision makers looking at internal dashboards or worse, users navigating our product – to be the first ones to discover data issues when they happen.

This is because some types of issues are particularly hard to detect in advance, like missing data from upstream sources.

As data engineers, it’s all too common for end users – decision makers looking at internal dashboards or worse, users navigating our product – to be the first ones to discover data issues when they happen.

This is because some types of issues are particularly hard to detect in advance, like missing data from upstream sources.

The Data Freshness Problem Is Real

Unfortunately, we’ve all been there.

The data you depend on (or are responsible for!) went days, weeks, or even months before someone noticed that it wasn’t being updated as often as it should.

Maybe it was due to an application code bug, a seemingly harmless SQL query change, or someone leaving the company. There are many reasons why a table on Snowflake, Redshift, or BigQuery may fail to get updated as frequently as stakeholders expect.

Such issues can have severe consequences: inaccurate insights, misinformed decision-making, and bad user experience to name a few.

For this reason, it’s critical that organizations try to get ahead of these types of issues, with a focus on protecting the most mission-critical data assets.

What if you could reduce the time to detect data freshness incidents?

What if you could *continuously* monitor the freshness status of your data and detect catch issues before they reach anyone else?

Introducing Freshness Monitoring on Acryl DataHub

While many data catalogs disregard the freshness problem altogether, we at Acryl believe the central data catalog should be the single source of truth for the technical health and the governance or compliance health of a data asset – a one-stop-shop for establishing trust in data, suitable for use by your entire organization.

Based on this belief, and our deep experience extracting metadata from systems like Snowflake, BigQuery, and Redshift, we felt well-positioned to tackle the data freshness challenge.

Through many conversations with our customers and community, we honed our initial approach and built the Freshness Assertion monitoring feature on Acryl DataHub.

With Acryl Freshness Assertions, data producers or consumers can

  1. Define expectations about when a particular table should change
  2. Continuously monitor those expectations over time
  3. Get notified when things go wrong.
It's like having an automated watchdog that continuously ensures that your most important tables are being updated on time, and that alerts you when things go off-track.

Let’s take a closer look at how Acryl Observe Freshness Assertions work.

The Tricky Bit: Determining Whether a Table Has Changed

Monitoring the freshness of your warehouse tables might sound straightforward, but it’s a bit more complicated than initially meets the eye.

The first challenge is determining what constitutes a “change”.

Is it an INSERT operation? What about a DELETE? Even if the total number of rows hasn’t changed? Is it based on rows being explicitly added or removed? Or perhaps the presence of rows with a new, higher value than has previously been observed in the table

More specifically, we may rely on:

  • An audit log, which contains information about the operations performed on each table
  • An information schema, which contains live database and table information
  • A last modified column, storing the last modification time for a given row
  • A high watermark column that has increasing values each time new data is introduced to a table (any continuously increasing value)

So what is the right way to determine whether a table has changed?

The answer: it depends.

Selecting the right approach hinges on your data consumer's expectations.

Each scenario needs a different approach. For instance, if you expect new date partitions daily, the high watermark column would be the right choice. If any change (INSERT, UPDATE, DELETE) is valid, the information schema might be suitable. If you already track row changes using a last modified timestamp column, that could be the simplest and most accurate option.

Our conversations with customers and partners revealed the need for configurability and customizability across these approaches.

The outcome of these conversations is the Freshness Assertion—a configurable Data Quality rule that monitors your table over time to detect deviations from the anticipated update schedule.

This ensures that you and your team are the first to know when freshness issues inevitably arise.

The Anatomy of an Acryl Freshness Assertion

Acryl DataHub supports creating and scheduling ‘Freshness Assertions’ to monitor the freshness of the most important tables in your warehouse.

What exactly is a Freshness Assertion in DataHub?

A Freshness Assertion is a configurable data quality rule used to determine if a table in the data warehouse has been updated within a given period. It is particularly useful when you have frequently changing tables.

At the most basic level, a Freshness Assertion consists of:

  • An evaluation schedule: This defines how often to check a given warehouse table for new updates. This is usually configured to match the expected change frequency of the table, although you can choose to evaluate it more frequently.
  • A change window: This defines the window of time that is used when determining whether a change has been made to a Table.
  • A change source: This is the mechanism that Acryl DataHub should use to determine whether the table has changed.
  • Audit Log (Default): A metadata API or table that is exposed by the data warehouse which contains information about the operations that have been performed on each table.
DataHub Change Source for Freshness Monitoring
  • Information Schema: A system table exposed by the data warehouse that contains live information about the databases and tables stored inside the warehouse.
  • Last Modified Column: A Date or Timestamp column that represents the last time that a specific row was touched or updated. Adding a Last Modified Column to each warehouse Table is a pattern often used for existing use cases around change management.
  • High Watermark Column: A column that contains a continuously increasing value - like a date, a time, or any other such value.

Using the Last Modified Column or High Watermark approach is especially useful when you want to monitor for specific types of changes, e.g. special inserts or updates, for a table.

There’s more.

As part of the Acryl Observe module, DataHub also comes with Smart Assertions, which are AI-powered Freshness Assertions that you can use out of the box to monitor the freshness of important warehouse tables.

This means that If DataHub can detect a pattern in the change frequency of a Snowflake, Redshift, or BigQuery table, you'll find recommended Smart Assertions for frequently changing tables under the Validations tab on the asset’s profile page.

Using Freshness Assertions to Monitor Tables on Your Data Warehouse

In this section, we’ll see how simple it is to set up a Freshness Monitoring Assertion for a table using the DataHub UI.

Step 1: Creating the Assertion

Navigate to the table to be monitored, and create a new DataHub Assertion for Freshness Monitoring using the Validations tab.

Step 2: Configuring the Assertion evaluation parameters


For this, you’ll need to configure the

  • Evaluation schedule: This is the frequency at which the table will be checked for changes (your expectation about how often the table should be updated).
  • Evaluation period: This defines the period between subsequent evaluations of the check. You can
    • Check whether the table has changed in a specific window of time
  • Check whether the table has changed between subsequent evaluations of the check.

Lastly, you can customize the evaluation source to configure the mechanism you want to use to evaluate the check.

Step 3: Triggering an Incident

Once you set the parameters for monitoring, you can decide how you want the assertion to automatically trigger an incident when it fails. This allows you to broadcast a health issue to all relevant stakeholders.

You can also use the Auto-Resolve option for when (and if) the issue has passed.

For a detailed guide on setting up Freshness Assertions on DataHub, check out this guide to Freshness Assertions on Managed DataHub (https://datahubproject.io/docs/managed-datahub/observe/freshness-assertions/).

Depending on the evaluation of the Assertion, DataHub uses an Identifier right next to the asset to indicate whether it is healthy or not.

Staying on Top of Data Freshness Issues with Subscriptions and Notifications

There are two ways for you to stay updated on the freshness of your data warehouse using the Subscriptions & Notification feature offered by Acryl DataHub.


To be notified when things go wrong, simply subscribe to receive notifications for

  • Assertion status changes: Get notified when an Assertion fails or passes for a specific table
  • Incident status changes: Get notified when an Incident is raised or closed for a specific table

These notifications are accessible via Slack and soon other platforms.

DataHub as a Health Indicator for Your Data

DataHub goes beyond observability – it is the central source of truth for the health of your data – encompassing both technical health, e.g. day to day-to-day data quality as well as governance health or compliance health like documented purpose, classification, accountability via ownership, etc.

Simply using the Assertions or Incidents filter in the Search feature can help you surface assets that have freshness issues.

You could even use the Observe module on DataHub Acryl to surface the health issues of your assets.

With DataHub, you have a snapshot view of the real-time health of your data that’s accessible and useful to anyone in your company – be it a marketer, a business analyst, or even the CEO.

Experience a Fresh Approach to Data Quality and Integrity

Data integrity isn't a one-size-fits-all challenge, and that's what sets DataHub’s Freshness Assertion Monitoring apart. Unlike traditional approaches that might rely on point-in-time checks, it offers continuous and real-time monitoring in a hands-free, no-code manner.

So, whether you're crunching numbers, building dashboards, or making critical decisions, DataHub's Freshness Assertion Monitoring can help ensure that

  • You're informed of freshness issues as soon as they occur
  • You prevent downstream data users from encountering data issues first

If this sounds like something your team needs, get in touch with me for a demo of Acryl DataHub (https://www.acryldata.io/sign-up).

PS: Freshness Assertion Monitoring is just the beginning. As we continue to iterate on our observability offering, we're excited to bring more data health-focused features to the table. Watch this space to stay tuned for our upcoming Volume Monitoring feature that will help you identify any unexpected shifts in the row count of your most important tables.

Connect with DataHub

Join us on SlackSign up for our NewsletterFollow us on Linkedin

Data Engineering

Metadata Management

Data Governance

NEXT UP

Governing the Kafka Firehose

Kafka’s schema registry and data portal are great, but without a way to actually enforce schema standards across all your upstream apps and services, data breakages are still going to happen. Just as important, without insight into who or what depends on this data, you can’t contain the damage. And, as data teams know, Kafka data breakages almost always cascade far and wide downstream—wrecking not just data pipelines, and not just business-critical products and services, but also any reports, dashboards, or operational analytics that depend on upstream Kafka data.

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Acryl Observe is a complete observability solution offered by Acryl Cloud. It helps you detect data quality issues as soon as they happen so you can address them proactively, rather than waiting for them to impact your business’ operations and services. And it integrates seamlessly with all data warehouses—including Snowflake, BigQuery, Redshift, and Databricks. But Acryl Observe is more than just detection. When data breakages do inevitably occur, it gives you everything you need to assess impact, debug, and resolve them fast; notifying all the right people with real-time status updates along the way.

John Joyce

2024-04-23

Five Signs You Need a Unified Data Observability Solution

A data observability tool is like loss-prevention for your data ecosystem, equipping you with the tools you need to proactively identify and extinguish data quality fires before they can erupt into towering infernos. Damage control is key, because upstream failures almost always have cascading downstream effects—breaking KPIs, reports, and dashboards, along with the business products and services these support and enable. When data quality fires become routine, trust is eroded. Stakeholders no longer trust their reports, dashboards, and analytics, jeopardizing the data-driven culture you’ve worked so hard to nurture

John Joyce

2024-04-17

TermsPrivacySecurity
© 2024 Acryl Data