Automatic Data Quality Metrics

Low data quality creates significant costs to large and small enterprises alike. Not only does malformed data have the potential to cause impact on production systems, it is also very expensive to manually correct, whilst at the same time severely hampers the effectiveness of advanced techniques such as machine-learning and other innovative uses of enterprise data.

The causes of low data quality are varied and myriad. Typically these result from fragmentation and a large number of different system performing the same function in parallel as determined by organizational silos (i.e. think each division, department, desk having their own system to effectively do the same thing). While addressing these root cause issues can be complex and time consuming [1], generating the metrics that quantifies a snapshot’s data quality should be automated and straightforward.

Amazon Deequ

Deequ is an open source library developed internally at Amazon to verify the quality of many large production data sets. It is implemented on top of Apache Spark and provides a rich set of data analysis tools. A good overview can be found here : https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/

Implementation Challenges

While Deequ is an excellent library, out-of-the-box it does come with a number of 'plumbing issues' that as a user you need to address when attempting to use the library:

  • Environment and Infrastructure Setup - In order to run Deequ, you need to have a running Spark cluster to do the computations along with somewhere that hosts your data sets. While this may not pose an issue for seasoned engineers, for data people who 'just want to work', dealing with distributed cluster setup, performance and scalability is at best a distraction.

  • Writing of Boilerplate Code - While the code required to drive Deequ is simple and straightforward to write, it is still a manual task that requires you to know the column types of your data set and translate that to the appropriate Spark invocation. For instance, taking the example from the Deequ blog post:

import com.amazon.deequ.analyzers.runners.{AnalysisRunner, AnalyzerContext}
...
import com.amazon.deequ.analyzers.{Compliance, Correlation, Size, Completeness, Mean, ApproxCountDistinct}

val analysisResult: AnalyzerContext = { AnalysisRunner
  // data to run the analysis on
  .onData(dataset)
  .addAnalyzer(Size())
  .addAnalyzer(Completeness("review_id"))
  .addAnalyzer(ApproxCountDistinct("review_id"))
  .addAnalyzer(Mean("star_rating"))
  ...
  // compute metrics
  .run()
}
val metrics = successMetricsAsDataFrame(spark, analysisResult)

As a user you need to 'know' the columns in your data set along with which analyzer you can apply for which column types (e.g. you cannot apply the Mean analyzer to non-numeric columns).

  • Storing and Visualizing the Metrics - Assuming you have successfully invoked Deequ and obtained some output, how do you associate and store it with your original data set in such a why that you can easily refer back to it at a later stage? Also, how do you visualize the output in an intuitive way? Again, as an experienced developer with an existing system these may be relatively easy for your to address. However, for more data focused users, these are simply yet another distraction.

Data Sets in TeraHelix

The TeraHelix platform provides a structured data lake that manages the flow of bi-temporally addressable immutable data snapshots through a workflow that makes up an application. Quite a lot to unpack that last sentence - but in simple terms, your application is essentially just a number of data sets that has transitioned from one into the other and is organized according to your business hierarchy:

cat application snapshot example

Each one of the snapshots in the screenshot above is individually addressable, typed and immutable. It is these snapshots that we feed into the Deequ engine.

Automatic Data Quality Metrics Example: CAT Reportable Equity Securities Symbol Master

FINRA makes available a CAT Reportable Equity Securities Symbol Master file twice a day which defines the identifiers which should be used when submitting reports to the regulator.

Although a relatively simple example with only a limited number of data attributes, it will hopefully serve to illustrate how TeraHelix integrates Amazon’s Deequ library to provide built-in automated quality metrics generation.

Data Set Structure

The structure of each of the rows (or objects) that make up the snapshot has been previously been inferred in the TeraHelix Loading Bay [2]. It is therefore now available as part of the snapshot (along with a number of different language definitions, should you need to interact with from third-party tools).

Code Generation and Invocation of Deequ

In much the same way that TeraHelix is able to automatically generate the the language bindings and reporting views, it is able to generate the Deeque code that is required to generate the data quality metrics:

import org.apache.spark.sql.SparkSession
import com.amazon.deequ.analyzers.runners._
...
val v1 = AnalysisRunner.onData(dataset)
v1.addAnalyzer(Size())
v1.addAnalyzer(Maximum("Symbol"))
v1.addAnalyzer(Completeness("Symbol"))
...
v1.addAnalyzer(CountDistinct("IssueName"))
v1.addAnalyzer(UniqueValueRatio("IssueName"))
v1.addAnalyzer(Uniqueness("IssueName"))
...
v1.addAnalyzer(Entropy("ListingExchange_Id"))
...
val analysisResult = v1.run()
// retrieve successfully computed metrics as a Spark data frame
val metrics = successMetricsAsDataFrame(spark, analysisResult)
metrics

Each column gets generated the appropriate number of analyzers as determined by its data type. Running all the analyzers for every single column all the time can be expensive, so TeraHelix provides a mechanism upon setup of the application to help the user to specify how thorough they would like these checks to be (for instance, Size is classified as "Low Complexity" while Entropy would be "High Complexity")

Data Storage and Visualization

The data quality metrics is attached as a Tag [3] that then makes available this meta information on the snapshot, in much the same way that TeraHelix manages data lineage.

Conclusion

Regulatory reporting and advanced data science use cases demand high quality data input. While achieving this at an enterprise scale may be challenging and time consuming, the the process of measuring and quantifying progress should not be. In this blog we touched on how TeraHelix makes data quality an integrated part of workflow using freely available open source technologies.

Any thoughts, queries and comments are, as always, welcome. Feel free to let us know at https://www.terahelix.io/contact/index.html


1. Replatforming and addressing these root cause issues is something that TeraHelix excels at - though that is a topic for another blog.
2. TeraHelix makes the on-boarding of new data sources straightforward by auto-generating much of the boilerplate code you need to build an application. Although this is a topic for another blog
3. Tags can be attached to any immutable object in the store. Tags follow a bi-temporally addressable immutable flow themselves and enables users to attach useful information to objects after the fact, without having to change the original object itself.