Enterprise Data Model Discovery and Surveillance

All enterprises that reach a certain level of maturity inevitably end up with a vast and extensive data estate consisting of multiple different database instances, numerous data feeds between systems and several reporting layers.

Through the passage of time, changes in business processes and staff turnover, the complexities of the firm’s overall data model keep on accumulating. Ultimately a situation is arrived at where the following adverse effects are observed:

Duplication of the same data field value throughout numerous model representations, albeit with different field names (e.g., both 'trade_id' and 'trade_identifier' fields exist in two different databases, each referring to the same value).
Inconsistencies in data representations (e.g., the directionality of a trade being represented by 'Buy'/'Sell', 'B'/'S' and 'Yes'/'No').
Wrong data types for fields (e.g., numeric values such as prices being represented by strings).
Violation of domain constraints (e.g., negative share price values).
Incorrect references or referential integrity issues causing a data linkage issue (e.g. an identifier not being carried over to the next stage of processing).

The enterprise’s data flows are now poorly understood, and the ability to evolve systems or to react to change in external market conditions become increasingly expensive.

An accurate and consistent view of the enterprise’s data model is essential in remaining competitive in the marketplace and driving down costs.

Drawbacks of the 'Static Documentation' Approach

A standard response to the advent of enterprise data model sprawl is the commissioning of a task force to systematically capture and document all the data fields and data flows in the enterprise.

This analysis is typically captured on spreadsheets, or in some cases using more advanced tools such as Erwin or Collibra. Whichever form this static documentation ultimately takes, it suffers from a number of shortcomings that calls into question the validity of this approach:

The creation of this documentation is manual and, depending on the size of the enterprise data estate, requires several FTE weeks / months of effort to complete.
As soon as the documentation is produced, it is likely to be out of date. Most enterprise systems are continually evolving, and it would not take long before the static documentation becomes out of date.
As the documentation is typically created by individuals who were not involved in the original design/implementation of the system, there is a risk of items being misinterpreted.
A manual documentation process, done by humans, tend to suffer from not taking into account a large enough sampling size. For instance, given a very large database a human would be forced to sample only a subset of the data in order to form their conclusions.
The likelihood of mistakes (spelling mistakes, copy & paste errors) increases the larger the enterprise data model becomes. Attempting to manually assert correctness of the documentation can be time-consuming and expensive.

Continuously Integrated, Compiler Driven Data Model Discovery and Surveillance

In order to keep pace with an ever-evolving data estate, the enterprise needs to adopt widespread automation and embrace the principles of Continuous Integration. It is essential for the discovery and surveillance of the ongoing state of the data model to be done frequently and repeatedly authenticated by a computerized process.

TeraHelix Spear provides organisations with powerful data modelling capabilities along with the ability to automate the discovery and surveillance changes to data model assumptions over time with these steps:

Data Model Discovery and Inference from Schemas, Data and APIs
Data Model Consistency with Compiler Technology
Data Model Attribute, Tag and Annotation Management
Data Model Referential Schema Linking
Data Model Change Impact Analysis

Data Model Discovery and Inference from Schemas, Data and APIs

Spear implements a rich feature set of data modelling abstractions and can accurately infer structures from a wide variety of sources, including:

Data Schemas - such XSD, JSON Schema, AVDL, ProtoBuf and the DDL of relational databases.
APIs - REST APIs (for instance using Swagger or the OpenAPI standard) or the programming interfaces of libraries such as Java, C++ and Python.
Data - Raw sample data in as format such as CSV, XML, JSON or YAML.

Given some combination of the above sources a data model definition will be 'fitted' and generated.

Data Model Consistency with Compiler Technology

The Spear compiler performs a series of automated checks and validations on model to ensure that it is consistent and coherent.

The use of compiler technology to streamline the overall process is a cornerstone of the TeraHelix approach.

Data Model Attribute, Tag and Annotation Management

While automated data model discovery is able to identify all fields that make up the data model, it may still be required for subject matter experts to augmented it with additional context, comments and tags.

Spear offers users the ability to 'mix in' comments, annotations and tags that are separated into the generated model definitions. The compiler subsequently ensures that no user content ever becomes 'unlinked' from model changes and that consistency is always maintained.

Data Model Referential Schema Linking

Having a clear picture of the references within and between data model definitions is key to the understanding of the data estate. Spear establishes linkages in the data model though a hierarchy of automated discovery processes:

Evaluation of declared references. E.g., relational database foreign keys and XSD schema references.
Equivalence of field names. E.g., two seemingly unrelated fields both called 'trade_id'.
Fuzzy matching of field names. E.g., determining that two seemingly unrelated fields are in fact matches, using a combination of fuzzy matching algorithms such as Levenshtein distance or lemmatisation.
Utilisation of Markov logic networks to establish the likelihood of data fields being related based on their content.

Data Model Change Impact Analysis

Enterprises and their data models are continuously evolving and any data model cataloguing exercise has to match that pace of change in order to remain relevant and effective.

With each iteration of the continuous integration cycle, new Spear models are generated. The Spear compiler then evaluates these new models against previous versions to provide the following impact analysis:

Common Data Model Definition Mistakes - detection of mistakes such as typos and incorrect naming and other common data model definition mistakes.
Version Incompatibilities - the new candidate model from a testing environment will not be compatible with data in the production environment, risking data corruption and/or loss.
Degradation in Data Quality - a new model version is likely to increase the chances of data quality issues in the future.
Duplicate Fields Introduction - a new model may introduce a duplicate definition.
Compatibility Warnings - the identification of constructs that cannot be accurately represented across all the technologies in the data estate.

Conclusion

An accurate data model catalogue provides the enterprise significant benefits. In order for the enterprise data model catalogue to remain relevant, it is essential for it to be maintained in an active, automated process. TeraHelix Spear provides enterprises with the tools required to streamline the creation of a data model and to guarantee it remains accurate as the enterprise evolves.