Why Protobuf Isn't Enough

Protocol Buffers (Protobuf) is a platform neutral serialization library implemented and widely used by Google. However, once you start working with Protobuf you quickly realise that the trade-off for high performance serialization equates to accepting severe restrictions in the way you are able to model your data, the tendency for Protobuf constructs to 'leak out' all over your codebase and new challenges in terms of managing data model version evolution over time.

By combining Spear, a rich implementation agnostic data modelling infrastructure, with Protobuf, an optimised low level serialization library, we are able to get the best of both worlds. Spear allows you to define your data model (your most important asset) using the correct abstraction level and still allows you to take advantage of whichever third party library is the most appropriate in terms of serialization or other low level implementation tasks.

Challenges of Using Protobuf In Isolation

While Protobuf remains a very popular choice, it is also the case that after several years of the library being pushed its limits, a number of weakness in the underlying infrastructure has started to emerge [1]. Developers tend to make the mistake of overloading Protobuf in trying to make it function as both a data modelling layer and as a serialization layer. At first glance this appears to make sense - it essentially provides you with a cross-platform mechanism of defining data structures with good serialization performance to boot.

However, as projects mature and the data models becomes more complex the limitations of Protobuf as a data modelling tool quickly becomes apparent:

No Required or Defaultable Fields - Everything is Optional

Strange is it may seem, Protobuf has explicitly removed support for 'required' and 'optional' fields in the definition - https://github.com/protocolbuffers/protobuf/issues/2497. The main argument as to why required fields are 'considered harmful' is the fact that this allows you to add and remove fields from the definition while still 'being fully forward and backward compatible with newer and older binaries'.

This argument cannot be true. While this attitude superficially makes the life of the data model author easier, it causes an incredible amount of misery for anyone downstream who depends on this data to do something useful. How do you know which fields are mandatory? Do you end up duplicating 'completeness check' code in every single consumer of this data? A field changing or suddenly disappearing in your data definition will inevitably cause application logic that depends on it to fail (or at the very least produce undefined results).

While properly managing the version evolution of your data model may require you to address a number of challenges [2], it is simply not credible to believe the fallacy that making every attribute optional will automatically solve this hugely important and complex issue.

Lack of Inheritance, Polymorphism and Generics

Another surprise when you first start working with Protobuf is that there is no support in the definition for many polymorphic language features that seasoned developers take for granted. Of course, while you may find that there are many 'techniques' (or hacks, depending on your point of view) [3] to work around these limitations, this ultimately means compromising on the definition of your data model.

Instead, having these features as a core part of your model makes the contract clear to downstream consumers and unambiguous. For instance, the Shared Market Data application contains some models for the definition of historic and real time stock price returns [4] :

Table 1. Stock Price Returns Model in Spear
Market Data Equity Returns Equity Stock Market Data Equity Individual Return
Market Data Equity Returns
Equity Stock
Market Data Equity Individual Return

From the definitions above, it is clear that a Return type shares the same identifier (key) as an Equity Stock type which in itself is a variant of Security - as per the Conventions.Security tagging interface. The Return type consists of IndividualReturn items that are themselves keyed by a business date.

Contrast this with the equivalent definitions in Protobuf 3 (examples below are based on what the Spear-Proto plugin generates, albeit without state management and reified collection types):

Table 2. Stock Price Returns Model in Protobuf
Market Data Equity Returns Equity Stock Market Data Equity Individual Return
Market Data Equity Returns
Equity Stock
Market Data Equity Individual Return

There is a lot schematic information missing from this definition. For instance, the type of the keys of the IndividualReturns map is ambiguous (Protobuf does not permit user defined types for the keys) not to mention the missing tagging interface and the identifiers of the entities. Hopefully you will also never need to add/remove fields from your definitions - this may require a 'renumbering exercise' reminiscent of the GOTO-style programming of Basic and Fortran.

Ultimately the schema is missing a lot of vital information that ends up having to be encoded 'elsewhere'. The result is that developers are led into the trap of creating a 'shadow model' (see below for more details).

Indispensable Data Modelling Constructs

The limitations mentioned above are merely the obvious items that are immediately apparent to most developers. Beyond the basics, there are several Spear features that once your start using it in your projects, quickly becomes indispensable. Here are a high level summary of some of my favourites [5]

  • Propstructs - Like an enum, but more powerful and extensible. Useful in translating complex specifications, such as the Consolidated Audit Trail into an actionable data model.

  • Relations - Bi-directional object to relational mapping of your data model in order to enable functionality from tools such as SQL, Apache Spark [6] or similar data frame type technologies in Python Pandas [7] or R [8].

  • Extensions like Converters and Validators - Spear allows you to 'plug-in' extensions to your data model such that you are able to meet requirements such as inline converting to different structures (e.g. type versions) or performing advanced data validations [9].

Beware of the 👤 Shadow Data Model 👤 .

Developers typically respond to an inadequate data model by adding 'small tweaks' at various points within the system to make up for the features that are missing or ambiguous. The subtle accumulation of technical debt happens quietly and unnoticed in the background, until it eventually becomes a serious issue.

For instance, take the missing date information from the IndividualReturns map in the example above. Each of the developer teams working on the different layers in this application may be tempted to 'augment' the string key of the map to make it clear that this is a date. The most straightforward way to 'correct' this is to take the received Protobuf object and either wrap or copy it to a new (shadow) structure or 'shim' [10].

Assume you have 3 layers in the application - a web layer written in Angular/Typescript, a backend written in Java and an analytics layer written in Python. In even this trivial application you now already have 3 versions of effectively the same thing. Now consider what happens when you scale up to more types and more consumers - before you know it you have an entire codebase where the majority of the work is simply copying one data structure to another. That is a lot manual work if you ever want to change your model!

Spear avoids the need for shadow data models (and associated pitfalls) by providing rich abstractions and extensions in the canonical data model itself and then allowing the physical implementation of that model to be generated in the language, platform and framework that makes the most sense given the current requirements.

Best of Both Worlds - Spear Models the Data, Protobuf Moves the Data

Spear provides extensive logical data modelling capabilities and Protobuf provides highly optimised low level serialization capabilities. The combination of these two technologies makes for a powerful solution where you are no longer required to compromise the quality of your data model to obtain performance improvements. You can have the best of both worlds!

In a sense, for Spear to use Protobuf as its serialization layer is not that dissimilar from using JSON, XML or YAML - it is simply another implementation 'plug in'. However, with its rich type system Spear is able to go beyond what is achievable if the project only uses Protobuf. In particular:

  • Reification of Collections and Interfaces [11] - Tempting as it may be to represent interface references and collections using the any construct in Protobuf, this comes with a significant performance penalty. Instead, Spear uses its type information to generate concrete definitions of all possible permutations. For instance, an attribute referencing an interface will be translated to a oneof for all the implementations in the data definition - allowing the Protobuf serializer to perform additional optimisations.

  • Cycle Resolution - The Protobuf compiler has a number of ad-hoc idiosyncrasies that are not always clearly documented. For instance, consider the situation where you have A.proto importing B.proto importing C.proto which in turn references A.proto again. Attempting to compile this results in the error: File recursively imports itself: A.proto → B.proto → C.proto → A.proto (something most other compilers have been able to deal with fine for some time now). However, if all of these definitions are placed in the same file, compilation succeeds. The Spear Protobuf generator hides these and other eccentricities, making sure that the resultant *.proto files and generated classes remain an implementation detail that the majority of developers never need to spend any time on.

  • Augmented Types - Protobuf lacks some basic types that tend to be required by most projects - such as temporal / datetime constructs. Spear augments the Protobuf definitions it generates with a number of these types that fills in the gaps of what is available with base Protobuf.

Performance

No blog on Protobuf would be complete without also discussing the significant performance improvements that is on offer when using a highly optimised binary serialization scheme instead of a text based scheme relying on something like JSON [12].

For this test we have taken the Market Data Equity Return objects (as per the Protobuf / Spear examples above) from the Shared Market Data application. The object contains the returns for the last 12 months on a portfolio of 1,356 stocks. A typical instance of an individual return in this IBM Example

All benchmarks were performed using in-memory byte arrays to eliminate noise from I/O devices. In addition, a number of 'warm-up' round-trip executions are performed in the JVM before any metrics are records.

Size - Spear Protobuf vs. Spear JSON

Storing the returns of the 1,356 stocks in Protobuf offers significant space gains - it is only ~6% the size of the same portfolio represented in JSON.

Total Size (MBs)

Serialization Round-trip Times - Spear Protobuf vs. Spear JSON

While the cost of JSON serialization on an individual basis seems negligible (on average 0.947 milliseconds for serialization and 1.77 milliseconds for deserialization), this rapidly accumulates once you scale up the number of objects. Here the use of Protobuf again offers significant improvements.

Total Serialization Round-trip Time in Milliseconds

Conclusion

Trading off a rich data model for high performance or vice versa is a false dichotomy. By combining the the data modelling capabilities of Spear with the low level serialization optimisations of Protobuf you are able to get the best of both worlds.

There is significant value in your enterprise’s data model - it has locked up inside of it all your intellectual property. Don’t compromise it by using deficient abstractions in its definition.

Any thoughts, queries and comments are, as always, welcome. Feel free to let us know at https://www.terahelix.io/contact/index.html


1. Perhaps one of the more robust rebuttals of Protobuf can be found here: http://reasonablypolymorphic.com/blog/protos-are-wrong/
2. Spear provides a number of tools that assists with version evolution. In summary, you can either convert your data in transit for older consumers or do a bulk migration exercise where all data models are updated across the board. In practice, you will probably need a combination of these two approaches. Spear's converters functionality enables both these approaches
3. This question on stackoverflow is a good example of the type of questions that come up from time to time
4. For more information you can consult this reference
5. This list, of course, is not exhaustive. Each one of the features listed probably requires a blog of their own.
6. Apache Spark Homepage - https://spark.apache.org/
7. Python Pandas Homepage - https://pandas.pydata.org/
8. R Project Homepage - https://www.r-project.org/
9. Integrating third party data tools become easier when your baseline is Spear. Refer to the Automatic Data Quality Metrics blog for more information on this
12. To many seasoned developers this may be well-trodden territory, however it is always advisable to validate your assumptions with tests on your own data.