Data Serialisation – Avro vs Protocol Buffers

Background

Typical Data flow
Data flows from frontend applications to the streaming layer in row-oriented format. But, for analytical purposes column-oriented format is preferred so transformation is essential from row-oriented format after pulling data from a stream.

Row vs Column Oriented
Refresher on row-oriented and column-oriented formats : (Reference Link):

File Formats Evolution

Big Data File Formats Summary

Why not use CSV/XML/JSON? 

  1. Repeated or no meta information.
  2. Files are not splittable, so cannot be used in a map-reduce environment.
  3. Missing/ Limited schema definition and evolution support.
    1. Can leverage “JsonSchema” to maintain schema separately for JSON.
    2. It may still require transformation based on a schema, so why not consider Avro/Proto?
  4. No native compression of repeating values and indexing capabilities. Binary JSON supports indexing. Link
  5. High consumption of bandwidth.
  6. Bad Space Utilisation.
  7. Poor performance.

Json Example


Important Terminologies

  • Serialisation → Process of converting objects such as arrays and dictionaries into byte streams that can be efficiently stored and transferred elsewhere.
  • Deserialisation → Using byte stream to get the original objects back
  • Backward Compatibility → New version of software can run code written in old version.
Backward Compatibility
  • Forward Compatibility → Older version of software can run code written in new version.
Forward Compatibility
  • Schema Evolution – Schema evolution allows you to update the schema used to write new data, while maintaining backwards compatibility with the schema(s) of your old data. Then you can read it all together, as if all of the data has one schema. Of course there are precise rules governing the changes allowed, to maintain compatibility. Those rules are listed under Schema Resolution.
  • Registering and using Schema :
Schema Registry

Avro and ProtoBuf

Both are Language-neutral data serialisation system, which relies on a schema-based system.

Avro

Apache Avro was released by the Hadoop working group in 2009. It is a row-based format that has a high degree of splitting. It is also described as a data serialization system similar to Java Serialization. The schema is stored in JSON format, while the data is stored in binary format, which minimizes file size and maximizes efficiency.

  • Schema is stored along with the Avro data in a file for any further processing.
  • We can also maintain separate schema registry :
    1. In RPC, the client and the server exchange schemas during the connection.
    2. Schema-less Avro is used heavily in HDFS network communication. (serialiser and deserialiser knows the data schema in advance)
  • Avro supports both dynamic and static types as per the requirement.
    1. without the static compilation step and greater interoperability with dynamic languages
    2. Code generation is nonetheless still available in Avro for statically typed languages as an alternative optimisation
  • Avro has support for primitive ( int, boolean, string , float etc.) and complex ( enums, arrays, maps, unions etc.) types. Link
  • Avro creates binary structured format that is both compressible and splittable. Hence it can be efficiently used as the input to Hadoop MapReduce jobs and that’s the reason it is built in the Hadoop ecosystem.
  • Schema Definition Example:
{
    "type": "record",
    "name": "Person",
    "fields": [
        {"name": "userName",        "type": "string"},
        {"name": "favouriteNumber", "type": ["null", "long"]},
        {"name": "interests",       "type": {"type": "array", "items": "string"}}
    ]
}

…or in an IDL:

record Person {
    string               userName;
    union { null, long } favouriteNumber;
    array<string>        interests;
}
  • Strings are just a length prefix followed by UTF-8 bytes
avro data internals
  • Avro WorkFlow
    • Serialisation
      • Create Avro schema → .avsc file
      • Read this schema in your code using SchemaParser
        • Write avro data file
    • Deserialisation
      • Without External schema
        • Read file directly.
      • With External schema
        • Read schema ← Using schema parser in Spark
        • Open avro file using schema definition of step a. above.

Protobuf

Protocol buffers, usually referred as Protobuf, is a protocol developed by Google to allow serialization and deserialization of structured data. Google developed it with the goal to provide a better way, compared to XML, to make systems communicate.

  • Developed by Google and was open sourced in 2008.
  • Protobuf is easy to use in microservices, especially where performance and interoperability is important.
  • Schema is to be maintained separately.
  • It only supports static types. 
  • Support more complex data types as compared to Avro. Link
  • Schema Definition Example : 
message Person {
    required string user_name        = 1;
    optional int64  favourite_number = 2;
    repeated string interests        = 3;
}
  • Each field starts with a byte that indicates its tag number (the numbers 123 from the schema), and the type of the field.
    • This lead to larger file size as compared to Avro (when data size is more as compared to schema stored).
Protobuf data internals
  • Protobuf workflow :
    • Serialisation 
      • Create proto schema → .proto file
      • Compile proto schema using “protoc” compiler for target language.
      • Read schema definition by importing compiled class generated in above step.
        • Write proto binary data file.
    • Deserialisation
      • Without External Schema ← Not Supported
      • With External Schema
        • Import compiled class generated by compiler → To get Schema definition
        • Open binary file using schema definition of step 1 above.
protobuf working

Avro vs ProtoBuf Comparison Summary

Categories   Description Avro ProtoBuf
General Storage type How is data stored (row/columnar format) ? Row Row
OLAP/OLTP Efficient for OLTP or OLAP environment ? OLTP OLTP
Stream Efficient for Streaming applications ? Yes Yes
RPC Interfaces Does it support RPC interface ? Yes Yes.
Best with gRPC
License   Apache BSD-Style
Language-neutral, platform-independent   Yes Yes
Ecosystems Preferred and used widely in what kind of ecosystem ? Big Data and Streaming (Kafka) RPC and Kubernetes (ActiveMQ, Google)
Codebase / Development efforts ? Which one require less coding and maintenance effort? Simpler Comparatively complex
Performance Which one generates more compact data encoding, and faster data processing? Slightly slow – due to simplicity. Fastest amongst all
         
         
Schema Schema enforcement Can we enforce external schema ? Yes Yes
Schema Support and Definition How to provide schema definition ? Defined in JSON, IDL, SchemaBuilder fluent API (Java). IDL
Schema Evolution Does it support backward and forward compatibility in schema evolution ? Supported
(Backward, Forward, Full, None – upto user to implement)
Supported
(Backward, Forward, Full, None – upto user to implement)
Dynamic Typing / Schema Can we parse data without external schema definition?  Supported Not Supported
Documentation Generation Tools Do we have any tool to generate schema definition ? AvroDoc
— Todo: Can look for updated one.
protoc-gen-doc : produces HTML, PDF, DocBook, 
PII Tagging supported ?  Can we mark PII data itself in schema definition ?

Yes

Example Links: 1

No.
— Todo: Can check further
         
         
Data File Compilation Needed ?  Do we need to compile schema first to generate data file ? Not Required Required
Schema available ?  Does data file holds schema in itself ? Yes, in header section. Can be removed as well. No
Splittable Can we split data file so that can be used by Map-Reduce world ? Yes No
Compression Can we compress data file ? Yes No
Data file viewing options ?  Can we view data file individually using any tool ? Yes.
Schema definition file is not mandatory.
Yes.
Schema definition file is mandatory, so process is bit complex.
Concatenate multiple data files ?  Can we concatenate multiple data files together with same schema ? Yes No
         
         
Data Types Enumerations Does it support Enums ? Yes Yes
Constants Can we have constants ? No No
Optional fields supported Can we mark any field as optional ? No
Workaround –> Use a union type, like union { null, long } OR use default attribute.
Yes
Unsigned Type Does it support unsigned data type ? No Yes
Default values supported ? Can we provide default value to fields ? Yes Yes
Default value mandatory ? Is default value mandatory ? No
But, as a best practice, it should be provided to support backward compatibility
No
Timestamp supported ?    Yes.
But it becomes a 64 bit signed integer.  Todo: need to check further.
Yes
— Todo: need to check further.
Deprecation Supported ? Is there a support to mark field as deprecated ?

No

Workaround Links: 1

Yes
Option to mark field as private ?  

No

Workaround Links: 1

No
         
         
Developer Queries Fields matching ?  How to match fields while deserialising ? Fields are matched by name Fields are matched by tag(position)
Tools availability ?  What all tools are available from developer perspective ?

Yes

Example: 1

Yes

Example: 1

Extension Capabilities Can we extend features if required ? Easier (Java) Core compiler: More difficult (C++)
Inheritance support and polymorphism Is it possible to build a new data type using inheritance? No
Workaround Links: 1 2
Yes
Links: 1
Debugging ?  Which one would be complex to debug ? Easy.
As data file has schema inbuilt so can easily view file using multiple 3rd party tools.
Complex
Snowflake Support Can we load data directly into Snowflake ? Supported Not Supported
Language support What all languages are supported ? C
C++
C#
Dart – Not Supported (can check for plugin)
Elixir
Go
Haskell
Java
Javascript
Perl
PHP
Python
Ruby
Rust
Scala
Kotlin – Not supported (Plugin – https://github.com/avro-kotlin/avro4k)
Objective-C – Not supported (can check for plugin)
Other through third party plugins
C –  Not Supported (can check for plugin)
C++
C#
Dart
Elixir – Not Supported (can check for plugin)
Go
Haskell – Not Supported (can check for plugin)
Java
JavaScript
Perl – Not Supported (can check for plugin)
PHP
Python
Ruby
Rust – Not Supported (can check for plugin)
Scala – Not Supported (can check for plugin)
Kotlin
Objective-C
Other through third party plugins

Conclusion

Avro seems a better fit for BigData use cases as it is widely used in multiple frameworks. Splittable, schema along with data and native compression techniques are major advantages over Protocol Buffer.

Protobuf is easy to use in microservices, especially where performance and interoperability are important and is superior to Avro in this area.


Reference Links

Links Helpful Details
https://avro.apache.org/docs/1.2.0/spec.html Avro Documentation
https://avro.apache.org/docs/current/ Apache Avro Page
https://thetechsolo.wordpress.com/2015/01/17/apache-avro-schema-less-serialization-how-to/ Avro Schemaless
https://towardsdatascience.com/csv-files-for-storage-absolutely-not-use-apache-avro-instead-7b7296149326 Avro and FastAvro libraries

https://data-flair.training/blogs/avro-schema/

Acro Schema
https://www.youtube.com/watch?v=UAg0Fo8pdi0  
https://garystafford.medium.com/previewing-apache-avro-files-in-amazon-s3-98f41e98f656 Step by step details of Avro
https://avro.apache.org/docs/current/spec.html Avro Schema rules
https://docs.confluent.io/platform/current/schema-registry/avro.html Schema evolution and compatibility
http://radar.oreilly.com/2014/11/the-problem-of-managing-schemas.html  
https://www.tutorialspoint.com/avro/avro_overview.htm Avro Details
   
https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html Schema evolution in Avro, Protocol Buffers and Thrift
https://www.adaltas.com/en/2020/07/23/benchmark-study-of-different-file-format/ All File formats comparison
https://www.bizety.com/2019/04/02/data-serialization-protocol-buffers-vs-thrift-vs-avro/ Data Serialisation and comparison of avro,Thrift and proto
https://cristian-matei-toader.medium.com/compressing-a-year-of-reddit-with-apache-avro-and-google-protobuf-c9e40cf90444 Nice explanation of protbuf vs avro – 2020 article
https://www.slideshare.net/IgorAnishchenko/pb-vs-thrift-vs-avro pb-vs-thrift-vs-avro
https://blog.softwaremill.com/the-best-serialization-strategy-for-event-sourcing-9321c299632b  
https://mnwa.medium.com/what-the-hell-is-protobuf-4aff084c5db4 Proto vs Json – Benchmarking
https://stackoverflow.com/questions/2935527/alternatives-to-protocol-buffers Proto Alternatives
https://puredanger.github.io/tech.puredanger.com/2011/05/27/serialization-comparison/ Comparison of Protobuff, Thrift, Avro
http://blog.mirthlab.com/2009/06/01/thrift-vs-protocol-bufffers-vs-json/ Thrift vs Protocol Bufffers vs JSON
https://www.igvita.com/2011/08/01/protocol-buffers-avro-thrift-messagepack/ Protocol Buffers, Avro, Thrift & MessagePack
https://www.codingblocks.net/programming/why-avro/ Avro vs Proto
https://auth0.com/blog/beating-json-performance-with-protobuf/ Proto vs Json
https://blog.softwaremill.com/the-best-serialization-strategy-for-event-sourcing-9321c299632b Ser-De Strategies
https://www.alibabacloud.com/blog/an-introduction-and-comparison-of-several-common-java-serialization-frameworks_597900 All Java Ser/De
https://www.niyuj.com/data-serialization-why-choose-protocol-buffers-over-apache-avro/  
   
https://www.farfetchtechblog.com/en/blog/post/protobuf-lab-session/ Proto buf detailed explanation
https://www.ionos.co.uk/digitalguide/websites/web-development/protocol-buffers-explained/ Proto explanation
https://developers.google.com/protocol-buffers/docs/overview Proto Documentation
https://betterprogramming.pub/understanding-protocol-buffers-43c5bced0d47 Proto details
https://stackoverflow.com/questions/62487227/efficient-encoding-alternatives-for-mapstring-string-in-protocol-buffers alternatives-for-mapstring-string-in-protocol-buffers
https://capnproto.org/ Another Proto
https://codeclimate.com/blog/choose-protocol-buffers/  
https://cloud.google.com/apis/design/proto3 What’s new in proto-3 ?
   
https://medium.com/swlh/an-introduction-to-json-schema-8eaea643fcda Json
   
Schema Registry  
https://www.confluent.io/blog/17-ways-to-mess-up-self-managed-schema-registry/ Different ways to mess up schema registry
https://medium.com/slalom-technology/introduction-to-schema-registry-in-kafka-915ccf06b902  
https://github.com/confluentinc/schema-registry  

3 thoughts on “Data Serialisation – Avro vs Protocol Buffers”

Leave a Comment