Table of Contents

Background

Typical Data flow — Data flows from frontend applications to the streaming layer in row-oriented format. But, for analytical purposes column-oriented format is preferred so transformation is essential from row-oriented format after pulling data from a stream.

Row vs Column Oriented — Refresher on row-oriented and column-oriented formats : (Reference Link):

File Formats Evolution

Why not use CSV/XML/JSON?

Repeated or no meta information.
Files are not splittable, so cannot be used in a map-reduce environment.
Missing/ Limited schema definition and evolution support.
1. Can leverage “JsonSchema” to maintain schema separately for JSON.
2. It may still require transformation based on a schema, so why not consider Avro/Proto?
No native compression of repeating values and indexing capabilities. Binary JSON supports indexing. Link
High consumption of bandwidth.
Bad Space Utilisation.
Poor performance.

Important Terminologies

Serialisation → Process of converting objects such as arrays and dictionaries into byte streams that can be efficiently stored and transferred elsewhere.
Deserialisation → Using byte stream to get the original objects back
Backward Compatibility → New version of software can run code written in old version.

Forward Compatibility → Older version of software can run code written in new version.

Schema Evolution – Schema evolution allows you to update the schema used to write new data, while maintaining backwards compatibility with the schema(s) of your old data. Then you can read it all together, as if all of the data has one schema. Of course there are precise rules governing the changes allowed, to maintain compatibility. Those rules are listed under Schema Resolution.
Registering and using Schema :

Schema Registry tools (More details → Coming Soon) :

Avro and ProtoBuf

Both are Language-neutral data serialisation system, which relies on a schema-based system.

Avro

Apache Avro was released by the Hadoop working group in 2009. It is a row-based format that has a high degree of splitting. It is also described as a data serialization system similar to Java Serialization. The schema is stored in JSON format, while the data is stored in binary format, which minimizes file size and maximizes efficiency.

Schema is stored along with the Avro data in a file for any further processing.
We can also maintain separate schema registry :
1. In RPC, the client and the server exchange schemas during the connection.
2. Schema-less Avro is used heavily in HDFS network communication. (serialiser and deserialiser knows the data schema in advance)
Avro supports both dynamic and static types as per the requirement.
1. without the static compilation step and greater interoperability with dynamic languages
2. Code generation is nonetheless still available in Avro for statically typed languages as an alternative optimisation
Avro has support for primitive ( int, boolean, string , float etc.) and complex ( enums, arrays, maps, unions etc.) types. Link
Avro creates binary structured format that is both compressible and splittable. Hence it can be efficiently used as the input to Hadoop MapReduce jobs and that’s the reason it is built in the Hadoop ecosystem.
Schema Definition Example:

{
    "type": "record",
    "name": "Person",
    "fields": [
        {"name": "userName",        "type": "string"},
        {"name": "favouriteNumber", "type": ["null", "long"]},
        {"name": "interests",       "type": {"type": "array", "items": "string"}}
    ]
}

…or in an IDL:

record Person {
    string               userName;
    union { null, long } favouriteNumber;
    array<string>        interests;
}

Strings are just a length prefix followed by UTF-8 bytes

Avro WorkFlow
- Serialisation
  - Create Avro schema → .avsc file
  - Read this schema in your code using SchemaParser
    - Write avro data file
- Deserialisation
  - Without External schema
    - Read file directly.
  - With External schema
    - Read schema ← Using schema parser in Spark
    - Open avro file using schema definition of step a. above.

Protobuf

Protocol buffers, usually referred as Protobuf, is a protocol developed by Google to allow serialization and deserialization of structured data. Google developed it with the goal to provide a better way, compared to XML, to make systems communicate.

Developed by Google and was open sourced in 2008.
Protobuf is easy to use in microservices, especially where performance and interoperability is important.
Schema is to be maintained separately.
It only supports static types.
Support more complex data types as compared to Avro. Link
Schema Definition Example :

message Person {
    required string user_name        = 1;
    optional int64  favourite_number = 2;
    repeated string interests        = 3;
}

Each field starts with a byte that indicates its tag number (the numbers 1, 2, 3 from the schema), and the type of the field.
- This lead to larger file size as compared to Avro (when data size is more as compared to schema stored).

Protobuf workflow :
- Serialisation
  - Create proto schema → .proto file
  - Compile proto schema using “protoc” compiler for target language.
  - Read schema definition by importing compiled class generated in above step.
    - Write proto binary data file.
- Deserialisation
  - Without External Schema ← Not Supported
  - With External Schema
    - Import compiled class generated by compiler → To get Schema definition
    - Open binary file using schema definition of step 1 above.

Avro vs ProtoBuf Comparison Summary

Categories		Description	Avro	ProtoBuf
General	Storage type	How is data stored (row/columnar format) ?	Row	Row
	OLAP/OLTP	Efficient for OLTP or OLAP environment ?	OLTP	OLTP
	Stream	Efficient for Streaming applications ?	Yes	Yes
	RPC Interfaces	Does it support RPC interface ?	Yes	Yes. Best with gRPC
	License		Apache	BSD-Style
	Language-neutral, platform-independent		Yes	Yes
	Ecosystems	Preferred and used widely in what kind of ecosystem ?	Big Data and Streaming (Kafka)	RPC and Kubernetes (ActiveMQ, Google)
	Codebase / Development efforts ?	Which one require less coding and maintenance effort?	Simpler	Comparatively complex
	Performance	Which one generates more compact data encoding, and faster data processing?	Slightly slow – due to simplicity.	Fastest amongst all


Schema	Schema enforcement	Can we enforce external schema ?	Yes	Yes
	Schema Support and Definition	How to provide schema definition ?	Defined in JSON, IDL, SchemaBuilder fluent API (Java).	IDL
	Schema Evolution	Does it support backward and forward compatibility in schema evolution ?	Supported (Backward, Forward, Full, None – upto user to implement)	Supported (Backward, Forward, Full, None – upto user to implement)
	Dynamic Typing / Schema	Can we parse data without external schema definition?	Supported	Not Supported
	Documentation Generation Tools	Do we have any tool to generate schema definition ?	AvroDoc — Todo: Can look for updated one.	protoc-gen-doc : produces HTML, PDF, DocBook,
	PII Tagging supported ?	Can we mark PII data itself in schema definition ?	Yes Example Links: 1	No. — Todo: Can check further


Data File	Compilation Needed ?	Do we need to compile schema first to generate data file ?	Not Required	Required
	Schema available ?	Does data file holds schema in itself ?	Yes, in header section. Can be removed as well.	No
	Splittable	Can we split data file so that can be used by Map-Reduce world ?	Yes	No
	Compression	Can we compress data file ?	Yes	No
	Data file viewing options ?	Can we view data file individually using any tool ?	Yes. Schema definition file is not mandatory.	Yes. Schema definition file is mandatory, so process is bit complex.
	Concatenate multiple data files ?	Can we concatenate multiple data files together with same schema ?	Yes	No


Data Types	Enumerations	Does it support Enums ?	Yes	Yes
	Constants	Can we have constants ?	No	No
	Optional fields supported	Can we mark any field as optional ?	No Workaround –> Use a union type, like union { null, long } OR use default attribute.	Yes
	Unsigned Type	Does it support unsigned data type ?	No	Yes
	Default values supported ?	Can we provide default value to fields ?	Yes	Yes
	Default value mandatory ?	Is default value mandatory ?	No But, as a best practice, it should be provided to support backward compatibility	No
	Timestamp supported ?		Yes. But it becomes a 64 bit signed integer. — Todo: need to check further.	Yes — Todo: need to check further.
	Deprecation Supported ?	Is there a support to mark field as deprecated ?	No Workaround Links: 1	Yes
	Option to mark field as private ?		No Workaround Links: 1	No


Developer Queries	Fields matching ?	How to match fields while deserialising ?	Fields are matched by name	Fields are matched by tag(position)
	Tools availability ?	What all tools are available from developer perspective ?	Yes Example: 1	Yes Example: 1
	Extension Capabilities	Can we extend features if required ?	Easier (Java)	Core compiler: More difficult (C++)
	Inheritance support and polymorphism	Is it possible to build a new data type using inheritance?	No Workaround Links: 1 2	Yes Links: 1
	Debugging ?	Which one would be complex to debug ?	Easy. As data file has schema inbuilt so can easily view file using multiple 3rd party tools.	Complex
	Snowflake Support	Can we load data directly into Snowflake ?	Supported	Not Supported
	Language support	What all languages are supported ?	C C++ C# Dart – Not Supported (can check for plugin) Elixir Go Haskell Java Javascript Perl PHP Python Ruby Rust Scala Kotlin – Not supported (Plugin – https://github.com/avro-kotlin/avro4k) Objective-C – Not supported (can check for plugin) Other through third party plugins	C – Not Supported (can check for plugin) C++ C# Dart Elixir – Not Supported (can check for plugin) Go Haskell – Not Supported (can check for plugin) Java JavaScript Perl – Not Supported (can check for plugin) PHP Python Ruby Rust – Not Supported (can check for plugin) Scala – Not Supported (can check for plugin) Kotlin Objective-C Other through third party plugins

Conclusion

Avro seems a better fit for BigData use cases as it is widely used in multiple frameworks. Splittable, schema along with data and native compression techniques are major advantages over Protocol Buffer.

Protobuf is easy to use in microservices, especially where performance and interoperability are important and is superior to Avro in this area.

Reference Links

Links	Helpful Details
https://avro.apache.org/docs/1.2.0/spec.html	Avro Documentation
https://avro.apache.org/docs/current/	Apache Avro Page
https://thetechsolo.wordpress.com/2015/01/17/apache-avro-schema-less-serialization-how-to/	Avro Schemaless
https://towardsdatascience.com/csv-files-for-storage-absolutely-not-use-apache-avro-instead-7b7296149326	Avro and FastAvro libraries
https://data-flair.training/blogs/avro-schema/	Acro Schema
https://www.youtube.com/watch?v=UAg0Fo8pdi0
https://garystafford.medium.com/previewing-apache-avro-files-in-amazon-s3-98f41e98f656	Step by step details of Avro
https://avro.apache.org/docs/current/spec.html	Avro Schema rules
https://docs.confluent.io/platform/current/schema-registry/avro.html	Schema evolution and compatibility
http://radar.oreilly.com/2014/11/the-problem-of-managing-schemas.html
https://www.tutorialspoint.com/avro/avro_overview.htm	Avro Details

https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html	Schema evolution in Avro, Protocol Buffers and Thrift
https://www.adaltas.com/en/2020/07/23/benchmark-study-of-different-file-format/	All File formats comparison
https://www.bizety.com/2019/04/02/data-serialization-protocol-buffers-vs-thrift-vs-avro/	Data Serialisation and comparison of avro,Thrift and proto
https://cristian-matei-toader.medium.com/compressing-a-year-of-reddit-with-apache-avro-and-google-protobuf-c9e40cf90444	Nice explanation of protbuf vs avro – 2020 article
https://www.slideshare.net/IgorAnishchenko/pb-vs-thrift-vs-avro	pb-vs-thrift-vs-avro
https://blog.softwaremill.com/the-best-serialization-strategy-for-event-sourcing-9321c299632b
https://mnwa.medium.com/what-the-hell-is-protobuf-4aff084c5db4	Proto vs Json – Benchmarking
https://stackoverflow.com/questions/2935527/alternatives-to-protocol-buffers	Proto Alternatives
https://puredanger.github.io/tech.puredanger.com/2011/05/27/serialization-comparison/	Comparison of Protobuff, Thrift, Avro
http://blog.mirthlab.com/2009/06/01/thrift-vs-protocol-bufffers-vs-json/	Thrift vs Protocol Bufffers vs JSON
https://www.igvita.com/2011/08/01/protocol-buffers-avro-thrift-messagepack/	Protocol Buffers, Avro, Thrift & MessagePack
https://www.codingblocks.net/programming/why-avro/	Avro vs Proto
https://auth0.com/blog/beating-json-performance-with-protobuf/	Proto vs Json
https://blog.softwaremill.com/the-best-serialization-strategy-for-event-sourcing-9321c299632b	Ser-De Strategies
https://www.alibabacloud.com/blog/an-introduction-and-comparison-of-several-common-java-serialization-frameworks_597900	All Java Ser/De
https://www.niyuj.com/data-serialization-why-choose-protocol-buffers-over-apache-avro/

https://www.farfetchtechblog.com/en/blog/post/protobuf-lab-session/	Proto buf detailed explanation
https://www.ionos.co.uk/digitalguide/websites/web-development/protocol-buffers-explained/	Proto explanation
https://developers.google.com/protocol-buffers/docs/overview	Proto Documentation
https://betterprogramming.pub/understanding-protocol-buffers-43c5bced0d47	Proto details
https://stackoverflow.com/questions/62487227/efficient-encoding-alternatives-for-mapstring-string-in-protocol-buffers	alternatives-for-mapstring-string-in-protocol-buffers
https://capnproto.org/	Another Proto
https://codeclimate.com/blog/choose-protocol-buffers/
https://cloud.google.com/apis/design/proto3	What’s new in proto-3 ?

https://medium.com/swlh/an-introduction-to-json-schema-8eaea643fcda	Json

Schema Registry
https://www.confluent.io/blog/17-ways-to-mess-up-self-managed-schema-registry/	Different ways to mess up schema registry
https://medium.com/slalom-technology/introduction-to-schema-registry-in-kafka-915ccf06b902
https://github.com/confluentinc/schema-registry

4 thoughts on “Data Serialisation – Avro vs Protocol Buffers”

foobar

June 29, 2022 at 12:18 pm

Great post!

Bahram

January 31, 2023 at 2:40 pm

Great post, thanks

Scala Development Services

August 18, 2023 at 5:31 pm

Thank you for sharing your thoughts. I really appreciate
your efforts and I am waiting for your next write ups thank you once again.

Thomas

November 18, 2023 at 9:22 pm

It seems that default values are not supported in proto3, only in proto2. Probably answer should be no then ?