Data Serialisation – Avro vs Protocol Buffers


Typical Data flow
Data flows from frontend applications to the streaming layer in row-oriented format. But, for analytical purposes column-oriented format is preferred so transformation is essential from row-oriented format after pulling data from a stream.

Row vs Column Oriented
Refresher on row-oriented and column-oriented formats : (Reference Link):

File Formats Evolution

Big Data File Formats Summary

Why not use CSV/XML/JSON? 

  1. Repeated or no meta information.
  2. Files are not splittable, so cannot be used in a map-reduce environment.
  3. Missing/ Limited schema definition and evolution support.
    1. Can leverage “JsonSchema” to maintain schema separately for JSON.
    2. It may still require transformation based on a schema, so why not consider Avro/Proto?
  4. No native compression of repeating values and indexing capabilities. Binary JSON supports indexing. Link
  5. High consumption of bandwidth.
  6. Bad Space Utilisation.
  7. Poor performance.

Json Example

Important Terminologies

  • Serialisation → Process of converting objects such as arrays and dictionaries into byte streams that can be efficiently stored and transferred elsewhere.
  • Deserialisation → Using byte stream to get the original objects back
  • Backward Compatibility → New version of software can run code written in old version.
Backward Compatibility
  • Forward Compatibility → Older version of software can run code written in new version.
Forward Compatibility
  • Schema Evolution – Schema evolution allows you to update the schema used to write new data, while maintaining backwards compatibility with the schema(s) of your old data. Then you can read it all together, as if all of the data has one schema. Of course there are precise rules governing the changes allowed, to maintain compatibility. Those rules are listed under Schema Resolution.
  • Registering and using Schema :
Schema Registry

Avro and ProtoBuf

Both are Language-neutral data serialisation system, which relies on a schema-based system.


Apache Avro was released by the Hadoop working group in 2009. It is a row-based format that has a high degree of splitting. It is also described as a data serialization system similar to Java Serialization. The schema is stored in JSON format, while the data is stored in binary format, which minimizes file size and maximizes efficiency.

  • Schema is stored along with the Avro data in a file for any further processing.
  • We can also maintain separate schema registry :
    1. In RPC, the client and the server exchange schemas during the connection.
    2. Schema-less Avro is used heavily in HDFS network communication. (serialiser and deserialiser knows the data schema in advance)
  • Avro supports both dynamic and static types as per the requirement.
    1. without the static compilation step and greater interoperability with dynamic languages
    2. Code generation is nonetheless still available in Avro for statically typed languages as an alternative optimisation
  • Avro has support for primitive ( int, boolean, string , float etc.) and complex ( enums, arrays, maps, unions etc.) types. Link
  • Avro creates binary structured format that is both compressible and splittable. Hence it can be efficiently used as the input to Hadoop MapReduce jobs and that’s the reason it is built in the Hadoop ecosystem.
  • Schema Definition Example:
    "type": "record",
    "name": "Person",
    "fields": [
        {"name": "userName",        "type": "string"},
        {"name": "favouriteNumber", "type": ["null", "long"]},
        {"name": "interests",       "type": {"type": "array", "items": "string"}}

…or in an IDL:

record Person {
    string               userName;
    union { null, long } favouriteNumber;
    array<string>        interests;
  • Strings are just a length prefix followed by UTF-8 bytes
avro data internals
  • Avro WorkFlow
    • Serialisation
      • Create Avro schema → .avsc file
      • Read this schema in your code using SchemaParser
        • Write avro data file
    • Deserialisation
      • Without External schema
        • Read file directly.
      • With External schema
        • Read schema ← Using schema parser in Spark
        • Open avro file using schema definition of step a. above.


Protocol buffers, usually referred as Protobuf, is a protocol developed by Google to allow serialization and deserialization of structured data. Google developed it with the goal to provide a better way, compared to XML, to make systems communicate.

  • Developed by Google and was open sourced in 2008.
  • Protobuf is easy to use in microservices, especially where performance and interoperability is important.
  • Schema is to be maintained separately.
  • It only supports static types. 
  • Support more complex data types as compared to Avro. Link
  • Schema Definition Example : 
message Person {
    required string user_name        = 1;
    optional int64  favourite_number = 2;
    repeated string interests        = 3;
  • Each field starts with a byte that indicates its tag number (the numbers 123 from the schema), and the type of the field.
    • This lead to larger file size as compared to Avro (when data size is more as compared to schema stored).
Protobuf data internals
  • Protobuf workflow :
    • Serialisation 
      • Create proto schema → .proto file
      • Compile proto schema using “protoc” compiler for target language.
      • Read schema definition by importing compiled class generated in above step.
        • Write proto binary data file.
    • Deserialisation
      • Without External Schema ← Not Supported
      • With External Schema
        • Import compiled class generated by compiler → To get Schema definition
        • Open binary file using schema definition of step 1 above.
protobuf working

Avro vs ProtoBuf Comparison Summary

Categories   Description Avro ProtoBuf
General Storage type How is data stored (row/columnar format) ? Row Row
OLAP/OLTP Efficient for OLTP or OLAP environment ? OLTP OLTP
Stream Efficient for Streaming applications ? Yes Yes
RPC Interfaces Does it support RPC interface ? Yes Yes.
Best with gRPC
License   Apache BSD-Style
Language-neutral, platform-independent   Yes Yes
Ecosystems Preferred and used widely in what kind of ecosystem ? Big Data and Streaming (Kafka) RPC and Kubernetes (ActiveMQ, Google)
Codebase / Development efforts ? Which one require less coding and maintenance effort? Simpler Comparatively complex
Performance Which one generates more compact data encoding, and faster data processing? Slightly slow – due to simplicity. Fastest amongst all
Schema Schema enforcement Can we enforce external schema ? Yes Yes
Schema Support and Definition How to provide schema definition ? Defined in JSON, IDL, SchemaBuilder fluent API (Java). IDL
Schema Evolution Does it support backward and forward compatibility in schema evolution ? Supported
(Backward, Forward, Full, None – upto user to implement)
(Backward, Forward, Full, None – upto user to implement)
Dynamic Typing / Schema Can we parse data without external schema definition?  Supported Not Supported
Documentation Generation Tools Do we have any tool to generate schema definition ? AvroDoc
— Todo: Can look for updated one.
protoc-gen-doc : produces HTML, PDF, DocBook, 
PII Tagging supported ?  Can we mark PII data itself in schema definition ?


Example Links: 1

— Todo: Can check further
Data File Compilation Needed ?  Do we need to compile schema first to generate data file ? Not Required Required
Schema available ?  Does data file holds schema in itself ? Yes, in header section. Can be removed as well. No
Splittable Can we split data file so that can be used by Map-Reduce world ? Yes No
Compression Can we compress data file ? Yes No
Data file viewing options ?  Can we view data file individually using any tool ? Yes.
Schema definition file is not mandatory.
Schema definition file is mandatory, so process is bit complex.
Concatenate multiple data files ?  Can we concatenate multiple data files together with same schema ? Yes No
Data Types Enumerations Does it support Enums ? Yes Yes
Constants Can we have constants ? No No
Optional fields supported Can we mark any field as optional ? No
Workaround –> Use a union type, like union { null, long } OR use default attribute.
Unsigned Type Does it support unsigned data type ? No Yes
Default values supported ? Can we provide default value to fields ? Yes Yes
Default value mandatory ? Is default value mandatory ? No
But, as a best practice, it should be provided to support backward compatibility
Timestamp supported ?    Yes.
But it becomes a 64 bit signed integer.  Todo: need to check further.
— Todo: need to check further.
Deprecation Supported ? Is there a support to mark field as deprecated ?


Workaround Links: 1

Option to mark field as private ?  


Workaround Links: 1

Developer Queries Fields matching ?  How to match fields while deserialising ? Fields are matched by name Fields are matched by tag(position)
Tools availability ?  What all tools are available from developer perspective ?


Example: 1


Example: 1

Extension Capabilities Can we extend features if required ? Easier (Java) Core compiler: More difficult (C++)
Inheritance support and polymorphism Is it possible to build a new data type using inheritance? No
Workaround Links: 1 2
Links: 1
Debugging ?  Which one would be complex to debug ? Easy.
As data file has schema inbuilt so can easily view file using multiple 3rd party tools.
Snowflake Support Can we load data directly into Snowflake ? Supported Not Supported
Language support What all languages are supported ? C
Dart – Not Supported (can check for plugin)
Kotlin – Not supported (Plugin –
Objective-C – Not supported (can check for plugin)
Other through third party plugins
C –  Not Supported (can check for plugin)
Elixir – Not Supported (can check for plugin)
Haskell – Not Supported (can check for plugin)
Perl – Not Supported (can check for plugin)
Rust – Not Supported (can check for plugin)
Scala – Not Supported (can check for plugin)
Other through third party plugins


Avro seems a better fit for BigData use cases as it is widely used in multiple frameworks. Splittable, schema along with data and native compression techniques are major advantages over Protocol Buffer.

Protobuf is easy to use in microservices, especially where performance and interoperability are important and is superior to Avro in this area.

Reference Links

Links Helpful Details Avro Documentation Apache Avro Page Avro Schemaless Avro and FastAvro libraries

Acro Schema Step by step details of Avro Avro Schema rules Schema evolution and compatibility Avro Details Schema evolution in Avro, Protocol Buffers and Thrift All File formats comparison Data Serialisation and comparison of avro,Thrift and proto Nice explanation of protbuf vs avro – 2020 article pb-vs-thrift-vs-avro Proto vs Json – Benchmarking Proto Alternatives Comparison of Protobuff, Thrift, Avro Thrift vs Protocol Bufffers vs JSON Protocol Buffers, Avro, Thrift & MessagePack Avro vs Proto Proto vs Json Ser-De Strategies All Java Ser/De Proto buf detailed explanation Proto explanation Proto Documentation Proto details alternatives-for-mapstring-string-in-protocol-buffers Another Proto What’s new in proto-3 ? Json
Schema Registry Different ways to mess up schema registry  

1 thought on “Data Serialisation – Avro vs Protocol Buffers”

Leave a Comment