Background
File Formats Evolution
Why not use CSV/XML/JSON?
- Repeated or no meta information.
- Files are not splittable, so cannot be used in a map-reduce environment.
- Missing/ Limited schema definition and evolution support.
- Can leverage “JsonSchema” to maintain schema separately for JSON.
- It may still require transformation based on a schema, so why not consider Avro/Proto?
- No native compression of repeating values and indexing capabilities. Binary JSON supports indexing. Link
- High consumption of bandwidth.
- Bad Space Utilisation.
- Poor performance.
Important Terminologies
- Serialisation → Process of converting objects such as arrays and dictionaries into byte streams that can be efficiently stored and transferred elsewhere.
- Deserialisation → Using byte stream to get the original objects back
- Backward Compatibility → New version of software can run code written in old version.
- Forward Compatibility → Older version of software can run code written in new version.
- Schema Evolution – Schema evolution allows you to update the schema used to write new data, while maintaining backwards compatibility with the schema(s) of your old data. Then you can read it all together, as if all of the data has one schema. Of course there are precise rules governing the changes allowed, to maintain compatibility. Those rules are listed under Schema Resolution.
- Registering and using Schema :
- Schema Registry tools (More details → Coming Soon) :
Avro and ProtoBuf
Both are Language-neutral data serialisation system, which relies on a schema-based system.
Avro
Apache Avro was released by the Hadoop working group in 2009. It is a row-based format that has a high degree of splitting. It is also described as a data serialization system similar to Java Serialization. The schema is stored in JSON format, while the data is stored in binary format, which minimizes file size and maximizes efficiency.
- Schema is stored along with the Avro data in a file for any further processing.
- We can also maintain separate schema registry :
- In RPC, the client and the server exchange schemas during the connection.
- Schema-less Avro is used heavily in HDFS network communication. (serialiser and deserialiser knows the data schema in advance)
- Avro supports both dynamic and static types as per the requirement.
- without the static compilation step and greater interoperability with dynamic languages
- Code generation is nonetheless still available in Avro for statically typed languages as an alternative optimisation
- Avro has support for primitive ( int, boolean, string , float etc.) and complex ( enums, arrays, maps, unions etc.) types. Link
- Avro creates binary structured format that is both compressible and splittable. Hence it can be efficiently used as the input to Hadoop MapReduce jobs and that’s the reason it is built in the Hadoop ecosystem.
- Schema Definition Example:
{ "type": "record", "name": "Person", "fields": [ {"name": "userName", "type": "string"}, {"name": "favouriteNumber", "type": ["null", "long"]}, {"name": "interests", "type": {"type": "array", "items": "string"}} ] }
…or in an IDL:
record Person { string userName; union { null, long } favouriteNumber; array<string> interests; }
- Strings are just a length prefix followed by UTF-8 bytes
- Avro WorkFlow
- Serialisation
- Create Avro schema → .avsc file
- Read this schema in your code using SchemaParser
- Write avro data file
- Deserialisation
- Without External schema
- Read file directly.
- With External schema
- Read schema ← Using schema parser in Spark
- Open avro file using schema definition of step a. above.
- Without External schema
- Serialisation
Protobuf
Protocol buffers, usually referred as Protobuf, is a protocol developed by Google to allow serialization and deserialization of structured data. Google developed it with the goal to provide a better way, compared to XML, to make systems communicate.
- Developed by Google and was open sourced in 2008.
- Protobuf is easy to use in microservices, especially where performance and interoperability is important.
- Schema is to be maintained separately.
- It only supports static types.
- Support more complex data types as compared to Avro. Link
- Schema Definition Example :
message Person { required string user_name = 1; optional int64 favourite_number = 2; repeated string interests = 3; }
- Each field starts with a byte that indicates its tag number (the numbers
1
,2
,3
from the schema), and the type of the field.- This lead to larger file size as compared to Avro (when data size is more as compared to schema stored).
- Protobuf workflow :
- Serialisation
- Create proto schema → .proto file
- Compile proto schema using “protoc” compiler for target language.
- Read schema definition by importing compiled class generated in above step.
- Write proto binary data file.
- Deserialisation
- Without External Schema ← Not Supported
- With External Schema
- Import compiled class generated by compiler → To get Schema definition
- Open binary file using schema definition of step 1 above.
- Serialisation
Avro vs ProtoBuf Comparison Summary
Categories | Description | Avro | ProtoBuf | |
General | Storage type | How is data stored (row/columnar format) ? | Row | Row |
OLAP/OLTP | Efficient for OLTP or OLAP environment ? | OLTP | OLTP | |
Stream | Efficient for Streaming applications ? | Yes | Yes | |
RPC Interfaces | Does it support RPC interface ? | Yes | Yes. Best with gRPC |
|
License | Apache | BSD-Style | ||
Language-neutral, platform-independent | Yes | Yes | ||
Ecosystems | Preferred and used widely in what kind of ecosystem ? | Big Data and Streaming (Kafka) | RPC and Kubernetes (ActiveMQ, Google) | |
Codebase / Development efforts ? | Which one require less coding and maintenance effort? | Simpler | Comparatively complex | |
Performance | Which one generates more compact data encoding, and faster data processing? | Slightly slow – due to simplicity. | Fastest amongst all | |
Schema | Schema enforcement | Can we enforce external schema ? | Yes | Yes |
Schema Support and Definition | How to provide schema definition ? | Defined in JSON, IDL, SchemaBuilder fluent API (Java). | IDL | |
Schema Evolution | Does it support backward and forward compatibility in schema evolution ? | Supported (Backward, Forward, Full, None – upto user to implement) |
Supported (Backward, Forward, Full, None – upto user to implement) |
|
Dynamic Typing / Schema | Can we parse data without external schema definition? | Supported | Not Supported | |
Documentation Generation Tools | Do we have any tool to generate schema definition ? | AvroDoc — Todo: Can look for updated one. |
protoc-gen-doc : produces HTML, PDF, DocBook, | |
PII Tagging supported ? | Can we mark PII data itself in schema definition ? |
Yes Example Links: 1 |
No. — Todo: Can check further |
|
Data File | Compilation Needed ? | Do we need to compile schema first to generate data file ? | Not Required | Required |
Schema available ? | Does data file holds schema in itself ? | Yes, in header section. Can be removed as well. | No | |
Splittable | Can we split data file so that can be used by Map-Reduce world ? | Yes | No | |
Compression | Can we compress data file ? | Yes | No | |
Data file viewing options ? | Can we view data file individually using any tool ? | Yes. Schema definition file is not mandatory. |
Yes. Schema definition file is mandatory, so process is bit complex. |
|
Concatenate multiple data files ? | Can we concatenate multiple data files together with same schema ? | Yes | No | |
Data Types | Enumerations | Does it support Enums ? | Yes | Yes |
Constants | Can we have constants ? | No | No | |
Optional fields supported | Can we mark any field as optional ? | No Workaround –> Use a union type, like union { null, long } OR use default attribute. |
Yes | |
Unsigned Type | Does it support unsigned data type ? | No | Yes | |
Default values supported ? | Can we provide default value to fields ? | Yes | Yes | |
Default value mandatory ? | Is default value mandatory ? | No But, as a best practice, it should be provided to support backward compatibility |
No | |
Timestamp supported ? | Yes. But it becomes a 64 bit signed integer. — Todo: need to check further. |
Yes — Todo: need to check further. |
||
Deprecation Supported ? | Is there a support to mark field as deprecated ? |
No Workaround Links: 1 |
Yes | |
Option to mark field as private ? |
No Workaround Links: 1 |
No | ||
Developer Queries | Fields matching ? | How to match fields while deserialising ? | Fields are matched by name | Fields are matched by tag(position) |
Tools availability ? | What all tools are available from developer perspective ? |
Yes Example: 1 |
Yes Example: 1 |
|
Extension Capabilities | Can we extend features if required ? | Easier (Java) | Core compiler: More difficult (C++) | |
Inheritance support and polymorphism | Is it possible to build a new data type using inheritance? | No Workaround Links: 1 2 |
Yes Links: 1 |
|
Debugging ? | Which one would be complex to debug ? | Easy. As data file has schema inbuilt so can easily view file using multiple 3rd party tools. |
Complex | |
Snowflake Support | Can we load data directly into Snowflake ? | Supported | Not Supported | |
Language support | What all languages are supported ? | C C++ C# Dart – Not Supported (can check for plugin) Elixir Go Haskell Java Javascript Perl PHP Python Ruby Rust Scala Kotlin – Not supported (Plugin – https://github.com/avro-kotlin/avro4k) Objective-C – Not supported (can check for plugin) Other through third party plugins |
C – Not Supported (can check for plugin) C++ C# Dart Elixir – Not Supported (can check for plugin) Go Haskell – Not Supported (can check for plugin) Java JavaScript Perl – Not Supported (can check for plugin) PHP Python Ruby Rust – Not Supported (can check for plugin) Scala – Not Supported (can check for plugin) Kotlin Objective-C Other through third party plugins |
Conclusion
Avro seems a better fit for BigData use cases as it is widely used in multiple frameworks. Splittable, schema along with data and native compression techniques are major advantages over Protocol Buffer.
Protobuf is easy to use in microservices, especially where performance and interoperability are important and is superior to Avro in this area.
Great post!
Great post, thanks
Thank you for sharing your thoughts. I really appreciate
your efforts and I am waiting for your next write ups thank you once again.