Avro vs Jackson vs Gson serialisation on Apache Flink

3 min readFeb 6, 2023

This article describes a research on deserialising data on Apache Flink when different serialisation engines were used.

TL;DR: Those are the findings

Time difference in deserialisation engines and message sizes

Methodology

The test consisted in having messages made of nested data, into a max depth of 4, for instance root.category.subcategory.property = ["item"] and running a producer, adding random 1000000 messages into kafka and, then, enabling the Apache Flink application to read from the kafka topic the whole lot at once, deserialising message per message on the source operator and measuring the deserialisation process individually and in total.

Each engine had its own implementation of AbstractDesserializationSchema where, inside the deserialise method, we gathered the time spent in the deserialisation operation for each item as shown beneath:

The overall measurement was done over the whole stream as shown in the graph beneath:

Each engine was tested by running the deserialisation five consecutive times on a Intel i9 MacBookPro with 32GB Ram. Each run consisted of the same load of 1000000 (one million) messages pre-written on a Kafka topic by a generic producer. The records were first written to Kafka and, after the write is over, the Flink application started and consumed the whole topic before writing the overall times. The log written was done asynchronously and did not impact the overall measurements.

Here are the results for the five runs for each engine statistics:

Avro

Jackson

GSON

Conclusion

As seen on the above tables, the results shows that AVRO has a faster deserialisation engine and that, also, it provides smaller message sizes. BUT, there's a caveat: AVRO requires a schema and, if you're working within a distributed environment where, for instance, a producer and a consumer needs to use the same schema for serialising and deserialising, there's a need for a schema registry and, with it, the overall complexity of the pipeline increases. There are already some schema registry as a service to overcome this problem but this is not on the scope of this article.

If you want to learn more about this experiment, please, feel free to reach out to me and we can discuss it further. I'm working on a docker setup where it'll be possible to run this experiment over and over again in a controlled way.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by Eduardo Elias Saleh

8 Followers

19 Following

Brazilian, 80’s kid, Lily’s father. In love with JS, PHP, C# and Baby Yoda. Dev since 97'. Board gamer always up for an Eclipse match. We created and killed God

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Recommended from Medium

How to Avoid Schema ID Changes in Kafka: Backup and Restore Schema Registry Guide

Ibrahim Patel

How to Avoid Schema ID Changes in Kafka: Backup and Restore Schema Registry Guide

In any Kafka-based system, managing schema compatibility between producers and consumers is critical to ensure smooth data flow. The Schema…

Oct 26, 2024

Introduction to AVRO and its Role in Big Data

Parin Patel

Introduction to AVRO and its Role in Big Data

Discover AVRO: A powerful, schema-based data serialization format that simplifies Big Data storage, processing, and schema evolution.

Oct 9, 2024

Lists

Staff picks

827 stories1648 saves

Stories to Help You Level-Up at Work

19 stories948 saves

Self-Improvement 101

20 stories3355 saves

Productivity 101

20 stories2819 saves

Zstd vs Snappy vs Gzip: The Compression King for Parquet Has Arrived

Data Engineering Xperts

Ritam Mukherjee

Zstd vs Snappy vs Gzip: The Compression King for Parquet Has Arrived

For years, Snappy has been the go-to choice, but its dominance is being challenged

Dec 7, 2024

Lydtech Consulting

Rob Golder

Integrating Flink with Kafka

Apache Flink is a processing framework for large-scale, distributed, complex real-time event-driven processing, batch processing, and…

Dec 1, 2024

Understanding Java Message Service (JMS) for Distributed Applications

Aditya Bhuyan

Understanding Java Message Service (JMS) for Distributed Applications

In today’s interconnected world, distributed applications play a crucial role in modern software architecture. The Java Message Service…

Oct 19, 2024

Javarevisited

Rasathurai Karan

Java’s Funeral Has Been Announced….☠️💻

Oh, Java is outdated! Java is too verbose! No one uses Java anymore!

6d ago

978

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams