Avro vs Jackson vs Gson serialisation on Apache Flink

Eduardo Elias Saleh
3 min readFeb 6, 2023

This article describes a research on deserialising data on Apache Flink when different serialisation engines were used.

TL;DR: Those are the findings

Time difference in deserialisation engines and message sizes

Methodology

The test consisted in having messages made of nested data, into a max depth of 4, for instance root.category.subcategory.property = ["item"] and running a producer, adding random 1000000 messages into kafka and, then, enabling the Apache Flink application to read from the kafka topic the whole lot at once, deserialising message per message on the source operator and measuring the deserialisation process individually and in total.

Each engine had its own implementation of AbstractDesserializationSchema where, inside the deserialise method, we gathered the time spent in the deserialisation operation for each item as shown beneath:

Item measurement example

The overall measurement was done over the whole stream as shown in the graph beneath:

Overall time measurement graph

Each engine was tested by running the deserialisation five consecutive times on a Intel i9 MacBookPro with 32GB Ram. Each run consisted of the same load of 1000000 (one million) messages pre-written on a Kafka topic by a generic producer. The records were first written to Kafka and, after the write is over, the Flink application started and consumed the whole topic before writing the overall times. The log written was done asynchronously and did not impact the overall measurements.

Here are the results for the five runs for each engine statistics:

Avro

Jackson

GSON

Conclusion

As seen on the above tables, the results shows that AVRO has a faster deserialisation engine and that, also, it provides smaller message sizes. BUT, there's a caveat: AVRO requires a schema and, if you're working within a distributed environment where, for instance, a producer and a consumer needs to use the same schema for serialising and deserialising, there's a need for a schema registry and, with it, the overall complexity of the pipeline increases. There are already some schema registry as a service to overcome this problem but this is not on the scope of this article.

If you want to learn more about this experiment, please, feel free to reach out to me and we can discuss it further. I'm working on a docker setup where it'll be possible to run this experiment over and over again in a controlled way.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Eduardo Elias Saleh
Eduardo Elias Saleh

Written by Eduardo Elias Saleh

Brazilian, 80’s kid, Lily’s father. In love with JS, PHP, C# and Baby Yoda. Dev since 97'. Board gamer always up for an Eclipse match. We created and killed God

No responses yet

Write a response