I had a pleasure to work with a number of companies and quite often, I have been asked to tackle problems they faced. And some of these problems were quite repetitive. So I decided to wrote a couple of blog posts about that starting today with serialization.
Avoid Java Serialization
As you probably know, Java serialization is very slow, footprint heavy and has Scala version binary compatibility issues. And as you may know, as well, Java serialization is the default in Akka which is pretty bad for us.
There are two main reasons why we want to serialize data when using Akka. First one is sending data through the network. Obviously, Akka doesn’t serialize messages which are sent locally, why would it? So we have an actor which serializes message and sends it to another actor. The second actor deserializes message, creates a response, serializes it and sends it back. This roundtrip is called ping-pong test and is often used for evaluation of serialization libraries.
The second example is persisting data into a local storage. For example, I had a client last year and they had performance issues regarding Akka persistence. They told me, it might take up to a minute to instantiate their persistent actors. A minute, seriously? But it’s certainly possible. Just imagine that the actor has a quite complex state and it also uses infamous Java serialization. Then you add not using snapshots and here we are.
How Does It Look Like
So let’s have a quick look at this graph. It compares round trips of various serialization libraries. I’m pretty sure it’s very hard to read so I added the arrow which shows us where the Java implementation stands.
And this graph shows us the footprint size. As you can see, the last spot. By the way, if you would like to see the sources of this, they are in the references in the end .
You can see how serialized data looks like for a very simple class in the picture below. The footprint of Java serialization is far bigger even than the footprints of not very efficient formats like XML or JSON .
The good question is, why it is like that. The reason is, it doesn’t serialize just data, but it also serializes the entire class definition and all definitions of all referenced classes. It was designed to be simple to use, to serialize almost everything and also it was supposed to work with different JVMs. Apparently, performance was not the main requirement, but honestly, after 8 major Java versions, it could be a little bit better.
So that was Java serialization, but before we move let's think about what points we should have in mind when choosing between various libraries. The main reasons why we want to avoid Java serialization are its performance and its footprint. But we should also have in mind Schema evolution, which is really important for real-world systems. And by the way, this is pretty complicated in the context of Java serialization. We should also consider the implementation effort. This varies a lot according to various libraries, but in general is pretty hard to beat Java serialization. The decisive point might be the requirement for human readability. It basically means if we have to use JSON or XML if we can use one of the binary formats with all their advantages. The last point I want to mention here are the language bindings, this very depends on a particular use case, but all libraries I'm going to mention here have Scala APIs.
JSON is the most used human readable format these days. Surely there are some others, like XML or CSV or whatever, which certainly have theirs use cases, but I would go mainstream. Surely, it has disadvantages, it’s relatively slow and bloaty, but if you need to have a human readable format it’s a good way to go. And the good thing is, there are so many great libraries available. Not exhaustive list of proven libraries includes and is not limited to following libraries - json4s, circe, µPickle, spray-json, argonaut, rapture-json, play-json. JSON traits are summarized in the table below:
Schema-less Binary Formats
For binary formats, we have two options. Firstly schema-less formats, meaning data are sent together with corresponding metadata. The main advantage is we don’t have to define schema, so it’s almost like using Java serialization, but performance-wise, these formats are a very different story. As an example, we can have Kryo and its Scala library named chill. And by the way, setting up this in Akka is a matter of minutes . We can also consider binary JSON formats, like MessagePack or BSON, which basically trade readability for performance. Schema-less formats are summarized below:
Binary Formats with Schema
And now let’s take a look at binary formats with schema defined by some kind of DSL. So they offer great performance and minimal footprint and they usually provide built-in support for schema evolution. A popular choice is protobuf and a couple of related projects like Flatbuffers and Cap’n’proto. Another representative might be Apache Thrift which was originally developed by Facebook or Apache Avro which is primarily used in Hadoop ecosystem. And again, a summary is below:
For a summary, I’d like to say you should always, or if you plan to go production at least, replace java serialization with something else. This very depends on your particular use case, so I leave it up to you. But if you want to have a quick blind recommendation I’d go for one of these Json4s , Kryo  or protobuf . All of them are great and proven libraries and they should work to your satisfaction.
So that's all for today. I have a couple of more problems in my head and if there's an interest I'll publish posts about them as well.