Modern Data Serialization Method

In this blog, I’m gonna discuss about the modern method for data serialization which is Protocol buffers usually referred as Protobuf. It’s a binary communication format designed by Google which allow us to serialization and deserialization of structured data.

But wait, the above tasks can also be done by other formats like JSON or XML, so why google choose to designed a new communication format ? As we all know that almost all big tech giants are majorly focusing on high performance and optimized speed. Due to tremendous popularity of microservices architecture system, it’s been very difficult to manage the communication between thousands of services using text based communication format, services generates thousands of requests to each other, loads a network & require a lot of resources, that’s why we need a fast way to serialize for transferring compact data between services. In this scenario Protocol buffers can save us a lot of money and resources.

It is important to note that, although JSON and Protobuf did the same job but, these technologies were designed with different goals and approach.

Protocol buffers were designed to be faster than JSON & XML by removing many responsibilities done by these formats and making it focus only on the ability to serialization and deserialization of data as fast as possible. Another important optimization is regarding how much network bandwidth is being utilized by making the transmitted data as small as possible.

How Protobuf’s are faster than other Communication format ?

It means the data transmitted during communication is in the form of binary which improves the speed of transmission compare to JSON string format. Let’s take an example for getting clear understanding:-

{
 "status":"success",
 "message":"found"
}

In the above JSON object there are total 38 characters including spaces and characters like { } , "" : which don’t possess any kind of informational data. So finally we have 2 curly brackets, 8 quotation marks, 2 colon and 1 comma which added up to 13 characters and keywords of JSON object occupies a total space of 6+7 = 13 characters whereas information value of JSON object occupies 7 +5= 12 characters.

After sum up the result we get the following information:-

JSON object length: 38 bytes
Informational length: 12 bytes
Non-Informational length: 26 bytes [“WASTAGE”]

PROTOBUF to the rescue !!!

According to google :-

Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data.

Protocol buffers helps us to define the structure of the models once and then use generated source code to easily write and read structured data to/from a variety of data streams using a variety of languages. Let see How to define structure of models? The structure of the models are defined in unique .proto file and compile it with protoc command which generates source code that can be invoked by a sender or recipient of these model structures. We’ll discuss protoc later, but first let us dive deep into this through an exercise. Let’s say we want to create a service which takes user_id as a request parameter and returns the user details.

user.proto

syntax = "proto3"; tell’s the compiler the version of protocol buffer.
In proto file we can also define the rpc service interface and protoc(compiler) will generate service interface code and stubs in your chosen language.
In proto file we can see that structure of model is declared with a message keyword followed by the user-defined message name. Here you can make an analogy that service is equivalent to class and rpc is equivalent to functions and message is equivalent to parameters/arguments in programming languages.
In message body we can see the fields are defined with their respective types which are associated with unique integer number.
These field numbers are used to identify your fields in the message binary format, and should not be changed once your message type is in use. Note that field numbers in the range 1 through 15 take one byte to encode, including the field number and the field’s type (you can find out more about this in Protocol Buffer Encoding). Field numbers in the range 16 through 2047 take two bytes. So you should reserve the numbers 1 through 15 for very frequently occurring message elements.

Now, we’re going to generate interface code using protoc for our user.proto file in our desired language. Here I’m using golang for generating source code.

protoc --proto_path=src --go_out=. user.proto

The first argument — --proto_path is the place where the output files are saved, --go_out means we need the output as golang to the user-defined directory. The last param is the path of the .proto file. We can also generate code for other languages like JAVA, Python etc, we just have to replace “ — go_out” with --java_out or --python_out. The above command will generate a user.pb.go that implements all messages as Golang structs and types:

user.pb.go

It would be very difficult to explain the implementation of generated source code in same blog, you can take a reference from here.

The serialization and deserialization is processed by the proto package, which provides Marshal and Unmarshal functions:

user.go

Let’s say our user doesn’t exist then serialized data you got will be looks like this:-

JSON O/P:- {"status":"failure", "message": "not found"}

SERIALIZED O/P (line:- 24):-

The serialized data contains only 20 bytes. After sum up the result we get the following information:-

Serialized data length: 20 bytes
Informational length: 16 bytes
Non-Informational length: 4 bytes [“WASTAGE”]

I hope you find this article useful and informative, stay tuned for next time.