Serialization

Avro, Protocol Buffers and Thrift
As part of the real-time bidding platform design, the engineering team at Komli evaluated the most popular data-serialization/RPC frameworks – Avro, Thrift and Protocol buffers.

Design goals
The real-time bidding architecture was designed to be fairly loosely coupled to make it easier to connect with multiple supply-side platforms (SSPs) like Pubmatic, Google AdEx, Rubicon, etc. without needing to change the core bidding engine.

Specific  adaptor

The adapter marshals SSP specific bid requests to an internal standard bid request format and vice versa for the bid response. The core bidding engine is responsible for the core functionality of deciding what and how much to bid for.
Since real-time bidding has strict SLAs of about 75 – 125ms response times including the network latency between SSP servers and our data centre, we needed an efficient and scalable request-response protocol between the adapter and the core engine. We also wanted the flexibility of coding adapter and the core engine with independent technology stacks for optimum performance, ease of coding and scalability.

RPC was a good way to reduce the overhead of the communication with the adapter talking directly to the core engine without needing anything in between to broker the communication.

Background

Protocol Buffers
Protocol buffers is the lingua franca for data at Google and used for data-serialization and RPC across Google. It includes an IDL for specifying schema files and needs an extra compilation step to generate language-specific boiler plate code for message serialization/de-serialization. One can also define RPC service interface which gets compiled into abstract type-safe interfaces that can be plugged into third-party RPC implementations.

Thrift
Thrift was open-sourced by Facebook in late 2007 with a greater focus on RPC in its core design. Thrift provides a usable RPC implementation with multiple transport options (network, memory or file) and has support for more official languages.

Avro
Avro is designed primarily for dynamic environments (Pig, Hive, Hadoop, etc.) by embedding the schema with the actual data. This allows systems to load and process data on the fly without needing to invoke IDL compilation and do selective projections on subset of data.

Based on our own benchmarks, there was little to separate the 3 libraries in serialization/de-serialization performance. We used Thrift for adapter-core engine communication as it is provides ready-to-use asynchronous RPC implementations for both server and client across languages.

We also use Avro for data serialization needs in a separate project where support for RPC was not needed. One caveat that we have encountered with the “data versioning” support by Avro is that when data is written using the in-memory writer, there is no schema stored with the data. This requires the application to do extra book-keeping and track all LIVE versions of schema outside of the data store which adds to a lot of overhead. Also, it is not very efficient to store the schema with single independent records that need to be stored separately.

Avro, Protocol buffers and Thrift are in use for data serialization in projects at Komli.