GoshawkDB: Benchmarking

A distributed, transactional,
fault-tolerant object store

Benchmarking

As far as I know, there is no other product available that has the same feature set as GoshawkDB. If there was, I probably wouldn't have written GoshawkDB. Consequently, any benchmark comparing GoshawkDB to some other data store is always going to be an apples to oranges comparison to some extent.

The best sorts of benchmarks are the ones that make the most sense for the application you're building: ideally, build the same application several times, using GoshawkDB and using other data stores and see which one works best for you: if they all pass your performance requirements then which leads to the nicest code? Which is easiest to think and reason about? Obviously, most of the time we don't have the time to build the same application several times in order to test different approaches, so inevitably there is an appeal to benchmarks comparing simplified operations.

GoshawkDB is a CP-system and therefore provably has to do more work than AP-systems. So a good AP-system will always be able to perform faster than GoshawkDB (e.g. Cassandra). GoshawkDB is not a competitor for AP-systems: AP-systems cannot offer strong-serializability and if they offer transactions at all, they tend to be one-object-only. AP-systems will always answer any query immediately no matter how many failures occur, but in doing this they sacrifice consistency and suffer from divergence. If you need the extreme performance of an AP-system and you are comfortable coping with the semantics of AP-systems then GoshawkDB is probably not for you.

Other popular CP data stores that make more sense to compare to GoshawkDB include etcd and Zookeeper. But these are not designed for large amounts of data (Zookeeper for instance keeps all data in memory all the time, and etcd doesn't support transactions). Traditional databases such as MySQL and PostgreSQL offer full transactions, and if configured correctly can offer strong-serializability, but they're not designed to be distributed databases and often solutions that make them become distributed databases sacrifice strong-serializability, or are leader/follower designs which often means that overall performance can never exceed the performance of the leader on its own.

There are hundreds of other data stores out there and I have looked through many of them. A lot of the time it is very difficult to determine their semantics: to start with, is it an AP or CP system? My own frustration with determining what these data stores actually guarantee is a large part of the motivation behind GoshawkDB. How can you build an application on top of a data store and make claims about the properties of your application if it's not clear what guarantees the data store itself offers?

Performance tips

GoshawkDB is designed for use on SSDs or faster storage. Please do not benchmark it running on traditional rotating hard drives. Please also make sure that a single drive is not being shared by multiple GoshawkDB nodes.
XFS is generally the file system of choice because its fsync performance is quite consistent. Other file systems tend to exhibit more varying fsync latency which can affect benchmarks.
If you're benchmarking on AWS EC2, I recommend using i2.2xlarge instances. Use the two SSDs and configure them as a software RAID 0. Then format the RAID 0 device with XFS. On EC2 it can also help to use a placement group for all the nodes in the cluster.
GoshawkDB's design means that all nodes that are involved in a transaction vote in parallel. From the point at which a node receives a transaction from a client, only three network delays occur (and one disk write and fsync) before the outcome is known. However, as the size of the cluster increases, there's a greater probability that a large transaction will require more nodes to be contacted. Whilst messages are sent in parallel, the more messages you send, the greater the chance that some of them suffer random network delays, and the outcome of a transaction is not known until after all nodes have voted. Therefore, having a fast network which isn't full of other traffic is beneficial.
In general, many data stores will perform worse under high contention. Normally, contention is caused by multiple transactions modifying the same item. For example, it could be that in a traditional SQL database, inserts into a table have to be fully serialized because of the need to safely read and increment a primary key. As a result if you have hundreds of clients all trying to insert into the same table, performance will reach a certain level and can then start to fall off.

GoshawkDB is no different: if multiple transactions are modifying the same object at the same time then those transactions will have to be totally ordered otherwise strong-serializability could be violated. GoshawkDB's advantage though is that you have greater ability to carefully design your application to minimise contention by minimising the number of objects a transaction needs to touch. This is something that you, using the client, are in control of. I have had cases myself where a few tweaks, duplicating where keys of nodes in client-side skiplist are stored, has reduced the number of objects each transaction needed to touch substantially and resulted in much higher performance.