GoshawkDB: Rationale

A distributed, transactional,
fault-tolerant object store

GoshawkDB: a rationale

Early in 2015 I spent some time looking around several popular data stores and came to the conclusion that I didn't want to use any of them. I wanted a data store that was distributed, fault tolerant, sharded, transactional and that puts semantics and guarantees front and centre. Looking around at the websites and documentation of many data stores, I couldn't find anything that had the features I wanted or spelled out clearly their semantics: what bad things happen when failures occur? How do I code around those issues? I didn't want something that was going to break new land-speed records: sure it's fun to be able to boast about big numbers, but I'd rather have something that I understand how it's going to behave and that I trust.

Like many people who take an interest in databases, I follow Kyle Kingsbury and his blog posts on Jepsen which is a tool to identify certain properties (often isolation) of data stores, particularly when they're in failure conditions. Almost without fail, every data store he looks at he finds problems with. Sometimes those problems get addressed and solved, sometimes they get mitigated. But these are not beta products that he's testing: they're widely used established products. All software has bugs in it: I've no doubt that GoshawkDB has plenty of bugs despite my best efforts, and I'm sure the engineers of all other data stores have tried their best to produce bug-free software. But I also suspect there are an awful lot of data sets out there which either contain corruption or can be corrupted if certain events occur, and possibly no one knows what those events are.

GoshawkDB has been designed first and foremost to guarantee strong-serializability and secondly to withstand failures. If anyone finds a way in which strong-serializability can be violated then that is certainly a bug. The core transaction engine logic in GoshawkDB has been designed with the aid of model checkers and simulators (in some ways similar to fuzz-testers), and whilst the desired properties of GoshawkDB haven't yet been mathematically proved to be correctly implemented, I certainly have several proof sketches which I hope to flesh out further when I can (yes, I'm aware this actually means precisely nothing).

SQL is a venerable technology, widespread and popular. However, SQL and the traditional means of expressing your data model through tables and rows isn't a great fit for today's programming languages. The fact that SQL is a separate language very much means there's a stage in programs equivalent to "and then we convert everything to SQL", and programming languages don't model data in tables either: they use objects (which I mean in the broadest possible sense including objects that don't have methods attached - i.e. structs). The NoSQL movement has shown that that is appetite for alternative (and perhaps more natural) ways of expressing data to data stores, though they tend to be of the more limited key-value design rather than arbitrary object graphs.

SQL can also lead to inefficiencies due to the type of thinking it encourages - very much "if all you have is a hammer, you tend to see every problem as a nail". For example, say you're storing customer orders in a traditional SQL database. You might decide that you want to know how many orders have been taken this week, and you write a query to do that. The database though is probably dumb and it probably doesn't understand what events would cause the value calculated to change. As a result, whenever you run that query, the database will probably repeat the same work. For simple queries, this obviously doesn't matter: there's no need to cache the value and deal with all the complexity that would bring. But consider what you'd write if you never had that "and then we convert everything to SQL" step. Imagine instead that your program just ran forever: it was its own database - it never went wrong, it never needed to be restarted. How would you model your data now? You may well decide on a completely different structure: you might have an object per week, and that object was mainly just a list of pointers to the orders taken. This could well work out to be a much more efficient solution: we can now guarantee that we can perform concurrent operations on different weeks without any contention or needing any locks. The SQL equivalent would be to have a different table per week, but whilst that's possible, it's not the normal thing to do: the database schema tends to be designed and fixed up front.

Another example of this is the use of object-relational mappings. They frequently produce one table per type of object. But that can mean that when multiple clients are modifying objects of the same type there is contention in the database between those actions even though if those objects weren't stored in the same table, there may be no possibility of contention. In short, I feel there is a substantial mismatch between traditional SQL databases and modelling data in programming languages, and the translation steps between the two can often add unnecessary complexity and bugs, and hurt performance. All that said, it's certainly clear that SQL still has a huge role to play in data stores and for scenarios where you need to be able to issue unknown arbitrary complex queries at run-time, SQL is not going anyway, and rightly so.

Transactions are a feature that are clearly not essential: after all, we don't have transactions when we write to memory, and we cope perfectly well when doing that. Well, no: that's wrong on both parts. Intel is actually adding transactional instructions to its CPUs at the time of writing, and writing concurrent programs that mutate shared memory is notoriously difficult. Transactions are certainly not essential, but they are very much a nice to have. Without transactions you need some other atomic operation - often a compare-and-swap operation (CAS). If your data store supports such an operation then there is nothing that can be done with a transactional data store that you can't do with one that only supports CAS, but your code may be an awful lot larger and be a lot more complex. The more code you have, the more bugs you have.

I am currently still unaware of any other data store that offers the same set of features as GoshawkDB, in particular regarding its guarantees of strong-serializability, fault tolerance, distribution and sharding, and the efficiency of its distributed transaction implementation. Obviously I'm biased but GoshawkDB seems to fit the requirements I have for a data store that I want to use, understand how to use, and can trust. I hope I'm not the only one.