A distributed, transactional,
fault-tolerant object store

GoshawkDB 0.3.1 released

Thursday 24th Nov, 2016

There was a mistake in the build for GoshawkDB 0.3. This affected all server binary artifacts: deb, rpm, tarball and the docker image, and the Java client and collection artifacts. The mistake meant clients could not connect to the server. GoshawkDB 0.3.1 corrects this. No other changes have been made. The 0.3 Go client works with the 0.3.1 server. Download.

CodeMesh talk - now public!

Tuesday 22nd Nov, 2016

Very pleased to say that the talk I gave at CodeMesh earlier this month on GoshawkDB is now public. I very much enjoyed making and delivering this talk; I hope you enjoy it too. Apologies for the frequent sniffing - I had a cold! Also the slides are available to download.

GoshawkDB 0.3 released

Friday 18th Nov, 2016

Over six months after the 0.2 release, and somewhat later than I'd hoped, I'm very pleased to be able to release GoshawkDB 0.3. GoshawkDB 0.3 adds support for Object Capabilities and Multiple Roots which allows you to give different accounts access to different data and different capabilities on the same data. We also now have a collections library which initially contains a single map collection. This provides a language neutral way to use efficient maps from your application code and have the entire map stored in GoshawkDB.

Other features and improvements include:

  • A lot of optimization work for serialization of messages between servers. This has lead to very significant increases in performance.
  • Many bug fixes.
  • Creation of a test harness that can create and bring up clusters, kill them, run tests against them, mutate the cluster, cause failures etc.
  • More bug fixes.

GoshawkDB 0.3 is available to download.

On the importance of Terminology

Monday 14th November, 2016

Databases, in general, are notorious for not defining their semantics clearly, and this was one of the motivations for GoshawkDB. Just as bad is when databases are lazy with terminology and language: it only serves to confuse users. In theory, we could define some mathematical notation and then very precisely specify behaviour, but that doesn't lead to particularly welcoming documentation. Those of you who've seen any of my talks will know that I tend to spend a bit of time lamenting how meaningless terms like snapshot isolation are: they tell you nothing about the relation with other isolation levels. Such terms typically refer more to an implementation strategy rather than anything to do with semantics.

So really, this post is a plea to all of us involved in databases to try to be more precise and accurate in our usage of terms.

For example, I frequently see linearizability defined lazily as a "total order of operations". This is false. Linearizability requires a total order of operations on each single object. Not across the entire database. Peter Bailis defines it as follows:

Linearizability is a guarantee about single operations on single objects. It provides a real-time (i.e., wall-clock) guarantee on the behaviour of a set of single operations (often reads and writes) on a single object (e.g., distributed register or data item).

Even this I would quibble about (though it's unlikely that those two sentences were intended to be a full specification): it does not specify who's wall-clock we're talking about - in a distributed system there will be many (at least a client and a server node), and they may not advance at the same rate, and they may not be in the same frame of reference.

But importantly, we're talking about being able to order operations on single objects. So a linearizable system, when presented with two operations, one which modifies x and the other which modifies y, and those operations arrive concurrently from different clients, does not need to impose an ordering on those operations. It is not necessary for the system to be able to tell you which "happened first", and indeed it may well not be possible: these operations could be processed completely in parallel with disjoint sets of nodes processing each operation. Even here, the definition of "concurrently" needs refinement: it's probably sufficient to say something like "there is no causal relationship between the two operations". E.g. client A could connect, perform an operation on x. Then "five minutes later", client B connects and performs some operation on y. There is no causal relation between these two operations (client A and client B are different clients, and client B has not observed the effect of client A). If these are the only two operations going on, then the database system could legally record that the operation on y happens before the operation on x, though it is not necessary for it to record any ordering information at all.

Why would this be legal? Well, there is no way to tell the order in which these events were emitted from their clients. It's quite possible that the operation from client B was emitted first, but got held up by networking issues, or by the fact it was a very very long way away. Unless either client first observes the effect of the other operation, there is no causal relationship between the operations, and so either order is viable, and it's quite likely imposing no order gives the widest range of implementation approaches (and possibly greatest performance). If you think about it as though client B is simply a long long way away from the server then defining "now" becomes very difficult. If client A and client B both emit their operations at the same time (imagine you are equidistant from both clients and so observe them both emitting their operations at the same time), but the server is much closer to client A then client A's operation will arrive first. Should this influence ordering of operations on the same object? Our equi-distant observer could even see client A's operation emitted after client B's. But client B's still arrives at the server last. What is "now" here? What order is actually wanted? When does it matter?


Several database systems (not GoshawkDB) adopt "Last Write Wins", in which if there are two writes to the same object, the one which happens "later" wins when trying to resolve what the current value of the object is. This means that each operation has to have a timestamp of some sort associated with it, and often this uses simple wall-clock time (also called physical time). The timestamp typically is taken from the client, and this should ensure that the system achieves causality at least from the point of view of multiple operations from the same client. If instead, the timestamp is added by the first server that the operation arrives at then you could have the scenario where the client is connected to several servers of the same database system and sends different operations to different servers. Even with "perfect" non-drifting clocks in each of these servers, you could still have the situation that due to distances between the servers and the client, the timestamps assigned by different servers to different operations result in an order of operations that is different from the order in which the operations were emitted by the client.

Timestamps created by clocks on the clients are still problematic though: firstly they can drift, and if we employ NTP very badly they could even move backwards (though I believe manual intervention would be necessary to achieve that). If the rate of change of an object in the database is greater than the accuracy of your clocks then "now" may become very odd indeed. For this reason we then have things like Spanner with TrueTime and Hybrid Logical Clocks which either account for the uncertainty of wall-clocks, or introduce a logical component that essentially tracks causality of events between different nodes a la Lamport Clocks.

But it's still important to understand the limitations of these approaches: even with perfect clocks that don't drift, we're still subject to physics and so delays due to distance, in addition to switches and routers. Defining what is meant by "now" is still really tricky, if not impossible. And that's before we get to kinematic or gravitational time dilation. Gravitational first: a perfect clock at the top of a mountain will find that time travels faster than an equally perfect clock at sea level; the closer you are to the bottom of a gravity well, the slower time passes. So by using NTP or other means to try to synchronise time between two servers in these different locations, what we're really trying to do is get them both into the same inertial frame of reference. Thus we will slow down the clock of the high-up server and speed up the clock of the low-down server. Kinematic time dilation is even more fun: for clocks that are in motion relative to each other, time does not pass at the same rate. And in combination, the results are not small either: fly around the earth and you gain nearly 300ns! So if you have clients to your database system that are mobile, it gets even harder to define "now": at this stage I simply have no idea whether it even makes sense to try to adjust their clocks to bring them into the same frame of reference, let alone if it's possible.


In database systems, we use and invent a lot of terminology all the time and rather than introduce new words which we could then precisely define the meaning of, we tend to reuse existing words. This alone has problems as the existing words may give us some intuitive clue as to what's really meant and that can result in us not bothering to look up and understand the precise definition. Understanding and comparing databases is hard enough even if all the terminology is agreed upon and used correctly by everyone concerned. But this is frequently not the case.

I feel that database and distributed system engineers also frequently pretend that we are not subject to the "laws" of physics (or at least we often fail to consider their impact in anything other than the simplest of cases). Whilst issues of special and general relativity may not yet be causing most of us too many issues, it may not be long until they do. Systems that rely on some vague notion of "now" instead of overtly tracking causality may increasingly find themselves in difficulty.

Code Mesh London 2016

Saturday 1st October, 2016

I'm very pleased to announce that I've been invited to give a talk on GoshawkDB at Code Mesh. The talk is on November 4th (day 2) at 10:15am. Earlier this year I gave a technical talk at QCon London, which focussed on some of the key algorithms that make GoshawkDB work. This time, the talk will be a bit more general, looking at what GoshawkDB is good for and what the properties are that it achieves. There will be some theory and some code too!

Hope to see you there!

Object Capabilities and Authority: design challenges

Monday 29th August, 2016

Back in mid-June I did an initial design and implementation attempt of Object Capabilities in GoshawkDB. I never merged that branch at that point because I didn't like either the design or the code, and I couldn't come up with a sane API to expose on the client library. Having been distracted by optimisations over the past couple of months in other areas of the code (which has been very fruitful!), I'm now back looking at this branch again.

In traditional SQL databases, you normally have user accounts, and to those accounts you can grant various types of privileges: for example retrieving data from a table, adding data to a table, deleting data, creating new tables and so on. Typically these are fairly coarse: you can normally restrict a user account to only being able to retrieve data from a single table, but you can't normally restrict it to individual rows or columns on that table (though views can offer a solution there if the database supports it). However, the security concerns and APIs are completely disjoint from the data APIs: when you add a row to a table you don't at that point choose who can in the future retrieve or modify that row. Arguably, because of this separation, restricting user accounts to meet a principle of least authority is something often punted to the end of the project: the classic "we'll add security in later". And then of course, it often never happens because it turns out you're using connection pooling in your application and rewiring all that so that different actions use different accounts with different privileges is going to be painful and clunky and the project's already over budget and everyone's sick of it and burnt out.

Because GoshawkDB eschews the concepts of database tables, rows and columns, and instead is a store for object graphs, I've been planning on using Object Capabilities to implement per-object security. Object Capabilities have been studied for many years (several decades) but are not widely used just yet. Probably the most frequently cited resource on this is the Capability Myths Demolished which Adrian Colyer covered back in February 2016 over on The Morning Paper. The simple concept is that, if you receive a pointer to an object, that pointer contains within it the capabilities you now have to interact with that object. If you receive multiple pointers to the same object, each of those pointers can contain different capabilities and so then your ability to interact with the object would be the union of all those capabilities. Importantly, different users can receive different pointers to the same object, granting each user different capabilities to use that object. The pointers have to be immutable and unforgeable and whenever you use a capability you need to present evidence that you received the necessary capability and that has to be enforced by the system and so on and so forth. But the concept is quite simple.

This idea allows us to usefully model and constrain access to common data structures. For example one pointer to a map or dictionary could grant the receiver read-only access, whilst another pointer to the same structure would grant the receiver read-write access. Equally, we could have a queue where one pointer allows for the head of the queue to be read and advanced, whilst another pointer to the very same queue only allows for items to be appended. This achieves the principle of least authority, and in some ways mirrors the use of interfaces in many programming languages: presenting an object as being of a type which allows access to only a subset of the object's true functionality. This design also brings security right into the data and makes the programmer think about how the data is to be accessed and manipulated: it makes it harder to try to pretend that security is something that can be added later (though yes, I do know some programmers who when writing Java make every field and method public).

In programming languages we tend to see both private and public fields as well as private and public methods. Some programming languages implement more of this (e.g. Java's default package-private and protected visibility modifiers), some less: for some languages everything is actually public it's just convention that you don't access fields or methods named in some particular way. The languages that I've worked with tend to allow interfaces to only be made up of methods, and not fields, but this is a fairly arbitrary constraint.

It is also worth remembering that in most programming languages, these visibility modifiers are erased during compilation: at run-time there is no checking or enforcement at all (if the language supports reflection then these visibility modifiers can be retained to help out with the reflection APIs). So if your program does have a vulnerability and someone manages to exploit that vulnerability, then they can almost certainly write to any memory location within the program's address space: as far as I'm aware, there's no support at CPU level or language run-time level to enforce these sorts of visibility constraints (interpreted languages may have an advantage here (assuming the interpreter has been proved unexploitable) but that goes away once enough optimisations and JIT-compilation gets implemented (again, unless the compilation has been proved unexploitable). At the moment, the best you can hope for is non-executable data areas and making the code areas read-only, but even then there are ways around that (for example Return Oriented Programming). So these modifiers are used by the language compiler to construct a proof that the program, provided it is not exploited in some way, does not violate these visibility modifiers. That is certainly useful and helps with documentation of APIs, intuition as to how to use them, code design and structuring, but it's certainly not normally an actively-checked defence against malicious actors (and even if you use something like E, with capabilities cryptographically signed and checked, unless you have special signing hardware available, it seems to me that if you can read memory then you can read out the keys, and then you can still do anything you want and forge anything (if I'm wrong on this, please let me know!)).

Back in the context of a program using a traditional SQL database with different user accounts; even if multiple accounts have been set up and principle of least authority has been achieved, if the same program is using multiple different accounts then an exploit will allow the attacker to extract all those access credentials, so you really need at least different programs for each user account and then rely on security of address-space separation and process isolation, or maybe even separate VMs or machines to ensure that an exploit on your most public facing (but hopefully least privileged) service doesn't leak powerful database user credentials. In this case, a database will check and enforce that the actions requested by a user account do not violate the authority granted to that account, so that superficially feels as though it's more secure than what programming languages offer us, but of course this assumes the database itself isn't vulnerable to any exploits.

In the context of GoshawkDB with a hypothetical implementation of Object Capabilities, let's imagine that you've received a pointer (reference) to an object (let's call it object X) and that pointer only grants you authority to set the 4th reference of X. This could be the consequence of some interface where you're only allowed to call X.SetFoo(foo) and it just so happens that the way the object is laid out and serialized means that the field within X that stores the current foo stores it as a pointer in the 4th reference slot. Now, the server presumably shouldn't trust clients, as the client is likely to be more user-facing and so prone to exploitation than the server. So when the client is first sent the object X the server should completely strip out all other data about the object: if the client never receives all the other data that makes up X and the object graph descending from X then that data can't possibly be leaked from an exploited client! So now what should happen if the client legitimately calls X.SetFoo(foo)? The client thinks that X only contains a single field which is holding foo and so as far as the client is concerned, this is a very straight-forward write-only transaction. The problem though is that once this transaction gets up to the server, presumably the server will have to add back in all the other data it previously masked out. Does that mean that this transaction is now actually a read-write transaction (i.e. should it be rejected and restarted if some other client in the mean-time has changed any other part of X)?

I think modification of transactions received from clients may be the right thing to do, and as ever in computing, there is prior art. Whilst x86 CPUs do byte-addressing of memory (as opposed to word addressing), CPU caches are loaded a cache-line at a time (where a cache line is much bigger than a single byte). When you write to a particular address, that write will first go into the relevant cache line and then depending on the cache write policy it will eventually make its way back to main memory as part of a (re)write/flush of all the data in that cache line. Now in a system with multiple CPUs (easiest to think of multiple CPU sockets, as I believe with multi-core single sockets there are sometimes unified caches as well as per-core caches) there will need to be some form of cache-coherency control otherwise you can't implement atomic primitives such as compare-and-swap. So essentially, this does amount to promoting a write-only action to a read-write action: sure, it's certainly not a read in the sense of MVCC, and I believe it's normally implemented as an exclusive write lock on those addresses in the cache line in question, but it is the same concept: in the time between the load of the cache line and the store, there cannot be any concurrent modification of those addresses. In short, multiple threads may be writing to very slightly different addresses on different CPUs with different caches, but because those addresses end up in the same cache-line, those concurrent writes can invalidate each other and so they have to be fully serialized as if they were read-write actions.

So this is what I'm currently working on and thinking about. Coming up with a design which leads to useful levels of security (which starts by figuring out what you're actually trying to protect against), designing sensible APIs which don't cause the code that uses them to look verbose and complex, and a design which can be implemented efficiently, all seems to be quite hard. If you have any thoughts, ideas or comments, as ever, please let me know.

QCon London talk - now public!

Monday 30th May, 2016

Very pleased to say that the talk I gave at QCon London back in March on GoshawkDB and its particular variety of vector clocks is now public. I very much enjoyed making and delivering this talk and was very impressed with the QCon setup and the support they gave to speakers. I hope you enjoy it too.

GoshawkDB Java Client Released

Saturday 28th May, 2016

A fully featured Java client is now available under the Apache License 2.0. This client supports all the same features as the Go client, and the API is very similar too. The Java Client works with GoshawkDB server 0.2 and, as with the Go client, new releases of the server will be supported by corresponding new releases of the client.

GoshawkDB 0.2 released

Friday 6th May, 2016

Over four months after the 0.1 release, and somewhat later than I'd hoped, I'm very pleased to be able to release GoshawkDB 0.2. The headline feature is the ability to change the cluster configuration: adding nodes, replacing nodes, removing nodes, and changing various other properties of the cluster. I'd originally wanted this to be part of 0.1 but decided that it was going to be a lot of work and it was more important to get 0.1 out sooner. In hindsight, I'm very glad I made that decision: the branch adding support for configuration changes turned out to be 116 commits long and took just under four months of work. The code added to achieve this makes up nearly 15% of the code base of the server. It has been quite a challenge.

Other features and improvements include:

  • GoshawkDB 0.1 required the underlying filesystem to support sparse files because it would create a 1TB file on initial start up. This was a major problem for OS X users. This has been solved now: GoshawkDB 0.2 is able to dynamically expand its disk store automatically on demand.
  • Encryption of connections has been changed to use TLS 1.2.
  • Better packaging: debs, rpms, and tarballs are available to download. OS X users are supported with GoshawkDB's Homebrew Tap. GoshawkDB 0.2 remains Go gettable if you wish to build from source.
  • Many bug fixes across the code base generally.

GoshawkDB 0.2 is available to download.

QCon London talk

Wednesday 9th March, 2016

Yesterday I gave a talk about GoshawkDB at QCon London 2016, and in particular the novel ways in which GoshawkDB uses vector clocks (or something like them) to model dependencies between transactions and so achieve strong serializability. I was delighted to be invited to give the talk and very much enjoyed it and I would like to extend my thanks to the track chair, programme committee and all the organizers of QCon.

I think I was giving probably the most complex talk of the day and I had a lot of material to get through so I apologise to any members of the audience who I may have lost along the way - hopefully you'll have the chance to revisit the material once the video is out, and please get in touch if you'd like to learn more.

I'll update this blog as soon as the video becomes available for all, but in the meantime, here are the slides. The slides are a big PDF (41MB) because they contain animations. I'm afraid I can't guarantee the animations will work on all platforms, but hopefully they're a minor part of the slide deck and the rest of the slide deck should work just fine.

QCon London March 8th 2016

Thursday 4th February, 2016

I'm very pleased to be able to announce that I'm going to be presenting on GoshawkDB at QCon London 2016 on Tuesday March 8th 2016. I'm in the Modern CS in the real world track, and I'm going to be talking in detail about how GoshawkDB uses Vector Clocks to capture dependencies between transactions and achieve strong serializability. The session will be recorded and the recording and slides will be available.

The abstract for the talk is:

It's well understood how logical clocks can be used to capture the order in which events occur. By extension, vector clocks encode when an event occurred across a distributed system. GoshawkDB reverses this idea: by analysing the dependencies between transactions, each participant calculates a vector clock that captures the constraints necessary to achieve strong serializability. The vector clocks from the different participants in a transaction can then be combined safely using the same techniques as CRDTs. This allows GoshawkDB to achieve strong serializability without imposing a total global ordering on transactions.

In this talk I will demonstrate these algorithms: what dependencies do we care about between transactions, how can we capture these with Vector Clocks, how we can treat Vector Clocks as a CRDT, and how GoshawkDB uses all these ideas and more to achieve a scalable distributed data store with strong serializability, general transactions and fault tolerance.

The rest of the track looks pretty interesting too, so if you're interested in learning more about GoshawkDB's innards or just want to come along to meet up and have a chat, please do; it promises to be an exciting day!

Simulation and Modelling in GoshawkDB

Monday 11th January, 2016

The inappropriate use of and reliance on unit tests is something that consistently frustrates me. Projects that celebrate when they have twice as much test code as project code confuse me. All code has bugs: I once read something that suggested that even in battle-tested and trusted code that's been around for a long time, you can expect to find about one bug per thousand lines of code. I don't know if that's true or not, but it seems reasonable to me. Now I certainly recognise that there are many ways to end up with relatively bug-free code, so I'm not suggesting that EVERYONE ELSE IS WRONG. This post is about how I've approached testing and development in GoshawkDB: it's about what I've chosen to do and what's worked for me. Please don't feel that I'm claiming that if you don't do exactly this, you're going to fail.

My big fear with writing unit tests is that they tend to test only a very small fraction of the control-flow paths. You might have 100% code coverage, but only exercise a fraction of the control-flow paths, so you could still have lots of bugs. A contrived example could help here:

func maybeCreateMap(x int) (m map[int]bool) {
    if x > 7 {
        m = make(map[int]bool)
    } else if x < 3 {
        return nil
    m[x] = true
    return m

Now I'm sure this is some of the stupidest code you've ever seen, but that's only because the problem is obvious. You now write two tests, one calls maybeCreateMap(8) and the other calls maybeCreateMap(1) and you have 100% code coverage. Job done; have a beer. But there's a control flow path that's not been covered: maybeCreateMap(6) will panic.

I'm perfectly happy to write unit tests for what I would describe as leaf-functionality. By that I mean modules of code that call very little if anything else. For example, there are unit tests for my skiplist implementation. This skiplist implementation is pretty much on the boundary between something I'm happy with just unit tests on and something I'd be much happier with fuzz testing. With fuzz testing, the idea is that the test framework generates directed random values as arguments to your functions and watches which control flow paths get hit. It will then try to adjust the input values to force unexplored parts of the control flow graph to be exercised. It's on the boundary because I (certainly wrongly) believe it's simple enough that I can reason about it in my head: it's not a huge amount of code and there's no pretence that it's thread safe. The benchmark code that's in the same test file is probably actually better than the unit tests for exercising more control flow paths: this is further evidence that in this case, my unit tests are inadequate, and not an argument for relying on benchmark code for testing!

There are also unit tests for the various consistent hashing algorithms in GoshawkDB, again with benchmark code. In this case, the unit tests are also (at least in my mind) acting as further documentation because sometimes it is difficult to describe accurately in English what the relationship is between inputs and outputs: a picture is worth a thousand words, and sometimes tests paint a pretty reasonable picture. There is also lots of benchmark code in there which helps identify the algorithm's complexity class. For code that's on the critical path (as this is), knowing how performance changes as the cluster gets bigger is pretty important to me.

Moving further up, there is nothing further until you get to the full integration tests that operate a client against a cluster of servers. These range from the very simple, to integration tests that try to provoke specific anomalies which should not occur under strong serializability. However, none of these integration tests are worth anything more than smoke signals: if they start failing, then yes, something has broken. But just because they are passing does not mean the algorithms and code are correct.

For developing the core transaction engine algorithms several things were clear from the outset.

  • I don't know ahead of time how to do this: I've never worked on a database before that has the properties I want GoshawkDB to have. Thus this is not just a matter of implementing some algorithms I've read in a paper. Quite literally, I don't know what I'm doing, even after having read a lot of papers.
  • If I'm going to avoid forcing a total global ordering of transactions, the consequence is GoshawkDB must cope with transactions arriving at different nodes in different orders.
  • Therefore, I need to be able to cope with every possible order.

Consider algorithms like Paxos: I don't think anyone would claim you could demonstrate the algorithm is correct by writing unit tests. Instead, you see both proofs that important properties cannot be violated, and you see modelling and simulation of the algorithm using tools like TLA+. I have used TLA+ a bit and I like it a lot. The purpose of a model checker is to try to systematically explore the state space of a model of an algorithm, or at least to explore enough of it that you have a high degree of confidence the algorithm is correct. It is not normally equivalent to a proof: for algorithms like Paxos, you would run the model for a certain number of agents, messages, ballots and so forth. If no property violations are found then you might want to increase these parameters, but soon enough you'll find the model takes weeks to explore the state space. If no problems are found, you can gain confidence your algorithm is correct, but you won't have a proof as you cannot completely explore an infinite state space.

For various reasons, rather than using TLA+, I decided to write my own permutation generator. The advantage of writing my own is that by writing it in Go, my models would also be in Go, so I hoped that my eventual implementation would be the exact same code as that which had been model checked (this turned out not to be the case because in a model, you inevitably make some simplifications that you can't ignore in the real implementation). It also means you need to learn the semantics of only one language instead of two, and even if the model implementation and the actual implementation are different, it should be easier to translate between the two (and demonstrate equivalence in a hand-wavy sort of way) if they're in the same language.

So, I start with a scenario. This specifies several transactions and what each of them do (read and write various objects). The scenario does not specify any concept of which client submitted which transaction: instead I assume all the transactions come from different clients at the same time. The scenario is first fed through a serial engine which receives the transactions of the scenario in every possible order (permutation). Different permutations will have different outcomes: remember that in GoshawkDB, the transaction is the complete record of what was read and written, and is known only once the client transaction function finishes. So a transaction could include something like "I read object x at version 3". If by the time the transaction arrives at a node carrying object x, object x is no longer at version 3 then the transaction must be aborted (the client would be told to restart the client transaction function against an updated copy of object x). The serial engine processes each transaction completely, synchronously, in the order in which it arrives as specified by each permutation. Every unique history generated is retained: this is the record of what is acceptable for this scenario. The history for a permutation is the list of which transactions were committed in what order.

For a transaction in GoshawkDB, there are events that can occur in any order (the transaction arriving at various different nodes), and there are dependencies between certain events (the outcome cannot be known before all the voting nodes have voted). This can be represented as a graph, and from the graph I can generate every possible permutation of events, according to the constraints of the graph. Each permutation represents a unique list of events, or instructions. These instructions are things like "txn 1 arrives at node C", or "the outcome of the votes for txn 2 arrives at node B". These instructions then get interpreted by the model of the engine.

So what is the property that I'm checking for? That the history of transactions is serial. I.e. every node modelled will produce a total ordering of transactions in which they were involved. This is the transaction history of the node. If you merge all the transaction histories together from all the different nodes, you should be able to construct a DAG which if flattened through a topological ordering, should match a history generated by all the transactions passing in some order through the serial engine: it matches one of the unique histories I previously generated through the serial engine.

Now all this is just for one scenario. So the next step is to be able to systematically generate scenarios. I want to be able to say "generate every scenario using from 0 to 3 transactions and each transaction can manipulate from 0 to 3 objects drawn from the set of objects x, y and z". And then for each of those scenarios, perform the above model checking. Needless to say, once you get scenarios with several transactions and objects, the number of permutations of events becomes very large, so model checking takes some time. For example, a scenario with three transactions, each of which reads one object and writes a different object where each object has just one voting node, has 638,625,600 different permutations of events.

At least six months of 2015 was spent developing the model of the engine: it's called matthew7 for a reason (and starts with several hundred lines of comments), and actually if I'd kept adding files rather that rewriting, it would probably be called something closer to matthew37. In many ways, I would suggest this is pretty close to TDD: I knew the property I wanted the engine to have in advance, and I just had to continuously iterate on the algorithm until I was happy it wasn't going to fail the model checker if I let the model checker run for longer.

Finally, some downsides. It is very tempting to play whack-a-mole when developing against a model checker: something goes wrong and the model checker explodes at you, but at that stage you don't have enough data points to be able to really see what the problem is, so you just fix the specific, not the general. This can get pretty dangerous because it's easy to let the model checker run for a few days and then claim "...and therefore it must be correct". If the model checker hasn't gone far enough to expose the next data point then you're doomed. This happened a couple of times to me: I thought I had finished the engine around July 2015 and had been working on the main implementation for a few months. Late October 2015 I suddenly thought of a particular sequence of events that would cause a problem for the engine. I knew that the model checker hadn't got that far, and sure enough if I manually constructed that scenario then the model showed a problem. Once again I realised I was going to have to substantially alter the engine. This time though, I did do some proof sketches with paper and pencil. Whilst that doesn't mean it's correct, and my own personal history with formal reasoning about algorithms is far from perfect, I believe I have some evidence that I'm not just relying on the model checker having not got far enough.

Every day, programmers invent algorithms. A lot of the time, the algorithms are simple, and they can be reasoned about just by sitting (or standing) and thinking about them. For these cases, unit tests may suffice. At the same time, lots of algorithms are invented that even if they look simple on the screen, can have very complex behaviour because the algorithm is distributed, parallel, asynchronous and because events can occur in different orders. I find reasoning about these types of algorithm very hard and so having confidence that a change either fixes a problem or introduces a problem is extremely difficult. Simulation and model checking is a very useful tool for these sorts of challenges.

Transaction life-cycle overview

Thursday 24th December, 2015

Today there has been quite a lot of discussion online as to how GoshawkDB works so I thought I'd try and quickly write a post to help out. This is going to be quite technical.

Transaction life-cycle is roughly as follows:

  1. Client sends a transaction payload to the server. The server identifies which objects have been read (and at what version), written, and created. For all of those objects, the server knows which of the nodes that make up the cluster hold copies of the objects (not every object is held by every node: GoshawkDB is sharded).
  2. This server calculates a roughly minimum set of nodes such that within this set, for each object in the transaction, there are at least F+1 copies of each object. The server now slightly reformats the transaction, but basically just sends it on to these nodes. These are the voting nodes (who will become proposers in the Paxos sense). The first F+1 of these nodes are the acceptors of the Paxos instances.
  3. Each voting node receives the transaction and identifies which objects within the transaction it is meant to vote on. It asks those objects to vote. Now in the case of GoshawkDB, a vote is not just a commit/abort decision. It is a commit/abort decision with a Vector Clock. I read a lot of Lamport's papers, and I also spent a lot of time thinking about Vector Clocks. One of the things I realised is that they can sort of be used backwards. The way logical clocks are described tends to be "you assign a counter to an event, and then everyone who receives two events can order them based on counters". But you can also do almost the reverse: "you receive two events and decide that one has to be before the other, so therefore you ensure that the one you want first has a lower counter". Obviously, this is hardly ground-breaking stuff, but it's the key of GoshawkDB: F+1 copies of each object in a transaction votes on the transaction and include their current Vector Clock in the vote. Thus if an object copy knows it wants one transaction ordered before another, then it can express this through Vector Clocks. Thus the usual set of dependencies between transactions can be captured.
  4. All the votes for the transaction within a voting node are gathered up and then Paxos instances are started for each object copy. Thus there is one proposer by design for each Paxos instance, so you can immediately skip phase 1 of Paxos in the normal case and go straight to your 2A messages. The acceptors receive your proposals. This particular use of Paxos is described in Gray and Lamport's paper Consensus on Transaction Commit.
  5. Hopefully, all the acceptors receive all the votes. If everyone has voted to commit, then all the various different vector clocks can be combined to form the final vector clock of the transaction (combining is just merge with pick-the-biggest). This outcome is now written to disk by the F+1 acceptors. Once it's been fsync'd to disk, and all acceptors have calculated the same result, then consensus has been reached and is stable. Now the acceptors send the result (which in the Paxos sense is the 2B message) to 1) the original server who received the transaction from the client - the outcome is known so it can go straight back to the client; 2) the voters/proposers; 3) every other node who has a copy of the objects in the transaction but who were not part of the initial selection of nodes: these are the learners in the Paxos sense.
  6. The voters and learners receive the outcome and can modify their own state, writing to disk if necessary. Once they've correctly processed the outcome of the transaction, they can then send a message to the acceptors informing the acceptors that that transaction is finished and the acceptors can delete and forget everything about that transaction. These steps can safely happen after the client has learnt the outcome of the transaction.

Thus we have three network-delays before the outcome is known: 1. initial server to all voters; 2. all voters to all acceptors; 3. all acceptors to initial server and all voters and all learners. Regardless of the size of the cluster or the value of F, the number of network delays is the same. By contrast, some other distributed data stores have to pass distributed transactions through a linear chain of server nodes - in that case, the number of network delays would be proportional to the size of the cluster. Thus in this way, GoshawkDB scales horizontally.

Vector Clocks are normally defined as having a fixed number of elements. There has been some research on Vector Clocks that can expand and contract but not much. If your Vector Clock gets very wide then you will find serialisation costs and network costs go up rapidly. A transaction in GoshawkDB will end up with at least one element in the transaction's Vector Clock for each object (not object copy) in the transaction. They can be wider. But by carefully reasoning about causality, GoshawkDB can safely shrink Vector Clocks back down very quickly. As a result, they remain an efficient means to model dependencies between transactions.

The same Consensus on Transaction Commit paper discusses two other issues. Firstly how to deal with failures. When a voter fails, the acceptors can notice and can start additional Paxos rounds (this time starting in phase 1) to discover what has happened so far and what to propagate. If nothing has been decided then they will attempt to get abort accepted. When an acceptor fails, the voters notice and choose a new acceptor: for each transaction there are 2F+1 acceptors predetermined, but only the first F+1 of them will be used until failures happen. In this case voters will again start additional Paxos rounds with the intention of aborting the transaction. These additional Paxos rounds can often find that actually the original voter voted to commit before failing. In this case the additional rounds will propagate that vote sufficiently that the whole transaction can still commit, even after one of the original voters or acceptors has failed.

Secondly, that paper shows that Two-phase commit (2PC) is equivalent to this use of Paxos with F=0, i.e. 2PC does not have the ability to tolerate any failure in general. The paper describes how to use Paxos synod instead of 2PC and gain the ability to withstand failure. By using Paxos in this way, GoshawkDB is set apart from all data stores that use 2PC to manage votes from different nodes involved in the transaction.

GoshawkDB 0.1 released

Tuesday 22nd December, 2015

According to the original repository, the first commit was on March 30th 2015, but there was already a fair chunk of simulation code at that point. None of my written notes are dated (and there are hundreds of pages of them) so I'm not sure when I really started working on GoshawkDB, but it was probably around the start of February 2015. 11 months later and around 650 commits later, I'm very excited to be able to release version 0.1.

Right now, GoshawkDB is not production ready: there are too many critical features missing, and almost certainly too many bugs undiscovered to me. The server is only 10kloc, and the Go client is a mere 1200loc (which makes me believe that writing clients for other languages is not going to be a big task). Despite its size, GoshawkDB is a powerful data store which can do a lot and takes very seriously the semantics and guarantees it's been designed for.

In a future post I'll talk more about the architecture of GoshawkDB, the development methodology (an awful lot of simulation and model checking has gone into this) and the structure of the code, but that'll have to wait for now. It's been an amazing year: I'm incredibly fortunate to have been able to work full time on this project for a whole year. GoshawkDB wouldn't exist without the work of many others and several algorithms and papers that seem to have thoroughly overlooked. I've massively enjoyed learning a huge amount about not just databases, but consensus, distributed systems and other areas, and the challenge of creating GoshawkDB. I hope you enjoy it too!