GoshawkDB: Data Model

A distributed, transactional,
fault-tolerant object store

Data Model

GoshawkDB presents a very simple data model to use, but it's quite unlike most other data stores. The data model consists of only two types: a byte array ([]byte), and object pointers (the terms "object pointer" and "object reference" are used interchangeably in GoshawkDB). So we can define an Object as a tuple of a byte array, and an array of pointers to Objects:

Object := ([]byte, []*Object)

This is extremely low-level: the fact that even basic integers or booleans don't explicitly exist in the data model puts this some way below even assembly languages. On the other hand, the means to be able to safely access and modify this data within transactions in a distributed setting is very high-level and rich, especially with features such as retry. When you take into account Object Capabilities, it should be clear that there is some meta-data as part of an Object pointer, but there is no ability to attach arbitrary meta-data to an Object pointer.

So what does this mean for modelling your data? Various questions may spring to mind:

How do I encode object state - fields etc?

This is entirely up to you and there are several things to take into account:

In which languages do you need to be able to encode and decode objects? Some serialisations are not well supported in different languages.

Performance may be something to think about too: it's fairly common for profiling of high performance distributed applications to reveal that serialisation (and maybe deserialisation too) can eat up a lot of CPU. Some modern serialisation schemes are arguably harder to work with than others. For example, MsgPack puts the encoder in charge of how a number gets represented in encoded form. So even though you may think you sent an int, the encoder is allowed to send it as a float if it wants to and so on the decoding side, you have to be able to cope with receiving any type of number. This appears to be a fairly common approach for schemes that target JavaScript or other mathematically depraved languages.

Another thing to think about is migrating state from one version of your object to another. With one large transaction, you could walk over your entire object graph and migrate everything in one go by reading in the old object, transforming it, and writing it back out again. Such a transaction may well be very expensive and you would probably want to temporarily disconnect other clients so that such a transaction would not need to be restarted several times. Or, you could introduce a new object root and so migrate objects one by one from the old root and old versions, to the new root. That way you would have a complete copy of the old version of the data should you ever need to roll back to it. Alternatively, some serialisation schemes (e.g. Protobuf, CapnProto, FlatBuffers) support backwards-compatible evolution of object schemas so provided you follow the necessary rules, you may be able to migrate your objects to new versions without rewriting them at all.
What object state should be encoded directly into an object []byte, and what should be a separate object reached through a pointer?

One fairly sensible rule of thumb is that if a field of an object has a value that may be shared (in the future if not now) between multiple objects then it should be externalised to a separate object and reached through a pointer. However, you also need to consider the effect on performance: there is a per-object CPU cost when a transaction is submitted to the server so it is not advisable to always split out every single field to a separate object - such an approach may also be more complex and expensive on the client side with additional serialisation costs. Inevitably, it is a balancing act.

There are also good reasons why you might want a single object in your programming language to be represented by multiple objects in GoshawkDB where the different GoshawkDB objects reflect the type inheritance of the programming language object. Such an approach allows fields of super-types to be accessed generically though you'll need to take care with issues like field shadowing. The example in Data Security also shows how this approach can work to limit different client accounts to different views of the same object.
How do I know which object pointers point to what?

You need to decide this for yourself, probably on a per object type basis. For example, for the fields of an object that you decide to externalise to their own objects, you may choose to sort the field names and then just use the index of the field name within that list to be the index of the relevant object pointer.

It's also fairly common to use the array of pointers as a general purpose list too. This is fine, and so you would probably start the array of pointers with various fixed offset fields (i.e. fields that will exist in every instance of this object), and then the tail of the list can be used as a general purpose list. In the Collections Library, the Linear Hash implementation takes this approach for buckets. There is only one field in a bucket which is the pointer to the next bucket which is therefore the first entry in the pointers array, and the rest of the array is used for the values of the key-value pairs in the current bucket, which can vary.
How do I create a nil-pointer?

This question comes up when an object has a pointer to another object that may or may not exist. GoshawkDB does not have an explicit nil or null pointer. One approach that works well instead is to use the current object itself in the place of a nil object. All the clients support a method on object pointers called ReferencesSameAs (or something very similar), which can be used to test whether two pointers are pointing to the same object (are aliases).
Should encoded objects be self-describing?

In programming languages, when you access an object through a pointer, you probably already know the type of the object at the end of that pointer (or at least, an interface that is satisfied by the object). In the same way, you will probably know the types of the objects that you're accessing and this may well be enforced by the language through which you're accessing GoshawkDB.

For example, your application may have a type that offers access to some field customer. You have decided that the customer is always reached through the 3rd pointer of the current object, so: custObj := obj.references()[2]. You thus know that custObj can only be decoded by the customer deserialiser. So you might write something that looks like: cust := customerFromObj(custObj)

However, there may be several different types of customer so you may need a way to detect that. Depending on the serialisation scheme you've chosen, the scheme itself may provide the means to support such a case (e.g. support for unions). Similarly, if you're using the Collections Library then the values in all the maps are just plain GoshawkDB Objects. So you are of course able to use these collections as heterogeneous rather than homogeneous collections. If you do so, you will certainly need the ability to detect enough information about a retrieved object to know how to decode it.