GoshawkDB: Modifying Clusters

A distributed, transactional,
fault-tolerant object store

Modifying GoshawkDB Clusters

GoshawkDB clusters can be reconfigured: adding or removing nodes, adding or removing client accounts, and changing the F configuration parameter. Currently in GoshawkDB 0.3.1, the MaxRMCount configuration parameter can not be changed. The configuration guide documents these parameters. The cluster configuration can be changed on a live running GoshawkDB cluster without needing to stop and restart nodes.

To make a change to the cluster configuration, update the configuration file, making sure you increment the Version field. Then send SIGHUP to the node which was started with that configuration file. The node will reread the configuration file, will verify that it's valid and if so, will communicate the change to the other nodes in the cluster and begin reconfiguration of the cluster.

When changing the nodes within the cluster, there are three types of nodes: new nodes - these are nodes that were not in the old configuration but are in the new configuration; removed nodes - these are nodes that were in the old configuration but are not in the new configuration; surviving nodes - these are nodes that are in both the old configuration and the new configuration.

If you send SIGHUP to a node then that node must have been started with the -config parameter otherwise the node will not know which configuration file to (re)load. Alternatively, you can stop a node and restart it, supplying the updated configuration file on the command line.
You only need to inform one node of the new configuration and the nodes will automatically communicate the configuration change amongst themselves. If you're adding nodes, it's fine to provide the configuration only to the new nodes. They will then contact surviving nodes and the configuration change will be able to progress.
Configuration changes can be made when there are failed nodes. This is in fact essential to be able to remove failed nodes from the cluster or to replace them. However, configuration changes can only occur if no more than F nodes are failed (unreachable).
It is your responsibility to make sure that there are never multiple different configurations with the same Version number. Behaviour of GoshawkDB is undefined if you simultaneously apply different new configurations with the same Version number to different nodes of the same cluster.
The configuration Version field must always increase. This includes the scenario when a node which has failed is being replaced without any other change to the configuration. For example, consider a cluster which includes host foo; the cluster is formed and is working; then host foo fails in some way. It then gets rebuilt and is assigned the same host name. When joining the new foo back into the cluster, the Version field must still be increased otherwise the rest of the cluster will not detect that foo is now empty and needs repopulating.
When replacing multiple failed nodes within a cluster without stopping the rest of the cluster, expect to reintroduce the failed nodes one at a time: it's unlikely you'll be able to start multiple replacements at the same time such that the surviving nodes realise multiple failed nodes have been replaced. (There is no indication in the configuration file which node(s) have failed and been rebuilt, so the surviving nodes have to determine this for themselves. Once the surviving nodes have identified any single node which needs to resync into the cluster, it'll start doing that (assuming there are no more than F failures currently). Hence it's unlikely you'll be able to get the cluster to identify multiple rebuilt nodes at the same time). Alternatively, stop the entire cluster, start all the replaced nodes first, and then bring back up the surviving nodes; with this approach, the surviving nodes will learn of all the replacements at the same time and so will be able to make the configuration change in one pass.
You can change the configuration whilst a configuration change is taking place. So if you realise a configuration change cannot complete (for example, due to too many failed nodes), you can provide an even newer cluster configuration via the normal routes (either SIGHUP or restarting a node). Provided the Version field is increased again, the cluster will abandon any change in progress and will start changing to the new configuration.
During configuration changes, clients will remain connected to surviving nodes but their transactions will not be able to progress until after the configuration change has succeeded.
If a configuration change removes a client certificate fingerprint then once the configuration change is complete, any client that authenticated with such a fingerprint has its connection closed.
If a configuration change modifies which roots a client account can access, or modifies the capabilities granted on those roots for a client account, then any connection using that client account will be disconnected as part of the reconfiguration.