Disaster recovery

A database can become unavailable due to issues on different system levels. For example, a data center failover may lead to the loss of multiple servers, which may cause a set of databases to become unavailable. It is also possible for databases to become quarantined due to a critical failure in the system, which may lead to unavailability even without the loss of servers.

This section contains a step-by-step guide on how to recover unavailable databases that are incapable of serving writes, while still may be able to serve reads. However, if a database is not performing as expected for other reasons, this section cannot help. By following the steps outlined here, you can recover the unavailable databases and make them fully operational with minimal impact on the other databases in the cluster.

If all servers in a Neo4j cluster are lost in a data center failover, it is not possible to recover the current cluster. You have to create a new cluster and restore the databases. See Deploy a basic cluster and Seed a database for more information.

Faults in clusters

Databases in clusters follow an allocation strategy. This means that they are allocated differently within the cluster and may also have different numbers of primaries and secondaries. The consequence of this is that all servers are different in which databases they are hosting. Losing a server in a cluster may cause some databases to lose a member while others are unaffected. Consequently, in a disaster where multiple servers go down, some databases may keep running with little to no impact, while others may lose all their allocated resources.

Guide to disaster recovery

There are three main steps to recovering a cluster from a disaster. Completing each step, regardless of the disaster scenario, is recommended to ensure the cluster is fully operational.

  1. Ensure the system database is available in the cluster. The system database defines the configuration for the other databases; therefore, it is vital to ensure it is available before doing anything else.

  2. After the system database’s availability is verified, whether recovered or unaffected by the disaster, recover the lost servers to ensure the cluster’s topology meets the requirements.

  3. After the system database is available and the cluster’s topology is satisfied, you can manage the databases.

The steps are described in detail in the following sections.

In this section, an offline server is a server that is not running but may be restartable. A lost server, however, is a server that is currently not running and cannot be restarted.

Disasters may sometimes affect the routing capabilities of the driver and may prevent the use of the neo4j scheme for routing. One way to remedy this is to connect directly to the server using bolt instead of neo4j. See Server-side routing for more information on the bolt scheme.

Restore the system database

The first step of recovery is to ensure that the system database is available. The system database is required for clusters to function properly.

  1. Start all servers that are offline. (If a server is unable to start, inspect the logs and contact support personnel. The server may have to be considered indefinitely lost.)

  2. Validate the system database’s availability.

    1. Run SHOW DATABASE system. If the response does not contain a writer, the system database is unavailable and needs to be recovered, continue to step 3.

    2. Optionally, you can create a temporary user to validate the system database’s writability by running CREATE USER 'temporaryUser' SET PASSWORD 'temporaryPassword'.

    3. Confirm that the temporary user is created as expected, by running SHOW USERS, then continue to Recover servers. If not, continue to step 3.

  3. Restore the system database.

    Only do the steps below if the system database’s availability cannot be validated by the first two steps in this section.

    Recall that the cluster remains fault tolerant, and thus able to serve both reads and writes, as long as a majority of the primaries are available. In case of a disaster affecting one or more server(s), but where the majority of servers are still available, it is possible to add a new server to the cluster and recover the system database (and any other affected user databases) on it by copying the system database (and affected user databases) from one of the available servers. This method prevents downtime for the other databases in the cluster. If this is the case, ie. if a majority of servers are still available, follow the instructions in Recover servers.

    The following steps create a new system database from a backup of the current system database. This is required since the current system database has lost too many members in the server failover.

    1. Shut down the Neo4j process on all servers. Note that this causes downtime for all databases in the cluster.

    2. On each server, run the following neo4j-admin command bin/neo4j-admin dbms unbind-system-db to reset the system database state on the servers. See neo4j-admin commands for more information.

    3. On each server, run the following neo4j-admin command bin/neo4j-admin database info system to find out which server is most up-to-date, ie. has the highest last-committed transaction id.

    4. On the most up-to-date server, take a dump of the current system database by running bin/neo4j-admin database dump system --to-path=[path-to-dump] and store the dump in an accessible location. See neo4j-admin commands for more information.

    5. Ensure there are enough system database primaries to create the new system database with fault tolerance. Either:

      1. Add completely new servers (see Add a server to the cluster) or

      2. Change the system database mode (server.cluster.system_database_mode) on the current system database’s secondary servers to allow them to be primaries for the new system database.

    6. On each server, run bin/neo4j-admin database load system --from-path=[path-to-dump] --overwrite-destination=true to load the current system database dump.

    7. Ensure that dbms.cluster.discovery.endpoints are set correctly on all servers, see Cluster server discovery for more information.

    8. Return to step 1.

Recover servers

Once the system database is available, the cluster can be managed. Following the loss of one or more servers, the cluster’s view of servers must be updated, ie. the lost servers must be replaced by new servers. The steps here identify the lost servers and safely detach them from the cluster.

  1. Run SHOW SERVERS. If all servers show health AVAILABLE and status ENABLED continue to Recover databases.

  2. For each UNAVAILABLE server, run CALL dbms.cluster.cordonServer("unavailable-server-id") on one of the available servers.

  3. For each CORDONED server, run DEALLOCATE DATABASES FROM SERVER cordoned-server-id on one of the available servers.

  4. For each server that failed to deallocate with one of the following messages:

    1. Could not deallocate server(s) 'serverId'. Unable to reallocate 'DatabaseId.*'.
      Required topology for 'DatabaseId.*' is 3 primaries and 0 secondaries.
      Consider running SHOW SERVERS to determine what action is suitable to resolve this issue.

      or

      Could not deallocate server(s) `serverId. Database [database] has lost quorum of servers, only found [existing number of primaries] of [expected number of primaries]. Cannot be safely reallocated.`

      First ensure that there is a backup for the database in question (see Online backup), and then drop the database by running DROP DATABASE database-name. Return to step 3.

    2. Could not deallocate server [server]. Cannot change allocations for database [stopped-db] because it is offline.

      Try to start the offline database by running START DATABASE stopped-db WAIT. If it starts successfully, return to step 3. Otherwise, ensure that there is a backup for the database before dropping it with DROP DATABASE stopped-db. Return to step 3.

      A database can be set to READ-ONLY-mode before it is started to avoid updates on a database that is desired to be stopped with the following: ALTER DATABASE database-name SET ACCESS READ ONLY.

    3. Could not deallocate server [server]. Reallocation of [database] not possible, no new target found. All existing servers: [existing-servers]. Actual allocated server with mode [mode] is [current-hostings].

      Add new servers and enable them and then return to step 3, see Add a server to the cluster for more information.

  5. Run SHOW SERVERS YIELD * once all enabled servers host the requested databases (hosting-field contains exactly the databases in the requestedHosting field), and proceed to the next step. Note that this may take a few minutes.

  6. For each deallocated server, run DROP SERVER deallocated-server-id.

  7. Return to step 1.

Recover databases

Once the system database is verified available, and all servers are online, the databases can be managed. The steps here aim to make the unavailable databases available.

  1. If you have previously dropped databases as part of this guide, re-create each one from a backup. See the Create databases section for more information on how to create a database.

  2. Run SHOW DATABASES. If all databases are in desired states on all servers (requestedStatus=currentStatus), disaster recovery is complete.