CDC on Neo4j DBMS

Neo4j extracts CDC information from the transaction log. However, by default the transaction log does not contain information directly usable by CDC. For CDC to work, the transaction log need to be enriched with further information. This is applied as an extra configuration option to each database. As soon as CDC is enabled, the database is ready to answer CDC queries from client applications.

CDC has three working modes:

  • OFF — CDC is disabled (default).

  • DIFF — Changes are captured as the difference between before and after states of each changed entity (i.e. they only contain removals, updates and additions).

  • FULL — Changes are recorded as a complete copy of the before and after states of each changed entity (i.e. the contain the full node/relationship, regardless of the extent to which they were altered).

Toggle CDC mode

Create a database with CDC enabled

To create a new database with CDC enabled, use the CREATE DATABASE Cypher command and set the option txLogEnrichment to either FULL or DIFF.

Query
CREATE DATABASE <dbName> IF NOT EXISTS OPTIONS {txLogEnrichment: "FULL"}

Modify a database’s CDC mode

To tweak the CDC mode on an existing database, use the ALTER DATABASE Cypher command and set the option txLogEnrichment to either FULL or DIFF.

Query
ALTER DATABASE <dbName> SET OPTION txLogEnrichment "DIFF"

Modifying CDC mode from DIFF to FULL or viceversa immediately changes the structure of captured changes. Your CDC application must be able to deal with the change of format.

Get a database’s CDC mode

To see what value the CDC mode of a database is, use the SHOW DATABASES Cypher command.

Query
SHOW DATABASES YIELD name, options
Table 1. Result
name options

"neo4j"

{"txLogEnrichment": "DIFF"}

"system"

{}

Disable CDC

To disable CDC on a database, either set the option txLogEnrichment explicitly to OFF or remove it altogether.

Using set option clause
ALTER DATABASE <dbName> SET OPTION txLogEnrichment "OFF"
Using remove option clause
ALTER DATABASE <dbName> REMOVE OPTION txLogEnrichment

Disabling CDC immediately breaks the continuity of change events. Change identifiers generated before disabling can no longer be used and, even if CDC is re-enabled, the previously-generated change identifiers remain invalid. Disabling and then re-enabling CDC is equivalent to enabling it for the first time: there is no memory of previous changes.

Key considerations

Security

CDC returns all changes in the database and is not limited to the entities which a certain user is allowed to access. In order to prevent unauthorized access, the procedure db.cdc.query requires admin privileges and should be configured for least privilege access.

For a regular user to be able to run db.cdc.query, the user must have been granted execute privileges as well as boosted execute privileges.

GRANT EXECUTE PROCEDURE db.cdc.query ON DBMS TO $role;
GRANT EXECUTE BOOSTED PROCEDURE db.cdc.query ON DBMS TO $role;

Non-boosted execute privileges are usually part of the PUBLIC role, in which case they do not need to be granted a second time.

Furthermore, the user does not have access to a database unless they have been granted access.

GRANT ACCESS ON DATABASE $database TO $role

Usually the PUBLIC role already has access to the default database.

The procedures db.cdc.current and db.cdc.earliest do not require admin privileges. In order to execute these, access to the database and regular execution privileges are sufficient.

For more details regarding procedure privileges in Neo4j, see Operations Manual → Manage procedure and user-defined function permissions.

Disk size

With CDC enabled, more data gets written into transaction log files for that database. As a result, transaction log files are rotated and pruned more frequently. The disk may run out of space if the disk size for transaction log storage is limited, so ensure that you have plenty of available space.

In particular, plan for a 50% increase in data written to the transaction log with DIFF CDC mode, and 75% for FULL CDC mode. Actual disk usage depends on the application, data model and transaction characteristics.

Transaction log retention

Since CDC information is stored in transaction log entries, the time for which the logs are retained dictates how much back in time your application may query for CDC data. As a general rule of thumb, you can pick a period greater or equal to the downtime tolerance of your downstream application, so that the changes don’t get pruned before the application has had time to process them.

You can control the transaction log retention period through the configuration setting db.tx_log.rotation.retention_policy. For more details on transaction log files and how to configure them, see Transaction logging.

Transaction logs are used not only by CDC, but also by differential backups and cluster operations. When setting a retention period for your CDC needs, keep in mind that it may affect other areas of your system.
Example 1. Transaction log configuration and CDC behavior

This example shows the behavior that results from a given server configuration.

db.tx_log.rotation.retention_policy = 2G 1Day
db.tx_log.rotation.size = 256M
db.checkpoint.interval.time = 15m
db.checkpoint.interval.tx = 100000

As new transactions come in, the server writes them to a log file. When that file exceeds 256MB, it creates a new file and continues writing there (transactions are never broken across files though, so if the current log is 255MB when a new 5MB transaction comes in, the file will grow to 260MB before rotating to a new file).

Every 15 minutes or 100000 transactions (whichever happens first), the server goes through the transaction log files from newer to oldest. When the sum of the scanned files is greater than 2GB, all subsequent files are deleted, including the latest scanned one, so that the total size is again below 2GB. Files older than 1 day are also deleted.

In other words, the server keeps at most 2GB worth of transaction logs, as long as they are all more recent than 1 day. As long as they are all more recent than 1 day, the server keeps at least 2GB - 256MB worth of transaction logs (256MB are always needed for the current log file to grow into). In case large transactions resulted in log files larger than 256MB, the minimum retained size of logs may be smaller than 2GB - 256MB.

Unrecorded changes

CDC can only capture data changes that pass through the transaction layer and any data creation that avoids this layer can therefore not be captured. For example, when importing data with the neo4j-admin database import tool, whether full or incremental, or when loading data with the neo4j-admin database load tool, data is written directly to the store without sending anything to the transaction log, so such changes are not captured by CDC.