CDC on Neo4j DBMS
Neo4j extracts CDC information from the transaction log. However, by default the transaction log does not contain information directly usable by CDC. For CDC to work, the transaction log need to be enriched with further information. This is applied as an extra configuration option to each database. As soon as CDC is enabled, the database is ready to answer CDC queries from client applications.
CDC has three working modes:
-
OFF
— CDC is disabled (default). -
DIFF
— Changes are captured as the difference between before and after states of each changed entity (i.e. they only contain removals, updates and additions). -
FULL
— Changes are recorded as a complete copy of the before and after states of each changed entity (i.e. the contain the full node/relationship, regardless of the extent to which they were altered).
Toggle CDC mode
Create a database with CDC enabled
To create a new database with CDC enabled, use the CREATE DATABASE
Cypher command and set the option txLogEnrichment
to either FULL
or DIFF
.
CREATE DATABASE <dbName> IF NOT EXISTS OPTIONS {txLogEnrichment: "FULL"}
Modify a database’s CDC mode
To tweak the CDC mode on an existing database, use the ALTER DATABASE
Cypher command and set the option txLogEnrichment
to either FULL
or DIFF
.
ALTER DATABASE <dbName> SET OPTION txLogEnrichment "DIFF"
Modifying CDC mode from |
Get a database’s CDC mode
To see what value the CDC mode of a database is, use the SHOW DATABASES
Cypher command.
SHOW DATABASES YIELD name, options
name | options |
---|---|
|
|
|
|
Disable CDC
To disable CDC on a database, either set the option txLogEnrichment
explicitly to OFF
or remove it altogether.
ALTER DATABASE <dbName> SET OPTION txLogEnrichment "OFF"
ALTER DATABASE <dbName> REMOVE OPTION txLogEnrichment
Disabling CDC immediately breaks the continuity of change events. Change identifiers generated before disabling can no longer be used and, even if CDC is re-enabled, the previously-generated change identifiers remain invalid. Disabling and then re-enabling CDC is equivalent to enabling it for the first time: there is no memory of previous changes. |
Key considerations
Security
CDC returns all changes in the database and is not limited to the entities which a certain user is allowed to access.
In order to prevent unauthorized access, the procedure db.cdc.query
requires admin privileges and should be configured for least privilege access.
For a regular user to be able to run db.cdc.query
, the user must have been granted execute privileges as well as boosted execute privileges.
GRANT EXECUTE PROCEDURE db.cdc.query ON DBMS TO $role;
GRANT EXECUTE BOOSTED PROCEDURE db.cdc.query ON DBMS TO $role;
Non-boosted execute privileges are usually part of the |
Furthermore, the user does not have access to a database unless they have been granted access.
GRANT ACCESS ON DATABASE $database TO $role
Usually the |
The procedures db.cdc.current
and db.cdc.earliest
do not require admin privileges. In order to execute these, access to the database and regular execution privileges are sufficient.
For more details regarding procedure privileges in Neo4j, see Operations Manual → Manage procedure and user-defined function permissions.
Disk size
With CDC enabled, more data gets written into transaction log files for that database. As a result, transaction log files are rotated and pruned more frequently. The disk may run out of space if the disk size for transaction log storage is limited, so ensure that you have plenty of available space.
In particular, plan for a 50% increase in data written to the transaction log with DIFF
CDC mode, and 75% for FULL
CDC mode.
Actual disk usage depends on the application, data model and transaction characteristics.
Transaction log retention
Since CDC information is stored in transaction log entries, the time for which the logs are retained dictates how much back in time your application may query for CDC data. As a general rule of thumb, you can pick a period greater or equal to the downtime tolerance of your downstream application, so that the changes don’t get pruned before the application has had time to process them.
You can control the transaction log retention period through the configuration setting db.tx_log.rotation.retention_policy
. For more details on transaction log files and how to configure them, see Transaction logging.
Transaction logs are used not only by CDC, but also by differential backups and cluster operations. When setting a retention period for your CDC needs, keep in mind that it may affect other areas of your system. |
This example shows the behavior that results from a given server configuration.
db.tx_log.rotation.retention_policy = 2G 1Day
db.tx_log.rotation.size = 256M
db.checkpoint.interval.time = 15m
db.checkpoint.interval.tx = 100000
As new transactions come in, the server writes them to a log file. When that file exceeds 256MB, it creates a new file and continues writing there (transactions are never broken across files though, so if the current log is 255MB when a new 5MB transaction comes in, the file will grow to 260MB before rotating to a new file).
Every 15 minutes or 100000 transactions (whichever happens first), the server goes through the transaction log files from newer to oldest. When the sum of the scanned files is greater than 2GB, all subsequent files are deleted, including the latest scanned one, so that the total size is again below 2GB. Files older than 1 day are also deleted.
In other words, the server keeps at most 2GB worth of transaction logs, as long as they are all more recent than 1 day.
As long as they are all more recent than 1 day, the server keeps at least 2GB - 256MB
worth of transaction logs (256MB are always needed for the current log file to grow into).
In case large transactions resulted in log files larger than 256MB, the minimum retained size of logs may be smaller than 2GB - 256MB
.
Unrecorded changes
CDC can only capture data changes that pass through the transaction layer and any data creation that avoids this layer can therefore not be captured.
For example, when importing data with the neo4j-admin database import
tool, whether full or incremental, or when loading data with the neo4j-admin database load
tool, data is written directly to the store without sending anything to the transaction log, so such changes are not captured by CDC.