Import
neo4j-admin database import
writes CSV data into Neo4j’s native file format as fast as possible.
Starting with version 5.26, Neo4j also provides support for the Parquet file format.
You should use this tool when:
-
Import performance is important because you have a large amount of data (millions/billions of entities).
-
The database can be taken offline and you have direct access to one of the servers hosting your Neo4j DBMS.
-
The database is either empty or its content is unchanged since a previous incremental import.
-
The CSV data is clean/fault-free (nodes are not duplicated and relationships' start and end nodes exist). This tool can handle data faults but performance is not optimized. If your data has a lot of faults, it is recommended to clean it using a dedicated tool before import.
Other methods of importing data into Neo4j might be better suited to non-admin users:
-
Cypher® - CSV data can be bulk loaded via the Cypher command
LOAD CSV
. See Cypher Manual →LOAD CSV
. -
Graphical Tools - Neo4j AuraDB → Importing data.
Change Data Capture does not capture any data changes resulting from the use of |
Overview
The neo4j-admin database import
command has two modes both used for initial data import:
-
full — used to import data into a non-existent empty database.
-
incremental — used when import cannot be completed in a single full import, by allowing the import to be a series of smaller imports.
The user running |
This section describes the neo4j-admin database import
option.
For information on |
These are some things you need to keep in mind when creating your input files:
-
Fields are comma-separated by default but a different delimiter can be specified.
-
All files must use the same delimiter.
-
Multiple data sources can be used for both nodes and relationships.
-
A data source can optionally be provided using multiple files.
-
A separate file with a header that provides information on the data fields, must be the first specified file of each data source.
-
Fields without corresponding information in the header are not read.
-
UTF-8 encoding is used.
-
By default, the importer trims extra whitespace at the beginning and end of strings. Quote your data to preserve leading and trailing whitespaces.
Indexes and constraints
Indexes and constraints are not created during the import. Instead, you have to add these afterward (see Cypher Manual → Indexes). Starting from Neo4j 5.24, you can use the |
Full import
Syntax
The syntax for importing a set of CSV files is:
neo4j-admin database import full [-h] [--expand-commands] [--verbose] [--auto-skip-subsequent-headers[=true|false]]
[--ignore-empty-strings[=true|false]] [--ignore-extra-columns[=true|false]]
[--legacy-style-quoting[=true|false]] [--normalize-types[=true|false]]
[--overwrite-destination[=true|false]] [--skip-bad-entries-logging[=true|false]]
[--skip-bad-relationships[=true|false]] [--skip-duplicate-nodes[=true|false]] [--strict
[=true|false]] [--trim-strings[=true|false]] [--additional-config=<file>]
[--array-delimiter=<char>] [--bad-tolerance=<num>] [--delimiter=<char>]
[--format=<format>] [--high-parallel-io=on|off|auto] [--id-type=string|integer|actual]
[--input-encoding=<character-set>] [--input-type=csv|parquet]
[--max-off-heap-memory=<size>] [--quote=<char>] [--read-buffer-size=<size>]
[--report-file=<path>] [--schema=<path>] [--threads=<num>] --nodes=[<label>[:
<label>]...=]<files>... [--nodes=[<label>[:<label>]...=]<files>...]...
[--relationships=[<type>=]<files>...]... [--multiline-fields=true|false|<path>[,
<path>] [--multiline-fields-format=v1|v2]] <database>
Parameters
Parameter | Description | Default |
---|---|---|
|
Name of the database to import.
If the database into which you import does not exist prior to importing, you must create it subsequently using |
|
Some of the options below are marked as Advanced. These options should not be used for experimentation. For more information, please contact Neo4j Professional Services. |
Options
Starting from Neo4j 5.26, the importer also supports the Parquet file format.
An additional parameter --input-type=csv|parquet
has been introduced to explicitly specify whether to use CSV or Parquet for the importer.
If not defined, the default value will be CSV.
The examples for CSV can also be used with Parquet.
Option | Description | Default |
---|---|---|
|
Configuration file with additional configuration. |
|
|
Delimiter character between array elements within a value in CSV data. Also accepts
For horizontal tabulation (HT), use Unicode character ID can be used if prepended by |
|
|
Automatically skip accidental header lines in subsequent files in file groups with more than one file. |
|
|
Number of bad entries before the import is aborted. The import process is optimized for error-free data. Therefore, cleaning the data before importing it is highly recommended. If you encounter any bad entries during the import process, you can set the number of bad entries to a specific value that suits your needs. However, setting a high value may affect the performance of the tool. |
|
|
Delimiter character between values in CSV data. Also accepts
For horizontal tabulation (HT), use Unicode character ID can be used if prepended by |
|
|
Allow command expansion in config value evaluation. |
|
|
Name of database format.
The imported database will be created in the specified format or use the format set in the configuration.
Valid formats are |
|
|
Show this help message and exit. |
|
|
Ignore environment-based heuristics and indicate if the target storage subsystem can support parallel IO with high throughput or auto detect.
Typically this is |
|
|
Each node must provide a unique ID. This is used to find the correct nodes when creating relationships. Possible values are:
|
|
|
Whether or not empty string fields, i.e. "" from input source are ignored, i.e. treated as null. |
|
|
If unspecified columns should be ignored during the import. |
|
|
Character set that input data is encoded in. |
|
|
Introduced in 5.26 File type to import from. Can be csv or parquet. Defaults to csv. |
|
|
Whether or not a backslash-escaped quote e.g. \" is interpreted as an inner quote. |
|
|
Maximum memory that Values can be plain numbers, such as |
|
|
Changed in 5.26 In v1, whether or not fields from an input source can span multiple lines, i.e. contain newline characters. Setting |
|
|
Introduced in 5.26 Controls the parsing of input source that can span multiple lines, i.e. contain newline characters. When set to v1, the value for |
|
|
Node CSV header and data.
It is possible to import files from AWS S3 buckets, Google Cloud storage buckets, and Azure buckets using the appropriate URI as the path. For an example, see Import data from CSV files using regular expression. |
|
|
When |
|
|
Delete any existing database files prior to the import. |
|
|
Character to treat as quotation character for values in CSV data. Quotes can be escaped as per RFC 4180 by doubling them, for example You cannot escape using |
|
|
Size of each buffer for reading input data. It has to be at least large enough to hold the biggest single value in the input data.
The value can be a plain number or a byte units string, e.g. |
|
|
Relationship CSV header and data.
It is possible to import files from AWS S3 buckets, Google Cloud storage buckets, and Azure buckets using the appropriate URI as the path. For an example, see Import data from CSV files using regular expression. |
|
|
File in which to store the report of the csv-import. The location of the import log file can be controlled using the If you are running on a UNIX-like system and you are not interested in the output, you can get rid of it altogether by directing the report file to If you need to debug the import, it might be useful to collect the stack trace.
This is done by using the |
|
|
Introduced in 5.24 Enterprise edition Path to the file containing the Cypher commands for creating indexes and constraints during data import. |
|
|
When set to |
|
|
Whether or not to skip importing relationships that refer to missing node IDs, i.e. either start or end node ID/group referring to a node that was not specified by the node input data. Skipped relationships will be logged, containing at most the number of entities specified by |
|
|
Whether or not to skip importing nodes that have the same ID/group. In the event of multiple nodes within the same group having the same ID, the first encountered will be imported, whereas consecutive such nodes will be skipped. Skipped nodes will be logged, containing at most the number of entities specified by |
|
|
Introduced in 5.6 Whether or not the lookup of nodes referred to from relationships needs to be checked strict. If disabled, most but not all relationships referring to non-existent nodes will be detected. If enabled all those relationships will be found but at the cost of lower performance. |
|
|
(advanced) Max number of worker threads used by the importer. Defaults to the number of available processors reported by the JVM. There is a certain amount of minimum threads needed so for that reason there is no lower bound for this value. For optimal performance, this value should not be greater than the number of available processors. |
|
|
Whether or not strings should be trimmed for whitespaces. |
|
|
Enable verbose output. |
|
1. See Tools → Configuration for details. 2. Ignored by Parquet import. |
Heap size for the import
You want to set the maximum heap size to a relevant value for the import.
This is done by defining the If doing imports in the order of magnitude of 100 billion entities, 20G will be an appropriate value. |
Record format
If your import data results in a graph that is larger than 34 billion nodes, 34 billion relationships, or 68 billion properties, you will need to configure the importer to use the
The |
Providing arguments in a file
All options can be provided in a file and passed to the command using the
For more information, see Picocli → AtFiles official documentation. |
Using both a multi-value option and a positional parameter
When using both a multi-value option, such as This is a limitation of the underlying library, Picocli, and is not specific to Neo4j Admin. For more information, see Picocli → Variable Arity Options and Positional Parameters official documentation. To resolve the problem, use one of the following solutions:
|
Importing from a cloud storage
The |
Examples
If importing to a database that has not explicitly been created before the import, it must be created subsequently in order to be used. |
Import data from CSV files
Assume that you have formatted your data as per CSV header format so that you have it in six different files:
-
movies_header.csv
-
movies.csv
-
actors_header.csv
-
actors.csv
-
roles_header.csv
-
roles.csv
The following command imports the three datasets:
bin/neo4j-admin database import full --nodes import/movies_header.csv,import/movies.csv \
--nodes import/actors_header.csv,import/actors.csv \
--relationships import/roles_header.csv,import/roles.csv
Provide indexes and constraints during import
Starting from Neo4j 5.24, you can use the --schema
option that allows Cypher commands to be provided to create indexes/constraints during the initial import process.
It currently only works for the block format and full import.
You should have a Cypher script containing only CREATE INDEX|CONSTRAINT
commands to be parsed and executed.
This file uses ';' as the separator.
For example:
CREATE INDEX PersonNameIndex FOR (i:Person) ON (i.name);
CREATE CONSTRAINT PersonAgeConstraint FOR (c:Person) REQUIRE c.age IS :: INTEGER
List of supported indexes and constraints that can be created by the import tool:
-
RANGE
-
LOOKUP
-
POINT
-
TEXT
-
FULL-TEXT
-
VECTOR
For example:
bin/neo4j-admin database import full neo4j --nodes=import/movies.csv --nodes=import/actors.csv --relationships=import/roles.csv --schema=import/schema.cypher
Import data from CSV files using regular expression
Assume that you want to include a header and then multiple files that match a pattern, e.g. containing numbers. In this case, a regular expression can be used. It is guaranteed that groups of digits will be sorted in numerical order, as opposed to lexicograghic order.
For example:
bin/neo4j-admin database import full --nodes import/node_header.csv,import/node_data_\d+\.csv
Import data from CSV files using a more complex regular expression
For regular expression patterns containing commas, which is also the delimiter between files in a group, the pattern can be quoted to preserve the pattern.
For example:
bin/neo4j-admin database import full --nodes import/node_header.csv,'import/node_data_\d{1,5}.csv' databasename
Importing files from a cloud storage
The following examples show how to import data stored in a cloud storage bucket using the --nodes
and --relationships
options.
Neo4j uses the AWS SDK v2 to call the APIs on AWS using AWS URLs.
Alternatively, you can override the endpoints so that the AWS SDK can communicate with alternative storage systems, such as Ceph, Minio, or LocalStack, using the system variables |
-
Install the AWS CLI by following the instructions in the AWS official documentation — Install the AWS CLI version 2.
-
Create an S3 bucket and a directory to store the backup files using the AWS CLI:
aws s3 mb --region=us-east-1 s3://myBucket aws s3api put-object --bucket myBucket --key myDirectory/
For more information on how to create a bucket and use the AWS CLI, see the AWS official documentation — Use Amazon S3 with the AWS CLI and Use high-level (s3) commands with the AWS CLI.
-
Verify that the
~/.aws/config
file is correct by running the following command:cat ~/.aws/config
The output should look like this:
[default] region=us-east-1
-
Configure the access to your AWS S3 bucket by setting the
aws_access_key_id
andaws_secret_access_key
in the~/.aws/credentials
file and, if needed, using a bucket policy. For example:-
Use
aws configure set aws_access_key_id aws_secret_access_key
command to set your IAM credentials from AWS and verify that the~/.aws/credentials
is correct:cat ~/.aws/credentials
The output should look like this:
[default] aws_access_key_id=this.is.secret aws_secret_access_key=this.is.super.secret
-
Additionally, you can use a resource-based policy to grant access permissions to your S3 bucket and the objects in it. Create a policy document with the following content and attach it to the bucket. Note that both resource entries are important to be able to download and upload files.
{ "Version": "2012-10-17", "Id": "Neo4jBackupAggregatePolicy", "Statement": [ { "Sid": "Neo4jBackupAggregateStatement", "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:GetObject", "s3:PutObject", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::myBucket/*", "arn:aws:s3:::myBucket" ] } ] }
-
-
Run the
neo4j-admin database import
command to import your data from your AWS S3 storage bucket. The example assumes that you have data stored in themyBucket/data
folder in your bucket.bin/neo4j-admin database import full --nodes s3://myBucket/data/nodes.csv --relationships s3://myBucket/data/relationships.csv newdb
-
Ensure you have a Google account and a project created in the Google Cloud Platform (GCP).
-
Install the
gcloud
CLI by following the instructions in the Google official documentation — Install the gcloud CLI. -
Create a service account and a service account key using Google official documentation — Create service accounts and Creating and managing service account keys.
-
Download the JSON key file for the service account.
-
Set the
GOOGLE_APPLICATION_CREDENTIALS
andGOOGLE_CLOUD_PROJECT
environment variables to the path of the JSON key file and the project ID, respectively:export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json" export GOOGLE_CLOUD_PROJECT=YOUR_PROJECT_ID
-
Authenticate the
gcloud
CLI with the e-mail address of the service account you have created, the path to the JSON key file, and the project ID:gcloud auth activate-service-account service-account@example.com --key-file=$GOOGLE_APPLICATION_CREDENTIALS --project=$GOOGLE_CLOUD_PROJECT
For more information, see the Google official documentation — gcloud auth activate-service-account.
-
Create a bucket in the Google Cloud Storage using Google official documentation — Create buckets.
-
Verify that the bucket is created by running the following command:
gcloud storage ls
The output should list the created bucket.
-
-
Run the
neo4j-admin database import
command to import your data from your Google storage bucket. The example assumes that you have data stored in themyBucket/data
folder in your bucket.bin/neo4j-admin database import full --nodes gs://myBucket/data/nodes.csv --relationships gs://myBucket/data/relationships.csv newdb
-
Ensure you have an Azure account, an Azure storage account, and a blob container.
-
You can create a storage account using the Azure portal.
For more information, see the Azure official documentation on Create a storage account. -
Create a blob container in the Azure portal.
For more information, see the Azure official documentation on Quickstart: Upload, download, and list blobs with the Azure portal.
-
-
Install the Azure CLI by following the instructions in the Azure official documentation — Azure official documentation.
-
Authenticate the neo4j or neo4j-admin process against Azure using the default Azure credentials.
See the Azure official documentation on default Azure credentials for more information.az login
Then you should be ready to use Azure URLs in either neo4j or neo4j-admin.
-
To validate that you have access to the container with your login credentials, run the following commands:
# Upload a file: az storage blob upload --file someLocalFile --account-name accountName - --container someContainer --name remoteFileName --auth-mode login # Download the file az storage blob download --account-name accountName --container someContainer --name remoteFileName --file downloadedFile --auth-mode login # List container files az storage blob list --account-name someContainer --container someContainer --auth-mode login
-
Run the
neo4j-admin database import
command to import your data from your Azure blob storage container. The example assumes that you have data stored in themyStorageAccount/myContainer/data
folder in your container.bin/neo4j-admin database import full --nodes azb://myStorageAccount/myContainer/data/nodes.csv --relationships azb://myStorageAccount/myContainer/data/relationships.csv newdb
Incremental import
Incremental import supports |
Incremental import allows you to incorporate large amounts of data in batches into the graph. You can run this operation as part of the initial data load when it cannot be completed in a single full import. Besides, you can update your graph by importing data incrementally, which is more performant than transactional insertion of such data.
Incremental import requires the use of --force
and can be run on an existing database only.
You must stop your database, if you want to perform the incremental import within one command.
If you cannot afford a full downtime of your database, split the operation into several stages:
-
prepare stage (offline)
-
build stage (offline or read-only)
-
merge stage (offline)
The database must be stopped for the prepare
and merge
stages.
During the build
stage, the database can be left online but put into read-only mode.
For a detailed example, see Incremental import in stages.
It is highly recommended to back up your database before running the incremental import, as if the merge stage fails, is aborted, or crashes, it may corrupt the database. |
Syntax
The syntax for importing a set of CSV files incrementally is:
neo4j-admin database import incremental [-h] [--expand-commands] --force [--verbose] [--auto-skip-subsequent-headers
[=true|false]] [--ignore-empty-strings[=true|false]] [--ignore-extra-columns
[=true|false]] [--legacy-style-quoting[=true|false]] [--normalize-types
[=true|false]] [--skip-bad-entries-logging[=true|false]]
[--skip-bad-relationships[=true|false]] [--skip-duplicate-nodes[=true|false]]
[--strict[=true|false]] [--trim-strings[=true|false]]
[--additional-config=<file>] [--array-delimiter=<char>] [--bad-tolerance=<num>]
[--delimiter=<char>] [--high-parallel-io=on|off|auto]
[--id-type=string|integer|actual] [--input-encoding=<character-set>]
[--input-type=csv|parquet] [--max-off-heap-memory=<size>] [--quote=<char>]
[--read-buffer-size=<size>] [--report-file=<path>] [--schema=<path>]
[--stage=all|prepare|build|merge] [--threads=<num>] --nodes=[<label>[:
<label>]...=]<files>... [--nodes=[<label>[:<label>]...=]<files>...]...
[--relationships=[<type>=]<files>...]... [--multiline-fields=true|false|<path>[,
<path>] [--multiline-fields-format=v1|v2]] <database>
Usage and limitations
The incremental import command can be used to add:
-
New nodes with labels and properties.
Note that you must have node property uniqueness constraints in place for the property key and label combinations that form the primary key, or the uniquely identifiable nodes. Otherwise, the command will throw an error and exit. For more information, see CSV header format.
-
New relationships between existing or new nodes.
The incremental import command cannot be used to:
-
Add new properties to existing nodes or relationships.
-
Update or delete properties in nodes or relationships.
-
Update or delete labels in nodes.
-
Delete existing nodes and relationships.
The importer works well on standalone servers. In clustering environments with multiple copies of the database, the updated database must be reseeded. |
Parameters
Parameter | Description | Default |
---|---|---|
|
Name of the database to import.
If the database into which you import does not exist prior to importing, you must create it subsequently using |
|
Options
Option | Description | Default |
---|---|---|
|
Configuration file with additional configuration. |
|
|
Delimiter character between array elements within a value in CSV data. Also accepts
For horizontal tabulation (HT), use Unicode character ID can be used if prepended by |
|
|
Automatically skip accidental header lines in subsequent files in file groups with more than one file. |
|
|
Number of bad entries before the import is aborted. The import process is optimized for error-free data. Therefore, cleaning the data before importing it is highly recommended. If you encounter any bad entries during the import process, you can set the number of bad entries to a specific value that suits your needs. However, setting a high value may affect the performance of the tool. |
|
|
Delimiter character between values in CSV data. Also accepts
For horizontal tabulation (HT), use Unicode character ID can be used if prepended by |
|
|
Allow command expansion in config value evaluation. |
|
|
Confirm incremental import by setting this flag. |
|
|
Show this help message and exit. |
|
|
Ignore environment-based heuristics and indicate if the target storage subsystem can support parallel IO with high throughput or auto detect.
Typically this is |
|
|
Introduced in 5.1 Each node must provide a unique ID. This is used to find the correct nodes when creating relationships. Possible values are:
|
|
|
Whether or not empty string fields, i.e. "" from input source are ignored, i.e. treated as null. |
|
|
If unspecified columns should be ignored during the import. |
|
|
Character set that input data is encoded in. |
|
|
Introduced in 5.26File type to import from. Can be csv or parquet. Defaults to csv. |
|
|
Whether or not a backslash-escaped quote e.g. \" is interpreted as an inner quote. |
|
|
Maximum memory that Values can be plain numbers, such as |
|
|
Changed in 5.26 In v1, whether or not fields from an input source can span multiple lines, i.e. contain newline characters. Setting |
|
|
Introduced in 5.26 Controls the parsing of input source that can span multiple lines, i.e. contain newline characters. When set to v1, the value for |
|
|
Node CSV header and data.
It is possible to import files from AWS S3 buckets, Google Cloud storage buckets, and Azure buckets using the appropriate URI as the path. For an example, see Import data from CSV files using regular expression. |
|
|
When |
|
|
Character to treat as quotation character for values in CSV data. Quotes can be escaped as per RFC 4180 by doubling them, for example You cannot escape using |
|
|
Size of each buffer for reading input data. It has to be at least large enough to hold the biggest single value in the input data.
The value can be a plain number or a byte units string, e.g. |
|
|
Relationship CSV header and data.
It is possible to import files from AWS S3 buckets, Google Cloud storage buckets, and Azure buckets using the appropriate URI as the path. For an example, see Import data from CSV files using regular expression. |
|
|
File in which to store the report of the csv-import. The location of the import log file can be controlled using the If you are running on a UNIX-like system and you are not interested in the output, you can get rid of it altogether by directing the report file to If you need to debug the import, it might be useful to collect the stack trace.
This is done by using the |
|
|
Introduced in 5.24 Path to the file containing the Cypher commands for creating indexes and constraints during data import. |
|
|
When set to |
|
|
Whether or not to skip importing relationships that refer to missing node IDs, i.e. either start or end node ID/group referring to a node that was not specified by the node input data. Skipped relationships will be logged, containing at most the number of entities specified by |
|
|
Whether or not to skip importing nodes that have the same ID/group. In the event of multiple nodes within the same group having the same ID, the first encountered will be imported, whereas consecutive such nodes will be skipped. Skipped nodes will be logged, containing at most the number of entities specified by |
|
|
Stage of incremental import. For incremental import into an existing database use For semi-online incremental import run |
|
|
Introduced in 5.6 Whether or not the lookup of nodes referred to from relationships needs to be checked strict. If disabled, most but not all relationships referring to non-existent nodes will be detected. If enabled all those relationships will be found but at the cost of lower performance. |
|
|
(advanced) Max number of worker threads used by the importer. Defaults to the number of available processors reported by the JVM. There is a certain amount of minimum threads needed so for that reason there is no lower bound for this value. For optimal performance, this value should not be greater than the number of available processors. |
|
|
Whether or not strings should be trimmed for whitespaces. |
|
|
Enable verbose output. |
|
3. See Tools → Configuration for details. 4. Ignored by Parquet import.
5. The |
Using both a multi-value option and a positional parameter
When using both a multi-value option, such as This is a limitation of the underlying library, Picocli, and is not specific to Neo4j Admin. For more information, see Picocli → Variable Arity Options and Positional Parameters official documentation. To resolve the problem, use one of the following solutions:
|
Examples
There are two ways of importing data incrementally.
Incremental import in a single command
If downtime is not a concern, you can run a single command with the option --stage=all
.
This option requires the database to be stopped.
neo4j@system> STOP DATABASE db1 WAIT;
...
bin/neo4j-admin database import incremental --stage=all --nodes=N1=../../raw-data/incremental-import/b.csv db1
Incremental import in stages
If you cannot afford a full downtime of your database, you can run the import in three stages.
-
prepare
stage:During this stage, the import tool analyzes the CSV headers and copies the relevant data over to the new increment database path. The import command is run with the option
--stage=prepare
and the database must be stopped.-
Using the
system
database, stop the databasedb1
with theWAIT
option to ensure a checkpoint happens before you run the incremental import command. The database must be stopped to run--stage=prepare
.STOP DATABASE db1 WAIT
-
Run the incremental import command with the
--stage=prepare
option:bin/neo4j-admin database import incremental --stage=prepare --nodes=N1=../../raw-data/incremental-import/c.csv db1
-
-
build
stage:During this stage, the import tool imports the data, deduplicates it, and validates it in the new increment database path. This is the longest stage and you can put the database in read-only mode to allow read access. The import command is run with the option
--stage=build
.-
Put the database in read-only mode:
ALTER DATABASE db1 SET ACCESS READ ONLY
-
Run the incremental import command with the
--stage=build
option:bin/neo4j-admin database import incremental --stage=build --nodes=N1=../../raw-data/incremental-import/c.csv db1
-
-
merge
stage:During this stage, the import tool merges the new with the existing data in the database. It also updates the affected indexes and upholds the affected property uniqueness constraints and property existence constraints. The import command is run with the option
--stage=merge
and the database must be stopped. It is not necessary to include the--nodes
or--relationships
options when using--stage=merge
.-
Using the
system
database, stop the databasedb1
with theWAIT
option to ensure a checkpoint happens before you run the incremental import command.STOP DATABASE db1 WAIT
-
Run the incremental import command with the
--stage=merge
option:bin/neo4j-admin database import incremental --stage=merge db1
-
CSV header format
The header file of each data source specifies how the data fields should be interpreted. You must use the same delimiter for the header file and the data files.
The header contains information for each field, with the format <name>:<field_type>
.
The <name>
is used for properties and node IDs.
In all other cases, the <name>
part of the field is ignored.
Incremental import
When using incremental import, you must have node property uniqueness constraints in place for the property key and label combinations that form the primary key, or the uniquely identifiable nodes.
For example, importing nodes with a This is also true when working with multiple groups.
For example, you can use
|
Node files
Files containing node data can have an ID
field, a LABEL
field, and properties.
- ID
-
Each node must have a unique ID if it is to be connected by any relationships created in the import. Neo4j uses the IDs to find the correct nodes when creating relationships. Note that the ID has to be unique across all nodes within the group, regardless of their labels. The unique ID is persisted in a property whose name is defined by the
<name>
part of the field definition<name>:ID
. If no such propertyname
is defined, the unique ID will be used for the import but not be available for reference later. If no ID is specified, the node will be imported, but it will not be connected to other nodes during the import. When a propertyname
is provided, that property type can be configured globally via the--id-type
option (as for Property data types).
From Neo4j 5.1, you can specify a different value ID type to be stored for a node property in its group using the optionid-type
in the header, e.g:id:ID(MyGroup){label:MyLabel, id-type: int}
. This ID type overrides the global--id-type
option. For example, the globalid-type
can be a string, but the nodes will have their IDs stored asint
type in their ID properties. For more information, see Storing a different value type for IDs in a group.
From Neo4j 5.3, a node header can also contain multipleID
columns, where the relationship data references the composite value of all those columns. This also implies usingstring
asid-type
. For eachID
column, you can specify to store its values as different node properties. However, the composite value cannot be stored as a node property. For more information, see Using multiple node IDs. - LABEL
-
Read one or more labels from this field. Like array values, multiple labels are separated by
;
, or by the character specified with--array-delimiter
. Introduced in 5.25 The max length of label names for block format is 16,383 characters.
You define the headers for movies in the movies_header.csv file.
Movies have the properties movieId
, year
, and title
.
You also specify a field for labels.
movieId:ID,title,year:int,:LABEL
You define three movies in the movies.csv file.
They contain all the properties defined in the header file.
All the movies are given the label Movie
.
Two of them are also given the label Sequel
.
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
Similarly, you also define three actors in the actors_header.csv and actors.csv files.
They all have the properties personId
and name
, and the label Actor
.
personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
Relationship files
Files containing relationship data have three mandatory fields and can also have properties. The mandatory fields are:
- TYPE
-
The relationship type to use for this relationship. Introduced in 5.25 The max length of relationship type names for block format is 16,383 characters.
- START_ID
-
The ID of the start node for this relationship.
- END_ID
-
The ID of the end node for this relationship.
The START_ID
and END_ID
refer to the unique node ID defined in one of the node data sources, as explained in the previous section.
None of these take a name, e.g. if <name>:START_ID
or <name>:END_ID
is defined, the <name>
part will be ignored.
Nor do they take a <field_type>
, e.g. if :START_ID:int
or :END_ID:int
is defined, the :int
part does not have any meaning in the context of type information.
In this example, you assume that the two node files from the previous example are used together with the following relationships file.
You define relationships between actors and movies in the files roles_header.csv and roles.csv.
Each row connects a start node and an end node with a relationship of relationship type ACTED_IN
.
Notice how you use the unique identifiers personId
and movieId
from the nodes files above.
The name of the character that the actor is playing in this movie is stored as a role
property on the relationship.
:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN
Property data types
For properties, the <name>
part of the field designates the property key, while the <field_type>
part assigns a data type.
You can have properties in both node data files and relationship data files.
Introduced in 5.25 The max length of property keys for block format is 16,383 characters.
Use one of int
, long
, float
, double
, boolean
, byte
, short
, char
, string
, point
, date
, localtime
, time
, localdatetime
, datetime
, and duration
to designate the data type for properties.
By default, types (except arrays) are converted to Cypher types.
See Cypher Manual → Property, structural, and constructed values.
This behavior can be disabled using the option --normalize-types=false
.
Normalizing types can require more space on disk, but avoids Cypher converting the type during queries.
If no data type is given, this defaults to string
.
To define an array type, append []
to the type.
By default, array values are separated by ;
.
A different delimiter can be specified with --array-delimiter
.
Arrays are not affected by the --normalize-types
flag.
For example, if you want a byte array to be stored as a Cypher long array, you must explicitly declare the property as long[]
.
Boolean values are true if they match exactly the text true
. All other values are false.
Values that contain the delimiter character need to be escaped by enclosing in double quotation marks, or by using a different delimiter character with the --delimiter
option.
This example illustrates several different data types specified in the CSV header.
:ID,name,joined:date,active:boolean,points:int
user01,Joe Soap,2017-05-05,true,10
user02,Jane Doe,2017-08-21,true,15
user03,Moe Know,2018-02-17,false,7
- Special considerations for the
point
data type -
A point is specified using the Cypher syntax for maps. The map allows the same keys as the input to the Cypher Manual → Point function. The point data type in the header can be amended with a map of default values used for all values of that column, e.g.
point{crs: 'WGS-84'}
. Specifying the header this way allows you to have an incomplete map in the value position in the data file. Optionally, a value in a data file may override default values from the header.Example 4. Property format forpoint
data typeThis example illustrates various ways of using the
point
data type in the import header and the data files.You are going to import the name and location coordinates for cities. First, you define the header as:
:ID,name,location:point{crs:WGS-84}
You then define cities in the data file.
-
The first city’s location is defined using
latitude
andlongitude
, as expected when using the coordinate system defined in the header. -
The second city uses
x
andy
instead. This would normally lead to a point using the coordinate reference systemcartesian
. Since the header definescrs:WGS-84
, that coordinate reference system will be used. -
The third city overrides the coordinate reference system defined in the header and sets it explicitly to
WGS-84-3D
.
:ID,name,location:point{crs:WGS-84} city01,"Malmö","{latitude:55.6121514, longitude:12.9950357}" city02,"London","{y:51.507222, x:-0.1275}" city03,"San Mateo","{latitude:37.554167, longitude:-122.313056, height: 100, crs:'WGS-84-3D'}"
Note that all point maps are within double quotation marks
"
in order to prevent the enclosed,
character from being interpreted as a column separator. An alternative approach would be to use--delimiter='\t'
and reformat the file with tab separators, in which case the"
characters are not required.:ID name location:point{crs:WGS-84} city01 Malmö {latitude:55.6121514, longitude:12.9950357} city02 London {y:51.507222, x:-0.1275} city03 San Mateo {latitude:37.554167, longitude:-122.313056, height: 100, crs:'WGS-84-3D'}
-
- Special considerations for temporal data types
-
The format for all temporal data types must be defined as described in Cypher Manual → Temporal instants syntax and Cypher Manual → Durations syntax. Two of the temporal types, Time and DateTime, take a time zone parameter that might be common between all or many of the values in the data file. It is therefore possible to specify a default time zone for Time and DateTime values in the header, for example:
time{timezone:+02:00}
and:datetime{timezone:Europe/Stockholm}
. If no default time zone is specified, the default timezone is determined by thedb.temporal.timezone
configuration setting. The default time zone can be explicitly overridden in the values in the data file.Example 5. Property format for temporal data typesThis example illustrates various ways of using the
datetime
data type in the import header and the data files.First, you define the header with two DateTime columns. The first one defines a time zone, but the second one does not:
:ID,date1:datetime{timezone:Europe/Stockholm},date2:datetime
You then define dates in the data file.
-
The first row has two values that do not specify an explicit timezone. The value for
date1
will use theEurope/Stockholm
time zone that was specified for that field in the header. The value fordate2
will use the configured default time zone of the database. -
In the second row, both
date1
anddate2
set the time zone explicitly to beEurope/Berlin
. This overrides the header definition fordate1
, as well as the configured default time zone of the database.
1,2018-05-10T10:30,2018-05-10T12:30 2,2018-05-10T10:30[Europe/Berlin],2018-05-10T12:30[Europe/Berlin]
-
Using ID spaces
By default, the import tool assumes that node identifiers are unique across node files.
In many cases, the ID is unique only across each entity file, for example, when your CSV files contain data extracted from a relational database and the ID field is pulled from the primary key column in the corresponding table.
To handle this situation you define ID spaces.
ID spaces are defined in the ID
field of node files using the syntax ID(<ID space identifier>)
.
To reference an ID of an ID space in a relationship file, you use the syntax START_ID(<ID space identifier>)
and END_ID(<ID space identifier>)
.
Define a Movie-ID
ID space in the movies_header.csv file.
movieId:ID(Movie-ID),title,year:int,:LABEL
1,"The Matrix",1999,Movie
2,"The Matrix Reloaded",2003,Movie;Sequel
3,"The Matrix Revolutions",2003,Movie;Sequel
Define an Actor-ID
ID space in the header of the actors_header.csv file.
personId:ID(Actor-ID),name,:LABEL
1,"Keanu Reeves",Actor
2,"Laurence Fishburne",Actor
3,"Carrie-Anne Moss",Actor
Now use the previously defined ID spaces when connecting the actors to movies.
:START_ID(Actor-ID),role,:END_ID(Movie-ID),:TYPE
1,"Neo",1,ACTED_IN
1,"Neo",2,ACTED_IN
1,"Neo",3,ACTED_IN
2,"Morpheus",1,ACTED_IN
2,"Morpheus",2,ACTED_IN
2,"Morpheus",3,ACTED_IN
3,"Trinity",1,ACTED_IN
3,"Trinity",2,ACTED_IN
3,"Trinity",3,ACTED_IN
Using multiple node IDs
From Neo4j 5.3, a node header can also contain multiple ID
columns, where the relationship data references the composite value of all those columns.
This also implies using string
as id-type
.
For each ID
column, you can specify to store its values as different node properties.
However, the composite value cannot be stored as a node property.
Incremental import doesn’t support the use of multiple node identifiers. This functionality is only available with a full import. |
You can define multiple ID
columns in the node header.
For example, you can define a node header with two ID
columns.
:ID,:ID,name
aa,11,John
bb,22,Paul
Now use both IDs when defining the relationship:
:START_ID,:TYPE,:END_ID
aa11,WORKS_WITH,bb22
Define a MyGroup
ID space in the nodes_header.csv file.
personId:ID(MyGroup),memberId:ID(MyGroup),name
aa,11,John
bb,22,Paul
Now use the defined ID space when connecting John with Paul, and use both IDs in the relationship.
:START_ID(MyGroup),:TYPE,:END_ID(MyGroup)
aa11,WORKS_WITH,bb22
Storing a different value type for IDs in a group
From Neo4j 5.1, you can control the ID type of the node property that will be stored by defining the id-type
option in the header, for example, :ID{id-type:long}
.
The id-type
option in the header overrides the global --id-type
value provided to the command.
This way, you can have property values of different types for different groups of nodes.
For example, the global id-type
can be a string, but some nodes can have their IDs stored as long
type in their ID properties.
id:ID(GroupOne){id-type:long},name,:LABEL
123,P1,Person
456,P2,Person
id:ID(GroupTwo),name,:LABEL
ABC,G1,Game
DEF,G2,Game
neo4j_home$ --nodes persons.csv --nodes games.csv --id-type string
The id
property of the nodes in the persons
group will be stored as long
type, while the id
property of the nodes in the games
group will be stored as string
type, as the global id-type
is a string.
Importing data that spans multiple lines
The --multiline-fields
option allows fields from an input source to span multiple lines, i.e. contain newline characters.
For example:
bin/neo4j-admin database import full --nodes import/node_header.csv,import/node_data.csv --multiline-fields=true databasename
Where import/node_data.csv
contains multiline fields, such as:
id,name,birthDate,birthYear,birthLocation,description
1,John,October 1st,2000,New York,This is a multiline
description
Setting |
Starting from 5.26, you can optionally specify the format of the --multiline-fields
to control the parsing of the input source by setting the --multiline-fields-format
option.
Possible values are:
-
v1
- the default format, which uses the current processing method for multiline fields. -
v2
- a more efficient processing method that requires text fields to be quoted. Forv2
, the--multiline-fields
option must be set to a list of files (regular expressions are allowed) that contain multiline fields.
Both formats have the restriction that the entirety of every row must be able to fit into the buffer (default is 4m).
The --multiline-fields-format
option is available in the full
and incremental
import modes.
For example:
bin/neo4j-admin database import full --nodes import/node_header.csv,import/node_data.csv --multiline-fields=true --multiline-fields-format=v1 databasename
Where import/node_data.csv
contains multiline fields, such as:
id,name,birthDate,birthYear,birthLocation,description
1,John,October 1st,2000,New York,This is a multiline
description
bin/neo4j-admin database import full --nodes import/node_header.csv,import/node_data.csv --multiline-fields=import/node_data.csv --multiline-fields-format=v2 databasename
Where import/node_data.csv
contains multiline fields, such as:
id,name,birthDate,birthYear,birthLocation,description
1,"John","October 1st",2000,"New York","This is a multiline
description"
Skipping columns
- IGNORE
-
If there are fields in the data that you wish to ignore completely, this can be done using the
IGNORE
keyword in the header file.IGNORE
must be prepended with a:
.Example 10. Skip a columnIn this example, you are not interested in the data in the third column of the nodes file and wish to skip over it. Note that the
IGNORE
keyword is prepended by a:
.personId:ID,name,:IGNORE,:LABEL
keanu,"Keanu Reeves","male",Actor laurence,"Laurence Fishburne","male",Actor carrieanne,"Carrie-Anne Moss","female",Actor
If all your superfluous data is placed in columns located to the right of all the columns that you wish to import, you can instead use the command line option --ignore-extra-columns
.
Importing compressed files
The import tool can handle files compressed with zip
or gzip
.
Each compressed file must contain a single file.
neo4j_home$ ls import
actors-header.csv actors.csv.zip movies-header.csv movies.csv.gz roles-header.csv roles.csv.gz
bin/neo4j-admin database import --nodes import/movies-header.csv,import/movies.csv.gz --nodes import/actors-header.csv,import/actors.csv.zip --relationships import/roles-header.csv,import/roles.csv.gz