Load Data from Web-APIs
Supported protocols are file, http, https, s3, gs, hdfs with redirect allowed.
If no procedure is provided, this procedure will try to check whether the URL is actually a file.
As apoc.import.file.use_neo4j_config is enabled, the procedures check whether file system access is allowed and possibly constrained to a specific directory by
reading the two configuration parameters dbms.security.allow_csv_import_from_file_urls and dbms.directories.import respectively.
If you want to remove these constraints please set apoc.import.file.use_neo4j_config=false
|
|
load JSON from URL |
|
load XML from URL |
|
load CSV fom URL |
|
load XLS fom URL |
Adding failOnError:false (by default true) to the config map when using any of the procedures in the above table will make them not fail in case of an error and just return zero rows. Example:
CALL apoc.load.json('http://example.com/test.json', null, {failOnError:false})
Load Single File From Compressed File (zip/tar/tar.gz/tgz)
When loading data from compressed files, we need to put the ! character before the file name or path in the compressed file.
For example:
CALL apoc.load.csv("pathToCompressedFile/file.zip!pathToCsvFileInZip/fileName.csv")
CALL apoc.load.json("https://github.com/neo4j-contrib/neo4j-apoc-procedures/blob/4.4/core/src/test/resources/testload.tgz?raw=true!person.json");
Using S3, GCS or HDFS protocols
To use any of these protocols, additional extra dependency jars need to be downloaded and copied into the plugins directory <NEO4J_HOME>/plugins, respectively:
Protocol |
Needed extra dependency |
S3 |
|
GCS |
|
HDFS |
After copying the jars into the plugins directory, the database will need to be restarted.
Using S3 protocol
The S3 URL must be in the following format:
-
s3://accessKey:secretKey[:sessionToken]@endpoint:port/bucket/key(where the sessionToken is optional) or -
s3://endpoint:port/bucket/key?accessKey=accessKey&secretKey=secretKey[&sessionToken=sessionToken](where the sessionToken is optional) or -
s3://endpoint:port/bucket/keyif the accessKey, secretKey, and the optional sessionToken are provided in the environment variables
Using Google Cloud Storage
Google Cloud Storage urls have the following shape:
gs://<bucket_name>/<file_path>
The authorization type can be specified via an additional authenticationType query parameter:
-
NONE: for public buckets (this is the default behavior if the parameter is not specified) -
GCP_ENVIRONMENT: for passive authentication as a service account when Neo4j is running in the Google Cloud -
PRIVATE_KEY: for using private keys generated for service accounts (requires settingGOOGLE_APPLICATION_CREDENTIALSenvironment variable pointing to a private key json file as described here: https://cloud.google.com/docs/authentication#strategies)
Example:
gs://bucket/test-file.csv?authenticationType=GCP_ENVIRONMENT