Load Data from Web-APIs
Supported protocols are file
, http
, https
, s3
, gs
, hdfs
with redirect allowed.
If no procedure is provided, this procedure will try to check whether the URL is actually a file.
As apoc.import.file.use_neo4j_config is enabled, the procedures check whether file system access is allowed and possibly constrained to a specific directory by
reading the two configuration parameters dbms.security.allow_csv_import_from_file_urls and dbms.directories.import respectively.
If you want to remove these constraints please set apoc.import.file.use_neo4j_config=false
|
|
load JSON from URL |
|
load XML from URL |
|
load CSV fom URL |
|
load XLS fom URL |
Adding failOnError:false
(by default true
) to the config map when using any of the procedures in the above table will make them not fail in case of an error and just return zero rows. Example:
CALL apoc.load.json('http://example.com/test.json', null, {failOnError:false})
Load Single File From Compressed File (zip/tar/tar.gz/tgz)
When loading data from compressed files, we need to put the !
character before the file name or path in the compressed file.
For example:
CALL apoc.load.csv("pathToCompressedFile/file.zip!pathToCsvFileInZip/fileName.csv")
CALL apoc.load.json("https://github.com/neo4j-contrib/neo4j-apoc-procedures/blob/4.4/core/src/test/resources/testload.tgz?raw=true!person.json");
Using S3, GCS or HDFS protocols
To use any of these protocols, additional extra dependency jars need to be downloaded and copied into the plugins directory <NEO4J_HOME>/plugins, respectively:
Protocol |
Needed extra dependency |
S3 |
|
GCS |
|
HDFS |
After copying the jars into the plugins directory, the database will need to be restarted.
Using S3 protocol
The S3 URL must be in the following format:
-
s3://accessKey:secretKey[:sessionToken]@endpoint:port/bucket/key
(where the sessionToken is optional) or -
s3://endpoint:port/bucket/key?accessKey=accessKey&secretKey=secretKey[&sessionToken=sessionToken]
(where the sessionToken is optional) or -
s3://endpoint:port/bucket/key
if the accessKey, secretKey, and the optional sessionToken are provided in the environment variables
Using Google Cloud Storage
Google Cloud Storage urls have the following shape:
gs://<bucket_name>/<file_path>
The authorization type can be specified via an additional authenticationType
query parameter:
-
NONE
: for public buckets (this is the default behavior if the parameter is not specified) -
GCP_ENVIRONMENT
: for passive authentication as a service account when Neo4j is running in the Google Cloud -
PRIVATE_KEY
: for using private keys generated for service accounts (requires settingGOOGLE_APPLICATION_CREDENTIALS
environment variable pointing to a private key json file as described here: https://cloud.google.com/docs/authentication#strategies)
Example:
gs://bucket/test-file.csv?authenticationType=GCP_ENVIRONMENT