Load Data from Web-APIs

Supported protocols are file, http, https, s3, gs, hdfs with redirect allowed.

If no procedure is provided, this procedure will try to check whether the URL is actually a file.

As apoc.import.file.use_neo4j_config is enabled, the procedures check whether file system access is allowed and possibly constrained to a specific directory by reading the two configuration parameters dbms.security.allow_csv_import_from_file_urls and dbms.directories.import respectively. If you want to remove these constraints please set apoc.import.file.use_neo4j_config=false

CALL apoc.load.json('http://example.com/map.json', [path], [config]) YIELD value as person

load JSON from URL

CALL apoc.load.xml('http://example.com/test.xml', ['xPath'], [config]) YIELD value as doc

load XML from URL

CALL apoc.load.csv('url',{sep:";"}) YIELD lineNo, list, strings, map, stringMap

load CSV fom URL

CALL apoc.load.xls('url','Sheet'/'Sheet!A2:B5',{config}) YIELD lineNo, list, map

load XLS fom URL

Adding failOnError:false (by default true) to the config map when using any of the procedures in the above table will make them not fail in case of an error and just return zero rows. Example:

CALL apoc.load.json('http://example.com/test.json', null, {failOnError:false})

Load Single File From Compressed File (zip/tar/tar.gz/tgz)

When loading data from compressed files, we need to put the ! character before the file name or path in the compressed file. For example:

Loading a compressed CSV file
CALL apoc.load.csv("pathToCompressedFile/file.zip!pathToCsvFileInZip/fileName.csv")
Loading a compressed JSON file
CALL apoc.load.json("https://github.com/neo4j-contrib/neo4j-apoc-procedures/blob/4.4/core/src/test/resources/testload.tgz?raw=true!person.json");

Using S3, GCS or HDFS protocols

To use any of these protocols, additional extra dependency jars need to be downloaded and copied into the plugins directory <NEO4J_HOME>/plugins, respectively:

Protocol

Needed extra dependency

S3

apoc-aws-dependencies-4.4.0.24.jar

GCS

apoc-gcs-dependencies-4.4.0.24.jar

HDFS

apoc-hadoop-dependencies-4.4.0.24.jar

After copying the jars into the plugins directory, the database will need to be restarted.

Using S3 protocol

The S3 URL must be in the following format:

  • s3://accessKey:secretKey[:sessionToken]@endpoint:port/bucket/key (where the sessionToken is optional) or

  • s3://endpoint:port/bucket/key?accessKey=accessKey&secretKey=secretKey[&sessionToken=sessionToken] (where the sessionToken is optional) or

  • s3://endpoint:port/bucket/key if the accessKey, secretKey, and the optional sessionToken are provided in the environment variables

Using Google Cloud Storage

Google Cloud Storage urls have the following shape:

gs://<bucket_name>/<file_path>

The authorization type can be specified via an additional authenticationType query parameter:

  • NONE: for public buckets (this is the default behavior if the parameter is not specified)

  • GCP_ENVIRONMENT: for passive authentication as a service account when Neo4j is running in the Google Cloud

  • PRIVATE_KEY: for using private keys generated for service accounts (requires setting GOOGLE_APPLICATION_CREDENTIALS environment variable pointing to a private key json file as described here: https://cloud.google.com/docs/authentication#strategies)

Example:

gs://bucket/test-file.csv?authenticationType=GCP_ENVIRONMENT