Prerequisites
Neo4j instance
You need a running Neo4j instance which the data can flow into.
If you don’t have an instance yet, you have two options:
-
sign-up for a free AuraDB instance
-
install and self-host Neo4j in a location that is publicly accessible (see Neo4j → Installation) with port 7687 open (Bolt protocol)
The template uses constraints, some of which are only available in Neo4j/Aura Enterprise Edition installations. Although the Dataflow jobs are able to run on Neo4j Community Edition instances, most constraints will not be created. You have thus to ensure that the source data and job specification are prepared accordingly. |
Dataset to import
You need a Google BigQuery dataset that you want to import into Neo4j.
This tutorial uses a subset of the movies
dataset.
It contains entities Person
and Movie
, linked together by DIRECTED
and ACTED_IN
relationships.
In other words, each Person
may have DIRECTED
and/or ACTED_IN
a Movie
.
Both entities and relationships have extra details attached to each of them.
The data is sourced from the following files: persons.csv, movies.csv, acted_in.csv, directed.csv.
Since you are moving data from a relational database into a graph database, the data model will have to change. Checkout Graph data modeling guidelines to learn how to model for graph databases. |
Google Dataflow job
The Google Dataflow job glues all the pieces together and performs the data import. You need to craft a job specification file to provide Dataflow with all the information it needs to load the data into Neo4j.
All Google-related resources (Cloud project, Cloud Storage buckets, Dataflow job) should either belong to the same account, or to one which the Dataflow job has permissions to access. |