Prerequisites

Neo4j instance

You need a running Neo4j instance which the data can flow into.

If you don’t have an instance yet, you have two options:

  • sign-up for a free AuraDB instance

  • install and self-host Neo4j in a location that is publicly accessible (see Neo4j → Installation) with port 7687 open (Bolt protocol)

The template uses constraints, some of which are only available in Neo4j/Aura Enterprise Edition installations. Although the Dataflow jobs are able to run on Neo4j Community Edition instances, most constraints will not be created. You have thus to ensure that the source data and job specification are prepared accordingly.

Dataset to import

You need a Google BigQuery dataset that you want to import into Neo4j.

This tutorial uses a subset of the movies dataset. It contains entities Person and Movie, linked together by DIRECTED and ACTED_IN relationships. In other words, each Person may have DIRECTED and/or ACTED_IN a Movie. Both entities and relationships have extra details attached to each of them. The data is sourced from the following files: persons.csv, movies.csv, acted_in.csv, directed.csv.

image$movies model
Since you are moving data from a relational database into a graph database, the data model will have to change. Checkout Graph data modeling guidelines to learn how to model for graph databases.

Google Dataflow job

The Google Dataflow job glues all the pieces together and performs the data import. You need to craft a job specification file to provide Dataflow with all the information it needs to load the data into Neo4j.

image$google dataflow
All Google-related resources (Cloud project, Cloud Storage buckets, Dataflow job) should either belong to the same account, or to one which the Dataflow job has permissions to access.