Graph model refactoring

Introduction

Building on the Cypher® basic concepts, this guide provides a working example of changing a graph model. Upon finishing this guide, you should be able to evolve your graph model based on changing requirements.

Airports dataset

This guide uses an airports dataset that contains connections between US airports in January, 2008. The data is presented in a CSV file. Below you can see the graph model of the database:

Before importing any data, you should create a unique constraint on the Airport label and code property to ensure that you don’t accidentally import duplicate airports. The following query creates the constraint:

CREATE CONSTRAINT airport_id
FOR (airport:Airport) REQUIRE airport.code IS UNIQUE
Table 1. Results

0 rows available after 86 ms, consumed after another 0 ms. Added 1 constraints

And the following query loads the data from a CSV file using the LOAD CSV tool:

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/training/master/modeling/data/flights_1k.csv" AS row
MERGE (origin:Airport {code: row.Origin})
MERGE (destination:Airport {code: row.Dest})
MERGE (origin)-[connection:CONNECTED_TO {
  airline: row.UniqueCarrier,
  flightNumber: row.FlightNum,
  date: date({year: toInteger(row.Year), month: toInteger(row.Month), day: toInteger(row.DayofMonth)}),
  cancelled: row.Cancelled,
  diverted: row.Diverted}]->(destination)
ON CREATE SET connection.departure = localtime(apoc.text.lpad(row.CRSDepTime, 4, "0")),
              connection.arrival = localtime(apoc.text.lpad(row.CRSArrTime, 4, "0"))

This query:

  • Creates a node with an Airport label with a code property that has a value from the Origin column in the CSV file.

  • Creates a node with an Airport label with a code property that has a value from the Dest column in the CSV file.

  • Creates a relationship of type CONNECTED_TO with several properties based on columns in the CSV file.

If you run this query, you will see the following output:

Table 2. Results

Added 62 labels, created 62 nodes, set 7062 properties, created 1000 relationships, completed after 376 ms.

This is a starting model, but there are some improvements that you can make.

Convert property to boolean

The diverted and cancelled properties on the CONNECTED_TO relationships contain string values of 1 and 0. Since these values are representing booleans, you can use the apoc.refactor.normalizeAsBoolean procedure to convert the values from strings to booleans.

The following query does the conversion for the diverted property:

MATCH (:Airport)-[connectedTo:CONNECTED_TO]->(:Airport)
CALL apoc.refactor.normalizeAsBoolean(connectedTo, "diverted", ["1"], ["0"])
RETURN count(*)
Table 3. Results
count(*)

1000

And the following query does the conversion for the cancelled property:

MATCH (origin:Airport)-[connectedTo:CONNECTED_TO]->(departure)
CALL apoc.refactor.normalizeAsBoolean(connectedTo, "cancelled", ["1"], ["0"])
RETURN count(*)
Table 4. Results
count(*)

1000

If you have a lot of relationships to update, you may get an OutOfMemory exception when trying to refactor them all in one transaction. You can therefore process the relationships in batches using the apoc.periodic.iterate procedure. The following query does this for the cancelled and reverted properties in the same query:

UNWIND ["cancelled", "reverted"] AS propertyToDelete
CALL apoc.periodic.iterate(
  "MATCH (:Airport)-[connectedTo:CONNECTED_TO]->(:Airport) RETURN connectedTo",
  "CALL apoc.refactor.normalizeAsBoolean(connectedTo, $propertyToDelete, ['1'], ['0'])
   RETURN count(*)",
  {params: {propertyToDelete: propertyToDelete}, batchSize: 100})
YIELD batches
RETURN propertyToDelete, batches

For more details about the UNWIND clause, see the Cypher manual → UNWIND page.

The apoc.periodic.iterate procedure in the previous query takes in three parameters:

  • An outer Cypher query that finds and returns a stream of CONNECTED_TO relationships to be processed.

  • An inner Cypher query that processes those CONNECTED_TO relationships, converting to boolean any values for the specified property on those relationships. It does this by using the apoc.refactor.normalizeAsBoolean procedure, which itself takes in several parameters:

    • the entity on which the property exists

    • the name of the property to normalize

    • a list of values that should be considered true

    • a list of values that should be considered false

  • Configuration values for the procedure, including:

    • params - parameters passed into those Cypher queries.

    • batchSize- controls the number of inner statements that are run within a single transaction.

After running the query, you will see the following output:

Table 5. Results
propertyToDelete batches

"cancelled"

10

"reverted"

10

Once you have done this, you can write the following query to return all cancelled connections:

MATCH (origin:Airport)-[connectedTo:CONNECTED_TO]->(destination)
WHERE connectedTo.cancelled
RETURN origin.code AS origin,
       destination.code AS destination,
       connectedTo.date AS date,
       connectedTo.departure AS departure,
       connectedTo.arrival AS arrival
Table 6. Results
origin destination date departure arrival

"LAS"

"OAK"

2008-01-03

07:00

08:30

"LAX"

"SFO"

2008-01-03

09:05

10:25

"LAX"

"OAK"

2008-01-03

11:00

12:15

"LAX"

"SJC"

2008-01-03

19:30

20:35

"LAX"

"SFO"

2008-01-03

16:20

17:40

"MDW"

"STL"

2008-01-03

11:10

12:15

"MDW"

"BDL"

2008-01-03

08:45

11:40

"MDW"

"DTW"

2008-01-03

06:00

08:05

"MDW"

"STL"

2008-01-03

14:45

15:50

"MDW"

"BNA"

2008-01-03

19:25

20:45

"OAK"

"BUR"

2008-01-03

13:10

14:15

"OAK"

"BUR"

2008-01-03

17:05

18:10

Create a node from a relationship

With the existing data model, writing a query that finds a specific flight can become a complex task. That is because here the flights are represented as relationships. However, you can change the model by creating a Flight node from the properties stored on the CONNECTED_TO relationship:

The following query does this refactoring:

CALL apoc.periodic.iterate(
  "MATCH (origin:Airport)-[connected:CONNECTED_TO]->(destination:Airport) RETURN origin, connected, destination",
  "CREATE (flight:Flight {
     date: connected.date,
     airline: connected.airline,
     number: connected.flightNumber,
     departure: connected.departure,
     arrival: connected.arrival,
     cancelled: connected.cancelled,
     diverted: connected.diverted
   })
   MERGE (origin)<-[:ORIGIN]-(flight)
   MERGE (flight)-[:DESTINATION]->(destination)
   DELETE connected",
  {batchSize: 100})

This query uses the apoc.periodic.iterate procedure so that you can do the refactoring in batches rather than within a single transaction. The procedure takes in three parameters:

  • An outer Cypher query that finds and returns a stream of CONNECTED_TO relationships, and origin and destination airports that need to be processed.

  • An inner Cypher query that processes those entities, creating a node with the label Flight and creating relationships from that node to the origin and destination airports.

  • batchSize configuration, which sets to 100 the number of inner statements that are run within a single transaction.

If you execute the query, you will see the following output:

Table 7. Results
batches total timeTaken committedOperations failedOperations failedBatches retries errorMessages batch operations wasTerminated

10

1000

0

1000

0

0

0

{}

{total: 10, committed: 10, failed: 0, errors: {}}

{total: 1000, committed: 1000, failed: 0, errors: {}}

FALSE

You can also do this refactoring using the apoc.refactor.extractNode procedure.

CALL apoc.periodic.iterate(
  "MATCH (origin:Airport)-[connected:CONNECTED_TO]->(destination:Airport)
   RETURN origin, connected, destination",
  "CALL apoc.refactor.extractNode([connected], ['Flight'], 'DESTINATION', 'ORIGIN')
   YIELD input, output, error
   RETURN input, output, error",
  {batchSize: 100});

This does the same as the previous query, but the outer Cypher query uses the apoc.refactor.extractNode procedure to create the Flight node and create relationships to origin and destination airports. If we run this query we’ll see the following output:

Table 8. Results
batches total timeTaken committedOperations failedOperations failedBatches retries errorMessages batch operations wasTerminated

10

1000

0

1000

0

0

0

{}

{total: 10, committed: 10, failed: 0, errors: {}}

{total: 1000, committed: 1000, failed: 0, errors: {}}

FALSE

Create a node from a property

At the moment the airline names are stored in the airline property on the Flight nodes. This means that if you want to return a stream of all airlines, you have to scan through every flight and check the airline property on each of those flights.

You can make this task simpler and more efficient by creating a node with an Airline label for each airline:

First, create a constraint on the Airline label and a name property to avoid duplicated airline nodes:

CREATE CONSTRAINT airline_id
FOR (airline:Airline) REQUIRE airline.name IS UNIQUE
Table 9. Results

0 rows available after 107 ms, consumed after another 0 ms. Added 1 constraints

Now you can run the following query to do the refactoring:

CALL apoc.periodic.iterate(
   'MATCH (flight:Flight) RETURN flight',
   'MERGE (airline:Airline {name:flight.airline})
    MERGE (flight)-[:AIRLINE]->(airline)
    REMOVE flight.airline',
   {batchSize:10000, iterateList:true, parallel:false}
)

Again you are using the apoc.periodic.iterate procedure with the following parameters:

  • An outer Cypher statement that returns a stream of Flight nodes to be processed.

  • An inner Cypher statement that processes the Flight nodes and creates Airline nodes based on the airline property. It also creates an AIRLINE relationship from the Flight to the Airline nodes. After that, you can remove the airline property from the Flight node.

If you run this query, the output will be the following:

Table 10. Results
batches total timeTaken committedOperations failedOperations failedBatches retries errorMessages batch operations wasTerminated

1

1000

0

1000

0

0

0

{}

{total: 1, committed: 1, failed: 0, errors: {}}

{total: 1000, committed: 1000, failed: 0, errors: {}}

FALSE

You can then write the following query to find the airlines and number of flights involving each airline:

MATCH (airline:Airline)<-[:AIRLINE]-(:Flight)
RETURN airline.name AS airline, count(*) AS numberOfFlights

This does the same as the previous query, but the outer Cypher query uses the apoc.refactor.extractNode procedure to create the Flight node and create relationships to origin and destination airports. If you run this query, you will get the following output:

Table 11. Results
airline numberOfFlights

"WN"

1000

Resources

This guide has shown how to refactor a graph model, with help from procedures in the APOC Library. Below are some resources for learning more about refactoring in Neo4j: