I spend a lot of my free time answering Stack Overflow questions about Cypher, Neo4j’s graph query language.
Cypher is an intuitive, ASCII-art-driven language that allows you to query your property graph by specifying patterns of nodes and relationships. While Cypher is the reason many developers choose Neo4j, it is not immune to common gotchas.
In this post I want to go over a few recurring questions and mistakes I see on Stack Overflow or at Neo4j trainings. All of my examples will use the movie graph that ships with the Neo4j Browser, so you can follow along by executing :play movies
in the browser query bar.
LIMIT x
vs. collect()[..x]
In many cases it’s clear when to use LIMIT
and when to use collect()
.
LIMIT
An obvious use of LIMIT
might be: “Find the top five actors ordered by how many movies they’ve acted in.”
MATCH (actor:Person) RETURN actor.name, size((actor)-[:ACTED_IN]->(:Movie)) AS movies ORDER BY movies DESC LIMIT 5;
actor.name | movies ---------------+------- Tom Hanks | 12 Keanu Reeves | 7 Hugo Weaving | 5 Jack Nicholson | 5 Meg Ryan | 5
collect()
An obvious use of collect()
might be: Find the Matrix movies and their directors as a collection.
MATCH (director:Person)-[:DIRECTED]->(movie:Movie) WHERE movie.title STARTS WITH "The Matrix" RETURN movie.title, collect(director.name) AS directors;
movie.title | directors -----------------------+------------------------------------- The Matrix Revolutions | ['Andy Wachowski', 'Lana Wachowski'] The Matrix | ['Lana Wachowski', 'Andy Wachowski'] The Matrix Reloaded | ['Lana Wachowski', 'Andy Wachowski']
LIMIT
and collect()
But once you need the top x
records of an entity grouped by some aggregate it gets a little trickier. Below are some examples of people struggling with this:
Let’s say we want to find the two oldest people in the graph and the three most recent movies they’ve acted in. This requires a combination of LIMIT
and collect()
along with some ORDER BY
s.
// Get the two oldest people. MATCH (actor:Person) WITH actor ORDER BY actor.born LIMIT 2 // Get their three most recent movies. MATCH (actor)-[:ACTED_IN]->(movie:Movie) WITH actor, movie ORDER BY movie.released DESC RETURN actor.name, 2016 - actor.born AS age, collect(movie.title)[..3] AS movies;
actor.name | age | movies --------------+-----+--------------------------------------------------- Max von Sydow | 87 | ['Snow Falling on Cedars', 'What Dreams May Come'] Gene Hackman | 86 | ['The Replacements', 'The Birdcage', 'Unforgiven']
The general pattern is to use ORDER BY
to order your data as needed and then use collect()[..x]
for collecting the top x
records for each row.
I know what you’re thinking: What if I want to return a Cartesian product of these actors and movies? That’s where UNWIND
comes in and it’s also useful if you want to continue with the query and write more MATCH
clauses with the actors and movies you’ve already matched on.
// Get the two oldest people. MATCH (actor:Person) WITH actor ORDER BY actor.born LIMIT 2 // Get their three most recent movies. MATCH (actor)-[:ACTED_IN]->(movie:Movie) WITH actor, movie ORDER BY movie.released DESC WITH actor, collect(movie)[..3] AS m // Unwind the collection into rows. UNWIND m AS movie RETURN actor.name, 2016 - actor.born AS age, movie.title;
actor.name | age | movie.title --------------+-----+----------------------- Gene Hackman | 86 | The Replacements Gene Hackman | 86 | The Birdcage Gene Hackman | 86 | Unforgiven Max von Sydow | 87 | Snow Falling on Cedars Max von Sydow | 87 | What Dreams May Come
MERGE
This is easily the most misunderstood Cypher keyword. It works as documented yet the actual behavior of MERGE
often differs from users’ expectations. A few examples of users struggling with this are below.
The following is an exercise I do at every Fundamentals of Neo4j training to clear up the confusion. Assume we have a uniqueness constraint on :Person
nodes by the name
property.
CREATE CONSTRAINT ON (p:Person) ASSERT p.name IS UNIQUE;
We know there’s already a :Person
node in the graph with the property name:"Tom Hanks"
. Let’s say we want to find or create this node if it doesn’t exist while also adding a new property, oscar_winner
. At this point I show the following query and ask what will happen:
MERGE (p:Person {name:"Tom Hanks", oscar_winner: true}) RETURN p.name, p.oscar_winner;
Almost everyone says: It will find the :Person
node where name:"Tom Hanks"
and then add the oscar_winner:true
property. Wrong!
Node 23478 already exists with label Person and property "name"=[Tom Hanks]
MERGE
matches on the entire pattern you specify within a single clause. When Neo4j determines that a node with the :Person
label and properties name:"Tom Hanks"
and oscar_winner:true
doesn’t exist, Neo4j attempts to create the node specified. This is when the uniqueness constraint is violated, because a :Person
node with property name:"Tom Hanks"
already exists.
The solution is to MERGE
on the unique property and then use SET
to update additional properties.
MERGE (p:Person {name:"Tom Hanks"}) SET p.oscar_winner = true RETURN p.name, p.oscar_winner;
p.name | p.oscar_winner ----------+--------------- Tom Hanks | True
There are also ON MATCH SET
and ON CREATE SET
for additional control over what happens depending on if the node was found or created, respectively. Read more about those here.
This concept also applies to relationships in MERGE
clauses. What do you think will happen with the following? Note that a :Person
node with name:"Tom Hanks"
already exists whereas a :Person
node with name:"Nicole White"
does not.
MERGE (tom:Person {name:"Tom Hanks"})-[:KNOWS]-(nicole:Person {name:"Nicole White"}) RETURN tom.name, nicole.name;
At this point everyone’s learned from their previous mistake and correctly guesses that this will throw an error for similar reasons.
Node 23478 already exists with label Person and property "name"=[Tom Hanks]
Once Neo4j has determined that the entire pattern specified in the MERGE
clause doesn’t exist, it tries to create everything in the pattern, including:
- a
:Person
node withname:"Tom Hanks"
- a
:Person
node withname:"Nicole White"
- a
:KNOWS
relationship between them
Numbers 2 and 3 would be okay, but the query throws an error because number 1 violates the uniqueness constraint. The solution is to follow the best practice of using MERGE
on the individual parts of a pattern that may or may not exist:
MERGE (tom:Person {name:"Tom Hanks"}) MERGE (nicole:Person {name:"Nicole White"}) MERGE (tom)-[:KNOWS]-(nicole) RETURN tom.name, nicole.name;
tom.name | nicole.name ----------+------------- Tom Hanks | Nicole White
WITH
The WITH
clause trips people up because it both changes the scope of variables and, if any aggregates are present, automatically groups by the other variables in the clause. Below are misunderstandings of each of these concepts respectively.
Unbound Variables
Leaving a variable out of the WITH
clause causes it to become unbound. It’s easy to catch this error if you attempt to use such a variable in a subsequent WITH
or WHERE
clause.
For example, let’s say we want to find movies where all of the movie’s directors also wrote the movie. The following attempt will expose an obvious unbound variable problem:
MATCH (p:Person)-[:DIRECTED]->(m:Movie) WITH m, collect(p) AS directors WHERE (p)-[:WROTE]->(m) RETURN m.title, [x IN directors | x.name];
p not defined (line 3, column 8 (offset: 79)) "WHERE (p)-[:WROTE]->(m)" ^
But if you try to use that variable in a subsequent MATCH
clause, no error will be thrown because the MATCH
clause thinks you’re binding to a new variable.
MATCH (p:Person)-[:DIRECTED]->(m:Movie) WITH m, collect(p) AS directors MATCH (p)-[:WROTE]->(m) RETURN m.title, [x IN directors | x.name];
m.title | [x IN directors | x.name] -----------------------+------------------------------------- A Few Good Men | ['Rob Reiner'] Something's Gotta Give | ['Nancy Meyers'] Speed Racer | ['Andy Wachowski', 'Lana Wachowski'] Speed Racer | ['Andy Wachowski', 'Lana Wachowski'] Jerry Maguire | ['Cameron Crowe'] Top Gun | ['Tony Scott'] V for Vendetta | ['James Marshall'] V for Vendetta | ['James Marshall'] When Harry Met Sally | ['Rob Reiner']
Some might look at the above query and say that it’s matching on movies where all its directors are also among its writers, but that is not correct. This query is actually showing you the directors of all movies that have a writer. The variable p
was unbound at line 2 of this query and the MATCH
clause at line 3 is binding nodes of any label that have an outgoing :WROTE
relationship to a new variable p
. The accurate version of this query would be:
MATCH (p:Person)-[:DIRECTED]->(m:Movie) WHERE (p)-[:WROTE]->(m) WITH m, collect(p) AS directors RETURN m.title, [x IN directors | x.name];
m.title | [x IN directors | x.name] -----------------------+------------------------------------- Something's Gotta Give | ['Nancy Meyers'] Speed Racer | ['Andy Wachowski', 'Lana Wachowski'] Jerry Maguire | ['Cameron Crowe']
Automatic Grouping
In addition to carrying over variables, the WITH
clause also automatically groups by variables if any aggregates are present. A common use case for WITH
is to determine an aggregate that you want to filter on. For example, we can find all movies where the average age of its actors is greater than 70.
MATCH (p:Person)-[:ACTED_IN]->(m:Movie) WITH m, avg(2016 - p.born) AS avg_age WHERE avg_age > 70 RETURN m.title, avg_age ORDER BY avg_age DESC;
m.title | avg_age --------------------------------+--------- Unforgiven | 86.00 One Flew Over the Cuckoo's Nest | 76.50 The Birdcage | 70.33
At the second line, WITH m, avg(...)
, we included the variable m
which is bound to movies and an aggregate function avg()
such that the WITH
clause automatically grouped the data by m
. If at this point you asked someone to also include the names of the actors in these movies, they might make the following mistake:
MATCH (p:Person)-[:ACTED_IN]->(m:Movie) WITH m, p, avg(2016 - p.born) AS avg_age WHERE avg_age > 70 RETURN m.title, avg_age, collect(p.name) AS actors ORDER BY avg_age DESC;
m.title | avg_age | actors --------------------------------+---------+----------------------------------------------------- Snow Falling on Cedars | 87.0 | ['Max von Sydow'] What Dreams May Come | 87.0 | ['Max von Sydow'] The Birdcage | 86.0 | ['Gene Hackman'] The Replacements | 86.0 | ['Gene Hackman'] Unforgiven | 86.0 | ['Gene Hackman', 'Richard Harris', 'Clint Eastwood'] Top Gun | 83.0 | ['Tom Skerritt'] Hoffa | 79.0 | ['Jack Nicholson'] A Few Good Men | 79.0 | ['Jack Nicholson'] As Good as It Gets | 79.0 | ['Jack Nicholson'] Something's Gotta Give | 79.0 | ['Jack Nicholson'] One Flew Over the Cuckoo's Nest | 79.0 | ['Jack Nicholson'] Frost/Nixon | 78.0 | ['Frank Langella'] The Da Vinci Code | 77.0 | ['Ian McKellen'] V for Vendetta | 76.0 | ['John Hurt'] The Green Mile | 76.0 | ['James Cromwell'] The Devil's Advocate | 76.0 | ['Al Pacino'] Snow Falling on Cedars | 76.0 | ['James Cromwell'] RescueDawn | 74.0 | ['Marshall Bell'] Stand By Me | 74.0 | ['Marshall Bell'] What Dreams May Come | 74.0 | ['Werner Herzog'] Hoffa | 73.0 | ['J.T. Walsh'] A Few Good Men | 73.0 | ['J.T. Walsh'] One Flew Over the Cuckoo's Nest | 72.0 | ['Danny DeVito'] Hoffa | 72.0 | ['Danny DeVito']
By including p
in the WITH
clause, the data has also been grouped by p
and the avg_age
is grouped by both movie and person, leading to nonsensical results: we’re calculating the average age of one person. The correct way to do this is to collect the actors by movie in the WITH
clause:
MATCH (p:Person)-[:ACTED_IN]->(m:Movie) WITH m, avg(2016 - p.born) AS avg_age, collect(p.name) AS actors WHERE avg_age > 70 RETURN m.title, avg_age, actors ORDER BY avg_age DESC;
m.title | avg_age | actors --------------------------------+----------+----------------------------------------------------- Unforgiven | 86.00 | ['Gene Hackman', 'Clint Eastwood', 'Richard Harris'] One Flew Over the Cuckoo's Nest | 76.50 | ['Jack Nicholson', 'Danny DeVito'] The Birdcage | 70.33 | ['Gene Hackman', 'Nathan Lane', 'Robin Williams']
Next Steps
The best way to improve your Cypher is, of course, to write more Cypher!
It’s also very helpful to read other peoples’ queries to pick up new tips and tricks; I probably learn a new Cypher trick once a week by reading Michael Hunger‘s Cypher queries. You can get more practice by signing up for the online training, reading more on our Cypher developer page, and by keeping the Cypher refcard handy at all times.
Want to sharpen your Cypher even more? Click below to get your free copy of the Learning Neo4j ebook and level up your graph database skills in no time.