Even in Neo4j with its high performance graph traversals, there are always queries that could and should be run faster – especially if your data is highly connected and if global pattern matches make even a single query account for many millions or billions of paths.
For this article, we’re using the larger movie dataset, which is also listed on the example datasets page.
The domain model that interests us here, is pretty straightforward:
(:Person {name}) -[:ACTS_IN|:DIRECTED]-> (:Movie {title}) (:Movie {title}) -[:GENRE]-> (:Genre {name})
HARDWARE
I presume you use a sensible machine, with a SSD (or enough IOPS) and decent amount of RAM. For a highly concurrent load, there should be also enough CPU cores to handle it.
Other questions to consider: Did you monitor io-waits, top, for CPU and memory usage? Any bottlenecks that turn up?
If so, you should address those issues first.
On Linux, configure your disk scheduler to noop
or deadline
and mount the database volume with noatime
. See this blog post for more information.
CONFIG
For best results, use the latest stable version of Neo4j (i.e., Neo4j Enterprise 2.2.5). There is always an Enterprise trial version available to give you a high-watermark baseline, so compare it to Neo4j Community on your machine as needed.
Set dbms.pagecache.memory=4G
in conf/neo4j.properties
or the size of the store-files (nodes, relationships, properties, string-properties) combined.
ls -lt data/graph.db/neostore.*.db 3802094 16 Jul 14:31 data/graph.db/neostore.propertystore.db 456960 16 Jul 14:31 data/graph.db/neostore.relationshipstore.db 442260 16 Jul 14:31 data/graph.db/neostore.nodestore.db 8192 16 Jul 14:31 data/graph.db/neostore.schemastore.db 8190 16 Jul 14:31 data/graph.db/neostore.labeltokenstore.db 8190 16 Jul 14:31 data/graph.db/neostore.relationshiptypestore.db 8175 16 Jul 14:31 data/graph.db/neostore.relationshipgroupstore.db
Set heap from 8 to 16G, depending on the RAM size of the machine. Also configure the young generation in conf/neo4j-wrapper.conf
.
wrapper.java.initmemory=8000 wrapper.java.maxmemory=8000 wrapper.java.additional=-Xmn2G
That’s mostly it, config-wise. If you are concurrency heavy, you could also set the webserver threads in conf/neo4j-server.properties
.
# cpu * 2 org.neo4j.server.webserver.maxthreads=24
QUERY TUNING
If these previous factors are taken care of, it’s now time to dig into query tuning. A lot of query tuning is simply prefixing your statements with EXPLAIN
to see what Cypher would do and using PROFILE
to retrieve the real execution data as well:
For example, let’s look at this query, which has the PROFILE
prefix:
PROFILE MATCH(g:Genre {name:"Action"})<-[:GENRE]-(m:Movie)<-[:ACTS_IN]-(a) WHERE a.name =~ "A.*" RETURN distinct a.name;
The result of this query is shown below in the visual query plan tool available in the Neo4j browser.
While the visual query plan is nice in the Neo4j browser, the one in Neo4j-shell is easier to compare and it also has more raw numbers.
Operator | Est.Rows | Rows | DbHits | Identifiers | Other |
---|---|---|---|---|---|
Total database accesses: 72310 |
|||||
Distinct |
2048 |
860 |
2636 |
a.name |
a.name |
Filter(0) |
2155 |
1318 |
41532 |
anon[32], anon[52], a, g, m |
a.name ~= /{ AUTOSTRING1}/ |
Expand(All)(0) |
2874 |
20766 |
23224 |
anon[32], anon[52], a, g, m |
(m)←[:ACTS_IN]-(a) |
Filter(1) |
390 |
2458 |
2458 |
anon[32], g, m |
m:Movie |
Expand(All)(1) |
390 |
2458 |
2459 |
anon[32], g, m |
(g)←[:GENRE]-(m) |
NodeUniqueIndexSeek |
1 |
1 |
1 |
g |
:Genre(name) |
Query Tuning Tip #1: Use Indexes and Constraints for Nodes You Look Up by Properties
Check – with either schema
or :schema
– that there is an index in place for non-unique values and a constraint for unique values, and make sure – with EXPLAIN
– that the index is used in your query.
CREATE INDEX ON :Movie(title); CREATE INDEX ON :Person(name); CREATE CONSTRAINT ON (g:Genre) ASSERT g.name IS UNIQUE;
Even for range queries (pre-Neo4j 2.3), it might be better to turn them into an IN
query to leverage an index.
// if :Movie(released) is indexed, this query for the nineties will *not use* an index: MATCH (m:Movie) WHERE m.released >= 1990 and m.released < 2000 RETURN count(*); CREATE INDEX ON :Movie(released); // but this will MATCH (m:Movie) WHERE m.released IN range(1990,1999) RETURN count(*); // same for OR queries MATCH (m:Movie) WHERE m.released = 1990 OR m.released = 1991 OR ...
Query Tuning Tip #2: Patterns with Bound Nodes are Optimized
If you have a pattern (node)-[:REL]→(node)
where both nodes on either side are already bound, Cypher will optimize the match by taking the node-degree (number of relationships) into account when checking for the connection, starting on the smaller side and also caching internally.
So, for example, (actor)-[:ACTS_IN]->(movie)
– if both actor
and movie
are known – turns into that described Expand(Into)
operation.
If one side is not known, then it is a normal Expand(All)
operation.
Query Tuning Tip #3: Enforce Index Lookups for Both Sides of a Path
Make sure that if nodes on both sides of a longer path can be found in an index, and are only a few hits of a larger total count, to add USING INDEX
for both sides. In many cases, that makes a big difference. It doesn't help if the path explodes in the middle and a simple left-to-right traversal with property checks would touch fewer paths.
PROFILE MATCH (a:Person {name:"Tom Hanks"})-[:ACTS_IN]->()<-[:ACTS_IN]-(b:Person {name:"Meg Ryan"}) RETURN count(*);
Operator | Est.Rows | Rows | DbHits | Identifiers | Other |
---|---|---|---|---|---|
Total database accesses: 765 |
|||||
EagerAggregation |
0 |
1 |
0 |
count(*) |
|
Filter |
0 |
3 |
531 |
anon[36], anon[49], anon[51], a, b |
a:Person AND a.name == { AUTOSTRING0}) AND NOT(anon[36] == anon[51] |
Expand(All)(0) |
3 |
177 |
204 |
anon[36], anon[49], anon[51], a, b |
()←[:ACTS_IN]-(a) |
Expand(All)(1) |
2 |
27 |
28 |
anon[49], anon[51], b |
(b)-[:ACTS_IN]→() |
NodeIndexSeek |
1 |
1 |
2 |
b |
:Person(name) |
If we add the second index-hint, we get 10x fewer database hits.
PROFILE MATCH (a:Person {name:"Tom Hanks"})-[:ACTS_IN]->()<-[:ACTS_IN]-(b:Person {name:"Meg Ryan"}) USING INDEX a:Person(name) USING INDEX b:Person(name) RETURN count(*);
Operator | Est.Rows | Rows | DbHits | Identifiers | Other |
---|---|---|---|---|---|
Total database accesses: 68 |
|||||
EagerAggregation |
0 |
1 |
0 |
count(*) |
|
Filter |
0 |
3 |
0 |
anon[36], anon[49], anon[51], a, b |
NOT(anon[36] == anon[51]) |
NodeHashJoin |
0 |
3 |
0 |
anon[36], anon[49], anon[51], a, b |
anon[49] |
Expand(All)(0) |
2 |
27 |
28 |
anon[49], anon[51], b |
(b)-[:ACTS_IN]→() |
NodeIndexSeek(0) |
1 |
1 |
2 |
b |
:Person(name) |
Expand(All)(1) |
2 |
35 |
36 |
anon[36], anon[49], a |
(a)-[:ACTS_IN]→() |
NodeIndexSeek(1) |
1 |
1 |
2 |
a |
:Person(name) |
Query Tuning Tip #4: Defer Property Access
Make sure to access properties only as the last operation – if possible – and on the smallest set of nodes and relationships. Massive property loading is more expensive than following relationships.
For example, this query:
PROFILE MATCH (p:Person)-[:ACTS_IN]->(m:Movie) RETURN p.name, count(*) as c ORDER BY c DESC limit 10;
Operator | Est.Rows | Rows | DbHits | Identifiers | Other |
---|---|---|---|---|---|
Total database accesses: 404525 |
|||||
Projection(0) |
308 |
10 |
0 |
anon[48], anon[54], c, p.name |
anon[48]; anon[54] |
Top |
308 |
10 |
0 |
anon[48], anon[54] |
{ AUTOINT0}; |
EagerAggregation |
308 |
44689 |
0 |
anon[48], anon[54] |
anon[48] |
Projection(1) |
94700 |
94700 |
189400 |
anon[48], anon[17], m, p |
p.name |
Filter |
94700 |
94700 |
94700 |
anon[17], m, p |
p:Person |
Expand(All) |
94700 |
94700 |
107562 |
anon[17], m, p |
(m)←[:ACTS_IN]-(p) |
NodeByLabelScan |
12862 |
12862 |
12863 |
m |
:Movie |
This query shown above accesses p.name
for all people, totaling 400,000 database hits. Instead, you should aggregate on the node first, then order and paginate, and only in the very end should you access and return the property.
PROFILE MATCH (p:Person)-[:ACTS_IN]->(m:Movie) WITH p, count(*) as c ORDER BY c DESC LIMIT 10 RETURN p.name, c;
This second query above only accesses p.name
for the top ten actors, and before that, it groups them directly by the nodes, saving us about 200,000 database hits.
Operator | Est.Rows | Rows | DbHits | Identifiers | Other |
---|---|---|---|---|---|
Total database accesses: 215145 |
|||||
Projection |
308 |
10 |
20 |
c, p, p.name |
p.name; c |
Top |
308 |
10 |
0 |
c, p |
{ AUTOINT0}; c |
EagerAggregation |
308 |
44943 |
0 |
c, p |
p |
Filter |
94700 |
94700 |
94700 |
anon[17], m, p |
p:Person |
Expand(All) |
94700 |
94700 |
107562 |
anon[17], m, p |
(m)←[:ACTS_IN]-(p) |
NodeByLabelScan |
12862 |
12862 |
12863 |
m |
:Movie |
But that query could even be optimized more, with....
Query Tuning Tip #5: Fast Relationship Counting
There is an optimal implementation for single path-expressions, by directly reading the degree of a node. Personally, I always prefer this method over optional matches, exists or general where conditions: size((s)-[:REL]->()) ←
uses get-degree which is a constant time operation (similarly without rel-type or direction).
PROFILE MATCH (n:Person) WHERE EXISTS((n)-[:DIRECTED]->()) RETURN count(*);
Here the plan doesn’t count the nested db-hits in the expression, which it should. That’s why I included the runtime:
1 row 197 ms
Operator | Est.Rows | Rows | DbHits | Identifiers | Other |
---|---|---|---|---|---|
Total database accesses: 106396 |
|||||
EagerAggregation |
194 |
1 |
56216 |
count(*) |
|
Filter |
37634 |
6037 |
0 |
n |
NestedPipeExpression(ExpandAllPipe(….)) |
NodeByLabelScan |
50179 |
50179 |
50180 |
n |
versus
PROFILE MATCH (n:Person) WHERE size((n)-[:DIRECTED]->()) <> 0 RETURN count(*);
1 row 90 ms
Operator | Est.Rows | Rows | DbHits | Identifiers | Other |
---|---|---|---|---|---|
Total database accesses: 150538 |
|||||
EagerAggregation |
213 |
1 |
0 |
count(*) |
|
Filter |
45161 |
6037 |
100358 |
n |
NOT(GetDegree(n,Some(DIRECTED),OUTGOING) == { AUTOINT0}) |
NodeByLabelScan |
50179 |
50179 |
50180 |
n |
:Person |
You can also use that technique nicely for overview pages or inline aggregations:
PROFILE MATCH (m:Movie) RETURN m.title, size((m)<-[:ACTS_IN]-()) as actors, size((m)<-[:DIRECTED]-()) as directors LIMIT 10;
+-------------------------------------------------------------+ | m.title | actors | directors | +-------------------------------------------------------------+ | "Indiana Jones and the Temple of Doom" | 13 | 1 | | "King Kong" | 1 | 1 | | "Stolen Kisses" | 21 | 1 | | "One Flew Over The Cuckoo's Nest" | 24 | 1 | | "Ziemia obiecana" | 17 | 1 | | "Scoop" | 21 | 1 | | "Fire" | 0 | 1 | | "Dial M For Murder" | 5 | 1 | | "Ed Wood" | 21 | 1 | | "Requiem" | 11 | 1 | +-------------------------------------------------------------+ 10 rows 13 ms
Operator | Est.Rows | Rows | DbHits | Identifiers | Other |
---|---|---|---|---|---|
Total database accesses: 71 |
|||||
Projection |
12862 |
10 |
60 |
actors, directors, |
m.title; GetDegree(m,Some(ACTS_IN),INCOMING); |
m, m.title |
GetDegree(m,Some(DIRECTED),INCOMING) |
||||
Limit |
12862 |
10 |
0 |
m |
{ AUTOINT0} |
NodeByLabelScan |
12862 |
10 |
11 |
m |
:Movie |
Our query from the previous section would look like this:
PROFILE MATCH (p:Person) WITH p, sum(size((p)-[:ACTS_IN]->())) as c ORDER BY c DESC LIMIT 10 RETURN p.name, c;
This query shaves off another 50,000 database hits. Not bad.
Operator | Est.Rows | Rows | DbHits | Identifiers | Other |
---|---|---|---|---|---|
Total database accesses: 150558 |
|||||
Projection |
224 |
10 |
20 |
c, p, p.name |
p.name; c |
Top |
224 |
10 |
0 |
c, p |
{ AUTOINT0}; c |
EagerAggregation |
224 |
50179 |
100358 |
c, p |
p |
NodeByLabelScan |
50179 |
50179 |
50180 |
p |
:Person |
Note to self: Optimized Cypher looks more like Lisp.
Bonus Query Tuning Tip: Reduce Cardinality of Work in Progress
When following longer paths, you’ll encounter duplicates. If you’re not interested in all the possible paths – but just distinct information from stages of the path – make sure that you eagerly eliminate duplicates, so that later matches don’t have to be executed many multiple times.
This reduction of the cardinality can be done using either WITH DISTINCT
or WITH
aggregation (which automatically de-duplicates).
So, for instance, for this query of "Movies that Tom Hanks' colleagues acted in":
PROFILE MATCH (p:Person {name:"Tom Hanks"})-[:ACTS_IN]->(m1)<-[:ACTS_IN]-(coActor)-[:ACTS_IN]->(m2) RETURN distinct m2.title;
This query has 10,272 db-hits and touches 3,020 total paths.
Operator | Est.Rows | Rows | DbHits | Identifiers | Other |
---|---|---|---|---|---|
Total database accesses: 10272 |
|||||
Distinct |
4 |
2021 |
6040 |
m2.title |
m2.title |
Filter(0) |
4 |
3020 |
0 |
anon[36], anon[53], anon[75], coActor, m1, m2, p |
(NOT(anon[53] == anon[75]) AND NOT(anon[36] == anon[75])) |
Expand(All)(0) |
4 |
3388 |
3756 |
anon[36], anon[53], anon[75], coActor, m1, m2, p |
(coActor)-[:ACTS_IN]→(m2) |
Filter(1) |
3 |
368 |
0 |
anon[36], anon[53], coActor, m1, p |
NOT(anon[36] == anon[53]) |
Expand(All)(1) |
3 |
403 |
438 |
anon[36], anon[53], coActor, m1, p |
(m1)←[:ACTS_IN]-(coActor) |
Expand(All)(2) |
2 |
35 |
36 |
anon[36], m1, p |
(p)-[:ACTS_IN]→(m1) |
NodeIndexSeek |
1 |
1 |
2 |
p |
:Person(name) |
The first-degree neighborhood is unique, since in this dataset there is only at most one :ACTS_IN
relationship between an actor and a movie. So, the first duplicated nodes appear at the second degree, which we can eliminate like this:
PROFILE MATCH (p:Person {name:"Tom Hanks"})-[:ACTS_IN]->(m1)<-[:ACTS_IN]-(coActor) WITH distinct coActor MATCH (coActor)-[:ACTS_IN]->(m2) RETURN distinct m2.title;
This query tuning technique reduces the number of paths to match for the last step to 2,906. In other use cases with more duplicates, the impact is much bigger.
Operator | Est.Rows | Rows | DbHits | Identifiers | Other |
---|---|---|---|---|---|
Total database accesses: 9529 |
|||||
Distinct(0) |
4 |
2031 |
5812 |
m2.title |
m2.title |
Expand(All)(0) |
4 |
2906 |
3241 |
anon[113], coActor, m2 |
(coActor)-[:ACTS_IN]→(m2) |
Distinct(1) |
3 |
335 |
0 |
coActor |
coActor |
Filter |
3 |
368 |
0 |
anon[36], anon[53], coActor, m1, p |
NOT(anon[36] == anon[53]) |
Expand(All)(1) |
3 |
403 |
438 |
anon[36], anon[53], coActor, m1, p |
(m1)←[:ACTS_IN]-(coActor) |
Expand(All)(2) |
2 |
35 |
36 |
anon[36], m1, p |
(p)-[:ACTS_IN]→(m1) |
NodeIndexSeek |
1 |
1 |
2 |
p |
:Person(name) |
Of course we would apply our Minimize Property Access tip here too:
PROFILE MATCH (p:Person {name:"Tom Hanks"})-[:ACTS_IN]->(m1)<-[:ACTS_IN]-(coActor) WITH distinct coActor MATCH (coActor)-[:ACTS_IN]->(m2) WITH distinct m2 RETURN m2.title;
Operator | Est.Rows | Rows | DbHits | Identifiers | Other |
---|---|---|---|---|---|
Total database accesses: 7791 |
|||||
Projection |
4 |
2037 |
4074 |
m2, m2.title |
m2.title |
Distinct(0) |
4 |
2037 |
0 |
m2 |
m2 |
Expand(All)(0) |
4 |
2906 |
3241 |
anon[113], coActor, m2 |
(coActor)-[:ACTS_IN]→(m2) |
Distinct(1) |
3 |
335 |
0 |
coActor |
coActor |
Filter |
3 |
368 |
0 |
anon[36], anon[53], coActor, m1, p |
NOT(anon[36] == anon[53]) |
Expand(All)(1) |
3 |
403 |
438 |
anon[36], anon[53], coActor, m1, p |
(m1)←[:ACTS_IN]-(coActor) |
Expand(All)(2) |
2 |
35 |
36 |
anon[36], m1, p |
(p)-[:ACTS_IN]→(m1) |
NodeIndexSeek |
1 |
1 |
2 |
p |
:Person(name) |
We still need the distinct m2
at the end, as the co-actors can have played in the same movies, and we don’t want duplicate results.
This query has 7,791 db-hits and touches 2,906 paths in total.
If you are also interested in the frequency (e.g., for scoring), you can compute them along with an aggregation instead of distinctly. In the end, You just multiply the path count per co-actor with the number of occurrences per movie.
MATCH (p:Person {name:"Tom Hanks"})-[:ACTS_IN]->(m1)<-[:ACTS_IN]-(coActor) WITH coActor, count(*) as freq MATCH (coActor)-[:ACTS_IN]->(m2) RETURN m2.title, freq * count(*) as occurrence;
Conclusion
The best way to start with query tuning is to take the slowest queries, PROFILE
them and optimize them using these tips.
If you need help, you can always reach out to us on Stack Overflow, our Google Group or our public Discord channel.
If you are part of a project that is adopting Neo4j or putting it into production, make sure to get some expert help to ensure you’re successful. Note: If you do ask for help, please provide enough information for others to be able to help you.
Explain your graph model, share your queries, their profile output and – best of all – a dataset to run them on.
Need more tips on how to effectively use Neo4j? Register for our online training class, Neo4j in Production, and learn how to master the world’s leading graph database.