Proudly Releasing: Efficient Graph Algorithms in Neo4j

Head of Product Innovation & Developer Strategy, Neo4j

August 3, 2017

5 min read

Learn about the new graph algorithms library for Neo4j, made more easily available using Cypher

I am very happy to announce the first public release of the Neo4j graph algorithms library.

You can use these graph algorithms on your connected data to gain new insights more easily within Neo4j. You can use these graph analytics to improve results from your graph data, for example by focusing on particular communities or favoring popular entities.

We developed this library as part of our effort to make it easier to use Neo4j for a wider variety of applications. Many users expressed interest in running graph algorithms directly on Neo4j without having to employ a secondary system. We also tuned these algorithms to be as efficient as possible in regards to resource utilization as well as streamlined for later management and debugging.

These algorithms represent user-defined procedures which you can call as part of Cypher statements running on top of Neo4j.

Kudos

Big thanks goes to Martin Knobloch and Paul Horn from our good friends at Avantgarde Labs in Dresden who did all the heavy lifting. From reading tons of papers and tuning and parallelizing implementations, to providing performance testing and implementers documentation – most of the work you see in this library is theirs.

Tomaz Bratanic also helped immensely with documenting the library, providing explanations and examples on small graphs and detailing syntax information for all graph algorithms.

Our documentation is still a work in progress, so please bear with us! However, please let us know if the existing sections are helpful or you have ideas on how to improve the documentation.

The Graph Algorithms

The graph algorithms covered by the library are:

Centrality:

PageRank
Betweenness Centrality
Closeness Centrality

Partitioning:

Label Propagation
(Weakly) Connected Components
Strongly Connected Components
Union-Find

Path-Finding:

Minimum Weight Spanning Tree<
All Pairs – and Single Source – Shortest Path
Multi-Source, Breadth-First Search

Most of the graph algorithms are available in two variants: One that writes the results (e.g., rank or partition) back to the graph, and the other, suffixed with .stream which will stream the results back for further sorting, filtering or aggregation.

To select which part of the graph to run the graph algorithm on, you can provide a label and relationship type as first parameters. Then, add configuration options depending on the algorithm. For instance, iterations and damping-factor for PageRank.

Example: PageRank on DBPedia

Here we run PageRank on DBPedia (11M Page-nodes, 125M Link-relationships):

CALL algo.pageRank('Page', 'Link', {write:true,iterations:20});


+--------------------------------------------------------------------------------------+
| nodes    | iter | loadMillis | computeMillis | writeMillis | damping | writeProperty |
+--------------------------------------------------------------------------------------+
| 11474730 | 20   | 34106      | 9712          | 1810        | 0.85    | "pagerank"    |
+--------------------------------------------------------------------------------------+
1 row
47888 ms


CALL algo.pageRank.stream('Page', 'Link', {iterations:5}) YIELD node, score
WITH * ORDER BY score DESC LIMIT 5
RETURN node.title, score;


+--------------------------------------+
| node.title                 | score   |
+--------------------------------------+
| "United States"            | 13349.2 |
| "Animal"                   | 6077.77 |
| "France"                   | 5025.61 |
| "List of sovereign states" | 4913.92 |
| "Germany"                  | 4662.32 |
+--------------------------------------+
5 rows
46247 ms

Graph Projection

One really cool feature is the ability to load a projection of a (sub-)graph of your data into the graph algorithm by passing Cypher statements to select nodes and node-pairs and choosing the “cypher” graph loader. The compiled Cypher runtime of Neo4j 3.2 (Enterprise) benefits this load strategy.

Here is an example for Union-Find:

CALL algo.unionFind('
MATCH (p:Page) WHERE p.title CONTAINS "Europe" RETURN id(p) as id
','
MATCH (p1:Page)-[:Link]->(p2:Page)
 WHERE p1.title CONTAINS "Europe" AND p2.title CONTAINS "Europe"
 RETURN id(p1) as source, id(p2) as target
',{graph:'cypher',write:true});


+-------------------------------------------------------------+
| loadMillis | computeMillis | writeMillis | nodes | setCount |
+-------------------------------------------------------------+
| 2975       | 2             | 30          | 23467 | 4322     |
+-------------------------------------------------------------+
1 row
3013 ms

Performance Testing

For several of the algorithms (PageRank, union-find, label-propagation, strongly-connected-components), I ran preliminary tests on medium and larger datasets that have also been used in other publications.

The table below contains database size as well as node and relationship counts. For each graph algorithm, you see the runtime (in seconds) of the first and second run.

Comparing them with other publications, those runtimes look quite good. But of course the real confirmation comes from you running the algorithms on your own datasets on your own hardware.

Graph	Size (GB)	nodes (M)	rels (M)	PR1 (s)	PR2 (s)	UF1 (s)	UF2 (s)	LP1 (s)	LP2 (s)	SCC1 (s)	SCC2 (s)
Pokec	7.3	2	31	10	7	24	6	12	9	12	7
DBPedia	15	11	117	46	38	91	37	51	43	65	41
Graphs500-23	7.9	5	129	19	15	29	10	18	17	25	13
cit-patents	0.2	4	17	13	10	23	5	12	10	14	5
Twitter-2010	49	42	1468	349	131	353	128	405	405	339	189
soc-LifeJournal1	6.3	5	69	30	14	34	11	25	19	23	13
Friendster	62	66	1806	611	235	619	196	296	282	483	257

And here is the obligatory chart.
Please note that this is log-scale to fit larger and smaller datasets in one chart.

Installation

We provide two releases, one for Neo4j 3.1.x and one for Neo4j 3.2.x

Installation is easy: just download the jar-file from the release link below, copy it into your $NEO4J_HOME/plugins directory and restart Neo4j.

Note: For Neo4j 3.2.x you will also have to add this line to your $NEO4J_HOME/conf/neo4j.conf config file:
dbms.security.procedures.unrestricted=algo.*

Feedback

We would love to hear your thoughts on the new graph algorithm library!

Please try out this library on your data and let us know how it worked. Raise GitHub issues if you run into any problems and don’t forget our #neo4j-graph-algorithm channel in the neo4j-users Discord if you have questions.

Implementation

The library is currently limited to handle 2 billion nodes by design, but in future versions we will remove those limits as we more tightly integrate this work.

Our general approach is to load the projected data from Neo4j into an efficient data structure, compute the algorithm and write the results back. All three steps are parallelized as much as possible to maximize CPU and I/O utilization.

We use a composed Graph-API interface to provide the algorithms access to the graph data, which is loaded into different representations by GraphFactory instances. Both the loading and writing back of results happens in parallel batches.

Feel free to have a look at the code, give us feedback or even add your own algorithm implementation based on the existing infrastructure.

We welcome any pull request with new algorithms, bug-fixes or other improvements.

Happy computing,

Michael

References

Graph Algorithms Developer Page
Releases
GitHub Repository (please star if you like it)
Documentation
GitHub issues
Neo4j Slack community

Want to take your Neo4j skills up a notch? Take our online training class, Neo4j in Production, and learn how to scale the world’s leading graph database to unprecedented levels.

Take the Class