Using Apache Spark and Neo4j for Big Data Graph Analytics

Developer Relations
5 min read

Using Apache Spark and Neo4j for Big Data Graph Analytics
Graph databases
I’ve been working with the Neo4j graph database for the last few years and I have yet to become jaded by its powerful ability to combine both data transformation and data analysis. Graph databases like Neo4j are solving analytical problems that relational databases struggle to solve in a flexible way. Graph processing at scale from a graph database like Neo4j is a tremendously valuable power. But if you wanted to run PageRank on a dump of Wikipedia articles in less than 2 hours on a laptop, you’d be hard pressed to be successful. More so, what if you wanted the power of a high-performance transactional database that seamlessly handled graph analysis at this scale?Mazerunner for Neo4j
Mazerunner is a Neo4j unmanaged extension and distributed graph processing platform that extends Neo4j to do big data graph processing jobs while persisting the results back to Neo4j. Mazerunner uses a message broker to distribute graph processing jobs to Apache Spark’s GraphX module. When an agent job is dispatched, a subgraph is exported from Neo4j and written to Apache Hadoop HDFS. After Neo4j exports a subgraph to HDFS, a separate Mazerunner service for Spark is notified to begin processing that data. The Mazerunner service will then start a distributed graph processing algorithm using Scala and Spark’s GraphX module. The GraphX algorithm is serialized and dispatched to Apache Spark for processing. Once the Apache Spark job completes, the results are written back to HDFS as a Key-Value list of property updates to be applied back to Neo4j. Neo4j is then notified that a property update list is available from Apache Spark on HDFS. Neo4j batch imports the results and applies the updates back to the original graph.Example
MATCH (a1:Person)-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActors)
CREATE (a1)-[:KNOWS]->(coActors);
Now our model looks like this:
https://localhost:7474/service/mazerunner/pagerank
Hitting this endpoint will run the PageRank algorithm on actors who know each other and then write the results back into Neo4j as a weight property on the Person nodes.
When the job finishes then we can find the top ten most valuable actors to work with. Running a Cypher query against this dataset produces the following results:
neo4j-sh (?)$ MATCH n WHERE HAS(n.weight) RETURN n ORDER BY n.weight DESC LIMIT 10;
+-----------------------------------------------------------------------+
| n |
+-----------------------------------------------------------------------+
| Node[71]{name:"Tom Hanks",born:1956,weight:4.642800717539658} |
| Node[1]{name:"Keanu Reeves",born:1964,weight:2.605304495549113} |
| Node[22]{name:"Cuba Gooding Jr.",born:1968,weight:2.5655048212974223} |
| Node[34]{name:"Meg Ryan",born:1961,weight:2.52628473708215} |
| Node[16]{name:"Tom Cruise",born:1962,weight:2.430592498009265} |
| Node[19]{name:"Kevin Bacon",born:1958,weight:2.0886893112867035} |
| Node[17]{name:"Jack Nicholson",born:1937,weight:1.9641313625284538} |
| Node[120]{name:"Ben Miles",born:1967,weight:1.8680986516285438} |
| Node[4]{name:"Hugo Weaving",born:1960,weight:1.8515582875810466} |
| Node[20]{name:"Kiefer Sutherland",born:1966,weight:1.784065038526406} |
+-----------------------------------------------------------------------+
10 rows
PageRank on Wikipedia Dump
+-------------------------------------------------------------+
| title | weight |
+-------------------------------------------------------------+
| "United States" | 13481.465434735326 |
| "France" | 5866.673240306334 |
| "Animal" | 5665.755894233272 |
| "Country" | 5422.733465218685 |
| "County" | 5097.065153505375 |
| "World War II" | 4903.803574507514 |
| "Germany" | 4758.829204613294 |
| "United Kingdom" | 4538.534060897834 |
| "Russia" | 4328.2166400665055 |
| "English" | 4220.257198573251 |
| "India" | 3956.6896584623237 |
| "England" | 3895.7939536781337 |
| "Japan" | 3665.9152429486285 |
| "Canada" | 3657.4749257843423 |
| "Australia" | 3595.941069092141 |
| "Iran" | 3482.6418104876884 |
| "Province" | 3400.887300974972 |
| "China" | 3239.811469195496 |
| "American" | 3192.7699175209023 |
| "French" | 3182.4457685761104 |
| "The New York Times" | 3181.5839069249864 |
| "Arthropod" | 3097.8173315850213 |
| "Italy" | 3032.3644182841776 |
| "German" | 3012.815858491087 |
| "London" | 2965.4743930058657 |
| "Insect" | 2869.920827785085 |
| "District" | 2784.755324500963 |
| "Spain" | 2668.2978896378804 |
| "New York City" | 2650.752385794727 |
| "Latin" | 2584.6877541535127 |
| "Poland" | 2575.2112457023773 |
| "World War I" | 2573.8477947401866 |
| "Soviet Union" | 2477.694486425502 |
| "Catholic Church" | 2474.683318006141 |
| "New York" | 2468.8443657519097 |
| "Brazil" | 2435.2531424225617 |
| "Romania" | 2393.1873377303928 |
| "Europe" | 2353.968692909295 |
| "California" | 2266.6608579578156 |
| "Lepidoptera" | 2158.715841127394 |
| "State" | 2132.243578231637 |
| "Italian" | 2106.6553548486086 |
| "Greek" | 2028.1563330073805 |
| "Switzerland" | 2027.8234786567634 |
| "Paris" | 1970.357297581375 |
| "Chordata" | 1950.4239922246595 |
| "Spanish" | 1926.201753857372 |
| "New Zealand" | 1828.6683122383472 |
| "Mexico" | 1828.0197503654385 |
| "British" | 1811.7648105938251 |
| "Washington, D.C." | 1810.533732133272 |
| "Netherlands" | 1803.7312379217158 |
| "Scotland" | 1776.2112626996402 |
| "National Register of Historic Places" | 1765.6396180107192 |
| "Sweden" | 1750.4933182628488 |
| "USA" | 1708.9086633673999 |
| "Chordate" | 1659.5183460486119 |
| "Ireland" | 1630.2162859072075 |
| "United States Census Bureau" | 1590.6841804100668 |
| "Turkey" | 1571.7895450758779 |
| "South Africa" | 1552.1113319282647 |
| "Belgium" | 1548.2393303043832 |
| "Austria" | 1517.0098756018515 |
| "Los Angeles" | 1487.3395876804987 |
| "Region" | 1455.4554257569105 |
| "Municipality" | 1422.385733793246 |
| "Greece" | 1415.3614936831354 |
| "CET" | 1393.056892061773 |
| "Norway" | 1387.064297388938 |
| "Chicago" | 1386.6868499051984 |
| "European Union" | 1382.571030912079 |
| "President" | 1376.0926002579163 |
| "Philippines" | 1371.3808757708794 |
| "BBC" | 1362.697735274599 |
| "White" | 1329.6229862450132 |
| "Voivodeship" | 1325.3419002294163 |
| "African American" | 1316.08289317503 |
| "New York Times" | 1305.9732797320457 |
| "Israel" | 1289.4941308687928 |
| "United Nations" | 1275.3410092372967 |
| "Rome" | 1268.7924011211794 |
| "Gmina" | 1264.8386282950032 |
| "CEST" | 1263.3655294822495 |
| "Canadian" | 1257.0066905218277 |
| "Argentina" | 1256.590368894054 |
| "Dutch" | 1253.7080447443823 |
| "Angiosperms" | 1247.3964328063778 |
| "Taiwan" | 1244.5572462360428 |
| "Native Americans" | 1231.9646926767173 |
| "Egypt" | 1230.2800089879247 |
| "Chinese" | 1226.417515716627 |
| "North America" | 1226.211997093438 |
| "Indonesia" | 1221.68337046196 |
| "Oxford University Press" | 1205.4955933795306 |
| "Roman Catholic" | 1205.3842960845823 |
| "Islam" | 1200.4126147456059 |
| "Pakistan" | 1199.2589818906795 |
| "Jews" | 1187.7942956454976 |
| "Texas" | 1187.3229003531255 |
| "The Guardian" | 1185.9922865148167 |
+-------------------------------------------------------------+