Graph algorithms are easy to use, fast to execute and produce powerful results. This blog series is designed to help you better utilize graph analytics and graph algorithms so you can effectively innovate and develop intelligent solutions faster using a graph database like.
Last week we completed our looked at Community Detection algorithms, with a focus on the Triangle Count and Average Clustering Coefficient algorithm.
This week we conclude our series with an overview of graph algorithms in practice, where we will learn how to apply graph algorithms in data-intensive applications.
About Graph Algorithms in Practice
Yelp.com has been running the Yelp Dataset challenge since 2013, a competition that encourages people to explore and research Yelp’s open dataset. As of Round 10 of the challenge, the dataset contained:
- Almost 5 million reviews
- Over 1.1 million users
- Over 150,000 businesses
- 12 metropolitan areas
Graph Model
The Yelp data is represented in a graph model as shown in the diagram below.
Our graph contains
User
labeled nodes, which have a FRIENDS
relationship with other Users
. Users also WRITE Reviews
and tips about Businesses
. All of the metadata is stored as properties of nodes, except for Categories
of the Businesses
, which are represented by separate nodes.Data Import
There are many different methods for importing data into Neo4j, including the import tool, LOAD CSV command and Neo4j Drivers.
For the Yelp dataset, we need to do a one-off import of a large amount of data so the import tool is the best choice. See the yelp-graph-algorithms GitHub repository for more details.
Exploratory Data Analysis
Once we have the data loaded in Neo4j, we execute some exploratory queries to get a feel for it. We will be using the Awesome Procedures on Cypher (APOC) library in this section. Please see Installing APOC for details if you would like to follow along.
The following queries return the cardinalities of node labels and relationship types.
CALL db.labels() YIELD label CALL apoc.cypher.run("MATCH (:`"+label+"`) RETURN count(*) as count", null) YIELD value RETURN label, value.count as count ORDER BY label
CALL db.relationshipTypes() YIELD relationshipType CALL apoc.cypher.run("MATCH ()-[:" + `relationshipType` + "]->() RETURN count(*) as count", null) YIELD value RETURN relationshipType, value.count AS count ORDER BY relationshipType
These queries shouldn’t reveal anything surprising but they are useful for checking that the data has been imported correctly.
It’s always fun reading hotel reviews, so we’re going to focus on businesses in that sector. We find out how many hotels there are by running the following query.
MATCH (category:Category {name: "Hotels"}) RETURN size((category)<-[:IN_CATEGORY]-()) AS businesses
That’s a decent number of hotels to explore.
How many reviews do we have to work with?
MATCH (:Review)-[:REVIEWS]->(:Business)-[:IN_CATEGORY]->(:Category {name:"Hotels"}) RETURN count(*) AS count
Let’s zoom in on some of the individual bits of data.
Trip Planning
Imagine that we’re planning a trip to Las Vegas and want to find somewhere to stay.
We might start by asking which are the most reviewed hotels and how well they’ve been rated.
MATCH (review:Review)-[:REVIEWS]->(business:Business), (business)-[:IN_CATEGORY]->(:Category {name:"Hotels"}), (business)-[:IN_CITY]->(:City {name: "Las Vegas"}) WITH business, count(*) AS reviews, avg(review.stars) AS averageRating ORDER BY reviews DESC LIMIT 10 RETURN business.name AS business, reviews, apoc.math.round(averageRating,2) AS averageRating
These hotels have a lot of reviews, far more than anyone would be likely to read. We’d like to find the best reviews and make them more prominent on our business page.
Finding Influential Hotel Reviewers
One way we can do this is by ordering reviews based on the influence of the reviewer on Yelp.
We’ll start by finding users who have reviewed more than five hotels. After that we’ll find the social network between those users and work out which users sit at the center of that network. This should reveal the most influential people. The
FRIENDS
relationship is an example of a bidirectional relationship, meaning that if Person A is friends with Person B then Person B is also friends with Person A. Neo4j stores a directed graph, but we have the option to ignore the direction when we query the graph.We want to execute the PageRank algorithm over a projected graph of users that have reviewed hotels and then add a
hotelPageRank
property to each of those users. This is the first example where we can’t express the projected graph in terms of node labels and relationship types. Instead we will write Cypher statements to project the required graph.The following query executes the PageRank algorithm.
CALL algo.pageRank(
"MATCH (u:User)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CATEGORY]-> (:Category {name: "Hotels"}) WITH u, count(*) AS reviews WHERE reviews > 5 RETURN id(u) AS id", "MATCH (u1:User)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CATEGORY]-> (:Category {name: "Hotels"}) MATCH (u1)-[:FRIENDS]->(u2) WHERE id(u1) < id(u2) RETURN id(u1) AS source, id(u2) AS target", {graph: "cypher", write: true, direction: "both", writeProperty: "hotelPageRank"})
We then write the following query to find the top reviewers.
MATCH (u:User) WHERE u.hotelPageRank > 0 WITH u ORDER BY u.hotelPageRank DESC LIMIT 5 RETURN u.name AS name, apoc.math.round(u.hotelPageRank,2) AS pageRank, size((u)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CATEGORY]-> (:Category {name: "Hotels"})) AS hotelReviews, size((u)-[:WROTE]->()) AS totalReviews, size((u)-[:FRIENDS]-()) AS friends
We could use those rankings on a hotel page when determining which reviews to show first. For example, if we want to show reviews of Caesars Palace, we could execute the following query.
MATCH (b:Business {name: "Caesars Palace Las Vegas Hotel & Casino"}) MATCH (b)<-[:REVIEWS]-(review)<-[:WROTE]-(user) RETURN user.name AS name, apoc.math.round(user.hotelPageRank,2) AS pageRank, review.stars AS stars ORDER BY user.hotelPageRank DESC LIMIT 5
This information may also be useful for businesses that want to know when an influencer is staying in their hotel.
Finding Similar Categories
The Yelp dataset contains more than 1,000 categories, and it seems likely that some of those categories are similar to each other. That similarity is useful for making recommendations to users for other businesses that they may be interested in.
We will build a weighted category similarity graph based on how businesses categorize themselves. For example, if only one business categorizes itself under
Hotels
and Historical Tours
, then we would have a link between Hotels
and Historical Tours
with a weight of 1.We don’t actually have to create the similarity graph – we can run a community detection algorithm, such as Label Propagation, over a projected similarity graph.
CALL algo.labelPropagation.stream "MATCH (c:Category) RETURN id(c) AS id", "MATCH (c1:Category)<-[:IN_CATEGORY]-()-[:IN_CATEGORY]->(c2:Category) WHERE id(c1) < id(c2) RETURN id(c1) AS source, id(c2) AS target, count(*) AS weight", {graph: "cypher"}) YIELD nodeId, label MATCH (c:Category) WHERE id(c) = nodeId MERGE (sc:SuperCategory {name: "SuperCategory-" + label}) MERGE (c)-[:IN_SUPER_CATEGORY]->(sc
The diagram below shows a sample of categories and super categories after we’ve run this query.
We write the following query to find some of the similar categories to hotels.
MATCH (hotels:Category {name: "Hotels"}), (hotels)-[:IN_SUPER_CATEGORY]->()<-[:IN_SUPER_CATEGORY]-(otherCategory) RETURN otherCategory.name AS otherCategory LIMIT 5
Not all of those categories are relevant for users in Las Vegas, so we need to write a more specific query to find the most popular similar categories in this location.
MATCH (hotels:Category {name: "Hotels"}), (lasVegas:City {name: "Las Vegas"}), (hotels)-[:IN_SUPER_CATEGORY]->()<-[:IN_SUPER_CATEGORY]-(otherCategory) RETURN otherCategory.name AS otherCategory, size((otherCategory)<-[:IN_CATEGORY]-()-[:IN_CITY]->(lasVegas)) AS count ORDER BY count DESC LIMIT 10
We could then make a suggestion of one business with an above average rating in each of those categories.
MATCH (hotels:Category {name: "Hotels"}), (lasVegas:City {name: "Las Vegas"}), (hotels)-[:IN_SUPER_CATEGORY]->()<-[:IN_SUPER_CATEGORY]-(otherCategory), (otherCategory)<-[:IN_CATEGORY]-(business)-[:IN_CITY]->(lasVegas) WITH otherCategory, count(*) AS count, collect(business) AS businesses, apoc.coll.avg(collect(business.averageStars)) AS categoryAverageStars ORDER BY count DESC LIMIT 10 WITH otherCategory, [b in businesses where b.averageStars >= categoryAverageStars] AS businesses RETURN otherCategory.name AS otherCategory, [b in businesses | b.name][toInteger(rand() * size(businesses))] AS business
In this blog, we’ve shown just a couple of ways that insights from graph algorithms are used in a real-time workflow to make real-time recommendations. In our example we made category and business recommendations but graph algorithms are applicable to many other problems.
Graph algorithms can help you take your graph-powered application to the next level.
Conclusion
Graph algorithms are the powerhouse behind the analysis of real-world networks – from identifying fraud rings and optimizing the location of public services to evaluating the strength of a group and predicting the spread of disease or ideas.
In this series, you’ve learned about how graph algorithms help you make sense of connected data. We covered the types of graph algorithms and offered specifics about how to use each one. Still, we are aware that we have only scratched the surface.
If you have any questions or need any help with any of the material in this series, send us an email at devrel@neo4j.com. We look forward to hearing how you are using graph algorithms.
Learn about the power of graph algorithms in the O'Reilly book,
Graph Algorithms: Practical Examples in Apache Spark and Neo4j by the authors of this article. Click below to get your free ebook copy.
Get the O'Reilly Ebook