In this post we explore how to get started with practical and scalable recommendation in graph. We will walk through a fundamental example with news recommendation on a dataset containing 17.5 million click events and around 750K users. We will leverage Neo4j and the Graph Data Science (GDS) library to quickly predict similar news based on user preferences and enable sub-second, rank-ordered, recommendation queries personalized to each user.
All the code to reproduce this analysis as well as resources to set up the graph from source data is available in this GitHub Repository.
Recommender Systems are a rich area of study and we will just barely scratch the surface here. If you are interested in going deeper in any area – whether it be other recommendation techniques, evaluating performance, optimization, including more inputs, how to scale up further, or anything else graph based recommenders, please drop me a line.
This post is structured as follows: First, we will briefly define Recommendation Systems. Next, we will go over the source dataset and graph we will be using, along with how to query basic profiling statistics to help us understand the graph and better prepare for analysis. Next, we will talk about a technique called Collaborative Filtering (CF) which will be our mechanism for recommendation in this post. After that we will get into the guts of applying CF using the Cypher query language and scaling with the Graph Data Science (GDS) Library, leveraging node embeddings and an ML technique called K-Nearest Neighbor (KNN). Finally, we will talk about next steps and follow up resources.
What Are Recommender Systems?
Put simply, Recommender Systems are a type of information filtering system that seek to generate meaningful recommendations to users for items they may be interested in.In the context of recommender systems, “item” is a general term that can refer to anything marketed or directed towards users, including products in online retail stores, content such as written articles, videos, and/or music, or potential connections or people to follow on social media platforms.
It should go without saying that Recommender Systems are essential for increasing user satisfaction and accelerating business growth in our increasingly competitive online landscape.
Today’s Dataset: Microsoft MIND
In this post we will explore news recommendation with the MIcrosoft News Dataset (MIND) Large Dataset which is a sample of 1 million anonymized users and their click behaviors collected from the Microsoft News website [1]. It includes about 15M impressions logs for about 160K English news articles.I formatted the dataset and loaded it into a graph with the below schema:
//visualize schema in Neo4j Browser
neo4j$ CALL db.schema.visualization();
We see that news articles are modeled as nodes and can be CLICKED
or HISTORICALLY CLICKED
by users which are also modeled as nodes. In this context, CLICKED
refers to a click action parsed from an impression record occurring over the time interval of our sample, approximately November 9, 2019 through November 15, 2019. HISTORICALLY CLICKED
refers to a click action from user history, occurring at some unknown time in the past.