[As community content, this post reflects the views and opinions of the particular author and does not necessarily reflect the official stance of Neo4j.]
The Problem of Discovery
Discovery, especially non-text discovery, is hard.
When looking for a cool T-shirt, for example, I might not know exactly what I want, only that I’m looking for a gift T-shirt that’s a little mathy that emphasizes my friend’s love of nature.
As a retailer, I might notice that geometric nature products are quite popular, and want to capitalize by marketing the more general “math/nature” theme to potential buyers who have demonstrated an affinity for mathy animal shirts as well as improving the browsing experience for new visitors to my site.
Many retail sites with user-generated content rely on user-generated tags to classify image-driven products. However, the quality and number of tags on each item vary widely and depend on the item’s creator and the administrators of the site to curate and sort into browsable categories.
On Threadless, for example, this awesome item has a rich amount of tags:
lim heng swee, ilovedoodle, cats, lol, funny, humor, food, foodies, food with faces, pets, meow, ice cream, desserts,awww, puns, punny, wordplay, v-necks, vnecks, tanks, tank tops, crew sweatshirts, CuteIn contrast, this beautiful item has only a handful:
jimena salas, jimenasalas, funded, birds, animals, geometric shapes, abstract, PatternsFurthermore, although a human might easily be able to classify an image with the tags [
ants
, anthill
, abstract
, goofy
] as probably belonging to the “funny animals” category, an automated system would have to know that ants are animals and that goofy is a synonym for funny. ConceptNet5
This article introduces the ConceptNet dataset and describes how to import the data into a Neo4j database.
To paraphrase the ConceptNet5 website, ConceptNet5 is a semantic network built from nodes representing words or short phrases of natural language (“terms” or “concepts”), and the relationships (“associations”) between them.
Armed with this information, a system can take human words as input and use them to better search for information, answer questions and understand user goals.
For example, take a look at toast in the ConceptNet5 web demo:
This looks remarkably similar to a graph model. The dataset is incredibly rich, including (in the JSON) the “sense” of toast as a bread and also as a drink one has in tribute.
Let’s take a look at the JSON response for one ConceptNet edge (the association between two concepts) and import some data into a Neo4j database for exploration:
{ edges: [ { context: "/ctx/all", dataset: "/d/globalmind", end: "/c/en/bread", features: [ "/c/en/toast /r/IsA -", "/c/en/toast - /c/en/bread", "- /r/IsA /c/en/bread" ], id: "/e/ff9b268e050d62255f236f35ba104300551b8a3b", license: "/l/CC/By-SA", rel: "/r/IsA", source_uri: "/or/[/and/[/s/activity/globalmind/assert/,/s/ contributor/omcs/bugmenot/]/,/s/umbel/2013/]", sources: [ "/s/activity/globalmind/assert", "/s/contributor/omcs/bugmenot", "/s/umbel/2013" ], start: "/c/en/toast", surfaceText: "Kinds of [[bread]] : [[toast]]", uri: "/a/[/r/IsA/,/c/en/toast/,/c/en/bread/]", weight: 3 }, }
Modeling the Database
For the purposes of this example, let’s model the database to have the following properties: Term Nodes:
- concept
- language
- partOfSpeech
- sense
- type
- weight
- surfaceText
Loading the Data into the Database
Let’s use the following Python script to upload some sample data:
import requests import json from py2neo import authenticate, Graph USERNAME = "neo4j" #use your actual username PASSWORD = "12345678" #use your actual password authenticate("localhost:7474", USERNAME, PASSWORD) graph = Graph() #sample_tags = ['fruit','orange','bikes','cream','nature', 'toast','electronic', 'techno', 'house', 'dubstep', 'drum_and_bass', 'space_rock', 'psychedelic_rock', 'psytrance', 'garage', 'progressive','Cologne', 'North_Rhine-Westphalia', 'gothic_rock', 'darkwave' 'goth', 'geometric', 'nature', 'skylines', 'landscapes', 'mountains', 'trees', 'silhouettes', 'back_in_stock', 'Patterns', 'raglans','giraffes', 'animals', 'nature', 'tangled', 'funny', 'cute', krautrock] # Build query. query = """ WITH {json} AS document UNWIND document.edges AS edges WITH SPLIT(edges.start,"/")[3] AS startConcept, SPLIT(edges.start,"/")[2] AS startLanguage, CASE WHEN SPLIT(edges.start,"/")[4] <> "" THEN SPLIT(edges.start,"/")[4] ELSE "" END AS startPartOfSpeech, CASE WHEN SPLIT(edges.start,"/")[5] <> "" THEN SPLIT(edges.start,"/")[5] ELSE "" END AS startSense, SPLIT(edges.rel,"/")[2] AS relType, CASE WHEN edges.surfaceText <> "" THEN edges.surfaceText ELSE "" END AS surfaceText, edges.weight AS weight, SPLIT(edges.end,"/")[3] AS endConcept, SPLIT(edges.end,"/")[2] AS endLanguage, CASE WHEN SPLIT(edges.end,"/")[4] <> "" THEN SPLIT(edges.end,"/")[4] ELSE "" END AS endPartOfSpeech, CASE WHEN SPLIT(edges.end,"/")[5] <> "" THEN SPLIT(edges.end,"/")[5] ELSE "" END AS endSense MERGE (start:Term {concept:startConcept, language:startLanguage, partOfSpeech:startPartOfSpeech, sense:startSense}) MERGE (end:Term {concept:endConcept, language:endLanguage, partOfSpeech:endPartOfSpeech, sense:endSense}) MERGE (start)-[r:ASSERTION {type:relType, weight:weight, surfaceText:surfaceText}]-(end) """ # Using the Search endpoint to load data into the graph for tag in sample_tags: searchURL = "https://conceptnet5.media.mit.edu/data/5.4/c/en/" + tag + "?limit=500" searchJSON = requests.get(searchURL, headers = {"accept":"application/json"}).json() graph.cypher.execute(query, json=searchJSON)
Exploring the Data
Use the following Cypher query to explore the data:
MATCH (n:Term {language:'en'})-[r:ASSERTION]->(m:Term {language:'en'}) WHERE NOT r.type = 'dbpedia' AND NOT r.surfaceText = '' AND NOT n.partOfSpeech = '' AND NOT n.sense = '' RETURN n.concept AS `Start Concept`, n.sense AS `in the sense of`, r.type, m.concept AS `End Concept`, m.sense AS `End Sense` ORDER BY r.weight DESC, n.sense ASC LIMIT 10
The ConceptNet dataset is incredibly rich, providing various “senses” in which someone might mean “orange” and provides a wide variety of “relationship types” to choose from.
| Start Concept | in the sense of | r.type | End Concept | End Sense ----+---------------+---------------------------------------------------------+------------+-----------------+----------- 1 | orange | colour | IsA | color | 2 | orange | film | InstanceOf | film | 3 | dynamic | a_characteristic_or_manner_of_an_interaction_a_behavior | Synonym | nature | 4 | garage | a_petrol_filling_station | Synonym | petrol_station | 5 | garage | a_petrol_filling_station | Synonym | fill_station | 6 | garage | a_petrol_filling_station | Synonym | gas_station | 7 | progressive | advancing_in_severity | Antonym | non_progressive | 8 | shop | automobile_mechanic's_workplace | Synonym | garage | 9 | electronic | band | IsA | band | 10 | cream | band | IsA | band |
Use Cases and Future Directions
When translated into a graph database, the ConceptNet5 API takes the agony out of tag-based recommendations and categorizations.
Small retail and social startups can integrate a Neo4j microservice into their currently existing stack, using it to power recommendations, provide insights on what is the most effective way to categorize products (should “funny cats” have their own first-level category, or should they go under “animals”?), and allow more time and budget for richer innovations.
References
Loading JSON into a Neo4j Database
Dealing with Empty Columns
Data
- ConceptNet5 (thanks to Marvin Minsky, Luminoso, Push Singh and the MIT Media Lab)
- WordNet
Learn how to build a real-time recommendation engine for non-text discovery on your website: Download this white paper – Powering Recommendations with a Graph Database – and start offering more timely, relevant suggestions to your users.