Entity Extraction with APOC NLP

Goals
In this guide we’ll learn how to do entity extraction using the AWS, GCP, and Azure NLP libraries.
Prerequisites
Please have Neo4j (version 4.2 or later) downloaded and installed.

Intermediate

Intro to Entity Extraction

Entity Extraction takes unstructured text and returns a list of named entities contained within that text.

AWS, GCP, and Azure each provide NLP APIs, which are wrapped by the apoc.nlp procedures in the APOC Library. In this guide we’re going to learn how to use these procedures to both build and then query a graph of entities constructed using these procedures.

Tools Used

This guide uses the following tools and versions:

Tool Version

Neo4j

4.2.0

APOC + NLP Dependencies

4.2.0.0

Prerequisites

Before we start using the APOC NLP procedures, there are some prerequisites that we need to setup.

Install Dependencies

The APOC NLP procedures have dependencies on Kotlin and client libraries that are not included in the APOC Library.

These dependencies are included in the apoc-nlp-dependencies-<version>.jar, which can be downloaded from the releases page. Once that file is downloaded, it should be placed in the plugins directory and the Neo4j Server restarted.

Setting up API Key

Now we need to setup an API key for the cloud platform that we want to use.

AWS

We can generate an Access Key and Secret by following the instructions at docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html. Once we’ve done that, we can populate and execute the following commands to create parameters that contains these details.

The following define the apiKey and apiSecret parameters
:param awsApiKey => ("<api-key-here>");
:param awsApiSecret => ("<api-secret-here>");
GCP

We can generate an API Key that has access to the Cloud Natural Language API by going to console.cloud.google.com/apis/credentials. Once we’ve created a key, we can populate and execute the following command to create a parameter that contains these details.

The following defines the apiKey parameter:

:param gcpApiKey => ("<api-key-here>")
Azure

We can generate an API key and URL by following the instructions in the Quickstart: Use the Text Analytics client library article. Once we’ve done that, we should be able to see a page listing our credentials, similar to the screenshot below:

aws keys
Figure 1. Azure Text Analytics credentials

In this case our API URL is https://neo4j-entity-extraction-developer-guide.cognitiveservices.azure.com/, and we can use either of the hidden keys.

Let’s populate and execute the following commands to create parameters that contains these details.

The following define the apiKey and apiSecret parameters
:param azureApiKey => ("<api-key-here>");
:param azureApiUrl => ("<api-url-here>");

Import text documents

We’re going to import some text documents from the Guardian Football RSS feed. We can use the apoc.load.xml for this, as shown in the following query:

CALL apoc.load.xml("https://www.theguardian.com/football/rss", "rss/channel/item")
YIELD value
WITH [child in value._children WHERE child._type = "title" | child._text][0] AS title,
     [child in value._children WHERE child._type = "link" | child._text][0] AS guid,
     apoc.text.regreplace(
       [child in value._children WHERE child._type = "description" | child._text][0],
       '<[^>]*>',
       ' '
     ) AS body,
     [child in value._children WHERE child._type = "category" | child._text] AS categories
MERGE (a:Article {id: guid})
SET a.body = body, a.title = title
WITH a, categories
CALL {
  WITH a, categories
  UNWIND categories AS category
  MERGE (c:Category {name: category})
  MERGE (a)-[:IN_CATEGORY]->(c)
  RETURN count(*) AS categoryCount
}
RETURN a.id AS articleId, a.title AS title, categoryCount;

We can see a Neo4j Browser visualization of the imported graph in the diagram below:

graph 35
Figure 2. Guardian football graph

Building an entity graph

Now we’re going to build our entity graph. We’re going to use a combination of apoc.periodic.iterate and the cloud specific streaming procedures to:

  • Extract entities for each article, filtering for specific entity types

  • Create a node for each entity

  • Create a relationship from each entity to the article

We can do this with the following query:

AWS
CALL apoc.periodic.iterate(
  "MATCH (a:Article) WHERE not(exists(a.processed)) RETURN a",
  "CALL apoc.nlp.aws.entities.stream([item in $_batch | item.a], {
          key: $awsApiKey,
          secret: $awsApiSecret,
          nodeProperty: 'body'
   })
   YIELD node AS a, value
   SET a.processed = true
   WITH a, value
   UNWIND value.entities AS entity
   WITH a, entity
   WHERE entity.type IN ['COMMERCIAL_ITEM', 'PERSON', 'ORGANIZATION', 'LOCATION', 'EVENT']

   CALL apoc.merge.node(
     ['Entity', apoc.text.capitalize(toLower(entity.type))],
     {name: entity.text}, {}, {}
   )
   YIELD node AS e
   MERGE (a)-[entityRel:AWS_ENTITY]->(e)
   SET entityRel.score = entity.score
   RETURN count(*)",
  { batchMode: "BATCH_SINGLE",
    batchSize: 25,
    params: {awsApiKey: $awsApiKey, awsApiSecret: $awsApiSecret}}
)
YIELD batches, total, timeTaken, committedOperations, errorMessages, batch, operations
RETURN batches, total, timeTaken, committedOperations, errorMessages, batch, operations;
Table 1. Results
batches total timeTaken committedOperations errorMessages batch operations

3

59

2

59

{}

{total: 3, committed: 3, failed: 0, errors: {}}

{total: 59, committed: 59, failed: 0, errors: {}}

GCP
CALL apoc.periodic.iterate(
  "MATCH (a:Article) WHERE not(exists(a.processed)) RETURN a",
  "CALL apoc.nlp.gcp.entities.stream([item in $_batch | item.a], {
          key: $gcpApiKey,
          nodeProperty: 'body'
   })
   YIELD node AS a, value
   SET a.processed = true
   WITH a, value
   UNWIND value.entities AS entity
   WITH a, entity
   WHERE entity.type IN ['PERSON', 'LOCATION', 'ORGANIZATION', 'EVENT']

   CALL apoc.merge.node(
     ['Entity', apoc.text.capitalize(toLower(entity.type))],
     {name: entity.name}, {}, {}
   )
   YIELD node AS e
   MERGE (a)-[entityRel:GCP_ENTITY]->(e)
   SET entityRel.score = entity.score
   RETURN count(*)",
  { batchMode: "BATCH_SINGLE",
    batchSize: 25,
    params: {gcpApiKey: $gcpApiKey}}
)
YIELD batches, total, timeTaken, committedOperations, errorMessages, batch, operations
RETURN batches, total, timeTaken, committedOperations, errorMessages, batch, operations;
Table 2. Results
batches total timeTaken committedOperations errorMessages batch operations

3

59

46

59

{}

{total: 3, committed: 3, failed: 0, errors: {}}

{total: 59, committed: 59, failed: 0, errors: {}}

Azure
CALL apoc.periodic.iterate(
  "MATCH (a:Article) WHERE not(exists(a.processed)) RETURN a",
  "CALL apoc.nlp.azure.entities.stream([item in $_batch | item.a], {
          key: $azureApiKey,
          url: $azureApiUrl,
          nodeProperty: 'body'
   })
   YIELD node AS a, value
   SET a.processed = true
   WITH a, value
   UNWIND value.entities AS entity
   WITH a, entity
   WHERE entity.type IN ['Person', 'Organization', 'Location', 'Event']

   CALL apoc.merge.node(
     ['Entity', apoc.text.capitalize(toLower(entity.type))],
     {name: entity.name}, {}, {}
   )
   YIELD node AS e
   MERGE (a)-[entityRel:AZURE_ENTITY]->(e)
   SET entityRel.score = entity.score
   RETURN count(*)",
  { batchMode: "BATCH_SINGLE",
    batchSize: 25,
    params: {azureApiUrl: $azureApiUrl, azureApiKey: $azureApiKey}}
)
YIELD batches, total, timeTaken, committedOperations, errorMessages, batch, operations
RETURN batches, total, timeTaken, committedOperations, errorMessages, batch, operations;
Table 3. Results
batches total timeTaken committedOperations errorMessages batch operations

3

59

3

59

{}

{total: 3, committed: 3, failed: 0, errors: {}}

{total: 59, committed: 59, failed: 0, errors: {}}

Querying the entity graph

Now that we’ve got the entities, it’s time to querying the entity graph. Let’s start by returning the entities for each article, as shown in the following query:

AWS
MATCH (a:Article)
RETURN a.title AS title, [(a)-[:AWS_ENTITY]->(entity) | entity.name] AS entities
LIMIT 5;
Table 4. Results
title entities

"Manchester United’s Edinson Cavani apologises for 'racist' Instagram post"

["Cavani", "FA", "Edinson Cavani", "Southampton", "Manchester United", "Football Association"]

"The need for concussion substitutes – Football Weekly"

["Faye Carruthers", "Soundcloud", "Mixcloud", "Arsenal", "Barry Glendenning", "Facebook", "Ewan Murray", "Raúl Jiménez", "Podcasts", "Wolves", "David Luiz", "Twitter", "Acast", "Stitcher", "Southampton", "Audioboom", "Manchester United", "Max Rushden", "Lars Sivertsen"]

"Lennon the fall guy for Celtic’s failure to build on domestic dominance"

["Rangers", "Leicester City", "Brendan Rodgers", "County", "Celtic", "Tony Mowbray", "Ross", "Gordon Strachan", "Neil Lennon", "Martin O’Neill", "Easy Street", "League Cup", "Ronny Deila"]

"Napoli find truest tribute to Maradona by mirroring his magic against Roma | Nicky Bandini"

["Stadio San Paolo", "Lorenzo Insigne", "European", "Napoli", "Roma", "Lionel Messi", "Naples", "Diego Maradona", "Curva"]

"Concussion substitutes trial could begin in Premier League early next year"

["David Luiz Trials", "David Luiz", "Raúl Jiménez", "Mexico", "Guardian", "Arsenal", "Wolves", "Premier League", "Daniela"]

GCP
MATCH (a:Article)
RETURN a.title AS title, [(a)-[:GCP_ENTITY]->(entity) | entity.name] AS entities
LIMIT 5;
Table 5. Results
title entities

"Manchester United’s Edinson Cavani apologises for 'racist' Instagram post"

["Southampton", "greeting", "club", "Striker", "incident", "Manchester United", "win", "body", "Uruguayan", "FA", "Edinson Cavani", "friend", "Football Association"]

"The need for concussion substitutes – Football Weekly"

["Wolves", "Acast", "Barry Glendenning", "Apple", "Ewan Murray", "victory", "Arsenal", "Soundcloud", "Faye Carruthers", "Max Rushden", "Raúl Jiménez", "Audioboom", "Lars Sivertsen", "David Luiz", "Manchester United"]

"Lennon the fall guy for Celtic’s failure to build on domestic dominance"

["Brendan Rodgers", "club", "season", "bar", "Easy Street", "Rangers", "Leicester", "Gordon Strachan", "Ronny Deila", "race", "League Cup", "defeat", "City manager", "fans", "Martin O’Neill’s", "Celtic", "Tony Mowbray", "pond", "Neil Lennon", "Ross County"]

"Napoli find truest tribute to Maradona by mirroring his magic against Roma | Nicky Bandini"

["fans", "Lorenzo Insigne", "Stadio San Paolo", "Naples", "Napoli", "family", "Diego Maradona", "Roma", "win", "death", "Lionel Messi", "European", "player"]

"Concussion substitutes trial could begin in Premier League early next year"

["club", "clash", "Daniela", "Raúl Jiménez", "operation", "Premier League", "David Luiz", "teams", "Arsenal", "recovery", "Guardian", "striker", "Mexico", "Wolves", "Rule change", "David Luiz Trials"]

Azure
MATCH (a:Article)
RETURN a.title AS title, [(a)-[:AZURE_ENTITY]->(entity) | entity.name] AS entities
LIMIT 5;
Table 6. Results
title entities

"Manchester United’s Edinson Cavani apologises for 'racist' Instagram post"

["Striker", "Edinson Cavani", "Manchester United F.C.", "Uruguay national football team", "The Football Association", "Southampton F.C."]

"The need for concussion substitutes – Football Weekly"

["Arsenal F.C.", "AudioBoom", "Stitcher Radio", "Max Rushden", "Lars Sivertsen", "Manchester United F.C.", "SoundCloud", "Raúl Jiménez", "Acast", "Facebook", "Mixcloud", "Barry Glendenning", "Southampton Rate", "Ewan Murray", "Twitter", "Southampton F.C.", "David Luiz", "Apple Podcasts", "Faye Carruthers"]

"Lennon the fall guy for Celtic’s failure to build on domestic dominance"

["Martin O’Neill", "Leicester City F.C.", "Tony Mowbray", "Ronny Deila", "Rangers F.C.", "Gordon Strachan", "League Cup", "Neil Lennon", "Ross County F.C.", "EFL Cup", "Celtic F.C.", "Brendan Rodgers"]

"Napoli find truest tribute to Maradona by mirroring his magic against Roma | Nicky Bandini"

["Naples", "Lorenzo Insigne", "Diego Maradona", "Rome", "Europe", "Lionel Messi", "Lionel Messi", "Stadio San Paolo", "S.S.C. Napoli", "Stadio San Paolo"]

"Concussion substitutes trial could begin in Premier League early next year"

["Premier League", "Mexico", "Daniela", "Guardian", "Raúl Jiménez", "Wolverhampton Wanderers F.C.", "David Luiz", "Arsenal F.C.", "The Guardian"]

We can also use the entities that pairs of articles have in common to determine article similarity. If we want to find the similar articles to Gary Lineker’s video about Maradona, we could write the following query:

AWS
MATCH (a1:Article {title: "Gary Lineker: Maradona was 'like a messiah' in Argentina – video"})
MATCH (a1:Article)-[:AWS_ENTITY]-(entity)<-[:AWS_ENTITY]-(a2:Article)
RETURN a2.title AS otherArticle, collect(entity.name) AS entities
ORDER BY size(entities) DESC
LIMIT 5;
Table 7. Results
otherArticle entities

"Remembering Diego Maradona: football legend dies aged 60 – video obituary"

["World Cup", "Diego Maradona", "Buenos Aires", "Napoli", "Maradona", "Argentina", "Barcelona"]

"Classic YouTube | Diego Armando Maradona, nerveless kicking and Football Manager kids"

["England", "World Cup", "Diego Maradona", "Argentina", "Maradona", "Lineker", "Gary Lineker"]

"Fans in Argentina and Naples mourn death of Diego Maradona – video"

["World Cup", "Diego Maradona", "Buenos Aires", "Napoli", "Argentina", "Maradona"]

"Burdened by genius: Maradona reminds us how peaking young brings its problems | Vic Marks"

["World Cup", "Diego Maradona", "Mexico", "Maradona", "Argentina"]

"A tribute to Diego Maradona – Football Weekly"

["World Cup", "Diego Maradona", "Buenos Aires", "Mexico", "Twitter"]

GCP
MATCH (a1:Article {title: "Gary Lineker: Maradona was 'like a messiah' in Argentina – video"})
MATCH (a1:Article)-[:GCP_ENTITY]-(entity)<-[:GCP_ENTITY]-(a2:Article)
RETURN a2.title AS otherArticle, collect(entity.name) AS entities
ORDER BY size(entities) DESC
LIMIT 5;
Table 8. Results
otherArticle entities

"Remembering Diego Maradona: football legend dies aged 60 – video obituary"

["Barcelona", "Argentina", "Napoli", "Buenos Aires"]

"Classic YouTube | Diego Armando Maradona, nerveless kicking and Football Manager kids"

["Argentina", "England", "Gary Lineker"]

"Diego Maradona’s personal doctor denies responsibility for death"

["home", "footballer", "Buenos Aires"]

"Ageless Zlatan Ibrahimovic continues to take care of business for Milan | Nicky Bandini"

["home", "Napoli"]

"'He took us all to heaven': football fans react to Diego Maradona’s death"

["Argentina", "Buenos Aires"]

Azure
MATCH (a1:Article {title: "Gary Lineker: Maradona was 'like a messiah' in Argentina – video"})
MATCH (a1:Article)-[:AZURE_ENTITY]-(entity)<-[:AZURE_ENTITY]-(a2:Article)
RETURN a2.title AS otherArticle, collect(entity.name) AS entities
ORDER BY size(entities) DESC
LIMIT 5;
Table 9. Results
otherArticle entities

"Remembering Diego Maradona: football legend dies aged 60 – video obituary"

["FC Barcelona", "Maradona", "S.S.C. Napoli", "Napoli", "Argentina", "Buenos Aires", "Diego Maradona"]

"Classic YouTube | Diego Armando Maradona, nerveless kicking and Football Manager kids"

["Gary Lineker", "Argentina national football team", "Sheffield Wednesday F.C.", "England", "England national football team", "Argentina", "Diego Maradona"]

"A tribute to Diego Maradona – Football Weekly"

["Sheffield Wednesday F.C.", "Maradona", "Mexico", "Twitter", "Buenos Aires", "Diego Maradona"]

"Fans in Argentina and Naples mourn death of Diego Maradona – video"

["Maradona", "S.S.C. Napoli", "Argentina", "Buenos Aires", "Diego Maradona"]

"It was Maradona’s defiance that most inspired me | Kenan Malik"

["Argentina national football team", "England", "England national football team", "Diego Maradona"]

Some of the entities that these articles have in common don’t really make sense (e.g. "home" or "Twitter"), but in general the articles would be good candidates to go in a 'recommended articles' section for the Gary Lineker article.

If we wanted to filter the extracted entities further, we could try including only those entities that have a Wikipedia entry. For more details on that approach, see Tutorial: Build a Knowledge Graph using NLP and Ontologies.