However, there are tools that could help you avoid manual labor and extract those insights automatically. I am, of course, talking about various NLP tools and services.
In this blog post, I will present a solution of how you can combine the power of NLP with knowledge graphs to extract valuable insights from relevant articles automatically.
There are multiple use-cases where this solution would be applicable. For example, you could create a business monitoring tool to survey what the internet says about your own company or perhaps your competitors. If you are an investor, you could identify potential investments by analyzing the news revolving around the companies or cryptocurrencies you might be interested in.
Not only that, but you could feed the extracted information to a machine learning model to help you spot great investments in either stocks or crypto.
I am sure there are more applications I haven’t thought of yet.
In my previous post, I’ve already alluded to how you could manually develop your data pipeline.
Making Sense of News, the Knowledge Graph Way
While it could take months to build your data pipeline that effectively crawls the internet articles and process them with NLP models, I’ve found a Diffbot solution that could help you solve that in a matter of hours.
Diffbot | Knowledge Graph, AI Web Data Extraction and Crawling
Diffbot has a mission of constructing the world’s largest knowledge graph by crawling and analyzing the entire internet. It then provides APIs endpoints to search and retrieve relevant articles or topics. If you wanted to, you could also use their APIs for data enrichment as their knowledge graph contains information about various organizations, people, products, and more.
On top of that, they also offer the Natural Language Processing API that extracts entities and relationships from the text. If you have read any of my previous blog posts, you already know that we will use Neo4j, a native graph database, to store and analyze the extracted information.
Agenda
- Retrieve articles that talk about cryptocurrency
- Translate foreign articles with Google Translate API
- Import articles into Neo4j
- Extract entities and facts with Diffbot’s NLP API
- Import entities and facts into Neo4j
- Graph analysis
I’ve prepared a Jupyter Notebook that contains the code to reproduce the steps in this article.
blogs/DiffbotNLP + Neo4j.ipynb at master · tomasonjo/blogs
Retrieve Articles About Cryptocurrencies
As mentioned, we will use the Diffbot APIs to retrieve articles that talk about cryptocurrencies. If you want to follow this post, you can create a free trial account on their page, which should be enough to complete all the steps presented here. Once you login to their portal, you can explore their visual query builder interface and inspect what is available.
There is a lot of data available by Diffbot’s Knowledge Graph API. So not only can you search for various articles, but you could use their KG APIs to retrieve information around organizations, products, persons, jobs, and more.
This example will retrieve the latest 5000 articles with a tag label Cryptocurrency.
I have constructed the search query in their visual builder and simply copied it to my Python script. That’s all the code required to fetch any number of articles that are relevant to your use case.
Translate Foreign Articles with Google Translate API
The retrieved articles are from all over the world and in many languages. In the next step, you will use Google Translate API to translate them to English. You will need to enable the Google Translate API and create an API key.
Make sure to check their pricing, as it ended up a bit more than expected for me to use their translation API. I’ve checked pricing on other sites, and it is usually between $15 to $20 to translate a million characters.
Before we move on to the NLP extraction part, we will import the articles into Neo4j.
Import Articles into Neo4j
I suggest you either download Neo4j Desktop or use the free Neo4j AuraDB cloud instance, which should be enough to store information about these 5000 articles. First of all, we have to define the connection to Neo4j instance.
The imported graph model will have the following schema.
We have some metadata around articles. For example, we know the overall sentiment of the paper and when it was published. In addition, for most of the articles, we know who wrote them and on which site. Lastly, the Diffbot API also returns the categories of an article.
Before continuing, we will define unique constraints in Neo4j, which will speed up the import and subsequent queries.
Now we can go ahead and import articles into Neo4j.
I won’t go into much detail and explain how the above Cypher query works. Instead, multiple blog posts deal with an introduction to Cypher and graph imports if you are interested. There is also a GraphAcademy course about importing data, that covers the basics.
Take the Importing CSV Data into Neo4j course with Neo4j GraphAcademy
We can examine a single to blog post to verify the graph schema model.
Before we move on to the analysis part of the post, we will use the NLP API to extract entities and relationships, or as Diffbot calls them, facts.
The Diffbot website offers an online NLP demo, where you can input any text and evaluate the results. I’ve input a sample content of an article we have just imported into Neo4j.
The NLP API will identify all the entities that appear in the text and possible relationships between them even as a graph. In this example, we can see that Jack Dorsey is the CEO of Block, which is based in San Francisco and deals with payments and mining. Jack’s coworker at Block is Thomas Templeton, who has a background in computer hardware.
To process the entities in the response and store the to Neo4j, we will use the following code:
This example will import only entities that have allowed types such as organization, person, product, and location, and their confidence level is greater than 0.7. Diffbot’s NLP API also features entity linking, where entities are linked to Wikipedia, Crunchbase, or LinkedIn, as far as I have seen. We also add the extra entity types as additional labels to the Entity node.
Next, we have to prepare the function that will clean and import relationships into Neo4j.
I have omitted the import of the properties that are defined in the skipProperties list. To me, it makes more sense to store them as node properties rather than relationships between entities. However, in this example, we will simply ignore them during import.
Now that we have the functions for importing entities and relationships prepared, we can go ahead and process the articles. You can send multiple articles in a single request. I’ve chosen to batch the requests by 50 pieces of content.
By following these steps you have successfully constructed a knowledge graph in Neo4j. For example, we can visualize the neighborhood of Jack Dorsey.
The NLP extraction picked up that Jack Dorsey is the CEO of Block and has working relationships with Alex Morcos, Martin White, etc. Of course, not all extracted information is perfect.
I find it funny that the NLP identified Elon Musk as an employee of Dogecoin, which is not that far from the truth anyhow. I haven’t played around with confidence levels of facts, but you could increase the threshold to reduce the noise. However, this is a game between precision and recall.
This is just a sample subgraph. It is hard to decide what exactly to show as there is so much information available.
Graph Analytics
In the last part of this post, I will walk you through some example applications that you could use with a knowledge graph like this. First, we will evaluate the timeline of the articles.
MATCH (a:Article)
RETURN date(a.date) AS date,
count(*) AS count
ORDER BY date DESC
LIMIT 10
Results
There is between 150 to 450 articles per day about cryptocurrencies around the world which backs my initial statement about that volume being too much to read. Next, we will evaluate which entities are most frequently mentioned in articles.
MATCH (e:Entity)
RETURN e.name AS entity,
size((e)<-[:MENTIONS]-()) AS articles
ORDER BY articles
DESC LIMIT 5
Results
As you would expect from articles revolving around cryptocurrencies, the most frequently mentioned entities are:
- cryptocurrency
- bitcoin
- Ethereum and
- blockchain
The sentiment is available on the article level as well as entity level. For example, we can examine the sentiment regarding bitcoin grouped by region.
MATCH (e:Entity {name:'bitcoin'})<-[m:MENTIONS]-()-[:ON_SITE]->()-[:HAS_REGION]->(region)
WITH region.name AS region, m.sentiment AS sentiment
RETURN region, avg(sentiment) AS avgSentiment,
stdev(sentiment) AS stdSentiment,
max(sentiment) AS maxSentiment,
min(sentiment) AS minSentiment,
count(*) AS articles
ORDER BY articles DESC
LIMIT 5
Results
The sentiment is on average positive, but it heavily fluctuates between articles based on the standard deviation values. We could explore bitcoin sentiment more. Instead, we will examine which persons have the highest and lowest average sentiment in and also present in most articles in North America.
MATCH (e:Person)<-[m:MENTIONS]-()-[:ON_SITE]->()-[:HAS_REGION]->(region)
WHERE region.name = "North America"
RETURN e.name AS entity,
count(*) AS articles,
avg(m.sentiment) AS sentiment
ORDER BY sentiment * articles DESC
LIMIT 5
UNION
MATCH (e:Person)<-[m:MENTIONS]-()-[:ON_SITE]->()-[:HAS_REGION]->(region)
WHERE region.name = "North America"
RETURN e.name AS entity,
count(*) AS articles,
avg(m.sentiment) AS sentiment
ORDER BY sentiment * articles ASC
LIMIT 5
Results
Now, we can explore the titles of articles in which, for example, Mark Cuban appears.
MATCH (site)<-[:ON_SITE]-(a:Article)-[m:MENTIONS]->(e:Entity {name: 'Mark Cuban'})
RETURN a.title AS title,
a.language AS language,
m.sentiment AS sentiment,
site.name AS site
ORDER BY sentiment DESC
LIMIT 5
Results
While the titles themselves might not the most descriptive, we can also examine which other entities frequently co-occur in articles where Mark Cuban is mentioned.
MATCH (o:Entity)<-[:MENTIONS]-(a:Article)-[m:MENTIONS]->(e:Entity {name: 'Mark Cuban'})
WITH o, count(*) AS countOfArticles
ORDER BY countOfArticles DESC
LIMIT 5
RETURN o.name AS entity, countOfArticles
Results
Not surprisingly, various crypto tokens are present. Also, the Dallas Mavericks appear, which is the NBA club that Mark owns. Does Dallas Mavericks support crypto, or do reporters like to state that Mark owns the Dallas Mavericks, that I don’t know. You could proceed with that route of analysis, but here, we’ll also look at what facts we extracted during NLP processing.
MATCH p=(e:Entity {name: "Mark Cuban"})--(:Entity)
RETURN p
Results
Next, we will quickly evaluate the article titles where Floyd Mayweather appears, as the average sentiment is quite low.
MATCH (a:Article)-[m:MENTIONS]->(e:Entity {name: 'Floyd Mayweather'})
RETURN a.title AS title, a.language AS language, m.sentiment AS sentiment
ORDER BY sentiment ASC
LIMIT 5
Results
It seems that Kim Kardashian and Floyd Mayweather are being sued over an alleged crypto scam. The NLP processing also identifies various tokens and stock tickers, so we can analyze which are popular at the moment and their sentiment.
MATCH (e:Entity)<-[m:MENTIONS]-()
WHERE (e)<-[:STOCK_SYMBOL]-()
RETURN e.name AS stock,
count(*) as mentions,
avg(m.sentiment) AS averageSentiment,
min(m.sentiment) AS minSentiment,
max(m.sentiment) AS maxSentiment
ORDER BY mentions DESC
LIMIT 5
Results
I have only scratched the surface of the available insights we could extract. For the end, I’ll just add two visualizations and mention some applications you could develop.
For example, you could analyze the market by looking at relationships like COMPETITORS, ACQUIRED_BY, SUPPLIERS, etc. On the other hand, you could focus your analysis more on the persons in the graph and evaluate their influence or connections.
Conclusion
I’ve only scratched the surface of possible analysis with these types of the data pipeline. As mentioned, you could monitor the news regarding your company, your competitors, the whole industry, or even try to predict future events like acquisitions. Not only that, but you could also use extracted data to fuel your machine learning models for your desired use case, like predicting crypto trends.
As always, the code is available on GitHub.
Monitoring the Cryptocurrency Space with NLP and Knowledge Graphs was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.