Using Neo4j Fabric for Scalable Fraud Detection on Graphs

In the banking world, dealing with huge amounts of data is an everyday reality. Graph databases have proven immensely valuable in mapping out and predicting fraudulent transactions. However, as databases grow to tens or even hundreds of billions of entities, building a scaled out data infrastructure that still performs with a response time in milliseconds becomes nontrivial.

A large bank in the European Union faced exactly this issue. With a growing database, batch loading of incoming transactions became increasingly time consuming, up to a point when the decision was made for the single Neo4j cluster to be scaled out across different clusters. This is where Neo4j Fabric comes in.

Neo4j allows a very large graph database to be divided into a set of smaller databases, called shards. Each shard is in a separate database that can reside on any server in the cluster. Conveniently, you can create a Neo4j Fabric composite database to facilitate federated queries across the shards, as if they were still a single database. Time-based data structures like financial and accounting transactions lend themselves extremely well to sharding. Partitioning by year or by financial quarter matches the bank’s archiving and retention policies for each database, and is a natural way to split a graph database of transactions.

How exactly can sharding and Fabric help? For use cases with continual growth in transactions, splitting a 250 billion relationship graph into five graphs of 50 billion relationships each has several advantages:

Workload distribution. For large analytical workloads, bundling the power of five machines instead of one will give a significant performance boost, since queries are performed in parallel across all shards.
Easier batch operations. Common database techniques such as daily differential loads become easier and faster to execute because the daily transaction load can occur in its own shard.
Easier maintenance. It is a lot easier to perform operational tasks like backup and restore in parallel across smaller size shards versus performing them against a multi-terabyte database.
Less infrastructure cost. Having shards in place allows you to put older and infrequently accessed data on cheaper hardware while you can keep your most recent data on the best hardware available.

For the bank, sharding and Fabric have allowed the graph to grow as the number of transactions grows – and have their knowledge graph become richer and richer. Finding fraud is often like looking for a needle in a haystack. But as haystacks become hay mountains, specialized tools like Fabric provide the cutting-edge advantage that organizations with perennially growing data need.

While sharding a very large database provides compelling benefits, for a graph database like Neo4j it is important to think about your optimal sharding strategies up front. Different situations require different sharding strategies, since you do not want your graph data to be randomly distributed all over the shards.

While in this use case the bank ended up choosing a sharding strategy based on time window (good for analytics and event-based systems like transactions), other choices include sharding by logical domain entity such as Customer ID (good for SaaS and multi-tenant applications), or by geographical location (good for social networks, sensor networks, and more). However, once that important decision is made, you’ll be pleasantly surprised about the ease of distributing a graph database across a scaled out infrastructure.

If you have very large graphs that need optimization or data silos that need unification, email niels.dejong@neo4j.com or learn about services offered by Neo4j. Also, learn more about Neo4j 5, which contains important enhancements that make Fabric easier to create and operate.

Email Me Today