Find Similar Patient Journeys With Neo4j Aura Graph Analytics

Corydon Baylor

Sr. Manager, Technical Product Marketing, Neo4j

Create a free graph database instance in Neo4j AuraDB

The future of healthcare is personal. Every year, the United States spends $1.5 trillion on chronic conditions. And 72 percent of patients say they want more personalized care — and believe technology can help get us there.

Mapping a patient’s path through the healthcare system is fundamentally a graph problem. The system itself is complex and fragmented, with patients bouncing between siloed providers, labs, specialists, and facilities. Most of the time, only the insurer has a complete view of the journey.

In this blog, we’ll look at how to model a patient’s journey through the healthcare system. To keep things focused, we’ll narrow our scope to patients with kidney disease. Then we’ll use node similarity to identify patients with similar care paths, and group them into communities using the Louvain method.

We’ll be working in a Python notebook hosted using Google Colab. You can find everything you need to follow along on our GitHub.

Setting Up the Environment

First, install and load the packages:

!pip install graphdatascience==1.15a2
!pip install --upgrade numpy

Next, import the packages:

import pandas as pd
from google.colab import userdataCode language: JavaScript (javascript)

You need an Aura account with Graph Analytics enabled to follow along. Let’s load in our credentials for Aura now:

CLIENT_ID = userdata.get("CLIENT_ID")
CLIENT_SECRET = userdata.get("CLIENT_SECRET")
TENANT_ID = userdata.get("TENANT_ID")Code language: JavaScript (javascript)

Then we create a session:

from graphdatascience.session import GdsSessions, AuraAPICredentials, AlgorithmCategory, CloudLocation
from datetime import timedelta

sessions = GdsSessions(api_credentials=AuraAPICredentials(CLIENT_ID, CLIENT_SECRET, TENANT_ID))

name = "my-new-session-sm"
memory = sessions.estimate(
    node_count=20,
    relationship_count=50,
    algorithm_categories=[AlgorithmCategory.CENTRALITY, AlgorithmCategory.NODE_EMBEDDING],
)
cloud_location = CloudLocation(provider="gcp", region="europe-west1")

gds = sessions.get_or_create(
    session_name=name,
    memory=memory,
    ttl=timedelta(hours=5),
    cloud_location=cloud_location,
)Code language: JavaScript (javascript)

Taking a Look at the Data

We’ll use Synthea to generate mock healthcare data. Our goal is to model patient similarity, so we can see if there’s an ideal patient plan for similar patients.

We’ll look at patients and the procedures they undergo. One thing we need to change is the ID because it contains characters:

patients = pd.read_csv("Patients.csv")
patients.head()

procedures = pd.read_csv("Procedures.csv")
procedures.head()Code language: JavaScript (javascript)

Converting ID to Numeric

Neo4j Aura Graph Analytics doesn’t support IDs with characters, so we’ll create a numeric one for the patient IDs in procedures. But first, we need to ensure that our IDs don’t collide. One way to do that would be to just have a longer ID. Let’s see how long the IDs in CODE in the procedures dataframe:

all_same_length = 
procedures['CODE'].astype(str).str.len().nunique() == 1
procedures['CODE'].astype(str).str.len().value_counts()Code language: JavaScript (javascript)

Then we’ll create one that’s a bit longer and doesn’t have any leading 0’s:

import pandas as pd

# Get unique patient IDs
unique_patients = procedures['PATIENT'].unique()

# Use pure Python integers to generate 20-digit codes
start_value = 10**18  # ensures 20 digits, doesn't start with 0
numeric_ids = [start_value + i for i in range(len(unique_patients))]

# Create mapping
patient_id_map = pd.Series(numeric_ids, index=unique_patients, dtype='object')

# Apply mapping
procedures['PATIENT2'] = procedures['PATIENT'].map(patient_id_map)Code language: PHP (php)

Next, we need to ensure that PATIENT in patients and PATIENT2 in procedures have the same ID:

patients['PATIENT'] = patients['ID'].map(patient_id_map)Code language: JavaScript (javascript)

Prepping for graph.construct

Now that we have valid identifiers, we’ll create our node and relationship tables. In our example, the node table will include patients and their procedures, and the relationship table will map our patient nodes to our procedure nodes.

Let’s start by creating a dataframe that only contains the IDs for patients who had kidney disease. We need to do some mild cleanup to make sure everything has the right names.

For the dataframe representing nodes:

  • The first column should be called nodeId

For the dataframe representing relationships:

  • Columns called sourceNodeId and targetNodeId
  • What to call that relationship in a column called relationshipType

In order to filter down to just patients who had kidney disease, we need to filter based on certain disease codes:

# Kidney-related reason codes
kidney_disease_codes = {431857002, 46177005, 161665007, 698306007}

# Filter procedures for kidney-related reasons
kidney_procedures = procedures[procedures['REASONCODE'].isin(kidney_disease_codes)]

# Extract unique patient IDs
kidney_patient_ids = kidney_procedures['PATIENT2'].unique()
kidney_patients_vw = pd.DataFrame({'nodeId': kidney_patient_ids})Code language: PHP (php)

We’ll do the same for procedures. This time, we’re just looking for procedures that kidney patients have undergone, regardless of whether those procedures were for kidney disease:

# Filter all procedures done by kidney patients
kidney_patient_procedures = procedures[procedures['PATIENT2'].isin(kidney_patient_ids)]

# Extract unique procedure codes
kidney_patient_procedures_vw = pd.DataFrame({
    'nodeId': kidney_patient_procedures['CODE'].unique()
})Code language: PHP (php)

Now, we create a relationship dataframe that represents the relationship between the kidney patients and all the procedures they’ve had. This will be the relationship used in the bipartite graph projection for Jaccard similarity:

# Create patient-to-procedure relationship pairs
kidney_patient_procedure_relationship = kidney_patient_procedures[['PATIENT2', 'CODE']].drop_duplicates()

# Rename columns for graph semantics
relationships = kidney_patient_procedure_relationship.rename(
    columns={'PATIENT2': 'sourceNodeId', 'CODE': 'targetNodeId'}
)Code language: PHP (php)

Finally, we combine the NodeIds for patients and procedures into one dataframe called nodes:

nodes = pd.concat([kidney_patients_vw, 
kidney_patient_procedures_vw], ignore_index=True)Code language: PHP (php)

Projecting a Graph and Running Patient Similarity

Next, we quickly create a graph using graph.construct:

graph_name = "patients"

if gds.graph.exists(graph_name)["exists"]:
    # Drop the graph if it exists
    gds.graph.drop(graph_name)
    print(f"Graph '{graph_name}' dropped.")

G = gds.graph.construct(graph_name, nodes, relationships)Code language: PHP (php)

And see the results:

similarity = gds.nodeSimilarity.stream(
  G
)

similarity
node1node2similarity
100000000000000002110000000000000007810.900000
100000000000000002110000000000000000950.875000
100000000000000002110000000000000006870.812500
100000000000000002110000000000000003130.781250
100000000000000002110000000000000007480.777778
100000000000000117810000000000000002650.540541
100000000000000117810000000000000000960.527027
100000000000000117810000000000000010070.520000
100000000000000117810000000000000009140.513889
100000000000000117810000000000000009420.506849

This runs a pairwise comparison between patients to measure how similar they are based on the structure of the graph. In other words, patients who followed similar care paths will score higher — revealing patterns in how they move through the system.

What’s most interesting isn’t necessarily patients with a Jaccard coefficient of 1 but those who are nearly similar. These near matches can help us predict future procedures. If two patients share many of the same interventions, there’s a good chance that any procedures one has undergone — but the other hasn’t yet — may be needed down the line.

Identifying Communities

Now that we have the similarity dataframe, we can use it to build a new graph projection. From there, we’ll run Louvain to see if meaningful patient communities emerge from our pairwise comparisons:

nodes_sim = pd.DataFrame(
    pd.unique(similarity[['node1', 'node2']].values.ravel()),
    columns=['nodeId']
)

# Create the relationships DataFrame
relationships_sim = similarity.rename(columns={
    'node1': 'sourceNodeId',
    'node2': 'targetNodeId',
    'similarity': 'weight'
})Code language: PHP (php)

Now, we create a new graph projection using the similarity scores:  

graph_name = "patients_sim"

if gds.graph.exists(graph_name)["exists"]:
    # Drop the graph if it exists
    gds.graph.drop(graph_name)
    print(f"Graph '{graph_name}' dropped.")

G = gds.graph.construct(graph_name, nodes_sim, relationships_sim)Code language: PHP (php)

Then we run Louvain against it. This allows us to bucket different users together into communities. From this, we can build out similar treatment programs for similar patients!

gds.louvain.stream(
  G
)Code language: CSS (css)
nodeIdcommunityId
10000000000000000210
10000000000000007810
10000000000000000950
10000000000000006870
10000000000000003130
10000000000000011470
100000000000000073016
100000000000000103716
10000000000000007350
100000000000000089616

Finally, we close the session and end our billing:

sessions.delete(session_name="my-new-session-sm")Code language: JavaScript (javascript)

What’s Next

Now that you’ve got a solid grasp on modeling patient journeys, head over to our GitHub repo for step-by-step instructions on how to do it yourself with Neo4j Aura Graph Analytics. You’ll find a Google Colab notebook, the full dataset, and everything you need to get started. 

Prefer working in Snowflake? You can run the same example there using Neo4j Graph Analytics for Snowflake.

Resources