Machine learning pipeline
This example is a simplified version of the Link Prediction pipeline described in the Machine learning section.
Create the graph
The following Cypher query creates the graph of a small social network in the Neo4j database.
CREATE
(alice:Person {name: 'Alice', age: 38}),
(michael:Person {name: 'Michael', age: 67}),
(karin:Person {name: 'Karin', age: 30}),
(chris:Person {name: 'Chris', age: 52}),
(will:Person {name: 'Will', age: 6}),
(mark:Person {name: 'Mark', age: 32}),
(greg:Person {name: 'Greg', age: 29}),
(veselin:Person {name: 'Veselin', age: 3}),
(alice)-[:KNOWS]->(michael),
(michael)-[:KNOWS]->(karin),
(michael)-[:KNOWS]->(chris),
(michael)-[:KNOWS]->(greg),
(will)-[:KNOWS]->(michael),
(will)-[:KNOWS]->(chris),
(mark)-[:KNOWS]->(michael),
(mark)-[:KNOWS]->(will),
(greg)-[:KNOWS]->(chris),
(veselin)-[:KNOWS]->(chris),
(karin)-[:KNOWS]->(veselin),
(chris)-[:KNOWS]->(karin)
The graph looks as follows:
The next query creates an in-memory graph called friends
from the Neo4j graph.
Since the Link Prediction model requires the graph to be undirected, the orientation of the :KNOWS
relationship is discarded.
MATCH (source:Person)-[r:KNOWS]->(target:Person)
RETURN gds.graph.project(
'friends',
source,
target,
{
sourceNodeProperties: source { .age },
targetNodeProperties: target { .age },
relationshipType: 'KNOWS'
},
{ undirectedRelationshipTypes: ['KNOWS'] }
)
Configure the pipeline
You can configure a machine learning pipeline with a sequence of Cypher queries.
-
Create the pipeline and add it to the pipeline catalog:
CALL gds.beta.pipeline.linkPrediction.create('pipe')
-
Add the link features (only
age
here) and a feature type (l2
here):CALL gds.beta.pipeline.linkPrediction.addFeature( 'pipe', 'l2', { nodeProperties: ['age'] } )
-
Configure the train-test split and the number of folds for cross-validation:
CALL gds.beta.pipeline.linkPrediction.configureSplit( 'pipe', { testFraction: 0.25, trainFraction: 0.6, validationFolds: 3 } )
-
Add a model candidate (a logistic regression with no further configuration here):
CALL gds.beta.pipeline.linkPrediction.addLogisticRegression('pipe')
Train a model
Once configured, the pipeline is ready to train a model. The training process returns the best performing model with the specified evaluation metrics.
The pipeline configuration shown in the previous section is simplified for convenience; as such, the model performance is not expected to be the best. See the Link prediction pipelines page for a detailed walkthrough. |
CALL gds.beta.pipeline.linkPrediction.train(
'friends', (1)
{
pipeline: 'pipe', (2)
modelName: 'lp-pipeline-model', (3)
targetRelationshipType: 'KNOWS', (4)
metrics: ['AUCPR'], (5)
randomSeed: 42 (6)
}
)
YIELD modelInfo
RETURN
modelInfo.bestParameters AS winningModel, (7)
modelInfo.metrics.AUCPR.train.avg AS avgTrainScore, (8)
modelInfo.metrics.AUCPR.validation.avg AS avgValidationScore,
modelInfo.metrics.AUCPR.outerTrain AS outerTrainScore,
modelInfo.metrics.AUCPR.test AS testScore
1 | Name of the projected graph to use for training. |
2 | Name of the configured pipeline. |
3 | Name of the model to train. |
4 | Name of the relationship to train the model on. |
5 | Metrics used to evaluate the models (AUCPR here). |
6 | The random seed is only needed to obtain the same results across runs. |
7 | Parameters of the best performing model returned by the training process. |
8 | Evaluated metrics (here for AUCPR ) of the best performing model returned by the training process. |
winningModel | avgTrainScore | avgValidationScore | outerTrainScore | testScore |
---|---|---|---|---|
{batchSize=100, classWeights=[], focusWeight=0.0, learningRate=0.001, maxEpochs=100, methodName="LogisticRegression", minEpochs=1, patience=1, penalty=0.0, tolerance=0.001} |
0.5740740741 |
0.3611111111 |
0.3784126984 |
0.3444444444 |
Use the model for prediction
You can use the trained model to predict the probability that a link exists between two nodes in a projected graph.
CALL gds.beta.pipeline.linkPrediction.predict.stream( (1)
'friends', (2)
{
modelName: 'lp-pipeline-model', (3)
topN: 5 (4)
}
)
YIELD node1, node2, probability
RETURN
gds.util.asNode(node1).name AS person1,
gds.util.asNode(node2).name AS person2,
probability
ORDER BY probability DESC, person1
1 | Run the prediction in stream mode (return the predicted links as query results). |
2 | Name of the projected graph to run the prediction on. |
3 | Name of the model to use for prediction. |
4 | Maximum number of predicted relationships to output. |
person1 | person2 | probability |
---|---|---|
"Karin" |
"Greg" |
0.4991379664 |
"Mark" |
"Karin" |
0.4989714183 |
"Mark" |
"Greg" |
0.4986938388 |
"Will" |
"Veselin" |
0.4986938388 |
"Mark" |
"Alice" |
0.4971949275 |
Next steps
Try to improve the performance of the training by using different model candidates, adding node properties to the features, or configuring autotuning.