Training the pipeline
The train mode, gds.beta.pipeline.linkPrediction.train
, is responsible for splitting data, feature extraction, model selection, training and storing a model for future use.
Running this mode results in a prediction model of type LinkPrediction
being stored in the model catalog along with metrics collected during training.
The model can be applied to a possibly different graph which produces a relationship type of predicted links, each having a predicted probability stored as a property.
More precisely, the procedure will in order:
-
Apply node filtering using
sourceNodeLabel
andtargetNodeLabel
, and relationship filtering usingtargetRelationshipType
. The resulting graph is used as input to splitting. -
Create a relationship split of the graph into
test
,train
andfeature-input
graphs as described in Configuring the relationship splits. These graphs are internally managed and exist only for the duration of the training. -
Apply the node property steps, added according to Adding node properties. The graph filter on each step consists of
contextNodeLabels + targetNodeLabel + sourceNodeLabel
andcontextRelationships + feature-input relationships
. -
Apply the feature steps, added according to Adding link features, to the
train
graph, which yields for eachtrain
relationship an instance, that is, a feature vector and a binary label. -
Split the training instances using stratified k-fold cross-validation. The number of folds
k
can be configured usingvalidationFolds
ingds.beta.pipeline.linkPrediction.configureSplit
. -
Train each model candidate given by the parameter space for each of the folds and evaluate the model on the respective validation set. The evaluation uses the specified metric.
-
Declare as winner the model with the highest average metric across the folds.
-
Re-train the winning model on the whole training set and evaluate it on both the
train
andtest
sets. In order to evaluate on thetest
set, the feature pipeline is first applied again as for thetrain
set. -
Register the winning model in the Model Catalog.
The above steps describe what the procedure does logically. The actual steps as well as their ordering in the implementation may differ. |
A step can only use node properties that are already present in the input graph or produced by steps, which were added before. |
Parallel executions of the same pipeline on the same graph is not supported. |
Syntax
CALL gds.beta.pipeline.linkPrediction.train(
graphName: String,
configuration: Map
) YIELD
trainMillis: Integer,
modelInfo: Map,
modelSelectionStats: Map,
configuration: Map
Name | Type | Default | Optional | Description |
---|---|---|---|---|
graphName |
String |
|
no |
The name of a graph stored in the catalog. |
configuration |
Map |
|
yes |
Configuration for algorithm-specifics and/or graph filtering. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
modelName |
String |
|
no |
The name of the model to train, must not exist in the Model Catalog. |
pipeline |
String |
|
no |
The name of the pipeline to execute. |
targetRelationshipType |
String |
|
no |
The name of the relationship type to train the model on. The relationship type must be undirected. |
sourceNodeLabel |
String |
|
yes |
The name of the node label relationships in the training and test sets should start from [1]. |
targetNodeLabel |
String |
|
yes |
The name of the node label relationships in the training and test sets should end at [1]. |
negativeClassWeight |
Float |
|
yes |
Weight of negative examples in model evaluation. Positive examples have weight 1. More details here. |
metrics |
List of String |
|
no |
Metrics used to evaluate the models. |
randomSeed |
Integer |
|
yes |
Seed for the random number generator used during training. |
Integer |
|
yes |
The number of concurrent threads used for running the algorithm. |
|
String |
|
yes |
An ID that can be provided to more easily track the training’s progress. |
|
storeModelToDisk |
Boolean |
|
yes |
Automatically store model to disk after training. |
1. This helps to train the model to predict links with a certain label combination. |
Name | Type | Description |
---|---|---|
trainMillis |
Integer |
Milliseconds used for training. |
modelInfo |
Map |
Information about the training and the winning model. |
modelSelectionStats |
Map |
Statistics about evaluated metrics for all model candidates. |
configuration |
Map |
Configuration used for the train procedure. |
The modelInfo
can also be retrieved at a later time by using the Model List Procedure.
The modelInfo
return field has the following algorithm-specific subfields:
Name | Type | Description |
---|---|---|
bestParameters |
Map |
The model parameters which performed best on average on validation folds according to the primary metric. |
modelCandidates |
List |
List of maps, where each map contains information about one model candidate. This information includes the candidates parameters, training statistics and validation statistics. |
bestTrial |
Integer |
The trial that produced the best model. The first trial has number 1. |
Name | Type | Description |
---|---|---|
modelName |
String |
The name of the trained model. |
modelType |
String |
The type of the trained model. |
bestParameters |
Map |
The model parameters which performed best on average on validation folds according to the primary metric. |
metrics |
Map |
Map from metric description to evaluated metrics for the winning model over the subsets of the data, see below. |
nodePropertySteps |
List of Map |
Algorithms that produce node properties within the pipeline. |
linkFeatures |
List of Map |
Feature steps that combine node properties from endpoint nodes to produce features for relationships (links) as input to the pipeline model. |
The structure of modelInfo
is:
{ bestParameters: Map, (1) nodePropertySteps: List of Map, linkFeatures: List of Map, metrics: { (2) AUCPR: { test: Float, (3) outerTrain: Float, (4) train: { (5) avg: Float, max: Float, min: Float, }, validation: { (6) avg: Float, max: Float, min: Float } } } }
1 | The best scoring model candidate configuration. |
2 | The metrics map contains an entry for each metric description (currently only AUCPR ) and the corresponding results for that metric. |
3 | Numeric value for the evaluation of the best model on the test set. |
4 | Numeric value for the evaluation of the best model on the outer train set. |
5 | The train entry summarizes the metric results over the train set. |
6 | The validation entry summarizes the metric results over the validation set. |
In (3)-(5), if the metric is |
In addition to the data the procedure yields, there’s a fair amount of information about the training that’s being sent to the Neo4j database’s logs as the procedure progresses. For example, how well each model candidates perform is logged with Some information is only logged with |
Example
In this example we will create a small graph and use the training pipeline we have built up thus far. The graph is a small social network of people and cities, including some information about where people live, were born, and what other people they know. We will attempt to train a model to predict which additional people might know each other. The example graph looks like this:
CREATE
(alice:Person {name: 'Alice', age: 38}),
(michael:Person {name: 'Michael', age: 67}),
(karin:Person {name: 'Karin', age: 30}),
(chris:Person {name: 'Chris', age: 52}),
(will:Person {name: 'Will', age: 6}),
(mark:Person {name: 'Mark', age: 32}),
(greg:Person {name: 'Greg', age: 29}),
(veselin:Person {name: 'Veselin', age: 3}),
(london:City {name: 'London'}),
(malmo:City {name: 'Malmo'}),
(alice)-[:KNOWS]->(michael),
(michael)-[:KNOWS]->(karin),
(michael)-[:KNOWS]->(chris),
(michael)-[:KNOWS]->(greg),
(will)-[:KNOWS]->(michael),
(will)-[:KNOWS]->(chris),
(mark)-[:KNOWS]->(michael),
(mark)-[:KNOWS]->(will),
(greg)-[:KNOWS]->(chris),
(veselin)-[:KNOWS]->(chris),
(karin)-[:KNOWS]->(veselin),
(chris)-[:KNOWS]->(karin),
(alice)-[:LIVES]->(london),
(michael)-[:LIVES]->(london),
(karin)-[:LIVES]->(london),
(chris)-[:LIVES]->(malmo),
(will)-[:LIVES]->(malmo),
(alice)-[:BORN]->(london),
(michael)-[:BORN]->(london),
(karin)-[:BORN]->(malmo),
(chris)-[:BORN]->(london),
(will)-[:BORN]->(malmo),
(greg)-[:BORN]->(london),
(veselin)-[:BORN]->(malmo)
With the graph in Neo4j we can now project it into the graph catalog.
We do this using a Cypher projection targeting the Person
nodes and the KNOWS
relationships.
We will also project the age
property, so it can be used when creating link features.
For the relationships we must use the UNDIRECTED
orientation.
This is because the Link Prediction pipelines are defined only for undirected graphs.
We ignore the additional nodes and relationship types, in order for our projection to be homogeneous.
We will illustrate how to make use of the larger graph in a subsequent example.
MATCH (source:Person)-[r:KNOWS]->(target:Person)
RETURN gds.graph.project(
'myGraph',
source,
target,
{
sourceNodeProperties: source { .age },
targetNodeProperties: target { .age },
relationshipType: 'KNOWS'
},
{ undirectedRelationshipTypes: ['KNOWS'] }
)
The Link Prediction model requires the graph to be created using the UNDIRECTED orientation for relationships.
|
Memory Estimation
First off, we will estimate the cost of training the pipeline by using the estimate
procedure.
Estimation is useful to understand the memory impact that training the pipeline on your graph will have.
When actually training the pipeline the system will perform an estimation and prohibit the execution if the estimation shows there is a very high probability of the execution running out of memory.
To read more about this, see Automatic estimation and execution blocking.
For more details on estimate
in general, see Memory Estimation.
CALL gds.beta.pipeline.linkPrediction.train.estimate('myGraph', {
pipeline: 'pipe',
modelName: 'lp-pipeline-model',
targetRelationshipType: 'KNOWS'
})
YIELD requiredMemory
requiredMemory |
---|
"[24 KiB ... 522 KiB]" |
Training
Now we are ready to actually train a LinkPrediction model.
We must make sure to specify the targetRelationshipType
to instruct the model to train only using that type.
With the graph myGraph
there are actually no other relationship types projected, but that is not always the case.
CALL gds.beta.pipeline.linkPrediction.train('myGraph', {
pipeline: 'pipe',
modelName: 'lp-pipeline-model',
metrics: ['AUCPR', 'OUT_OF_BAG_ERROR'],
targetRelationshipType: 'KNOWS',
randomSeed: 18
}) YIELD modelInfo, modelSelectionStats
RETURN
modelInfo.bestParameters AS winningModel,
modelInfo.metrics.AUCPR.train.avg AS avgTrainScore,
modelInfo.metrics.AUCPR.outerTrain AS outerTrainScore,
modelInfo.metrics.AUCPR.test AS testScore,
[cand IN modelSelectionStats.modelCandidates | cand.metrics.AUCPR.validation.avg] AS validationScores
winningModel | avgTrainScore | outerTrainScore | testScore | validationScores |
---|---|---|---|---|
{batchSize=100, classWeights=[0.55, 0.45], focusWeight=0.070341817, hiddenLayerSizes=[4, 2], learningRate=0.001, maxEpochs=100, methodName="MultilayerPerceptron", minEpochs=1, patience=2, penalty=0.5, tolerance=0.001} |
0.7579365079 |
0.7 |
0.6666666667 |
[0.4305555556, 0.5833333333, 0.4305555556, 0.75] |
We can see the MLP model configuration won, and has a score of 0.67
on the test set.
The score computed as the AUCPR metric, which is in the range [0, 1].
A model which gives higher score to all links than non-links will have a score of 1.0, and a model that assigns random scores will on average have a score of 0.5.
Training with context filters
In the above example we projected a Person-KNOWS-Person subgraph and used it for training and testing.
Much information in the original graph is not used.
We might want to utilize more node and relationship types to generate node properties (and link features) and investigate whether it improves link prediction.
We can do that by passing in contextNodeLabels
and contextRelationshipTypes
.
We explicitly pass in sourceNodeLabel
and targetNodeLabel
to specify a narrower set of nodes to be used for training and testing.
The following statement will project the full graph using a Cypher projection and store it in the graph catalog under the name 'fullGraph'.
MATCH (source:Person)-[r:KNOWS|LIVES|BORN]->(target:Person|City)
RETURN gds.graph.project(
'fullGraph',
source,
target,
{
sourceNodeLabels: labels(source),
targetNodeLabels: labels(target),
sourceNodeProperties: source { age: coalesce(source.age, 1) },
targetNodeProperties: target { age: coalesce(target.age, 1) },
relationshipType: type(r)
},
{ undirectedRelationshipTypes: ['KNOWS'] }
)
The full graph contains 2 node labels and 3 relationship types. We still train a Person-KNOWS-Person model, but use context information Person-LIVES-City, Person-BORN-City to generate node properties that the model uses in training. Note that we do not require the UNDIRECTED orientation for the context relationship types, as these are excluded from the LinkPrediction training.
First we’ll create a new pipeline.
CALL gds.beta.pipeline.linkPrediction.create('pipe-with-context')
Next we add the nodePropertyStep with context configurations.
CALL gds.beta.pipeline.linkPrediction.addNodeProperty('pipe-with-context', 'fastRP', {
mutateProperty: 'embedding',
embeddingDimension: 256,
randomSeed: 42,
contextNodeLabels: ['City'],
contextRelationshipTypes: ['LIVES', 'BORN']
})
Then we add the link feature.
CALL gds.beta.pipeline.linkPrediction.addFeature('pipe-with-context', 'hadamard', {
nodeProperties: ['embedding', 'age']
})
And then similarly configure the data splits.
CALL gds.beta.pipeline.linkPrediction.configureSplit('pipe-with-context', {
testFraction: 0.25,
trainFraction: 0.6,
validationFolds: 3
})
Then we add an MLP model candidate.
CALL gds.alpha.pipeline.linkPrediction.addMLP('pipe-with-context',
{hiddenLayerSizes: [4, 2], penalty: 1, patience: 2})
CALL gds.beta.pipeline.linkPrediction.train('fullGraph', {
pipeline: 'pipe-with-context',
modelName: 'lp-pipeline-model-filtered',
metrics: ['AUCPR', 'OUT_OF_BAG_ERROR'],
sourceNodeLabel: 'Person',
targetNodeLabel: 'Person',
targetRelationshipType: 'KNOWS',
randomSeed: 12
}) YIELD modelInfo, modelSelectionStats
RETURN
modelInfo.bestParameters AS winningModel,
modelInfo.metrics.AUCPR.train.avg AS avgTrainScore,
modelInfo.metrics.AUCPR.outerTrain AS outerTrainScore,
modelInfo.metrics.AUCPR.test AS testScore,
[cand IN modelSelectionStats.modelCandidates | cand.metrics.AUCPR.validation.avg] AS validationScores
winningModel | avgTrainScore | outerTrainScore | testScore | validationScores |
---|---|---|---|---|
{batchSize=100, classWeights=[], focusWeight=0.0, hiddenLayerSizes=[4, 2], learningRate=0.001, maxEpochs=100, methodName="MultilayerPerceptron", minEpochs=1, patience=2, penalty=1.0, tolerance=0.001} |
0.832010582 |
0.6666666667 |
0.8611111111 |
[0.75] |
As we can see, the results are effectively identical. While the train and test score stays the same in this toy example, it is likely that the contextual information will have a greater impact for larger datasets.