Training the pipeline
This feature is in the alpha tier. For more information on feature tiers, see API Tiers.
The train mode, gds.alpha.pipeline.nodeRegression.train
, is responsible for data splitting, feature extraction, model selection, training and storing a model for future use.
Running this mode results in a regression model of type NodeRegression
, which is then stored in the model catalog.
The regression model can be applied on a graph to predict property values for new nodes.
More precisely, the training proceeds as follows:
-
Apply the node property steps, added according to Adding node properties, on the whole graph. The graph filter on each step consists of
contextNodeLabels + targetNodeLabels
andcontextRelationships + relationshipTypes
. -
Apply the
targetNodeLabels
filter to the graph. -
Select node properties to be used as features, as specified in Adding features.
-
Split the input graph into two parts: the train graph and the test graph. This is described in Configuring the node splits. These graphs are internally managed and exist only for the duration of the training.
-
Split the nodes in the train graph using stratified k-fold cross-validation. The number of folds
k
can be configured as described in Configuring the node splits. -
Each model candidate defined in the parameter space is trained on each train set and evaluated on the respective validation set for every fold. The evaluation uses the specified primary metric.
-
Choose the best performing model according to the highest average score for the primary metric.
-
Retrain the winning model on the entire train graph.
-
Evaluate the performance of the winning model on the whole train graph as well as the test graph.
-
Retrain the winning model on the entire original graph.
-
Register the winning model in the Model Catalog.
The above steps describe what the procedure does logically. The actual steps as well as their ordering in the implementation may differ. |
A step can only use node properties that are already present in the input graph or produced by steps, which were added before. |
Parallel executions of the same pipeline on the same graph is not supported. |
Metrics
The Node Regression model in the Neo4j GDS library supports the following evaluation metrics:
-
MEAN_SQUARED_ERROR
-
ROOT_MEAN_SQUARED_ERROR
-
MEAN_ABSOLUTE_ERROR
More than one metric can be specified during training but only the first specified — the primary
one — is used for evaluation, the results of all are present in the train results.
Syntax
CALL gds.alpha.pipeline.nodeRegression.train(
graphName: String,
configuration: Map
) YIELD
trainMillis: Integer,
modelInfo: Map,
modelSelectionStats: Map,
configuration: Map
Name | Type | Default | Optional | Description |
---|---|---|---|---|
graphName |
String |
|
no |
The name of a graph stored in the catalog. |
configuration |
Map |
|
yes |
Configuration for algorithm-specifics and/or graph filtering. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
pipeline |
String |
|
no |
The name of the pipeline to execute. |
targetNodeLabels |
List of String |
|
yes |
Filter the named graph using the given node labels to obtain nodes that are subject to training and evaluation. |
List of String |
|
yes |
Filter the named graph using the given relationship types. |
|
Integer |
|
yes |
The number of concurrent threads used for running the algorithm. |
|
targetProperty |
String |
|
no |
The target property of the node. Must be of type Integer or Float. |
metrics |
List of String |
|
no |
Metrics used to evaluate the models. |
randomSeed |
Integer |
|
yes |
Seed for the random number generator used during training. |
modelName |
String |
|
no |
The name of the model to train, must not exist in the Model Catalog. |
String |
|
yes |
An ID that can be provided to more easily track the training’s progress. |
Name | Type | Description |
---|---|---|
trainMillis |
Integer |
Milliseconds used for training. |
modelInfo |
Map |
Information about the training and the winning model. |
modelSelectionStats |
Map |
Statistics about evaluated metrics for all model candidates. |
configuration |
Map |
Configuration used for the train procedure. |
The modelInfo
can also be retrieved at a later time by using the Model List Procedure.
The modelInfo
return field has the following algorithm-specific subfields:
Name | Type | Description |
---|---|---|
bestParameters |
Map |
The model parameters which performed best on average on validation folds according to the primary metric. |
metrics |
Map |
Map from metric description to evaluated metrics for the winning model over the subsets of the data, see below. |
nodePropertySteps |
List of Map |
Algorithms that produce node properties within the pipeline. |
featureProperties |
List of String |
Node properties selected as input features to the pipeline model. |
The structure of modelInfo
is:
{ bestParameters: Map, (1) nodePropertySteps: List of Map, featureProperties: List of String, metrics: { (2) <METRIC_NAME>: { (3) test: Float, (4) outerTrain: Float, (5) train: { (6) avg: Float, max: Float, min: Float, }, validation: { (7) avg: Float, max: Float, min: Float, params: Map } } } }
1 | The best scoring model candidate configuration. |
2 | The metrics map contains an entry for each metric description, and the corresponding results for that metric. |
3 | A metric name specified in the configuration of the procedure, e.g., F1_MACRO or RECALL(class=4) . |
4 | Numeric value for the evaluation of the winning model on the test set. |
5 | Numeric value for the evaluation of the winning model on the outer train set. |
6 | The train entry summarizes the metric results over the train set. |
7 | The validation entry summarizes the metric results over the validation set. |
In addition to the data the procedure yields, there’s a fair amount of information about the training that’s being sent to the Neo4j database’s logs as the procedure progresses. For example, how well each model candidates perform is logged with Some information is only logged with |
Example
All the examples below should be run in an empty database. The examples use Cypher projections as the norm. Native projections will be deprecated in a future release. |
In this section we will show examples of running a Node Regression training pipeline on a concrete graph.
The intention is to illustrate what the results look like and to provide a guide in how to make use of the model in a real setting.
We will do this on a small graph of a handful of nodes representing houses.
In our example we want to predict the price
of a house.
The example graph looks like this:
CREATE
(gold:House {color: 'Gold', sizePerStory: [15.5, 23.6, 33.1], price: 99.99}),
(red:House {color: 'Red', sizePerStory: [15.5, 23.6, 100.0], price: 149.99}),
(blue:House {color: 'Blue', sizePerStory: [11.3, 35.1, 22.0], price: 77.77}),
(green:House {color: 'Green', sizePerStory: [23.2, 55.1, 0.0], price: 80.80}),
(gray:House {color: 'Gray', sizePerStory: [34.3, 24.0, 0.0], price: 57.57}),
(black:House {color: 'Black', sizePerStory: [71.66, 55.0, 0.0], price: 140.14}),
(white:House {color: 'White', sizePerStory: [11.1, 111.0, 0.0], price: 122.22}),
(teal:House {color: 'Teal', sizePerStory: [80.8, 0.0, 0.0], price: 80.80}),
(beige:House {color: 'Beige', sizePerStory: [106.2, 0.0, 0.0], price: 110.11}),
(magenta:House {color: 'Magenta', sizePerStory: [99.9, 0.0, 0.0], price: 100.00}),
(purple:House {color: 'Purple', sizePerStory: [56.5, 0.0, 0.0], price: 60.00}),
(pink:UnknownHouse {color: 'Pink', sizePerStory: [23.2, 55.1, 56.1]}),
(tan:UnknownHouse {color: 'Tan', sizePerStory: [22.32, 102.0, 0.0]}),
(yellow:UnknownHouse {color: 'Yellow', sizePerStory: [39.0, 0.0, 0.0]}),
// richer context
(schiele:Painter {name: 'Schiele'}),
(picasso:Painter {name: 'Picasso'}),
(kahlo:Painter {name: 'Kahlo'}),
(schiele)-[:PAINTED]->(gold),
(schiele)-[:PAINTED]->(red),
(schiele)-[:PAINTED]->(blue),
(picasso)-[:PAINTED]->(green),
(picasso)-[:PAINTED]->(gray),
(picasso)-[:PAINTED]->(black),
(picasso)-[:PAINTED]->(white),
(kahlo)-[:PAINTED]->(teal),
(kahlo)-[:PAINTED]->(beige),
(kahlo)-[:PAINTED]->(magenta),
(kahlo)-[:PAINTED]->(purple),
(schiele)-[:PAINTED]->(pink),
(schiele)-[:PAINTED]->(tan),
(kahlo)-[:PAINTED]->(yellow);
With the graph in Neo4j we can now project it into the graph catalog to prepare it for the pipeline execution.
We do this using a Cypher projection targeting the House
and UnknownHouse
labels.
We will also project the sizeOfStory
property to use as a model feature, and the price
property to use as a target feature.
MATCH (house:House|UnknownHouse)
RETURN gds.graph.project(
'myGraph',
house,
null,
{
sourceNodeLabels: labels(house),
targetNodeLabels: [],
sourceNodeProperties: house { .sizePerStory, .price },
targetNodeProperties: {}
}
)
Train
In the following examples we will demonstrate running the Node Regression training pipeline on this graph.
We will train a model to predict the price of a house, based on its sizePerStory
property.
The configuration of the pipeline is the result of running the examples on the previous page:
CALL gds.alpha.pipeline.nodeRegression.train('myGraph', {
pipeline: 'pipe',
targetNodeLabels: ['House'],
modelName: 'nr-pipeline-model',
targetProperty: 'price',
randomSeed: 25,
concurrency: 1,
metrics: ['MEAN_SQUARED_ERROR']
}) YIELD modelInfo
RETURN
modelInfo.bestParameters AS winningModel,
modelInfo.metrics.MEAN_SQUARED_ERROR.train.avg AS avgTrainScore,
modelInfo.metrics.MEAN_SQUARED_ERROR.outerTrain AS outerTrainScore,
modelInfo.metrics.MEAN_SQUARED_ERROR.test AS testScore
winningModel | avgTrainScore | outerTrainScore | testScore |
---|---|---|---|
{maxDepth=2147483647, methodName="RandomForest", minLeafSize=1, minSplitSize=2, numberOfDecisionTrees=5, numberOfSamplesRatio=1.0} |
658.1848249523812 |
1188.6296009999999 |
1583.5897253333333 |
Here we can observe that the RandomForest
candidate with 5 decision trees performed the best in the training phase.
Notice that this is just a toy example on a very small graph.
In order to achieve a higher test score, we may need to use better features, a larger graph, or different model configuration.
Providing richer contexts to node property steps
In the above example we projected a House subgraph without relationships and used it for training and testing. Much information in the original graph is not used. We might want to utilize more node and relationship types to generate node properties (and link features) and investigate whether it improves node regression. We can do that by passing in contextNodeLabels and contextRelationshipTypes when adding a node property step.
The following statement will project a graph containing the information about houses and their painters using a Cypher projection and store it in the graph catalog under the name 'paintingGraph'.
MATCH (house:House)
OPTIONAL MATCH (painter:Painter)-[r:PAINTED]->(house:House)
RETURN gds.graph.project(
'paintingGraph',
painter,
house,
{
sourceNodeLabels: ['Painter'],
targetNodeLabels: ['House'],
sourceNodeProperties: {},
targetNodeProperties: house { .sizePerStory, .price },
relationshipType: 'PAINTED'
},
{ undirectedRelationshipTypes: ['PAINTED'] }
)
We still train a model to predict the price of each house, but use Painter
and PAINTED
as context in addition to House
to generate features that leverage the full graph structure.
After the feature generation however, it is only the House
nodes that are considered as training and evaluation instances, so only the House
nodes need to have the target property price
.
First, we create a new pipeline.
CALL gds.alpha.pipeline.nodeRegression.create('pipe-with-context')
Second, we add a node property step (in this case, a node embedding) with Painter
as contextNodeLabels.
CALL gds.alpha.pipeline.nodeRegression.addNodeProperty('pipe-with-context', 'fastRP', {
embeddingDimension: 64,
iterationWeights: [0, 1],
mutateProperty:'embedding',
contextNodeLabels: ['Painter'],
randomSeed: 1337
})
We add our embedding as a feature for the model:
CALL gds.alpha.pipeline.nodeRegression.selectFeatures('pipe-with-context', ['embedding'])
And we complete the pipeline setup by adding a random forest model candidate:
CALL gds.alpha.pipeline.nodeRegression.addRandomForest('pipe-with-context', {numberOfDecisionTrees: 5})
We are now ready to invoke the training of the newly created pipeline.
CALL gds.alpha.pipeline.nodeRegression.train('paintingGraph', {
pipeline: 'pipe-with-context',
targetNodeLabels: ['House'],
modelName: 'nr-pipeline-model-contextual',
targetProperty: 'price',
randomSeed: 25,
concurrency: 1,
metrics: ['MEAN_SQUARED_ERROR']
}) YIELD modelInfo
RETURN
modelInfo.bestParameters AS winningModel,
modelInfo.metrics.MEAN_SQUARED_ERROR.train.avg AS avgTrainScore,
modelInfo.metrics.MEAN_SQUARED_ERROR.outerTrain AS outerTrainScore,
modelInfo.metrics.MEAN_SQUARED_ERROR.test AS testScore
winningModel | avgTrainScore | outerTrainScore | testScore |
---|---|---|---|
{maxDepth=2147483647, methodName="RandomForest", minLeafSize=1, minSplitSize=2, numberOfDecisionTrees=5, numberOfSamplesRatio=1.0} |
758.087008266667 |
837.5558960000001 |
1192.523748 |
As we can see, the results indicate a lower mean square error for the random forest model, compared to nr-pipeline-model
in earlier section.
The change is due to the embeddings taking into account more contextual information.
While this is a toy example, additional context can sometimes provide valuable information to pipeline steps, resulting in better performance.