Split Relationships
This feature is in the alpha tier. For more information on feature tiers, see API Tiers.
Introduction
The Split relationships algorithm is a utility algorithm that is used to pre-process a graph for model training.
It splits the relationships into a holdout set and a remaining set.
The holdout set is divided into two classes: positive, i.e., existing relationships, and negative, i.e., non-existing relationships.
The class is indicated by a label
property on the relationships.
This enables the holdout set to be used for training or testing a machine learning model.
Both, the holdout and the remaining relationships are added to the projected graph.
If the configuration option relationshipWeightProperty
is specified, then the corresponding relationship property is preserved on the remaining set of relationships.
Note however that the holdout set only has the label
property; it is not possible to induce relationship weights on the holdout set as it also contains negative samples.
Syntax
This section covers the syntax used to execute the Split Relationships algorithm in each of its execution modes. We are describing the named graph variant of the syntax. To learn more about general syntax variants, see Syntax overview.
CALL gds.alpha.ml.splitRelationships.mutate(
graphName: String,
configuration: Map
)
YIELD
preProcessingMillis: Integer,
computeMillis: Integer,
mutateMillis: Integer,
relationshipsWritten: Integer,
configuration: Map
Name | Type | Default | Optional | Description |
---|---|---|---|---|
graphName |
String |
|
no |
The name of a graph stored in the catalog. |
configuration |
Map |
|
yes |
Configuration for algorithm-specifics and/or graph filtering. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
sourceNodeLabels |
List of String |
|
yes |
Filter the relationships where the sourceNode has at least one of the sourceNodeLabels. |
targetNodeLabels |
List of String |
|
yes |
Filter the relationships where the targetNode has at least one of the targetNodeLabels. |
List of String |
|
yes |
Filter the named graph using the given relationship types. |
|
Integer |
|
yes |
The number of concurrent threads used for running the algorithm. |
|
String |
|
yes |
An ID that can be provided to more easily track the algorithm’s progress. |
|
holdoutFraction |
Float |
|
no |
The fraction of valid relationships being used as holdout set. The remaining |
negativeSamplingRatio |
Float |
|
no |
The desired ratio of negative to positive samples in holdout set. |
holdoutRelationshipType |
String |
|
no |
Relationship type used for the holdout set. Each relationship has a property |
remainingRelationshipType |
String |
|
no |
Relationships where one node has none of the source or target labels will be omitted. All invalid relationship are added to the remaining set. |
nonNegativeRelationshipTypes |
List of String |
|
yes |
Additional relationship types that are not used for negative sampling. |
String |
|
yes |
Name of the relationship property that is inherited by the |
|
randomSeed |
Integer |
|
yes |
An optional seed value for the random selection of relationships. |
Name | Type | Description |
---|---|---|
|
Integer |
Milliseconds for preprocessing the data. |
|
Integer |
Milliseconds for running the algorithm. |
|
Integer |
Milliseconds for adding properties to the projected graph. |
|
Integer |
The number of relationships created by the algorithm. |
|
Map |
The configuration used for running the algorithm. |
Examples
All the examples below should be run in an empty database. The examples use Cypher projections as the norm. Native projections will be deprecated in a future release. |
In this section we will show examples of running the Split Relationships algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small graph of a handful nodes connected in a particular pattern. The example graph looks like this:
Consider the graph created by the following Cypher statement:
CREATE
(n0:Label),
(n1:Label),
(n2:Label),
(n3:Label),
(n4:Label),
(n5:Label),
(n0)-[:TYPE { prop: 0} ]->(n1),
(n1)-[:TYPE { prop: 1} ]->(n2),
(n2)-[:TYPE { prop: 4} ]->(n3),
(n3)-[:TYPE { prop: 9} ]->(n4),
(n4)-[:TYPE { prop: 16} ]->(n5)
Given the above graph, we want to use 20% of the relationships as holdout set.
The holdout set will be split into two same-sized classes: positive and negative.
Positive relationships will be randomly selected from the existing relationships and marked with a property label: 1
.
Negative relationships will be randomly generated, i.e., they do not exist in the input graph, and are marked with a property label: 0
.
MATCH (source:Label)-[r:TYPE]->(target:Label)
RETURN gds.graph.project(
'graph',
source,
target,
{
sourceNodeLabels: ['Label'],
targetNodeLabels: ['Label'],
relationshipType: 'TYPE'
},
{ undirectedRelationshipTypes: ['TYPE'] }
)
Now we can run the algorithm by specifying the appropriate ratio and the output relationship types. We use a random seed value in order to produce deterministic results.
CALL gds.alpha.ml.splitRelationships.mutate('graph', {
holdoutRelationshipType: 'TYPE_HOLDOUT',
remainingRelationshipType: 'TYPE_REMAINING',
holdoutFraction: 0.2,
negativeSamplingRatio: 1.0,
randomSeed: 1337
}) YIELD relationshipsWritten
relationshipsWritten |
---|
10 |
The input graph consists of 5 relationships.
We use 20% (1 relationship) of the relationships to create the 'TYPE_HOLDOUT' relationship type (holdout set).
This creates 1 relationship with positive label.
Because of the negativeSamplingRatio
, one relationship with negative label is also created.
Finally, the TYPE_REMAINING
relationship type is formed with the remaining 80% (4 relationships).
These are written as orientation UNDIRECTED
which counts as writing 8 relationships.
TEST
and TRAIN
relationship.CREATE
(n0:Label),
(n1:Label),
(n2:Label),
(n3:Label),
(n4:Label),
(n5:Label),
(n2)-[:TYPE_HOLDOUT { label: 0 } ]->(n5), // negative, non-existing
(n3)-[:TYPE_HOLDOUT { label: 1 } ]->(n2), // positive, existing
(n0)<-[:TYPE_REMAINING { prop: 0} ]-(n1),
(n1)<-[:TYPE_REMAINING { prop: 1} ]-(n2),
(n3)<-[:TYPE_REMAINING { prop: 9} ]-(n4),
(n4)<-[:TYPE_REMAINING { prop: 16} ]-(n5),
(n0)-[:TYPE_REMAINING { prop: 0} ]->(n1),
(n1)-[:TYPE_REMAINING { prop: 1} ]->(n2),
(n3)-[:TYPE_REMAINING { prop: 9} ]->(n4),
(n4)-[:TYPE_REMAINING { prop: 16} ]->(n5)