GraphSAGE

This feature is not available in Aura Graph Analytics.

This feature is in the beta tier. For more information on feature tiers, see API Tiers.

Directed

Undirected

Heterogeneous nodes

Heterogeneous relationships

Weighted relationships

Glossary

Directed: Directed trait. The algorithm is well-defined on a directed graph.
Directed: Directed trait. The algorithm ignores the direction of the graph.
Directed: Directed trait. The algorithm does not run on a directed graph.
Undirected: Undirected trait. The algorithm is well-defined on an undirected graph.
Undirected: Undirected trait. The algorithm ignores the undirectedness of the graph.
Heterogeneous nodes: Heterogeneous nodes fully supported. The algorithm has the ability to distinguish between nodes of different types.
Heterogeneous nodes: Heterogeneous nodes allowed. The algorithm treats all selected nodes similarly regardless of their label.
Heterogeneous relationships: Heterogeneous relationships fully supported. The algorithm has the ability to distinguish between relationships of different types.
Heterogeneous relationships: Heterogeneous relationships allowed. The algorithm treats all selected relationships similarly regardless of their type.
Weighted relationships: Weighted trait. The algorithm supports a relationship property to be used as weight, specified via the relationshipWeightProperty configuration parameter.
Weighted relationships: Weighted trait. The algorithm treats each relationship as equally important, discarding the value of any relationship weight.

GraphSAGE is an inductive algorithm for computing node embeddings. GraphSAGE is using node feature information to generate node embeddings on unseen nodes or graphs. Instead of training individual embeddings for each node, the algorithm learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood.

The algorithm is defined for UNDIRECTED graphs.

For more information on this algorithm see:

Considerations

Isolated nodes

If you are embedding a graph that has an isolated node, the aggregation step in GraphSAGE can only draw information from the node itself. When all the properties of that node are 0.0, and the activation function is ReLU, this leads to an all-zero vector for that node. However, since GraphSAGE normalizes node embeddings using the L2-norm, and a zero vector cannot be normalized, we assign all-zero embeddings to such nodes under these special circumstances. In scenarios where you generate all-zero embeddings for orphan nodes, that may have impacts on downstream tasks such as nearest neighbor or other similarity algorithms. It may be more appropriate to filter out these disconnected nodes prior to running GraphSAGE.

Memory estimation

When doing memory estimation of the training, the feature dimension is computed as if each feature property is scalar.

Graph pre-sampling to reduce time and memory

Since training a GraphSAGE model may take a lot of time and memory on large graphs, it can be helpful to sample a smaller subgraph prior to training, and then training on that subgraph. The trained model can still be applied to predict embeddings on the full graph (or other graphs) since GraphSAGE is inductive. To sample a structurally representative subgraph, see Random walk with restarts sampling.

Usage in machine learning pipelines

It may be useful to generate node embeddings with GraphSAGE as a node property step in a machine learning pipeline (like Link prediction pipelines and Node property prediction). It is not supported to train the GraphSAGE model inside the pipeline, but rather one must first train the model outside the pipeline. Once the model is trained, it is possible to add GraphSAGE as a node property step to a pipeline using gds.beta.graphSage or the shorthand beta.graphSage as the procedureName procedure parameter, and referencing the trained model in the procedure configuration map as one would with the predict mutate mode.

Tuning parameters

In general tuning parameters is very dependent on the specific dataset.

Embedding dimension

The size of the node embedding as well as its hidden layer. A large embedding size captures more information, but increases the required memory and computation time. A small embedding size is faster, but can cause the input features and graph topology to be insufficiently encoded in the embedding.

Aggregator

An aggregator defines how to combine a node’s embedding and the sampled neighbor embeddings from the previous layer. GDS supports the Mean and Pool aggregators.

Mean is simpler, requires less memory and is faster to compute. Pool is more complex and can encode a richer neighbourhood.

Activation function

The activation function is used to convert the input of a neuron in the neural network. We support Sigmoid and leaky ReLu .

Sample sizes

Each sample size represents a hidden layer with an output of size equal to the embedding dimension. The layer uses the given aggregator and activation function. More layers result in more distant neighbors being considered for a node’s embedding. Layer N uses the sampled neighbor embeddings of distance <\= N at Layer N -1. The more layers the higher memory and computation time.

A sample size n means we try to sample at most n neighbors from a node. Higher sample sizes also require more memory and computation time.

Batch size

This parameter defines how many training examples are grouped in a single batch. For each training example, we will also sample a positive and a negative example. The gradients are computed concurrently on the batches using concurrency many threads.

The batch size does not affect the model quality, but can be used to tune for training speed. A larger batch size increases the memory consumption of the computation.

Epochs

This parameter defines the maximum number of epochs for the training. Before each epoch, the new neighbors are sampled for each layer as specified in Sample sizes. Independent of the model’s quality, the training will terminate after these many epochs. Note, that the training can also stop earlier if an epoch converged if the loss converged (see Tolerance).

Setting this parameter can be useful to limit the training time for a model. Restricting the computational budget can serve the purpose of regularization and mitigate overfitting, which becomes a risk with a large number of epochs.

Because each epoch resamples neighbors, multiple epochs avoid overfitting on specific neighborhoods.

Max Iterations

This parameter defines the maximum number of iterations run for a single epoch. Each iteration uses the gradients of randomly sampled batches, which are summed and scaled before updating the weights. The number of sampled batches is defined via Batch sampling ratio. Also, it is verified if the loss converged (see Tolerance).

A high number of iterations can lead to overfitting for a specific sample of neighbors.

Batch sampling ratio

This parameter defines the number of batches to sample for a single iteration.

The more batches are sampled, the more accurate the gradient computation will be. However, more batches also increase the runtime of each single iteration.

In general, it is recommended to make sure to use at least the same number of batches as the defined concurrency.

Search depth

This parameter defines the maximum depth of the random walks which sample positive examples for each node in a batch.

How close similar nodes are depends on your dataset and use case.

Negative-sample weight

This parameter defines the weight of the negative samples compared to the positive samples in the loss computation. Higher values increase the impact of negative samples in the loss and decreases the impact of the positive samples.

Penalty L2

This parameter defines the influence of the regularization term on the loss function. The l2 penalty term is computed over all the weights from the layers defined based on the Aggregator and Sample sizes.

While the regularization can avoid overfitting, a high value can even lead to underfitting. The minimal value is zero, where the regularization term has no effect at all.

Learning rate

When updating the weights, we move in the direction dictated by the Adam optimizer based on the loss function’s gradients. The learning rate parameter dictates how much to update the weights after each iteration.

Tolerance

This parameter defines the convergence criteria of an epoch. An epoch converges if the loss of the current iteration and the loss of the previous iteration differ by less than the tolerance.

A lower tolerance results in more sensitive training with a higher probability to train longer. A high tolerance means a less sensitive training and hence resulting in earlier convergence.

Projected feature dimension

This parameter is only relevant if you want to distinguish between multiple node labels.

Syntax

GraphSAGE syntax per mode

Run GraphSAGE in train mode on a named graph.

CALL gds.beta.graphSage.train(
  graphName: String,
  configuration: Map
) YIELD
  modelInfo: Map,
  configuration: Map,
  trainMillis: Integer

Table 1. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 2. Configuration
Name	Type	Default	Optional	Description
modelName	String	`n/a`	no	The name of the model to train, must not exist in the Model Catalog.
featureProperties	List of String	`n/a`	no	The names of the node properties that should be used as input features. All property names must exist in the projected graph and be of type Float or List of Float.
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels. Nodes with any of the given labels will be included.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types. Relationships with any of the given types will be included.
concurrency	Integer	`4 ^[1]`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
embeddingDimension	Integer	`64`	yes	The dimension of the generated node embeddings as well as their hidden layer representations.
aggregator	String	`"mean"`	yes	The aggregator to be used by the layers. Supported values are "Mean" and "Pool".
activationFunction	String	`"sigmoid"`	yes	The activation function to be used in the model architecture. Supported values are "Sigmoid" and "ReLu".
sampleSizes	List of Integer	`[25, 10]`	yes	A list of Integer values, the size of the list determines the number of layers and the values determine how many nodes will be sampled by the layers.
projectedFeatureDimension	Integer	`n/a`	yes	The dimension of the projected `featureProperties`. This enables multi-label GraphSage, where each label can have a subset of the `featureProperties`.
batchSize	Integer	`100`	yes	The number of nodes per batch.
tolerance	Float	`1e-4`	yes	Tolerance used for the early convergence of an epoch, which is checked after each iteration.
learningRate	Float	`0.1`	yes	The learning rate determines the step size at each iteration while moving toward a minimum of a loss function.
epochs	Integer	`1`	yes	Number of times to traverse the graph.
maxIterations	Integer	`10`	yes	Maximum number of iterations per epoch. Each iteration the weights are updated.
batchSamplingRatio	Float	`concurrency * batchSize / nodeCount`	yes	Sampling ratio of batches to consider per weight updates. By default, each thread evaluates a single batch.
searchDepth	Integer	`5`	yes	Maximum depth of the RandomWalks to sample nearby nodes for the training.
negativeSampleWeight	Integer	`20`	yes	The weight of the negative samples.
relationshipWeightProperty	String	`null`	yes	Name of the relationship property to use as weights. If unspecified, the algorithm runs unweighted.
randomSeed	Integer	`random`	yes	A random seed which is used to control the randomness in computing the embeddings.
penaltyL2	Float	`0.0`	yes	The influence of the l2 penalty term to the loss function.
storeModelToDisk	Boolean	`false`	yes	Automatically store model to disk after training.
1. In a GDS Session, the default is the number of available processors.

Table 3. Results
Name	Type	Description
`modelInfo`	Map	Details of the trained model.
`configuration`	Map	The configuration used to run the procedure.
`trainMillis`	Integer	Milliseconds to train the model.

Table 4. Details on `modelInfo`
Name	Type	Description
`name`	String	The name of the trained model.
`type`	String	The type of the trained model. Always `graphSage`.
`metrics`	Map	Metrics related to running the training, details in the table below.

Table 5. Metrics collected during training
Name	Type	Description
`ranEpochs`	Integer	The number of ran epochs during training.
`epochLosses`	List	The average loss per node after each epoch.
`iterationLossPerEpoch`	List of List of Float	The average loss per node after each iteration for each epoch.
`didConverge`	Boolean	Indicates if the training has converged.

Run GraphSAGE in stream mode on a named graph.

CALL gds.beta.graphSage.stream(
  graphName: String,
  configuration: Map
) YIELD
  nodeId: Integer,
  embedding: List

Table 6. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 7. Configuration
Name	Type	Default	Optional	Description
modelName	String	`n/a`	no	The name of a GraphSAGE model in the model catalog.
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels. Nodes with any of the given labels will be included.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types. Relationships with any of the given types will be included.
concurrency	Integer	`4 ^[2]`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
batchSize	Integer	`100`	yes	The number of nodes per batch.
2. In a GDS Session, the default is the number of available processors.

Table 8. Results
Name	Type	Description
`nodeId`	Integer	The Neo4j node ID.
`embedding`	List of Float	The computed node embedding.

Run GraphSAGE in mutate mode on a graph stored in the catalog.

CALL gds.beta.graphSage.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  nodeCount: Integer,
  nodePropertiesWritten: Integer,
  preProcessingMillis: Integer,
  computeMillis: Integer,
  mutateMillis: Integer,
  configuration: Map

Table 9. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 10. Configuration
Name	Type	Default	Optional	Description
modelName	String	`n/a`	no	The name of a GraphSAGE model in the model catalog.
mutateProperty	String	`n/a`	no	The node property in the GDS graph to which the embedding is written.
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types.
concurrency	Integer	`4`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
batchSize	Integer	`100`	yes	The number of nodes per batch.

Table 11. Results
Name	Type	Description
nodeCount	Integer	The number of nodes processed.
nodePropertiesWritten	Integer	The number of node properties written.
preProcessingMillis	Integer	Milliseconds for preprocessing data.
computeMillis	Integer	Milliseconds for running the algorithm.
mutateMillis	Integer	Milliseconds for writing result data back to the projected graph.
configuration	Map	The configuration used for running the algorithm.

Run GraphSAGE in write mode on a graph stored in the catalog.

CALL gds.beta.graphSage.write(
  graphName: String,
  configuration: Map
)
YIELD
  nodeCount: Integer,
  nodePropertiesWritten: Integer,
  preProcessingMillis: Integer,
  computeMillis: Integer,
  writeMillis: Integer,
  configuration: Map

Table 12. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 13. Configuration
Name	Type	Default	Optional	Description
modelName	String	`n/a`	no	The name of a GraphSAGE model in the model catalog.
nodeLabels	List of String	`['*']`	yes	Filter the named graph using the given node labels. Nodes with any of the given labels will be included.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types. Relationships with any of the given types will be included.
concurrency	Integer	`4 ^[3]`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
writeConcurrency	Integer	`value of 'concurrency'`	yes	The number of concurrent threads used for writing the result to Neo4j.
writeProperty	String	`n/a`	no	The node property in the Neo4j database to which the embedding is written.
batchSize	Integer	`100`	yes	The number of nodes per batch.
3. In a GDS Session, the default is the number of available processors.

Table 14. Results
Name	Type	Description
nodeCount	Integer	The number of nodes processed.
nodePropertiesWritten	Integer	The number of node properties written.
preProcessingMillis	Integer	Milliseconds for preprocessing data.
computeMillis	Integer	Milliseconds for running the algorithm.
writeMillis	Integer	Milliseconds for writing result data back to Neo4j.
configuration	Map	The configuration used for running the algorithm.

Examples

All the examples below should be run in an empty database.

The examples use Cypher projections as the norm. Native projections will be deprecated in a future release.

In this section we will show examples of running the GraphSAGE algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small friends network graph of a handful nodes connected in a particular pattern. The example graph looks like this:

The following Cypher statement will create the example graph in the Neo4j database:

CREATE
  // Persons
  (  dan:Person {name: 'Dan',   age: 20, heightAndWeight: [185, 75]}),
  (annie:Person {name: 'Annie', age: 12, heightAndWeight: [124, 42]}),
  ( matt:Person {name: 'Matt',  age: 67, heightAndWeight: [170, 80]}),
  ( jeff:Person {name: 'Jeff',  age: 45, heightAndWeight: [192, 85]}),
  ( brie:Person {name: 'Brie',  age: 27, heightAndWeight: [176, 57]}),
  ( elsa:Person {name: 'Elsa',  age: 32, heightAndWeight: [158, 55]}),
  ( john:Person {name: 'John',  age: 35, heightAndWeight: [172, 76]}),

  (dan)-[:KNOWS {relWeight: 1.0}]->(annie),
  (dan)-[:KNOWS {relWeight: 1.6}]->(matt),
  (annie)-[:KNOWS {relWeight: 0.1}]->(matt),
  (annie)-[:KNOWS {relWeight: 3.0}]->(jeff),
  (annie)-[:KNOWS {relWeight: 1.2}]->(brie),
  (matt)-[:KNOWS {relWeight: 10.0}]->(brie),
  (brie)-[:KNOWS {relWeight: 1.0}]->(elsa),
  (brie)-[:KNOWS {relWeight: 2.2}]->(jeff),
  (john)-[:KNOWS {relWeight: 5.0}]->(jeff)

MATCH (source:Person)
OPTIONAL MATCH (source:Person)-[r:KNOWS]->(target:Person)
RETURN gds.graph.project(
  'persons',
  source,
  target,
  {
    sourceNodeLabels: labels(source),
    targetNodeLabels: labels(target),
    sourceNodeProperties: source { .age, .heightAndWeight },
    targetNodeProperties: target { .age, .heightAndWeight },
    relationshipType: type(r),
    relationshipProperties: r { .relWeight }
  },
  { undirectedRelationshipTypes: ['KNOWS'] }
)

The algorithm is defined for UNDIRECTED graphs.

Train

Before we are able to generate node embeddings we need to train a model and store it in the model catalog. Below is an example of how to do that.

The names specified in the featureProperties configuration parameter must exist in the projected graph.

CALL gds.beta.graphSage.train(
  'persons',
  {
    modelName: 'exampleTrainModel',
    featureProperties: ['age', 'heightAndWeight'],
    aggregator: 'mean',
    activationFunction: 'sigmoid',
    randomSeed: 1337,
    sampleSizes: [25, 10]
  }
) YIELD modelInfo as info
RETURN
  info.modelName as modelName,
  info.metrics.didConverge as didConverge,
  info.metrics.ranEpochs as ranEpochs,
  info.metrics.epochLosses as epochLosses

Table 15. Results
modelName	didConverge	ranEpochs	epochLosses
"exampleTrainModel"	true	1	[26.5784954435]

Due to the random initialisation of the weight variables the results may vary between different runs.

Looking at the results we can draw the following conclusions, the training converged after a single epoch, the losses are almost identical. Tuning the algorithm parameters, such as trying out different sampleSizes, searchDepth, embeddingDimension or batchSize can improve the losses. For different datasets, GraphSAGE may require different train parameters for producing good models.

The trained model is automatically registered in the model catalog.

Train with multiple node labels

In this section we describe how to train on a graph with multiple labels. The different labels may have different sets of properties. To run on such a graph, GraphSAGE is run in multi-label mode, in which the feature properties are projected into a common feature space. Therefore, all nodes have feature vectors of the same dimension after the projection.

The projection for a label is linear and given by a matrix of weights. The weights for each label are learned jointly with the other weights of the GraphSAGE model.

In the multi-label mode, the following is applied prior to the usual aggregation layers:

A property representing the label is added to the feature properties for that label
The feature properties for each label are projected into a feature vector of a shared dimension

The projected feature dimension is configured with projectedFeatureDimension, and specifying it enables the multi-label mode.

The feature properties used for a label are those present in the featureProperties configuration parameter which exist in the graph for that label. In the multi-label mode, it is no longer required that all labels have all the specified properties.

Assumptions

A requirement for multi-label mode is that each node belongs to exactly one label.
A GraphSAGE model trained in this mode must be applied on graphs with the same schema with regards to node labels and properties.

Examples

In order to demonstrate GraphSAGE with multiple labels, we add instruments and relationships of type LIKE between person and instrument to the example graph.

Visualization of the multi-label example graph

The following Cypher statement will extend the example graph in the Neo4j database:

MATCH
  (dan:Person {name: "Dan"}),
  (annie:Person {name: "Annie"}),
  (matt:Person {name: "Matt"}),
  (brie:Person {name: "Brie"}),
  (john:Person {name: "John"})
CREATE
  (guitar:Instrument {name: 'Guitar', cost: 1337.0}),
  (synth:Instrument {name: 'Synthesizer', cost: 1337.0}),
  (bongos:Instrument {name: 'Bongos', cost: 42.0}),
  (trumpet:Instrument {name: 'Trumpet', cost: 1337.0}),
  (dan)-[:LIKES]->(guitar),
  (dan)-[:LIKES]->(synth),
  (dan)-[:LIKES]->(bongos),
  (annie)-[:LIKES]->(guitar),
  (annie)-[:LIKES]->(synth),
  (matt)-[:LIKES]->(bongos),
  (brie)-[:LIKES]->(guitar),
  (brie)-[:LIKES]->(synth),
  (brie)-[:LIKES]->(bongos),
  (john)-[:LIKES]->(trumpet)

MATCH (source:Person)-[r:LIKES]->(target:Instrument)
RETURN gds.graph.project(
  'persons_with_instruments',
  source,
  target,
  {
    sourceNodeLabels: labels(source),
    sourceNodeProperties: source { .age, .heightAndWeight },
    targetNodeLabels: labels(target),
    targetNodeProperties: target { .cost },
    relationshipType: type(r),
    relationshipProperties: r { .relWeight }
  },
  { undirectedRelationshipTypes: ['LIKES'] }
)

We can now run GraphSAGE in multi-label mode on that graph by specifying the projectedFeatureDimension parameter. Multi-label GraphSAGE removes the requirement, that each node in the in-memory graph must have all featureProperties. However, the projections are independent per label and even if two labels have the same featureProperty they are considered as different features before projection. The projectedFeatureDimension should equal the maximum length of the feature-array. In our example, persons have age (1) and heightAndWeight (2), summing up to a length of 3. Instruments only have cost with length of 1. Thus, the projectedFeatureDimension should be set to 3. For each node its unique labels properties is projected using a label specific projection to vector space of dimension projectedFeatureDimension. Note that the cost feature is only defined for the instrument nodes, while age and heightAndWeight are only defined for persons.

CALL gds.beta.graphSage.train(
  'persons_with_instruments',
  {
    modelName: 'multiLabelModel',
    featureProperties: ['age', 'heightAndWeight', 'cost'],
    projectedFeatureDimension: 3
  }
)

Train with relationship weights

The GraphSAGE implementation supports training using relationship weights. Greater relationship weight between nodes signifies that the nodes should have more similar embedding values.

The following Cypher query trains a GraphSAGE model using relationship weights

CALL gds.beta.graphSage.train(
  'persons',
  {
    modelName: 'weightedTrainedModel',
    featureProperties: ['age', 'heightAndWeight'],
    relationshipWeightProperty: 'relWeight',
    nodeLabels: ['Person'],
    relationshipTypes: ['KNOWS']
  }
)

Train when there are no node properties present in the graph

In the case when you have a graph that does not have node properties we recommend to use existing algorithm in mutate mode to create node properties. Good candidates are Centrality algorithms or Community algorithms.

The following example illustrates calling Degree Centrality in mutate mode and then using the mutated property as feature of GraphSAGE training. For the purpose of this example we are going to use the Persons graph, but we will not load any properties to the in-memory graph.

Create a graph projection without any node properties

MATCH (source:Person)-[r:KNOWS]->(target:Person)
RETURN gds.graph.project(
  'noPropertiesGraph',
  source,
  target,
  {},
  { undirectedRelationshipTypes: ['*'] }
)

Run DegreeCentrality mutate to create a new property for each node

CALL gds.degree.mutate(
  'noPropertiesGraph',
  {
    mutateProperty: 'degree'
  }
) YIELD nodePropertiesWritten

Run GraphSAGE train using the property produced by DegreeCentrality as feature property

CALL gds.beta.graphSage.train(
  'noPropertiesGraph',
  {
    modelName: 'myModel',
    featureProperties: ['degree']
  }
)
YIELD trainMillis
RETURN trainMillis

gds.degree.mutate will create a new node property degree for each of the nodes in the in-memory graph, which then can be used as featureProperty in the GraphSAGE.train mode.

Using separate algorithms to produce featureProperties can also be very useful to capture graph topology properties.

Stream

To generate embeddings and stream them back to the client we can use the stream mode. We must first train a model, which we do using the gds.beta.graphSage.train procedure.

CALL gds.beta.graphSage.train(
  'persons',
  {
    modelName: 'graphSage',
    featureProperties: ['age', 'heightAndWeight'],
    embeddingDimension: 3,
    randomSeed: 19
  }
)

Once we have trained a model (named 'graphSage') we can use it to generate and stream the embeddings.

CALL gds.beta.graphSage.stream(
  'persons',
  {
    modelName: 'graphSage'
  }
)
YIELD nodeId, embedding
RETURN gds.util.asNode(nodeId).name AS person, embedding
ORDER BY person, embedding

Table 16. Results
person	embedding
"Annie"	[0.5285002573, 0.4682181872, 0.7081378445]
"Brie"	[0.5285002574, 0.4682181872, 0.7081378445]
"Dan"	[0.5285002573, 0.4682181872, 0.7081378445]
"Elsa"	[0.5285002574, 0.4682181872, 0.7081378444]
"Jeff"	[0.5285002573, 0.4682181872, 0.7081378445]
"John"	[0.5285002573, 0.4682181872, 0.7081378445]
"Matt"	[0.5285002573, 0.4682181872, 0.7081378445]

Due to the random initialisation of the weight variables the results may vary slightly between the runs.

Mutate

The model trained as part of the stream example can be reused to write the results to the in-memory graph using the mutate mode of the procedure. Below is an example of how to achieve this.

CALL gds.beta.graphSage.mutate(
  'persons',
  {
    mutateProperty: 'inMemoryEmbedding',
    modelName: 'graphSage'
  }
) YIELD
  nodeCount,
  nodePropertiesWritten

Table 17. Results
nodeCount	nodePropertiesWritten
7	7

Write

The model trained as part of the stream example can be reused to write the results to Neo4j. Below is an example of how to achieve this.

CALL gds.beta.graphSage.write(
  'persons',
  {
    writeProperty: 'embedding',
    modelName: 'graphSage'
  }
) YIELD
  nodeCount,
  nodePropertiesWritten

Table 18. Results
nodeCount	nodePropertiesWritten
7	7