Multi Tenancy in Neo4j: A Worked Example

Goals
In this guide, we will learn how to do multi tenancy in Neo4j.
Prerequisites
Please have Neo4j (version 4.0 or later) downloaded and installed. It helps to have read the section on managing multiple databases. We’ll also need to install APOC, Neo4j’s standard library.

Beginner

In Neo4j (v4.0+), we can create and use more than one active database at the same time. This works in standalone and causal cluster scenarios and allows us to maintain multiple, separate graphs in one installation.

In this guide we’re going to learn how to use this feature with a multi tenant dataset.

System setup

If you haven’t already, download Neo4j.

We will need to have a database running and open Neo4j Browser to walk through this guide. If you are unsure how to create and start a database, step-by-step instructions for doing so in Neo4j Desktop are provided in this guide.

We’re going to use the APOC Library in this guide, and in order to process local files, we’ll need need to add the following entry to the apoc.conf file:

apoc.conf
apoc.import.file.enabled=true

We’ll also be using Neo4j Fabric, which we can enable by adding the following properties in the neo4j.conf file:

neo4j.conf
fabric.database.name=fabric

fabric.graph.0.name=mall1
fabric.graph.0.uri=neo4j://localhost:7687
fabric.graph.0.database=mall1

fabric.graph.1.name=mall2
fabric.graph.1.uri=neo4j://localhost:7687
fabric.graph.1.database=mall2

Carrefour Retail dataset

We’re going to use a dataset published by Carrefour, a French retailer, as part of their Delighting Customers Challenge Basket Data. Rik van Bruggen recently wrote a series of blog posts showing how to import and analyse this data in Neo4j 3.5.

The data is available as a JSON file (1GB) curated by Rik. Once we’ve downloaded the file, we need to put it into the import folder of Neo4j.

The dataset is accompanied by a data dictionary that describes each of the fields that we see in the file:

Table 1. Data Dictionary
Field Description

id

Number id for that individual ticket.

mall

Store where the ticket was printed. It has two values, 1 and 2.

data

Date and time the ticket was printed.

client

Some tickets will have a Customer ID. Many tickets will share a Customer ID.

items

List of items contained in the printed ticket. The list contains a dictionary with a product description (desc), the amount charged (net_am), and the number of units bought (n_unit).

From looking at these descriptions, mall seems like a good property to use to partition the data. We can store the tickets printed in mall 1 in one database, and the tickets printed in mall 2 in another database.

Exploring the data

We can now process this file using the apoc.load.json procedure.

CALL apoc.load.json("file:///DelightingCustomersBDclean.json")
YIELD value
RETURN value
LIMIT 5;
Table 2. Results
value

{date: "2016-01-14T20:07:00.000+0000", client: 77021708271, _id: 1001, items: [{n_unit: 1, net_am: 1.0, desc: "CARAMELOS S/AZUCAR"}], mall: 2}

{date: "2016-01-14T15:25:00.000+0000", client: 77021708271, _id: 1002, items: [{n_unit: 1, net_am: 3.0, desc: "TOSTA VARIADA"}, {n_unit: 1, net_am: 1.0, desc: "BAGUETTE TORTILLA"}], mall: 1}

{date: "2016-01-14T20:07:00.000+0000", client: 77021708271, _id: 1003, items: [{n_unit: 1, net_am: 2.83, desc: "QUESO TIERNO MEZCL"}, {n_unit: 1, net_am: 1.65, desc: "GUISANTES MUY FINO"}, {n_unit: 1, net_am: 1.77, desc: "BIFIDUS CON FRUTAS"}, {n_unit: 1, net_am: 1.16, desc: "MAIZ DULCE PACK3X140"}, {n_unit: 1, net_am: 2.5, desc: "SANUS FRESA L. CASEI"}, {n_unit: 1, net_am: 1.0, desc: "FANTA LIMÓN S/BURB"}, {n_unit: 1, net_am: 1.85, desc: "CEREALES ESTRELLITAS"}, {n_unit: 1, net_am: 2.15, desc: "SALVASLIP EVAX"}, {n_unit: 1, net_am: 1.09, desc: "YOGUR NATURAL DANO"}, {n_unit: 1, net_am: 1.15, desc: "ARROZ LARGO SOS 1 KI"}, {n_unit: 1, net_am: 1.5, desc: "YORK SANDWICH ELPO"}, {n_unit: 1, net_am: 0.75, desc: "BARQUILLO NATA COVY"}, {n_unit: 1, net_am: 1.39, desc: "TRINA LIMON 1,5LITRO"}, {n_unit: 1, net_am: 1.85, desc: "CHORIZO DULCE CARR"}, {n_unit: 1, net_am: 1.0, desc: "PIMIENTO PIQUILLO"}, {n_unit: 24, net_am: 6.0, desc: "CERVEZA HOLANDESA"}, {n_unit: 1, net_am: 1.34, desc: "PECHUGA PAVO LONCH"}, {n_unit: 6, net_am: 1.52, desc: "AGUA CARREFOUR 2 L"}, {n_unit: 1, net_am: 1.69, desc: "FIAMBRE JAMON PAVO"}, {n_unit: 1, net_am: 0.66, desc: "BIFIDUS COCO 0%"}, {n_unit: 1, net_am: 2.74, desc: "DUPLO COLGATE TRIP"}], mall: 2}

{date: "2016-01-14T16:25:00.000+0000", client: 77021708271, _id: 1004, items: [{n_unit: 1, net_am: 0.64, desc: "AGUA SOLAN CABRAS"}], mall: 2}

{date: "2016-01-14T14:25:00.000+0000", client: 77021708271, _id: 1005, items: [{n_unit: 1, net_am: 3.9, desc: "PAQUETE 500 HOJAS A4"}, {n_unit: 1, net_am: 4.99, desc: "LEGGING NINA 3/14"}, {n_unit: 1, net_am: 9.99, desc: "JERSEY UNISEX 3/14"}, {n_unit: 3, net_am: 4.6, desc: "HUESITOS LECHE 12"}, {n_unit: 1, net_am: 2.65, desc: "MINI FUET CAMPOFRIO"}, {n_unit: 2, net_am: 2.78, desc: "REGAÑA ACEITE OLIV"}, {n_unit: 1, net_am: 15.95, desc: "MEGA MAKI"}], mall: 1}

We can see that some items are from mall 1 and some from mall 2. Let’s have a look how many tickets each mall printed:

CALL apoc.load.json("file:///DelightingCustomersBDclean.json")
YIELD value
RETURN value.mall, count(*)
Table 3. Results
value.mall count(*)

2

293586

1

292893

It looks like we’ll have two quite evenly populated databases. Let’s get those databases created!

Creating databases

We’re going to store the data for each mall in a different Neo4j database. We’ll create these databases using the CREATE DATABASE command, which we need to run against the system database. Let’s first switch to that database:

:use system;

Once we’ve done that we can run the following statements to create the databases:

CREATE DATABASE mall1;
CREATE DATABASE mall2;

We can check that those databases have been created by running the following command:

SHOW DATABASES;
Table 4. SHOW DATABASES
name address role requestedStatus currentStatus error default

"neo4j"

"0.0.0.0:7687"

"standalone"

"online"

"online"

""

TRUE

"system"

"0.0.0.0:7687"

"standalone"

"online"

"online"

""

FALSE

"fabric"

"0.0.0.0:7687"

"standalone"

"online"

"online"

""

FALSE

"mall1"

"0.0.0.0:7687"

"standalone"

"online"

"online"

""

FALSE

"mall2"

"0.0.0.0:7687"

"standalone"

"online"

"online"

""

FALSE

Graph Model

We’re going to import the data into the following graph model:

We’ll have nodes representing products, clients, tickets, and ticket items, and relationships that indicate which client purchased a ticket, the items that comprise a ticket, and the product that each item was for.

Importing the data

We’ll use the same apoc.load.json procedure to import the data into each of our databases. We’ll filter the JSON file by the mall property so that we load the data appropriately.

Let’s switch to the mall1 database and setup parameters that we’ll use in our import query:

:use mall1;
:param params => ({ url: "file:///DelightingCustomersBDclean.json", mall: 1});

And now we’ll create that indexes that we’re going to need:

CREATE INDEX on :Ticket(id);
CREATE INDEX on :Client(id);
CREATE INDEX on :Product(description);
CREATE INDEX on :TicketItem(netAmount);
CREATE INDEX on :TicketItem(units);
CREATE INDEX on :TicketItem(product);

Now we’re ready to process the JSON file. We’ll wrap our call to apoc.load.json inside a call to apoc.periodic.iterate so that our import is batched, which will avoid out of memory exceptions:

CALL apoc.periodic.iterate(
  "CALL apoc.load.json($url)
   YIELD value
   WHERE value.mall = $mall
   RETURN value
   LIMIT 20000",
  "CREATE (t:Ticket {id: value._id, datetime: datetime(value.date)})
   MERGE (c:Client {id: value.client})
   CREATE (c)-[:PURCHASED]->(t)
   WITH value, t
   UNWIND value.items as item
   CREATE (t)-[:HAS_TICKETITEM]->(ti:TicketItem {
     product: item.desc,
     netAmount: item.net_am,
     units: item.n_unit
   })
   MERGE (p:Product {description: item.desc})
   CREATE (ti)-[:FOR_PRODUCT]->(p)",
  { batchSize: 10000,
    iterateList: true,
    parallel: false,
    params: $params }
);

And now let’s do the same thing for mall 2. We’ll switch to that database and setup our parameters:

:use mall2;
:param params => ({ url: "file:///DelightingCustomersBDclean.json", mall: 2});

And now we’ll create that indexes that we’re going to need:

CREATE INDEX on :Ticket(id);
CREATE INDEX on :Client(id);
CREATE INDEX on :Product(description);
CREATE INDEX on :TicketItem(netAmount);
CREATE INDEX on :TicketItem(units);
CREATE INDEX on :TicketItem(product);

And now let’s import the tickets sold in this store:

CALL apoc.periodic.iterate(
  "CALL apoc.load.json($url)
   YIELD value
   WHERE value.mall = $mall
   RETURN value
   LIMIT 20000",
  "CREATE (t:Ticket {id: value._id, datetime: datetime(value.date)})
   MERGE (c:Client {id: value.client})
   CREATE (c)-[:PURCHASED]->(t)
   WITH value, t
   UNWIND value.items as item
   CREATE (t)-[:HAS_TICKETITEM]->(ti:TicketItem {
     product: item.desc,
     netAmount: item.net_am,
     units: item.n_unit
   })
   MERGE (p:Product {description: item.desc})
   CREATE (ti)-[:FOR_PRODUCT]->(p)",
  { batchSize: 10000,
    iterateList: true,
    parallel: false,
    params: $params }
);

We can see a sample of the imported graph in the Neo4j Browser visualization below:

Querying individual databases

Now that we’ve imported the data, let’s explore it by writing some queries. We’ll show the results of running some queries on both databases and some just on one of them.

We can switch between the databases by running the command :use mall1 or :use mall2.

The following query returns the biggest spending clients in the mall:

MATCH (client:Client)-[:PURCHASED]->(ticket)-[:HAS_TICKETITEM]->(item:TicketItem)
WHERE client.id <> "Unknown"
WITH client, count(DISTINCT ticket) AS tickets,
     apoc.math.round(sum(item.netAmount), 2) AS totalSpend
RETURN client.id AS clientId, totalSpend, tickets,
       apoc.math.round(totalSpend / tickets, 2) AS spendPerTicket
ORDER BY totalSpend DESC
LIMIT 10;
Table 5. Mall 1 biggest spenders
clientId totalSpend tickets spendPerTicket

77038482725

4125.71

95

43.43

77091111583

3111.06

29

107.28

77042913479

2726.84

51

53.47

77035560017

2664.49

69

38.62

77081132118

2644.51

33

80.14

77060098914

2433.68

24

101.4

77071854908

2379.27

27

88.12

77079395996

2272.06

27

84.15

77042176706

2183.27

65

33.59

77059085165

2162.1

44

49.14

Table 6. Mall 2 biggest spenders
clientId totalSpend tickets spendPerTicket

77038482725

3314.19

87

38.09

77074230047

3281.82

47

69.83

77035560017

3120.69

67

46.58

77042573786

2696.9

31

87.0

77017054434

2572.42

45

57.16

77091111583

2494.08

27

92.37

77037711999

2369.63

36

65.82

77053280208

2359.28

58

40.68

77045642784

2337.03

44

53.11

77049483454

2259.89

56

40.36

Interestingly, the biggest spending client in both malls is the same person.

The spendPerTicket also looks interesting. The following query shows the products bought by the client that has the highest spendPerTicket in mall 2 :

MATCH path = (:Client {id: 77078849426})-[:PURCHASED]->()-->(:TicketItem)-->(:Product)
RETURN path

On the left hand side of the diagram we can see highlighted nodes that have skewed the average spend! My Spanish isn’t great, but I can see an iPad Air among the items purchased. ASP ROB LG HOMB S5 is a robot vacuum cleaner and I think POR APP MB MJVE2YA is an Apple Macbook. The other items look more like the type of products you’d buy in a normal grocery shop.

Let’s go back to mall 1 and see which products are purchased most often. We can write the following query to compute this:

MATCH (item:TicketItem)-->(product:Product)
WITH product, apoc.math.round(sum(item.netAmount), 2) AS totalSpend,
     count(*) AS purchases
RETURN product.description, totalSpend, purchases,
       apoc.math.round(totalSpend / purchases, 2) AS costPerItem
ORDER BY totalSpend DESC
LIMIT 10;
Table 7. Mall 1 popular products
product.description totalSpend purchases costPerItem

"ROSCON NATA MEDIANO"

4596.64

450

10.21

"ACEITE OLIVA HOJIB"

3597.33

269

13.37

"CERVEZA MAHOU CLAS"

3328.62

363

9.17

"CERVEZA MAHOU 5 *"

2870.3

383

7.49

"ROSCON MEDIANO SIN R"

2632.75

363

7.25

"ATÚN CLARO ACEITE"

2404.49

502

4.79

"ROSCON DE REYES"

2363.89

158

14.96

"ACEITE OLIVA CARBO"

2268.72

180

12.6

"CAMISA CRO M/LARGA"

2208.31

293

7.54

"PAÑAL DODOT"

2186.34

75

29.15

All the top items are priced at less than 30 euros each. We can see a couple of beers (cerveza) in positions 3 and 4, and then Roscon seems to be a range of tasty looking pastries.

Querying across databases

We can also write queries across databases using Neo4j Fabric. This functionality is useful for executing aggregate queries across multiple databases.

Let’s first change to the fabric database by running the following command:

:use fabric

Now we’re going write a query that returns the total spend on products in each of our databases.

We’ll prefix our queries with the USE clause, which tells the Fabric engine which database to execute the query against. The syntax of this clause is as follows:

USE <fabric.database.name>.<fabric.graph.*.database>

Or in our case USE fabric.mall1 and USE fabric.mall2.

The following query finds the total spend and number of purchases in each mall, combines the results using the UNION ALL clause, and then orders the results by total spend:

USE fabric.mall1
MATCH (item:TicketItem)-->(product:Product)
RETURN "Mall 1" as mall,
       apoc.math.round(sum(item.netAmount), 2) AS totalSpend,
       count(*) AS purchases

UNION ALL

USE fabric.mall2
MATCH (item:TicketItem)-->(product:Product)
RETURN "Mall 2" as mall,
       apoc.math.round(sum(item.netAmount), 2) AS totalSpend,
       count(*) AS purchases

ORDER BY totalSpend DESC;
Table 8. Total spending in malls
mall totalSpend purchases

"Mall 1"

953536.51

219561

"Mall 2"

943463.47

219966

The total spend and number of purchases are almost identical between the malls! We can also write a version of this query that reduces duplication by using the built-in <fabric-database-name>graphIds function and the CALL{} clause, as shown below:

WITH ["Mall 1", "Mall 2"] AS malls
UNWIND fabric.graphIds() AS graphId
CALL {
  USE fabric.graph(graphId)
  MATCH (item:TicketItem)-->(product:Product)
  RETURN apoc.math.round(sum(item.netAmount), 2) AS totalSpend,
         count(*) AS purchases
}
RETURN malls[graphId] AS mall, totalSpend, purchases
ORDER BY totalSpend DESC;

We can also aggregate the results returned by the sub queries executed on each of the mall databases. The following query finds the highest selling products in each mall, and then shows the top 10 products across both malls:

UNWIND fabric.graphIds() AS graphId
CALL {
  USE fabric.graph(graphId)
  MATCH (item:TicketItem)-->(product:Product)
  WITH product, apoc.math.round(sum(item.netAmount), 2) AS totalSpend,
       count(*) AS purchases
  RETURN product, totalSpend, purchases
  ORDER BY totalSpend DESC
}
RETURN product.description, totalSpend, purchases
ORDER BY totalSpend DESC
LIMIT 10
Table 9. Top selling products across malls
product.description totalSpend purchases

"ROSCON NATA MEDIANO"

4596.64

450

"ROSCON NATA MEDIANO"

4462.64

451

"ACEITE OLIVA HOJIB"

3597.33

269

"LED SAM UE48J6200"

3546.08

6

"CERVEZA MAHOU CLAS"

3406.55

392

"CERVEZA MAHOU CLAS"

3328.62

363

"PAÑAL DODOT"

3078.55

99

"CERVEZA MAHOU 5 *"

2870.3

383

"ROSCON MEDIANO SIN R"

2632.75

363

"ACEITE OLIVA HOJIB"

2478.99

222

The top selling product in both malls is ROSCON NATA MEDIANO, but it’d be good if the sales for each product was aggregated so that we have only one row per product. We can do this using the SUM aggregation function:

UNWIND fabric.graphIds() AS graphId
CALL {
  USE fabric.graph(graphId)
  MATCH (item:TicketItem)-->(product:Product)
  WITH product, apoc.math.round(sum(item.netAmount), 2) AS totalSpend,
       count(*) AS purchases
  RETURN product, totalSpend, purchases
  ORDER BY totalSpend DESC
}
RETURN product.description, sum(totalSpend) AS totalSpend, sum(purchases) AS purchases
ORDER BY totalSpend DESC
LIMIT 10
The implicit group by key for these aggregation functions is product.description and not product. The product node is completely different in each of the databases, so if we grouped by that we’d still end up with individual per product sold in each mall.

The table below shows the results of running the aggregation query:

Table 10. Top aggregated selling products across malls
product.description totalSpend purchases

"ROSCON NATA MEDIANO"

9059.28

901

"CERVEZA MAHOU CLAS"

6735.17

755

"ACEITE OLIVA HOJIB"

6076.32

491

"LED SAM UE48J6200"

5343.08

9

"PAÑAL DODOT"

5264.89

174

"CERVEZA MAHOU 5 *"

5236.620000000001

740

"ROSCON MEDIANO SIN R"

4902.26

689

"ROSCON DE REYES"

4783.57

318

"ATÚN CLARO ACEITE"

4758.15

974

"CAMISA CRO M/LARGA"

4335.5599999999995

571