Picture the Titanic, a floating palace that sank and killed two-thirds of those aboard. Whole families perished, children were left orphaned, wives watched their husbands drown. While on close inspection of the passenger manifest, it might be possible to work out the relationships among the people who died or survived, each instance takes a great deal of deduction – deduction that a graph and Cypher can facilitate.
Exploring the Manifest
The passenger manifest records passenger details, which means we can sort them into groups, by gender, age, class, and also by ticket. Additionally, and rather frustratingly, the manifest also records the number of parents or children (parch), each passenger that traveled with them, combined as a single number. The same conflation of data has also occurred for siblings and spouses (sibsp). With a cursory look at the passenger manifest, it appears that for each person, these numbers would only refer to one of the possible relations (i.e., that ‘parch’ refers to either parent or child but not both; that if a person is a parent, they do not have their own parents aboard). The same holds true for ‘sibsp’ where those who are married do not have any of their siblings aboard.
Understanding the specific relationships among passengers on the Titanic allows us to ask interesting questions about the graph that would be extremely difficult and time-consuming with just a tabular database like the manifest. Such questions include did those who survived have siblings aboard? If yes, were their siblings male or female? Did they have either their mother, father, or both aboard? What proportion of surviving women were single or married? And how many survivors had their children aboard? And did they also survive?
To parse the relationships from the data, it was necessary to think about what the numbers in the database represented and whether a unique set of clauses could be defined to determine relationships from that data.
Defining the Relationship Categories
The first step was to define the categories of relationships we were interested in.
Here are the three relationships I had to define: MARRIED_TO
, SIBLING_TO
, PARENT_OF
.
Both MARRIED_TO
and SIBLING_TO
would imply the same relationship in the other direction between the same nodes. PARENT_OF
would imply a reverse relationship of CHILD_OF
.
I had to assume that family members would be traveling on the same ticket, and to be sure not to marry children to their parents or vice-versa, I needed to know their ages. Fortunately, the data on ticket numbers and ages is fairly complete in the passenger manifest, so I began with the following:
MATCH (person:Person) WHERE person.age IS NOT NULL MATCH (person:Person)-[:TRAVELED_ON]->(ticket:Ticket)<-[:TRAVELED_ON]-(other:Person)
To avoid including servants and other people on the tickets who were not family members in the search, I added the condition that the total number of family members for each person in the relationship had to be the same:
WHERE other.age IS NOT NULL AND person.family = other.family
With those parameters in place, I began defining the relationship for [:MARRIED_TO]
. It was likely that if a MARRIED_TO
relationship existed on a ticket, it would exist between the eldest people in the family, so I ordered the other people on the ticket by descending age and collected them together as a list called familyMembers
:
WITH person, other ORDER BY other.age DESC
The following parameters defined the rest of the MARRIED_TO
relationship.
That they would have one spouse:
p1.sibsp = 1
That their potential spouse would also have one spouse:
p2.sibsp = 1>
They would have at least one family member on the ticket:
p2.family >= 1
They would be the opposite sex of their spouse:
p2.sex <> p1.sex
That the potential spouse was the eldest other family member on the ticket; this assumes that there is no more than one married couple on each ticket and that any family members older than the married couple would not have any siblings aboard, which would register as sibsp:
p2 = familyMembers[0]
And that if there were other family members on the ticket apart from the spouse, that the person would be older than them; this is to prevent the children being married off to their second eldest parent (this took a lot of trial and error):
(familyMembers = 1 OR p1.age > familyMembers[1].age)
The Queries
Eventually, this was the query I drew up to create the MARRIED_TO
relationships:
MATCH (person:Person) WHERE person.age IS NOT NULL MATCH (person:Person)-[:TRAVELED_ON]->(ticket:Ticket)<-[:TRAVELED_ON]-(other:Person) WHERE other.age IS NOT NULL AND person.family = other.family WITH person, other ORDER BY other.age DESC WITH person as p1, collect(other) as familyMembers WITH p1, familyMembers, [p2 in familyMembers WHERE p1.sibsp = 1 AND p2.sibsp = 1 AND p2.family >= 1 AND p2.sex <> p1.sex AND p2 = familyMembers[0] AND (size(familyMembers) = 1 OR p1.age > familyMembers[1].age) ] as spouses FOREACH (p in spouses | CREATE (p1)-[:MARRIED_TO]->(p))
Next, I worked on creating the SIBLING_OF
relationship as a list that could only be created from people who were not spouses, then finally on the PARENT_OF
relationship, where I assumed parents could not also be siblings of people onboard:
MATCH (person:Person) WHERE person.age IS NOT NULL MATCH (person:Person)-[:TRAVELED_ON]->(ticket:Ticket)<-[:TRAVELED_ON]-(other:Person) WHERE other.age IS NOT NULL AND person.family = other.family WITH person, other ORDER BY other.age DESC WITH person as p1, collect(other) as familyMembers WITH p1, familyMembers, [p2 in familyMembers WHERE NOT (p2)-[:MARRIED_TO]->() AND NOT (p1)-[:MARRIED_TO]->() AND p2.sibsp >= 1 AND p2.sibsp = p1.sibsp AND p2.family >= 1 AND (p2.parch = 1 OR p2.parch = 2) AND NOT p2 = familyMembers [0] AND NOT p1 = familyMembers [0] ] as siblings WITH p1, familyMembers, siblings, [p2 in familyMembers WHERE NOT (p2)-[:MARRIED_TO]->() AND NOT p2 IN siblings AND NOT p1 IN siblings AND p2.family >= 1 AND (p2.parch = 1 OR p2.parch = 2) AND p1.parch >= 1 AND p1.age > p2.age ] as children FOREACH (p in siblings | CREATE (p1)-[:SIBLING_OF]->(p)) FOREACH (p in children | CREATE (p1)-[:PARENT_OF]->(p))
With these relationships created in the graph, I could then ask questions such as ‘of the people who died, how many of them had siblings aboard?’
MATCH (p:Person {fate: 'Died'}) RETURN COUNT(p), EXISTS {(p)-[:SIBLING_OF]-()}
Query to return the number of people without siblings that died or survived:
MATCH (p:Person) WHERE NOT EXISTS {(p)-[:SIBLING_OF]-()} WITH COUNT (p) as totalnosib, sum(CASE WHEN p.fate = 'Died' THEN 1 ELSE 0 END) as nosibdied RETURN totalnosib, nosibdied , 1.0 * nosibdied / totalnosib as pdied, totalnosib - nosibdied as nosibsurvived, 1 - (1.0 * nosibdied / totalnosib) as psurvived Total people without sibling = 1204 People with no siblings who died = 754 Percentage that died of those without siblings = 0.626 People with no siblings who survived = 450 Percentage that survived of those without siblings = 0.374
Total people without sibling = 1204
People with no siblings who died = 754
Percentage that died of those without siblings = 0.626
People with no siblings who survived = 450
Percentage that survived of those without siblings = 0.374
Query to return the number of people with spouses that died or survived:
MATCH (p:Person) WHERE EXISTS {(p)-[:MARRIED_TO]-()} WITH COUNT (p) as totalsp, sum(CASE WHEN p.fate = 'Died' THEN 1 ELSE 0 END) as spdied RETURN totalsp, spdied , 1.0 * spdied / totalsp as pdied, totalsp - spdied as spsurvived, 1 - (1.0 * spdied / totalsp) as psurvived Total people with a spouse = 202 People with a spouse who died = 100 Percentage that died with a spouse = 0.495 People with a spouse who survived = 102 Percentage that survived with a spouse= 0.505
And finally, a query to return the number of people with children aboard that died or survived:
MATCH (p:Person) WHERE EXISTS {(p)-[:PARENT_OF]->()} WITH COUNT (p) as totalpar, sum(CASE WHEN p.fate = 'Died' THEN 1 ELSE 0 END) as pardied RETURN totalpar, pardied , 1.0 * pardied / totalpar as pdied, totalpar - pardied as parsurvived, 1 - (1.0 * pardied / totalpar) as psurvived
The returns from these queries suggest that of those who had children, more than half survived. This is without taking into account the ages of the passengers, where some would not be old enough to have children. A query to account for this would simply be a matter of adding another clause to the WHERE
:
MATCH (p:Person) WHERE p.ageClass = 'Adult' AND NOT EXISTS {(p)-[:PARENT_OF]->()} WITH COUNT (p) as totalnopar, sum(CASE WHEN p.fate = 'Died' THEN 1 ELSE 0 END) as nopardied RETURN totalnopar, nopardied , 1.0 * nopardied / totalnopar as pdied, totalnopar - nopardied as parsurvived, 1 - (1.0 * nopardied / totalnopar) as psurvived Totalnopar = 1075 Nopardied = 703 Pdied = 0.654 Noparsurvived = 372 Psurvived = 0.346
As survivorship for men was significantly lower than for women, we could also add a clause to see the difference between survival chances for men vs. women, both with and without children:
MATCH (p:Person) WHERE p.ageClass = 'Adult' AND p.sex = 'male' AND EXISTS {(p)-[:PARENT_OF]->()} WITH COUNT (p) as totalnopar, sum(CASE WHEN p.fate = 'Died' THEN 1 ELSE 0 END) as nopardied RETURN totalnopar, nopardied , 1.0 * nopardied / totalnopar as pdied, totalnopar - nopardied as parsurvived, 1 - (1.0 * nopardied / totalnopar) as psurvived
The Findings
There is little difference between the survivorship of the genders according to whether they have children. However, this example shows that the flexibility of the graph means that it is possible to draw on multiple properties of an item in the construction of the graph and in the queries you can ask of it.
The data about familial relationships existed in the passenger manifest, but was only readable in that format by careful cross-referencing of passengers on each ticket. The clausal capacity of the graph, however, allowed me to extract these relationships automatically and ask questions of the data that would have been impossible to answer from a tabular database.
Information About the Dataset
The passenger manifest for this dataset was originally downloaded from GitHub, but has been added to and changed through references to the following websites:
Two passengers have been added, and several hundred are missing ages. Some of the data on these websites is uncertain and based on speculation. Where ambiguity exists, the most likely or simplest option was chosen to best fill out the CSV as completely as possible. This version of the passenger manifest should, therefore, not be taken as an accurate or complete representation of the actual passenger manifest of the Titanic.
The final dataset is available on GitHub’s gist.