# Graph Data Modeling - Asking the Right Questions

As the NoSQL Data Store Team, we work with application teams to select data stores appropriate for their use case. One area of datastore technology that we have been involved in with increasing frequency is the consulting and provisioning of graph databases.

This increase somewhat reflects current trends in database technology at large. Graph databases have seen an uptick in popularity and use over the past few years. This is largely due to companies like Neo4j and DataStax releasing newer, cutting-edge graph database technologies, as well as increasing interest in events like Graph Day and GraphConnect.

Graph databases differ from other databases in that they provide the following:

- Structures built around storing and treating both relationships and entities as equals.
- An engine built around efficient entity/relationship traversal.

The culmination of this lies in one common question, which usually falls to the NoSQL Data Store Team to answer: *“Is my use case a good fit for graph?”*

Unfortunately, the answer isn’t always easy. We do routinely ask some questions around the importance of relationships, as well as how many “levels” of relationships would need to be commonly queried. The idea is to get teams to talk about what, exactly, they’re trying to do. What is clear is that there are varying levels of comfort when it comes to knowing how and when to use a graph database.

**The Königsberg Bridge Problem**

In an effort to “bridge” the knowledge gap between GraphDB technology and its appropriate use cases, I felt that it would be helpful to discuss the first graph use case. Graph theory is a discipline of mathematics pioneered by Swiss mathematician Leonard Euler (pronounced Oy-Ler). Euler’s theories came about as he analyzed citizens traversing the bridges in the city of Königsberg, Prussia, in 1736. (See Paoletti T. 2011. **Leonard Euler’s Solution to the Königsberg Bridge Problem**. Mathematical Association of America. Retrieved on 20181204 from: (https://www.maa.org/press/periodicals/convergence/leonard-eulers-solution-to-the-konigsberg-bridge-problem).)

Königsberg (now present-day Kaliningrad, Russia) is split by the Pregel River. The city’s territory also encompasses two large islands within the river itself. Structures for living and commerce existed on the North and South Banks, as well as on the Kneiphof and Lomse islands. These lands were connected by a series of seven bridges, as depicted in Figure 1.

*Figure 1 – A representation of Königsberg in 1736, illustrating the path of the Pregel River, and the approximate locations of the seven bridges.*

Euler wondered if it was possible to traverse all the bridges in Königsberg, without crossing the same bridge twice. It is important to note, but Euler’s question itself affected how the model was built. When looking at the problem in the context of the question to be answered, it becomes clear that the bridges are just as important as the landmasses.

We can attempt to emulate this model the Neo4j graph database. With some quick Cypher code, we can quickly create the landmasses and the bridges which connect them.

```
CREATE (L1:Land {name:'North Bank'})
CREATE (L2:Land {name:'South Bank'})
CREATE (L3:Land {name:'Lomse'})
CREATE (L4:Land {name:'Kneiphof'})
CREATE (L1)-[B1:bridge {name:'bridge 1'}]->(L4)
CREATE (L1)-[B2:bridge {name:'bridge 2'}]->(L4)
CREATE (L1)-[B3:bridge {name:'bridge 3'}]->(L3)
CREATE (L4)-[B4:bridge {name:'bridge 4'}]->(L3)
CREATE (L4)-[B5:bridge {name:'bridge 5'}]->(L2)
CREATE (L4)-[B6:bridge {name:'bridge 6'}]->(L2)
CREATE (L2)-[B7:bridge {name:'bridge 7'}]->(L3);
```

Then, we can query Neo4j for all entities with a “bridge” relationship, as shown below in Figure 2:

```
MATCH l=()-[:bridge]->() RETURN l;
```

*Figure 2 – The Seven Bridges of Königsberg model shown in the Neo4j browser.*

Some quick yet important aspects to note about this model is that it:

- Is almost 300 years old.
- Considers both the bridges and landmasses of equal importance.
- Involved traversing seven bridges to see if a completely unique path could be found.

Let that last point sink in a little bit. The first use case for graph theory was not an analysis of some large-scale task or system. But rather, it was something as simple as walking a unique path over seven bridges. Even Euler admitted that he found the problem laden with both intrigue and banality, saying (Paoletti 2011): “This question is so banal, but seemed to me worthy of attention in that [neither] geometry, nor algebra, nor even the art of counting was sufficient to solve it.”

Representing the Seven Bridges problem as a graph model helped Euler to make sense of his observations. Given this new way of approaching the problem, he was able to generalize that for a unique path to be present, that (Paoletti 2011) “there can be at most two landmasses with an odd number of bridges.” In the context of the Königsberg model, Euler concluded that a unique path therefore could not be found for four landmasses connected by an odd number of bridges.

**Lessons Learned from the Seven Bridges Model**

There are certain data model aspects which are common to graph use cases. Identifying those aspects is important to figuring out how to leverage the entities and relationships to benefit your model. If your model has multiple properties from this list (below), it is likely a viable graph use case. (See Carpenter J. 2018. **Why you should use a graph database**. InfoWorld. Retrieved on 20181204 from: (https://www.infoworld.com/article/3251829/nosql/why-you-should-use-a-graph-database.html).) Let us consider the Seven Bridges of Königsberg problem with the following four properties:

- The model contains frequent many-to-many relationships.
- The relationships are just as important as the data entities.
- The application to be served requires traversal (queries) at low latency over large scale.
- Traversal of entities and relationships may occur over a varying number of levels.

There is no single landmass (entity) that is connected by a single bridge (relationship); they are all many-to-many. The key to this problem is in determining how to traverse the bridges, making the relationships the primary focus. Putting the unique path aside for a moment, there are a multitude of paths (levels) in which the entities (and thus, the bridges) could be traversed. While the problem of scale isn’t there, this model does satisfy the three remaining properties for graph.

Another point to consider is the questions which you may want your graph model to solve. Euler was concerned with finding a unique path in his relationships. As with most distributed databases, the design of the data model should take potential queries into consideration.

Fast-forward 283 years and now assume that you are building an application around managing databases that back a 300,000-user enterprise and a Fortune 50 e-commerce website. To manage resources at that scale, it would be helpful to observe which applications might be affected by changes to certain database products.

One way to model that would be to split out applications, databases and specific product versions into entities. Talking out how those entities interact with one another should produce relationship types: Applications “connect to” databases, and databases run on a “version” of a datastore product.

Given these assumptions (and data), I should be able to provide an answer to the question of *“Which applications would be affected by an upgrade of Apache Cassandra 3.11.2?”* The results can be seen in Figure 3:

*Figure 3 – Visual results showing the applications that currently use an Apache Cassandra 3.11.2 database. An enterprise-wide upgrade of that version would indeed have far-reaching effects.*

In the scope of this problem, identifying the applications, databases and product versions as entities made sense. I was able to discover the relationship types among the entities by describing how they work with one another. While the “is version” relationship may not be immediately apparent, it presented itself once I applied the context of the question that I wanted to ask it.

Do note that the product version very well could have been a property on the “Database” entity. However, as the product version is central to solving our problem, we’ll want to make it an entity to avoid “hiding” that aspect in the data model. (See Armbruster S. 2016. **Welcome to the Dark Side: Neo4j Worst Practices (& How to Avoid Them)**. Neo4j, Inc. Retrieved on 20181205 from: (https://neo4j.com/blog/dark-side-neo4j-worst-practices/).)

**Summary**

Going back and reading through material on the Seven Bridges of Königsberg problem changed the way that I approached building graph data models. There were always questions that I felt were good to ask teams about their model. While helpful, they did have a tendency to obfuscate the real questions that the model was designed to answer.

Keep this in mind for the next time you are considering a database for your application. If you are wondering if your data model is a graph problem, don’t ask me, another engineer or even another human being for that matter. Consider your use case and ask it, “Are you a good fit for graph?”

*Aaron Ploetz is the NoSQL engineering lead for Target. He is a three-time DataStax MVP for Apache Cassandra and has authored the books “Seven NoSQL Databases in a Week” and “Mastering Apache Cassandra 3.x.”*