S3 Home‎ > ‎Our Work‎ > ‎Areas of Expertise‎ > ‎

Graph Databases and Knowledge Bases

Minimalist Description

We have been working with graphs for decades, and graph databases since soon after they were introduced. Graphs use nodes and links (aka vertices and edges) to describe how things relate to one another: 
  • San Diego and Los Angeles are 120 miles apart,
  • Paul McCartney and Ringo Starr were in the same band,
  • Pandu is Jamie's badminton coach,
  • Money moves from one bank account to another through a bank transfer,
  • Etc.
Graphs are all around us and graph analytics are important decision aids that can be used to:
  • Decide on courses of action: "Take Waring Road to Navajo Road, then turn right on Jackson Drive",
  • Discover loose communities: "These people appear to form a money laundering network",
  • Make recommendations: "You might be interested in this restaurant".
Graph databases make it possible to store graphs that are too large to keep in memory, enabling analytics that are not otherwise feasible. 

Knowledge bases use graph relationships to describe what we know about the world. A relationship in a knowledge base describes a piece of knowledge, like "Mars is a planet". But knowledge bases also let us define what a planet is. Concepts themselves are part of the dynamic data in a knowledge base.


Graphs are a formal mathematical way to describe relationships between things. The "things" are called vertices and the "relationships" are called edges. Graphs can be reasoned about by traversing edges from vertex to vertex. In fact, graph traversal is like diagramming sentences, except in reverse. Instead of starting with the sentence and building the diagram, we start with the diagram and traversal generates the sentence. The graph at the right sketches out a few of the character relationships in Hamlet. For each vertex-edge-vertex relationship we could write a sentence:
  • Polonius is the father of Laertes.
  • Polonius is the father of Ophelia.
  • Hamlet is the boyfriend of Ophelia.
  • Hamlet kills Polonius.
  • Laertes kills Hamlet.
  • Hamlet kills Laertes.
We could impose some rules to make other inferences about the graph. We can look for patterns in the graph and "mine" these for additional information.

Because Polonius is father to both Laertes and Ophelia, 
  • Laertes is Ophelia's brother.
  • Ophelia is Laertes sister.
This inference is less certain. They could be half siblings, but it's a reasonable conjecture based on time-tested patterns of family relationships. Further, we might reason that Laertes kills Hamlet because Hamlet killed Laertes' father. That's less certain still. Literary scholars have been debating that question for centuries, and the pattern we use to make that assertion is a bit shakier. We might also reason that Hamlet kills Laertes because Laertes has tried to kill him. That would be so uncertain that it's wrong! Hamlet kills Laertes by accident. But nonetheless, if you were a sleuth investigating a crime, it would be a good lead. Some leads just turn out to be wrong, and sometimes patterns of a behavior match when the behavior isn't present (false positive) or fail to match when behavior is present (false negative). One definition of a good analysis is one that identifies patterns with few false negatives or false positives. 

In the 20th century, most mathematicians said that the business of mathematics was to prove theorems. But we have seen a shift, largely due to the availability of automated computation. Today, mathematicians have started to think that the business of mathematics is to understand patterns: to describe patterns, to identify those patterns in the world, and to separate patterns from one another. The job of Machine Learning Systems (sometimes called "AI" in the newspapers) is to "learn" patterns with low false positive and false negative rates. But "learning" is much easier when the information is structured properly. As an example, we could learn the rules of chess just by watching players play on a board. We'd learn that pawns move one step forward and bishops move along diagonals. We could capture chess games just as completely--just as accurately--by writing a 64 character string for each board position. But it would be pretty hopeless to try to learn the game from a list of those strings because the geometry that is so central to the game isn't obvious in those character strings. An actual chess board is a more helpful structure. Graphs are an ideal structure for describing large classes of patterns, making graphs very useful inputs to machine learning systems. The pattern might be "a blue vertex connected to a red one". If that pattern produces too many false positives or false negatives, a better pattern might be "a blue vertex connected to a red one and an orange one but not a yellow one" (patterns like that arise frequently in genomics). In any case, graphs can simplify Machine Learning because they are a good way to represent patterns.

The reason we've gone through this example is that:
  • Graphs are everywhere in the world,
  • The graph captures things that we know,
  • Graphs can describe patterns of interest,
  • Graphs may also contain patterns of interest,
  • Graph patterns can be learned either by humans or machines,
  • Some patterns are better than others.
Where we would "query" a traditional relational database in SQL to get answers, we traverse a graph to get answers, and there are traversal languages like Cypher and Gremlin that are analogous to SQL but more powerful because graph patterns are more expressive. 

Graph Databases:

The graph database is a relatively recent invention. Graph databases allow us to save and operate on graphs that are too big to fit into computer memory. And they employ a lot of cleverness to make those operations fast. From the sound of things, we've had graph databases all along. Graphs express relationships, and we have relational databases. So problem solved, right? But the term "relational database" that we are all familiar with is a misnomer because the "relationships" in a relational database are implicit. Vertex-edge-vertex relationships are never stored explicitly. Let's say we have a table of customers and a table of customer orders, looking something like the picture below. 

There is a relationship in the real world between customers and orders, but it isn't really there in the database. The relationships in a relational database only become real through a "relational join" operation. That's how we "relate" a customer to their orders in a relational database, and the "Cust no." field in the Orders table above serves as a "foreign key" to facilitate that relational join. This is how 1-to-many relationships are done in relational databases. Relational joins are messy and expensive operations. 

Graph databases take a different approach. They make the relationships explicit. The equivalent customer/order data, represented in a graph database would look something like this, where the link arrows replace the foreign keys:

If we wanted to find all orders by Jack Frost in this graph database, rather than doing a relational join, we would traverse from the Jack Frost vertex to all Order vertices directly connected to it. 

Why, and When We Should Prefer Graph Databases to Relational Ones:

In the case of customers and their orders from the previous example, there is a 1-to-many relationship between customers and orders. The relational database will work fine for those kinds of relationships. But relational databases are not so good where many-to-many relationships need to be modeled. Consider the case where instead of a retail business with customers and orders, we are modeling a book club with members and the books they've read. It sounds like all we'd need to do is change the names of things. It's the same thing, right? But now two club members might have read the same book. it turns out that in the relational database version, this leads to a very different data model. We have to add a new table that captures only the relationships and then we need relational joins to make use of the relationships. Those tables will end up looking something like this:

You might say that the Readings table makes the relationships explicit. In a trivial sense it does because there is one row in the table per graph edge, but in a bigger sense it doesn't because there are still no vertex-edge-vertex relationships without a relational join. In fact, now we need two expensive relational joins. And this is a pretty simple example of a many-to-many relationship. More complex examples require more joins and performance rapidly degrades, not to mention the demands on memory. 

When we go from the retail customer/order model to the book club model in the graph database world, we basically just change the names. 

As the relationships become more complex, the graph database becomes more advantageous. If the objective is to exploit the information contained in complex many-to-many relationships, the graph database is the go-to tool. Many-to-many relationships are common in: 
  • Social networks: I have many friends, many people count me as a friend.
  • Sensor networks: This region is surveilled by many sensors, and sensors surveil many regions.
  • Sports information: Serena Williams plays in many tournaments, and tournaments have many players.
  • Academic literature: Papers have many authors, and scholars write many papers. 
  • Etc.
One thing we've discovered in our own work is that these many-to-many relationships come up more frequently when, instead of simply looking inside your enterprise, you look outward on the world for data. This is presumably because you have less control, and the data was either intended for different purposes or for multiple purposes. All of the examples in the list above exhibit those properties. These situations are coming up more and more. This is partly because the "edge" of the network is constantly moving out and becoming more amorphous. The edge is getting bigger but its pieces are getting smaller. In the 1980s, the edge of the network was in university labs in the form of engineering workstations. In the 1990s it was in corporate offices on desktop PCs. In the early 2000s, it was in homes on laptop computers with wireless connections. In this decade, the edge of the network is in mobile devices that connect through cellular networks. A decade from now, the edge will be "Smart Dust" located everywhere and connecting information about everything. Five hundred sensors will monitor your driverless car. Many of them will also monitor other cars. Some will monitor the road, a few more the weather. Some may monitor your destination so that if the Mini-Mart is destined to close before you arrive, you can be rerouted to the Speedy-Mart as soon as possible. And graph databases will be indispensable for keeping track of it all and minimizing the chance that you have to go without your slushie. 


Knowledge Bases:

This discussion has gone from graphs that capture knowledge about things and and their relationships to graph databases that capture very large graphs. So it would seem like we already have knowledge bases. Once again, problem solved, right? Well, once again, it's not that simple. Let's look at the simplest piece of knowledge in the Hamlet graph:

Okay, that's good to know, but what's a "Hamlet", and what's a "Polonius"? The original graph doesn't give any context for this statement. If we know that Hamlet and Polonius are characters in a play, this little piece of knowledge is actionable. But if we don't, it's rather useless. Knowledge bases need to give us ways to:
  1. Identify what a thing is; what kind of entity,
  2. Declare new kinds of things -- new concepts,
  3. Refine concepts from abstract to concrete and reuse information where possible, because a big component of knowledge is understanding how things are different and how they are the same.
These system features can be implemented in many ways, most of which can be understood in terms of "is a" and "has a" relationships. As a first step, let's introduce the concept of a "character" by adding a piece of data that we'll call a "stereotype". Stereotypes are conceptual. This stereotype says that every character has a name, but the stereotype isn't a character. It merely defines the concept of a character. The stereotype is shown in yellow in the figure below, because stereotypes are special.

This is still less than satisfactory. "Character" is too abstract. There are lots of different kinds of characters. Some are in plays, some are superheroes, etc. They all have a name, but they are also all different. Further, instead of having to know what Hamlet is, now we have to know what a character is. Have we not just pushed the bubble in the linoleum around a bit? So let's add some more stereotypes and refine the abstraction of "Character" into some more concrete kinds of characters.

Now we have different types of characters, all of whom share certain traits (i.e. a name), but who differ in other kinds of properties they might have.  But the most important thing in this picture is the orange "Stereotype" oval. It's a special kind of stereotype. It defines nothing, but is known in advance, at startup. In fact, the orange stereotype circle is the only thing that's known in advance, at startup. Everything else is data. This is good because we can define whatever concepts we want by building on the orange stereotype, and since the orange oval stereotype doesn't describe anything (it's just the most abstract of abstractions) we aren't bound to any concepts up front. We can make any concepts we want. Concepts are just data. And examples/instances of those concepts are also just data. We can define new concepts as we need them and add them to the knowledge base, and new examples of those concepts and their relationships to the stuff that's already there. For instance, we could define a concept/stereotype for "Actor", and an instance of Actor for "Lawrence Olivier" and a "Role" relationship connecting Lawrence Olivier to the character of Hamlet.

We've played a bit fast and loose here with the notation in deference to simplicity and readability. We haven't defined what "Kills" and "Married to" mean. But we could. We've avoided the complexity of temporal relationships. There were various periods and various productions where Lawrence Olivier played Hamlet, and other periods where he didn't. We have left out a lot of metadata, like provenance data: who says Stretch Armstrong is a superhero, and why should we believe that? And at the expense of complicating our graph, we could add all of those details. But conceptually, this is how a knowledge base works, and why it is different from a plain vanilla graph database. And as enterprises look less inward for their data and more outward, where the relevant concepts are defined not by the IT team but instead by the outside world and those definitions change at a furious pace, the knowledge base becomes more essential to answering enterprise questions.

Graph Analytics:

We begin with a caveat. Graph analytics and graph databases are not necessarily the same things. They may be, but not necessarily. 

Graph databases are optimized for certain kinds of queries, proximity queries in particular (for a fairly broad and general definition of "proximity"). These queries answer a lot of the questions people have about graph data: who is six degrees from Kevin Bacon? What is my nearest ATM? Etc. Machine Learning questions are often better answered by specialized graph analytics systems like Carnegie-Mellon's GraphLab. These systems are less focused on data storage and more focused on distributed and parallel algorithms for processing that data. Community detection (Page Rank) for instance, may be better addressed through graph analytics systems. Probabilistic Graphical Models (PGMs) are another case. PGMs are a strict mathematical formalism that describe relationships (edges) probabilistically. Then we can ask probabilistic questions like "what is the likelihood that the rocket will fail, and what components of the rocket are the most likely to contribute to failure, hence candidates for hardening?" 

In a graph database, we typically develop a database over time, building it up day-by-day. If we are capturing baseball information, we may update the database every day adding pitch-by-pitch data as we receive it. They we can answer queries as they arise. For graph analytics problems, we more typically know the question we want answered, we know the algorithms we want to use to answer it, we construct a graph to support that algorithm, we run the algorithm (distributed over many processors) and then after we get the answer we tear things down.

These two kinds of systems are not inherently different. But there are different kinds of questions we want to answer about graph data and these two classes of system have been focused on different classes of question. However, we see these two classes of system coming closer together. Amazon's Neptune system is an attempt to marry some of the graph database features of systems like Neo4J with the some of the distributed Machine Learning capabilities of systems like GraphLab. In the near future, we expect to see omnibus graph systems that are well suited to both kinds of jobs.

Copyright S3 Data Science, 2018.