Picking the Right Data Model and Knowledge Graphs
On a sports website like ESPN, the content primarily comprises videos and articles containing relevant photos. We've already been through the video part in depth when designing the live streaming and the video-on-demand service. In this chapter, we will design the service managing the articles and the related images.
Our CMS (Content Management Service) consists of two sub-services:
- Content Storage Service
- Content Delivery Service
The content storage service consumes content from writers and persists it in the database. And the content delivery service delivers the content to the reader's device.
Let's begin with the storage service.
Content Storage Service
Before we design the content storage service, let's understand our data model.
Picking the Right Data Model
An article posted on the website will have a title, content summary, content, several images, categories, tags and one or more authors.
Speaking of categories and tags, categories can be sport name, tournament name, etc. and tags can be club name, country, game type (domestic, international), etc.
Which type of database do you think would fit best to store this sort of data?
We can clearly see that the article has a one-to-many relationship with the categories, tags and authors. A relational database, then?
Alternatively, we can also consider storing our data in a denormalized fashion in a document-oriented database with respective categories, tags and authors stored with every article document. This will avert the need for joins.
But then the primary queries in the system will be to fetch the articles written by an author X, fetch the articles belonging to a category Y, tag Z and such. How do you think we will compute the results when we just have one collection where articles are stored with their respective categories, tags, etc?
To return the results quick, we would have to create a separate collection for every category, tag, and author. For instance, basketball will have a separate collection; a basketball tournament will be a different collection and an author specializing in the basketball sport will have a separate collection.
A quick reminder: a collection in a document-oriented database is the same as a table in a relational database.
Also, these collections will not have exclusive data. There will be a lot of data overlap resulting in duplicate data in all these collections. For instance, an article on Baseball will have several categories and tags and thus will be duplicated in the collections of those categories and tags.
Certainly, this is not desirable. A relational data model appears a better choice. Alternatively, we can also consider the graph data model to store the entities (articles, categories, tags, authors) as graph nodes with edges between each other establishing relationships. This graph-based structure for storing content is also known as the Knowledge Graph and is leveraged in many real-world applications.
Knowledge Graph
One primary use case of Knowledge Graphs is in content management systems like Wikipedia, Bloomberg and the like dealing with big and complex data.
Imagine a Wiki article on a historical war. Compared to our sports data model, it would contain a considerable number of categories and tags such as continents, countries, battles, battle locations, commanders, leaders, allies, hostile forces and so on. Now imagine storing data on all the wars in the world ever and managing a relationship between all the events, including the entities involved such as leaders, countries, soldiers, weapons, etc.
These deep entity relations enable a reader to start with an article on a certain battle and continue to read through the related events, the people, generals, countries involved, and everything. Maintaining such complex data and serving it to the readers provides an unmatchable learning experience resulting in better user retention.
Likewise, Google and social networks like Facebook maintain a knowledge graph to show relevant data from thousands of petabytes of data with minimal latency to the users.
Figure 7.2
Storing such complex data would be a nightmare with a document-oriented database and an arduous task with a relational database. Imagine modeling the historical wars knowledge graph with these databases. Too much denormalizing of the data will exhaust our hard disks and the joins will make the latency soar like a NASA rocket.
The lesson on graph databases from my web architecture 101 course is a recommended read to understand the basics of the graph data model.
In the next lesson, let's understand how we can model our data with the help of graphs.