Live Streaming Service - Part 3 - (Stream Validation in Real-time)
Elastic Stack
ELK (Elastic Logstash Kibana) stack provides an observability solution for distributed systems that enables us to monitor our services effectively, set up alerts and gain insights into Big data. The stack is open source and de facto in the industry when it comes to implementing search and log monitoring.
Elasticsearch is a distributed, RESTful, JSON-based, full-text search and analytics engine. Logstash aggregates the logs from multiple data sources moves them to Elasticsearch and other destinations. Kibana provides a user interface to visualize, query and analyze data via graphs and charts.
Log management is just one use case of Elastic stack. It is leveraged in the industry in a wide range of use cases, such as tracking application metrics, performance and endpoint security, monitoring system uptime, implementing application search, etc.
Implementing ELK
In our use case, we will leverage the ELK stack to monitor the live stream in real-time with the help of logs generated by the viewer's devices.
The log data from the viewer's devices will be pushed to Logstash for aggregation. Logstash, along with Kafka, will act as a data processing pipeline ingesting data from the devices, transforming the data and moving it ahead to Elasticsearch for storage.
All the log data stored in Elasticsearch can be viewed in a web-based dashboard via Kibana.
Figure 3.9
I've discussed data ingestion and data pipeline in the Stream processing chapter of my Web Architecture 101 course here.
Archiving Old Data in Cloud Storage
In our use case, it's a good idea to keep the latest data upto a certain time range (one year or less) in Elasticsearch and archive old data to cloud archival storage.
Why?
Archiving data using the right cloud storage class significantly reduces data storage costs in comparison to when retaining everything in the Elastic storage. Google Cloud, AWS, Azure all offer different storage classes suiting different business storage needs. Additionally, moving old data to archival storage frees up storage space for new data.
I've discussed data storage infrastructure and cloud storage in detail in my cloud course here.
Cloud archival storage makes the archived data accessible through APIs within milliseconds when required. There isn't a separate retrieval process that comes along when we have data archived in tapes. Also, to fulfill further data durability requirements, the data can be made redundant in different geographical regions of the cloud. Businesses are often required to retain data adhering to the regulatory compliance requirements.
Though cloud archive storage retrieves data in milliseconds, it still isn't a cent-percent replacement for tape archives. There are cases where due to software bugs, data gets inadvertently deleted from the storage. And this can cause serious issues depending on how critical the data is.
Here is a real-world example of a couple of data deletion events that happened in the past as a result of software bugs:
Data Deletion Events @Google Due to a Bug
Gmail – Restore from GTape
On Sunday, February 27, 2011, Gmail lost quite an amount of user data despite having many safeguards, internal checks and redundancies. With the help of GTape (a global backup system for Gmail), they were able to restore 99%+ of the customer data. Taking backup on a tape provides protection against disk failures and other wide-scale infrastructure failures.
Google Music – Restore from GTape
A privacy-protecting data deletion pipeline, designed to delete big numbers of audio tracks in record time (the privacy policy meant music files and relevant metadata gets purged within a reasonable time after users delete them) removed approximately 600,000 audio references affecting audio files for 21,000 users on Tuesday, 6th of March, 2012.
Google had audio files backed up on tape sitting in offsite storage locations. Around 5K tapes were fetched to restore the data. However, only 436,223 of the approximately 600,000 lost audio tracks were found on tape backups, which meant that about 161,000 other audio tracks were eaten by the bug before they could even be backed up.
In the initial recovery, 5475 restore jobs were triggered to restore the data from the tapes that held 1.5 petabytes of audio data. The data was scanned from the tapes and moved to the distributed compute storage.
Within a span of 7 days, 1.5 petabytes of data had been brought back to the users via offsite tape backups.
The missing 161,000 audio files were promotional tracks and their original copies luckily went unaffected by the bug. In addition, a small portion of these were uploaded by the users themselves.
Information source: Designing and operating highly available software systems at scale - Google research.
I've also written a blog article on this: Design for scale and high availability – What does 100 million users on a Google service mean? Check it out after you are done with this lesson.
These events give us an idea that tapes are indispensable when backing up critical data. Yes, there is offline management and data retrieval overhead with tapes, but the data is safe from being deleted due to software bugs and wide-scale infrastructure failures. We can use a combination of both approaches (cloud archival plus tape backups) to archive our data.
We understood the use of the ELK stack in our stream monitoring system. Logstash ingests the logs through a data pipeline and moves them to Elasticsearch. What is the need for Kafka here?
Why not push the logs directly to Logstash as opposed to first streaming to Kafka and then to Logstash?
The Need for Kafka
Kafka is an open-source distributed event streaming platform widely used in the industry for high-performance data pipelines, streaming analytics, data integration and many other use cases.
During a live event, when the data being ingested in the data pipeline is massive, bottlenecks can arise in the pipeline due to the sheer size and velocity of data, specifically when Logstash transforms the data using filters to a common format to be processed by Elasticsearch. And then Elasticsearch has to index the logs for efficient retrieval. All this processing needs time.
To deal with the massive data influx without overwhelming Logstash, Kafka is placed in between the data source and Logstash. Logstash consumes data from Kafka topics at its own pace as a consumer.
In our stream validation architecture, Kafka acts as a cache layer, a buffer, to slow down the log traffic, just like a cache is leveraged in the thundering herd problem to cut down the load on the origin server.
In the next lesson, let's dig a little deeper into Kafka.