Top K problem System Design (Heavy Hitters)
The Heavy Hitters system design problem
The Heavy Hitters system design problem refers to the challenge of identifying the most frequently occurring items, or "heavy hitters," in a large stream of data in real-time.
The main challenge
The main challenge in the Heavy Hitters problem is to handle a massive amount of data, potentially terabytes or petabytes, and identify the most frequent items in real-time. This requires designing an efficient and scalable algorithm that can process data streams in real-time, and handle high throughput with low latency.
The Count-Min sketch algorithm
One common approach to solving the Heavy Hitters problem is to use the Count-Min Sketch algorithm. This algorithm is based on hashing and uses a fixed-size hash table to estimate the frequency of items. The algorithm can efficiently estimate the frequency of heavy hitters with a high probability, while minimizing the memory and computational resources required.
Flajolet-Martin algorithm
Another approach to solving the Heavy Hitters problem is to use the Flajolet-Martin algorithm. This algorithm uses probabilistic counting to estimate the number of unique items in a data stream, and can be used to identify heavy hitters based on their frequency of occurrence. The Flajolet-Martin algorithm is known for its accuracy and efficiency in handling large-scale data streams.
Database schema
To support the scale of the system, we need to store the aggregates data in the NoSQL database. The table schema has to be designed for specific queries we want to support. The common approach is to store the data in a time series fashion so it supports time range queries. We can have 2 different tables: real time and batch aggregates partitioned by the same key - in our case its day, hour, bucket + additional dimensions like country, city, and tag. This way we can support time series queries + extra aggregations like country, city and tag. The next optimization would be to store the dimensions as a JSON field and make the dynamic, so we can add more dimensions without re-aggregating the historical data. Check out this design in more details in a deep dive system design course on crushingtechinterview.com.
Conclusion
Overall, solving the Heavy Hitters system design problem requires a combination of efficient algorithms, distributed systems, and real-time data processing techniques, making it a challenging but essential problem in the field of data engineering and analytics.
Here's the video of me explaining this in-depth 👇
Thank you so much for reading this edition of the newsletter!