Skip to content

Probabilistic Data Structures

Suppose you must track which of 10 billion URLs you’ve already crawled, count how often each of millions of search terms appears, or report how many unique visitors hit your site today. The exact solutions all share one enemy: memory grows with the data. A hash set of 10 billion URLs is tens of gigabytes; a unique-visitor count needs to remember every visitor.

Probabilistic data structures make a radical trade: give up exactness, and memory stops growing with the data. They answer with a small, bounded, quantifiable error in exchange for space savings of 10×, 100×, or more. The skill is recognizing the many places where “almost exactly right” is worth far more than “exactly right but too expensive to compute.”

The shared trick: hash into a small fixed structure

Section titled “The shared trick: hash into a small fixed structure”

All three structures below work the same way underneath: hash the input into a small, fixed-size array of bits or counters. Because the structure is fixed-size, collisions are inevitable — and those collisions are precisely the source of the error. The genius is in arranging the hashing so the error is bounded and one-directional, never wild.

Question it answers: “Have I seen this item before?”

A Bloom filter is a bit array plus k hash functions. To add an item, hash it k ways and set those k bits. To query, hash it k ways and check those bits: if any is 0, the item is definitely not in the set; if all are 1, it’s probably in the set.

add("cat"): set bits 2, 5, 9
add("dog"): set bits 1, 5, 7
┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
bit array │0│1│1│0│0│1│0│1│0│1│
└─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘
query("fox") → bit 4 is 0 → DEFINITELY NOT present
query("cat") → bits 2,5,9 all 1 → PROBABLY present

The asymmetry is the whole point: no false negatives, only false positives. It never forgets something it saw; it may occasionally claim to have seen something it didn’t. That makes it perfect as a cheap gatekeeper in front of an expensive lookup: “is this key maybe in the database? If the filter says no, skip the disk read entirely.” Databases use Bloom filters on SSTables for exactly this; a false positive just costs one wasted lookup, never a wrong answer.

Question it answers: “Roughly how many times have I seen this item?”

Where a Bloom filter is a bit array, a Count-Min sketch is a small grid of counters: d rows, each with its own hash function mapping items to w columns. To increment, bump one counter per row. To query, take the minimum across the rows.

col→
hash row 1: [ . . 7 . . ] ← "the" landed here
hash row 2: [ . 7 . . . ]
hash row 3: [ . . . 7 . ]
estimate("the") = min(7, 7, 7) = 7 (min cancels collisions)

Why the minimum? Collisions can only ever add to a counter, so every row overestimates. The smallest estimate is the least polluted by collisions — and the true count never exceeds it. The error is one-directional (never undercounts) and bounded by the grid size. This is how you find heavy hitters in a stream — the hot keys driving a hot partition — using kilobytes instead of a per-key map.

Question it answers: “How many distinct items have I seen?”

Counting uniques exactly means remembering every item — linear in the data. HyperLogLog counts distinct elements in a few kilobytes, error around 2%, no matter whether you feed it a thousand items or a billion.

The intuition: hash each item and look at the run of leading zeros in the hash. Long runs are rare — seeing a hash that starts with 10 zeros suggests you’ve hashed roughly 2¹⁰ distinct things. By tracking the longest run seen (across many buckets, then averaging to tame variance), you estimate cardinality from a tiny fixed-size summary. Counting unique visitors, distinct IPs, or unique search queries per day — all collapse from gigabytes to kilobytes.

StructureQuestionError shapeTypical size
Bloom filterIs X in the set?false positives onlybits per item
Count-Min sketchHow often is X?overcounts onlysmall grid
HyperLogLogHow many distinct?~2% either waya few KB total

What does this buy us, and what does it cost?

Section titled “What does this buy us, and what does it cost?”

These structures buy you answers at scales where the exact computation is simply infeasible — constant or sub-linear memory where the precise version is linear. The cost is exactness, but it’s a disciplined cost: the error is bounded, quantifiable, and usually one-directional, so you can reason about it. The trade is only wise when the error is tolerable for the decision at hand. A 2% error in a dashboard’s unique-visitor count is invisible; a 2% error in a bank balance is fraud. The engineering judgment isn’t “is approximate good enough?” in the abstract — it’s “is this error, in this direction, acceptable for this use?”

  1. Why does a Bloom filter produce false positives but never false negatives, and why does that asymmetry make it a good gatekeeper?
  2. Why can’t you safely delete an item from a standard Bloom filter?
  3. In a Count-Min sketch, why do you take the minimum across rows instead of the average or sum?
  4. How does HyperLogLog estimate billions of distinct items using only a few kilobytes?
  5. Give one use case where a 2% cardinality error is fine and one where any error is unacceptable — what distinguishes them?