Design a Web Crawler

A web crawler is a deceptively simple loop: fetch a page, extract its links, add them to a to-do list, repeat. Run that loop at the scale of the open web and every word in it becomes a hard problem. The to-do list has billions of entries. “Have I seen this URL?” must be answered billions of times. “Repeat” must not hammer a small blog into the ground. This walkthrough builds a crawler that is polite, distributed, and doesn’t drown in its own duplicates.

1. Requirements

Functional

Start from seed URLs, fetch pages, extract and follow links (BFS-ish traversal).
Deduplicate URLs so we don’t crawl the same page endlessly.
Respect robots.txt and politeness (don’t overload any single host).
Store fetched content for downstream use (search index, archive, training corpus).
Support re-crawling — pages change, so revisit by priority/freshness.

Non-functional

Scale: billions of pages; horizontally scalable workers.
Politeness is non-negotiable — a rude crawler gets IP-banned and is a bad web citizen.
Robustness: handle timeouts, redirects, traps (infinite calendars), malformed HTML.
Extensibility: pluggable parsers and filters.

2. Estimation

Target 1B pages/month → ~400 pages/sec average sustained. At ~100 KB of HTML per page that’s ~100 TB/month of raw content before dedup and compression. The frontier (URLs queued to crawl) can hold tens of billions of URLs — far too big for one machine’s memory, so it lives on disk, sharded.

The URL-seen set is the sneaky cost: tracking “have I crawled this?” for tens of billions of URLs, checked on every extracted link. Storing full URLs is prohibitive — this is where we get clever (see deep dives).

3. API sketch

A crawler is mostly internal, but the control surface looks like:

POST /v1/seeds          { "urls": [...], "priority": 5 }   // inject starting points
GET  /v1/status         → { pages_crawled, frontier_size, fetch_rate }
POST /v1/recrawl        { "host": "example.com", "max_age": "7d" }

Internal queue contract (frontier):
  enqueue(url, host, priority, earliest_fetch_time)
  dequeue() → next URL whose host is due (politeness-aware)

4. Data model

frontier        url, host, priority, depth, earliest_fetch_time, enqueued_at
url_seen        url_hash                       // membership only — Bloom filter + store
content         url, content_hash, fetched_at, status, body_ref (→ blob store)
host_state      host, last_fetch_at, crawl_delay, robots_rules
content_seen    content_hash                   // near-dup detection across URLs

Two separate “seen” concepts matter: url_seen (“have I queued/crawled this URL?”) and content_seen (“have I seen this body before, under a different URL?”). Mirror sites and URL parameters produce the same content at many addresses; catching that saves enormous storage.

5. High-level design

   seeds ──► ┌─────────────────────────────────────────────┐
             │              FRONTIER (sharded)              │
             │  priority lanes  ×  per-HOST politeness queue │
             └───────────────┬─────────────────────────────┘
                 dequeue host-due URLs
        ┌───────────────┬────┴───────┬───────────────┐
        ▼               ▼            ▼               ▼
   fetcher worker   fetcher       fetcher        fetcher     (stateless pool)
        │  robots.txt check + rate limit per host
        ▼
   DNS resolve → HTTP GET → response
        │
        ├──► content store (blob) + content_hash dedup
        │
        ▼
   parser: extract links, normalize URLs
        │
        ▼
   URL filter ──► URL-SEEN check (Bloom filter) ──new?──► enqueue back to frontier
                                                 │seen
                                                 └─ drop

The loop is a cycle: frontier → fetch → parse → filter → frontier. The two valves that keep it sane are the per-host politeness queue in the frontier (controls rate) and the URL-seen filter before re-enqueue (controls duplication).

What does this buy us, and what does it cost? Sharding the frontier and running a stateless fetcher pool buys near-linear horizontal scale — add workers, crawl faster. The cost is coordination: the politeness guarantee is global (“never hit example.com more than once a second”) but the workers are distributed, so two workers must not both grab example.com URLs at the same instant. We solve this by partitioning the frontier by host — all of a host’s URLs live in one shard, so one queue enforces that host’s rate. We buy scale by giving up the simplicity of a single global queue.

6. Scaling and deep dives

The frontier as a politeness engine. Model it as priority lanes (important pages first) crossed with per-host queues. A URL is only dequeued when its host’s earliest_fetch_time has passed (now + crawl_delay). This naturally spreads load: high-priority pages jump the queue, but no host ever gets crawled faster than its delay allows.

URL-seen dedup with a Bloom filter. Checking “have I seen this URL?” against tens of billions of entries on every link is the hottest path in the system. Storing all URLs in a hash set would cost terabytes of RAM. Instead, a Bloom filter answers membership in a few bits per URL with a tunable false-positive rate.

Bloom filter: "definitely new" or "probably seen"
  - never a false negative → we never re-crawl something it says is seen... wait:
  - false POSITIVE means: we occasionally skip a genuinely-new URL (acceptable)
  - false NEGATIVE never happens → we never crawl a true duplicate twice

A false positive means we miss a new page (tolerable at web scale); we never waste a fetch on a true duplicate. This trade — a little recall for a massive memory saving — is exactly what Probabilistic Data Structures are for.

Politeness in detail. Beyond robots.txt, respect Crawl-delay, cap concurrent connections per host, and identify yourself with a real User-Agent. Back off on 429/503. A crawler that ignores these gets blocked and poisons the well for everyone.

Crawler traps & normalization. Infinite calendars, session-id query params, and faceted filters generate unbounded near-identical URLs. Defenses: URL normalization (strip tracking params, sort query keys), depth limits, per-host page caps, and content-hash dedup to catch the same body behind many URLs.

Distributed coordination. Shard frontier and seen-set by host hash so each crawler “owns” a slice of hosts — this makes politeness a local decision (one shard = one host’s rate) instead of a global lock. DNS is cached aggressively; it’s a surprising bottleneck at 400 fetches/sec.

Freshness / re-crawl. Pages change at wildly different rates. Prioritize re-crawls by observed change frequency (news hourly, archives rarely) so crawl budget goes where content actually moves.

7. Key trade-offs

Decision                  Buys you                  Costs you
─────────────────────────────────────────────────────────────────────
Frontier sharded by host  local politeness, no      uneven load (a giant
                          global lock               host loads one shard)
Bloom filter for          tiny memory at billions   rare skipped new URLs
URL-seen                  of URLs                   (false positives)
Per-host rate limit       don't get IP-banned;      slower max throughput
                          good citizenship          per host
Content-hash dedup        skip mirror/dup bodies    extra hashing + store
Stateless fetcher pool    horizontal scaling        coordination via frontier

The throughline: a web crawler is a politeness-constrained dedup machine. The naive loop is easy; the real system is the frontier that throttles per host and the probabilistic seen-set that lets you remember billions of URLs without billions of dollars of RAM.

Check your understanding

Why is politeness — not raw speed — the constraint that most shapes the architecture?
Why is the frontier partitioned by host, and what does that buy over a single global queue?
A Bloom filter for the URL-seen set has false positives but no false negatives. What concrete behavior does each mean for the crawler, and why is the error in the “safe” direction?
Distinguish url_seen from content_seen and give an example where the second saves real storage.
Name two kinds of crawler trap and the defenses that contain them.