Skip to content

Design a Web Crawler

A web crawler is a deceptively simple loop: fetch a page, extract its links, add them to a to-do list, repeat. Run that loop at the scale of the open web and every word in it becomes a hard problem. The to-do list has billions of entries. “Have I seen this URL?” must be answered billions of times. “Repeat” must not hammer a small blog into the ground. This walkthrough builds a crawler that is polite, distributed, and doesn’t drown in its own duplicates.

Functional

  • Start from seed URLs, fetch pages, extract and follow links (BFS-ish traversal).
  • Deduplicate URLs so we don’t crawl the same page endlessly.
  • Respect robots.txt and politeness (don’t overload any single host).
  • Store fetched content for downstream use (search index, archive, training corpus).
  • Support re-crawling — pages change, so revisit by priority/freshness.

Non-functional

  • Scale: billions of pages; horizontally scalable workers.
  • Politeness is non-negotiable — a rude crawler gets IP-banned and is a bad web citizen.
  • Robustness: handle timeouts, redirects, traps (infinite calendars), malformed HTML.
  • Extensibility: pluggable parsers and filters.

Target 1B pages/month → ~400 pages/sec average sustained. At ~100 KB of HTML per page that’s ~100 TB/month of raw content before dedup and compression. The frontier (URLs queued to crawl) can hold tens of billions of URLs — far too big for one machine’s memory, so it lives on disk, sharded.

The URL-seen set is the sneaky cost: tracking “have I crawled this?” for tens of billions of URLs, checked on every extracted link. Storing full URLs is prohibitive — this is where we get clever (see deep dives).

A crawler is mostly internal, but the control surface looks like:

POST /v1/seeds { "urls": [...], "priority": 5 } // inject starting points
GET /v1/status → { pages_crawled, frontier_size, fetch_rate }
POST /v1/recrawl { "host": "example.com", "max_age": "7d" }
Internal queue contract (frontier):
enqueue(url, host, priority, earliest_fetch_time)
dequeue() → next URL whose host is due (politeness-aware)
frontier url, host, priority, depth, earliest_fetch_time, enqueued_at
url_seen url_hash // membership only — Bloom filter + store
content url, content_hash, fetched_at, status, body_ref (→ blob store)
host_state host, last_fetch_at, crawl_delay, robots_rules
content_seen content_hash // near-dup detection across URLs

Two separate “seen” concepts matter: url_seen (“have I queued/crawled this URL?”) and content_seen (“have I seen this body before, under a different URL?”). Mirror sites and URL parameters produce the same content at many addresses; catching that saves enormous storage.

seeds ──► ┌─────────────────────────────────────────────┐
│ FRONTIER (sharded) │
│ priority lanes × per-HOST politeness queue │
└───────────────┬─────────────────────────────┘
dequeue host-due URLs
┌───────────────┬────┴───────┬───────────────┐
▼ ▼ ▼ ▼
fetcher worker fetcher fetcher fetcher (stateless pool)
│ robots.txt check + rate limit per host
DNS resolve → HTTP GET → response
├──► content store (blob) + content_hash dedup
parser: extract links, normalize URLs
URL filter ──► URL-SEEN check (Bloom filter) ──new?──► enqueue back to frontier
│seen
└─ drop

The loop is a cycle: frontier → fetch → parse → filter → frontier. The two valves that keep it sane are the per-host politeness queue in the frontier (controls rate) and the URL-seen filter before re-enqueue (controls duplication).

What does this buy us, and what does it cost? Sharding the frontier and running a stateless fetcher pool buys near-linear horizontal scale — add workers, crawl faster. The cost is coordination: the politeness guarantee is global (“never hit example.com more than once a second”) but the workers are distributed, so two workers must not both grab example.com URLs at the same instant. We solve this by partitioning the frontier by host — all of a host’s URLs live in one shard, so one queue enforces that host’s rate. We buy scale by giving up the simplicity of a single global queue.

The frontier as a politeness engine. Model it as priority lanes (important pages first) crossed with per-host queues. A URL is only dequeued when its host’s earliest_fetch_time has passed (now + crawl_delay). This naturally spreads load: high-priority pages jump the queue, but no host ever gets crawled faster than its delay allows.

URL-seen dedup with a Bloom filter. Checking “have I seen this URL?” against tens of billions of entries on every link is the hottest path in the system. Storing all URLs in a hash set would cost terabytes of RAM. Instead, a Bloom filter answers membership in a few bits per URL with a tunable false-positive rate.

Bloom filter: "definitely new" or "probably seen"
- never a false negative → we never re-crawl something it says is seen... wait:
- false POSITIVE means: we occasionally skip a genuinely-new URL (acceptable)
- false NEGATIVE never happens → we never crawl a true duplicate twice

A false positive means we miss a new page (tolerable at web scale); we never waste a fetch on a true duplicate. This trade — a little recall for a massive memory saving — is exactly what Probabilistic Data Structures are for.

Politeness in detail. Beyond robots.txt, respect Crawl-delay, cap concurrent connections per host, and identify yourself with a real User-Agent. Back off on 429/503. A crawler that ignores these gets blocked and poisons the well for everyone.

Crawler traps & normalization. Infinite calendars, session-id query params, and faceted filters generate unbounded near-identical URLs. Defenses: URL normalization (strip tracking params, sort query keys), depth limits, per-host page caps, and content-hash dedup to catch the same body behind many URLs.

Distributed coordination. Shard frontier and seen-set by host hash so each crawler “owns” a slice of hosts — this makes politeness a local decision (one shard = one host’s rate) instead of a global lock. DNS is cached aggressively; it’s a surprising bottleneck at 400 fetches/sec.

Freshness / re-crawl. Pages change at wildly different rates. Prioritize re-crawls by observed change frequency (news hourly, archives rarely) so crawl budget goes where content actually moves.

Decision Buys you Costs you
─────────────────────────────────────────────────────────────────────
Frontier sharded by host local politeness, no uneven load (a giant
global lock host loads one shard)
Bloom filter for tiny memory at billions rare skipped new URLs
URL-seen of URLs (false positives)
Per-host rate limit don't get IP-banned; slower max throughput
good citizenship per host
Content-hash dedup skip mirror/dup bodies extra hashing + store
Stateless fetcher pool horizontal scaling coordination via frontier

The throughline: a web crawler is a politeness-constrained dedup machine. The naive loop is easy; the real system is the frontier that throttles per host and the probabilistic seen-set that lets you remember billions of URLs without billions of dollars of RAM.

  1. Why is politeness — not raw speed — the constraint that most shapes the architecture?
  2. Why is the frontier partitioned by host, and what does that buy over a single global queue?
  3. A Bloom filter for the URL-seen set has false positives but no false negatives. What concrete behavior does each mean for the crawler, and why is the error in the “safe” direction?
  4. Distinguish url_seen from content_seen and give an example where the second saves real storage.
  5. Name two kinds of crawler trap and the defenses that contain them.