Design a Payment System
Most system designs optimize for latency and throughput. A payment system inverts the priorities: correctness is everything, and speed is negotiable. Losing a photo upload is annoying; charging a customer twice or losing a transaction is a financial and legal incident. This case study is the clearest application of idempotency and exactly-once semantics in the whole book.
1. Requirements
Section titled “1. Requirements”Functional
- Process a payment: debit a payer, credit a payee (often via an external processor like Stripe/a card network).
- Track every transaction’s state and produce an auditable history.
- Handle refunds and reconciliation against the external processor.
Non-functional
- Correctness above all: no double charges, no lost transactions, money always balances.
- Durability: an acknowledged payment must never be lost.
- Auditability: every change traceable; regulators and disputes demand it.
- Availability and latency matter, but never at the cost of correctness.
2. Estimation
Section titled “2. Estimation”Say a mid-size processor handles 10,000 payments/sec at peak. Each payment writes one intent row plus two ledger entries (double-entry), so ~30k row-writes/sec — and records are kept forever for audit and regulatory reasons:
10,000 payments/sec × 86,400 s/day ≈ 0.86 B payments/dayeach payment ≈ 1 intent + 2 ledger entries ≈ ~500 bytes→ ~0.43 TB/day of append-only data, retained for years → petabytes, no deletesThese numbers shape everything: the workload is write-heavy and append-only, storage grows without deletes, and reads are “by id / by account,” not analytics — so an account-partitioned store with an immutable ledger fits, and caching matters far less than durability.
3. API sketch
Section titled “3. API sketch”Payments are command-style and must be idempotent, so the idempotency key is a first-class input:
POST /payments Idempotency-Key: 9f3c… ← client-generated, unique per logical payment { from, to, amount, currency } → 201 { payment_id, status: PENDING } (a retry with the SAME key returns the SAME result)
GET /payments/{id} → { status: PENDING | SUCCEEDED | FAILED, … }POST /payments/{id}/refund Idempotency-Key: … → 201 { refund_id, status }Note what’s deliberately absent: there is no “set balance” endpoint. Balances are derived from the ledger, never written directly.
4. Data model
Section titled “4. Data model”Three tables carry the whole design:
payments (the intent + state machine) payment_id PK | idempotency_key UNIQUE | from_acct | to_acct | amount | currency | status (PENDING|SUCCEEDED|FAILED) | processor_ref | created_at
ledger_entries (immutable, append-only — the actual money) entry_id PK | payment_id FK | account | amount (+credit / −debit) | created_at ── invariant: for each payment, SUM(amount) = 0
idempotency_keys (dedup) key PK | payment_id | response_snapshot | created_atThe UNIQUE constraint on idempotency_key is the enforcement mechanism: two concurrent retries
race to insert the same key, the database lets exactly one win, and the loser reads back the original
result. That is how “store the key atomically with the charge” is actually implemented — a single
transaction inserts the payments row and its key, or neither.
5. The two hardest problems
Section titled “5. The two hardest problems”Double charges (idempotency)
Section titled “Double charges (idempotency)”The network guarantees you will see duplicate requests — a client retries after a timeout even though the first request actually succeeded. Without protection, the customer is charged twice.
client → "pay $50" → (response lost) → client retries "pay $50" → charged twice ✗The fix is an idempotency key: the client generates a unique key per logical payment and sends it with every retry. The server records the key; a repeat with the same key returns the original result instead of charging again.
"pay $50" + Idempotency-Key: abc123 first time → process, store (abc123 → result) retry → key seen → return stored result, do NOT charge again ✓Money must balance (the ledger)
Section titled “Money must balance (the ledger)”Don’t model balances as a mutable number you overwrite — that loses history and invites lost-update races. Use a double-entry ledger: every transaction is recorded as two immutable entries (a debit and an equal credit) that must sum to zero.
Transaction T1: pay $50 from Alice to Bob ledger entry: Alice -50 (debit) ledger entry: Bob +50 (credit) ─────────── sum = 0 ← always; if it doesn't, something is wrongA balance is then derived by summing entries (like event sourcing). This gives a complete audit trail, makes every change reversible by a compensating entry, and turns “is the system correct?” into a checkable invariant: all entries sum to zero.
6. High-level design
Section titled “6. High-level design” client ──(payment + idempotency key)──► Payment API │ records intent (status: PENDING) in DB ▼ Payment service ──► external processor (card network) │ on result, append ledger entries (atomic) ▼ [ append-only ledger DB ] ◄── reconciliation job compares our records vs processor's dailyState moves through explicit statuses: PENDING → SUCCEEDED / FAILED. Because the external processor
is slow and may time out, treat the call asynchronously and make the whole flow resumable — never
assume a missing response means failure (it might have succeeded).
7. Deeper concerns
Section titled “7. Deeper concerns”- Exactly-once is impossible; effectively-once is the goal. You cannot guarantee the external charge happens exactly once at the network level. You get correctness via at-least-once + an idempotency key, so duplicates collapse to one effect. (See Exactly-Once Semantics.)
- Reconciliation. A periodic job compares your ledger against the processor’s settled report and flags any mismatch. This is the safety net that catches the inevitable edge cases — it’s expected, not a sign of failure.
- Consistency over availability. When the database that records charges is unreachable, a payment system should refuse the charge rather than risk an unrecorded one — a deliberate CP choice (see CAP & PACELC).
The thread
Section titled “The thread”What does this buy us, and what does it cost? Idempotency keys, an immutable double-entry ledger, explicit state machines, and reconciliation cost latency, storage, and engineering rigor. They buy the one property a payment system cannot live without: money is never created, destroyed, or double-counted. Here, the universal trade-off is resolved decisively toward correctness — you spend performance and simplicity to buy provable financial integrity.
Check your understanding
Section titled “Check your understanding”- Why does a payment system prioritize correctness over latency, unlike most designs?
- How does an idempotency key prevent a double charge on a client retry, and why must it be stored atomically with the charge?
- What is a double-entry ledger, and what invariant lets you check correctness at a glance?
- Why is “exactly-once” unattainable, and what combination delivers “effectively-once”?
- What is reconciliation, and why is a mismatch expected rather than alarming?