Skip to content

Design a Payment System

Most system designs optimize for latency and throughput. A payment system inverts the priorities: correctness is everything, and speed is negotiable. Losing a photo upload is annoying; charging a customer twice or losing a transaction is a financial and legal incident. This case study is the clearest application of idempotency and exactly-once semantics in the whole book.

Functional

  • Process a payment: debit a payer, credit a payee (often via an external processor like Stripe/a card network).
  • Track every transaction’s state and produce an auditable history.
  • Handle refunds and reconciliation against the external processor.

Non-functional

  • Correctness above all: no double charges, no lost transactions, money always balances.
  • Durability: an acknowledged payment must never be lost.
  • Auditability: every change traceable; regulators and disputes demand it.
  • Availability and latency matter, but never at the cost of correctness.

Say a mid-size processor handles 10,000 payments/sec at peak. Each payment writes one intent row plus two ledger entries (double-entry), so ~30k row-writes/sec — and records are kept forever for audit and regulatory reasons:

10,000 payments/sec × 86,400 s/day ≈ 0.86 B payments/day
each payment ≈ 1 intent + 2 ledger entries ≈ ~500 bytes
→ ~0.43 TB/day of append-only data, retained for years → petabytes, no deletes

These numbers shape everything: the workload is write-heavy and append-only, storage grows without deletes, and reads are “by id / by account,” not analytics — so an account-partitioned store with an immutable ledger fits, and caching matters far less than durability.

Payments are command-style and must be idempotent, so the idempotency key is a first-class input:

POST /payments
Idempotency-Key: 9f3c… ← client-generated, unique per logical payment
{ from, to, amount, currency }
→ 201 { payment_id, status: PENDING } (a retry with the SAME key returns the SAME result)
GET /payments/{id} → { status: PENDING | SUCCEEDED | FAILED, … }
POST /payments/{id}/refund
Idempotency-Key: … → 201 { refund_id, status }

Note what’s deliberately absent: there is no “set balance” endpoint. Balances are derived from the ledger, never written directly.

Three tables carry the whole design:

payments (the intent + state machine)
payment_id PK | idempotency_key UNIQUE | from_acct | to_acct | amount | currency
| status (PENDING|SUCCEEDED|FAILED) | processor_ref | created_at
ledger_entries (immutable, append-only — the actual money)
entry_id PK | payment_id FK | account | amount (+credit / −debit) | created_at
── invariant: for each payment, SUM(amount) = 0
idempotency_keys (dedup)
key PK | payment_id | response_snapshot | created_at

The UNIQUE constraint on idempotency_key is the enforcement mechanism: two concurrent retries race to insert the same key, the database lets exactly one win, and the loser reads back the original result. That is how “store the key atomically with the charge” is actually implemented — a single transaction inserts the payments row and its key, or neither.

The network guarantees you will see duplicate requests — a client retries after a timeout even though the first request actually succeeded. Without protection, the customer is charged twice.

client → "pay $50" → (response lost) → client retries "pay $50" → charged twice ✗

The fix is an idempotency key: the client generates a unique key per logical payment and sends it with every retry. The server records the key; a repeat with the same key returns the original result instead of charging again.

"pay $50" + Idempotency-Key: abc123
first time → process, store (abc123 → result)
retry → key seen → return stored result, do NOT charge again ✓

Don’t model balances as a mutable number you overwrite — that loses history and invites lost-update races. Use a double-entry ledger: every transaction is recorded as two immutable entries (a debit and an equal credit) that must sum to zero.

Transaction T1: pay $50 from Alice to Bob
ledger entry: Alice -50 (debit)
ledger entry: Bob +50 (credit)
───────────
sum = 0 ← always; if it doesn't, something is wrong

A balance is then derived by summing entries (like event sourcing). This gives a complete audit trail, makes every change reversible by a compensating entry, and turns “is the system correct?” into a checkable invariant: all entries sum to zero.

client ──(payment + idempotency key)──► Payment API
│ records intent (status: PENDING) in DB
Payment service ──► external processor (card network)
│ on result, append ledger entries (atomic)
[ append-only ledger DB ] ◄── reconciliation job compares
our records vs processor's daily

State moves through explicit statuses: PENDING → SUCCEEDED / FAILED. Because the external processor is slow and may time out, treat the call asynchronously and make the whole flow resumable — never assume a missing response means failure (it might have succeeded).

  • Exactly-once is impossible; effectively-once is the goal. You cannot guarantee the external charge happens exactly once at the network level. You get correctness via at-least-once + an idempotency key, so duplicates collapse to one effect. (See Exactly-Once Semantics.)
  • Reconciliation. A periodic job compares your ledger against the processor’s settled report and flags any mismatch. This is the safety net that catches the inevitable edge cases — it’s expected, not a sign of failure.
  • Consistency over availability. When the database that records charges is unreachable, a payment system should refuse the charge rather than risk an unrecorded one — a deliberate CP choice (see CAP & PACELC).

What does this buy us, and what does it cost? Idempotency keys, an immutable double-entry ledger, explicit state machines, and reconciliation cost latency, storage, and engineering rigor. They buy the one property a payment system cannot live without: money is never created, destroyed, or double-counted. Here, the universal trade-off is resolved decisively toward correctness — you spend performance and simplicity to buy provable financial integrity.

  1. Why does a payment system prioritize correctness over latency, unlike most designs?
  2. How does an idempotency key prevent a double charge on a client retry, and why must it be stored atomically with the charge?
  3. What is a double-entry ledger, and what invariant lets you check correctness at a glance?
  4. Why is “exactly-once” unattainable, and what combination delivers “effectively-once”?
  5. What is reconciliation, and why is a mismatch expected rather than alarming?