Data Modeling

Every previous page in this part gave you machinery: engines, indexes, replicas, shards, transactions. Data modeling is the craft of deciding how to shape your records in the first place so that machinery serves you instead of fighting you. It’s where storage stops being a menu of features and becomes a design with consequences — because the model you choose decides which queries are fast, which are agony, and which are simply impossible.

The whole discipline turns on one trade-off, and it’s the same one we’ve been weighing all along: what does this shape buy us, and what does it cost? Here the currency is reads versus writes, and storage versus query speed.

Normalization: store each fact once

The relational instinct is normalization: split data into many tables so that every fact lives in exactly one place, and reference it elsewhere by ID.

   users                       orders
   ┌────┬─────────────┐        ┌────┬─────────┬───────────┐
   │ id │ name        │        │ id │ user_id │ total     │
   ├────┼─────────────┤        ├────┼─────────┼───────────┤
   │ 7  │ Ada Lovelace│        │ 91 │   7     │  $40      │
   └────┴─────────────┘        │ 92 │   7     │  $15      │
                               └────┴─────────┴───────────┘
   "Ada Lovelace" is stored ONCE. Orders point to it by user_id.

Buys: no duplication, no update anomalies. Ada changes her name? One row changes and every order reflects it instantly. Data can’t contradict itself, because each fact has a single home. Writes are cheap and safe.
Costs: reads must join. Showing an order with the customer’s name means combining tables at read time. Joins are fine on one machine but get expensive — sometimes prohibitively — once data is partitioned across nodes, because the rows you’re joining may live on different machines.

Denormalization: store the answer pre-assembled

Denormalization does the opposite: deliberately duplicate data so a read doesn’t have to assemble it. Copy the customer’s name into each order; store a pre-computed feed; keep a running comment_count instead of COUNT(*)-ing every time.

   orders (denormalized)
   ┌────┬─────────┬─────────────┬───────┐
   │ id │ user_id │ user_name   │ total │
   ├────┼─────────┼─────────────┼───────┤
   │ 91 │   7     │ Ada Lovelace│  $40  │   ← name copied in; no join to read
   │ 92 │   7     │ Ada Lovelace│  $15  │
   └────┴─────────┴─────────────┴───────┘

Buys: blazing reads. The answer is pre-assembled, so one lookup returns everything — no joins, no scatter across shards. This is how you serve a feed to millions.
Costs: expensive, dangerous writes. Ada changes her name and now you must update every copy — every order, every cached feed entry. Miss one and your data is inconsistent: the same fact says two different things. You’ve traded write simplicity for read speed.

Model for your access patterns

Here is the reframe that separates good data modeling from cargo-culting: don’t model the data as it “naturally” is — model it for how your application will read it.

In a relational database you can get away with normalizing first and adding indexes for slow queries later, because the join + planner machinery is flexible. In NoSQL this is not optional. A document or wide-column store has no rich joins to bail you out — so you must know your queries before you design your schema, and shape the data so each query is a single efficient lookup.

A worked example — a social feed:

   Normalized read:  fetch follows → fetch each followee's posts → merge → sort
                     (many lookups + a join + a sort, EVERY page load)

   Denormalized:     when someone posts, WRITE a copy into each follower's feed table
                     read = one range scan of "my feed"   (fan-out-on-write)

The denormalized version does far more work on write (fan the post out to thousands of feeds) to make the read trivial — the right call when reads vastly outnumber writes, which is exactly true of a timeline. (The celebrity with millions of followers breaks even this; that edge case is the hot-partition problem.)

How to actually decide

A practical sequence that works for most systems:

Start normalized. It’s the safe default — correct, flexible, no duplication to keep in sync.
Identify your hot read paths. Which queries run constantly, must be fast, or fan across shards?
Denormalize those — and only those. Duplicate data precisely where a measured read problem justifies the write cost, and make a clear plan for keeping the copies in sync (triggers, application logic, async jobs).
In NoSQL, do this design up front, because the engine won’t let you bolt on flexibility later.

The discipline is resisting the urge to denormalize everything for imagined speed (you’ll drown in sync bugs) and the urge to normalize everything for purity (your hot reads will crawl).

The thread

Data modeling is where every choice in this part comes home to roost. Normalization keeps each fact in one place — safe, cheap writes, but reads must join, and joins fight partitioning. Denormalization pre-assembles the answer for instant reads — but every write must now maintain the duplicates, and inconsistency is one missed update away. The deciding question is never “which is correct?” but “how do I read this data, and how often versus how often do I write it?” Model for the access pattern, denormalize where the read pressure earns it, and you’ve turned all the preceding machinery into a system that does exactly what your application asks of it.

Check your understanding

Define normalization and denormalization, and give the single-sentence trade-off between them.
Why do joins, which are cheap on one machine, become a problem under partitioning?
What is an “update anomaly,” and which approach is prone to it? Why?
Why is “model for your access patterns” mandatory in NoSQL but only advisable in SQL?
In the social-feed example, what does fan-out-on-write spend more of to make reads cheap, and when is that the right trade?