Library
00/07 · ~38 min
GUIDEDECK · PART 2 · for systems that must agree across the world

Advanced
System Design —
consistency under failure.

A 38-minute deep session that picks up where the intro deck left off. It assumes you already know scaling, caching, replication, queues and CAP from System Design for scale — and goes after the hard part: keeping many machines, in many regions, agreeing while the network and the machines keep failing.

~38 MININTERMEDIATE → ADVANCEDBUILDS ON PART 1
SCROLL
01 · Consistency models 5 min

"Consistent" isn't one thing —
it's a spectrum you buy into.

Part 1 ended on CAP: when the network splits you choose C or A. This deck starts one level down. Consistency is not a switch— it is a ladder of guarantees, each weaker and cheaper than the last. Picking the right rung per piece of data is the whole craft. If you need a refresher on partitions and replication lag, that's the intro deck.

A consistency model is a contract between the store and your code about which reads can see which writes. Stronger models forbid more surprising orderings — and cost more coordination, so more latency and less availability. The art is spending strong consistency only where a wrong answer is genuinely unacceptable.
Linearizable one global order, real-time Sequential an order, not the clock Causal cause before effect Read-your-writes session guarantees Eventual converges, someday costlier cheaper

Down the ladder each model forbids fewer orderings — and needs less coordination, so it is faster and more available.

The rungs, top to bottom

  • Linearizable (strong) — every operation appears to happen instantly at a single point between its call and its return, in one real-time global order. Once a write returns, everylater read sees it. This is what people mean by "strongly consistent".
  • Sequential — there is somesingle order all nodes agree on, respecting each client's own order — but it need not match wall-clock time.
  • Causal — if A caused B (you read A, then wrote B), everyone sees A before B. Truly concurrent writes may be seen in different orders. The strongest model still available during a partition.
  • Eventual — stop writing and replicas converge. No ordering promise in between.

Linearizability, precisely

// W(x=1) returns at t1. Any read that STARTS after t1 // must observe x=1 — no exceptions, on any replica. writer: write(x, 1) ────► ok@t1 readerA: read(x) ─► 1 // ✓ started after t1 readerB: read(x) ─► 0|1 // overlaps t1 → either ok // the cost: a quorum round-trip on the write path, // so a single global write can't dodge cross-node latency.
writer reader write x=1 t1 read → 1 ✓

The read begins after t1, so linearizability forces it to return the new value 1.

Session guarantees — the cheap wins

You rarely need a global order; you need your own view to make sense. These per-session promises sit on top of an eventually consistent store and fix the symptoms users actually notice — usually by pinning a session to a replica or tracking a version token.

session

Read-your-writes

After you update your profile, yousee the new value on the next read — even if other users briefly don't. The single most-missed guarantee behind "I saved it and it vanished" bug reports.

session

Monotonic reads

Once you've seen a value, you never see an older one — no time-travel from refreshing onto a laggier replica.

session

Consistent prefix

You see writes in a prefix of the true order — never a reply before the message it answers. The promise that keeps causal chats readable.

Rule of thumb: default to eventual + session guarantees, and spend linearizability only on the few invariants that must never bend — balances, inventory counts, unique usernames, leader identity.

02 · Consensus 6 min

Getting N machines to agree
on one value — provably hard.

Strong consistency, a single leader, a committed log: under the hood they all reduce to consensus — many nodes agreeing on one value despite crashes and a lossy network. It is the load-bearing algorithm of every coordination system you depend on.

Consensus all non-faulty nodes decide the same value, and only a value actually proposed. The FLP result (Fischer–Lynch–Paterson) proves that in a fully asynchronous network, no deterministic algorithm can guarantee consensus always terminates if even one node may crash. So real algorithms keep safety unconditionally and lean on timeouts to make progress — trading guaranteed liveness for near-certain liveness.

Raft — consensus designed to be understood

Raft splits the problem into two stories: elect one leader, then let that leader replicate an append-only log. Every decision rides a majority quorum — with 2f+1 nodes you survive f failures, because any two majorities overlap.

Leader election — terms and votes

  • Time is divided into terms (monotonic numbers). Each term has at most one leader.
  • A follower that hears nothing for a randomized timeout becomes a candidate, bumps the term, and requests votes.
  • Win a majority → leader; it then sends heartbeats. The randomized timeout makes split votes rare and self-healing.
leader term 4 node node node down votes

3 of 5 votes is a majority — the leader holds even with one node down.

Log replication — commit on a quorum

// leader-only write path; followers just append function onClientWrite(cmd) { log.append({ term, cmd }) // 1 · append locally sendAppendEntries(followers, log) // 2 · replicate (RPC) if (acks >= majority()) { // 3 · quorum = N/2 + 1 commitIndex = log.lastIndex // durable once a majority has it applyToStateMachine(cmd) // 4 · apply IN LOG ORDER } }
  • An entry is committed only after a majority has it on disk — so it survives any minority failure.
  • Log Matching: if two logs agree at an index/term, they agree on everything before it — the leader forces followers to converge.
  • A candidate with a stale log can never win, so a new leader always holds every committed entry (Leader Completeness).

Paxos — the same guarantee, in two phases

Paxos came first and proves the same safety, but in terms of proposers, acceptors and learners rather than a leader and a log. One value is chosen via two rounds:

  • Phase 1 — prepare/promise: a proposer picks a unique, increasing number n and asks a majority to promiseto ignore anything lower. Acceptors reply with any value they've already accepted.
  • Phase 2 — accept/accepted: the proposer sends n with a value (the highest already-accepted one, if any). A majority accepting it chooses the value forever.
  • Multi-Paxosskips Phase 1 by electing a stable leader — at which point it's essentially Raft, which is why Raft, born for understandability, usually wins in new systems.

Like passing a motion: first check you have the floor (a majority will hear you), then call the vote — and a majority "aye" makes it binding.

The tooling landscape — consensus & coordination

You almost never implement Raft yourself. You rent it: a small, strongly-consistent store that holds leader leases, config, locks and service registries for everything else.

raft · kv

etcd

A Raft-backed key-value store; the brain of Kubernetes.

Pro: simple gRPC API, linearizable reads, leases and watches — and proven at K8s scale.

Con: small datasets only (fits in memory); not a general database.

zab · znodes

ZooKeeper

The veteran: a hierarchy of znodes over the ZAB protocol.

Pro: battle-tested for a decade-plus; rich recipes (locks, barriers, leader election).

Con: heavier JVM ops, an older API — and its flagship user, Kafka, has now moved to built-in KRaft (4.0).

raft · discovery

Consul

Raft KV plus service discovery, health checks and multi-datacenter.

Pro: discovery, DNS and a service mesh in one, with first-class multi-DC.

Con: a bigger surface to run; overkill if you only need a lock.

How to choose: on Kubernetes you already run etcd — use it. Pick Consul when you need service discovery and multi-DC, not just coordination. Reach for ZooKeeper mainly to integrate with older JVM systems that still expect it.

03 · Distributed transactions 6 min

One commit across many services
is a contradiction — so stop trying.

A single ACID transaction across services means one slow service can hold everyone's locks. The modern answer isn't a bigger transaction — it's per-service local transactions stitched together with compensation, reliable events, and idempotency.

Two-phase commit (2PC) a coordinator asks every participant to PREPARE, then COMMIT only if all voted yes. It's correct, but blocking: between prepare and commit every participant holds locks, and if the coordinator dies there, they're stuck — uncertain and unavailable until it returns.
coordinator participant A participant B 1·prepare 2·yes coordinator dies here → locks held, stuck

2PC is atomic but blocking — a coordinator crash after PREPARE freezes every participant.

Sagas — trade atomicity for availability

A saga is a sequence of local transactions, each with a compensating action that semantically undoes it. If a step fails, you run the compensations of the completed steps in reverse. No global locks, no blocking — but no isolation either, so you must design for intermediate states being visible.

// each step pairs a forward action with its undo const saga = [ { do: reserveStock, undo: releaseStock }, { do: chargeCard, undo: refundCard }, { do: bookCourier, undo: cancelCourier }, ] // fail at step i → run undo for i-1 … 0, newest first
Orchestration — central brain
// one coordinator drives the steps & compensations class OrderSaga { async run(o) { await this.step(reserveStock, releaseStock) await this.step(chargeCard, refundCard) } // + clear flow − a new coupling point }
Choreography — events only
// services react to each other's events; no central brain on("order.placed", () => reserveStock()) on("stock.reserved",() => chargeCard()) on("charge.failed", () => releaseStock()) // + loosely coupled − flow is implicit, harder to trace

The outbox pattern — kill the dual-write

Sagas and events assume you can "update the DB andpublish a message" reliably. You can't do both atomically across two systems — crash between them and they diverge. The transactional outbox writes the event into the same database transaction as the state change, then a relay ships it.

// ONE local transaction writes BOTH rows — atomic await db.tx(async (t) => { await t.insert("orders", order) // business state await t.insert("outbox", { // event, same tx topic: "order.placed", payload: order, }) }) // a relay tails the outbox (poll or WAL/CDC) → broker → marks sent
one DB tx orders outbox relay broker poll/CDC

The event commits atomically with the state; the relay delivers it at-least-once — never a lost or phantom event.

Idempotency keys at scale

At-least-once delivery and client retries mean every write may arrive twice. An idempotency key — a client-supplied unique id per logical operation — lets the server recognize a replay and return thesame result instead of charging the card again.

// client sends Idempotency-Key: <uuid> with the request async function charge(key, amount) { const ok = await store.setNX(key, { status: "pending" }) // claim it if (!ok) return await store.wait(key) // replay → saved result const result = await gateway.charge(amount) await store.set(key, { status: "done", result }, { ttl: "24h" }) return result }
04 · Geo-distribution & replication 6 min

Across regions, the enemy isn't load —
it's the speed of light.

A cross-continent round trip is ~100–200 ms no matter how much you spend — physics, not budget. Go multi-region and every synchronous global write pays that toll. So you choose: serialize writes through one region, or let regions write independently and reconcile conflicts.

Leaderless replication (Dynamo-style) — any replica takes any write, and reads/writes use quorums to overlap. With N replicas, if writes touch W and reads touch R with R + W > N, every read intersects the latest write — tunable consistency without a single leader to fail over.
N = 3 · W = 2 · R = 2 R1 R2 R3 write w+r read R2 is in both sets → read sees the write ✓ R + W > N forces an overlap

Because R + W > N, the read quorum always shares a node with the write quorum — so a fresh value is always reachable.

Keeping replicas honest

  • Sloppy quorum + hinted handoff — if a home replica is unreachable, write to a stand-in that holds a hint and forwards it later. Availability over strictness.
  • Read repair — a read that sees disagreeing replicas pushes the newest value back on the spot.
  • Anti-entropy — background sync compares replicas with Merkle trees to find and heal divergence cheaply.
  • Tune per call: W=N for safe writes, R=1 for fast reads, or balance both.

When two regions write the same key

Concurrent writes with no leader will conflict. You need a deterministic, order-independent way to merge them — or you silently lose data.

Last-write-wins

Keep the write with the highest timestamp. Dead simple — and it throws awaythe other concurrent write. Fine for caches and presence; dangerous for anything you can't lose.

Vector clocks

A per-node version vector tags each write so the store can detect concurrency (neither happened-before the other) and surface conflicts — then resolve by app logic or merge.

CRDTs

Data types whose merges are mathematically guaranteed to converge — no central arbiter, no lost updates. The strongest answer when it fits your data shape.

CRDT Conflict-free Replicated Data Type — a structure whose merge is commutative, associative and idempotent, so replicas applying the same updates in any order, even twice, reach the same state. State-based CRDTs ship and merge whole states (a join on a semilattice); operation-based ones ship commutative ops. Counters, sets (OR-Set), registers and even collaborative text (RGA) all have CRDT forms.
// G-Counter: each node owns a tally; merge = element-wise max function merge(a, b) { const out = {} for (const node of keysOf(a, b)) out[node] = Math.max(a[node] ?? 0, b[node] ?? 0) return out // commutative · associative · idempotent } const value = (c) => sum(valuesOf(c)) // total = sum of tallies
replica A {a:3, b:1} replica B {a:1, b:4} {a:3, b:4} = 7 per-node max

Taking the max per node loses nothing and converges however the merges interleave — that is the CRDT guarantee.

The tooling landscape — globally-distributed databases

truetime · SQL

Google Spanner

Globally externally-consistent SQL, ordered by TrueTime (GPS + atomic clocks).

Pro: linearizable transactions across regions, with real SQL and horizontal scale.

Con: a managed GCP product — vendor-locked and priced accordingly.

HLC · postgres-wire

CockroachDB

Spanner's idea without atomic clocks — hybrid logical clocks and serializable isolation, Postgres-compatible.

Pro: self-host or cloud, Postgres wire protocol, serializable by default.

Con: cross-region writes still pay quorum latency; core is now under a restrictive license.

leaderless · AP

Cassandra / Dynamo

Leaderless, tunable-consistency wide-column / KV stores built for write volume.

Pro: massive write throughput, always-on, multi-region by design.

Con: eventual by default, no joins — you model tables around exact query paths.

How to choose: need global strongconsistency with SQL and you're on GCP → Spanner. Want that portable and Postgres-shaped → CockroachDB. Need extreme write scale and can live with tunable/eventual consistency → Cassandra or a Dynamo-style store.

05 · Rate limiting & backpressure 5 min

Protect the system from its callers —
and from itself.

Unbounded demand will always find your weakest component. Rate limiting caps what callers may send; backpressure lets a saturated component tell upstream to slow down — instead of quietly building a queue until it falls over.

Token bucket a bucket of capacity B refills at r tokens/second; each request spends a token, and an empty bucket means reject. It permits short burstsup to B while holding the long-run average at r — the default limiter because it's cheap, fair, and burst-friendly.
-- atomic token bucket on Redis (Lua = one round trip) local tokens = tonumber(redis.call("GET", k) or cap) tokens = math.min(cap, tokens + elapsed * rate) -- refill if tokens < 1 then return 0 end -- reject redis.call("SET", k, tokens - 1, "EX", ttl) -- consume return 1 -- allow
cap = B refill r/s allow ✓ empty → 429

Tokens drip in at r; a burst drains the bucket, then callers throttle to the refill rate.

window

Fixed window

Count requests per clock minute. Trivial, but allows a 2× burst straddling the boundary — two full windows back to back.

window · log

Sliding window log

Store each request's timestamp (e.g. a Redis sorted set) and count the last 60s exactly. Precise, but memory grows with traffic.

window · counter

Sliding window counter

Weight the previous window's count by how far you are into the current one. Smooths the boundary burst at O(1) memory — the common production pick.

The tooling landscape — a Redis-backed distributed limiter

One node's in-memory counter doesn't hold across a fleet — you need a shared, atomic store. Redis is the default: every approach below runs as one atomic operation so concurrent requests can't race past the limit.

INCR + EX

Atomic counter

Pro: two commands, trivial — INCR a per-window key with a TTL.

Con:it's a fixed window, so it inherits the boundary-burst flaw.

ZSET

Sorted-set window

Pro: exact sliding window — trim old timestamps, count the rest.

Con: memory and CPU scale with request volume per key.

Lua script

Token bucket

Pro: bursts + O(1) memory, all atomic in one round trip.

Con: a script to maintain, and a central Redis hop on every request.

How to choose: default to the Lua token bucket — bursty, O(1), atomic. Drop to plain INCR when a rough cap is fine, and use the sorted-set window only when you truly need exact counts. At very high RPS, limit locally with a small per-node budget and sync to Redis periodically, trading a little precision for latency.

Backpressure — let saturation flow upstream

  • Bounded queues. An unbounded buffer just hides overload until OOM. A bounded one that rejects when full is the signal.
  • Push the limit to the edge. Reject early at the gateway, not after work is half-done deep inside.
  • Load shedding> collapse. Dropping low-priority traffic keeps the rest healthy — graceful degradation beats a full outage.
producer bounded worker full → slow down shed excess

A full bounded queue is the backpressure signal: slow the producer, or shed what won't fit.

06 · Reliability patterns 5 min

One slow dependency
shouldn't take the whole system down.

In a distributed system, failures cascade: a slow downstream ties up threads, those threads stall the caller, and the stall climbs the stack until everything is "down". These four patterns contain the blast radius.

Circuit breaker wrap a risky call so that after enough failures it "trips" and fails fast instead of waiting. Three states: closed (calls flow, failures counted), open (reject instantly for a cooldown), half-open (let one probe through — succeed and close, fail and re-open). It stops a sick dependency from draining your threads.
// closed → open → half-open → closed if (state === "open") { if (now < openedAt + cooldown) throw CircuitOpen // fail fast state = "half-open" // allow one probe } try { const r = await call(); onSuccess(); return r // close on success } catch (e) { if (++fails >= threshold) trip("open"); throw e }
closed open half- open fails ≥ N cooldown probe ✓

Trip open after N failures; after a cooldown, one half-open probe decides whether to close again.

isolate

Bulkheads

Give each dependency its ownpool of threads/connections, so one flooded dependency can't starve the others — like watertight compartments in a ship's hull. The slow service drowns its own bulkhead, not the boat.

retry · safely

Retries + jitter

Retry only idempotent calls, with exponential backoff and jitterso a thousand clients don't retry in lockstep and re-stampede the recovering service. Cap total attempts with a retry budget.

degrade

Load shedding

When saturated, drop the least important work — admission control at the door. A served checkout and a dropped analytics ping beats both failing. Pair with a timeout/deadline on every call.

Naive retry — synchronized stampede
// every client waits the SAME fixed delay for (let i = 0; i < 5; i++) { try { return await call() } catch { await sleep(1000) } // all retry together → re-flood }
Backoff with full jitter
for (let i = 0; i < 5; i++) { try { return await call() } catch { const cap = Math.min(MAX, BASE * 2 ** i) // exponential await sleep(Math.random() * cap) // full jitter } }
07 · A globally-consistent design + recap 5 min

Put it together: a
global payment ledger.

Users worldwide move money between accounts. The hard invariant: no balance ever goes negative and no money is created or lost — even across regions, retries and partitions. Every tool in this deck earns its place.

Step 1 · the invariant

Linearizable where it counts

The ledger itself is the one place that needs strong consistency. Put balances in a globally-consistent SQL store (Spanner / CockroachDB) so a debit and credit commit in one serializable transaction. Everything else can be weaker.

Step 2 · cross-service

Saga + outbox, never 2PC

A transfer spans ledger, fraud-check and notification services. Run it as a saga with compensations, and emit each event via the transactional outbox so a crash never loses or duplicates a step.

Step 3 · safe retries

Idempotency at the edge

The client sends an idempotency key per transfer. A retry after a timeout replays the stored result — the money moves exactly once no matter how many times the request lands.

Step 4 · survive load & faults

Limits, breakers, backpressure

A token-bucket limiter caps abuse at the gateway; circuit breakers and bulkheads isolate the fraud service; backoff with jitter and load shedding keep a spike from cascading.

The one trade-off to name: the ledger is CP — during a partition the minority region refuses writes rather than risk a double-spend. But the account feed and notifications are AP, served from CRDT-merged replicas. Same product, two consistency rungs, chosen per-invariant. That choice — not any single tool — is the senior skill.

Five rules to walk out with

1Consistency is a dial, set per invariant. Strong only where a wrong answer is unacceptable; session guarantees + eventual everywhere else.
2Agreement costs a quorum. Consensus (Raft/Paxos) is how machines agree under failure — rent it (etcd) rather than build it.
3Don't span a transaction across services. Use sagas + the outbox + idempotency keys instead of blocking 2PC.
4Across regions, plan to merge. Quorums tune consistency; CRDTs converge without a leader; LWW silently loses data.
5Contain failure on purpose. Rate limits, backpressure, breakers, bulkheads and jittered retries turn a cascade into a graceful degrade.
Knowledge check

Did it stick?

Five questions on consistency, consensus, distributed transactions, geo-replication and reliability — instant feedback, no sign-in.

Rate this deck
be the first

Navigate with ← → or scroll · back to library