A 38-minute deep session that picks up where the intro deck left off. It assumes you already know scaling, caching, replication, queues and CAP from System Design for scale — and goes after the hard part: keeping many machines, in many regions, agreeing while the network and the machines keep failing.
Part 1 ended on CAP: when the network splits you choose C or A. This deck starts one level down. Consistency is not a switch— it is a ladder of guarantees, each weaker and cheaper than the last. Picking the right rung per piece of data is the whole craft. If you need a refresher on partitions and replication lag, that's the intro deck.
Down the ladder each model forbids fewer orderings — and needs less coordination, so it is faster and more available.
The read begins after t1, so linearizability forces it to return the new value 1.
You rarely need a global order; you need your own view to make sense. These per-session promises sit on top of an eventually consistent store and fix the symptoms users actually notice — usually by pinning a session to a replica or tracking a version token.
After you update your profile, yousee the new value on the next read — even if other users briefly don't. The single most-missed guarantee behind "I saved it and it vanished" bug reports.
Once you've seen a value, you never see an older one — no time-travel from refreshing onto a laggier replica.
You see writes in a prefix of the true order — never a reply before the message it answers. The promise that keeps causal chats readable.
Rule of thumb: default to eventual + session guarantees, and spend linearizability only on the few invariants that must never bend — balances, inventory counts, unique usernames, leader identity.
Strong consistency, a single leader, a committed log: under the hood they all reduce to consensus — many nodes agreeing on one value despite crashes and a lossy network. It is the load-bearing algorithm of every coordination system you depend on.
Raft splits the problem into two stories: elect one leader, then let that leader replicate an append-only log. Every decision rides a majority quorum — with 2f+1 nodes you survive f failures, because any two majorities overlap.
3 of 5 votes is a majority — the leader holds even with one node down.
Paxos came first and proves the same safety, but in terms of proposers, acceptors and learners rather than a leader and a log. One value is chosen via two rounds:
n and asks a majority to promiseto ignore anything lower. Acceptors reply with any value they've already accepted.n with a value (the highest already-accepted one, if any). A majority accepting it chooses the value forever.Like passing a motion: first check you have the floor (a majority will hear you), then call the vote — and a majority "aye" makes it binding.
You almost never implement Raft yourself. You rent it: a small, strongly-consistent store that holds leader leases, config, locks and service registries for everything else.
A Raft-backed key-value store; the brain of Kubernetes.
Pro: simple gRPC API, linearizable reads, leases and watches — and proven at K8s scale.
Con: small datasets only (fits in memory); not a general database.
The veteran: a hierarchy of znodes over the ZAB protocol.
Pro: battle-tested for a decade-plus; rich recipes (locks, barriers, leader election).
Con: heavier JVM ops, an older API — and its flagship user, Kafka, has now moved to built-in KRaft (4.0).
Raft KV plus service discovery, health checks and multi-datacenter.
Pro: discovery, DNS and a service mesh in one, with first-class multi-DC.
Con: a bigger surface to run; overkill if you only need a lock.
How to choose: on Kubernetes you already run etcd — use it. Pick Consul when you need service discovery and multi-DC, not just coordination. Reach for ZooKeeper mainly to integrate with older JVM systems that still expect it.
A single ACID transaction across services means one slow service can hold everyone's locks. The modern answer isn't a bigger transaction — it's per-service local transactions stitched together with compensation, reliable events, and idempotency.
2PC is atomic but blocking — a coordinator crash after PREPARE freezes every participant.
A saga is a sequence of local transactions, each with a compensating action that semantically undoes it. If a step fails, you run the compensations of the completed steps in reverse. No global locks, no blocking — but no isolation either, so you must design for intermediate states being visible.
Sagas and events assume you can "update the DB andpublish a message" reliably. You can't do both atomically across two systems — crash between them and they diverge. The transactional outbox writes the event into the same database transaction as the state change, then a relay ships it.
The event commits atomically with the state; the relay delivers it at-least-once — never a lost or phantom event.
At-least-once delivery and client retries mean every write may arrive twice. An idempotency key — a client-supplied unique id per logical operation — lets the server recognize a replay and return thesame result instead of charging the card again.
setNX / a unique constraint) is the trick — it serializes concurrent retries so exactly one executes.A cross-continent round trip is ~100–200 ms no matter how much you spend — physics, not budget. Go multi-region and every synchronous global write pays that toll. So you choose: serialize writes through one region, or let regions write independently and reconcile conflicts.
N replicas, if writes touch W and reads touch R with R + W > N, every read intersects the latest write — tunable consistency without a single leader to fail over.Because R + W > N, the read quorum always shares a node with the write quorum — so a fresh value is always reachable.
W=N for safe writes, R=1 for fast reads, or balance both.Concurrent writes with no leader will conflict. You need a deterministic, order-independent way to merge them — or you silently lose data.
Keep the write with the highest timestamp. Dead simple — and it throws awaythe other concurrent write. Fine for caches and presence; dangerous for anything you can't lose.
A per-node version vector tags each write so the store can detect concurrency (neither happened-before the other) and surface conflicts — then resolve by app logic or merge.
Data types whose merges are mathematically guaranteed to converge — no central arbiter, no lost updates. The strongest answer when it fits your data shape.
Taking the max per node loses nothing and converges however the merges interleave — that is the CRDT guarantee.
Globally externally-consistent SQL, ordered by TrueTime (GPS + atomic clocks).
Pro: linearizable transactions across regions, with real SQL and horizontal scale.
Con: a managed GCP product — vendor-locked and priced accordingly.
Spanner's idea without atomic clocks — hybrid logical clocks and serializable isolation, Postgres-compatible.
Pro: self-host or cloud, Postgres wire protocol, serializable by default.
Con: cross-region writes still pay quorum latency; core is now under a restrictive license.
Leaderless, tunable-consistency wide-column / KV stores built for write volume.
Pro: massive write throughput, always-on, multi-region by design.
Con: eventual by default, no joins — you model tables around exact query paths.
How to choose: need global strongconsistency with SQL and you're on GCP → Spanner. Want that portable and Postgres-shaped → CockroachDB. Need extreme write scale and can live with tunable/eventual consistency → Cassandra or a Dynamo-style store.
Unbounded demand will always find your weakest component. Rate limiting caps what callers may send; backpressure lets a saturated component tell upstream to slow down — instead of quietly building a queue until it falls over.
Tokens drip in at r; a burst drains the bucket, then callers throttle to the refill rate.
Count requests per clock minute. Trivial, but allows a 2× burst straddling the boundary — two full windows back to back.
Store each request's timestamp (e.g. a Redis sorted set) and count the last 60s exactly. Precise, but memory grows with traffic.
Weight the previous window's count by how far you are into the current one. Smooths the boundary burst at O(1) memory — the common production pick.
One node's in-memory counter doesn't hold across a fleet — you need a shared, atomic store. Redis is the default: every approach below runs as one atomic operation so concurrent requests can't race past the limit.
Pro: two commands, trivial — INCR a per-window key with a TTL.
Con:it's a fixed window, so it inherits the boundary-burst flaw.
Pro: exact sliding window — trim old timestamps, count the rest.
Con: memory and CPU scale with request volume per key.
Pro: bursts + O(1) memory, all atomic in one round trip.
Con: a script to maintain, and a central Redis hop on every request.
How to choose: default to the Lua token bucket — bursty, O(1), atomic. Drop to plain INCR when a rough cap is fine, and use the sorted-set window only when you truly need exact counts. At very high RPS, limit locally with a small per-node budget and sync to Redis periodically, trading a little precision for latency.
A full bounded queue is the backpressure signal: slow the producer, or shed what won't fit.
In a distributed system, failures cascade: a slow downstream ties up threads, those threads stall the caller, and the stall climbs the stack until everything is "down". These four patterns contain the blast radius.
Trip open after N failures; after a cooldown, one half-open probe decides whether to close again.
Give each dependency its ownpool of threads/connections, so one flooded dependency can't starve the others — like watertight compartments in a ship's hull. The slow service drowns its own bulkhead, not the boat.
Retry only idempotent calls, with exponential backoff and jitterso a thousand clients don't retry in lockstep and re-stampede the recovering service. Cap total attempts with a retry budget.
When saturated, drop the least important work — admission control at the door. A served checkout and a dropped analytics ping beats both failing. Pair with a timeout/deadline on every call.
Users worldwide move money between accounts. The hard invariant: no balance ever goes negative and no money is created or lost — even across regions, retries and partitions. Every tool in this deck earns its place.
The ledger itself is the one place that needs strong consistency. Put balances in a globally-consistent SQL store (Spanner / CockroachDB) so a debit and credit commit in one serializable transaction. Everything else can be weaker.
A transfer spans ledger, fraud-check and notification services. Run it as a saga with compensations, and emit each event via the transactional outbox so a crash never loses or duplicates a step.
The client sends an idempotency key per transfer. A retry after a timeout replays the stored result — the money moves exactly once no matter how many times the request lands.
A token-bucket limiter caps abuse at the gateway; circuit breakers and bulkheads isolate the fraud service; backoff with jitter and load shedding keep a spike from cascading.
Five questions on consistency, consensus, distributed transactions, geo-replication and reliability — instant feedback, no sign-in.
Navigate with ← → or scroll · back to library