Library
00/08 · ~43 min
GUIDEDECK · for systems that grow without falling over

System Design
for scale
and failure.

A 43-minute working session on the moving parts of large systems — how to scale them out, where to cache, how databases survive growth, why we lean on queues, the trade-offs the CAP theorem forces, and how to see the whole thing once it's running in production.

~43 MINMIXED TEAMLANGUAGE-AGNOSTIC
SCROLL
01 · The framing 4 min

You don't design for today's load.
You design for growth and failure.

One box serving a thousand users is easy. The hard part starts when traffic, data and the number of failing machines all climb at once. Every decision ahead trades one resource for another — there is no free win, only the right trade for your numbers.

System design the architecture of a system at scale — is deciding how requests, data and failures flow through many machines so the whole thing stays fast, available and affordable as load grows. It is engineering under three pressures at once: more traffic, more data, and more things breaking.
x100

traffic can grow two orders of magnitude in a year — design must bend, not snap.

p99

p99= the slowest 1 request in 100. The tail, not the average, is what users actually feel — so that's what you optimize.

100%?

no machine has 100% uptime. At scale, something is always failing — plan for it.

$$$

capacity costs money. The cheapest design that meets the SLO wins, not the most powerful.

The three numbers that drive every decision

  • Latency — how long one request takes. Watch the tail (p95/p99), not the mean; a few slow requests poison the experience.
  • Throughput — how many requests per second the system can absorb before it queues, slows, or sheds load.
  • Availability— the fraction of time the system answers correctly. "Three nines" is ~8.7 hours of downtime a year; each extra nine is roughly 10× harder.

Almost every technique ahead is, underneath, the same move: spend memory, money or consistency to buy back latency, throughput or availability. A cache, for instance, spends extra memory (and risks slightly out-of-date data) to make reads far faster.

02 · Scaling & load balancing 6 min

Scale up until it hurts,
then scale out.

There are only two ways to handle more load: make the machine bigger, or add more machines. The first is simple and has a ceiling; the second is unbounded but forces you to confront distribution.

Vertical scaling (scale up) — buy a bigger machine: more CPU, RAM, faster disks. Horizontal scaling (scale out) — add more machines and spread the work across them. Up is easy but caps out and is a single point of failure; out is harder but has no ceiling and survives a dead node.
Scale up — one big box
// one server does everything Server { cpu: 128, ram: "2 TB" } // + dead simple, no distribution // − hard ceiling (biggest box exists) // − single point of failure // − price climbs faster than power
Scale out — a fleet
// many small servers behind a balancer [ Server, Server, Server, ... ] // + no ceiling — keep adding nodes // + a dead node ≠ an outage // − needs a load balancer + statelessness // − distribution adds complexity
clients requests load balancer app · node 1 app · node 2 app · node 3 node 4 · down skipped ✕

The balancer health-checks the fleet and routes around the dead node — clients never notice.

The load balancer, and what it needs

  • Spreads requests by round-robin, least-connections, or a hash of the client — and health-checks each node so a dead one is dropped.
  • Horizontal scaling only works if app servers are stateless — any node can serve any request. Push session and uploaded state out to a shared store, not local memory.
  • Need stickiness? Prefer a shared session store (a cache or DB) over sticky routing, so losing a node never logs users out.
  • The balancer itself must not be the single point of failure — run it redundantly (e.g. a pair behind a floating address).

The tooling landscape — load balancers

A load balancer is the traffic cop in front of the fleet, deciding which node each request goes to. You can run your own, or let your cloud run one for you.

software · web

nginx

A web server that doubles as a reverse proxy and balancer.

Pro: fast and battle-tested, and the same box can also serve static files and handle HTTPS.

Con:the richest routing and live-reload features sit behind the paid "Plus" tier.

software · proxy

HAProxy

A proxy built for one job: balancing traffic, nothing else.

Pro: the deepest health-check and load-algorithm controls, and rock-solid at very high request rates.

Con: proxying only — no static files — and the config has a learning curve.

managed · cloud

Cloud LB

Your cloud's own balancer (AWS ELB/ALB, GCP, Azure).

Pro:fully managed — it auto-scales, heals itself, and there's nothing for you to patch.

Con: less fine-grained control, a per-hour plus per-GB bill, and it ties you to that one cloud.

How to choose: on a cloud, start with the managed load balancer — it scales itself and you never patch it. Drop down to nginx or HAProxy when you need custom routing rules or want to stay portable across providers.

Whose machines? The big cloud providers

Scaling out means renting machines somewhere. The three large clouds all rent compute, managed databases, queues and load balancers — they mostly differ in maturity, price, and which extras they do best.

cloud · widest

AWS

Amazon Web Services — the oldest and largest.

Pro: the widest catalog of services and the biggest hiring pool — most tutorials assume it.

Con:so many overlapping options that it's easy to get lost, and the bill is hard to predict.

cloud · data/ml

GCP

Google Cloud — strong on data and Kubernetes (which Google created).

Pro: excellent analytics, machine-learning and container tooling, with clean developer ergonomics.

Con: a smaller catalog and ecosystem, and a few products have been retired over the years.

cloud · enterprise

Azure

Microsoft's cloud — the default if you already live in the Microsoft world.

Pro: deep ties to Windows, Office and Active Directory make it the easy pick for many enterprises.

Con:the console and naming can feel inconsistent, and some services trail AWS's.

How to choose:on the basics they're close to a tie, so pick the one your team already knows or where your other tools live. Lean GCP for heavy data/ML work, Azure if you're a Microsoft shop, and AWS when you want the broadest menu and the most hiring options.

03 · Caching 6 min

The fastest query is
the one you never run.

A cache keeps a copy of expensive results close to where they're needed, so most reads skip the slow path entirely. It is the single highest-leverage tool for latency — and the easiest to get subtly wrong.

A cache is a fast, temporary store of expensive results kept close to the reader. A hit serves from cache (fast); a miss falls through to the source of truth (slow) and usually populates the cache on the way back. Aim for a high hit ratio — that ratio is the whole value.
where · client

Browser / CDN

Static assets and public responses cached at the edge, near the user. Cheapest possible hit — the request never reaches your servers.

where · app

In-memory / Redis

Hot objects and query results in a shared cache (Redis, Memcached) between app and DB. The workhorse layer most people mean by "a cache".

where · db

Database & query cache

The DB's own buffer pool and materialized views. Useful, but you don't control it — and it doesn't cut network hops.

app cache Redis database truth 1 · get hit ✓ 2 · miss 3 · backfill (with TTL)

Cache-aside: ask the cache first; on a miss, read the DB and write the answer back with a TTL.

Cache-aside, the default pattern

async function getUser(id) { let u = await cache.get(`user:${id}`) if (u) return u // hit — done u = await db.findUser(id) // miss — slow path await cache.set(`user:${id}`, u, { ttl: 300 }) return u // backfill for next time }

Like keeping the files you use daily on your desk, not walking to the archive room each time.

The two hard parts: invalidation & TTLs

staleness

Invalidation

A cache holds a copy; when the source changes, the copy is a lie. On write, delete (or update) the keyso the next read re-fetches. "There are only two hard things… one is cache invalidation."

expiry

TTL — time to live

A safety net: every entry expires after N seconds even if you forget to invalidate. Short TTL = fresher but more misses; long TTL = faster but staler. Tune per data.

failure

Stampede

When a popular key expires, a thousand requests all miss at the same instant and stampede the database. Defend with a short lock, by letting just one request refetch while the others wait ("coalescing"), or by adding a little randomness to the TTL ("jitter") so keys don't all expire together.

The tooling landscape — caches

in-memory · rich

Redis

An in-memory store that keeps data in RAM for microsecond reads.

Pro: far more than a cache — lists, sets, counters, pub/sub, and an option to save to disk so it survives a restart.

Con: a single-threaded core and more moving parts to run; one giant instance can itself become the bottleneck.

in-memory · simple

Memcached

The bare-bones option: a fast key → value string cache.

Pro:dead simple and multi-threaded — very fast for plain "remember this string" caching.

Con: strings only, no persistence, no replication — lose the node and you lose the whole cache.

managed · edge

CDN & managed

Edge caches (Cloudflare, Fastly, CloudFront) and hosted Redis (ElastiCache, Upstash).

Pro: someone else runs it — caching right next to the user (CDN) or no servers to patch (managed Redis).

Con: a monthly bill and less control; a CDN only helps for public, cacheable responses.

How to choose: reach for Redis by default — its data types and optional persistence pay off the moment you need more than plain lookups. Pick Memcached only when you want the simplest possible string cache, and push static, public content out to a CDN so it never touches your servers at all.

04 · Databases at scale 7 min

Reads scale with copies.
Writes & size scale with splits.

The database is usually the first thing to buckle, because it holds state and state is hard to spread. Two moves carry you a long way: replication for read load and availability, and partitioning for write load and raw size.

Replication keep full copies of the data on many nodes (a primary takes writes; replicas serve reads). Partitioning / sharding split the data into pieces by a key, each piece living on a different node. Replication answers "too many reads"; sharding answers "too many writes / too much data".

Replication — one primary, many read replicas

Writes go to the primary; they stream to replicas that serve reads. You multiply read capacity and gain a hot standby — if the primary dies, a replica is promoted (failover).

primary writes replica reads replica reads async replication
  • Read scaling — point reporting and read-heavy traffic at replicas.
  • Availability — a replica becomes the new primary if the primary fails.
  • The catch — replication lag. Async replicas are a moment behind, so a read right after a write may not see it (read-your-writes problem).

Sharding — split rows across nodes by a key

Choose a shard key (e.g. user_id); a hash or range of it decides which node owns each row. Each shard holds a slice of the data and takes its own writes — so write throughput and total size scale with the number of shards.

// route by a hash of the shard key function shardFor(userId, n) { return hash(userId) % n // → which node owns this user } // all of one user's rows live together on one shard // → single-user queries stay on a single node
  • Pick the key carefully— it's expensive to change. A bad key creates hot shards (one node gets most traffic).
  • Cross-shard queries hurt— joins and "count all" now fan out to every node. Design so the common query stays on one shard.
  • Consistent hashing lets you add/remove nodes while moving only a fraction of the keys.

Pick the store for the access pattern

Relational (SQL)

Strong consistency, transactions, flexible ad-hoc queries and joins. The right default — most apps never outgrow a well-tuned, replicated SQL database.

NoSQL (KV / document / wide-column)

Built to shard horizontally for huge write volume and simple, known access patterns. You trade joins and (often) strict consistency for raw scale.

Rule of thumb: start relational. Reach for NoSQL when a specific access pattern at a specific scale demands it — not by default.

  • Replication solves reads & availability — but introduces lag and a failover story.
  • Sharding solves writes & size — but introduces a shard key you must choose well and cross-shard pain.
  • Real systems combine them: each shard is itself replicated. Add the complexity only when one machine genuinely can't cope — YAGNI applies to databases too.
  • And don't run analyticson your app database. Heavy "sum / group over millions of rows" queries belong on a separate OLAP / columnar store — a warehouse or ClickHouse, fed from the app DB — so transactional (OLTP) and analytical workloads never fight for the same machine.
05 · Async & message queues 5 min

Don't make the user wait
for work they don't need to see.

When a request triggers slow work — sending email, encoding video, charging a card — doing it inline ties up the user and couples two services' uptime. A queue lets you accept the work now and do it soon.

A message queue is a durable buffer between a producer and a consumer. The producer enqueues a message and returns immediately; a worker picks it up and processes it later, at its own pace. This decouples the two sides — they no longer have to be fast, available, or even running at the same time.
Synchronous — coupled & slow
async function signup(req) { const u = await db.createUser(req) await email.sendWelcome(u) // 800ms, may fail await billing.provision(u) // 2s, may be down return ok(u) // user waits for ALL of it } // email outage → signup outage
Asynchronous — decoupled & fast
async function signup(req) { const u = await db.createUser(req) await queue.publish("user.created", u) // instant return ok(u) // user is done now } // workers handle email + billing independently // email outage → a delay, not an outage
producer web/API queue enqueue worker 1 worker 2 worker 3 consume at own pace

Add workers to drain the queue faster; the queue absorbs spikes so producers never block.

What a queue buys — and what to watch

  • Buffering & spikes — a traffic surge fills the queue instead of overwhelming the workers; they catch up after.
  • Backpressure — if the queue grows faster than workers drain it, that depth is the signal to add workers, slow producers, or shed load. Watch queue depth as a first-class metric.
  • Retries & dead-letters — failed messages retry, then land in a dead-letter queue for inspection instead of vanishing.
  • At-least-once delivery — a message may arrive twice, so consumers must be idempotent (processing it twice == processing it once).

The tooling landscape — message queues

log · firehose

Apache Kafka

A durable, append-only log of events that many readers can replay.

Pro: enormous throughput and a replayable history — new consumers can re-read old events from the start.

Con: heavier to run and reason about; overkill for a handful of simple background jobs.

broker · routing

RabbitMQ

A classic message broker with flexible routing rules.

Pro: rich routing and per-message acknowledgements with mature tooling — great for task/job queues.

Con: throughput tops out well below Kafka, and you run and scale the broker yourself.

managed · simple

AWS SQS

A fully managed queue — there are no servers to run.

Pro: zero operations, scales automatically, and cheap at low volume.

Con: AWS-only and no replay; the ordered (FIFO) variant trades away some throughput.

How to choose: for plain background jobs, a managed queue (SQS) or RabbitMQ is the least work. Choose Kafka when you have a firehose of events, several independent consumers, or you need to replay history — and you can afford to operate it.

06 · CAP & the core trade-off 5 min

When the network splits,
you pick consistency or availability.

Every technique so far added machines, and machines talk over a network that willdrop messages. The CAP theorem says what you must give up when that happens — and it's not optional.

CAP is an acronym for the three properties a distributed store juggles: Consistency, Availability, Partition-tolerance. The theorem: during a network partition (P), you can keep C or A — not both. Since partitions are a fact of life, the real choice is CP (refuse to answer rather than answer wrongly) vs AP (answer, possibly with stale data).
C
Consistency
every read sees the latest write
A
Availability
every request gets an answer
P
Partition-tolerance
survives dropped messages
Node A x = 5 got the write Node B x = 4 stale PARTITION messages dropped CP: refuse · AP: answer 5 CP: refuse · AP: answer 4

B can't hear A's write. CP refuses to answer; AP answers with the stale 4.

The trade-off table

  • CP — choose consistency. On a partition, the minority side refuses requests rather than serve stale data. Use for money, inventory, bookings — a wrong answer is worse than no answer.
  • AP — choose availability. Always answer, reconcile later. Use for feeds, likes, presence, carts — a slightly stale answer is fine.
  • Eventual consistency — the AP promise: stop writing and all replicas converge to the same value. Not wrong, just not instant.

"CA" only exists when there are no partitions — i.e. a single node. The moment you distribute, P is mandatory, so every real distributed system is CP or AP. Modern stores even let you pick per-operation (strong reads here, fast reads there).

07 · Observability & monitoring 5 min

You designed for failure.
Now you have to see it.

A distributed system fails in pieces, quietly, at 3am. If you can't tell what broke and where from the outside, every incident is a guessing game. Observability is the instrument panel for everything we just built.

Monitoring answers "is a known thing broken?" — dashboards and alerts you set up in advance. Observability is being able to ask new questions of a running system without shipping new code, and it rests on three streams of telemetry: metrics, logs, and traces.
pillar · metrics

Metrics

Cheap time-series numbers — request rate, error %, p99 latency, queue depth, CPU. You aggregate, trend and alert on them. Prometheus + Grafana is the canonical stack.

pillar · logs

Logs

Structured event records — one line per thing that happened, with enough context to search after the fact. Loki, the ELK stack, or a managed log store.

pillar · traces

Traces

One request's full path across services, timed per hop — the only way to find which service ate the latency. OpenTelemetry → Jaeger, or an APM like Dynatrace.

services your fleet collector OTel agent metrics logs traces dashboards alert → page uptime probe synthetic check

Services emit three streams to a collector that feeds dashboards and alerts; an external probe checks uptime from outside the fleet.

Alert on symptoms — the four golden signals

  • Latency — how long requests take (watch p99, and split success vs error latency).
  • Traffic — demand on the system, e.g. requests per second.
  • Errors — the rate of failed requests.
  • Saturation — how full the system is (CPU, memory, queue depth).
// page on a user-facing symptom, tied to an SLO alert HighErrorRate { when errors / requests > 1% // golden signal for 5m // don't flap on a blip then page("on-call") // burning the error budget }

Like a car dashboard: speed, fuel, engine light — symptoms you act on, not the thousand sensors behind them. Page on symptoms, not causes like "CPU 80%".

The tooling landscape

dashboards · metrics

Grafana

The open dashboard layer. Point it at Prometheus (metrics), Loki (logs) or traces and build the one screen on-call stares at — the golden signals, SLO burn, queue depth — and wire alert rules off it.

APM · traces

Dynatrace

Full-stack, mostly-automatic observability: drop an agent in, it auto-discovers services and stitches distributed traces with AI-assisted root-cause. The "buy, don't build" end — Datadog and New Relic are peers.

uptime · synthetic

Uptime monitoring

Checks from outsideyour network — hit a health endpoint every minute, alert when it's down or slow, and drive a public status page. Pingdom, UptimeRobot, Better Stack. Answers the one thing users care about: is it up?

08 · Worked example & recap 5 min

Put it together:
design a URL shortener.

A deceptively small problem that touches every layer we covered — POST a long URL, get a short code back; GET /code redirects. Read-heavy, latency-sensitive, must never lose a mapping.

Step 1 · the core

Generate a code, store the mapping

Take a unique 64-bit id (from a counter or id service) and base62-encode it into a short string. Store code → longUrl in the database — the source of truth.

function shorten(longUrl) { const id = ids.next() // unique 64-bit const code = base62(id) // "3Bk9" — short & unique db.put(code, longUrl) return `short.ly/${code}` }
Step 2 · the read path

Redirects are 100:1 reads — cache them

Lookups vastly outnumber creates. Front the DB with a cache-aside layer on code; mappings are immutable, so cached entries never go stale. Serve the redirect from memory.

async function resolve(code) { return await cache.get(code) ?? await db.get(code) // miss → backfill } // immutable mapping → no invalidation problem

How each part maps to the talk

  • Scaling — stateless resolver nodes behind a load balancer; add nodes as traffic grows.
  • Caching — a cache on code turns most redirects into memory hits; immutability makes it trivial.
  • Databases — replicas absorb the read flood; shard by codeonce one node can't hold the keyspace.
  • Async — push click counting onto a queue; analytics must never slow the redirect.
  • CAP — resolves are happily AP (a momentarily missing brand-new link is fine); the id generator stays CP so codes are never reused.

Read the whole flow

client GET balancer resolver cache DB replica queue · clicks

Hit the cache (green) and return; miss falls to a replica; clicks go async to a queue.

Six rules to walk out with

1Design for growth and failure. Optimize the tail latency and assume something is always down.
2Scale out, stay stateless.Many small boxes behind a balancer beat one big box you can't replace.
3Cache the reads, mind invalidation. The fastest query is the one you never run — but a stale cache is a bug.
4Replicate for reads, shard for writes, and decouple slow work behind a queue.
5Know your CAP choice. CP for money, AP for feeds. There is no free lunch — only the right trade.
6You can't fix what you can't see. Instrument the golden signals, trace across services, and alert on user-facing symptoms — Grafana, an APM, an uptime check.

Keep going

  • Designing Data-Intensive Applications — Martin Kleppmann (the canonical reference)
  • The System Design Primer — open-source on GitHub, free
  • Site Reliability Engineering — Google (SLOs, error budgets)
  • High Scalability — real architecture case studies

One sentence to remember

"There is no scaling, only trade-offs — name the one you're making."

— the whole talk, compressed

Knowledge check

Did the trade-offs land?

Five quick questions on scaling, caching, databases, and the CAP choice — instant feedback, no sign-in.

Rate this deck
be the first

Navigate with ← → or scroll · Part 2: Advanced System Design → · back to library