A 43-minute working session on the moving parts of large systems — how to scale them out, where to cache, how databases survive growth, why we lean on queues, the trade-offs the CAP theorem forces, and how to see the whole thing once it's running in production.
One box serving a thousand users is easy. The hard part starts when traffic, data and the number of failing machines all climb at once. Every decision ahead trades one resource for another — there is no free win, only the right trade for your numbers.
traffic can grow two orders of magnitude in a year — design must bend, not snap.
p99= the slowest 1 request in 100. The tail, not the average, is what users actually feel — so that's what you optimize.
no machine has 100% uptime. At scale, something is always failing — plan for it.
capacity costs money. The cheapest design that meets the SLO wins, not the most powerful.
Almost every technique ahead is, underneath, the same move: spend memory, money or consistency to buy back latency, throughput or availability. A cache, for instance, spends extra memory (and risks slightly out-of-date data) to make reads far faster.
There are only two ways to handle more load: make the machine bigger, or add more machines. The first is simple and has a ceiling; the second is unbounded but forces you to confront distribution.
The balancer health-checks the fleet and routes around the dead node — clients never notice.
A load balancer is the traffic cop in front of the fleet, deciding which node each request goes to. You can run your own, or let your cloud run one for you.
A web server that doubles as a reverse proxy and balancer.
Pro: fast and battle-tested, and the same box can also serve static files and handle HTTPS.
Con:the richest routing and live-reload features sit behind the paid "Plus" tier.
A proxy built for one job: balancing traffic, nothing else.
Pro: the deepest health-check and load-algorithm controls, and rock-solid at very high request rates.
Con: proxying only — no static files — and the config has a learning curve.
Your cloud's own balancer (AWS ELB/ALB, GCP, Azure).
Pro:fully managed — it auto-scales, heals itself, and there's nothing for you to patch.
Con: less fine-grained control, a per-hour plus per-GB bill, and it ties you to that one cloud.
How to choose: on a cloud, start with the managed load balancer — it scales itself and you never patch it. Drop down to nginx or HAProxy when you need custom routing rules or want to stay portable across providers.
Scaling out means renting machines somewhere. The three large clouds all rent compute, managed databases, queues and load balancers — they mostly differ in maturity, price, and which extras they do best.
Amazon Web Services — the oldest and largest.
Pro: the widest catalog of services and the biggest hiring pool — most tutorials assume it.
Con:so many overlapping options that it's easy to get lost, and the bill is hard to predict.
Google Cloud — strong on data and Kubernetes (which Google created).
Pro: excellent analytics, machine-learning and container tooling, with clean developer ergonomics.
Con: a smaller catalog and ecosystem, and a few products have been retired over the years.
Microsoft's cloud — the default if you already live in the Microsoft world.
Pro: deep ties to Windows, Office and Active Directory make it the easy pick for many enterprises.
Con:the console and naming can feel inconsistent, and some services trail AWS's.
How to choose:on the basics they're close to a tie, so pick the one your team already knows or where your other tools live. Lean GCP for heavy data/ML work, Azure if you're a Microsoft shop, and AWS when you want the broadest menu and the most hiring options.
A cache keeps a copy of expensive results close to where they're needed, so most reads skip the slow path entirely. It is the single highest-leverage tool for latency — and the easiest to get subtly wrong.
Static assets and public responses cached at the edge, near the user. Cheapest possible hit — the request never reaches your servers.
Hot objects and query results in a shared cache (Redis, Memcached) between app and DB. The workhorse layer most people mean by "a cache".
The DB's own buffer pool and materialized views. Useful, but you don't control it — and it doesn't cut network hops.
Cache-aside: ask the cache first; on a miss, read the DB and write the answer back with a TTL.
Like keeping the files you use daily on your desk, not walking to the archive room each time.
A cache holds a copy; when the source changes, the copy is a lie. On write, delete (or update) the keyso the next read re-fetches. "There are only two hard things… one is cache invalidation."
A safety net: every entry expires after N seconds even if you forget to invalidate. Short TTL = fresher but more misses; long TTL = faster but staler. Tune per data.
When a popular key expires, a thousand requests all miss at the same instant and stampede the database. Defend with a short lock, by letting just one request refetch while the others wait ("coalescing"), or by adding a little randomness to the TTL ("jitter") so keys don't all expire together.
An in-memory store that keeps data in RAM for microsecond reads.
Pro: far more than a cache — lists, sets, counters, pub/sub, and an option to save to disk so it survives a restart.
Con: a single-threaded core and more moving parts to run; one giant instance can itself become the bottleneck.
The bare-bones option: a fast key → value string cache.
Pro:dead simple and multi-threaded — very fast for plain "remember this string" caching.
Con: strings only, no persistence, no replication — lose the node and you lose the whole cache.
Edge caches (Cloudflare, Fastly, CloudFront) and hosted Redis (ElastiCache, Upstash).
Pro: someone else runs it — caching right next to the user (CDN) or no servers to patch (managed Redis).
Con: a monthly bill and less control; a CDN only helps for public, cacheable responses.
How to choose: reach for Redis by default — its data types and optional persistence pay off the moment you need more than plain lookups. Pick Memcached only when you want the simplest possible string cache, and push static, public content out to a CDN so it never touches your servers at all.
The database is usually the first thing to buckle, because it holds state and state is hard to spread. Two moves carry you a long way: replication for read load and availability, and partitioning for write load and raw size.
Writes go to the primary; they stream to replicas that serve reads. You multiply read capacity and gain a hot standby — if the primary dies, a replica is promoted (failover).
Choose a shard key (e.g. user_id); a hash or range of it decides which node owns each row. Each shard holds a slice of the data and takes its own writes — so write throughput and total size scale with the number of shards.
Strong consistency, transactions, flexible ad-hoc queries and joins. The right default — most apps never outgrow a well-tuned, replicated SQL database.
Built to shard horizontally for huge write volume and simple, known access patterns. You trade joins and (often) strict consistency for raw scale.
Rule of thumb: start relational. Reach for NoSQL when a specific access pattern at a specific scale demands it — not by default.
When a request triggers slow work — sending email, encoding video, charging a card — doing it inline ties up the user and couples two services' uptime. A queue lets you accept the work now and do it soon.
Add workers to drain the queue faster; the queue absorbs spikes so producers never block.
A durable, append-only log of events that many readers can replay.
Pro: enormous throughput and a replayable history — new consumers can re-read old events from the start.
Con: heavier to run and reason about; overkill for a handful of simple background jobs.
A classic message broker with flexible routing rules.
Pro: rich routing and per-message acknowledgements with mature tooling — great for task/job queues.
Con: throughput tops out well below Kafka, and you run and scale the broker yourself.
A fully managed queue — there are no servers to run.
Pro: zero operations, scales automatically, and cheap at low volume.
Con: AWS-only and no replay; the ordered (FIFO) variant trades away some throughput.
How to choose: for plain background jobs, a managed queue (SQS) or RabbitMQ is the least work. Choose Kafka when you have a firehose of events, several independent consumers, or you need to replay history — and you can afford to operate it.
Every technique so far added machines, and machines talk over a network that willdrop messages. The CAP theorem says what you must give up when that happens — and it's not optional.
B can't hear A's write. CP refuses to answer; AP answers with the stale 4.
"CA" only exists when there are no partitions — i.e. a single node. The moment you distribute, P is mandatory, so every real distributed system is CP or AP. Modern stores even let you pick per-operation (strong reads here, fast reads there).
A distributed system fails in pieces, quietly, at 3am. If you can't tell what broke and where from the outside, every incident is a guessing game. Observability is the instrument panel for everything we just built.
Cheap time-series numbers — request rate, error %, p99 latency, queue depth, CPU. You aggregate, trend and alert on them. Prometheus + Grafana is the canonical stack.
Structured event records — one line per thing that happened, with enough context to search after the fact. Loki, the ELK stack, or a managed log store.
One request's full path across services, timed per hop — the only way to find which service ate the latency. OpenTelemetry → Jaeger, or an APM like Dynatrace.
Services emit three streams to a collector that feeds dashboards and alerts; an external probe checks uptime from outside the fleet.
Like a car dashboard: speed, fuel, engine light — symptoms you act on, not the thousand sensors behind them. Page on symptoms, not causes like "CPU 80%".
The open dashboard layer. Point it at Prometheus (metrics), Loki (logs) or traces and build the one screen on-call stares at — the golden signals, SLO burn, queue depth — and wire alert rules off it.
Full-stack, mostly-automatic observability: drop an agent in, it auto-discovers services and stitches distributed traces with AI-assisted root-cause. The "buy, don't build" end — Datadog and New Relic are peers.
Checks from outsideyour network — hit a health endpoint every minute, alert when it's down or slow, and drive a public status page. Pingdom, UptimeRobot, Better Stack. Answers the one thing users care about: is it up?
A deceptively small problem that touches every layer we covered — POST a long URL, get a short code back; GET /code redirects. Read-heavy, latency-sensitive, must never lose a mapping.
Take a unique 64-bit id (from a counter or id service) and base62-encode it into a short string. Store code → longUrl in the database — the source of truth.
Lookups vastly outnumber creates. Front the DB with a cache-aside layer on code; mappings are immutable, so cached entries never go stale. Serve the redirect from memory.
code turns most redirects into memory hits; immutability makes it trivial.codeonce one node can't hold the keyspace.Hit the cache (green) and return; miss falls to a replica; clicks go async to a queue.
"There is no scaling, only trade-offs — name the one you're making."
— the whole talk, compressed
Five quick questions on scaling, caching, databases, and the CAP choice — instant feedback, no sign-in.
Navigate with ← → or scroll · Part 2: Advanced System Design → · back to library