A 36-minute working session on knowing what your software is doing right now — the three pillars (metrics, logs, traces), the golden signals, distributed tracing, alerting on SLOs, and the tools that tie it together.
Once your code leaves your laptop and runs for real users, it becomes a black box. A request is slow — but where? A checkout fails — but why, and for whom? Observability is how you turn that black box into something you can ask questions of, ideally beforea customer is the one telling you it's broken.
Monitoring covers the questions you wrote down; observability lets you ask the ones you didn't.
Almost every observability tool is built on three kinds of telemetry. The rest of this session is one pillar at a time.
Cheap, aggregated measurements — requests per second, error rate, latency. Great for dashboards and alerts; they tell you that something changed.
Timestamped records of individual things that happened. Rich detail you query after the fact; they tell you what happened in one case.
The full path of a single request as it hops between services. They tell you where the time went and which hop failed.
Like a hospital chart — metrics are the vital-sign monitors, logs are the nurse's notes, and a trace is following one patient through every department.
A metric is the workhorse of monitoring: a single number, sampled again and again, stored as a time series. Because each data point is tiny and pre-aggregated, you can keep millions of them and chart weeks of history without breaking the bank.
service or region). Think http_requests_total ticking up, or memory_used_bytes sampled every 15 seconds.A running total — requests served, errors seen. You chart its rate of change.
A snapshot value — temperature, queue depth, memory in use right now.
Counts values into ranges — the basis for percentile latency like p95 and p99.
A key/value tag (route, status) you can filter and group by. Keep label values low-cardinality.
From Google's SRE practice: if you can only watch four things on a user-facing service, watch these. They catch almost every user-visible problem.
How long requests take. Track percentiles (p95/p99), not the average — and separate slow successes from slow errors.
How much demand — requests per second, transactions per minute. It gives every other signal its context.
The rate of failing requests — explicit (HTTP 500s) and implicit (wrong answer, 200 with a broken body).
How full the system is — CPU, memory, queue depth. The signal that warns you before the others blow up.
The average looks calm; the p95 tells the truth. Always chart the tail.
Tip for request services, RED — Rate, Errors, Duration — is the golden signals restated; for resources, USE — Utilization, Saturation, Errors.
When a metric tells you errors spiked, logs tell you the story of each one: which user, which input, which line threw. A log is a timestamped record of a single thing that happened — and the single biggest upgrade you can make is to log it as structured data, not prose.
user_id, group by status, and chart counts — instead of grepping for a phrase and hoping.Each service ships its lines to one searchable store — so you query across the whole system at once.
errorfor act-now, warn for suspicious, info for milestones, debug for the firehose.In a microservice system one click can touch a dozen services. A metric says "checkout is slow"; logs are scattered across those dozen boxes. A trace stitches them back into a single timeline, so you can see exactly which hop ate the time or threw the error.
trace_id.span_id) to the next service in a request header — this is context propagation.trace_id.The waterfall makes the slow, failing payment span obvious at a glance.
The traceparentheader carries the id downstream, so every service's spans land in the same trace.
Like a parcel's tracking number — one id follows the package through every depot, so you can see exactly where it got stuck.
Telemetry is only useful if the right alert reaches the right person at the right time. The trap is alerting on everything — a wall of pages nobody reads. The fix is to alert on what users actually feel, and to define "good enough" with an SLO.
100% − SLO — the amount you're allowed to fail before you stop shipping features and fix reliability.The budget reframes reliability as a spendable resource — and the burn rate tells you how soon it runs out.
Severity decides the channel: page for urgent, chat for warnings, ticket for the rest.
You rarely build this from scratch. The market splits into the vendor-neutral way to collect telemetry, best-of-breed tools per pillar, and all-in-one platforms. Each tab below names the leading real options with a one-line pro and con.
Pro: one vendor-neutral API + SDKs + Collector for all three pillars — no lock-in.
Con: a moving target; logs are the least mature pillar and setup has real surface area.
Pro: receive, batch, filter, and re-route telemetry to any backend from one place.
Con: another component to run, size, and keep alive — it can become a bottleneck.
Pro: the de-facto open standard; powerful PromQL, pull-based, huge ecosystem.
Con:a single node isn't long-term or HA on its own — you add Thanos/Mimir for scale.
Pro: gorgeous, source-agnostic dashboards over Prometheus, Loki, Tempo, and more.
Con: visualization only — it stores nothing; dashboard sprawl needs discipline.
Pro: cheap — indexes labels, not full text; pairs naturally with Prometheus/Grafana.
Con: full-text search is weaker; you must label well up front.
Pro: Elasticsearch + Kibana — powerful full-text search and mature analytics.
Con: storage-hungry and operationally heavy to run and tune at scale.
Pro: CNCF project, battle-tested trace search and waterfall UI; OTel-native.
Con: you operate its own storage backend; less integrated with metrics/logs.
Pro: object-storage cheap; stitches traces ↔ logs ↔ metrics in one Grafana pane.
Con: querying leans on having good metrics/logs to find a trace_id first.
Pro: broadest integrations and the slickest unified UX across all pillars.
Con: cost can spiral fast and unpredictably as data volume grows.
Pro: strong auto-instrumentation and automated root-cause for big estates.
Con: heavyweight and priciest; overkill for small teams.
Pro: managed Prometheus/Loki/Tempo — open standards, no lock-in, generous free tier.
Con: more assembly than a single polished pane; you wire the pieces.
Pro: all-in-one APM with simple per-user + per-GB pricing.
Con: ingest-based billing still surprises at high volume; UI is busy.
Pro: modern uptime monitoring, status pages, and incident paging in one tidy product.
Con: focused on synthetic checks — not a full metrics/traces backend.
Pro: simple, long-established global uptime and page-speed checks.
Con: narrow scope; you still need a real observability stack behind it.
Here's how a real debugging session flows — each pillar handing off to the next.
/checkout, only in the EU region, starting right after a deploy.payment span is red and slow. Now you know where.trace_id to its logs: payment_timeout against a new gateway endpoint. Roll back. Done."If you have to ship code to answer the question, it wasn't observable."
— the working definition
Five quick questions on the pillars, golden signals, tracing, and SLOs — instant feedback, no sign-in.
Navigate with ← → or scroll · back to library