Library
00/07 · ~36 min
GUIDEDECK· for running systems you can't see inside

Observability
& Monitoring
for systems in production.

A 36-minute working session on knowing what your software is doing right now — the three pillars (metrics, logs, traces), the golden signals, distributed tracing, alerting on SLOs, and the tools that tie it together.

~36 MINMIXED TEAMVENDOR-AGNOSTIC
SCROLL
01 · Why observability 4 min

You can only fix
what you can see.

Once your code leaves your laptop and runs for real users, it becomes a black box. A request is slow — but where? A checkout fails — but why, and for whom? Observability is how you turn that black box into something you can ask questions of, ideally beforea customer is the one telling you it's broken.

Observability how well you can understand what's happening inside a system from the data it emits. A system is observable when you can answer new questions about it — including ones you never thought to ask in advance — without shipping new code to go look.

Monitoring vs. observability

  • Monitoring watches the things you already knew to watch — CPU, error rate, disk space — and tells you when a known thing breaks. It answers questions you wrote down in advance.
  • Observability is the broader capability of asking new questions after the fact. Monitoring is one use of it: the dashboards and alerts are the questions you chose to pre-bake.
  • Rough rule: monitoring tells you that something is wrong; observability helps you find why.
MONITORING known questions is CPU > 90%? is error rate high? is the disk full? OBSERVABILITY unknown questions why are EU users on v2.3 slow only at checkout, after 6pm? explore, don't pre-bake

Monitoring covers the questions you wrote down; observability lets you ask the ones you didn't.

The three pillars

Almost every observability tool is built on three kinds of telemetry. The rest of this session is one pillar at a time.

Metrics

Numbers over time

Cheap, aggregated measurements — requests per second, error rate, latency. Great for dashboards and alerts; they tell you that something changed.

Logs

Events you search

Timestamped records of individual things that happened. Rich detail you query after the fact; they tell you what happened in one case.

Traces

One request's journey

The full path of a single request as it hops between services. They tell you where the time went and which hop failed.

Like a hospital chart — metrics are the vital-sign monitors, logs are the nurse's notes, and a trace is following one patient through every department.

02 · Metrics 6 min

Cheap numbers,
measured over time.

A metric is the workhorse of monitoring: a single number, sampled again and again, stored as a time series. Because each data point is tiny and pre-aggregated, you can keep millions of them and chart weeks of history without breaking the bank.

Metric a numeric measurement recorded at intervals, forming a time series (a value plus a timestamp, usually tagged with labels like service or region). Think http_requests_total ticking up, or memory_used_bytes sampled every 15 seconds.
counter

Only goes up

A running total — requests served, errors seen. You chart its rate of change.

gauge

Goes up and down

A snapshot value — temperature, queue depth, memory in use right now.

histogram

Buckets a spread

Counts values into ranges — the basis for percentile latency like p95 and p99.

label

Splits a series

A key/value tag (route, status) you can filter and group by. Keep label values low-cardinality.

The four golden signals

From Google's SRE practice: if you can only watch four things on a user-facing service, watch these. They catch almost every user-visible problem.

L

Latency

How long requests take. Track percentiles (p95/p99), not the average — and separate slow successes from slow errors.

T

Traffic

How much demand — requests per second, transactions per minute. It gives every other signal its context.

E

Errors

The rate of failing requests — explicit (HTTP 500s) and implicit (wrong answer, 200 with a broken body).

S

Saturation

How full the system is — CPU, memory, queue depth. The signal that warns you before the others blow up.

# A counter, split by route and status label http_requests_total{route="/checkout", status="500"} 42 # Errors as a fraction of all traffic, last 5 min rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) # p95 latency from a histogram histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
ms time p95 avg deploy

The average looks calm; the p95 tells the truth. Always chart the tail.

Tip  for request services, RED — Rate, Errors, Duration — is the golden signals restated; for resources, USE — Utilization, Saturation, Errors.

03 · Logs 5 min

Events you can
search after the fact.

When a metric tells you errors spiked, logs tell you the story of each one: which user, which input, which line threw. A log is a timestamped record of a single thing that happened — and the single biggest upgrade you can make is to log it as structured data, not prose.

Structured log a log line written as machine-readable key/value fields (usually JSON) rather than a free-text sentence. Structure is what lets you filter by user_id, group by status, and chart counts — instead of grepping for a phrase and hoping.
Unstructured — hard to query
# a sentence — fine for a human, painful at scale User 8842 checkout failed after 3 retries: timeout User 8843 checkout ok in 240ms # to count failures you must parse English
Structured — queryable
{ "ts": "2026-06-27T18:04:11Z", "level": "error", "event": "checkout_failed", "user_id": 8842, "retries": 3, "reason": "payment_timeout", "trace_id": "a1b2c3" }
api worker db log store query

Each service ships its lines to one searchable store — so you query across the whole system at once.

Logging that pays off

  • Use levels honestlyerrorfor act-now, warn for suspicious, info for milestones, debug for the firehose.
  • Add a request/trace idto every line so you can stitch a request's logs together — and jump to its trace (Part 4).
  • Never log secrets or PII — passwords, tokens, full card numbers. Logs are widely readable and long-lived.
  • Mind the cost — logs are the priciest pillar. Sample the noisy paths; keep the errors.
04 · Traces 6 min

Follow one request
across every service.

In a microservice system one click can touch a dozen services. A metric says "checkout is slow"; logs are scattered across those dozen boxes. A trace stitches them back into a single timeline, so you can see exactly which hop ate the time or threw the error.

Trace the end-to-end record of one request as it moves through a system, made of spans. A span is one unit of work (a single service call or db query) with a start time, a duration, and a parent — so the spans nest into a tree that shows the whole journey. Distributed tracing is doing this across separate services.

How the spans connect

  • The first service starts a root span and mints a trace_id.
  • It passes that id (plus the current span_id) to the next service in a request header — this is context propagation.
  • Each downstream call opens a child span under that parent, so all spans share one trace_id.
  • The backend reassembles them by id into the waterfall on the right.
trace_id a1b2c3 · time → gateway 320ms auth 60ms checkout 240ms payment 150ms ✕ db 40ms each bar = a span · width = duration · nesting = parent

The waterfall makes the slow, failing payment span obvious at a glance.

// OpenTelemetry: wrap work in a span await tracer.startActiveSpan("charge_card", async (span) => { span.setAttribute("amount", order.total) try { await gateway.charge(order) // header carries trace_id } catch (e) { span.recordException(e) // span turns red ✕ throw e } finally { span.end() } })
gateway checkout payment traceparent: a1b2c3 same trace_id flows in the header

The traceparentheader carries the id downstream, so every service's spans land in the same trace.

Like a parcel's tracking number — one id follows the package through every depot, so you can see exactly where it got stuck.

05 · Alerting & SLOs 6 min

Page on symptoms,
not on causes.

Telemetry is only useful if the right alert reaches the right person at the right time. The trap is alerting on everything — a wall of pages nobody reads. The fix is to alert on what users actually feel, and to define "good enough" with an SLO.

SLI / SLO / error budget — an SLI is a measured indicator of service health (e.g. % of requests under 300ms). An SLO is the target you promise for that SLI (e.g. 99.9% over 30 days). The error budget is the leftover — 100% − SLO — the amount you're allowed to fail before you stop shipping features and fix reliability.

Symptom vs. cause

  • Symptom alert (page):"checkout error rate > 5% for 5 min" — a user is hurting now. Wake someone.
  • Cause alert (ticket):"one replica's CPU is high" — maybe fine if users are unaffected. Don't page at 3am for it.
  • Good paging is actionable, urgent, and user-visible. Everything else is a dashboard or a ticket.
  • Burn the error budget too fast → alert and slow down. Budget intact → ship freely.
30-day error budget (SLO 99.9%) budget spent 61% left 39% burn rate budget exhausted

The budget reframes reliability as a spendable resource — and the burn rate tells you how soon it runs out.

# Page only on a user-visible symptom - alert: HighCheckoutErrorRate expr: | rate(http_requests_total{route="/checkout",status=~"5.."}[5m]) / rate(http_requests_total{route="/checkout"}[5m]) > 0.05 for: 5m # avoid flapping on a blip labels: { severity: page } annotations: { summary: "Checkout failing for users" }
rule fires route page (on-call) chat (warn) ticket (info)

Severity decides the channel: page for urgent, chat for warnings, ticket for the rest.

06 · The tooling 5 min

The observability
stack.

You rarely build this from scratch. The market splits into the vendor-neutral way to collect telemetry, best-of-breed tools per pillar, and all-in-one platforms. Each tab below names the leading real options with a one-line pro and con.

OpenTelemetry (OTel) the vendor-neutral, open standard for generating and shipping metrics, logs, and traces. You instrument your code onceagainst OTel, then point it at whatever backend you like. It is the safest default because it keeps you from being locked to one vendor's agent.

Instrument & ship telemetry

OpenTelemetry

The standard

Pro: one vendor-neutral API + SDKs + Collector for all three pillars — no lock-in.

Con: a moving target; logs are the least mature pillar and setup has real surface area.

OTel Collector

The pipe in the middle

Pro: receive, batch, filter, and re-route telemetry to any backend from one place.

Con: another component to run, size, and keep alive — it can become a bottleneck.

Time-series & dashboards

Prometheus

Metrics store

Pro: the de-facto open standard; powerful PromQL, pull-based, huge ecosystem.

Con:a single node isn't long-term or HA on its own — you add Thanos/Mimir for scale.

Grafana

Dashboards

Pro: gorgeous, source-agnostic dashboards over Prometheus, Loki, Tempo, and more.

Con: visualization only — it stores nothing; dashboard sprawl needs discipline.

Store & search events

Grafana Loki

Label-indexed logs

Pro: cheap — indexes labels, not full text; pairs naturally with Prometheus/Grafana.

Con: full-text search is weaker; you must label well up front.

ELK / OpenSearch

Full-text logs

Pro: Elasticsearch + Kibana — powerful full-text search and mature analytics.

Con: storage-hungry and operationally heavy to run and tune at scale.

Distributed tracing backends

Jaeger

Mature tracing

Pro: CNCF project, battle-tested trace search and waterfall UI; OTel-native.

Con: you operate its own storage backend; less integrated with metrics/logs.

Grafana Tempo

Cheap trace store

Pro: object-storage cheap; stitches traces ↔ logs ↔ metrics in one Grafana pane.

Con: querying leans on having good metrics/logs to find a trace_id first.

Managed, all-three-pillars platforms

Datadog

The market leader

Pro: broadest integrations and the slickest unified UX across all pillars.

Con: cost can spiral fast and unpredictably as data volume grows.

Dynatrace

Enterprise / AI

Pro: strong auto-instrumentation and automated root-cause for big estates.

Con: heavyweight and priciest; overkill for small teams.

Grafana Cloud

Open-stack, hosted

Pro: managed Prometheus/Loki/Tempo — open standards, no lock-in, generous free tier.

Con: more assembly than a single polished pane; you wire the pieces.

New Relic

Usage-priced APM

Pro: all-in-one APM with simple per-user + per-GB pricing.

Con: ingest-based billing still surprises at high volume; UI is busy.

Outside-in checks & on-call

Better Stack

Uptime + on-call

Pro: modern uptime monitoring, status pages, and incident paging in one tidy product.

Con: focused on synthetic checks — not a full metrics/traces backend.

Pingdom

Classic uptime

Pro: simple, long-established global uptime and page-speed checks.

Con: narrow scope; you still need a real observability stack behind it.

How to choose

  • Instrument with OpenTelemetry no matter what. It decouples your code from the backend, so switching vendors later is a config change, not a rewrite.
  • Small team / cost-sensitive: the Grafana stack (Prometheus + Loki + Tempo), self-hosted or Grafana Cloud — open standards, low cost, no lock-in.
  • Want one polished pane and will pay: Datadog or New Relic; Dynatrace for large enterprise estates needing automated root-cause.
  • Always add outside-in uptime(Better Stack / Pingdom) — it catches total outages your own stack can't report when it is the thing that's down.
07 · A worked incident & recap 4 min

The three pillars, working together.

Here's how a real debugging session flows — each pillar handing off to the next.

1Metric alerts. Checkout error rate crosses 5% — a symptom page fires. You know that users are hurting.
2Dashboards localize it. The golden-signal board shows errors only on /checkout, only in the EU region, starting right after a deploy.
3A trace finds the hop.Open a failed request's waterfall — the payment span is red and slow. Now you know where.
4Logs give the why. Jump from the span's trace_id to its logs: payment_timeout against a new gateway endpoint. Roll back. Done.

Five things to walk out with

  • Three pillars: metrics tell you that, traces tell you where, logs tell you why.
  • Watch the golden signals — latency, traffic, errors, saturation — and chart percentiles, not averages.
  • Log structured, with a trace id on every line, and never log secrets.
  • Page on symptomsusers feel; define "good enough" with an SLO and spend the error budget.
  • Instrument with OpenTelemetry so the backend stays your choice, not your cage.

One sentence to remember

"If you have to ship code to answer the question, it wasn't observable."

— the working definition

Knowledge check

Did it stick?

Five quick questions on the pillars, golden signals, tracing, and SLOs — instant feedback, no sign-in.

Rate this deck
be the first

Navigate with ← → or scroll · back to library