A 34-minute working session on measuring non-deterministic systems: building a golden dataset, choosing graders, gating regressions in CI, watching production, and closing the loop with guardrails.
A normal function returns the same output for the same input, so a green test means something. An LLM doesn't. Same prompt, same settings — and you can still get three different answers, two great and one subtly wrong. Evals are how you turn "it felt good in the demo" into a number you can defend, track, and gate on.
Non-determinism: one input, several plausible outputs. Eval whether the distribution is good, not a single lucky run.
run each case several times — judge the pass rate, not one sample.
evals turn vibes into a score between 0 and 1 you can trend on a chart.
the goal: block a regression in review, not discover it in production.
models, prompts and data keep changing — quality needs continuous measurement.
An eval is only as honest as its cases. A dataset of toy questions you wrote at your desk will pass forever and tell you nothing. The cases that matter are the awkward ones real users actually send — plus every bug you have already fixed, frozen so it can never silently come back.
Each case is input + expectation + metadata. The grader turns the model's output into a score.
Start small and trusted. 100 cases you have actually read beat 10,000 you haven't.
Like a regression test suite — every production bug earns a permanent test so it can never quietly return.
There is no single right grader — there is a trade-off between cost, speed, and how much nuance you can capture. Cheap deterministic checks catch the obvious; human review catches what nothing else can. Most good harnesses use several at once.
Push as much grading as you can to the left; reserve the costly methods for what genuinely needs judgment.
Pro — free, instant, perfectly repeatable. Con — brittle on anything free-form. When it wins: always run these first; they catch the cheapest bugs for zero cost.
Pro— cheap, deterministic, captures "roughly right". Con — proxies correlate loosely with real quality. When it wins: semantic similarity is a great guardrail before you spend on a judge.
Pro — handles nuance no regex can. Con — costs tokens, can drift, needs its own calibration. When it wins:pairwise A/B ("which answer is better?") is more reliable than asking for an absolute score.
Pro — the only true ground truth. Con — does not scale to every PR. When it wins: spend it on a sample to anchor the automated graders that run on everything else.
Offline evals run on your golden dataset before you ship — a gate. Online metrics watch real traffic after you ship — a signal. You need both: offline tells you what should happen; online tells you what is happening with users you never imagined.
Offline gates the deploy; online observes reality and feeds fresh cases back into the golden set.
An eval you run by hand once a month is a science project. An eval that runs on every PR and fails the build when quality drops is engineering. Wire your golden dataset into CI, set thresholds against a baseline, and let the pipeline catch regressions before merge.
Every prompt or model change runs the suite; a drop below the baseline blocks the merge.
Gate on a delta vs main, not an absolute number. "No worse than today" is easier to defend than a magic 0.92.
Sampling makes scores wobble. Pin a low temperature for eval runs, run each case N times, and gate on the rate with a tolerance band — not a single pass.
Cheap deterministic checks on every PR; the full judge-based suite nightly or on release. Don't pay for a judge on every typo fix.
Once it ships, the eval question becomes operational: what did the model actually do, on which input, costing how much — and is today different from last week? That means tracing every call, logging the full context, and watching for drift.
A trace is a span tree. Each span carries latency, cost, tokens and the exact payload — replayable and gradable.
Pick on where you live: open-source CLI for control, a managed platform for collaboration and dashboards.
Pro — open-source, config-as-code, runs local and in CI, red-teaming built in.
Con — you host storage and dashboards yourself.
Choose when you want evals in git and a free CI gate.
Pro — managed eval + experiment tracking, strong dataset and diff UI.
Con — commercial SaaS; another vendor and bill.
Choose when a team needs shared dashboards over many experiments.
Pro — tracing + evals together; tight fit if you use LangChain / LangGraph.
Con — most natural inside that ecosystem; commercial.
Choosewhen you're already on the LangChain stack.
Pro — open-source framework, registry of ready-made evals.
Con — lower-level; you build the harness around it.
Choose for custom, code-first evals you fully control.
Pick on lock-in tolerance: a vendor-neutral standard, or a purpose-built LLM platform.
Pro — vendor-neutral semantic conventions for LLM spans; export anywhere, no lock-in.
Con — the GenAI conventions are still stabilizing; you assemble the backend and dashboards yourself.
Choose when you already run OTel and want LLM traces in the same pipeline as the rest of your services.
Pro — open-source, LLM-native tracing, prompt management, datasets and online evals; self-host or cloud.
Con — another system to run; an LLM-specific tool alongside your general observability.
Choose when you want batteries-included LLM observability without building it.
For a first feature, skip the platform: log full prompt/response pairs to the database you already run and eyeball them. Adopt a tracing tool the moment you have multi-step chains, real traffic, or more than one person debugging — not before. The best tool is the one your team will actually open.
Evals tell you about quality on average; guardrails defend the single live request. Together with online feedback they form a loop: observe, find failures, add them to the golden set, fix, re-eval, ship. That loop is LLMOps.
The loop: production failures become golden cases, the fix is re-evaluated, and only a non-regressing build ships.
Five quick questions on evals, datasets, graders, CI gating, and guardrails — instant feedback, no sign-in.
Navigate with ← → or scroll · back to library