A 36-minute working session on programming a model you can't fully predict — prompts as your API, typed JSON out, letting the model call your code, streaming for a fast feel, and how you prove it works.
Every line of code you've written so far is deterministic: same input, same output, every single time. An LLM breaks that promise. The same prompt can return different text on each call, and no one — not even the people who built it — can tell you exactly why. Building well means designing aroundthat uncertainty instead of pretending it isn't there.
Same call, a spread of plausible answers — most good, some wrong. Design for the spread, not a single value.
Never assert exact string equality. Validate shape and properties("is it valid JSON with these fields?"), not the precise words.
The model will sometimes invent facts, names, or APIs that sound right. It is not lying — it is completing a pattern. Ground it and check it (Part 05).
You pay per token (a chunk of text, ~¾ of a word) in and out, and bigger models answer slower. Picking a model is a real engineering trade-off, not an afterthought.
With a normal library you read the function signature. With an LLM, the prompt is the interface you design: it sets the role, the rules, the examples, and the shape you want back. Vague prompt, vague product. A precise, structured prompt is the single highest-leverage thing you control.
System = the standing contract; user = the changing request. Keep rules in the system message so every turn obeys them.
UNKNOWN").Like onboarding a new hire: a clear role, a couple of worked examples, and the do-not-do list get you a useful first day.
Prose is great for humans and useless for the rest of your program. Two features turn a chat toy into a building block: structured output (the model returns data that matches a schema you define) and tool calling (the model asks your functions to run and uses the results).
zodor JSON Schema), the SDK constrains the model to it, and you get back a typed object you can use directly — no fragile string-parsing, no "please respond in JSON" and hope.The schema is the contract. You get a typed object back instead of a paragraph you have to parse and pray over.
getWeather({city:"Berlin"}). Your code runs the real function and hands the result back, then the model finishes the answer with real data.The model never touches your database. It requests a tool; you run it and return the result; it answers with real data.
Wiring tools by hand into every app gets repetitive. MCP is an open standard (introduced by Anthropic) that lets any host — Claude Desktop, Claude Code, an IDE extension — connect to a server that exposes tools, resources, and prompts over a shared protocol. Transports are stdio (local) and streamable HTTP (remote). Build the server once; every MCP host can use it. We go deep in the MCP deck.
A good answer can take many seconds to finish. If you wait for the whole thing before showing anything, the app feels broken. Streaming sends the reply token-by-token so the user sees words appear immediately — the same total time, a completely different feel.
Both finish at the same moment. Streaming just stops the user from staring at a dead spinner the whole way there.
You can't write assert(output === expected) against a model that never repeats itself. So you test differently: build a small set of graded examples (evals), score every change against them, and wrap the live system in guardrails that catch bad output before a user ever sees it.
Run the dataset, grade the answers, watch the pass rate. Change a prompt or model only if the number goes up.
category match the label? Cheap, objective, your first line of defense.Strip or reject prompt-injection attempts and obviously bad input. Cap length so a giant paste can't blow your token budget.
Re-validate structured output against the schema. If a tool was "called", confirm the arguments are sane before you run it. Never eval() model text.
Put the real data (the actual ticket, the actual policy) in the prompt and tell the model to answer onlyfrom it — and to say so when the answer isn't there. This is what RAG automates.
Like a kitchen: evals are tasting against the recipe before service; guardrails are the health inspector who stops a bad plate reaching the table.
Two choices shape an LLM app: which model answers the prompt, and which SDK you build with. Neither is permanent — a good design lets you swap models behind one interface — but knowing the landscape keeps you from cargo-culting whatever the last blog post used.
Pro: strong reasoning, coding, and tool use; a tier for each need — Opus (deepest), Sonnet (balanced), Haiku (fast/cheap).
Con: top tier costs more per token than the small open models.
Pro: huge ecosystem, mature tooling, broad familiarity across teams.
Con: single vendor; capability and pricing tiers shift, so pin versions.
Pro: very large context windows and tight fit if you already live on Google Cloud.
Con: tooling and behavior differ from the others — budget porting time.
Pro: run them yourself — privacy, no per-token bill, full control; great for high volume.
Con: you own the GPUs, scaling, and ops; top quality still trails the best closed models.
How to choose: prototype on a strong hosted model (Sonnet / GPT / Gemini), then drop to a cheaper or open model per task once your evals prove the quality holds. Route easy calls to small models, hard ones to big.
Start with the Vercel AI SDK talking to a strong hosted model. Reach for LangGraph only when orchestration genuinely gets multi-step, and LlamaIndex when retrieval over your own data is the real problem. Most first features need none of the heavy frameworks — an SDK and a good prompt go a long way.
Let's assemble the pieces into a real, small feature: auto-triage incoming support tickets. Every idea from this deck shows up exactly once.
Prompt → structured call → validate → tool. Evals sit beside it as the safety net for every change.
{ category, urgency }, typed and parseable.assign() tool routes the ticket to the right queue with real data.urgency to 1–5 before acting."Treat the model as a fast, fallible colleague — give it clear instructions, then check its work."
Five quick questions on non-determinism, prompting, structured output, streaming, and evals — instant feedback, no sign-in.
Navigate with ← → or scroll · back to library