Library
00/07 · ~36 min
GUIDEDECK · for engineers shipping their first AI feature

Building
apps with LLMs
that actually ship.

A 36-minute working session on programming a model you can't fully predict — prompts as your API, typed JSON out, letting the model call your code, streaming for a fast feel, and how you prove it works.

~36 MINBEGINNER → INTERMEDIATEVENDOR-AGNOSTIC
SCROLL
01 · Why building with LLMs is different 4 min

You're calling a function
that's a fuzzy black box.

Every line of code you've written so far is deterministic: same input, same output, every single time. An LLM breaks that promise. The same prompt can return different text on each call, and no one — not even the people who built it — can tell you exactly why. Building well means designing aroundthat uncertainty instead of pretending it isn't there.

LLMLarge Language Model— is a program trained on enormous amounts of text to do one thing: given some text, predict the most likely next chunk of text, over and over. It has no database of facts and no logic engine inside it — it's a very good pattern-completer. That single idea explains both its magic and its failure modes.

The mental model that changes

  • A normal function is a vending machine: press B4, get the same snack forever.
  • An LLM is more like asking a sharp, fast colleague — brilliant, occasionally confidently wrong, never word-for-word the same twice.
  • Temperature is a dial (≈0 → 1) for how much randomness the model adds. Low = steadier and repetitive; high = more varied and creative.
  • So you stop asking "is the output correct?" and start asking "is it acceptable, often enough?"
input function() DETERMINISTIC 42 ✓ always prompt LLM() PROBABILISTIC "Sure, here's…" "Of course! …" made-up fact ✕

Same call, a spread of plausible answers — most good, some wrong. Design for the spread, not a single value.

Non-determinism

Plan for variety

Never assert exact string equality. Validate shape and properties("is it valid JSON with these fields?"), not the precise words.

Hallucination

Confidently wrong

The model will sometimes invent facts, names, or APIs that sound right. It is not lying — it is completing a pattern. Ground it and check it (Part 05).

Cost & latency

Tokens add up

You pay per token (a chunk of text, ~¾ of a word) in and out, and bigger models answer slower. Picking a model is a real engineering trade-off, not an afterthought.

02 · Prompting that works 6 min

The prompt is your API —
write it like one.

With a normal library you read the function signature. With an LLM, the prompt is the interface you design: it sets the role, the rules, the examples, and the shape you want back. Vague prompt, vague product. A precise, structured prompt is the single highest-leverage thing you control.

Prompteverything you send the model for one call. It usually splits into a system message (the standing instructions — who the model is and the rules it must follow) and a usermessage (the actual request for this turn). Think job description vs. today's ticket.
system role · rules · tone (stable) user this request (changes) LLM reply system sets the frame · user fills it

System = the standing contract; user = the changing request. Keep rules in the system message so every turn obeys them.

Four habits of a good prompt

  • Give it a role."You are a support-triage assistant" steers tone and judgement more than a page of rules.
  • Be specific about the output.Say the format, the length, and what to do when unsure ("reply UNKNOWN").
  • Show, don't just tell. One or two worked examples (few-shot) beat paragraphs of description.
  • State the guardrails. What it must never do, and how to behave when the input is missing or hostile.
Few-shot prompting putting a few solved examples in the prompt so the model copies the pattern. "Zero-shot" is just asking; "few-shot" is asking and showing two or three input→output pairs. It is the cheapest accuracy upgrade you have.
Vague — you'll get vague back
// no role, no format, no examples const prompt = "Sort out this support email and tell me what it's about." // → rambling paragraph, different shape every call, // impossible to parse or trust downstream
Specific — role, rules, shape
const system = `You are a support-triage assistant. Classify each email. category ∈ {billing, bug, other}. If unsure, use "other". Reply with JSON only.` // few-shot: show one solved example // in: "I was charged twice" out: {"category":"billing"} const user = `Email: ${ticket.body}`

Like onboarding a new hire: a clear role, a couple of worked examples, and the do-not-do list get you a useful first day.

03 · Structured output & tool calling 6 min

Get typed JSON back —
and let the model call your code.

Prose is great for humans and useless for the rest of your program. Two features turn a chat toy into a building block: structured output (the model returns data that matches a schema you define) and tool calling (the model asks your functions to run and uses the results).

Structured output forcing the model to answer as JSON that fits a schema you supply. You hand it a shape (often a zodor JSON Schema), the SDK constrains the model to it, and you get back a typed object you can use directly — no fragile string-parsing, no "please respond in JSON" and hope.
import { generateObject } from "ai" import { z } from "zod" const { object } = await generateObject({ model, schema: z.object({ category: z.enum(["billing", "bug", "other"]), urgency: z.number().min(1).max(5), }), prompt: ticket.body, }) // object.category is a typed string — use it directly
schema you define LLM constrained { category, urgency } typed schema in → typed object out, every time no string-parsing, no guessing

The schema is the contract. You get a typed object back instead of a paragraph you have to parse and pray over.

Tool calling (a.k.a. function calling) — you describe functions the model is allowed to invoke (name, description, argument schema). When the model decides it needs one, it doesn't run anything — it returns a request like getWeather({city:"Berlin"}). Your code runs the real function and hands the result back, then the model finishes the answer with real data.
const { text } = await generateText({ model, prompt: "How many open billing tickets today?", tools: { countTickets: { description: "Count tickets by status", parameters: z.object({ status: z.string() }), execute: async ({ status }) => db.count(status), }, }, }) // model asks → your code runs → model answers with the number
model your code db / api 1 call(args) 2 result loop

The model never touches your database. It requests a tool; you run it and return the result; it answers with real data.

MCP · Model Context Protocol

One open standard for plugging tools in

Wiring tools by hand into every app gets repetitive. MCP is an open standard (introduced by Anthropic) that lets any host — Claude Desktop, Claude Code, an IDE extension — connect to a server that exposes tools, resources, and prompts over a shared protocol. Transports are stdio (local) and streamable HTTP (remote). Build the server once; every MCP host can use it. We go deep in the MCP deck.

04 · Streaming & UX 4 min

Show tokens as they arrive,
not after a long wait.

A good answer can take many seconds to finish. If you wait for the whole thing before showing anything, the app feels broken. Streaming sends the reply token-by-token so the user sees words appear immediately — the same total time, a completely different feel.

Streaming delivering the model's reply in small pieces as it is generated rather than in one final blob. The key metric becomes time to first token (how fast something shows up), not just total time. It is the single biggest perceived-speed win in an LLM UI.
blocking … spinner … (nothing visible) all streaming The bill was first token ↑ early same end →

Both finish at the same moment. Streaming just stops the user from staring at a dead spinner the whole way there.

Streaming in practice

import { streamText } from "ai" const result = streamText({ model, prompt: ticket.body, }) // pipe straight to the browser; UI renders as it flows return result.toUIMessageStreamResponse()
  • Render a cursorwhile tokens flow — it reads as "thinking out loud".
  • Let users stop. A cancel button that aborts the request saves tokens and frustration.
  • Heads-up:you can't validate a half-finished answer — do final checks once the stream completes.
05 · Evals & guardrails 6 min

How do you know it works
when the output keeps changing?

You can't write assert(output === expected) against a model that never repeats itself. So you test differently: build a small set of graded examples (evals), score every change against them, and wrap the live system in guardrails that catch bad output before a user ever sees it.

Eval a repeatable test that scores model output against known-good cases. It's a unit test for behavior you can't pin to an exact string: instead of "equals X", it checks "did it pick the right category?" or "does it contain the order number?" across a fixed dataset, and reports a pass rate.
dataset in + ideal model score grader pass rate ✓ gate next change

Run the dataset, grade the answers, watch the pass rate. Change a prompt or model only if the number goes up.

Three ways to grade an answer

  • Exact / rule-based — for structured output: did category match the label? Cheap, objective, your first line of defense.
  • Contains / regex — did the reply include the order number, and avoid a banned phrase?
  • LLM-as-judge— use a second model call to rate fuzzy qualities ("is this helpful and on-topic, 1–5?"). Powerful, but it's a model too, so spot-check it.
Hallucination output that is fluent and confident but factually wrong or invented. The model isn't malfunctioning; it's filling a gap with the most plausible-sounding text. The cure is rarely "a better prompt" alone — it's grounding the model in real data and verifying what comes back.

Guardrails — the seatbelts

Input guard

Check before you send

Strip or reject prompt-injection attempts and obviously bad input. Cap length so a giant paste can't blow your token budget.

Output guard

Validate before you trust

Re-validate structured output against the schema. If a tool was "called", confirm the arguments are sane before you run it. Never eval() model text.

Grounding

Give it the facts

Put the real data (the actual ticket, the actual policy) in the prompt and tell the model to answer onlyfrom it — and to say so when the answer isn't there. This is what RAG automates.

Like a kitchen: evals are tasting against the recipe before service; guardrails are the health inspector who stops a bad plate reaching the table.

06 · The tooling — models & SDKs 6 min

Pick a model and an SDK
with the trade-offs in view.

Two choices shape an LLM app: which model answers the prompt, and which SDK you build with. Neither is permanent — a good design lets you swap models behind one interface — but knowing the landscape keeps you from cargo-culting whatever the last blog post used.

The split that matters — a model is the brain you call over an API (or run yourself); an SDK is the library in your app that formats the call, streams the reply, and wires up tools. You usually pick one SDK and keep two or three models a config flag apart.

Models — the brains

Anthropic Claude

Opus · Sonnet · Haiku

Pro: strong reasoning, coding, and tool use; a tier for each need — Opus (deepest), Sonnet (balanced), Haiku (fast/cheap).

Con: top tier costs more per token than the small open models.

OpenAI GPT

The default many reach for

Pro: huge ecosystem, mature tooling, broad familiarity across teams.

Con: single vendor; capability and pricing tiers shift, so pin versions.

Google Gemini

Long context & Google stack

Pro: very large context windows and tight fit if you already live on Google Cloud.

Con: tooling and behavior differ from the others — budget porting time.

Open-weight

Llama · Mistral · DeepSeek

Pro: run them yourself — privacy, no per-token bill, full control; great for high volume.

Con: you own the GPUs, scaling, and ops; top quality still trails the best closed models.

How to choose: prototype on a strong hosted model (Sonnet / GPT / Gemini), then drop to a cheaper or open model per task once your evals prove the quality holds. Route easy calls to small models, hard ones to big.

SDKs & frameworks — the wiring

Vercel AI SDK — app integration, TypeScript-first

import { generateText } from "ai" import { anthropic } from "@ai-sdk/anthropic" const { text } = await generateText({ model: anthropic("claude-..."), prompt, }) // same code; swap provider import to change model
Pro
One typed API across providers; first-class streaming and React hooks; swap models by changing one import.
Con
TypeScript/JS world; lighter on heavy data-pipeline plumbing than the Python frameworks.
Pick when
You're shipping a web or Node app and want streaming UI fast.

LangChain / LangGraph — orchestration, Python-leaning

Pro
Batteries-included building blocks for chains, agents, memory, and many integrations; LangGraph adds explicit, stateful multi-step graphs.
Con
Big surface area and abstractions can hide what's actually sent to the model — easy to over-engineer a job a few API calls would do.
Pick when
You're building complex, multi-step agent workflows and want pre-built orchestration.

LlamaIndex — data & retrieval, Python-leaning

Pro
Purpose-built for connecting your data to an LLM — loaders, indexing, chunking, and retrieval for RAG done well.
Con
Narrower than a general framework; for a plain chat or tool-calling feature it's more than you need.
Pick when
The hard part is retrieving the right context from large or messy data, not the chat loop.

The honest default

Start with the Vercel AI SDK talking to a strong hosted model. Reach for LangGraph only when orchestration genuinely gets multi-step, and LlamaIndex when retrieval over your own data is the real problem. Most first features need none of the heavy frameworks — an SDK and a good prompt go a long way.

07 · A worked LLM feature + recap 4 min

One feature, end to end —
then five things to walk out with.

Let's assemble the pieces into a real, small feature: auto-triage incoming support tickets. Every idea from this deck shows up exactly once.

ticket in prompt + schema model → JSON validate ✓ tool: assign() evals run on a fixed dataset

Prompt → structured call → validate → tool. Evals sit beside it as the safety net for every change.

The five moving parts

  • Prompt (02) — a system message gives the model the triage role and rules, with one few-shot example.
  • Structured output (03) — a schema returns { category, urgency }, typed and parseable.
  • Tool calling (03) — an assign() tool routes the ticket to the right queue with real data.
  • Guardrails (05) — re-validate the JSON; clamp urgency to 1–5 before acting.
  • Evals (05) — 30 labelled tickets gate every prompt or model change. Stream the agent reply (04) for the human-facing summary.

Five rules to walk out with

1Design for non-determinism. Validate shape and properties, never exact strings. Acceptable-often-enough is the bar.
2The prompt is your API. Role, rules, format, and a few-shot example are the highest-leverage thing you control.
3Ask for structure; let it call your code. Typed JSON and tool calling turn a chat box into a real building block.
4Stream for feel; guard for trust. Tokens early, validation late — never run model output unchecked.
5Evals before opinions.A small graded dataset turns "feels better" into a number you can ship on.

Where to go next

One sentence to remember

"Treat the model as a fast, fallible colleague — give it clear instructions, then check its work."

Knowledge check

Did it stick?

Five quick questions on non-determinism, prompting, structured output, streaming, and evals — instant feedback, no sign-in.

Rate this deck
be the first

Navigate with ← → or scroll · back to library