Building LLM Apps · GuideDeck

01 · Why building with LLMs is different 4 min

You're calling a function
that's a fuzzy black box.

Every line of code you've written so far is deterministic: same input, same output, every single time. An LLM breaks that promise. The same prompt can return different text on each call, and no one — not even the people who built it — can tell you exactly why. Building well means designing aroundthat uncertainty instead of pretending it isn't there.

LLM — Large Language Model— is a program trained on enormous amounts of text to do one thing: given some text, predict the most likely next chunk of text, over and over. It has no database of facts and no logic engine inside it — it's a very good pattern-completer. That single idea explains both its magic and its failure modes.

The mental model that changes

A normal function is a vending machine: press B4, get the same snack forever.
An LLM is more like asking a sharp, fast colleague — brilliant, occasionally confidently wrong, never word-for-word the same twice.
Temperature is a dial (≈0 → 1) for how much randomness the model adds. Low = steadier and repetitive; high = more varied and creative.
So you stop asking "is the output correct?" and start asking "is it acceptable, often enough?"

Same call, a spread of plausible answers — most good, some wrong. Design for the spread, not a single value.

Non-determinism

Plan for variety

Never assert exact string equality. Validate shape and properties("is it valid JSON with these fields?"), not the precise words.

Hallucination

Confidently wrong

The model will sometimes invent facts, names, or APIs that sound right. It is not lying — it is completing a pattern. Ground it and check it (Part 05).

Cost & latency

Tokens add up

You pay per token (a chunk of text, ~¾ of a word) in and out, and bigger models answer slower. Picking a model is a real engineering trade-off, not an afterthought.

02 · Prompting that works 6 min

The prompt is your API —
write it like one.

With a normal library you read the function signature. With an LLM, the prompt is the interface you design: it sets the role, the rules, the examples, and the shape you want back. Vague prompt, vague product. A precise, structured prompt is the single highest-leverage thing you control.

Prompt — everything you send the model for one call. It usually splits into a system message (the standing instructions — who the model is and the rules it must follow) and a usermessage (the actual request for this turn). Think job description vs. today's ticket.

System = the standing contract; user = the changing request. Keep rules in the system message so every turn obeys them.

Four habits of a good prompt

Give it a role."You are a support-triage assistant" steers tone and judgement more than a page of rules.
Be specific about the output.Say the format, the length, and what to do when unsure ("reply UNKNOWN").
Show, don't just tell. One or two worked examples (few-shot) beat paragraphs of description.
State the guardrails. What it must never do, and how to behave when the input is missing or hostile.

Few-shot prompting — putting a few solved examples in the prompt so the model copies the pattern. "Zero-shot" is just asking; "few-shot" is asking and showing two or three input→output pairs. It is the cheapest accuracy upgrade you have.

Vague — you'll get vague back

// no role, no format, no examples const prompt = "Sort out this support email and tell me what it's about." // → rambling paragraph, different shape every call, // impossible to parse or trust downstream

Specific — role, rules, shape

const system = `You are a support-triage assistant. Classify each email. category ∈ {billing, bug, other}. If unsure, use "other". Reply with JSON only.` // few-shot: show one solved example // in: "I was charged twice" out: {"category":"billing"} const user = `Email: ${ticket.body}`

Like onboarding a new hire: a clear role, a couple of worked examples, and the do-not-do list get you a useful first day.

03 · Structured output & tool calling 6 min

Get typed JSON back —
and let the model call your code.

Prose is great for humans and useless for the rest of your program. Two features turn a chat toy into a building block: structured output (the model returns data that matches a schema you define) and tool calling (the model asks your functions to run and uses the results).

Structured output — forcing the model to answer as JSON that fits a schema you supply. You hand it a shape (often a zodor JSON Schema), the SDK constrains the model to it, and you get back a typed object you can use directly — no fragile string-parsing, no "please respond in JSON" and hope.

import { generateObject } from "ai" import { z } from "zod" const { object } = await generateObject({ model, schema: z.object({ category: z.enum(["billing", "bug", "other"]), urgency: z.number().min(1).max(5), }), prompt: ticket.body, }) // object.category is a typed string — use it directly

The schema is the contract. You get a typed object back instead of a paragraph you have to parse and pray over.

Tool calling (a.k.a. function calling) — you describe functions the model is allowed to invoke (name, description, argument schema). When the model decides it needs one, it doesn't run anything — it returns a request like getWeather({city:"Berlin"}). Your code runs the real function and hands the result back, then the model finishes the answer with real data.

const { text } = await generateText({ model, prompt: "How many open billing tickets today?", tools: { countTickets: { description: "Count tickets by status", parameters: z.object({ status: z.string() }), execute: async ({ status }) => db.count(status), }, }, }) // model asks → your code runs → model answers with the number

The model never touches your database. It requests a tool; you run it and return the result; it answers with real data.

MCP · Model Context Protocol

One open standard for plugging tools in

Wiring tools by hand into every app gets repetitive. MCP is an open standard (introduced by Anthropic) that lets any host — Claude Desktop, Claude Code, an IDE extension — connect to a server that exposes tools, resources, and prompts over a shared protocol. Transports are stdio (local) and streamable HTTP (remote). Build the server once; every MCP host can use it. We go deep in the MCP deck.

04 · Streaming & UX 4 min

Show tokens as they arrive,
not after a long wait.

A good answer can take many seconds to finish. If you wait for the whole thing before showing anything, the app feels broken. Streaming sends the reply token-by-token so the user sees words appear immediately — the same total time, a completely different feel.

Streaming — delivering the model's reply in small pieces as it is generated rather than in one final blob. The key metric becomes time to first token (how fast something shows up), not just total time. It is the single biggest perceived-speed win in an LLM UI.

Both finish at the same moment. Streaming just stops the user from staring at a dead spinner the whole way there.

Streaming in practice

import { streamText } from "ai" const result = streamText({ model, prompt: ticket.body, }) // pipe straight to the browser; UI renders as it flows return result.toUIMessageStreamResponse()

Render a cursorwhile tokens flow — it reads as "thinking out loud".
Let users stop. A cancel button that aborts the request saves tokens and frustration.
Heads-up:you can't validate a half-finished answer — do final checks once the stream completes.

05 · Evals & guardrails 6 min

How do you know it works
when the output keeps changing?

You can't write assert(output === expected) against a model that never repeats itself. So you test differently: build a small set of graded examples (evals), score every change against them, and wrap the live system in guardrails that catch bad output before a user ever sees it.

Eval — a repeatable test that scores model output against known-good cases. It's a unit test for behavior you can't pin to an exact string: instead of "equals X", it checks "did it pick the right category?" or "does it contain the order number?" across a fixed dataset, and reports a pass rate.

Run the dataset, grade the answers, watch the pass rate. Change a prompt or model only if the number goes up.

Three ways to grade an answer

Exact / rule-based — for structured output: did category match the label? Cheap, objective, your first line of defense.
Contains / regex — did the reply include the order number, and avoid a banned phrase?
LLM-as-judge— use a second model call to rate fuzzy qualities ("is this helpful and on-topic, 1–5?"). Powerful, but it's a model too, so spot-check it.

Hallucination — output that is fluent and confident but factually wrong or invented. The model isn't malfunctioning; it's filling a gap with the most plausible-sounding text. The cure is rarely "a better prompt" alone — it's grounding the model in real data and verifying what comes back.

Guardrails — the seatbelts

Input guard

Check before you send

Strip or reject prompt-injection attempts and obviously bad input. Cap length so a giant paste can't blow your token budget.

Output guard

Validate before you trust

Re-validate structured output against the schema. If a tool was "called", confirm the arguments are sane before you run it. Never eval() model text.

Grounding

Give it the facts

Put the real data (the actual ticket, the actual policy) in the prompt and tell the model to answer onlyfrom it — and to say so when the answer isn't there. This is what RAG automates.

Like a kitchen: evals are tasting against the recipe before service; guardrails are the health inspector who stops a bad plate reaching the table.

06 · The tooling — models & SDKs 6 min

Pick a model and an SDK —
with the trade-offs in view.

Two choices shape an LLM app: which model answers the prompt, and which SDK you build with. Neither is permanent — a good design lets you swap models behind one interface — but knowing the landscape keeps you from cargo-culting whatever the last blog post used.

The split that matters — a model is the brain you call over an API (or run yourself); an SDK is the library in your app that formats the call, streams the reply, and wires up tools. You usually pick one SDK and keep two or three models a config flag apart.

Models — the brains

Anthropic Claude

Opus · Sonnet · Haiku

Pro: strong reasoning, coding, and tool use; a tier for each need — Opus (deepest), Sonnet (balanced), Haiku (fast/cheap).

Con: top tier costs more per token than the small open models.

OpenAI GPT

The default many reach for

Pro: huge ecosystem, mature tooling, broad familiarity across teams.

Con: single vendor; capability and pricing tiers shift, so pin versions.

Google Gemini

Long context & Google stack

Pro: very large context windows and tight fit if you already live on Google Cloud.

Con: tooling and behavior differ from the others — budget porting time.

Open-weight

Llama · Mistral · DeepSeek

Pro: run them yourself — privacy, no per-token bill, full control; great for high volume.

Con: you own the GPUs, scaling, and ops; top quality still trails the best closed models.

How to choose: prototype on a strong hosted model (Sonnet / GPT / Gemini), then drop to a cheaper or open model per task once your evals prove the quality holds. Route easy calls to small models, hard ones to big.

SDKs & frameworks — the wiring

Vercel AI SDK — app integration, TypeScript-first

import { generateText } from "ai" import { anthropic } from "@ai-sdk/anthropic" const { text } = await generateText({ model: anthropic("claude-..."), prompt, }) // same code; swap provider import to change model

Pro

One typed API across providers; first-class streaming and React hooks; swap models by changing one import.

Con

TypeScript/JS world; lighter on heavy data-pipeline plumbing than the Python frameworks.

Pick when

You're shipping a web or Node app and want streaming UI fast.

LangChain / LangGraph — orchestration, Python-leaning

Pro

Batteries-included building blocks for chains, agents, memory, and many integrations; LangGraph adds explicit, stateful multi-step graphs.

Con

Big surface area and abstractions can hide what's actually sent to the model — easy to over-engineer a job a few API calls would do.

Pick when

You're building complex, multi-step agent workflows and want pre-built orchestration.

LlamaIndex — data & retrieval, Python-leaning

Pro

Purpose-built for connecting your data to an LLM — loaders, indexing, chunking, and retrieval for RAG done well.

Con

Narrower than a general framework; for a plain chat or tool-calling feature it's more than you need.

Pick when

The hard part is retrieving the right context from large or messy data, not the chat loop.

The honest default

Start with the Vercel AI SDK talking to a strong hosted model. Reach for LangGraph only when orchestration genuinely gets multi-step, and LlamaIndex when retrieval over your own data is the real problem. Most first features need none of the heavy frameworks — an SDK and a good prompt go a long way.

07 · A worked LLM feature + recap 4 min

One feature, end to end —
then five things to walk out with.

Let's assemble the pieces into a real, small feature: auto-triage incoming support tickets. Every idea from this deck shows up exactly once.

Prompt → structured call → validate → tool. Evals sit beside it as the safety net for every change.

The five moving parts

Prompt (02) — a system message gives the model the triage role and rules, with one few-shot example.
Structured output (03) — a schema returns { category, urgency }, typed and parseable.
Tool calling (03) — an assign() tool routes the ticket to the right queue with real data.
Guardrails (05) — re-validate the JSON; clamp urgency to 1–5 before acting.
Evals (05) — 30 labelled tickets gate every prompt or model change. Stream the agent reply (04) for the human-facing summary.

Five rules to walk out with

1Design for non-determinism. Validate shape and properties, never exact strings. Acceptable-often-enough is the bar.

2The prompt is your API. Role, rules, format, and a few-shot example are the highest-leverage thing you control.

3Ask for structure; let it call your code. Typed JSON and tool calling turn a chat box into a real building block.

4Stream for feel; guard for trust. Tokens early, validation late — never run model output unchecked.

5Evals before opinions.A small graded dataset turns "feels better" into a number you can ship on.

Where to go next

RAG & Vector Search — ground answers in your own data.
AI Agents & Tool Use — the loop that chains many tool calls.
MCP — the open standard for plugging tools and data into any host.

One sentence to remember

"Treat the model as a fast, fallible colleague — give it clear instructions, then check its work."

Knowledge check

Did it stick?

Five quick questions on non-determinism, prompting, structured output, streaming, and evals — instant feedback, no sign-in.

Rate this deck

be the first

Navigate with ← → or scroll · back to library

Buildingapps with LLMsthat actually ship.

You're calling a functionthat's a fuzzy black box.

The mental model that changes

Plan for variety

Confidently wrong

Tokens add up

The prompt is your API —write it like one.

Four habits of a good prompt

Get typed JSON back —and let the model call your code.

One open standard for plugging tools in

Show tokens as they arrive,not after a long wait.

Streaming in practice

How do you know it workswhen the output keeps changing?

Three ways to grade an answer

Guardrails — the seatbelts

Check before you send

Validate before you trust

Give it the facts

Pick a model and an SDK —with the trade-offs in view.

Models — the brains

Opus · Sonnet · Haiku

The default many reach for

Long context & Google stack

Llama · Mistral · DeepSeek

SDKs & frameworks — the wiring

Vercel AI SDK — app integration, TypeScript-first

LangChain / LangGraph — orchestration, Python-leaning

LlamaIndex — data & retrieval, Python-leaning

The honest default

One feature, end to end —then five things to walk out with.

The five moving parts

Five rules to walk out with

Where to go next

One sentence to remember

Did it stick?

Building
apps with LLMs
that actually ship.

You're calling a function
that's a fuzzy black box.

The prompt is your API —
write it like one.

Get typed JSON back —
and let the model call your code.

Show tokens as they arrive,
not after a long wait.

How do you know it works
when the output keeps changing?

Pick a model and an SDK —
with the trade-offs in view.

One feature, end to end —
then five things to walk out with.