Library
00/07 · ~30 min
GUIDEDECK · for getting reliable work out of a model

Prompt
Engineering
that holds up.

A 30-minute working session on writing prompts that behave the same way twice — from how a model actually reads your text, through clear instructions, examples, structured output and tool calling, reasoning techniques, and guarding against prompt injection.

~30 MINBEGINNER → INTERMEDIATEMODEL-AGNOSTIC
SCROLL
01 · How a model reads a prompt 4 min

A model doesn't read your words —
it predicts the next token.

Before you can steer a model, it helps to know what it's actually doing: chopping your text into tokens, fitting everything into a fixed context window, and spreading its attention unevenly across your instructions. Good prompting is mostly working with these three facts instead of against them.

Prompteverything you send the model in one request: the system instructions, any examples, the running conversation, retrieved documents, and the user's actual question. The model sees it all as one long stream of text and continues it. There is no hidden memory — if it isn't in the prompt, the model doesn't know it.

1 · Text becomes tokens

  • A token is a chunk of text — roughly ¾ of a word in English. Summarize might split into Sum + mar + ize.
  • The model only ever sees token IDs, never letters — which is why it can fumble spelling, character counts, and rare words.
  • Tokens are also the unit you pay for and the unit the context window is measured in.
"Summarize this email" RAW TEXT tokenize Sum mar ize this MODEL predicts next token

Your text is split into tokens; the model reads those, then generates the most likely next token, one at a time.

CONTEXT WINDOW system prompt examples conversation history user question room for the answer

Everything competes for one fixed budget. Fill it with history and there's no room left for the model to answer.

2 · Everything shares one budget

  • The context window is the maximum number of tokens a model can consider at once — today commonly hundreds of thousands of tokens, but always finite.
  • System prompt, examples, chat history, retrieved docs and the reply all share it. Overflow and the oldest tokens fall off.
  • Attention is uneven: models weight the start and end of a prompt more than the middle. Put the most important instruction where it will be seen — often last.

The model is one piece of a larger system — retries, streaming, and memory live in the app around it. That layer is its own topic: Building LLM Apps.

02 · Clear instructions & roles 4 min

Vague prompts get vague answers.
Be specific about the job.

The single biggest lever in prompting is also the most boring: say exactly what you want. State the role, the audience, the format, the length, and what to do when the model is unsure. A model fills ambiguity with its average guess — your job is to leave less to guess.

System promptstanding instructions that frame every turn: who the model is, the rules it must follow, and the output shape you expect. The user prompt carries the specific request. Keep durable rules in the system prompt; keep the changing question in the user prompt.
Vague — the model has to guess
# one line, no context Write something about our new pricing.
Specific — the model has a brief
# role · audience · format · length · guardrail You are a support engineer writing to existing customers. Explain the new pricing in 3 short bullet points. Friendly, plain English, no jargon. If a detail is missing, say so — do not invent numbers.
SYSTEM role · rules · tone · format DEVELOPER task framing · constraints USER the actual question

Outer layers set the rules; inner layers fill in the request. Trust decreases as you move inward — important for Part 6.

Habits that pay off every time

  • Give it a role. "You are a meticulous code reviewer" narrows the model's style and standards.
  • Say the format out loud. "Reply as a Markdown table" beats hoping. For machine output, go further (Part 4).
  • Prefer positive instructions. "Use British spelling" lands better than "don't use American spelling."
  • Use delimiters. Wrap pasted text in clear markers (<doc>…</doc>) so the model can tell your instructions from the data.
03 · Few-shot examples 4 min

When telling isn't enough,
show the model.

Some patterns are easier to demonstrate than to describe — a tricky output format, a labelling convention, a particular tone. Drop a few worked examples into the prompt and the model copies the pattern. That's few-shot prompting.

Zero-shot vs few-shotzero-shot gives the model only instructions and trusts it to comply. Few-shot adds a handful of input → output examples so the model infers the pattern by analogy. "Shot" just means example: one example is one-shot, a few is few-shot.
# classify sentiment — show the exact label format Review: "Shipping was painfully slow." Label: negative Review: "Works exactly as described." Label: positive Review: "It's fine, nothing special." Label: ← model continues the pattern
ZERO-SHOT instruction guessed format FEW-SHOT example 1 example 2 matched format

Examples pin down the exact shape of the answer — far more reliable than describing the format in prose.

When it earns its tokens
  • The output format is fiddly or unusual and hard to describe in words.
  • You need a consistent style or labelling scheme across many calls.
  • The task is niche enough that instructions alone drift — examples anchor it.
Honest trade-offs
  • Every example costs tokens and latency on every call — often the simpler win is one clear instruction (zero-shot).
  • Bad or contradictory examples teach bad patterns. Curate them like test cases.
  • Capable models need fewer shots than they used to — start at zero, add examples only when output drifts.
04 · Structured output & tool calling 5 min

Stop parsing prose. Ask for
JSON — or a tool call.

The moment a model's output feeds another program, free text is a liability. Two features fix that: structured output constrains the reply to a schema you define, and tool calling lets the model ask your code to do something and use the result.

Structured outputforcing the model to emit valid JSON matching a schema you supply, so your code can rely on the shape. Tool calling (a.k.a. function calling) — giving the model a menu of functions; instead of answering in prose, it emits a structured request to call one, which your app runs and feeds back.
// you describe the tool; the model decides when to call it { "name": "get_weather", "description": "Current weather for a city", "parameters": { "type": "object", "properties": { "city": { "type": "string" } }, "required": ["city"] } }
MODEL YOUR APP get_weather() real function call(city) result answer

The model never runs anything itself — it requests a call, your app executes it, and the result goes back into the prompt.

Why schemas beat prose
  • No brittle regex parsing — you get fields, not a paragraph to scrape.
  • Many APIs guarantee valid JSON against your schema, so malformed replies stop being a class of bug.
  • Keep schemas small and well-named — field names are instructions the model reads too.
Tool calling unlocks
  • Live data and actions: search, database lookups, sending an email, fetching today's price.
  • It's the foundation of agents — chaining tool calls toward a goal lives in AI Agents & Tool Use.
  • When the model needs knowledge it wasn't trained on, retrieve it: RAG & Vector Search.
05 · Reasoning techniques 5 min

Give the model room
to think before it answers.

For multi-step problems — math, logic, planning, careful extraction — a model that blurts the first token often stumbles. Techniques like chain-of-thought, decomposition, and self-check trade a few extra tokens for noticeably better answers by letting the model reason out loud.

Chain-of-thought (CoT)prompting the model to work through the steps before committing to a final answer ("think step by step"). Because each token it generates becomes context for the next, writing out the reasoning gives the model more to build the answer on.
DIRECT question guess ✕ STEP BY STEP step 1 step 2 step 3 answer ✓

Reasoning out the steps gives later tokens something correct to build on — instead of one impulsive guess.

Three moves worth knowing

  • Chain-of-thought. Ask for the working, not just the answer. Great for arithmetic, logic, and careful reading.
  • Decomposition. Break a big task into named sub-steps — or separate calls — so each stays simple and checkable.
  • Self-check. Have the model draft, then critique its own answer against the requirements before finalising.
Honest trade-offs
  • Reasoning costs tokens and latency — don't pay for it on trivial tasks where a direct answer is already right.
  • Many newer reasoning models already think internally, so "think step by step" can be redundant — or even fight their built-in process. Test, don't assume.
  • Reasoning text can leak sensitive logic or wrong-but-confident steps. Don't show raw chains to end users by default.
Make it measurable

"It feels smarter" is not evidence. Whether CoT, more shots, or a bigger model actually helps is an empirical question — run it against a fixed test set and compare. That discipline is its own topic: LLM Evals & LLMOps.

06 · Iterate with evals · guard against injection 5 min

Treat prompts like code:
test them, and don't trust input.

A prompt that works in the playground can quietly regress when you tweak a word, swap a model, or hit a new input. Two practices keep you honest: measure changes with evals, and assume any text from outside is potentially hostile.

Evala repeatable test set for prompts: a list of inputs plus a way to score each output (exact match, a rule, or another model acting as judge). Run it on every change so you compare versions with numbers, not vibes.

Prompt injection — the core risk

  • The model can't fully tell your instructions from data it's given. Hostile text inside a document or web page can hijack it: "ignore previous instructions and…".
  • It gets dangerous when the model also holds tools or secrets — injected text can trick it into sending data or taking actions.
  • There is no single perfect fix — defend in layers.
untrusted document MODEL tools + secrets "ignore rules" exfiltrate ✕

Hostile text hidden in data can override your instructions and abuse whatever tools the model holds.

Defenses that stack
  • Separate trusted instructions from untrusted data with clear delimiters; tell the model the delimited block is data, not commands.
  • Least privilege. Don't hand the model tools or secrets it doesn't need for the task.
  • Validate output before acting on it — confirm tool arguments, require human approval for risky actions.
  • Test adversarially. Keep a suite of injection attempts in your evals.
Tighten the loop
  • Version prompts in source control — a prompt is part of your app.
  • Change one thing at a time and re-run the eval; otherwise you can't attribute the win.
  • Capture real failures and fold them back in as new test cases.

Tooling landscape

You don't have to build the harness yourself. A one-line read on where each tool fits — and its trade-off:

promptfoo

Prompt regression tests

Declarative test cases that run a prompt across inputs and models, with assertions and side-by-side diffs.

  • Pro — local, fast, great in CI; compares models head-to-head.
  • Con — only as good as the assertions you write; not a production monitor.
  • Choose it for catching regressions before you ship.
DSPy

Programmatic optimization

Treats prompts as parameters and tunes them — including example selection — against a metric you define.

  • Pro — automates prompt and few-shot tuning so you stop hand-tweaking wording.
  • Con — steeper learning curve; you give up some direct control of phrasing.
  • Choose it when you have a clear metric and many examples.
Guardrails-style validators

Validate & repair output

Check structured output against a schema or rules and retry or fix when it doesn't conform.

  • Pro — turns "mostly valid" output into a hard guarantee at the boundary.
  • Con — adds latency and retries; no substitute for a clear schema (Part 4).
  • Choose it when malformed output is expensive or unsafe.
Tracing platforms

Observe production

Capture real traffic, build datasets, and score with rules or LLM-as-judge over time (Langfuse, LangSmith, Braintrust).

  • Pro — sees real-world failures offline tests miss; tracks quality trends.
  • Con — another service; judge models need their own validation.
  • Choose it once you're live and need ongoing signal.

Picking a provider, SDK, or model — streaming, retries, cost — is a separate decision covered in Building LLM Apps; the measurement discipline goes deeper in LLM Evals & LLMOps.

07 · Patterns, anti-patterns & recap 3 min

Patterns to keep, anti-patterns to drop.

Almost every reliable prompt comes back to a few habits — and almost every flaky one repeats a few mistakes. Pick the lightest technique that does the job; reach for the heavier ones only when the simpler option provably falls short.

Patterns that hold up
  • Be specific. Role, audience, format, length, and a fallback for uncertainty.
  • Show, don't tell — a couple of examples when a format is hard to describe.
  • Structure machine output as JSON or a tool call instead of parsing prose.
  • Let it reason on hard, multi-step tasks; keep the steps out of the user's view.
  • Iterate with evals — change one thing, measure, keep what wins.
Anti-patterns to drop
  • The vague mega-prompt — a wall of wishes with no clear task or format.
  • Stuffing the context with everything "just in case" — it buries the signal and costs tokens.
  • Prompting around a wall. Missing knowledge needs retrieval; live data needs tools — not more adjectives.
  • Trusting unverified output — acting on raw model text, or pasting untrusted data straight into instructions.

Which technique, when?

Reach for the simplest row that works; move down only when it doesn't.

Zero-shot — instructions only

Use when
The task is common and you can describe it clearly. Always your first try — it's cheapest.
Move on when
Output drifts in format or quality no matter how you reword the instruction.

Few-shot — show a few examples

Use when
A format or style is easier to demonstrate than describe, or you need consistency across calls.
Cost
Tokens and latency on every call; bad examples teach bad habits — curate them.

Chain-of-thought — reason step by step

Use when
Multi-step logic, math, planning, or careful extraction where a direct answer often slips.
Watch for
Latency, and redundancy on reasoning models that already think internally. Measure it.

Structured output — JSON or tool calls

Use when
Another program consumes the result, or the model needs live data and actions.
It's really
Removing a whole class of parsing bugs — pair it with schema validation.

Three rules to walk out with

1Specific beats clever. Say the role, the format, and what to do when unsure — then escalate to examples or reasoning only if you must.
2Structure the boundary. When output feeds code, demand JSON or a tool call and validate it — never scrape prose.
3Measure and distrust. Test prompt changes against a fixed set, and treat every outside input as potentially hostile.

Where this fits

"Leave the model less to guess — then prove it worked."

Knowledge check

Did it stick?

Five quick questions on tokens, instructions, few-shot, tool calling, and prompt injection — instant feedback, no sign-in.

Rate this deck
be the first

Navigate with ← → or scroll · back to library