Library
00/07 · ~32 min
GUIDEDECK · for teams shipping with LLMs

Fine-tuning &
Model Adaptation
without the regret.

A 32-minute working session on changing a model's behavior — when to reach for the prompt, when retrieval is enough, and when actually training the weights pays off. We'll cover SFT, LoRA, preference tuning, distillation, and the honest trade-offs in between.

~32 MINBEGINNER → INTERMEDIATEVENDOR-AGNOSTIC
SCROLL
01 · The adaptation ladder 4 min

Climb the cheapest rung that
solves the problem — and stop there.

"The model isn't good enough yet" almost never means "we must fine-tune." There's a ladder of techniques, ordered by cost and commitment. Each rung is dramatically cheaper to try, faster to change, and easier to undo than the one above it. Most teams find their answer on the bottom two.

Model adaptation making a general model behave the way your task needs — is a spectrum, not a single button. At one end you change only the words you send (the prompt); at the other you change the model's weights (fine-tuning). Knowing where your problem actually lives saves weeks.
CHEAP · FAST · REVERSIBLE COSTLY · SLOW · STICKY Prompt Few-shot RAG Fine-tune last resort try here first ↓

Each rung costs more to build and is harder to change. Exhaust the lower ones before you climb.

What each rung fixes

  • Prompt engineering— clearer instructions, format, role. Fixes most "it won't do what I asked" problems. See the deck →
  • Few-shot examples— show 2–5 worked examples in the prompt so the model copies the pattern. Still just text.
  • RAG — retrieve facts at query time and paste them in. The right rung when the gap is knowledge, not behavior. See the deck →
  • Fine-tuning — change the weights. Worth it for consistent style, format, or a narrow skillthe prompt can't pin down.
The one-line test — if a smart new colleague could do the task correctly given your instructions and the right documents, you need a better prompt or RAG, not fine-tuning. Fine-tuning is for teaching a reflex, not for delivering facts.
02 · What fine-tuning is 5 min

Keep training a finished model
on your own examples.

A base model already learned language from a huge, general corpus. Fine-tuning continues that training on a small, focused set of yourinput → output pairs, nudging the weights so the model's default behavior shifts toward your examples. The most common form is supervised fine-tuning.

SFTSupervised Fine-Tuning — shows the model many (prompt, ideal answer)pairs and adjusts its weights so its own answers move closer to the ideal ones. " Supervised" just means every example is labeled with the answer you want. It learns the shape of a good response: tone, format, structure, and how to handle your edge cases.
// one line = one labeled example (chat format) {"messages": [ {"role":"system", "content":"You are our support agent."}, {"role":"user", "content":"Where is order #4021?"}, {"role":"assistant", "content":"Order #4021 shipped Apr 2 via UPS…"} ]} // thousands of these → the weights learn your house style
examples prompt+answer base model weights tuned new weights loss → gradient → small weight update (repeat for every example, many times)

Each example produces a small correction to the weights; repeated over the set, the model's defaults shift.

Good fit

Style & format

A fixed tone, a strict JSON shape, a domain's phrasing — things easier to show in 500 examples than to describe in a prompt.

Good fit

A narrow skill

Classify support tickets, extract fields from invoices, rewrite to house guidelines — a repeated, well-defined task with clear right answers.

Poor fit

Fresh facts

Weights are frozen knowledge as of training day. For changing data (prices, docs, today's tickets), reach for RAG — fine-tuning memorizes unreliably.

03 · Data preparation 6 min

The dataset is the model.
Garbage in, garbage out.

This is the part teams underestimate and the part that actually decides success. A fine-tuned model is a mirror of its training examples — every inconsistency, typo, and lazy answer gets learned and amplified. You will spend most of your time here, and you should.

Quality beats quantity. A few hundred clean, consistent, on-target examples routinely beat tens of thousands of noisy ones. The model can't tell "a good answer" from "an answer that happened to be in the file" — so every example has to be one you'd be proud to ship.
Noisy data — teaches the wrong reflex
{"prompt":"order status","completion":"it shipped"} {"prompt":"Where's my stuff??","completion":"idk check email"} // inconsistent shape, terse, off-tone, // mixed formats → the model learns the mess
Clean data — one consistent target
{"messages":[ {"role":"user","content":"What's the status of order #4021?"}, {"role":"assistant","content":"Order #4021 shipped Apr 2 via UPS, tracking 1Z…"} ]} // same shape every time, full answer, your voice
raw examples clean · dedupe train ~90% held-out eval never trained on

Always carve out a held-out set beforetraining — it's the only honest way to know the tune helped.

The data checklist

  • Consistent format— one schema, one chat template, the same one you'll use at inference time.
  • Cover the edges— include the tricky and the "say no politely" cases, not just the happy path.
  • Deduplicate — near-identical rows skew the model and inflate your scores.
  • Hold out 10–20% for evaluation, and keep it out of training — no leakage.
  • Balance the classes— if 95% of examples say " approved," the model just learns to say "approved."
Where examples come from — hand-written by experts (gold but slow), mined from real historical logs and edited, or synthetic: a stronger model drafts candidates that humans curate. Synthetic data scales fast but inherits the teacher's blind spots, so a human still has to sign off.
04 · Parameter-efficient tuning 6 min

Train a tiny add-on,
freeze the giant model behind it.

Full fine-tuning updates every weight — billions of them — which is slow, expensive, and needs serious hardware. Parameter-efficient fine-tuning (PEFT) freezes the original model and trains a small set of new parameters instead. LoRAis the technique you'll meet first and most.

LoRALow-Rank Adaptation — leaves the base weights untouched and learns two small matrices whose product is addedto a layer's output. Those matrices are tiny (often megabytes, not gigabytes), so training is cheap and the result is a swappable adapter you snap on or off the base model.
from peft import LoraConfig config = LoraConfig( r=8, # adapter rank — keep it small lora_alpha=16, # scaling factor target_modules=["q_proj", "v_proj"], task_type="CAUSAL_LM", ) # base model frozen; only these matrices train
W frozen big · untouched + A B tiny · trained = out

Output = frozen W + a small low-rank A·B. Only A and B learn.

QLoRA

LoRA on a quantized base

Quantizationstores the frozen base weights at lower precision (e.g. 4-bit instead of 16-bit), cutting memory ~4×. QLoRA quantizes the base, then trains a LoRA adapter on top — that's how people fine-tune large models on a single GPU.

Adapters

Swappable, stackable, cheap to store

One base model in memory, many small adapters on disk — a support-tone adapter, a legal-tone adapter — loaded per request. Far cheaper than hosting a full copy of the model per use case.

The tooling landscape — where you run it

Hosted fine-tuning from model vendors

Upload a JSONL file, click train, get an endpoint. OpenAI's fine-tuning API, Google Vertex AI tuning for Gemini, and Claude Haiku fine-tuning via Amazon Bedrock all follow this shape — the provider hides the GPUs and the LoRA details.

Pro
Zero infrastructure; a working tuned endpoint in an afternoon, scaling handled for you.
Con
Closed weights, per-token pricing, limited control, and you can't leave with the model.
Choose when
You want results fast, lack ML-ops staff, and are comfortable staying on that vendor.

Hugging Face — PEFT + TRL on open models

The de-facto open-source stack: the transformers, peft, and trl libraries give you LoRA, QLoRA, SFT and preference training over open-weight models you host yourself.

Pro
Full control, open weights you own, runs anywhere, huge community and model hub.
Con
You write the training code and own the GPUs, debugging, and serving.
Choose when
You need ownership, on-prem/private data, or to customize the training loop.

Axolotl — config-file fine-tuning

A wrapper over the Hugging Face stack that turns a training run into a single YAML file — dataset, base model, LoRA settings — so you don't hand-write the loop.

Pro
Reproducible, readable configs; sane defaults for common recipes.
Con
Still your hardware; the YAML abstraction hides knobs you may eventually need.
Choose when
You want open-source control without boilerplate and value repeatable runs.

Unsloth — fast, low-memory LoRA/QLoRA

Optimized kernels that make LoRA and QLoRA training notably faster and lighter on memory, letting modest single-GPU setups fine-tune larger models.

Pro
Big speed/memory wins on one GPU; friendly notebooks to get started.
Con
Focused on LoRA-style tuning; not a full MLOps platform.
Choose when
You're GPU-poor, experimenting, or want the cheapest path to a LoRA adapter.
05 · Preference tuning & distillation 5 min

Beyond imitation:
teach taste and shrink the model.

SFT teaches the model to copy good answers. But often you can't write one perfect answer — you can only say "this reply is better than that one." Preference tuning learns from those comparisons. Distillation does something different: it compresses a big model's skill into a small, cheap one.

RLHF & DPO preference (alignment) tuning — train the model on pairs of a preferred and a rejected answer so it learns which kind of response humans actually like. RLHF does this with a separate reward model and reinforcement learning; DPO (Direct Preference Optimization) skips the reward model and optimizes on the pairs directly — simpler and now the common starting point.
// each row: a prompt + a better and worse answer {"prompt": "Explain a deadlock in one sentence.", "chosen": "Two tasks each wait on a lock the other holds, so neither proceeds.", "rejected": "A deadlock is when the computer is bad."} // DPO nudges the model toward 'chosen', away from 'rejected'
prompt chosen ✓ preferred rejected ✕ worse tune

The model learns from the comparison, not from a single gold answer — useful when "better" is easier than " perfect."

Distillation — a small student copies a big teacher

  • Run a large, capable teacher model over many prompts and record its outputs.
  • Fine-tune a small student model (via SFT) to reproduce those outputs on your task.
  • You trade a little quality for a model that's far cheaper and faster to run in production.

The payoff is cost and latency: a focused small model can match a giant one on one narrow task for a fraction of the price.

teacher large · slow student small · fast outputs → train student mimics teacher on the task

The student learns to imitate the teacher's answers — same task, a fraction of the cost.

Picking an approach — cost vs benefit

SFT

Cheapest training

Best for a clear right answer. Needs labeled pairs. Start here.

LoRA / PEFT

SFT, but light

Same goal as SFT at a fraction of the compute. The default how for open models.

DPO / RLHF

Teaches taste

For subjective quality and tone. Needs preference pairs; DPO is the simpler path.

Distillation

Shrink for prod

Cut cost/latency once quality is proven. You need a strong teacher and many prompts.

06 · Evaluating & deploying 4 min

Prove it's better before
you ship it.

A tuned model that feelsbetter in a few hand tests is not a result. You need numbers, on data the model never saw, against the simpler baseline you were trying to beat. If it doesn't clearly win, ship the baseline — it's cheaper to run and maintain.

The honest comparison — always score the tuned model against the baseline (the well-prompted base model, or base + RAG) on a held-out evaluation set. The question is never "is the tune good?" — it's "is it enough betterto justify the cost?" Evaluation is a whole discipline of its own. See the LLM Evals deck →
# the cardinal rule: never score on training data train, holdout = split(dataset, eval_frac=0.15) base = evaluate(base_model, holdout) tuned = evaluate(tuned_model, holdout) assert tuned.score > base.score # else ship base assert tuned.regressions == 0 # no skills lost
base + prompt tuned model held-out eval set compare ship winner same questions · same scoring · no leakage

Same held-out questions through both models; ship only the clear winner.

Watch for

Catastrophic forgetting

Over-tuning on a narrow set can erase general skills the model used to have. Keep a few broad checks in your eval set to catch it.

Watch for

Overfitting

If it aces training examples but flops on the held-out set, it memorized rather than learned. Fewer epochs or more varied data.

Deploy

Version & monitor

Tag the model + dataset + config, deploy the adapter behind your endpoint, and keep watching live quality — distributions drift.

07 · When NOT to fine-tune & recap 2 min

The best fine-tune is often
the one you don't do.

Fine-tuning is a real commitment: a data pipeline to maintain, a model to host, and a re-train every time the base model improves underneath you. Reach for it last, and only when the cheaper rungs genuinely fall short.

Don't

…to add fresh knowledge

Changing facts belong in RAG, not in frozen weights. Fine-tuning memorizes unreliably and goes stale instantly.

Don't

…before trying the prompt

Most "bad output" is a prompt problem. A clearer prompt or a few examples is free and reversible — fine-tuning is neither.

Don't

…with thin or messy data

Too few examples, or noisy ones, make the model worse. No dataset, no fine-tune — fix the data first.

1Climb the cheapest rung first. Prompt → few-shot → RAG → fine-tune. Stop the moment the problem is solved.
2Fine-tune for behavior, RAG for knowledge. Weights teach style and skill; retrieval delivers fresh facts.
3The dataset is the model. A few hundred clean, consistent examples beat a pile of noisy ones.
4Use LoRA/PEFT by default. Tiny swappable adapters get most of the benefit at a fraction of the cost.
5Prove it on held-out data. Beat the baseline measurably, or ship the baseline.
  • Wrong facts or stale info? → RAG, not fine-tuning.
  • Wrong format, tone, or a narrow repeated skill? → SFT / LoRA.
  • "Better" is a matter of taste you can only compare? → DPO.
  • Quality is fine but it's too slow/expensive? → distillation.
  • Not sure? → improve the prompt and measure first. It's free.
Knowledge check

Did it stick?

Five quick questions on the adaptation ladder, SFT, data, LoRA, and knowing when to stop — instant feedback, no sign-in.

Rate this deck
be the first

Navigate with ← → or scroll · back to library