A 32-minute working session on changing a model's behavior — when to reach for the prompt, when retrieval is enough, and when actually training the weights pays off. We'll cover SFT, LoRA, preference tuning, distillation, and the honest trade-offs in between.
"The model isn't good enough yet" almost never means "we must fine-tune." There's a ladder of techniques, ordered by cost and commitment. Each rung is dramatically cheaper to try, faster to change, and easier to undo than the one above it. Most teams find their answer on the bottom two.
Each rung costs more to build and is harder to change. Exhaust the lower ones before you climb.
A base model already learned language from a huge, general corpus. Fine-tuning continues that training on a small, focused set of yourinput → output pairs, nudging the weights so the model's default behavior shifts toward your examples. The most common form is supervised fine-tuning.
Each example produces a small correction to the weights; repeated over the set, the model's defaults shift.
A fixed tone, a strict JSON shape, a domain's phrasing — things easier to show in 500 examples than to describe in a prompt.
Classify support tickets, extract fields from invoices, rewrite to house guidelines — a repeated, well-defined task with clear right answers.
Weights are frozen knowledge as of training day. For changing data (prices, docs, today's tickets), reach for RAG — fine-tuning memorizes unreliably.
This is the part teams underestimate and the part that actually decides success. A fine-tuned model is a mirror of its training examples — every inconsistency, typo, and lazy answer gets learned and amplified. You will spend most of your time here, and you should.
Always carve out a held-out set beforetraining — it's the only honest way to know the tune helped.
Full fine-tuning updates every weight — billions of them — which is slow, expensive, and needs serious hardware. Parameter-efficient fine-tuning (PEFT) freezes the original model and trains a small set of new parameters instead. LoRAis the technique you'll meet first and most.
Output = frozen W + a small low-rank A·B. Only A and B learn.
Quantizationstores the frozen base weights at lower precision (e.g. 4-bit instead of 16-bit), cutting memory ~4×. QLoRA quantizes the base, then trains a LoRA adapter on top — that's how people fine-tune large models on a single GPU.
One base model in memory, many small adapters on disk — a support-tone adapter, a legal-tone adapter — loaded per request. Far cheaper than hosting a full copy of the model per use case.
Upload a JSONL file, click train, get an endpoint. OpenAI's fine-tuning API, Google Vertex AI tuning for Gemini, and Claude Haiku fine-tuning via Amazon Bedrock all follow this shape — the provider hides the GPUs and the LoRA details.
The de-facto open-source stack: the transformers, peft, and trl libraries give you LoRA, QLoRA, SFT and preference training over open-weight models you host yourself.
A wrapper over the Hugging Face stack that turns a training run into a single YAML file — dataset, base model, LoRA settings — so you don't hand-write the loop.
Optimized kernels that make LoRA and QLoRA training notably faster and lighter on memory, letting modest single-GPU setups fine-tune larger models.
SFT teaches the model to copy good answers. But often you can't write one perfect answer — you can only say "this reply is better than that one." Preference tuning learns from those comparisons. Distillation does something different: it compresses a big model's skill into a small, cheap one.
The model learns from the comparison, not from a single gold answer — useful when "better" is easier than " perfect."
The payoff is cost and latency: a focused small model can match a giant one on one narrow task for a fraction of the price.
The student learns to imitate the teacher's answers — same task, a fraction of the cost.
Best for a clear right answer. Needs labeled pairs. Start here.
Same goal as SFT at a fraction of the compute. The default how for open models.
For subjective quality and tone. Needs preference pairs; DPO is the simpler path.
Cut cost/latency once quality is proven. You need a strong teacher and many prompts.
A tuned model that feelsbetter in a few hand tests is not a result. You need numbers, on data the model never saw, against the simpler baseline you were trying to beat. If it doesn't clearly win, ship the baseline — it's cheaper to run and maintain.
Same held-out questions through both models; ship only the clear winner.
Over-tuning on a narrow set can erase general skills the model used to have. Keep a few broad checks in your eval set to catch it.
If it aces training examples but flops on the held-out set, it memorized rather than learned. Fewer epochs or more varied data.
Tag the model + dataset + config, deploy the adapter behind your endpoint, and keep watching live quality — distributions drift.
Fine-tuning is a real commitment: a data pipeline to maintain, a model to host, and a re-train every time the base model improves underneath you. Reach for it last, and only when the cheaper rungs genuinely fall short.
Changing facts belong in RAG, not in frozen weights. Fine-tuning memorizes unreliably and goes stale instantly.
Most "bad output" is a prompt problem. A clearer prompt or a few examples is free and reversible — fine-tuning is neither.
Too few examples, or noisy ones, make the model worse. No dataset, no fine-tune — fix the data first.
Five quick questions on the adaptation ladder, SFT, data, LoRA, and knowing when to stop — instant feedback, no sign-in.
Navigate with ← → or scroll · back to library