A 34-minute working session on retrieval-augmented generation — how to let a language model answer from your documents without retraining it. Embeddings, chunking, similarity search, reranking, and the vector databases that hold it all together.
An LLM learned the public internet up to a fixed training cutoff. It has never seen your company wiki, yesterday's support tickets, or the PDF a customer just uploaded. Ask about those and it either says "I don't know" or, worse, makes something up. RAG fixes that by handing the model the right facts at question time.
The question pulls snippets from your knowledge base; those snippets ride along into the prompt so the model can answer from them.
Like the difference between memorizing a textbook for an exam (fine-tuning) and being allowed to bring the book and look things up (RAG). Fine-tuning teaches style and skills; RAG supplies facts. Most teams reach for RAG first.
Computers can't compare sentences the way we do. The trick is to convert each piece of text into a long list of numbers — a vector — positioned so that texts with similar meaning land close together. That conversion is called an embedding, and it's the engine under all of vector search.
Each text becomes a point. "Refund" and "money back" sit together; "weather" sits far away.
refund. Embeddings match the idea, so a question phrased completely differently still finds the right doc.text-embedding-3) — strong default, simple API, pay per token.embed v3/v4) — great multilingual quality; also makes the reranker we'll meet in Part 5.BGE, E5, nomic) — run them yourself, no per-call cost, full data control.One rule: embed your documents and your queries with the same model. Vectors from two different models don't live in the same space and can't be compared.
You don't embed a whole 50-page handbook as one vector — that single point would be a blurry average of everything in it, and you'd have to feed the entire thing to the model. Instead you cut documents into small, focused pieces called chunks, and embed each one on its own.
The document is cut into chunks that overlap slightly — so a sentence split across a seam still appears whole in one of them.
A few hundred words is the usual sweet spot — big enough to hold one complete thought, small enough that its vector stays sharp.
Repeat ~10–20% of text between neighbors so an idea that straddles a boundary isn't lost — the answer might sit right on the cut.
Split on headings, paragraphs, or sentences — not mid-word. Respecting structure keeps each chunk readable and on-topic.
Like indexing a cookbook by recipe, not by chapter — you want the page with the exact dish, not the whole "Desserts" section.
Every chunk is now a point. To answer a question, you turn the question into a point too, then find the handful of stored points sitting closest to it. Those are your most-relevant chunks. This is vector search, and it's the "retrieve" in RAG.
k of 3 to 10.The query (amber) lands among the chunks; the three nearest (mint) are the top-k results. The far-off points are ignored.
Each result comes back with a similarity score; you keep the top-k and can drop anything below a threshold.
Comparing the query against every stored vector one by one is exact but slow once you have millions of chunks. So vector databases build a smart index (the popular one is called HNSW) that finds the nearest neighbors approximately — almost as accurate, but dramatically faster. This family of methods is called ANN, approximate nearest neighbor search. You trade a tiny bit of recall for a huge speed-up — almost always the right call.
Pure similarity search has blind spots. It can miss an exact term like an error code or a product SKU, and its top result isn't always the best one. Two cheap upgrades fix most of this: hybrid search and reranking.
ERR-503; semantic search catches paraphrases. Together they cover each other's gaps.Keyword search is precise about words; vector search is smart about meaning. Run both, then fuse the two ranked lists into one. A common, robust way to merge is RRF (reciprocal rank fusion) — it just rewards chunks that rank highly in either list, no score-tuning required.
Two independent searches, one merged result list.
Fast vector search gives you a rough shortlist — say the top 50. A reranker is a heavier, more accurate model that reads the query and each candidate together and scores true relevance, then reorders them. You keep only the top 5 after reranking. It's slower per item, which is exactly why you run it on 50, not 5 million.
Retrieve wide and cheap, rerank narrow and sharp.
A vector database stores your chunk vectors, builds the ANN index, and answers similarity queries fast — most also do the metadata filtering and hybrid search you just saw. Here are the leading options, what each is best at, and how to choose.
An extension that adds a vector column + ANN index to plain Postgres.
Pro — reuse the database, backups, and SQL you already have; vectors live next to your relational data.
Con — at very large scale and very high query volume a purpose-built engine can outpace it.
A hosted, serverless vector service — no infrastructure to run.
Pro — fast to ship, scales for you, low ops burden.
Con — proprietary and usage-priced; your vectors live in someone else's cloud.
A dedicated vector engine with strong filtering; self-host or use their cloud.
Pro — excellent performance and rich metadata-filtered search; open-source.
Con — another service to operate if you self-host.
Open-source DB with built-in hybrid search and pluggable embedding modules.
Pro — hybrid search and schema features out of the box.
Con — more concepts to learn; heavier than a minimal setup.
A lightweight, open-source store that runs locally with almost no setup.
Pro — the fastest way to prototype RAG on your laptop.
Con — you'll graduate to something sturdier for big production loads.
An open-source, distributed vector DB aimed at very large deployments.
Pro — scales to billions of vectors with cluster-grade throughput.
Con — the most operational weight; overkill for small apps.
Two phases. Ingest happens once (and whenever data changes); query happens on every question. Everything in this deck slots into one of them.
Ingest fills the store; query reads from it. The vector DB is the shared hinge between the two phases.
Five quick questions on RAG, embeddings, chunking, retrieval, and the tooling — instant feedback, no sign-in.
Navigate with ← → or scroll · back to library