Library
00/07 · ~34 min
GUIDEDECK · giving an LLM your own knowledge

RAG & Vector
Search.

A 34-minute working session on retrieval-augmented generation — how to let a language model answer from your documents without retraining it. Embeddings, chunking, similarity search, reranking, and the vector databases that hold it all together.

~34 MINBEGINNER → INTERMEDIATEAI ENGINEERING
SCROLL
01 · Why RAG 4 min

A model can only answer from
what it was trained on.

An LLM learned the public internet up to a fixed training cutoff. It has never seen your company wiki, yesterday's support tickets, or the PDF a customer just uploaded. Ask about those and it either says "I don't know" or, worse, makes something up. RAG fixes that by handing the model the right facts at question time.

RAGRetrieval-Augmented Generation — is a simple pattern: before the model answers, you retrieve the most relevant pieces of your own data and paste them into the prompt. The model then generates its answer grounded in those pieces. Think of it as an open-book exam — the model still does the reasoning, but you slide the right page in front of it first.

The whole idea in one line

  • Retrieve — search your data for the few snippets that actually relate to the question.
  • Augment — stuff those snippets into the prompt as context.
  • Generate — let the model answer using that context, and cite where it came from.
question from user retrieve your data knowledge base docs · wiki · PDFs LLM + context answer

The question pulls snippets from your knowledge base; those snippets ride along into the prompt so the model can answer from them.

Why not just retrain the model?

Fine-tuning — bake facts into weights
  • Expensive and slow — you re-run training every time a fact changes.
  • Hard to update — fixing one wrong line means another training run.
  • No easy citations — the model can't point at where an answer came from.
  • Still hallucinates — memorized facts blur together.
RAG — look facts up at question time
  • Cheap and fast — change a document, the next answer reflects it.
  • Always current — add today's ticket and it's searchable instantly.
  • Built-in citations — you know exactly which snippet was used.
  • Fewer made-up answers — the facts sit right in the prompt.

Like  the difference between memorizing a textbook for an exam (fine-tuning) and being allowed to bring the book and look things up (RAG). Fine-tuning teaches style and skills; RAG supplies facts. Most teams reach for RAG first.

02 · Embeddings 5 min

Turning text into numbers
that capture meaning.

Computers can't compare sentences the way we do. The trick is to convert each piece of text into a long list of numbers — a vector — positioned so that texts with similar meaning land close together. That conversion is called an embedding, and it's the engine under all of vector search.

Embedding — a list of numbers (a vector) that represents a piece of text's meaning. A model reads the text and outputs, say, 1,536 numbers. Texts about the same idea get nearby vectors; unrelated texts get distant ones — even when they share no words. "How do I get my money back?" and "refund policy" end up neighbors.
import { embed } from "ai" import { openai } from "@ai-sdk/openai" const { embedding } = await embed({ model: openai.embedding("text-embedding-3-small"), value: "How do I get my money back?", }) // embedding → [0.021, -0.044, 0.087, … ] (1536 numbers)
"refund policy" embedding model text → vector MEANING-SPACE refund money back weather

Each text becomes a point. "Refund" and "money back" sit together; "weather" sits far away.

Why this is powerful

  • Meaning, not keywords. Old search matched the exact word refund. Embeddings match the idea, so a question phrased completely differently still finds the right doc.
  • Distance = relatedness. Once everything is a point, "is this relevant?" becomes "how close are these two points?" — a fast math operation.
  • Do it once, reuse forever. You embed every document up front and store the vectors; at question time you only embed the short query.

Picking an embedding model

  • OpenAI (text-embedding-3) — strong default, simple API, pay per token.
  • Cohere (embed v3/v4) — great multilingual quality; also makes the reranker we'll meet in Part 5.
  • Open-source (e.g. BGE, E5, nomic) — run them yourself, no per-call cost, full data control.

One rule: embed your documents and your queries with the same model. Vectors from two different models don't live in the same space and can't be compared.

03 · Chunking 5 min

Split documents so retrieval
can actually find things.

You don't embed a whole 50-page handbook as one vector — that single point would be a blurry average of everything in it, and you'd have to feed the entire thing to the model. Instead you cut documents into small, focused pieces called chunks, and embed each one on its own.

Chunk — a small, self-contained slice of a document (a few paragraphs, say 300–800 words). Each chunk is embedded and stored separately, so search can return just the relevant slice instead of a whole file. Chunk size is a balance: too big and the vector is unfocused; too small and a single idea gets cut in half.
// split a long document into overlapping windows const chunks = splitText(doc, { size: 800, // ~chars (or tokens) per chunk overlap: 100, // repeat across the seam }) // → each chunk gets embedded + stored on its own for (const c of chunks) await store(await embed(c), c)
doc chunk 1 chunk 2 chunk 3 overlap

The document is cut into chunks that overlap slightly — so a sentence split across a seam still appears whole in one of them.

size

Right-size the slice

A few hundred words is the usual sweet spot — big enough to hold one complete thought, small enough that its vector stays sharp.

overlap

Overlap the seams

Repeat ~10–20% of text between neighbors so an idea that straddles a boundary isn't lost — the answer might sit right on the cut.

structure

Cut on natural breaks

Split on headings, paragraphs, or sentences — not mid-word. Respecting structure keeps each chunk readable and on-topic.

Bad chunking
  • Too large — one chunk covers five topics; its vector is a muddy average and matches nothing well.
  • Hard cuts — a table or sentence is sliced in half, so the retrieved snippet is gibberish.
Good chunking
  • One idea per chunk — focused vectors that match precise questions.
  • Overlap + structure — answers survive the seams and read cleanly when pasted into the prompt.

Like  indexing a cookbook by recipe, not by chapter — you want the page with the exact dish, not the whole "Desserts" section.

05 · Reranking & hybrid search 5 min

Vector search is a great
first pass — not the final word.

Pure similarity search has blind spots. It can miss an exact term like an error code or a product SKU, and its top result isn't always the best one. Two cheap upgrades fix most of this: hybrid search and reranking.

Hybrid search — run two searches and combine them: classic keyword search (also called lexical or BM25, which matches exact words) plus semantic vector search (which matches meaning). Keyword search nails exact tokens like ERR-503; semantic search catches paraphrases. Together they cover each other's gaps.

Lexical + semantic, then merge

Keyword search is precise about words; vector search is smart about meaning. Run both, then fuse the two ranked lists into one. A common, robust way to merge is RRF (reciprocal rank fusion) — it just rewards chunks that rank highly in either list, no score-tuning required.

  • Keyword wins on names, IDs, codes, rare terms.
  • Semantic wins on paraphrases and synonyms.
  • Fused — fewer "it didn't find the obvious doc" misses.
keyword BM25 · exact semantic vector · meaning fuse RRF

Two independent searches, one merged result list.

A smarter second pass over the shortlist

Fast vector search gives you a rough shortlist — say the top 50. A reranker is a heavier, more accurate model that reads the query and each candidate together and scores true relevance, then reorders them. You keep only the top 5 after reranking. It's slower per item, which is exactly why you run it on 50, not 5 million.

  • Retriever — fast, casts a wide net (recall).
  • Reranker — slow, picks the true best (precision).
  • Best results land at the top, where the model pays most attention.
// Cohere Rerank — reorder by true relevance const ranked = await cohere.rerank({ model: "rerank-v3.5", query, documents: candidates, // top-50 from search topN: 5, // keep the best 5 })
50 rough rerank q + doc best 5 ✓

Retrieve wide and cheap, rerank narrow and sharp.

  • Add hybrid search when users search by exact identifiers — product codes, names, version numbers, legal citations — and plain vector search keeps missing them.
  • Add a reranker when retrieval returns roughly right chunks but the genuinely best one isn't in the top few. It's the single highest-leverage quality upgrade for most RAG apps.
  • Start simple. Plain top-k vector search is a fine v1. Add these when your evals show the misses — not before.
06 · The tooling 5 min

Where the vectors live:
the vector database.

A vector database stores your chunk vectors, builds the ANN index, and answers similarity queries fast — most also do the metadata filtering and hybrid search you just saw. Here are the leading options, what each is best at, and how to choose.

Vector database — a data store built to hold embeddings and answer "find the nearest vectors" in milliseconds, even across millions of points. The big practical choice is add it to a database you already run (pgvector) vs. adopt a dedicated system built only for vectors.
pgvector

Postgres, extended

An extension that adds a vector column + ANN index to plain Postgres.

Pro — reuse the database, backups, and SQL you already have; vectors live next to your relational data.

Con — at very large scale and very high query volume a purpose-built engine can outpace it.

Pinecone

Fully managed

A hosted, serverless vector service — no infrastructure to run.

Pro — fast to ship, scales for you, low ops burden.

Con — proprietary and usage-priced; your vectors live in someone else's cloud.

Qdrant

Open-source, Rust-fast

A dedicated vector engine with strong filtering; self-host or use their cloud.

Pro — excellent performance and rich metadata-filtered search; open-source.

Con — another service to operate if you self-host.

Weaviate

Batteries-included

Open-source DB with built-in hybrid search and pluggable embedding modules.

Pro — hybrid search and schema features out of the box.

Con — more concepts to learn; heavier than a minimal setup.

Chroma

Dev-friendly

A lightweight, open-source store that runs locally with almost no setup.

Pro — the fastest way to prototype RAG on your laptop.

Con — you'll graduate to something sturdier for big production loads.

Milvus

Built for billions

An open-source, distributed vector DB aimed at very large deployments.

Pro — scales to billions of vectors with cluster-grade throughput.

Con — the most operational weight; overkill for small apps.

How to choose

  • Already on Postgres? Start with pgvector. One fewer system to run, and it carries most apps comfortably into production.
  • Want zero ops? A managed service like Pinecone (or hosted Qdrant/Weaviate) trades cost for not babysitting infrastructure.
  • Need top performance or rich filtering, self-hosted? Qdrant or Weaviate. Just prototyping? Chroma. Billions of vectors? Milvus.
  • Rule of thumb: pick the simplest option that meets today's scale — migrating embeddings later is far easier than running a cluster you didn't need.
07 · A worked pipeline & recap 4 min

The whole RAG pipeline, end to end.

Two phases. Ingest happens once (and whenever data changes); query happens on every question. Everything in this deck slots into one of them.

INGEST · once load docs chunk embed vector DBstore QUERY · every question question embed searchtop-k rerank build prompt+ context LLM answer same store

Ingest fills the store; query reads from it. The vector DB is the shared hinge between the two phases.

// the query phase, in five lines const qv = await embed(question) // 1 · embed query const hits = await db.search(qv, { topK: 50 }) // 2 · vector search const top = await rerank(question, hits, 5) // 3 · rerank const { text } = await generateText({ // 4 · augment + generate model: anthropic("claude-opus-4-8"), prompt: `Context:\n${top}\n\nQuestion: ${question}`, })

Five takeaways

  • RAG = retrieve + generate. Look facts up at question time instead of baking them in.
  • Embeddings turn meaning into distance. Close vectors = related text.
  • Chunk well. Right-sized, overlapping slices make or break retrieval.
  • Hybrid + rerank rescue the misses pure vector search leaves behind.
  • Start with pgvector and the simplest pipeline; add parts when evals demand them.
Knowledge check

Did it stick?

Five quick questions on RAG, embeddings, chunking, retrieval, and the tooling — instant feedback, no sign-in.

Rate this deck
be the first

Navigate with ← → or scroll · back to library