RAG & Vector Search · GuideDeck

01 · Why RAG 4 min

A model can only answer from
what it was trained on.

An LLM learned the public internet up to a fixed training cutoff. It has never seen your company wiki, yesterday's support tickets, or the PDF a customer just uploaded. Ask about those and it either says "I don't know" or, worse, makes something up. RAG fixes that by handing the model the right facts at question time.

RAG — Retrieval-Augmented Generation — is a simple pattern: before the model answers, you retrieve the most relevant pieces of your own data and paste them into the prompt. The model then generates its answer grounded in those pieces. Think of it as an open-book exam — the model still does the reasoning, but you slide the right page in front of it first.

The whole idea in one line

Retrieve — search your data for the few snippets that actually relate to the question.
Augment — stuff those snippets into the prompt as context.
Generate — let the model answer using that context, and cite where it came from.

The question pulls snippets from your knowledge base; those snippets ride along into the prompt so the model can answer from them.

Why not just retrain the model?

Fine-tuning — bake facts into weights

Expensive and slow — you re-run training every time a fact changes.
Hard to update — fixing one wrong line means another training run.
No easy citations — the model can't point at where an answer came from.
Still hallucinates — memorized facts blur together.

RAG — look facts up at question time

Cheap and fast — change a document, the next answer reflects it.
Always current — add today's ticket and it's searchable instantly.
Built-in citations — you know exactly which snippet was used.
Fewer made-up answers — the facts sit right in the prompt.

Like the difference between memorizing a textbook for an exam (fine-tuning) and being allowed to bring the book and look things up (RAG). Fine-tuning teaches style and skills; RAG supplies facts. Most teams reach for RAG first.

02 · Embeddings 5 min

Turning text into numbers
that capture meaning.

Computers can't compare sentences the way we do. The trick is to convert each piece of text into a long list of numbers — a vector — positioned so that texts with similar meaning land close together. That conversion is called an embedding, and it's the engine under all of vector search.

Embedding — a list of numbers (a vector) that represents a piece of text's meaning. A model reads the text and outputs, say, 1,536 numbers. Texts about the same idea get nearby vectors; unrelated texts get distant ones — even when they share no words. "How do I get my money back?" and "refund policy" end up neighbors.

import { embed } from "ai" import { openai } from "@ai-sdk/openai" const { embedding } = await embed({ model: openai.embedding("text-embedding-3-small"), value: "How do I get my money back?", }) // embedding → [0.021, -0.044, 0.087, … ] (1536 numbers)

Each text becomes a point. "Refund" and "money back" sit together; "weather" sits far away.

Why this is powerful

Meaning, not keywords. Old search matched the exact word refund. Embeddings match the idea, so a question phrased completely differently still finds the right doc.
Distance = relatedness. Once everything is a point, "is this relevant?" becomes "how close are these two points?" — a fast math operation.
Do it once, reuse forever. You embed every document up front and store the vectors; at question time you only embed the short query.

Picking an embedding model

OpenAI (text-embedding-3) — strong default, simple API, pay per token.
Cohere (embed v3/v4) — great multilingual quality; also makes the reranker we'll meet in Part 5.
Open-source (e.g. BGE, E5, nomic) — run them yourself, no per-call cost, full data control.

One rule: embed your documents and your queries with the same model. Vectors from two different models don't live in the same space and can't be compared.

03 · Chunking 5 min

Split documents so retrieval
can actually find things.

You don't embed a whole 50-page handbook as one vector — that single point would be a blurry average of everything in it, and you'd have to feed the entire thing to the model. Instead you cut documents into small, focused pieces called chunks, and embed each one on its own.

Chunk — a small, self-contained slice of a document (a few paragraphs, say 300–800 words). Each chunk is embedded and stored separately, so search can return just the relevant slice instead of a whole file. Chunk size is a balance: too big and the vector is unfocused; too small and a single idea gets cut in half.

// split a long document into overlapping windows const chunks = splitText(doc, { size: 800, // ~chars (or tokens) per chunk overlap: 100, // repeat across the seam }) // → each chunk gets embedded + stored on its own for (const c of chunks) await store(await embed(c), c)

The document is cut into chunks that overlap slightly — so a sentence split across a seam still appears whole in one of them.

size

Right-size the slice

A few hundred words is the usual sweet spot — big enough to hold one complete thought, small enough that its vector stays sharp.

overlap

Overlap the seams

Repeat ~10–20% of text between neighbors so an idea that straddles a boundary isn't lost — the answer might sit right on the cut.

structure

Cut on natural breaks

Split on headings, paragraphs, or sentences — not mid-word. Respecting structure keeps each chunk readable and on-topic.

Bad chunking

Too large — one chunk covers five topics; its vector is a muddy average and matches nothing well.
Hard cuts — a table or sentence is sliced in half, so the retrieved snippet is gibberish.

Good chunking

One idea per chunk — focused vectors that match precise questions.
Overlap + structure — answers survive the seams and read cleanly when pasted into the prompt.

Like indexing a cookbook by recipe, not by chapter — you want the page with the exact dish, not the whole "Desserts" section.

04 · Vector search & retrieval 6 min

Find the nearest chunks,
then hand them to the model.

Every chunk is now a point. To answer a question, you turn the question into a point too, then find the handful of stored points sitting closest to it. Those are your most-relevant chunks. This is vector search, and it's the "retrieve" in RAG.

Vector search (a.k.a. similarity search) — given a query vector, return the stored vectors closest to it. "Closest" is measured by cosine similarity: are these two vectors pointing in the same direction? Same direction → same meaning → high score. The top-k closest chunks are what you keep — usually k of 3 to 10.

The query (amber) lands among the chunks; the three nearest (mint) are the top-k results. The far-off points are ignored.

The query flow, step by step

1 · Embed the question with the same model used for the chunks.
2 · Search the vector store for the top-k nearest chunk vectors.
3 · Assemble those chunks into the prompt as context.
4 · Generate — the model answers from the context and you surface the sources.

-- pgvector: find the 5 nearest chunks to the query SELECT content, source FROM chunks ORDER BY embedding <=> :query -- <=> = cosine distance LIMIT 5; -- top-k

Each result comes back with a similarity score; you keep the top-k and can drop anything below a threshold.

One thing to know: it's usually approximate

Comparing the query against every stored vector one by one is exact but slow once you have millions of chunks. So vector databases build a smart index (the popular one is called HNSW) that finds the nearest neighbors approximately — almost as accurate, but dramatically faster. This family of methods is called ANN, approximate nearest neighbor search. You trade a tiny bit of recall for a huge speed-up — almost always the right call.

05 · Reranking & hybrid search 5 min

Vector search is a great
first pass — not the final word.

Pure similarity search has blind spots. It can miss an exact term like an error code or a product SKU, and its top result isn't always the best one. Two cheap upgrades fix most of this: hybrid search and reranking.

Hybrid search — run two searches and combine them: classic keyword search (also called lexical or BM25, which matches exact words) plus semantic vector search (which matches meaning). Keyword search nails exact tokens like ERR-503; semantic search catches paraphrases. Together they cover each other's gaps.

Lexical + semantic, then merge

Keyword search is precise about words; vector search is smart about meaning. Run both, then fuse the two ranked lists into one. A common, robust way to merge is RRF (reciprocal rank fusion) — it just rewards chunks that rank highly in either list, no score-tuning required.

Keyword wins on names, IDs, codes, rare terms.
Semantic wins on paraphrases and synonyms.
Fused — fewer "it didn't find the obvious doc" misses.

Two independent searches, one merged result list.

A smarter second pass over the shortlist

Fast vector search gives you a rough shortlist — say the top 50. A reranker is a heavier, more accurate model that reads the query and each candidate together and scores true relevance, then reorders them. You keep only the top 5 after reranking. It's slower per item, which is exactly why you run it on 50, not 5 million.

Retriever — fast, casts a wide net (recall).
Reranker — slow, picks the true best (precision).
Best results land at the top, where the model pays most attention.

// Cohere Rerank — reorder by true relevance const ranked = await cohere.rerank({ model: "rerank-v3.5", query, documents: candidates, // top-50 from search topN: 5, // keep the best 5 })

Retrieve wide and cheap, rerank narrow and sharp.

Add hybrid search when users search by exact identifiers — product codes, names, version numbers, legal citations — and plain vector search keeps missing them.
Add a reranker when retrieval returns roughly right chunks but the genuinely best one isn't in the top few. It's the single highest-leverage quality upgrade for most RAG apps.
Start simple. Plain top-k vector search is a fine v1. Add these when your evals show the misses — not before.

06 · The tooling 5 min

Where the vectors live:
the vector database.

A vector database stores your chunk vectors, builds the ANN index, and answers similarity queries fast — most also do the metadata filtering and hybrid search you just saw. Here are the leading options, what each is best at, and how to choose.

Vector database — a data store built to hold embeddings and answer "find the nearest vectors" in milliseconds, even across millions of points. The big practical choice is add it to a database you already run (pgvector) vs. adopt a dedicated system built only for vectors.

pgvector

Postgres, extended

An extension that adds a vector column + ANN index to plain Postgres.

Pro — reuse the database, backups, and SQL you already have; vectors live next to your relational data.

Con — at very large scale and very high query volume a purpose-built engine can outpace it.

Pinecone

Fully managed

A hosted, serverless vector service — no infrastructure to run.

Pro — fast to ship, scales for you, low ops burden.

Con — proprietary and usage-priced; your vectors live in someone else's cloud.

Qdrant

Open-source, Rust-fast

A dedicated vector engine with strong filtering; self-host or use their cloud.

Pro — excellent performance and rich metadata-filtered search; open-source.

Con — another service to operate if you self-host.

Weaviate

Batteries-included

Open-source DB with built-in hybrid search and pluggable embedding modules.

Pro — hybrid search and schema features out of the box.

Con — more concepts to learn; heavier than a minimal setup.

Chroma

Dev-friendly

A lightweight, open-source store that runs locally with almost no setup.

Pro — the fastest way to prototype RAG on your laptop.

Con — you'll graduate to something sturdier for big production loads.

Milvus

Built for billions

An open-source, distributed vector DB aimed at very large deployments.

Pro — scales to billions of vectors with cluster-grade throughput.

Con — the most operational weight; overkill for small apps.

How to choose

Already on Postgres? Start with pgvector. One fewer system to run, and it carries most apps comfortably into production.
Want zero ops? A managed service like Pinecone (or hosted Qdrant/Weaviate) trades cost for not babysitting infrastructure.
Need top performance or rich filtering, self-hosted? Qdrant or Weaviate. Just prototyping? Chroma. Billions of vectors? Milvus.
Rule of thumb: pick the simplest option that meets today's scale — migrating embeddings later is far easier than running a cluster you didn't need.

07 · A worked pipeline & recap 4 min

The whole RAG pipeline, end to end.

Two phases. Ingest happens once (and whenever data changes); query happens on every question. Everything in this deck slots into one of them.

Ingest fills the store; query reads from it. The vector DB is the shared hinge between the two phases.

// the query phase, in five lines const qv = await embed(question) // 1 · embed query const hits = await db.search(qv, { topK: 50 }) // 2 · vector search const top = await rerank(question, hits, 5) // 3 · rerank const { text } = await generateText({ // 4 · augment + generate model: anthropic("claude-opus-4-8"), prompt: `Context:\n${top}\n\nQuestion: ${question}`, })

Five takeaways

RAG = retrieve + generate. Look facts up at question time instead of baking them in.
Embeddings turn meaning into distance. Close vectors = related text.
Chunk well. Right-sized, overlapping slices make or break retrieval.
Hybrid + rerank rescue the misses pure vector search leaves behind.
Start with pgvector and the simplest pipeline; add parts when evals demand them.

Knowledge check

Did it stick?

Five quick questions on RAG, embeddings, chunking, retrieval, and the tooling — instant feedback, no sign-in.

Rate this deck

be the first

Navigate with ← → or scroll · back to library

RAG & VectorSearch.

A model can only answer fromwhat it was trained on.

The whole idea in one line

Why not just retrain the model?

Turning text into numbersthat capture meaning.

Why this is powerful

Picking an embedding model

Split documents so retrievalcan actually find things.

Right-size the slice

Overlap the seams

Cut on natural breaks

Find the nearest chunks,then hand them to the model.

The query flow, step by step

One thing to know: it's usually approximate

Vector search is a greatfirst pass — not the final word.

Lexical + semantic, then merge

A smarter second pass over the shortlist

Where the vectors live:the vector database.

Postgres, extended

Fully managed

Open-source, Rust-fast

Batteries-included

Dev-friendly

Built for billions

How to choose

The whole RAG pipeline, end to end.

Five takeaways

Did it stick?

RAG & Vector
Search.

A model can only answer from
what it was trained on.

Turning text into numbers
that capture meaning.

Split documents so retrieval
can actually find things.

Find the nearest chunks,
then hand them to the model.

Vector search is a great
first pass — not the final word.

Where the vectors live:
the vector database.