AIStackWatch
Back to wiki

Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) is the pattern of fetching relevant documents at query time and stuffing them into the prompt alongside the user question. The LLM answers from that fresh context rather than from memorized training data.

In production apps, RAG often sits inside an agent framework: the agent decides when to retrieve, a vector database stores the candidate chunks, and evals catch whether retrieved context improves the answer.

The basic flow

  1. Index — chunk your documents (300-1000 tokens each), embed each chunk, store vectors in a database.
  2. Retrieve — embed the user query, nearest-neighbor search against the index, pull top-k chunks.
  3. Augment — prepend the chunks to the prompt with a directive like "answer only using the context below."
  4. Generate — the LLM produces an answer; ideally cite which chunks it used.

Why RAG over fine-tuning

  • Freshness. Today's docs appear in today's answers.
  • Attribution. You can cite the chunks that grounded the answer.
  • Cost. No training job; update by re-embedding changed documents.
  • Access control. Filter chunks by user permissions at retrieval time, not inside weights.

Common failure modes

  • Bad chunking. Splitting mid-sentence or mid-table destroys meaning. Chunk on semantic boundaries (headings, paragraphs).
  • Single-vector search misses keywords. Product codes and exact phrases fail on pure semantic search. Hybrid search (BM25 + vectors) is the standard fix.
  • Top-k too small. If the answer is split across five chunks and you retrieve three, the model confabulates the rest.
  • Stale index. Document is updated in the source; the embedding still reflects last month. Set up an incremental re-indexing job.

Vector store choices

  • Pinecone — managed, large-scale, predictable pricing.
  • Weaviate — open-source, hybrid search built in.
  • Turbopuffer — serverless, very cheap cold storage, good for archive-class RAG.
  • pgvector — run inside Postgres you already have; see its own wiki page.

When NOT to use RAG

If the corpus fits in the model's context window AND fits in your cache budget, skip retrieval and just pass the whole thing — simpler and more accurate. That threshold is a moving target as context windows grow; re-evaluate quarterly.