Learn

What is Retrieval-Augmented Generation (RAG)?

How LLMs answer questions about your data without being retrained on it.

Retrieval-Augmented Generation (RAG) is a pattern where an AI system retrieves relevant documents from a vector database and includes them in the prompt to an LLM, allowing it to answer questions using information outside its training data. Introduced by Meta AI researchers in 2020, RAG is the standard way to give LLMs access to private data, current data, or data too large to fit in a single prompt.

Free to startNo credit card requiredUpdated Apr 2026
Short answer

Retrieval-Augmented Generation (RAG) is a pattern where an AI system retrieves relevant documents from a vector database and includes them in the prompt to an LLM, allowing it to answer questions using information outside its training data. Introduced by Meta AI researchers in 2020, RAG is the standard way to give LLMs access to private data, current data, or data too large to fit in a single prompt.

In depth

Every LLM has a training cutoff date and a fixed context window. RAG solves both problems at once by adding a retrieval step before generation. When a user asks a question, the system first converts the question into a vector embedding, searches a vector database for the most semantically similar chunks of your documents, and then prepends those chunks to the prompt sent to the LLM. The LLM now answers based on your data plus its pre-trained knowledge. The RAG pipeline has four stages. (1) Indexing: you chunk your documents (typically 200-800 tokens per chunk), embed each chunk with a model like OpenAI text-embedding-3 or Cohere Embed v3, and store the vectors in a database like Pinecone, Weaviate, Qdrant, pgvector, or Chroma. (2) Retrieval: at query time, embed the question the same way and run a nearest-neighbor search — typically top-5 to top-20 chunks. (3) Augmentation: stuff those chunks into the LLM prompt along with instructions. (4) Generation: the LLM reads the retrieved context and produces an answer, ideally citing which chunks it used. RAG matters because it solves the three hardest problems of applying LLMs to real business data. First, freshness: your knowledge base changes daily; retraining a model is impossibly expensive. With RAG, you just re-index. Second, privacy: sending proprietary data to train a public model is unacceptable; with RAG your data stays in your vector store. Third, citability: when the LLM answers from retrieved chunks, you can show the user exactly which document the answer came from, making the system auditable in a way pure generation is not. RAG comes in several flavors. Naive RAG does simple similarity search and is fine for small, clean corpora. Advanced RAG adds reranking (e.g. Cohere's rerank-english-v3), query rewriting, and hybrid search (semantic + BM25 keyword). Agentic RAG lets the LLM itself decide when and how to retrieve — issuing multiple queries, synthesizing results, and refining its search. In 2025, most production systems moved from naive to hybrid with reranking, and 2026 is seeing rapid adoption of agentic RAG for complex research tasks. Tycoon uses RAG heavily to give the AI CEO and AI employees persistent memory of the business. Every project doc, past conversation, task history, and uploaded file is indexed. When Astra answers 'what did we decide about pricing last month?', she's not searching her training data — she's retrieving from the business's own vector store. This is why an AI employee feels continuous across weeks while a raw ChatGPT session forgets everything between chats. RAG is not a silver bullet. It fails when retrieval brings back irrelevant chunks (garbage in, garbage out), when your documents are poorly chunked (answers split across chunk boundaries), or when the question requires reasoning over the entire corpus rather than a few passages (RAG can't 'summarize your whole codebase' — it can only summarize the top-k chunks it retrieved). The frontier of RAG research in 2026 is hybrid systems that combine retrieval with long-context reasoning and tool use.

Examples

  • ChatGPT with custom GPTs uploading knowledge files — under the hood, RAG over your uploaded docs
  • Anthropic Claude projects — upload documents and Claude uses RAG to ground answers in them
  • Perplexity — real-time web search is effectively RAG over the open web, feeding retrieved pages into GPT-4/Claude/Sonar
  • Cursor and Windsurf — RAG over your codebase so the AI coder can answer questions about files it hasn't seen in the current session
  • Tycoon — every project generates a RAG-indexed knowledge base; Astra (AI CEO) retrieves from it on every chat turn to maintain continuity
  • Enterprise chat-your-docs products (Glean, Guru, Mendable) — RAG over internal wikis, Slack, and Google Docs
  • Customer support AI that retrieves from your help center and past tickets before generating a reply

Related terms

Frequently asked questions

How is RAG different from fine-tuning?

Fine-tuning bakes knowledge into the model weights by training on your data — it changes the model itself. RAG leaves the model unchanged and gives it access to your data at inference time. Fine-tuning is better for teaching style, tone, or proprietary reasoning patterns; RAG is better for factual knowledge that changes frequently. Most production systems use base models plus RAG because fine-tuning costs more, takes longer to update, and hides where an answer came from. A common pattern is fine-tuning for tone and RAG for facts.

Do I still need RAG if the model has a 1 million-token context window?

Usually yes, for two reasons. First, cost and latency: sending 1M tokens every query is expensive ($5-30 per query depending on the model) and slow (5-30 seconds). RAG lets you send only the relevant 5K tokens. Second, quality degrades with large contexts — models have a well-documented 'lost in the middle' problem where information in the middle of a long context gets ignored. RAG keeps the retrieved content small and highly relevant, which actually improves answer quality versus dumping everything in.

Which vector database should I use?

For prototypes: Chroma or FAISS (local, free). For production: Pinecone (managed, fast), Qdrant (open-source, self-hostable), Weaviate (open-source + managed), or pgvector (PostgreSQL extension, good if you're already on Postgres). The differences are mostly operational — scaling, managed vs. self-hosted, and integration with your existing stack. Algorithmically they all use similar approximate nearest neighbor methods (HNSW, IVF). Tycoon uses pgvector because the rest of the platform is on PostgreSQL.

What are the biggest RAG failure modes?

Four common ones. (1) Poor chunking — if you chunk by fixed token count and split a table or code block, retrieval brings back unusable fragments. Fix: chunk by semantic boundaries. (2) Embedding-query mismatch — the retrieval model finds documents similar to the question phrasing but not to the answer phrasing. Fix: query rewriting or HyDE (hypothetical document embeddings). (3) Top-k too small or too large — too small misses the answer, too large adds noise. Fix: rerank and take top-3 to top-5 after a wider initial retrieval. (4) Stale index — your docs changed but the vectors didn't. Fix: incremental re-indexing triggered by doc updates.

How much does running RAG cost?

Three cost lines. Embedding cost: $0.01-0.10 per million tokens (one-time for indexing, cheap). Vector DB: free (pgvector, Chroma) to $70-500+/month (managed Pinecone at scale). Inference cost: dominated by LLM calls, typically $0.001-0.10 per query depending on the model and context size. For a small business with 10K documents and 1K queries per day, total cost is typically $10-100/month — the LLM calls dominate, and RAG makes them dramatically cheaper than stuffing everything into context.

Run your one-person company.

Hire your AI team in 30 seconds. Start for free.

Free to start · No credit card required · Set up in 30 seconds