What is Retrieval-Augmented Generation (RAG)?
How LLMs answer questions about your data without being retrained on it.
Retrieval-Augmented Generation (RAG) is a pattern where an AI system retrieves relevant documents from a vector database and includes them in the prompt to an LLM, allowing it to answer questions using information outside its training data. Introduced by Meta AI researchers in 2020, RAG is the standard way to give LLMs access to private data, current data, or data too large to fit in a single prompt.
Retrieval-Augmented Generation (RAG) is a pattern where an AI system retrieves relevant documents from a vector database and includes them in the prompt to an LLM, allowing it to answer questions using information outside its training data. Introduced by Meta AI researchers in 2020, RAG is the standard way to give LLMs access to private data, current data, or data too large to fit in a single prompt.
In depth
Examples
- →ChatGPT with custom GPTs uploading knowledge files — under the hood, RAG over your uploaded docs
- →Anthropic Claude projects — upload documents and Claude uses RAG to ground answers in them
- →Perplexity — real-time web search is effectively RAG over the open web, feeding retrieved pages into GPT-4/Claude/Sonar
- →Cursor and Windsurf — RAG over your codebase so the AI coder can answer questions about files it hasn't seen in the current session
- →Tycoon — every project generates a RAG-indexed knowledge base; Astra (AI CEO) retrieves from it on every chat turn to maintain continuity
- →Enterprise chat-your-docs products (Glean, Guru, Mendable) — RAG over internal wikis, Slack, and Google Docs
- →Customer support AI that retrieves from your help center and past tickets before generating a reply
Related terms
Frequently asked questions
How is RAG different from fine-tuning?
Fine-tuning bakes knowledge into the model weights by training on your data — it changes the model itself. RAG leaves the model unchanged and gives it access to your data at inference time. Fine-tuning is better for teaching style, tone, or proprietary reasoning patterns; RAG is better for factual knowledge that changes frequently. Most production systems use base models plus RAG because fine-tuning costs more, takes longer to update, and hides where an answer came from. A common pattern is fine-tuning for tone and RAG for facts.
Do I still need RAG if the model has a 1 million-token context window?
Usually yes, for two reasons. First, cost and latency: sending 1M tokens every query is expensive ($5-30 per query depending on the model) and slow (5-30 seconds). RAG lets you send only the relevant 5K tokens. Second, quality degrades with large contexts — models have a well-documented 'lost in the middle' problem where information in the middle of a long context gets ignored. RAG keeps the retrieved content small and highly relevant, which actually improves answer quality versus dumping everything in.
Which vector database should I use?
For prototypes: Chroma or FAISS (local, free). For production: Pinecone (managed, fast), Qdrant (open-source, self-hostable), Weaviate (open-source + managed), or pgvector (PostgreSQL extension, good if you're already on Postgres). The differences are mostly operational — scaling, managed vs. self-hosted, and integration with your existing stack. Algorithmically they all use similar approximate nearest neighbor methods (HNSW, IVF). Tycoon uses pgvector because the rest of the platform is on PostgreSQL.
What are the biggest RAG failure modes?
Four common ones. (1) Poor chunking — if you chunk by fixed token count and split a table or code block, retrieval brings back unusable fragments. Fix: chunk by semantic boundaries. (2) Embedding-query mismatch — the retrieval model finds documents similar to the question phrasing but not to the answer phrasing. Fix: query rewriting or HyDE (hypothetical document embeddings). (3) Top-k too small or too large — too small misses the answer, too large adds noise. Fix: rerank and take top-3 to top-5 after a wider initial retrieval. (4) Stale index — your docs changed but the vectors didn't. Fix: incremental re-indexing triggered by doc updates.
How much does running RAG cost?
Three cost lines. Embedding cost: $0.01-0.10 per million tokens (one-time for indexing, cheap). Vector DB: free (pgvector, Chroma) to $70-500+/month (managed Pinecone at scale). Inference cost: dominated by LLM calls, typically $0.001-0.10 per query depending on the model and context size. For a small business with 10K documents and 1K queries per day, total cost is typically $10-100/month — the LLM calls dominate, and RAG makes them dramatically cheaper than stuffing everything into context.
Run your one-person company.
Hire your AI team in 30 seconds. Start for free.
Free to start · No credit card required · Set up in 30 seconds