What is Semantic Search?
Finding documents by meaning, not by keywords.
Semantic search is a retrieval technique that ranks documents by meaning similarity rather than keyword overlap. It converts both query and documents into vector embeddings and returns the closest matches by cosine similarity or dot product. Unlike traditional lexical search (BM25, tf-idf), it finds relevant results that share no literal words with the query, which is why it powers modern RAG, AI search, and agent memory systems.
Semantic search is a retrieval technique that ranks documents by meaning similarity rather than keyword overlap. It converts both query and documents into vector embeddings and returns the closest matches by cosine similarity or dot product. Unlike traditional lexical search (BM25, tf-idf), it finds relevant results that share no literal words with the query, which is why it powers modern RAG, AI search, and agent memory systems.
In depth
Examples
- →ChatGPT memory retrieval — when you ask a new question, it semantically matches against stored memories
- →Perplexity's search — embeds your query, retrieves relevant web pages, feeds them to the LLM
- →Anthropic Claude projects — semantic search over your uploaded documents to ground answers
- →Tycoon — every project chat and doc is indexed, so Astra retrieves only the 5-10 most relevant chunks per turn
- →Cursor and Windsurf — semantic search over your codebase so the AI coder finds relevant files instantly
- →Spotify discover weekly — semantic similarity between your liked songs and the entire catalog
- →Google's AI Overviews — semantic retrieval of relevant web passages before generating the summary
Related terms
Frequently asked questions
How is semantic search different from keyword search?
Keyword search (BM25, tf-idf, Elasticsearch defaults) matches based on term overlap — documents score high if they contain the same words as the query. Semantic search matches based on meaning — documents score high if their vector embedding is close to the query's embedding, even if they share no literal words. Keyword search wins on exact identifiers and phrase matching; semantic search wins on conceptual queries and paraphrases. The production answer in 2026 is hybrid: run both, fuse results, optionally rerank. Pure semantic or pure keyword is rarely the right choice for real applications.
Which embedding model should I use?
For general-purpose English, OpenAI text-embedding-3-large is a strong default ($0.13 per 1M tokens, 3072 dims, top-tier quality). Cohere Embed v3 is competitive with better multilingual performance. For cost-sensitive workloads, text-embedding-3-small ($0.02 per 1M) is nearly as good with a third the cost. Open-source bge-m3 and nomic-embed-v2 are strong self-hostable options. For code, use a code-specialized model like voyage-code-3. Domain-specific (legal, biomedical) embeddings outperform general models within domain by 10-20% recall. Test on your actual queries — MTEB leaderboards are useful but not predictive of your workload.
How fast is semantic search?
Typical production latency is 10-50ms for millions of vectors, using HNSW or IVF approximate nearest neighbor indexes. Exact search (no index) is O(n) and gets slow past ~100K vectors. The query embedding itself is the slowest step — a call to OpenAI text-embedding-3 takes 50-200ms over the network. Self-hosted embeddings run in single-digit ms on GPU. For user-facing search you usually want sub-100ms end to end, which means self-hosted embeddings or aggressive caching of query embeddings.
Why do I still need keyword search if semantic search is smarter?
Three reasons. (1) Exact matching: semantic search won't reliably find 'error code E-1042' because embeddings compress away exact tokens. BM25 finds it instantly. (2) Novel jargon: embedding models only know words they saw in training. For new product names, internal acronyms, or domain jargon that post-dates the embedding model, semantic search fails while keyword search still works. (3) Cost: keyword indexes are cheap to build and update. Semantic indexes require re-embedding when models change. The hybrid pattern covers both failure modes — use keywords for exact/rare tokens, semantic for conceptual matching, fuse the results.
What is reranking and do I need it?
Reranking is a second pass that takes the top 50-200 results from initial retrieval and re-scores them using a more expensive but more accurate model — usually a cross-encoder that takes query+document together, or a small LLM prompted to judge relevance. Cohere rerank-english-v3 is a popular commercial option; cross-encoder models like bge-reranker-large are self-hostable. Reranking adds 100-500ms of latency but typically improves precision@10 by 10-25 percentage points over embedding-only retrieval. You need it for user-facing RAG where answer quality matters more than latency. You don't need it for coarse retrieval like 'find roughly relevant docs to feed into context.'
Run your one-person company.
Hire your AI team in 30 seconds. Start for free.
Free to start · No credit card required · Set up in 30 seconds