Learn

What is Semantic Search?

Finding documents by meaning, not by keywords.

Semantic search is a retrieval technique that ranks documents by meaning similarity rather than keyword overlap. It converts both query and documents into vector embeddings and returns the closest matches by cosine similarity or dot product. Unlike traditional lexical search (BM25, tf-idf), it finds relevant results that share no literal words with the query, which is why it powers modern RAG, AI search, and agent memory systems.

Free to startNo credit card requiredUpdated Apr 2026
Short answer

Semantic search is a retrieval technique that ranks documents by meaning similarity rather than keyword overlap. It converts both query and documents into vector embeddings and returns the closest matches by cosine similarity or dot product. Unlike traditional lexical search (BM25, tf-idf), it finds relevant results that share no literal words with the query, which is why it powers modern RAG, AI search, and agent memory systems.

In depth

Classical search engines match keywords. A query for 'how to fire an employee' retrieves documents containing those exact tokens, missing pages about 'letting a team member go' or 'termination procedures.' Semantic search fixes this by encoding query and documents into a shared vector space where meaning-similar things are close together geometrically. The query 'how to fire an employee' might sit 0.12 cosine distance from a page titled 'termination best practices' even though they share only one common word. The pipeline has two phases. In indexing, you chunk documents into passages (typically 200-800 tokens each), run them through an embedding model like OpenAI text-embedding-3-large, Cohere Embed v3, or open-source alternatives like bge-large, and store the resulting vectors in a database. In querying, you embed the user's query with the same model and run an approximate nearest neighbor (ANN) search — HNSW or IVF algorithms — against the stored vectors. The top-k closest passages come back, typically in 10-50 milliseconds even across millions of vectors. Semantic search by itself isn't always better than keyword search. Pure lexical search (BM25) still wins on exact-phrase queries, product SKUs, legal citations, and code identifiers — anywhere the literal string matters. The production pattern in 2026 is hybrid search: run both BM25 and semantic in parallel, fuse the results with reciprocal rank fusion, and optionally rerank the top 50-100 with a cross-encoder like Cohere rerank-english-v3 or a smaller LLM. This beats either alone by 10-20% on most retrieval benchmarks. The quality of semantic search is bounded by the embedding model. OpenAI text-embedding-3-large (3072 dims), Cohere Embed v3 (1024 dims), and bge-m3 are the current leaders by 2026 MTEB scores, with rapidly improving open-source contenders. Domain-specific embeddings (legal, code, biomedical) outperform general-purpose ones within their domain. Matryoshka embeddings, which let you truncate vectors to smaller sizes at query time, are increasingly popular for cost control. For AI agents, semantic search is the foundation of both RAG (retrieving knowledge to answer questions) and memory (retrieving relevant past interactions). Tycoon indexes every project doc, chat message, and task with embeddings so Astra can answer 'what did we decide about pricing last quarter' without the founder having to dig through Notion. The retrieval is what makes the agent feel continuous across time. Without semantic search, an AI employee either has to load the entire business history every call (expensive and noisy) or forget everything (useless).

Examples

  • ChatGPT memory retrieval — when you ask a new question, it semantically matches against stored memories
  • Perplexity's search — embeds your query, retrieves relevant web pages, feeds them to the LLM
  • Anthropic Claude projects — semantic search over your uploaded documents to ground answers
  • Tycoon — every project chat and doc is indexed, so Astra retrieves only the 5-10 most relevant chunks per turn
  • Cursor and Windsurf — semantic search over your codebase so the AI coder finds relevant files instantly
  • Spotify discover weekly — semantic similarity between your liked songs and the entire catalog
  • Google's AI Overviews — semantic retrieval of relevant web passages before generating the summary

Related terms

Frequently asked questions

How is semantic search different from keyword search?

Keyword search (BM25, tf-idf, Elasticsearch defaults) matches based on term overlap — documents score high if they contain the same words as the query. Semantic search matches based on meaning — documents score high if their vector embedding is close to the query's embedding, even if they share no literal words. Keyword search wins on exact identifiers and phrase matching; semantic search wins on conceptual queries and paraphrases. The production answer in 2026 is hybrid: run both, fuse results, optionally rerank. Pure semantic or pure keyword is rarely the right choice for real applications.

Which embedding model should I use?

For general-purpose English, OpenAI text-embedding-3-large is a strong default ($0.13 per 1M tokens, 3072 dims, top-tier quality). Cohere Embed v3 is competitive with better multilingual performance. For cost-sensitive workloads, text-embedding-3-small ($0.02 per 1M) is nearly as good with a third the cost. Open-source bge-m3 and nomic-embed-v2 are strong self-hostable options. For code, use a code-specialized model like voyage-code-3. Domain-specific (legal, biomedical) embeddings outperform general models within domain by 10-20% recall. Test on your actual queries — MTEB leaderboards are useful but not predictive of your workload.

How fast is semantic search?

Typical production latency is 10-50ms for millions of vectors, using HNSW or IVF approximate nearest neighbor indexes. Exact search (no index) is O(n) and gets slow past ~100K vectors. The query embedding itself is the slowest step — a call to OpenAI text-embedding-3 takes 50-200ms over the network. Self-hosted embeddings run in single-digit ms on GPU. For user-facing search you usually want sub-100ms end to end, which means self-hosted embeddings or aggressive caching of query embeddings.

Why do I still need keyword search if semantic search is smarter?

Three reasons. (1) Exact matching: semantic search won't reliably find 'error code E-1042' because embeddings compress away exact tokens. BM25 finds it instantly. (2) Novel jargon: embedding models only know words they saw in training. For new product names, internal acronyms, or domain jargon that post-dates the embedding model, semantic search fails while keyword search still works. (3) Cost: keyword indexes are cheap to build and update. Semantic indexes require re-embedding when models change. The hybrid pattern covers both failure modes — use keywords for exact/rare tokens, semantic for conceptual matching, fuse the results.

What is reranking and do I need it?

Reranking is a second pass that takes the top 50-200 results from initial retrieval and re-scores them using a more expensive but more accurate model — usually a cross-encoder that takes query+document together, or a small LLM prompted to judge relevance. Cohere rerank-english-v3 is a popular commercial option; cross-encoder models like bge-reranker-large are self-hostable. Reranking adds 100-500ms of latency but typically improves precision@10 by 10-25 percentage points over embedding-only retrieval. You need it for user-facing RAG where answer quality matters more than latency. You don't need it for coarse retrieval like 'find roughly relevant docs to feed into context.'

Run your one-person company.

Hire your AI team in 30 seconds. Start for free.

Free to start · No credit card required · Set up in 30 seconds