Classical search engines match keywords. A query for 'how to fire an employee' retrieves documents containing those exact tokens, missing pages about 'letting a team member go' or 'termination procedures.' Semantic search fixes this by encoding query and documents into a shared vector space where meaning-similar things are close together geometrically. The query 'how to fire an employee' might sit 0.12 cosine distance from a page titled 'termination best practices' even though they share only one common word.
The pipeline has two phases. In indexing, you chunk documents into passages (typically 200-800 tokens each), run them through an embedding model like OpenAI text-embedding-3-large, Cohere Embed v3, or open-source alternatives like bge-large, and store the resulting vectors in a database. In querying, you embed the user's query with the same model and run an approximate nearest neighbor (ANN) search — HNSW or IVF algorithms — against the stored vectors. The top-k closest passages come back, typically in 10-50 milliseconds even across millions of vectors.
Semantic search by itself isn't always better than keyword search. Pure lexical search (BM25) still wins on exact-phrase queries, product SKUs, legal citations, and code identifiers — anywhere the literal string matters. The production pattern in 2026 is hybrid search: run both BM25 and semantic in parallel, fuse the results with reciprocal rank fusion, and optionally rerank the top 50-100 with a cross-encoder like Cohere rerank-english-v3 or a smaller LLM. This beats either alone by 10-20% on most retrieval benchmarks.
The quality of semantic search is bounded by the embedding model. OpenAI text-embedding-3-large (3072 dims), Cohere Embed v3 (1024 dims), and bge-m3 are the current leaders by 2026 MTEB scores, with rapidly improving open-source contenders. Domain-specific embeddings (legal, code, biomedical) outperform general-purpose ones within their domain. Matryoshka embeddings, which let you truncate vectors to smaller sizes at query time, are increasingly popular for cost control.
For AI agents, semantic search is the foundation of both RAG (retrieving knowledge to answer questions) and memory (retrieving relevant past interactions). Tycoon indexes every project doc, chat message, and task with embeddings so Astra can answer 'what did we decide about pricing last quarter' without the founder having to dig through Notion. The retrieval is what makes the agent feel continuous across time. Without semantic search, an
AI employee either has to load the entire business history every call (expensive and noisy) or forget everything (useless).