What are Vector Embeddings?
How neural networks turn meaning into math.
A vector embedding is a dense numerical representation of a piece of content — typically a 384 to 3072 dimensional float vector — produced by a neural network trained so that meaning-similar inputs yield geometrically close vectors. Embeddings turn text, code, images, or audio into a shared numerical space where cosine similarity approximates semantic similarity, enabling retrieval, clustering, classification, and recommendation without rule-based features.
A vector embedding is a dense numerical representation of a piece of content — typically a 384 to 3072 dimensional float vector — produced by a neural network trained so that meaning-similar inputs yield geometrically close vectors. Embeddings turn text, code, images, or audio into a shared numerical space where cosine similarity approximates semantic similarity, enabling retrieval, clustering, classification, and recommendation without rule-based features.
In depth
Examples
- →word2vec (Mikolov, 2013) — the embedding that started the modern era, 300-dim word vectors learned from skip-gram
- →OpenAI text-embedding-3-large — current commercial workhorse, 3072 dims, supports Matryoshka truncation
- →Cohere Embed v3 — strong multilingual, optimized for RAG retrieval quality
- →bge-m3 (BAAI) — best open-source, supports dense + sparse + multi-vector retrieval in one model
- →CLIP (OpenAI) — joint image-text embeddings powering 'search photos with words' and Stable Diffusion's text conditioning
- →voyage-code-3 — code-specialized, outperforms general models on code search by 10-15 points
- →Tycoon embeds every project doc and conversation with text-embedding-3-small for Astra's memory retrieval
- →Spotify song embeddings — acoustic features in a learned space where similar-sounding tracks cluster
Related terms
Frequently asked questions
What's the difference between an embedding and a token?
A token is a unit of text (roughly a word or subword) that an LLM processes — 'unbelievable' might be 3 tokens. Each token has a static embedding from the model's vocabulary, typically 1000-4000 dimensions depending on the model. But when people say 'embedding' in the context of search or RAG, they usually mean a single vector summarizing an entire passage, not per-token vectors. That passage-level embedding is produced by a dedicated embedding model (not the LLM itself) trained specifically for similarity search. An LLM's internal per-token embeddings are not directly useful for retrieval — they're optimized for next-token prediction, not similarity.
How do I choose an embedding model?
Four-step decision: (1) Modality — text-only, multilingual, code, or multimodal. Pick accordingly. (2) Managed vs self-hosted — OpenAI or Cohere if you don't want ops; bge-m3 or nomic-embed-v2 if you do. (3) Dimensions — 1024 is a good default balancing quality and cost; Matryoshka-compatible models let you adjust later. (4) Evaluate on your data — build a small test set of 50 queries with expected results, measure recall@10 for 2-3 candidate models. The gap between models is usually 5-20% on your specific workload, much bigger than MTEB suggests. Don't skip this step.
What happens if I change embedding models?
Full re-indexing. Embeddings from different models live in different vector spaces and are not comparable. If you have 10M documents indexed with text-embedding-3-small and switch to Cohere Embed v3, you must embed all 10M again with the new model and rebuild the ANN index. This is why embedding model choice is sticky — migrations are expensive. Plan for it: version your embeddings by model, keep the embedding model identifier in metadata, and budget for periodic migrations every 12-24 months as models improve substantially.
Can I fine-tune embeddings for my domain?
Yes, and it helps when your domain has jargon or structure that general models miss. Sentence-Transformers lets you fine-tune open-source embedding models on contrastive pairs from your domain in a few GPU-hours. Typical gains: 5-20% recall@10 improvement over general embeddings on legal, medical, or technical domains. For most startups the ROI isn't there — pick a good general model and move on. Consider fine-tuning when (a) you have 10K+ labeled query-document pairs, (b) your retrieval quality is the bottleneck limiting product quality, or (c) you're in a regulated domain where off-the-shelf models demonstrably underperform.
Do embeddings leak private information?
Partially, yes — this is an active research area. Inversion attacks can reconstruct approximate text from embeddings given access to the embedding model and many samples. For most use cases this is acceptable (the original text is stored alongside the vector anyway), but for regulated workloads treat embeddings as PII-equivalent: encrypt at rest, control access, and don't ship them to third parties without a DPA. If you're using a managed embedding API, you're already trusting that vendor with the plaintext; using their embeddings adds no incremental risk. For fully private workloads, self-host an open-source model like bge-m3.
Run your one-person company.
Hire your AI team in 30 seconds. Start for free.
Free to start · No credit card required · Set up in 30 seconds