Definition
A word embedding is a numerical representation of text (words, phrases, documents) in a high-dimensional vector space (typically 256-3072 dimensions), where semantically similar texts are positioned close to each other.
Unlike traditional symbolic representations (one-hot encoding), embeddings capture complex semantic relationships. This geometric property enables mathematical operations on linguistic concepts: computing similarity, finding analogies, clustering content by meaning.
How It Works
Embeddings are learned through neural networks trained on large text corpora. The main evolution:
Static Embeddings (2013-2017)
- Word2Vec (Mikolov et al., 2013): two architectures, Skip-gram (predicts context given a word) and CBOW (predicts word given context). Vectors of 100-300 dimensions.
- GloVe (Pennington et al., 2014): based on global co-occurrence statistics. Performance comparable to Word2Vec.
- FastText (Bojanowski et al., 2017): considers sub-words (character n-grams), useful for morphologically rich languages and out-of-vocabulary words.
Limitation: one fixed vector per word. “Bank” has the same embedding in “river bank” and “piggy bank”.
Contextual Embeddings (2018+)
BERT, GPT, and subsequent models generate representations that depend on the entire sentence: the same word produces different vectors in different contexts.
Modern embedding models (OpenAI text-embedding-3, Cohere embed-v3, BGE, E5) are specifically optimized for retrieval and semantic similarity, with measurable performance on benchmarks like MTEB.
Use Cases
RAG (Retrieval-Augmented Generation): documents indexed as embeddings in a vector database, query converted to embedding, retrieval by cosine similarity, context passed to LLM.
Semantic search: going beyond keyword matching. A query like “liquidity problems” finds documents about “negative cash flow” without exact match.
Classification and clustering: embeddings as features for traditional ML models or unsupervised clustering.
Deduplication: identifying nearly identical content in large corpora.
Practical Considerations
Dimensionality: 768-1024 dimensions are sufficient for most cases. A float32 vector at 1536 dimensions occupies 6KB. At scale, storage and latency become relevant.
Model choice: depends on task, language, domain. The MTEB benchmark aggregates 56 tasks, but performance varies by domain. Open-source models (BGE, E5) allow fine-tuning and on-premise deployment.
API costs: OpenAI text-embedding-3-small costs ~$0.02 per million tokens. For high volumes, self-hosting can significantly reduce costs.
Common Misconceptions
”Similar embeddings = synonyms”
Nearby embeddings indicate frequent contextual co-occurrence, not synonymy. “Hospital” and “patient” have nearby embeddings because they often appear together, not because they mean the same thing.
”One embedding per word is fixed”
Only in static models (Word2Vec, GloVe). In modern models like BERT, “bank” has different embeddings in “river bank” vs “piggy bank”.
”More dimensions = always better”
Beyond 768-1024 dimensions, accuracy gains are marginal compared to additional storage and compute costs.
Related Terms
- Vector Database: databases optimized for storage and queries on embeddings
- RAG: architectural pattern that uses embeddings for retrieval
- Tokenization: pre-processing that precedes embedding generation
- Transformer: architecture underlying modern embedding models
- NLP: broader field that uses embeddings as a building block
Sources
- Mikolov, T. et al. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781
- Pennington, J. et al. (2014). GloVe: Global Vectors for Word Representation. EMNLP
- Devlin, J. et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers. arXiv:1810.04805
- MTEB Leaderboard: benchmark for comparing embedding models