Word Embedding

Definition

A word embedding is a numerical representation of text (words, phrases, documents) in a high-dimensional vector space (typically 256-3072 dimensions), where semantically similar texts are positioned close to each other.

Unlike traditional symbolic representations (one-hot encoding), embeddings capture complex semantic relationships. This geometric property enables mathematical operations on linguistic concepts: computing similarity, finding analogies, clustering content by meaning.

How It Works

Embeddings are learned through neural networks trained on large text corpora. The main evolution:

Static Embeddings (2013-2017)

Word2Vec (Mikolov et al., 2013): two architectures, Skip-gram (predicts context given a word) and CBOW (predicts word given context). Vectors of 100-300 dimensions.
GloVe (Pennington et al., 2014): based on global co-occurrence statistics. Performance comparable to Word2Vec.
FastText (Bojanowski et al., 2017): considers sub-words (character n-grams), useful for morphologically rich languages and out-of-vocabulary words.

Limitation: one fixed vector per word. “Bank” has the same embedding in “river bank” and “piggy bank”.

Contextual Embeddings (2018+)

BERT, GPT, and subsequent models generate representations that depend on the entire sentence: the same word produces different vectors in different contexts.

Modern embedding models (OpenAI text-embedding-3, Cohere embed-v3, BGE, E5) are specifically optimized for retrieval and semantic similarity, with measurable performance on benchmarks like MTEB.

Use Cases

RAG (Retrieval-Augmented Generation): documents indexed as embeddings in a vector database, query converted to embedding, retrieval by cosine similarity, context passed to LLM.

Semantic search: going beyond keyword matching. A query like “liquidity problems” finds documents about “negative cash flow” without exact match.

Classification and clustering: embeddings as features for traditional ML models or unsupervised clustering.

Deduplication: identifying nearly identical content in large corpora.

Practical Considerations

Dimensionality: 768-1024 dimensions are sufficient for most cases. A float32 vector at 1536 dimensions occupies 6KB. At scale, storage and latency become relevant.

Model choice: depends on task, language, domain. The MTEB benchmark aggregates 56 tasks, but performance varies by domain. Open-source models (BGE, E5) allow fine-tuning and on-premise deployment.

API costs: OpenAI text-embedding-3-small costs ~$0.02 per million tokens. For high volumes, self-hosting can significantly reduce costs.

Common Misconceptions

”Similar embeddings = synonyms”

Nearby embeddings indicate frequent contextual co-occurrence, not synonymy. “Hospital” and “patient” have nearby embeddings because they often appear together, not because they mean the same thing.

”One embedding per word is fixed”

Only in static models (Word2Vec, GloVe). In modern models like BERT, “bank” has different embeddings in “river bank” vs “piggy bank”.

”More dimensions = always better”

Beyond 768-1024 dimensions, accuracy gains are marginal compared to additional storage and compute costs.

Vector Database: databases optimized for storage and queries on embeddings
RAG: architectural pattern that uses embeddings for retrieval
Tokenization: pre-processing that precedes embedding generation
Transformer: architecture underlying modern embedding models
NLP: broader field that uses embeddings as a building block

Sources

Mikolov, T. et al. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781
Pennington, J. et al. (2014). GloVe: Global Vectors for Word Representation. EMNLP
Devlin, J. et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers. arXiv:1810.04805
MTEB Leaderboard: benchmark for comparing embedding models