Definition
Tokenization is the process of converting raw text into a sequence of tokens, the discrete units that an LLM can process. Each token corresponds to a numeric ID in the model’s vocabulary.
Tokens don’t necessarily correspond to words: they can be whole words (“the”), parts of words (“un” + “believable”), single characters, or special sequences.
Main Algorithms
Byte-Pair Encoding (BPE): iteratively merges the most frequent byte/character pairs in the training corpus. Used by GPT and Llama.
WordPiece: similar to BPE but optimizes for language model likelihood. Used by BERT.
SentencePiece: implementation including BPE and Unigram, operates directly on bytes without pre-tokenization. Used by T5 and Llama.
Tiktoken: OpenAI’s BPE implementation, optimized for speed. Used by GPT-3.5/4.
Why It Matters
Costs: LLM APIs charge per token. Inefficiently tokenized text costs more.
Context window: models have token limits (4K-128K+). Tokenization determines how much text fits in context.
Multilingual performance: tokenizers trained primarily on English can produce more tokens for the same content in other languages, increasing costs and reducing effective context.
Practical Considerations
Token-to-word ratio: in English, ~0.75 tokens per word (one word = ~1.33 tokens). In Italian and other languages, often worse (1.5-2 tokens/word).
Verification tools: OpenAI Tokenizer, tiktoken (Python) to count tokens before API calls.
Special tokens: <|endoftext|>, <|im_start|>, etc. Reserved for control signals, not directly generatable.
Common Misconceptions
”1 token = 1 word”
No. On average ~0.75 tokens/word in English, but varies. Rare or long words are split into multiple tokens. “Tokenization” itself may be 2-3 tokens.
”All tokenizers are the same”
No. GPT-4 and Llama have different tokenizers with different vocabularies. Same text produces different token sequences and different lengths.
”Character count approximates tokens”
Very approximate. Empirical rule: ~4 characters/token in English, but varies significantly by language and content.
Related Terms
- LLM: models that require tokenization of input
- Embeddings: vector representations of tokens
- Transformer: architecture that processes token sequences
Sources
- Sennrich, R. et al. (2015). Neural Machine Translation of Rare Words with Subword Units. ACL
- OpenAI Tokenizer
- Hugging Face Tokenizers Library