Tokenization

Definition

Tokenization is the process of converting raw text into a sequence of tokens, the discrete units that an LLM can process. Each token corresponds to a numeric ID in the model’s vocabulary.

Tokens don’t necessarily correspond to words: they can be whole words (“the”), parts of words (“un” + “believable”), single characters, or special sequences.

Main Algorithms

Byte-Pair Encoding (BPE): iteratively merges the most frequent byte/character pairs in the training corpus. Used by GPT and Llama.

WordPiece: similar to BPE but optimizes for language model likelihood. Used by BERT.

SentencePiece: implementation including BPE and Unigram, operates directly on bytes without pre-tokenization. Used by T5 and Llama.

Tiktoken: OpenAI’s BPE implementation, optimized for speed. Used by GPT-3.5/4.

Why It Matters

Costs: LLM APIs charge per token. Inefficiently tokenized text costs more.

Context window: models have token limits (4K-128K+). Tokenization determines how much text fits in context.

Multilingual performance: tokenizers trained primarily on English can produce more tokens for the same content in other languages, increasing costs and reducing effective context.

Practical Considerations

Token-to-word ratio: in English, ~0.75 tokens per word (one word = ~1.33 tokens). In Italian and other languages, often worse (1.5-2 tokens/word).

Verification tools: OpenAI Tokenizer, tiktoken (Python) to count tokens before API calls.

Special tokens: <|endoftext|>, <|im_start|>, etc. Reserved for control signals, not directly generatable.

Common Misconceptions

”1 token = 1 word”

No. On average ~0.75 tokens/word in English, but varies. Rare or long words are split into multiple tokens. “Tokenization” itself may be 2-3 tokens.

”All tokenizers are the same”

No. GPT-4 and Llama have different tokenizers with different vocabularies. Same text produces different token sequences and different lengths.

”Character count approximates tokens”

Very approximate. Empirical rule: ~4 characters/token in English, but varies significantly by language and content.

LLM: models that require tokenization of input
Embeddings: vector representations of tokens
Transformer: architecture that processes token sequences

Sources

Sennrich, R. et al. (2015). Neural Machine Translation of Rare Words with Subword Units. ACL
OpenAI Tokenizer
Hugging Face Tokenizers Library