Natural Language Processing • Tokenizer
- Overview
- Tokenization: The Theoretical Specifics
- Sub-word Tokenization: The “Goldilocks” Zone
- WordPiece Tokenization
- Byte Pair Encoding (BPE)
- Unigram Language Model Tokenization
- Subword Regularization & SentencePiece
- Comparative Analysis
- Specialized Domains
- Modern Frontiers in Tokenization (2024+)
- References
Overview
Tokenization is the critical discretization step that bridges continuous human language and discrete statistical models. In modern Large Language Models (LLMs), it is not merely preprocessing but an integral architectural component that defines the model’s sample space.
Formally, let \(\Sigma\) be the set of valid characters (e.g., the Unicode standard) and \(\Sigma^{\ast}\) be the set of all possible strings. Tokenization is a function \(T: \Sigma^{\ast} \to V^{\ast}\) that maps a raw string \(s \in \Sigma^{\ast}\) to a sequence of discrete tokens \(\mathbf{t} = (t_1, t_2, \dots, t_N)\), where each \(t_i\) belongs to a fixed vocabulary \(V\).
This process introduces a fundamental bottleneck: the model can only perceive and generate text expressible as sequences from \(V\). Consequently, the quality of \(V\)—its granularity, coverage, and efficiency—upper-bounds the model’s capabilities in generalization, multilingualism, and arithmetic reasoning.
Tokenization: The Theoretical Specifics
The design of a tokenizer revolves around optimizing the trade-off between Vocabulary Size (\(\lvert V \rvert\)) and Sequence Length (\(N\)).
- Compression Rate: A good tokenizer maximizes the “information per token.” If \(N\) is minimized for a fixed text, the attention mechanism’s context window (conceptually \(O(N^2)\) complexity) becomes more efficient.
- Vocabulary Sparsity:
- Large \(\lvert V \rvert\) (e.g., Word-level): Minimizes \(N\) but suffers from extreme sparsity. Most words appear rarely (Zipf’s Law), leading to undertrained embeddings for the “long tail.”
- Small \(\lvert V \rvert\) (e.g., Character-level): Maximizes \(N\). While \(\lvert V \rvert \approx 100\) (printable ASCII) ensures no out-of-vocabulary (OOV) tokens, the sequence length explodes, diluting semantic density and making long-range dependencies harder to learn.
- Subword Regularization: Modern tokenizers (BPE, WordPiece, Unigram) operate in the “Goldilocks” zone (\(\lvert V \rvert \approx 32k - 256k\)), treating frequent words as atomic and decomposing rare words into subword units (morphemes).
The Mathematical Objective
Ideally, we seek a tokenizer that maximizes the marginal likelihood of the training corpus \(\mathcal{D}\): \(\arg\max_{\phi} \sum_{s \in \mathcal{D}} \log P(s \mid \phi)\) where \(\phi\) represents the tokenizer’s rules or vocabulary. In practice, because jointly optimizing tokenizer \(\phi\) and model parameters \(\theta\) is intractable, we solve this in two stages:
- Pre-tokenization: Heuristic splitting (e.g., on whitespace/punctuation).
- Vocabulary Induction: Learn \(V\) from a corpus using an algorithm (BPE, Unigram) to effectively compress the data.
- Model Training: Train the LLM using fixed \(V\).
Note: The fragmentation of words (e.g., “unhappiness” \(\to\) “un”, “happi”, “ness”) is not a bug but a feature of likelihood optimization, allowing the model to construct compositional meaning from shared sub-components.
Sub-word Tokenization: The “Goldilocks” Zone
Sub-word tokenization bridges the gap between character-level (too long, no semantics) and word-level (too sparse) models.
The Optimization Problem
Fundamentally, subword tokenization aims to learn a vocabulary \(V\) that allows for the “optimal” segmentation of a corpus \(\mathcal{D}\). If we define a segmentation \(S\) as a sequence of tokens \(S = (s_1, \dots, s_k)\) that reconstructs \(\mathcal{D}\), we want to maximize the likelihood: \(\arg\max_{V, S} P(\mathcal{D} \mid S, V) \quad \text{s.t.} \quad \lvert V \rvert \le K\) Since finding the globally optimal \(V\) is NP-hard, different algorithms (BPE, WordPiece, Unigram) use different greedy heuristics to approximate this objective.
Key Advantages
- Open Vocabulary: By keeping individual characters in \(V\), the model can tokenize any string, falling back to character-level representation for unknown words (e.g., “Covid-19” \(\to\) “Co”, “vid”, “-“, “19”).
- Morphological Generalization: The model learns sub-components like “ing”, “ed”, or “anti”, allowing it to understand “anti-gravity” even if it has only seen “gravity”.
WordPiece Tokenization
WordPiece (Schuster & Nakajima, 2012) is the algorithm powering BERT. Unlike BPE which merges based on frequency, WordPiece merges based on likelihood.
The Training Algorithm
- Initialization: Start with a vocabulary \(V\) containing all unique characters (plus special tokens like
[CLS],[SEP]). - Scoring Step: For every possible pair of adjacent tokens \((u, v)\), calculate the score:
\(\text{Score}(u, v) = \frac{\text{Count}(uv)}{\text{Count}(u) \times \text{Count}(v)}\)
- Note: This is effectively the Pointwise Mutual Information (PMI). We want to merge pairs that occur together more often than they would by chance. An extremely frequent pair like “a” + “n” might have a lower score than “Z” + “e” if individual “Z”s almost always are followed by “e”.
- Merge: Select the pair with the highest score, added \(uv\) to \(V\), and update counts.
- Repeat: Continue until \(\lvert V \rvert\) reaches the desired limit (e.g., 30k for BERT).
Inference (Tokenization)
WordPiece inference is deterministic. It uses a MaxMatch (Longest-Match-First) strategy:
- Given a word like “unaffable”:
- Check if “unaffable” \(\in V\). No.
- Check “unaffabl”. No.
- Iteratively shorten the string from the right (“unaffab”, “unaffa”, …) until a match is found in the vocabulary.
- Found “un” \(\in V\).
- Remaining: “affable”. Repeat process for the suffix.
- Result:
["un", "##aff", "##able"]. (Note: BERT uses##to denote non-initial subwords).
Byte Pair Encoding (BPE)
BPE (Sennrich et al., 2016) is a frequency-based algorithm, distinct from the likelihood-based WordPiece. It is the dominant tokenizer for GPT-2, GPT-3, and RoBERTa.
The Algorithm
- Pre-tokenization: Split text into words (often preserving whitespace).
- Base Vocabulary: Initialize with all characters.
- Count Step: Count the frequency of all adjacent symbol pairs.
- Merge Step: Select the most frequent pair \((A, B)\), create a new symbol \(AB\), and replace all occurrences.
- Termination: Repeat until vocab size \(\lvert V \rvert\) is reached.
Byte-Level BPE
Classic BPE struggles with large Unicode vocabularies (>140k characters). Byte-Level BPE (introduced in GPT-2) addresses this by processing raw bytes (UTF-8) instead of characters.
- The Trick: To prevent control characters (e.g., null bytes) from creating degenerate tokens, GPT-2 maps the 256 byte values to a set of 256 printable Unicode characters.
- Benefit: This ensures the base vocabulary is always size 256, guaranteeing that any string can be tokenized without an
<UNK>token.
Unigram Language Model Tokenization
Unlike BPE/WordPiece which grow a vocabulary from characters, Unigram (Kudo, 2018) starts with a massive vocabulary and prunes it. It is a probabilistic model.
The Probabilistic Formulation
The Unigram model assumes each subword occurs independently. The probability of a sequence \(\mathbf{x} = (x_1, \dots, x_M)\) is: \(P(\mathbf{x}) = \prod_{i=1}^M P(x_i)\) where \(P(x_i)\) is the learned probability of subword \(x_i\).
Training via EM
- Likelihood: For a corpus, we want to maximize \(\mathcal{L} = \sum_{s \in \mathcal{D}} \log P(s)\). Since segmentation is latent, we sum over all valid segmentations \(S(s)\) for each sentence: \(P(s) = \sum_{\mathbf{x} \in S(s)} \prod_{i} P(x_i)\)
- Pruning: Computing potential loss: For every token \(w \in V\), compute how much the global likelihood \(\mathcal{L}\) would drop if \(w\) were removed. Remove the bottom 10-20% of tokens with the lowest “loss impact.”
Subword Regularization & SentencePiece
SentencePiece is a library that implements both BPE and Unigram, but specifically treats the input as a raw stream (including whitespace), making it robust for languages like Japanese or Chinese that lack spaces.
- Subword Regularization: Because Unigram is probabilistic, we can sample multiple segmentations for the same text during training \(P(\mathbf{x} \mid \text{text})\) rather than just taking the “best” one (Viterbi). This acts as data augmentation, making the model more robust to noise and segmentation errors.
- BPE-Dropout: A similar regularization technique for BPE, where merges are randomly dropped with probability \(p\), forcing the model to see finer-grained subwords.
Comparative Analysis
| Feature | Byte Pair Encoding (BPE) | WordPiece | Unigram LM |
|---|---|---|---|
| Optimization Goal | Maximize Frequency of Pairs | Maximize Likelihood of Training Data | Minimize KL-Divergence (loss) |
| Criterion | \(\arg\max \text{Count}(uv)\) | \(\arg\max \frac{P(uv)}{P(u)P(v)}\) | \(\arg\min \Delta\mathcal{L}\) (Pruning) |
| Decision Type | Bottom-up (Merge) | Bottom-up (Merge) | Top-down (Prune from large vocab) |
| Deterministic? | Yes | Yes | No (Probabilistic sampling) |
| Typical Usage | GPT-2, GPT-3, RoBERTa | BERT, Electra | T5, ALBERT, mBART |
| Pros | Simple, efficient, no OOV (with bytes) | Linguistically consistent merging | Enables subword regularization |
- BPE is the standard for generative models due to its simplicity and byte-level robustness.
- Unigram is preferred for multilingual models (like T5/mBART) because its probabilistic nature handles noisy segmentation boundaries better than greedy merging.
Specialized Domains
Tokenization needs vary significantly outside of natural language.
Bioinformatics (DNA/Proteins)
Biological sequences (e.g., ATCG... or Amino Acids) lack natural delimiters like spaces.
- k-mers: The traditional approach uses overlapping sliding windows of length \(k\). While simple, it creates massive vocabularies (\(4^k\) for DNA) and loses global context.
- BPE/Unigram: Modern Protein Language Models (like ESM-2) use BPE to discover statistically significant biological motifs. For example, BPE tends to merge frequent functional domains (like “helices” or “binding sites”) into single tokens, aligning computational units with biological structure (source).
Programming Code
- Token-based: Most LLMs treat code as text. However, this ignores the strict syntax of programming languages.
- AST-based: Tools like
Tree-sitterparse code into an Abstract Syntax Tree (AST). Some experimental models tokenize the nodes of the AST (e.g.,FunctionDefinition,IfStatement) rather than raw characters, ensuring that the model always generates syntactically valid trees, though this is computationally expensive.
Computer Vision
- Patch Embeddings (ViT): Deterministically splits an image into a fixed grid (e.g., 16x16 patches) and linearly projects them. This is “continuous” tokenization.
- Discrete Codebooks (VQ-VAE): Learns a vocabulary of visual “symbols.” An encoder maps image patches to the nearest vector in a learnable codebook, effectively “tokenizing” an image into a sequence of discrete integers (indices), enabling autoregressive image generation (like in DALL-E 1).
Modern Frontiers in Tokenization (2024+)
The paradigm of “static tokenizer + fixed embedding table” is rapidly becoming a bottleneck. As model sizes plateau, researchers are attacking the inefficiency of the tokenization layer itself. The goal is to move from static lookup tables to adaptive systems that can handle infinite vocabularies, cross-lingual shifts, and multi-modal inputs without retraining the entire model foundation.
Dynamic and Adaptive Architectures
- Zero-Shot Tokenizer Transfer (ZeTT): Addresses the rigidity of pretrained LMs by training a hypernetwork \(H_\phi\) to predict embedding parameters for any tokenizer \(T'\). \(H_\phi\) takes token surface forms and metadata as input and generates the input/output embedding matrices \(E_{in}, E_{out}\) for the new vocabulary. This enables “hot-swapping” tokenizers (e.g., switching to a code-optimized tokenizer) at runtime with less than 3% performance degradation, effectively treating tokenization as a pluggable interface rather than a baked-in constraint (source).
- Inference-Time Compression: Treats tokenization as an online compression problem. Models dynamically identify recurring patterns in the byte stream during generation and fuse them into “hypertokens” (similar to LZW compression updates), assigning them on-the-fly embeddings. This reduces effective sequence length without changing the base model weights (source).
- Post-Hoc Vocabulary Expansion: Solves over-fragmentation in specialized domains (e.g.,
BioMedicalterms split into 10+ subwords) by adding new tokens to a frozen model. The embeddings for these new tokens are initialized via weighted averaging or PCA of their constituent subword embeddings from the base model, ensuring immediate alignment with the latent space (source).
Tokenizer-Free Models: The Byte Latent Transformer (BLT)
While byte-level models (like MEGABYTE) eliminate vocabularies, they struggle with the quadratic memory cost of long byte sequences.
- Entropy-Based Patching: BLT (Meta, 2024) introduces a dynamic patching mechanism. A small local model reads the byte stream and predicts the “entropy” (unpredictability) of the next byte. High entropy signals a patch boundary.
- Result: Complex data is patched finely, while redundant data is compressed into large patches.
- Architecture:
Local Encoder(Bytes \(\to\) Patches) \(\to\)Latent Transformer(Processing Patches) \(\to\)Local Decoder(Patches \(\to\) Bytes). This matches the FLOPs of subword logic without any fixed vocabulary, improving robustness to noise and spelling variations (source).
Tokenization in Generative Media
Text-to-image conditioning highlights the divergence between “generative” and “discriminative” tokenization needs:
- CLIP (SD 1.5, SDXL): Uses a legacy BPE tokenizer with a strict 77-token limit. This acts as a bottleneck, as complex prompts are truncated. The text encoder is treated as a frozen feature extractor, so the tokenizer cannot be changed.
- T5-XXL (SD3, Flux): Adopts SentencePiece with a large vocabulary and long context (up to 512+ tokens). T5’s bidirectional attention allows for detailed prompt comprehension, making tokenizer efficiency less critical than in autoregressive LLMs (source).