Text Tokenizers (w/ Pytorch)
Advanced Text Tokenization Techniques
Text tokenization is a fundamental step in natural language processing (NLP) that involves breaking down text into smaller units. This article explores four popular subword tokenization algorithms: Byte Pair Encoding (BPE), WordPiece, Unigram, and SentencePiece.
Byte Pair Encoding (BPE)
BPE is an algorithm that iteratively merges the most frequent pair of bytes or characters in a corpus.
Algorithm:
- Initialize the vocabulary with individual characters.
- Count frequency of adjacent pairs in the corpus.
- Merge the most frequent pair and add to the vocabulary.
- Repeat steps 2-3 until reaching a desired vocabulary size or iteration limit.
Pros:
- Simple and efficient
- Handles out-of-vocabulary words well
Cons:
- May create suboptimal merges based solely on frequency
Used in: GPT (Generative Pre-trained Transformer)
WordPiece
WordPiece, developed by Google, is similar to BPE but uses a likelihood-based criterion for merging tokens.
Algorithm:
- Start with a vocabulary of individual characters.
- For each possible merge, calculate the likelihood increase of the corpus.
- Choose the merge that maximizes the likelihood of the training data.
- Repeat steps 2-3 until reaching the desired vocabulary size.
Pros:
- Produces more linguistically sound subwords compared to BPE
- Effective for languages with rich morphology
Cons:
- More computationally expensive than BPE
Used in: BERT (Bidirectional Encoder Representations from Transformers)
Unigram
Unigram is a probabilistic subword segmentation algorithm that uses a top-down approach.
Algorithm:
- Start with a large vocabulary (e.g., all possible substrings).
- Calculate the likelihood of the corpus using the current vocabulary.
- For each subword, compute the loss in likelihood if removed.
- Remove a fixed percentage of subwords with the lowest loss.
- Repeat steps 2-4 until reaching the desired vocabulary size.
Pros:
- Produces a probabilistically motivated vocabulary
- Can handle multiple segmentations of a word
Cons:
- More complex implementation
- Can be slower than BPE or WordPiece
Used in: SentencePiece
SentencePiece
SentencePiece is a tokenizer and detokenizer mainly for Neural Network-based text generation systems. It implements both BPE and Unigram algorithms.
Key Features:
- Language-agnostic tokenization
- Direct training from raw sentences
- Subword regularization with multiple segmentations
- Deterministic vocabulary generation
Algorithm Options:
- Unigram (default)
- BPE
Pros:
- Works well with any language (no pre-tokenization required)
- Treats the input as a raw stream of Unicode characters
- Supports subword regularization
Cons:
- May not capture language-specific nuances as well as specialized tokenizers
Used in: Many multilingual models and when processing languages without clear word boundaries
Comparison Table
Feature | BPE | WordPiece | Unigram | SentencePiece |
---|---|---|---|---|
Approach | Bottom-up | Bottom-up | Top-down | Configurable (BPE or Unigram) |
Merge Criterion | Frequency | Likelihood | Probability | Depends on algorithm |
Language Agnostic | Partially | Partially | Yes | Yes |
Pre-tokenization Required | Yes | Yes | No | No |
Subword Regularization | No | No | Yes | Yes |
Complexity | Low | Medium | High | Medium-High |
Used in | GPT | BERT | Various | Multilingual models |
Comparision of example results
Algorithm | Tokenization Result |
---|---|
BPE | ['The', 'qu', 'ick', 'bro', 'wn', 'fox', 'jump', 's', 'over', 'the', 'la', 'zy', 'dog'] |
WordPiece | ['The', 'quick', 'brown', 'fox', 'jump', '##s', 'over', 'the', 'la', '##zy', 'dog'] |
Unigram | ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'] |
SentencePiece | ['▁The', '▁quick', '▁brown', '▁fox', '▁jumps', '▁over', '▁the', '▁la', 'zy', '▁dog'] |
Implementation with PyTorch and Hugging Face
Here's an example of how to implement these tokenizers using Hugging Face's tokenizers
library and SentencePiece:
from tokenizers import Tokenizer
from tokenizers.models import BPE, WordPiece, Unigram
from tokenizers.trainers import BpeTrainer, WordPieceTrainer, UnigramTrainer
import sentencepiece as spm
# BPE Tokenizer
bpe_tokenizer = Tokenizer(BPE())
bpe_trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
bpe_tokenizer.train(files=["path/to/files"], trainer=bpe_trainer)
# WordPiece Tokenizer
wp_tokenizer = Tokenizer(WordPiece())
wp_trainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
wp_tokenizer.train(files=["path/to/files"], trainer=wp_trainer)
# Unigram Tokenizer
unigram_tokenizer = Tokenizer(Unigram())
unigram_trainer = UnigramTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
unigram_tokenizer.train(files=["path/to/files"], trainer=unigram_trainer)
# SentencePiece Tokenizer
spm.SentencePieceTrainer.train(input="path/to/files", model_prefix="spm_model", model_type="bpe", vocab_size=32000)
sp = spm.SentencePieceProcessor()
sp.load("spm_model.model")
# Example usage
text = "Hello, how are you today?"
print("BPE:", bpe_tokenizer.encode(text).tokens)
print("WordPiece:", wp_tokenizer.encode(text).tokens)
print("Unigram:", unigram_tokenizer.encode(text).tokens)
print("SentencePiece:", sp.encode(text, out_type=str))
This code demonstrates how to create, train, and use each type of tokenizer, including SentencePiece.
Conclusion
Each tokenization algorithm has its strengths and is suited for different types of NLP tasks. BPE and WordPiece are widely used in many popular language models, while Unigram offers a probabilistic approach to tokenization. SentencePiece stands out for its language-agnostic nature and ability to handle raw text input, making it particularly useful for multilingual models and languages without clear word boundaries.
The choice between these algorithms often depends on the specific requirements of the project, the language being processed, computational resources, and the need for language-agnostic processing. As NLP continues to evolve, these tokenization methods play a crucial role in improving the performance and versatility of language models.