Text Tokenizers (w/ Pytorch)

Advanced Text Tokenization Techniques

Text tokenization is a fundamental step in natural language processing (NLP) that involves breaking down text into smaller units. This article explores four popular subword tokenization algorithms: Byte Pair Encoding (BPE), WordPiece, Unigram, and SentencePiece.

Byte Pair Encoding (BPE)

BPE is an algorithm that iteratively merges the most frequent pair of bytes or characters in a corpus.

Algorithm:

Initialize the vocabulary with individual characters.
Count frequency of adjacent pairs in the corpus.
Merge the most frequent pair and add to the vocabulary.
Repeat steps 2-3 until reaching a desired vocabulary size or iteration limit.

Pros:

Simple and efficient
Handles out-of-vocabulary words well

Cons:

May create suboptimal merges based solely on frequency

Used in: GPT (Generative Pre-trained Transformer)

WordPiece

WordPiece, developed by Google, is similar to BPE but uses a likelihood-based criterion for merging tokens.

Algorithm:

Start with a vocabulary of individual characters.
For each possible merge, calculate the likelihood increase of the corpus.
Choose the merge that maximizes the likelihood of the training data.
Repeat steps 2-3 until reaching the desired vocabulary size.

Pros:

Produces more linguistically sound subwords compared to BPE
Effective for languages with rich morphology

Cons:

More computationally expensive than BPE

Used in: BERT (Bidirectional Encoder Representations from Transformers)

Unigram

Unigram is a probabilistic subword segmentation algorithm that uses a top-down approach.

Algorithm:

Start with a large vocabulary (e.g., all possible substrings).
Calculate the likelihood of the corpus using the current vocabulary.
For each subword, compute the loss in likelihood if removed.
Remove a fixed percentage of subwords with the lowest loss.
Repeat steps 2-4 until reaching the desired vocabulary size.

Pros:

Produces a probabilistically motivated vocabulary
Can handle multiple segmentations of a word

Cons:

More complex implementation
Can be slower than BPE or WordPiece

Used in: SentencePiece

SentencePiece

SentencePiece is a tokenizer and detokenizer mainly for Neural Network-based text generation systems. It implements both BPE and Unigram algorithms.

Key Features:

Language-agnostic tokenization
Direct training from raw sentences
Subword regularization with multiple segmentations
Deterministic vocabulary generation

Algorithm Options:

Unigram (default)
BPE

Pros:

Works well with any language (no pre-tokenization required)
Treats the input as a raw stream of Unicode characters
Supports subword regularization

Cons:

May not capture language-specific nuances as well as specialized tokenizers

Used in: Many multilingual models and when processing languages without clear word boundaries

Comparison Table

Feature	BPE	WordPiece	Unigram	SentencePiece
Approach	Bottom-up	Bottom-up	Top-down	Configurable (BPE or Unigram)
Merge Criterion	Frequency	Likelihood	Probability	Depends on algorithm
Language Agnostic	Partially	Partially	Yes	Yes
Pre-tokenization Required	Yes	Yes	No	No
Subword Regularization	No	No	Yes	Yes
Complexity	Low	Medium	High	Medium-High
Used in	GPT	BERT	Various	Multilingual models

Comparision of example results

Algorithm	Tokenization Result
BPE	['The', 'qu', 'ick', 'bro', 'wn', 'fox', 'jump', 's', 'over', 'the', 'la', 'zy', 'dog']
WordPiece	['The', 'quick', 'brown', 'fox', 'jump', '##s', 'over', 'the', 'la', '##zy', 'dog']
Unigram	['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
SentencePiece	['▁The', '▁quick', '▁brown', '▁fox', '▁jumps', '▁over', '▁the', '▁la', 'zy', '▁dog']

Implementation with PyTorch and Hugging Face

Here's an example of how to implement these tokenizers using Hugging Face's tokenizers library and SentencePiece:

from tokenizers import Tokenizer
from tokenizers.models import BPE, WordPiece, Unigram
from tokenizers.trainers import BpeTrainer, WordPieceTrainer, UnigramTrainer
import sentencepiece as spm

# BPE Tokenizer
bpe_tokenizer = Tokenizer(BPE())
bpe_trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
bpe_tokenizer.train(files=["path/to/files"], trainer=bpe_trainer)

# WordPiece Tokenizer
wp_tokenizer = Tokenizer(WordPiece())
wp_trainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
wp_tokenizer.train(files=["path/to/files"], trainer=wp_trainer)

# Unigram Tokenizer
unigram_tokenizer = Tokenizer(Unigram())
unigram_trainer = UnigramTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
unigram_tokenizer.train(files=["path/to/files"], trainer=unigram_trainer)

# SentencePiece Tokenizer
spm.SentencePieceTrainer.train(input="path/to/files", model_prefix="spm_model", model_type="bpe", vocab_size=32000)
sp = spm.SentencePieceProcessor()
sp.load("spm_model.model")

# Example usage
text = "Hello, how are you today?"
print("BPE:", bpe_tokenizer.encode(text).tokens)
print("WordPiece:", wp_tokenizer.encode(text).tokens)
print("Unigram:", unigram_tokenizer.encode(text).tokens)
print("SentencePiece:", sp.encode(text, out_type=str))

This code demonstrates how to create, train, and use each type of tokenizer, including SentencePiece.

Conclusion

Each tokenization algorithm has its strengths and is suited for different types of NLP tasks. BPE and WordPiece are widely used in many popular language models, while Unigram offers a probabilistic approach to tokenization. SentencePiece stands out for its language-agnostic nature and ability to handle raw text input, making it particularly useful for multilingual models and languages without clear word boundaries.

The choice between these algorithms often depends on the specific requirements of the project, the language being processed, computational resources, and the need for language-agnostic processing. As NLP continues to evolve, these tokenization methods play a crucial role in improving the performance and versatility of language models.

Text Tokenizers (w/ Pytorch)

Advanced Text Tokenization Techniques

Byte Pair Encoding (BPE)

WordPiece

Unigram

SentencePiece

Comparison Table

Comparision of example results

Implementation with PyTorch and Hugging Face

Conclusion

Tags

Navigation