Text Tokenizers (w/ Pytorch)

Category: Deep Learning
Donghyuk Kim

Advanced Text Tokenization Techniques

Text tokenization is a fundamental step in natural language processing (NLP) that involves breaking down text into smaller units. This article explores four popular subword tokenization algorithms: Byte Pair Encoding (BPE), WordPiece, Unigram, and SentencePiece.

Byte Pair Encoding (BPE)

BPE is an algorithm that iteratively merges the most frequent pair of bytes or characters in a corpus.

Algorithm:

  1. Initialize the vocabulary with individual characters.
  2. Count frequency of adjacent pairs in the corpus.
  3. Merge the most frequent pair and add to the vocabulary.
  4. Repeat steps 2-3 until reaching a desired vocabulary size or iteration limit.

Pros:

  • Simple and efficient
  • Handles out-of-vocabulary words well

Cons:

  • May create suboptimal merges based solely on frequency

Used in: GPT (Generative Pre-trained Transformer)

WordPiece

WordPiece, developed by Google, is similar to BPE but uses a likelihood-based criterion for merging tokens.

Algorithm:

  1. Start with a vocabulary of individual characters.
  2. For each possible merge, calculate the likelihood increase of the corpus.
  3. Choose the merge that maximizes the likelihood of the training data.
  4. Repeat steps 2-3 until reaching the desired vocabulary size.

Pros:

  • Produces more linguistically sound subwords compared to BPE
  • Effective for languages with rich morphology

Cons:

  • More computationally expensive than BPE

Used in: BERT (Bidirectional Encoder Representations from Transformers)

Unigram

Unigram is a probabilistic subword segmentation algorithm that uses a top-down approach.

Algorithm:

  1. Start with a large vocabulary (e.g., all possible substrings).
  2. Calculate the likelihood of the corpus using the current vocabulary.
  3. For each subword, compute the loss in likelihood if removed.
  4. Remove a fixed percentage of subwords with the lowest loss.
  5. Repeat steps 2-4 until reaching the desired vocabulary size.

Pros:

  • Produces a probabilistically motivated vocabulary
  • Can handle multiple segmentations of a word

Cons:

  • More complex implementation
  • Can be slower than BPE or WordPiece

Used in: SentencePiece

SentencePiece

SentencePiece is a tokenizer and detokenizer mainly for Neural Network-based text generation systems. It implements both BPE and Unigram algorithms.

Key Features:

  1. Language-agnostic tokenization
  2. Direct training from raw sentences
  3. Subword regularization with multiple segmentations
  4. Deterministic vocabulary generation

Algorithm Options:

  • Unigram (default)
  • BPE

Pros:

  • Works well with any language (no pre-tokenization required)
  • Treats the input as a raw stream of Unicode characters
  • Supports subword regularization

Cons:

  • May not capture language-specific nuances as well as specialized tokenizers

Used in: Many multilingual models and when processing languages without clear word boundaries

Comparison Table

Feature BPE WordPiece Unigram SentencePiece
Approach Bottom-up Bottom-up Top-down Configurable (BPE or Unigram)
Merge Criterion Frequency Likelihood Probability Depends on algorithm
Language Agnostic Partially Partially Yes Yes
Pre-tokenization Required Yes Yes No No
Subword Regularization No No Yes Yes
Complexity Low Medium High Medium-High
Used in GPT BERT Various Multilingual models

Comparision of example results

Algorithm Tokenization Result
BPE ['The', 'qu', 'ick', 'bro', 'wn', 'fox', 'jump', 's', 'over', 'the', 'la', 'zy', 'dog']
WordPiece ['The', 'quick', 'brown', 'fox', 'jump', '##s', 'over', 'the', 'la', '##zy', 'dog']
Unigram ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
SentencePiece ['▁The', '▁quick', '▁brown', '▁fox', '▁jumps', '▁over', '▁the', '▁la', 'zy', '▁dog']

Implementation with PyTorch and Hugging Face

Here's an example of how to implement these tokenizers using Hugging Face's tokenizers library and SentencePiece:

from tokenizers import Tokenizer
from tokenizers.models import BPE, WordPiece, Unigram
from tokenizers.trainers import BpeTrainer, WordPieceTrainer, UnigramTrainer
import sentencepiece as spm

# BPE Tokenizer
bpe_tokenizer = Tokenizer(BPE())
bpe_trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
bpe_tokenizer.train(files=["path/to/files"], trainer=bpe_trainer)

# WordPiece Tokenizer
wp_tokenizer = Tokenizer(WordPiece())
wp_trainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
wp_tokenizer.train(files=["path/to/files"], trainer=wp_trainer)

# Unigram Tokenizer
unigram_tokenizer = Tokenizer(Unigram())
unigram_trainer = UnigramTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
unigram_tokenizer.train(files=["path/to/files"], trainer=unigram_trainer)

# SentencePiece Tokenizer
spm.SentencePieceTrainer.train(input="path/to/files", model_prefix="spm_model", model_type="bpe", vocab_size=32000)
sp = spm.SentencePieceProcessor()
sp.load("spm_model.model")

# Example usage
text = "Hello, how are you today?"
print("BPE:", bpe_tokenizer.encode(text).tokens)
print("WordPiece:", wp_tokenizer.encode(text).tokens)
print("Unigram:", unigram_tokenizer.encode(text).tokens)
print("SentencePiece:", sp.encode(text, out_type=str))

This code demonstrates how to create, train, and use each type of tokenizer, including SentencePiece.

Conclusion

Each tokenization algorithm has its strengths and is suited for different types of NLP tasks. BPE and WordPiece are widely used in many popular language models, while Unigram offers a probabilistic approach to tokenization. SentencePiece stands out for its language-agnostic nature and ability to handle raw text input, making it particularly useful for multilingual models and languages without clear word boundaries.

The choice between these algorithms often depends on the specific requirements of the project, the language being processed, computational resources, and the need for language-agnostic processing. As NLP continues to evolve, these tokenization methods play a crucial role in improving the performance and versatility of language models.