Word/Sentence Embedding Models

Category: Generative AI
Donghyuk Kim

Introduction to Word Embedding Models

Model File Size Dimensions Performance Metrics Notes
Word2Vec 532MB 100 Baseline performance Widely used, simple structure
Word2Vec-pt 571MB 100 Similar to Word2Vec POS tagged version
FastText-ch 1.8GB 100 Better for OOV words Character n-grams (3-6)
FastText-jm 1.8GB 100 Similar to FastText-ch Jamo n-grams (3-6)
FastText-ch/jm 4.8GB 100 Largest model Combined character and jamo n-grams
GloVe Not specified Varies Good performance on EvCR -
ELMo Largest Not specified Best overall on EvCR and EnCR Contextual embeddings
Text-embedding-ada-002 Not specified Not specified Outperformed MiniLM on groundedness (0.72 vs 0.60) and answer relevance (0.82 vs 0.62) OpenAI model
Multilingual MiniLM L12 v2 Not specified Not specified Lower performance compared to ada-002 -

Word embeddings are a crucial technology in natural language processing (NLP) for representing text data in vector space. Various word embedding models have been developed, each with its own characteristics and trade-offs. Let's explore the major models:

Word2Vec

Word2Vec, developed by Google in 2013, is one of the most famous word embedding models. It features two main architectures:

  • CBOW (Continuous Bag of Words): Predicts the center word using surrounding words.
  • Skip-gram: Predicts surrounding words using the center word.

Word2Vec excels at capturing semantic relationships between words, allowing for vector operations like "king - man + woman ≈ queen".

GloVe (Global Vectors)

GloVe, developed at Stanford University, combines global matrix factorization and local context window methods. It generates embeddings using word co-occurrence statistics from the corpus.

GloVe's advantage lies in its ability to achieve similar performance to Word2Vec with less computational cost. It also performs relatively well on rare words.

Hugging Face link: https://huggingface.co/sentence-transformers/average_word_embeddings_glove.840B.300d

FastText

FastText, developed by Facebook, can be seen as an extension of Word2Vec. Its key feature is representing words as a collection of character n-grams.

FastText's advantages:

  • Robust to rare words and Out-of-Vocabulary (OOV) issues.
  • Captures relationships between morphologically similar words well.
  • Effective for multilingual processing.

ELMo (Embeddings from Language Models)

Introduced in 2018, ELMo generates contextualized word representations. It uses bidirectional LSTMs to capture the contextual meaning of words.

Key features of ELMo:

  • Addresses the problem of polysemy.
  • Captures rich linguistic features using deep bidirectional language models.
  • Effective for transfer learning.

BERT (Bidirectional Encoder Representations from Transformers)

BERT, released by Google in 2018, is based on the Transformer architecture. It generates contextualized word representations through pre-trained language models.

BERT's advantages:

  • Enables deep representations considering bidirectional context.
  • Achieves state-of-the-art performance on various NLP tasks.
  • Easily adaptable to specific tasks through fine-tuning.

Hugging Face link: https://huggingface.co/google-bert/bert-base-uncased

Major Sentence Embedding Models

Model Size Dimensions Performance Languages Notes
OpenAI text-embedding-3-large Large 3072 State-of-the-art on MTEB English Latest OpenAI model, highest performance
OpenAI text-embedding-3-small Small 1536 Comparable to ada-002 English Efficient, good performance/cost ratio
OpenAI text-embedding-ada-002 Medium 1536 91.1% accuracy on generic classification English Previous best OpenAI model
bge-large-en-v1.5 Large 1024 Top performer on MTEB leaderboard English Strong overall performance
all-MiniLM-L6-v2 Small 384 Good performance for size Multilingual Efficient, versatile model
all-mpnet-base-v2 Medium 768 89% accuracy on generic classification English Strong general-purpose model
Sentence-BERT Varies Varies Good for semantic similarity tasks Multiple BERT-based architecture
Universal Sentence Encoder Medium 512 Strong on semantic similarity, paraphrase detection Multiple Versatile model
LASER Large 1024 Specialized in language-agnostic sentence representations 93 languages Language-agnostic embeddings
FinBERT Medium 768 High performance for Finnish Finnish Language-specific model
Jina AI embeddings-v2-base-en Medium 768 Strong performance on MTEB English Recent model with good overall performance

OpenAI's Embedding Models

OpenAI provides several powerful embedding models, each designed for different use cases and performance requirements. As of the latest update, OpenAI offers two main categories of embedding models: the newer third-generation models and the legacy second-generation models.

a. Third-Generation Models

  1. text-embedding-3-small

    • Dimensions: 1536 (can be reduced to 512)
    • Max input tokens: 8191
    • Performance on MTEB benchmark: 62.3%
    • Pricing: Approximately 62,500 pages per dollar
  2. text-embedding-3-large

    • Dimensions: 3072 (can be reduced to 256)
    • Max input tokens: 8191
    • Performance on MTEB benchmark: 64.6%
    • Pricing: Approximately 9,615 pages per dollar

Key features of the third-generation models:

  • Improved performance, especially for multilingual tasks
  • Flexible dimensionality reduction options
  • Better handling of longer sequences

b. Second-Generation Model (Legacy)

  1. text-embedding-ada-002
    • Dimensions: 1536
    • Max input tokens: 8191
    • Performance on MTEB benchmark: 61.0%
    • Pricing: Approximately 12,500 pages per dollar

c. Comparison and Use Cases

  1. text-embedding-3-small:

    • Best for: Applications requiring a balance between performance and cost
    • Advantages: Faster processing, lower cost, good performance
    • Use cases: General-purpose embeddings, semantic search, clustering
  2. text-embedding-3-large:

    • Best for: High-performance requirements, especially in multilingual contexts
    • Advantages: Highest accuracy, excellent multilingual performance
    • Use cases: Advanced NLP tasks, cross-lingual applications, when accuracy is critical
  3. text-embedding-ada-002:

    • Best for: Legacy applications or when consistent results with previous implementations are needed
    • Advantages: Well-established, good performance for English-language tasks
    • Use cases: Maintaining compatibility with existing systems, general-purpose embeddings

d. Key Improvements in Third-Generation Models

  1. Multilingual Performance: Significant improvement in handling multiple languages, with the MIRACL benchmark score jumping from 31.4% to 54.9%.
  2. Flexible Dimensionality: The ability to reduce dimensions while maintaining performance, offering more efficient storage and processing options.
  3. Improved Accuracy: Even the smaller model (text-embedding-3-small) outperforms the previous generation in most tasks.

e. How to Use

Using OpenAI's embedding models is straightforward through their API. Here's a basic example using Python:

from openai import OpenAI

client = OpenAI()

def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding

# Example usage
text = "Your text here"
embedding = get_embedding(text, model="text-embedding-3-small")

f. Considerations

  • API Key: You need an OpenAI API key to access these models.
  • Cost: While more powerful, the newer models, especially text-embedding-3-large, are more expensive to use.
  • Dimensionality Trade-offs: Reducing dimensions can save storage and processing time but may slightly impact performance.
  • Knowledge Cutoff: The models' knowledge is not updated, so they don't have information about recent events.

Meta (Facebook) Models

Meta has developed several robust sentence embedding models:

  • InferSent: A supervised learning model trained on natural language inference tasks. It performs well in various NLP tasks.

  • LASER (Language-Agnostic SEntence Representations): Supports multilingual sentence embeddings for 93 languages.

Meta's models are open-source, offering the advantage of free usage.

Google Models

Google provides significant sentence embedding models:

  • Universal Sentence Encoder (USE): A versatile sentence embedding model for various NLP tasks. It uses a transformer architecture and offers multilingual support.

Google's USE is easily accessible through TensorFlow Hub and is particularly useful in transfer learning scenarios.

Hugging Face Models

Hugging Face hosts several popular sentence embedding models:

  • all-MiniLM-L6-v2: A lightweight model that maps sentences to 384-dimensional dense vectors.

Hugging Face link: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

The all-MiniLM-L6-v2 is a powerful and efficient sentence embedding model developed by the sentence-transformers team. It's designed to map sentences and paragraphs to a 384-dimensional dense vector space, making it particularly useful for tasks such as clustering, semantic search, and information retrieval.

a. Key Features

  1. Architecture: Based on the MiniLM architecture, which is a compressed version of BERT. The "L6" in the name indicates it has 6 layers, making it more lightweight than larger models.
  2. Output Dimension: Produces 384-dimensional embeddings, striking a balance between model size and representation power.
  3. Input Handling: Can process input text up to 256 word pieces, after which it truncates the input.
  4. Training Data: Fine-tuned on a massive dataset of over 1 billion sentence pairs, sourced from various datasets including Reddit comments, Wikipedia citations, and Quora question pairs.
  5. Training Objective: Uses a contrastive learning objective, which helps the model learn to distinguish between similar and dissimilar sentence pairs.

b. Usage

The model can be easily used with the sentence-transformers library:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)

c. Performance and Efficiency

  1. Speed: Due to its compact size, it's faster than larger models while maintaining competitive performance.
  2. Memory Efficiency: The smaller architecture makes it suitable for deployment in resource-constrained environments.
  3. Multilingual Capability: While primarily trained on English, it shows decent performance on other languages as well.

d. Applications

  1. Semantic Search: Ideal for building efficient search systems that understand the meaning behind queries.
  2. Text Clustering: Useful for grouping similar documents or sentences.
  3. Sentence Similarity: Can be used to find paraphrases or similar sentences in large datasets.
  4. Information Retrieval: Effective for matching queries with relevant documents.
  5. Text Classification: The embeddings can be used as features for downstream classification tasks.

e. Limitations

  1. Context Length: Limited to 256 tokens, which may not be sufficient for very long documents.
  2. Language Specificity: While it shows some multilingual capabilities, it's primarily optimized for English.
  3. Fine-grained Understanding: As a general-purpose model, it may not capture very domain-specific nuances without further fine-tuning.

The all-MiniLM-L6-v2 model represents a excellent balance between efficiency and performance, making it a popular choice for many NLP applications, especially when computational resources are a consideration.

  • paraphrase-multilingual-MiniLM-L12-v2: A multilingual model supporting over 50 languages.

Hugging Face link: https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2