Word/Sentence Embedding Models

Introduction to Word Embedding Models

Model	File Size	Dimensions	Performance Metrics	Notes
Word2Vec	532MB	100	Baseline performance	Widely used, simple structure
Word2Vec-pt	571MB	100	Similar to Word2Vec	POS tagged version
FastText-ch	1.8GB	100	Better for OOV words	Character n-grams (3-6)
FastText-jm	1.8GB	100	Similar to FastText-ch	Jamo n-grams (3-6)
FastText-ch/jm	4.8GB	100	Largest model	Combined character and jamo n-grams
GloVe	Not specified	Varies	Good performance on EvCR	-
ELMo	Largest	Not specified	Best overall on EvCR and EnCR	Contextual embeddings
Text-embedding-ada-002	Not specified	Not specified	Outperformed MiniLM on groundedness (0.72 vs 0.60) and answer relevance (0.82 vs 0.62)	OpenAI model
Multilingual MiniLM L12 v2	Not specified	Not specified	Lower performance compared to ada-002	-

Word embeddings are a crucial technology in natural language processing (NLP) for representing text data in vector space. Various word embedding models have been developed, each with its own characteristics and trade-offs. Let's explore the major models:

Word2Vec

Word2Vec, developed by Google in 2013, is one of the most famous word embedding models. It features two main architectures:

CBOW (Continuous Bag of Words): Predicts the center word using surrounding words.
Skip-gram: Predicts surrounding words using the center word.

Word2Vec excels at capturing semantic relationships between words, allowing for vector operations like "king - man + woman ≈ queen".

GloVe (Global Vectors)

GloVe, developed at Stanford University, combines global matrix factorization and local context window methods. It generates embeddings using word co-occurrence statistics from the corpus.

GloVe's advantage lies in its ability to achieve similar performance to Word2Vec with less computational cost. It also performs relatively well on rare words.

Hugging Face link: https://huggingface.co/sentence-transformers/average_word_embeddings_glove.840B.300d

FastText

FastText, developed by Facebook, can be seen as an extension of Word2Vec. Its key feature is representing words as a collection of character n-grams.

FastText's advantages:

Robust to rare words and Out-of-Vocabulary (OOV) issues.
Captures relationships between morphologically similar words well.
Effective for multilingual processing.

ELMo (Embeddings from Language Models)

Introduced in 2018, ELMo generates contextualized word representations. It uses bidirectional LSTMs to capture the contextual meaning of words.

Key features of ELMo:

Addresses the problem of polysemy.
Captures rich linguistic features using deep bidirectional language models.
Effective for transfer learning.

BERT (Bidirectional Encoder Representations from Transformers)

BERT, released by Google in 2018, is based on the Transformer architecture. It generates contextualized word representations through pre-trained language models.

BERT's advantages:

Enables deep representations considering bidirectional context.
Achieves state-of-the-art performance on various NLP tasks.
Easily adaptable to specific tasks through fine-tuning.

Hugging Face link: https://huggingface.co/google-bert/bert-base-uncased

Major Sentence Embedding Models

Model	Size	Dimensions	Performance	Languages	Notes
OpenAI text-embedding-3-large	Large	3072	State-of-the-art on MTEB	English	Latest OpenAI model, highest performance
OpenAI text-embedding-3-small	Small	1536	Comparable to ada-002	English	Efficient, good performance/cost ratio
OpenAI text-embedding-ada-002	Medium	1536	91.1% accuracy on generic classification	English	Previous best OpenAI model
bge-large-en-v1.5	Large	1024	Top performer on MTEB leaderboard	English	Strong overall performance
all-MiniLM-L6-v2	Small	384	Good performance for size	Multilingual	Efficient, versatile model
all-mpnet-base-v2	Medium	768	89% accuracy on generic classification	English	Strong general-purpose model
Sentence-BERT	Varies	Varies	Good for semantic similarity tasks	Multiple	BERT-based architecture
Universal Sentence Encoder	Medium	512	Strong on semantic similarity, paraphrase detection	Multiple	Versatile model
LASER	Large	1024	Specialized in language-agnostic sentence representations	93 languages	Language-agnostic embeddings
FinBERT	Medium	768	High performance for Finnish	Finnish	Language-specific model
Jina AI embeddings-v2-base-en	Medium	768	Strong performance on MTEB	English	Recent model with good overall performance

OpenAI's Embedding Models

OpenAI provides several powerful embedding models, each designed for different use cases and performance requirements. As of the latest update, OpenAI offers two main categories of embedding models: the newer third-generation models and the legacy second-generation models.

a. Third-Generation Models

text-embedding-3-small
- Dimensions: 1536 (can be reduced to 512)
- Max input tokens: 8191
- Performance on MTEB benchmark: 62.3%
- Pricing: Approximately 62,500 pages per dollar
text-embedding-3-large
- Dimensions: 3072 (can be reduced to 256)
- Max input tokens: 8191
- Performance on MTEB benchmark: 64.6%
- Pricing: Approximately 9,615 pages per dollar

Key features of the third-generation models:

Improved performance, especially for multilingual tasks
Flexible dimensionality reduction options
Better handling of longer sequences

b. Second-Generation Model (Legacy)

text-embedding-ada-002
- Dimensions: 1536
- Max input tokens: 8191
- Performance on MTEB benchmark: 61.0%
- Pricing: Approximately 12,500 pages per dollar

c. Comparison and Use Cases

text-embedding-3-small:
- Best for: Applications requiring a balance between performance and cost
- Advantages: Faster processing, lower cost, good performance
- Use cases: General-purpose embeddings, semantic search, clustering
text-embedding-3-large:
- Best for: High-performance requirements, especially in multilingual contexts
- Advantages: Highest accuracy, excellent multilingual performance
- Use cases: Advanced NLP tasks, cross-lingual applications, when accuracy is critical
text-embedding-ada-002:
- Best for: Legacy applications or when consistent results with previous implementations are needed
- Advantages: Well-established, good performance for English-language tasks
- Use cases: Maintaining compatibility with existing systems, general-purpose embeddings

d. Key Improvements in Third-Generation Models

Multilingual Performance: Significant improvement in handling multiple languages, with the MIRACL benchmark score jumping from 31.4% to 54.9%.
Flexible Dimensionality: The ability to reduce dimensions while maintaining performance, offering more efficient storage and processing options.
Improved Accuracy: Even the smaller model (text-embedding-3-small) outperforms the previous generation in most tasks.

e. How to Use

Using OpenAI's embedding models is straightforward through their API. Here's a basic example using Python:

from openai import OpenAI

client = OpenAI()

def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding

# Example usage
text = "Your text here"
embedding = get_embedding(text, model="text-embedding-3-small")

f. Considerations

API Key: You need an OpenAI API key to access these models.
Cost: While more powerful, the newer models, especially text-embedding-3-large, are more expensive to use.
Dimensionality Trade-offs: Reducing dimensions can save storage and processing time but may slightly impact performance.
Knowledge Cutoff: The models' knowledge is not updated, so they don't have information about recent events.

Meta (Facebook) Models

Meta has developed several robust sentence embedding models:

InferSent: A supervised learning model trained on natural language inference tasks. It performs well in various NLP tasks.
LASER (Language-Agnostic SEntence Representations): Supports multilingual sentence embeddings for 93 languages.

Meta's models are open-source, offering the advantage of free usage.

Google Models

Google provides significant sentence embedding models:

Universal Sentence Encoder (USE): A versatile sentence embedding model for various NLP tasks. It uses a transformer architecture and offers multilingual support.

Google's USE is easily accessible through TensorFlow Hub and is particularly useful in transfer learning scenarios.

Hugging Face Models

Hugging Face hosts several popular sentence embedding models:

all-MiniLM-L6-v2: A lightweight model that maps sentences to 384-dimensional dense vectors.

Hugging Face link: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

The all-MiniLM-L6-v2 is a powerful and efficient sentence embedding model developed by the sentence-transformers team. It's designed to map sentences and paragraphs to a 384-dimensional dense vector space, making it particularly useful for tasks such as clustering, semantic search, and information retrieval.

a. Key Features

Architecture: Based on the MiniLM architecture, which is a compressed version of BERT. The "L6" in the name indicates it has 6 layers, making it more lightweight than larger models.
Output Dimension: Produces 384-dimensional embeddings, striking a balance between model size and representation power.
Input Handling: Can process input text up to 256 word pieces, after which it truncates the input.
Training Data: Fine-tuned on a massive dataset of over 1 billion sentence pairs, sourced from various datasets including Reddit comments, Wikipedia citations, and Quora question pairs.
Training Objective: Uses a contrastive learning objective, which helps the model learn to distinguish between similar and dissimilar sentence pairs.

b. Usage

The model can be easily used with the sentence-transformers library:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)

c. Performance and Efficiency

Speed: Due to its compact size, it's faster than larger models while maintaining competitive performance.
Memory Efficiency: The smaller architecture makes it suitable for deployment in resource-constrained environments.
Multilingual Capability: While primarily trained on English, it shows decent performance on other languages as well.

d. Applications

Semantic Search: Ideal for building efficient search systems that understand the meaning behind queries.
Text Clustering: Useful for grouping similar documents or sentences.
Sentence Similarity: Can be used to find paraphrases or similar sentences in large datasets.
Information Retrieval: Effective for matching queries with relevant documents.
Text Classification: The embeddings can be used as features for downstream classification tasks.

e. Limitations

Context Length: Limited to 256 tokens, which may not be sufficient for very long documents.
Language Specificity: While it shows some multilingual capabilities, it's primarily optimized for English.
Fine-grained Understanding: As a general-purpose model, it may not capture very domain-specific nuances without further fine-tuning.

The all-MiniLM-L6-v2 model represents a excellent balance between efficiency and performance, making it a popular choice for many NLP applications, especially when computational resources are a consideration.

paraphrase-multilingual-MiniLM-L12-v2: A multilingual model supporting over 50 languages.

Hugging Face link: https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2