Document Representations for RAG
Bag of Words (BoW)
The Bag of Words model is a fundamental and widely used technique in natural language processing (NLP) for text representation. It simplifies the representation of text data by treating each document as a collection of words, disregarding grammar and word order but keeping track of the frequency of each word.
Given a document with a vocabulary , the Bag of Words representation can be expressed as:
Where:
- is the frequency of word in document .
- is the number of unique words in the vocabulary.
Advantages
- Simplicity: The BoW model is easy to understand and implement.
- Efficiency: It allows for quick computation and works well with many machine learning algorithms.
- Interpretability: The resulting vectors are straightforward to interpret, making it easier to analyze feature importance.
Disadvantages
- Loss of Context: BoW ignores word order and grammatical structure, which can lead to loss of contextual information.
- High Dimensionality: The vocabulary can become very large, leading to high-dimensional feature vectors that may be sparse.
- Inability to Capture Semantics: It does not account for synonyms or semantic relationships between words, which can limit its effectiveness in understanding meaning.
TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF is a statistical measure used to evaluate the importance of a word in a document within a collection or corpus. It's widely used in information retrieval and text mining.
Components of TF-IDF
- Term Frequency (TF)
- Inverse Document Frequency (IDF)
Term Frequency (TF)
TF measures how frequently a term appears in a document. It's calculated as:
Inverse Document Frequency (IDF)
IDF measures how important a term is across the entire corpus. It's calculated as:
TF-IDF Calculation
The TF-IDF score is the product of TF and IDF:
Characteristics of TF-IDF
-
Importance Weighting: TF-IDF assigns higher weights to terms that are frequent in a particular document but rare across the corpus.
-
Normalization: It normalizes term frequency by document length, preventing bias towards longer documents.
-
Inverse Scaling: The IDF component scales down the weight of common terms across documents.
Here's a basic implementation using Python and the scikit-learn library:
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
documents = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
]
# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()
# Generate TF-IDF vectors
tfidf_matrix = vectorizer.fit_transform(documents)
# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()
# Print TF-IDF scores
for i, doc in enumerate(documents):
print(f"Document {i+1}:")
for j, word in enumerate(feature_names):
score = tfidf_matrix[i, j]
if score > 0:
print(f" {word}: {score:.4f}")
print()
This code will output the TF-IDF scores for each word in each document, providing a clear view of which terms are considered most important in each document relative to the entire corpus.
TF-IDF remains a fundamental technique in text analysis and serves as a baseline for many more advanced NLP methods.
Dense Passage Retrieval (DPR)
Dense Passage Retrieval (DPR) is an efficient document retrieval method designed for open-domain question answering systems. Here are the key features and workings of DPR:
-
Dense Vector Representation: DPR encodes questions and passages (documents) into fixed-size dense vectors. This approach contrasts with traditional sparse vector representations like TF-IDF or BM25.
-
Dual Encoder Framework: DPR utilizes two BERT-based encoders:
- Question encoder: Encodes questions into vectors
- Passage encoder: Encodes passages into vectors
-
Similarity Calculation: The similarity between encoded question and passage vectors is computed using dot product.
-
Training Method: The model is trained to maximize similarity for relevant question-passage pairs while minimizing it for irrelevant pairs.
-
Efficient Retrieval: DPR uses tools like Facebook AI Similarity Search (FAISS) for fast retrieval of similar passages from large document collections.
-
Performance: DPR has shown 9%-19% higher accuracy compared to traditional BM25 systems and has achieved state-of-the-art performance on several open-domain QA benchmarks.
-
Advantages:
- Better capture of semantic similarity (e.g., synonyms, paraphrases)
- Ability to learn complex relationships between questions and passages
- Fast retrieval speed and good scalability
-
Mathematical Formulation: The relevance score between a question q and a passage p is calculated as:
Where E_Q and E_P are the question and passage encoders respectively.
-
Training Objective: DPR is trained using a contrastive loss function:
Where p+ is a positive (relevant) passage and p_i- are negative (irrelevant) passages.
-
Implementation: DPR can be implemented using deep learning frameworks like PyTorch or TensorFlow, often utilizing pre-trained language models as a starting point for the encoders.
DPR represents a significant advancement in information retrieval for question answering systems, enabling more accurate and efficient retrieval from large document collections. Its ability to capture semantic relationships between questions and documents has made it a crucial technology in modern NLP applications.
Sentence-BERT
Sentence-BERT generates embeddings for sentences or short texts. It fine-tunes BERT using a Siamese network structure.
Embedding generation process:
- Pass input sentence through BERT
- Use mean pooling or [CLS] token on output token embeddings
- Pass through an additional dense layer for final embedding
Similarity calculation:
Advantages:
- Captures sentence-level semantics well
Disadvantages:
- May miss document-wide context
BM25
BM25 is an improved version of TF-IDF that considers document length. The basic formula is:
Where:
- : Frequency of query term in document D
- : Length of document D
- avgdl: Average document length
- and b: Tunable parameters
Advantages:
- Better performance than TF-IDF
Disadvantages:
- Still limited in capturing semantic similarity
ColBERT
ColBERT uses "late interaction" to generate separate embeddings for each token in a document and calculates similarity through interaction with the query.
Mathematical representation:
Where:
- and are query and document encoders respectively
- and are tokens of the query and document
Key features:
- Enables fine-grained matching through token-level interaction
- Uses MaxSim operation to find optimal token matching
Advantages:
- High accuracy due to token-level matching
Disadvantages:
- High storage and computational costs
SPLADE (Sparse-Dense Retrieval)
SPLADE combines sparse and dense representations. It uses BERT's output to calculate importance scores for each word.
Main components:
- BERT-based encoder
- Linear layer for word importance prediction
- ReLU activation for sparsity
Formula:
Where W is a learnable weight matrix.
Similarity calculation during retrieval:
Advantages:
- Leverages benefits of both sparse and dense representations
Disadvantages:
- Complex implementation and tuning
Multi-Representation Indexing
This method uses multiple representations of a document (e.g., summary, full text, metadata).
Structure:
Document
├── Full Text Representation
├── Summary Representation
├── Metadata Representation
└── Additional Representations
Retrieval process:
- Perform separate searches for each representation
- Combine scores from each search result
Score combination example:
Where , , are weights for each representation.
Advantages:
- Represents documents from various perspectives, improving retrieval accuracy
- Can handle diverse types of queries
Disadvantages:
- Increased index size and search complexity
- May require weight adjustment for each representation
These various Document Representation methods each have their own strengths and weaknesses. In practical applications, the choice of method or combination of methods depends on the specific task characteristics, data properties, and system requirements.
When implementing these methods, consider the following comparison table:
Method | Semantic Understanding | Computational Cost | Storage Requirement | Implementation Complexity |
---|---|---|---|---|
DPR | High | Medium | Medium | Medium |
Sentence-BERT | High | Medium | Medium | Medium |
TF-IDF | Low | Low | Low | Low |
BM25 | Low | Low | Low | Low |
ColBERT | Very High | High | High | High |
SPLADE | High | Medium | Medium | High |
Multi-Representation | High | High | High | High |
This table provides a quick overview of the trade-offs between different methods, which can be helpful when deciding which approach to use for a specific RAG implementation.