Document Representations for RAG

Bag of Words (BoW)

The Bag of Words model is a fundamental and widely used technique in natural language processing (NLP) for text representation. It simplifies the representation of text data by treating each document as a collection of words, disregarding grammar and word order but keeping track of the frequency of each word.

Given a document $D$ with a vocabulary $V$ , the Bag of Words representation can be expressed as:

\text{BoW}(D) = [f(w_1), f(w_2), ..., f(w_n)]

Where:

$f(w_i)$ is the frequency of word $w_i$ in document $D$ .
$n$ is the number of unique words in the vocabulary.

Advantages

Simplicity: The BoW model is easy to understand and implement.
Efficiency: It allows for quick computation and works well with many machine learning algorithms.
Interpretability: The resulting vectors are straightforward to interpret, making it easier to analyze feature importance.

Disadvantages

Loss of Context: BoW ignores word order and grammatical structure, which can lead to loss of contextual information.
High Dimensionality: The vocabulary can become very large, leading to high-dimensional feature vectors that may be sparse.
Inability to Capture Semantics: It does not account for synonyms or semantic relationships between words, which can limit its effectiveness in understanding meaning.

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a statistical measure used to evaluate the importance of a word in a document within a collection or corpus. It's widely used in information retrieval and text mining.

Components of TF-IDF

Term Frequency (TF)
Inverse Document Frequency (IDF)

Term Frequency (TF)

TF measures how frequently a term appears in a document. It's calculated as:

$TF(t,d) = \frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}}$

Inverse Document Frequency (IDF)

IDF measures how important a term is across the entire corpus. It's calculated as:

$IDF(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term t}}\right)$

TF-IDF Calculation

The TF-IDF score is the product of TF and IDF:

$TF\text{-}IDF(t,d) = TF(t,d) \times IDF(t)$

Characteristics of TF-IDF

Importance Weighting: TF-IDF assigns higher weights to terms that are frequent in a particular document but rare across the corpus.
Normalization: It normalizes term frequency by document length, preventing bias towards longer documents.
Inverse Scaling: The IDF component scales down the weight of common terms across documents.

Here's a basic implementation using Python and the scikit-learn library:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate TF-IDF vectors
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Print TF-IDF scores
for i, doc in enumerate(documents):
    print(f"Document {i+1}:")
    for j, word in enumerate(feature_names):
        score = tfidf_matrix[i, j]
        if score > 0:
            print(f"  {word}: {score:.4f}")
    print()

This code will output the TF-IDF scores for each word in each document, providing a clear view of which terms are considered most important in each document relative to the entire corpus.

TF-IDF remains a fundamental technique in text analysis and serves as a baseline for many more advanced NLP methods.

Dense Passage Retrieval (DPR)

Dense Passage Retrieval (DPR) is an efficient document retrieval method designed for open-domain question answering systems. Here are the key features and workings of DPR:

Dense Vector Representation: DPR encodes questions and passages (documents) into fixed-size dense vectors. This approach contrasts with traditional sparse vector representations like TF-IDF or BM25.
Dual Encoder Framework: DPR utilizes two BERT-based encoders:
- Question encoder: Encodes questions into vectors
- Passage encoder: Encodes passages into vectors
Similarity Calculation: The similarity between encoded question and passage vectors is computed using dot product.
Training Method: The model is trained to maximize similarity for relevant question-passage pairs while minimizing it for irrelevant pairs.
Efficient Retrieval: DPR uses tools like Facebook AI Similarity Search (FAISS) for fast retrieval of similar passages from large document collections.
Performance: DPR has shown 9%-19% higher accuracy compared to traditional BM25 systems and has achieved state-of-the-art performance on several open-domain QA benchmarks.
Advantages:
- Better capture of semantic similarity (e.g., synonyms, paraphrases)
- Ability to learn complex relationships between questions and passages
- Fast retrieval speed and good scalability
Mathematical Formulation: The relevance score between a question q and a passage p is calculated as:

$sim(q, p) = E_Q(q)^T \cdot E_P(p)$

Where E_Q and E_P are the question and passage encoders respectively.
Training Objective: DPR is trained using a contrastive loss function:

$L(q, p^+, p^-) = -log \frac{e^{sim(q, p^+)}}{e^{sim(q, p^+)} + \sum_{i=1}^n e^{sim(q, p_i^-)}}$

Where p+ is a positive (relevant) passage and p_i- are negative (irrelevant) passages.
Implementation: DPR can be implemented using deep learning frameworks like PyTorch or TensorFlow, often utilizing pre-trained language models as a starting point for the encoders.

DPR represents a significant advancement in information retrieval for question answering systems, enabling more accurate and efficient retrieval from large document collections. Its ability to capture semantic relationships between questions and documents has made it a crucial technology in modern NLP applications.

Sentence-BERT

Sentence-BERT generates embeddings for sentences or short texts. It fine-tunes BERT using a Siamese network structure.

Embedding generation process:

Pass input sentence through BERT
Use mean pooling or [CLS] token on output token embeddings
Pass through an additional dense layer for final embedding

Similarity calculation:

$similarity = cosine(embedding_1, embedding_2)$

Advantages:

Captures sentence-level semantics well

Disadvantages:

May miss document-wide context

BM25

BM25 is an improved version of TF-IDF that considers document length. The basic formula is:

$Score(D, Q) = \sum_{i=1}^n IDF(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{avgdl})}$

Where:

$f(q_i, D)$ : Frequency of query term $q_i$ in document D
$|D|$ : Length of document D
avgdl: Average document length
$k_1$ and b: Tunable parameters

Advantages:

Better performance than TF-IDF

Disadvantages:

Still limited in capturing semantic similarity

ColBERT

ColBERT uses "late interaction" to generate separate embeddings for each token in a document and calculates similarity through interaction with the query.

Mathematical representation:

$S(q, d) = \sum_{i=1}^{|q|} \max_{j=1}^{|d|} E_q(q_i)^T \cdot E_d(d_j)$

Where:

$E_q$ and $E_d$ are query and document encoders respectively
$q_i$ and $d_j$ are tokens of the query and document

Key features:

Enables fine-grained matching through token-level interaction
Uses MaxSim operation to find optimal token matching

Advantages:

High accuracy due to token-level matching

Disadvantages:

High storage and computational costs

SPLADE (Sparse-Dense Retrieval)

SPLADE combines sparse and dense representations. It uses BERT's output to calculate importance scores for each word.

Main components:

BERT-based encoder
Linear layer for word importance prediction
ReLU activation for sparsity

Formula:

$f(d) = ReLU(W \cdot BERT(d))$

Where W is a learnable weight matrix.

Similarity calculation during retrieval:

$sim(q, d) = f(q)^T \cdot f(d)$

Advantages:

Leverages benefits of both sparse and dense representations

Disadvantages:

Complex implementation and tuning

Multi-Representation Indexing

This method uses multiple representations of a document (e.g., summary, full text, metadata).

Structure:

Document
├── Full Text Representation
├── Summary Representation
├── Metadata Representation
└── Additional Representations

Retrieval process:

Perform separate searches for each representation
Combine scores from each search result

Score combination example:

$Score_{final} = w_1 \cdot Score_{full} + w_2 \cdot Score_{summary} + w_3 \cdot Score_{metadata}$

Where $w_1$ , $w_2$ , $w_3$ are weights for each representation.

Advantages:

Represents documents from various perspectives, improving retrieval accuracy
Can handle diverse types of queries

Disadvantages:

Increased index size and search complexity
May require weight adjustment for each representation

These various Document Representation methods each have their own strengths and weaknesses. In practical applications, the choice of method or combination of methods depends on the specific task characteristics, data properties, and system requirements.

When implementing these methods, consider the following comparison table:

Method	Semantic Understanding	Computational Cost	Storage Requirement	Implementation Complexity
DPR	High	Medium	Medium	Medium
Sentence-BERT	High	Medium	Medium	Medium
TF-IDF	Low	Low	Low	Low
BM25	Low	Low	Low	Low
ColBERT	Very High	High	High	High
SPLADE	High	Medium	Medium	High
Multi-Representation	High	High	High	High

This table provides a quick overview of the trade-offs between different methods, which can be helpful when deciding which approach to use for a specific RAG implementation.

Document Representations for RAG

Bag of Words (BoW)

Advantages

Disadvantages

TF-IDF (Term Frequency-Inverse Document Frequency)

Components of TF-IDF

Term Frequency (TF)

Inverse Document Frequency (IDF)

TF-IDF Calculation

Characteristics of TF-IDF

Dense Passage Retrieval (DPR)

Sentence-BERT

BM25

ColBERT

SPLADE (Sparse-Dense Retrieval)

Multi-Representation Indexing

Tags

Navigation