Document Representations for RAG

Category: Natural Language Processing
Donghyuk Kim

Bag of Words (BoW)

The Bag of Words model is a fundamental and widely used technique in natural language processing (NLP) for text representation. It simplifies the representation of text data by treating each document as a collection of words, disregarding grammar and word order but keeping track of the frequency of each word.

Given a document DD with a vocabulary VV, the Bag of Words representation can be expressed as:

BoW(D)=[f(w1),f(w2),...,f(wn)]\text{BoW}(D) = [f(w_1), f(w_2), ..., f(w_n)]

Where:

  • f(wi)f(w_i) is the frequency of word wiw_i in document DD.
  • nn is the number of unique words in the vocabulary.

Advantages

  • Simplicity: The BoW model is easy to understand and implement.
  • Efficiency: It allows for quick computation and works well with many machine learning algorithms.
  • Interpretability: The resulting vectors are straightforward to interpret, making it easier to analyze feature importance.

Disadvantages

  • Loss of Context: BoW ignores word order and grammatical structure, which can lead to loss of contextual information.
  • High Dimensionality: The vocabulary can become very large, leading to high-dimensional feature vectors that may be sparse.
  • Inability to Capture Semantics: It does not account for synonyms or semantic relationships between words, which can limit its effectiveness in understanding meaning.

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a statistical measure used to evaluate the importance of a word in a document within a collection or corpus. It's widely used in information retrieval and text mining.

Components of TF-IDF

  1. Term Frequency (TF)
  2. Inverse Document Frequency (IDF)

Term Frequency (TF)

TF measures how frequently a term appears in a document. It's calculated as:

TF(t,d)=Number of times term t appears in document dTotal number of terms in document dTF(t,d) = \frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}}

Inverse Document Frequency (IDF)

IDF measures how important a term is across the entire corpus. It's calculated as:

IDF(t)=log(Total number of documentsNumber of documents containing term t)IDF(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term t}}\right)

TF-IDF Calculation

The TF-IDF score is the product of TF and IDF:

TF-IDF(t,d)=TF(t,d)×IDF(t)TF\text{-}IDF(t,d) = TF(t,d) \times IDF(t)

Characteristics of TF-IDF

  1. Importance Weighting: TF-IDF assigns higher weights to terms that are frequent in a particular document but rare across the corpus.

  2. Normalization: It normalizes term frequency by document length, preventing bias towards longer documents.

  3. Inverse Scaling: The IDF component scales down the weight of common terms across documents.

Here's a basic implementation using Python and the scikit-learn library:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate TF-IDF vectors
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Print TF-IDF scores
for i, doc in enumerate(documents):
    print(f"Document {i+1}:")
    for j, word in enumerate(feature_names):
        score = tfidf_matrix[i, j]
        if score > 0:
            print(f"  {word}: {score:.4f}")
    print()

This code will output the TF-IDF scores for each word in each document, providing a clear view of which terms are considered most important in each document relative to the entire corpus.

TF-IDF remains a fundamental technique in text analysis and serves as a baseline for many more advanced NLP methods.

Dense Passage Retrieval (DPR)

Dense Passage Retrieval (DPR) is an efficient document retrieval method designed for open-domain question answering systems. Here are the key features and workings of DPR:

  1. Dense Vector Representation: DPR encodes questions and passages (documents) into fixed-size dense vectors. This approach contrasts with traditional sparse vector representations like TF-IDF or BM25.

  2. Dual Encoder Framework: DPR utilizes two BERT-based encoders:

    • Question encoder: Encodes questions into vectors
    • Passage encoder: Encodes passages into vectors
  3. Similarity Calculation: The similarity between encoded question and passage vectors is computed using dot product.

  4. Training Method: The model is trained to maximize similarity for relevant question-passage pairs while minimizing it for irrelevant pairs.

  5. Efficient Retrieval: DPR uses tools like Facebook AI Similarity Search (FAISS) for fast retrieval of similar passages from large document collections.

  6. Performance: DPR has shown 9%-19% higher accuracy compared to traditional BM25 systems and has achieved state-of-the-art performance on several open-domain QA benchmarks.

  7. Advantages:

    • Better capture of semantic similarity (e.g., synonyms, paraphrases)
    • Ability to learn complex relationships between questions and passages
    • Fast retrieval speed and good scalability
  8. Mathematical Formulation: The relevance score between a question q and a passage p is calculated as:

    sim(q,p)=EQ(q)TEP(p)sim(q, p) = E_Q(q)^T \cdot E_P(p)

    Where E_Q and E_P are the question and passage encoders respectively.

  9. Training Objective: DPR is trained using a contrastive loss function:

    L(q,p+,p)=logesim(q,p+)esim(q,p+)+i=1nesim(q,pi)L(q, p^+, p^-) = -log \frac{e^{sim(q, p^+)}}{e^{sim(q, p^+)} + \sum_{i=1}^n e^{sim(q, p_i^-)}}

    Where p+ is a positive (relevant) passage and p_i- are negative (irrelevant) passages.

  10. Implementation: DPR can be implemented using deep learning frameworks like PyTorch or TensorFlow, often utilizing pre-trained language models as a starting point for the encoders.

DPR represents a significant advancement in information retrieval for question answering systems, enabling more accurate and efficient retrieval from large document collections. Its ability to capture semantic relationships between questions and documents has made it a crucial technology in modern NLP applications.

Sentence-BERT

Sentence-BERT generates embeddings for sentences or short texts. It fine-tunes BERT using a Siamese network structure.

Embedding generation process:

  1. Pass input sentence through BERT
  2. Use mean pooling or [CLS] token on output token embeddings
  3. Pass through an additional dense layer for final embedding

Similarity calculation:

similarity=cosine(embedding1,embedding2)similarity = cosine(embedding_1, embedding_2)

Advantages:

  • Captures sentence-level semantics well

Disadvantages:

  • May miss document-wide context

BM25

BM25 is an improved version of TF-IDF that considers document length. The basic formula is:

Score(D,Q)=i=1nIDF(qi)f(qi,D)(k1+1)f(qi,D)+k1(1b+bDavgdl)Score(D, Q) = \sum_{i=1}^n IDF(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{avgdl})}

Where:

  • f(qi,D)f(q_i, D): Frequency of query term qiq_i in document D
  • D|D|: Length of document D
  • avgdl: Average document length
  • k1k_1 and b: Tunable parameters

Advantages:

  • Better performance than TF-IDF

Disadvantages:

  • Still limited in capturing semantic similarity

ColBERT

ColBERT uses "late interaction" to generate separate embeddings for each token in a document and calculates similarity through interaction with the query.

Mathematical representation:

S(q,d)=i=1qmaxj=1dEq(qi)TEd(dj)S(q, d) = \sum_{i=1}^{|q|} \max_{j=1}^{|d|} E_q(q_i)^T \cdot E_d(d_j)

Where:

  • EqE_q and EdE_d are query and document encoders respectively
  • qiq_i and djd_j are tokens of the query and document

Key features:

  1. Enables fine-grained matching through token-level interaction
  2. Uses MaxSim operation to find optimal token matching

Advantages:

  • High accuracy due to token-level matching

Disadvantages:

  • High storage and computational costs

SPLADE (Sparse-Dense Retrieval)

SPLADE combines sparse and dense representations. It uses BERT's output to calculate importance scores for each word.

Main components:

  1. BERT-based encoder
  2. Linear layer for word importance prediction
  3. ReLU activation for sparsity

Formula:

f(d)=ReLU(WBERT(d))f(d) = ReLU(W \cdot BERT(d))

Where W is a learnable weight matrix.

Similarity calculation during retrieval:

sim(q,d)=f(q)Tf(d)sim(q, d) = f(q)^T \cdot f(d)

Advantages:

  • Leverages benefits of both sparse and dense representations

Disadvantages:

  • Complex implementation and tuning

Multi-Representation Indexing

This method uses multiple representations of a document (e.g., summary, full text, metadata).

Structure:

Document
├── Full Text Representation
├── Summary Representation
├── Metadata Representation
└── Additional Representations

Retrieval process:

  1. Perform separate searches for each representation
  2. Combine scores from each search result

Score combination example:

Scorefinal=w1Scorefull+w2Scoresummary+w3ScoremetadataScore_{final} = w_1 \cdot Score_{full} + w_2 \cdot Score_{summary} + w_3 \cdot Score_{metadata}

Where w1w_1, w2w_2, w3w_3 are weights for each representation.

Advantages:

  • Represents documents from various perspectives, improving retrieval accuracy
  • Can handle diverse types of queries

Disadvantages:

  • Increased index size and search complexity
  • May require weight adjustment for each representation

These various Document Representation methods each have their own strengths and weaknesses. In practical applications, the choice of method or combination of methods depends on the specific task characteristics, data properties, and system requirements.

When implementing these methods, consider the following comparison table:

Method Semantic Understanding Computational Cost Storage Requirement Implementation Complexity
DPR High Medium Medium Medium
Sentence-BERT High Medium Medium Medium
TF-IDF Low Low Low Low
BM25 Low Low Low Low
ColBERT Very High High High High
SPLADE High Medium Medium High
Multi-Representation High High High High

This table provides a quick overview of the trade-offs between different methods, which can be helpful when deciding which approach to use for a specific RAG implementation.

Tags


TF/IDF ColBERT Sentence-BERT BoW DPR BM25 SPLADE