Sep 4, 2024 5:25:32 AM

RAG vs MLLM Performance Comparison: A Continuation of Our Document Understanding Series

In our previous post, we introduced LayoutLM-Byne-v0.1, our model for page retrieval from visually rich documents. Today, we’re excited to continue this series by diving deeper into the comparison between Retrieval-Augmented Generation (RAG) and Multimodal Large Language Models (MLLMs) for parsing complex PDFs.

The Challenge Revisited

As we mentioned in our LayoutLM-Byne post, applying multimodal LLMs directly to documents is rapidly gaining popularity. However, retrieval has been the critical missing piece in the “transform a document into images -> find relevant pages -> feed into an MLLM” approach.

Building on this foundation, we wanted to explore how different approaches perform when tasked with extracting meaningful information from highly complex documents. We selected the ADXL345 accelerometer datasheet from Analog Devices for this experiment - a 36-page PDF packed with tables, graphs, and technical specifications.

Approach 1: Multimodal LLM

We implemented a hybrid visual RAG system using open-source tools.

Here’s a breakdown of the process:

Convert the PDF to images using pdf2image
Use BM25 retrieval to rank and select the most relevant page for each query
Apply a multimodal LLM to answer the question based on the selected image

Let’s look at some key parts of the implementation:

from pdf2image import convert_from_path
from langchain_community.retrievers import BM25Retriever
from transformers import AutoModel, AutoTokenizer

# Convert PDF to images
pages = convert_from_path('adxl345.pdf', 100)
for count, page in enumerate(pages):
    page.save(f'out{count}.jpg', 'JPEG')

# Setup BM25 retriever
loader = PyMuPDFLoader("adxl345.pdf")
data = loader.load()
retriever = BM25Retriever.from_documents(data)

# Load multimodal LLM
model = AutoModel.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True, torch_dtype=torch.float16)
model = model.to(device='cuda')
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True)

# Question-answering function
def qa(uri, q):
    image = Image.open(uri).convert('RGB')
    msgs = [{'role': 'user', 'content': q}]
    res = model.chat(
        image=image,
        msgs=msgs,
        tokenizer=tokenizer,
        sampling=True,
        temperature=0.7
    )
    return res

For this experiment, we used the open-source openbmb/MiniCPM-Llama3-V-2_5 model, which is capable of processing text and images.

Results

The initial results were somewhat disappointing:

Accuracy: 14/50 (28%)

However, when we manually selected the correct pages for each question (simulating perfect retrieval):

Accuracy: 45/50 (90%)

This significant improvement aligns with our findings from the LayoutLM-Byne experiments. Retrieval is indeed the key to understanding documents with MLLMs.

Approach 2: RAG with Text-Based LLMs

For our second experiment, we implemented a more traditional RAG system using the Byne platform.

Key components:

PDF parsing with an experimental “pdf_deepdive” flag
Hybrid search combining dense and sparse retrievals
Text-based LLMs for question-answering

We tested two models:

Llama3-8B-Instruct
Llama3-70B-Instruct

Configuration details:

# RAG Configuration
embeddings = "text-embedding-3-small"  # OpenAI embedding model
chunk_size = 600
chunk_overlap = 200
hybrid_search = True  # 50/50 mix of dense and sparse retrieval

# Model parameters
max_tokens = 8000  # Context window limited to 8k tokens for fair comparison

Results

Screenshot 2024-09-04 at 11.47.33

The RAG approach significantly outperformed the basic multimodal method but underperformed (on the model size basis) against MLLM with a simulated perfect retrieval:

Llama3-8B-Instruct: 44/50 (88% accuracy)
Llama3-70B-Instruct: 48/50 (96% accuracy)

To ensure this wasn’t just due to the strength of the Llama models, we also tested GPT-4o with the same RAG setup:

GPT-4o: 46/50 (92% accuracy)

Analysis and Insights

RAG and MLLMs are roughly on par when retrieval is not considered: Properly configured RAG systems consistently achieved higher accuracy (88-96%) compared to the basic multimodal approach (28% with automated retrieval, 90% with perfect retrieval). On the other hand, MLLM shows more potential when compared on the basis of size and retrieval is not considered.
Retrieval is critical: The dramatic improvement in MLLM performance with manual page selection (from 28% to 90%) validates our focus on improving retrieval with LayoutLM-Byne.
Model size matters, but less than you might think: The improvement from using larger models was unexpectedly marginal.
Open-source solutions are competitive: The open-source Llama3 models performed on par with (and even slightly better than) GPT-4o in this task.

Looking Ahead

As we continue to refine our LayoutLM-Byne model and explore its applications, we see great potential for combining the strengths of advanced retrieval systems with multimodal LLMs. Here are some directions we’re excited about:

Integrating LayoutLM-Byne into MLLM pipelines: By using our state-of-the-art retrieval model to select relevant pages, we could potentially boost the performance of multimodal LLMs to match or exceed that of traditional RAG systems.
Expanding context awareness: As mentioned in our LayoutLM-Byne post, we plan to add support for adjacent page awareness. This could be particularly beneficial for complex technical documents like the ADXL345 datasheet, where information often spans multiple pages.
Scaling up: While our current LayoutLM-Byne model has a context size of 512 tokens, we’re working on increasing this to handle longer documents more effectively.

Appendix: Evaluation Questions

To provide more context on how the models were evaluated, here is the complete list of 50 questions used in our experiments. These questions cover a wide range of technical specifications and operational details from the ADXL345 datasheet:

What is the measurement range of the ADXL345?
What is the nonlinearity percentage of the ADXL345?
What is the inter-axis alignment error of the ADXL345?
What is the cross-axis sensitivity of the ADXL345?
What is the maximum output resolution of the ADXL345 in full resolution mode?
What is the typical sensitivity at X, Y, Z outputs for all g-ranges in full resolution mode?
What is the maximum sensitivity deviation from ideal for all g-ranges?
What is the typical scale factor at X, Y, Z outputs for all g-ranges in full resolution mode?
What is the typical 0g output for X and Y axes?
What is the typical 0g output for the Z axis?
What is the maximum 0g output deviation from ideal for X and Y axes?
What is the maximum 0g output deviation from ideal for the Z axis?
What is the typical 0g offset vs. temperature for X and Y axes?
What is the typical 0g offset vs. temperature for the Z axis?
What is the typical noise for X and Y axes at 100 Hz output data rate?
What is the typical noise for Z axis at 100 Hz output data rate?
What is the maximum output data rate of the ADXL345?
What is the minimum output data rate of the ADXL345?
What is the typical self-test output change for the X axis?
What is the typical self-test output change for the Y axis?
What is the typical self-test output change for the Z axis?
What is the operating voltage range of the ADXL345?
What is the interface voltage range of the ADXL345?
What is the typical supply current at output data rate ≥ 100 Hz?
What is the typical supply current at output data rate < 10 Hz?
What is the typical standby mode leakage current?
What is the turn-on and wake-up time at 3200 Hz output data rate?
What is the operating temperature range of the ADXL345?
What is the device weight of the ADXL345?
What is the maximum SPI clock speed?
What is the minimum SPI CS deassertion time between communications?
What is the maximum I2C clock frequency?
What is the minimum I2C SCL cycle time?
What is the minimum I2C data setup time?
What is the maximum capacitive load for each I2C bus line?
What is the typical rise time of the interrupt pins?
What is the typical fall time of the interrupt pins?
What is the device ID of the ADXL345?
What is the scale factor of the THRESH_TAP register?
What is the scale factor of the offset registers (OFSX, OFSY, OFSZ)?
What is the scale factor of the DUR register?
What is the scale factor of the latent register?
What is the scale factor of the window register?
What is the scale factor of the THRESH_ACT and THRESH_INACT registers?
What is the scale factor of the TIME_INACT register?
What is the scale factor of the THRESH_FF register?
What is the scale factor of the TIME_FF register?
How many bits are in the FIFO buffer?
What is the maximum shock survival rating?
What is the size of the LGA package?

These questions were designed to test the models’ ability to extract precise technical information from various parts of the datasheet, including tables, specifications, and detailed descriptions. The diversity and specificity of these questions highlight the challenges faced by both image-based and text-based approaches in processing complex technical documents.

RAG vs MLLM Performance Comparison: A Continuation of Our Document Understanding Series

The Challenge Revisited

Approach 1: Multimodal LLM

Results

Approach 2: RAG with Text-Based LLMs

Results

Analysis and Insights

Looking Ahead

Appendix: Evaluation Questions

Related posts

LayoutLM-Byne-v0.1 – beta launch!

LayoutLM-Byne-v0.1: The new SOTA in page retrieval from visually-rich documents.

The key things to keep in mind when building GenAI applications