In our previous post, we introduced LayoutLM-Byne-v0.1, our model for page retrieval from visually rich documents. Today, we’re excited to continue this series by diving deeper into the comparison between Retrieval-Augmented Generation (RAG) and Multimodal Large Language Models (MLLMs) for parsing complex PDFs.
As we mentioned in our LayoutLM-Byne post, applying multimodal LLMs directly to documents is rapidly gaining popularity. However, retrieval has been the critical missing piece in the “transform a document into images -> find relevant pages -> feed into an MLLM” approach.
Building on this foundation, we wanted to explore how different approaches perform when tasked with extracting meaningful information from highly complex documents. We selected the ADXL345 accelerometer datasheet from Analog Devices for this experiment - a 36-page PDF packed with tables, graphs, and technical specifications.
We implemented a hybrid visual RAG system using open-source tools.
Here’s a breakdown of the process:
Let’s look at some key parts of the implementation:
from pdf2image import convert_from_path
from langchain_community.retrievers import BM25Retriever
from transformers import AutoModel, AutoTokenizer
# Convert PDF to images
pages = convert_from_path('adxl345.pdf', 100)
for count, page in enumerate(pages):
page.save(f'out{count}.jpg', 'JPEG')
# Setup BM25 retriever
loader = PyMuPDFLoader("adxl345.pdf")
data = loader.load()
retriever = BM25Retriever.from_documents(data)
# Load multimodal LLM
model = AutoModel.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True, torch_dtype=torch.float16)
model = model.to(device='cuda')
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True)
# Question-answering function
def qa(uri, q):
image = Image.open(uri).convert('RGB')
msgs = [{'role': 'user', 'content': q}]
res = model.chat(
image=image,
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
temperature=0.7
)
return res
For this experiment, we used the open-source openbmb/MiniCPM-Llama3-V-2_5 model, which is capable of processing text and images.
The initial results were somewhat disappointing:
However, when we manually selected the correct pages for each question (simulating perfect retrieval):
This significant improvement aligns with our findings from the LayoutLM-Byne experiments. Retrieval is indeed the key to understanding documents with MLLMs.
For our second experiment, we implemented a more traditional RAG system using the Byne platform.
Key components:
We tested two models:
Configuration details:
# RAG Configuration
embeddings = "text-embedding-3-small" # OpenAI embedding model
chunk_size = 600
chunk_overlap = 200
hybrid_search = True # 50/50 mix of dense and sparse retrieval
# Model parameters
max_tokens = 8000 # Context window limited to 8k tokens for fair comparison
The RAG approach significantly outperformed the basic multimodal method but underperformed (on the model size basis) against MLLM with a simulated perfect retrieval:
To ensure this wasn’t just due to the strength of the Llama models, we also tested GPT-4o with the same RAG setup:
RAG and MLLMs are roughly on par when retrieval is not considered: Properly configured RAG systems consistently achieved higher accuracy (88-96%) compared to the basic multimodal approach (28% with automated retrieval, 90% with perfect retrieval). On the other hand, MLLM shows more potential when compared on the basis of size and retrieval is not considered.
Retrieval is critical: The dramatic improvement in MLLM performance with manual page selection (from 28% to 90%) validates our focus on improving retrieval with LayoutLM-Byne.
Model size matters, but less than you might think: The improvement from using larger models was unexpectedly marginal.
Open-source solutions are competitive: The open-source Llama3 models performed on par with (and even slightly better than) GPT-4o in this task.
As we continue to refine our LayoutLM-Byne model and explore its applications, we see great potential for combining the strengths of advanced retrieval systems with multimodal LLMs. Here are some directions we’re excited about:
Integrating LayoutLM-Byne into MLLM pipelines: By using our state-of-the-art retrieval model to select relevant pages, we could potentially boost the performance of multimodal LLMs to match or exceed that of traditional RAG systems.
Expanding context awareness: As mentioned in our LayoutLM-Byne post, we plan to add support for adjacent page awareness. This could be particularly beneficial for complex technical documents like the ADXL345 datasheet, where information often spans multiple pages.
Scaling up: While our current LayoutLM-Byne model has a context size of 512 tokens, we’re working on increasing this to handle longer documents more effectively.
To provide more context on how the models were evaluated, here is the complete list of 50 questions used in our experiments. These questions cover a wide range of technical specifications and operational details from the ADXL345 datasheet:
These questions were designed to test the models’ ability to extract precise technical information from various parts of the datasheet, including tables, specifications, and detailed descriptions. The diversity and specificity of these questions highlight the challenges faced by both image-based and text-based approaches in processing complex technical documents.