TL;DR: (1) We present a new approach, LayoutLM-Byne, for retrieving pages from visually rich...
LayoutLM-Byne-v0.1 – beta launch!
Following our initial release of LayoutLM-Byne v0.1, we're excited to share our latest progress in document page retrieval and visual RAG! Our model sets the new SOTA on page retrieval for visually rich documents, enabling you to build intelligent systems for understanding pitch decks, company reports or scientific papers.
Since the latest release, we've conducted multiple benchmarks and created a simple Python library ready to be used by engineers!
You can find the library repo here: Github.
Industry-Leading Performance Metrics
Our previous blog post evaluated the model using Hit Rate. We've also computed NDCG to compare the model against ColPali:
- NDCG@3: 0.8126
- NDCG@5: 0.7394
- NDCG@10: 0.6577
These results significantly outperform previous benchmarks, including recent approaches like ColPali (Faysse et al., 2024). Our NDCG@5 score of 0.7394 confirms that we have set a new state-of-the-art on the task.
Advanced RAG Pipeline Comparison
We've also compared an end-to-end Retrieval-Augmented Generation (RAG) pipeline to evaluate our model in a real-life deployment scenario.
Classic RAG Pipeline
- Embedding: Snowflake Arctic (Snowflake/snowflake-arctic-embed-m-v1.5)
- QA Model: LLaMA 3.1 70B
- Total Size: ~70B parameters
LayoutLM-Byne Pipeline
- Embedding: LayoutLM-Byne v0.1
- QA Model: MiniCPM-V-2_6 (openbmb/MiniCPM-V-2_6)
- Total Size: ~8B parameters
We've created a dataset of c.200 questions based on highly complex, technical documents. The dataset is available here: https://huggingface.co/datasets/Boriscii/LayoutLM-Byne-v0.1. Lllama3.1-405B was used as an assessor to compare QA model responses against the dataset ground truth.
Results
-
Classic RAG:
- Hit Rate @5: 0.37
- Total Dataset Correct Answer Rate: 0.18
-
LayoutLM-Byne Pipeline:
- Hit Rate @5: 0.45
- Total Dataset Correct Answer Rate: 0.24
Despite being significantly smaller (8B vs 70B parameters), our pipeline outperforms the larger, text-only approach in both retrieval accuracy and answer correctness.
Key points
MiniCPM-V-2_6 Integration
Our QA component utilizes the MiniCPM-V-2_6 model:
import torch
from transformers import AutoModel, AutoTokenizer
llm = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16)
llm = llm.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
def get_cpm_answer(page, query):
image = page.convert('RGB')
prompt = f"Question: {query}\n\nAnswer:"
msgs = [{'role': 'user', 'content': [image, prompt]}]
res = llm.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
return res
As before, we plan to publish a "tiny paper", so feedback is very welcome!
Reference
Faysse, M., et al. (2024). ColPali: Efficient Document Retrieval with Vision Language Models. arXiv:2407.01449v2.