Skip to content

LayoutLM-Byne-v0.1 – beta launch!

Following our initial release of LayoutLM-Byne v0.1, we're excited to share our latest progress in document page retrieval and visual RAG! Our model sets the new SOTA on page retrieval for visually rich documents, enabling you to build intelligent systems for understanding pitch decks, company reports or scientific papers.

Since the latest release, we've conducted multiple benchmarks and created a simple Python library ready to be used by engineers!

You can find the library repo here: Github.

Industry-Leading Performance Metrics

Our previous blog post evaluated the model using Hit Rate. We've also computed NDCG to compare the model against ColPali:

  • NDCG@3: 0.8126
  • NDCG@5: 0.7394
  • NDCG@10: 0.6577

These results significantly outperform previous benchmarks, including recent approaches like ColPali (Faysse et al., 2024). Our NDCG@5 score of 0.7394 confirms that we have set a new state-of-the-art on the task.

Advanced RAG Pipeline Comparison

We've also compared an end-to-end Retrieval-Augmented Generation (RAG) pipeline to evaluate our model in a real-life deployment scenario.

Classic RAG Pipeline

  • Embedding: Snowflake Arctic (Snowflake/snowflake-arctic-embed-m-v1.5)
  • QA Model: LLaMA 3.1 70B
  • Total Size: ~70B parameters

LayoutLM-Byne Pipeline

  • Embedding: LayoutLM-Byne v0.1
  • QA Model: MiniCPM-V-2_6 (openbmb/MiniCPM-V-2_6)
  • Total Size: ~8B parameters

We've created a dataset of c.200 questions based on highly complex, technical documents. The dataset is available here: https://huggingface.co/datasets/Boriscii/LayoutLM-Byne-v0.1. Lllama3.1-405B was used as an assessor to compare QA model responses against the dataset ground truth.

Results

  • Classic RAG:

    • Hit Rate @5: 0.37
    • Total Dataset Correct Answer Rate: 0.18
  • LayoutLM-Byne Pipeline:

    • Hit Rate @5: 0.45
    • Total Dataset Correct Answer Rate: 0.24

Despite being significantly smaller (8B vs 70B parameters), our pipeline outperforms the larger, text-only approach in both retrieval accuracy and answer correctness.

Key points

MiniCPM-V-2_6 Integration

Our QA component utilizes the MiniCPM-V-2_6 model:

import torch
from transformers import AutoModel, AutoTokenizer

llm = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
 attn_implementation='sdpa', torch_dtype=torch.bfloat16)
llm = llm.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

def get_cpm_answer(page, query):
 image = page.convert('RGB')
 prompt = f"Question: {query}\n\nAnswer:"
 msgs = [{'role': 'user', 'content': [image, prompt]}]

 res = llm.chat(
 image=None,
 msgs=msgs,
 tokenizer=tokenizer
 )
 return res

As before, we plan to publish a "tiny paper", so feedback is very welcome!

Reference

Faysse, M., et al. (2024). ColPali: Efficient Document Retrieval with Vision Language Models. arXiv:2407.01449v2.