(1) We present a new approach, LayoutLM-Byne, for retrieving pages from visually rich documents. The demonstrator outperforms SOTA by 10-20% across a range of top-K hit rate metrics, but we are certain that the ceiling for this method is considerably higher.
(2) The model is based on LayoutLM and fine-tuned on DocVQA using a three-stage fine-tuning approach.
(3) Model card. Colab. A pre-print will be available on arXiv soon. We also plan to submit to ICLR'25 as part of the Tiny Papers series, so feedback and ideas are very welcome!
(4) The model pipeline is not yet available via Byne API. Please get in touch with us directly if you'd like help setting up a fully managed RAG pipeline based on the model.
(5) We've done our best to review the literature and identify if this approach has been used before. Please get in touch with us at borys.nadykto@bynesoft.com if we've missed any papers or releases implementing the same method.
Applying multimodal LLMs directly to documents is rapidly growing in popularity. As MLLMs (particularly open-source ones) progress, they demonstrate a higher performance ceiling for understanding documents than the older plain RAG approach: "parse -> chunk -> retrieve -> feed into an LLM".
It's easy to see how the older approach has a relatively low performance ceiling – we lose information about element positioning and many visual elements. Using advanced parsing into Markdown helps somewhat. However, not every component and chart from a PDF can be transformed into Markdown effectively, disproportionately penalizing visually rich materials like presentations or hardware documentation datasheets.
This approach – "transform a document into images -> find relevant pages -> feed into an MLLM" – still has one important missing piece: retrieval.
Most engineers we've encountered use regular hybrid search to retrieve relevant pages. It's not hard to see why this is suboptimal – here's how a regular PDF form looks after being parsed with Tesseract OCR: "GREAT WESTERN UNITED CORPORATION INFORMATION SHEET Name,.... .Geoge Es, Wilber, eee... Birthdate. 2/6/33 rrr Company Name & Title. President, The Great Western 1 Sugar Compan eececceveee Coes ecerereeeeee Office Address.. 1530". This "word mess" occurs because many elements on the page critically rely on their positioning for context. If we treat the page as a continuous string, we make the job unnecessarily hard for a regular embedding model.
Here are the retrieval stats we obtained by applying some of the best-performing mid-sized models to a 10% subset of the DocVQA dataset [1]; HR@K denotes hit-rate with top-K retrieval:
Model | HR@3 | HR@5 | HR@10 |
all-mpnet-base-v2 | 0.2500 | 0.2900 | 0.3600 |
gte-base-en-v1.5 | 0.3454 | 0.3899 | 0.4554 |
snowflake-arctic-embed-m-v1.5 | 0.3548 | 0.4042 | 0.4573 |
This is already reasonably good, but do we have a shot at improving upon SOTA?
Our idea is to ingest and embed bounding boxes alongside each token. We embed all four coordinates, width, and height of a bounding box in the same manner as we'd embed a regular token and sum them with the token embeddings, just as you'd add positional encoding to the token embeddings. The rest of the model is a regular BERT with a CLS pooler on top. Does this sound familiar? It's effectively LayoutLM, the first-gen version before it had all the fancy CNN encoders! [2]
We fine-tune the model on the DocVQA dataset, using a question as the anchor and an image as the positive example.
DocVQA is particularly suitable for this task because (1) many questions are impossible to map to the image (e.g., "What is the document title?"), which emulates "chaotic" user behaviour when using RAG-based pipelines and makes the model more resilient to it, and (2) most documents are fairly visually rich – forms, graphs, and ad posters are frequent occurrences in DocVQA.
We face two challenges when tuning the model: (1) forgetfulness, which is a common problem when fine-tuning Transformers, and (2) model behaviour, which is not well-suited for embedding queries and fails to converge if trained to embed both the query and the page from scratch. We implement a three-stage fine-tuning procedure to address both issues.
The question is embedded with all-mpnet-base-v2 and the page with LayoutLM-Byne. An old embedding model here may be seen as a questionable choice, we will come back to it later. Both the embedding layer and the encoder are frozen. An aggressive AdamW LR of 1e-4 ensures faster convergence and a more ambitious exploration of the problem surface. We tune for 10 epochs with batch size 32. This [3] implementation of the InfoNCE loss is used with a temperature of 0.05. At this stage, we achieve the following result on a 10% validation set:
HR@3: 0.27039846777915955
HR@5: 0.3406071960926056
HR@10: 0.42979127168655396
We unfreeze all layers and continue tuning for a further 2 epochs with a smaller LR of 1e-5. This is done to make further small alignments of the embedding and encode layers. Still, we are cautious here to ensure we don't ruin the model, so we don't train to convergence, achieving the following validation results:
HR@3: 0.3311195373535156
HR@5: 0.40607210993766785
HR@10: 0.48956355452537537
Finally, we let the model go wild. All layers are unfrozen, and the question and the pages are embedded with LayoutLM-Byne. We use dummy bounding boxes to embed the query.
We tune for 5 epochs with an LR 1e-5, achieving the following results:
HR@3: 0.3491460978984833
HR@5: 0.4269449710845947
HR@10: 0.5436432361602783
The model still appears somewhat under-fitted. We will address it later, when we will have produced an independent validation set. At this point, we are slightly worried about training it too much – to avoid degrading out-of-sample generalisation capabilities. DocVQA, while diverse, is still a fairly limited selection of examples.
Now, let's finally compare the results against three popular off-the-shelf embedding models on our validation set:
Model | HR@3 | HR@5 | HR@10 |
all-mpnet-base-v2 | 0.2500 | 0.2900 | 0.3600 |
gte-base-en-v1.5 | 0.3454 | 0.3899 | 0.4554 |
snowflake-arctic-embed-m-v1.5 | 0.3548 | 0.4042 | 0.4573 |
LayoutLM-Byne-v0.1 | 0.3491 | 0.4269 | 0.5436 |
Improvement over best competitor | -1.61% | +5.62% | +18.87% |
Nice! Our model has a much higher marginal growth in recall (i.e., how much the hit rate improves as you retrieve more items). This indicates that our hypothesis is correct: text-only embedders miss important features, and they are only competitive on this problem because of a much more refined training (as suggested by the more concentrated retrieval of the relevant items).
An added benefit is that our three-stage fine-tuning routine makes the model compatible with the original embedder used for anchoring. On our evaluation set, the model scored as follows when all-mpnet-base-v2 was used to embed queries and our model was applied to images:
HR@3: 0.276091068983078
HR@5: 0.3557874858379364
HR@10: 0.455407977104187
Backward compatibility is a nice bonus, so we will need to come back and upgrade the anchor model to something more popular for the v0.2 release.
Let's also evaluate an out-of-set example: the old AirBnB pitch deck [4]. We have written four questions that an investor is likely to ask about a startup when doing the due diligence:
1) "How big is the market?",
2) "How much commission does AirBnB take?"
3) "What are the advantages of AirBnB?"
4) "What is the founders' background?".
The model retrieved the correct page as hit #1 3 out of 4 times, and only once the relevant page was second.
While the model sets a new SOTA on the problem, it still needs improvement. We plan to:
(1) Add support for adjacent page awareness – i.e. when a critical part of the content is placed on a different page, like multi-page tables, that might be meaningless if the header is on a different page.
(2) Produce an alternative validation set that won't overlap with DocVQA.
(3) Experiment with training hyperparameters: add a scheduler, modify temperature, do a hyperparameter search, etc. We are sure we're only scratching the surface: while the model seems to have converged, we are sure that the training hyperparams are suboptimal.
(4) Experiment with the latest LayoutLM models, like V2 and V3.
(5) Increase context size. It currently only has 512 tokens, which should be sufficient for most visually-rich documents like decks but is still suboptimal.
(6) Update the anchor model. While it is unlikely to improve performance (although likely to decrease training time), it will make backward compatibility more useful.
Have fun building RAG pipelines!
[1] https://arxiv.org/abs/2007.00398
[2] https://arxiv.org/abs/1912.13318
[3] https://github.com/RElbers/info-nce-pytorch
[4] https://marekstraka.com/wp-content/uploads/2020/08/AirBnB_Pitch_Deck.pdf