Improving GPT Vision for multi-column document analysis

Hi everyone, I’m currently developing a contract analysis tool called, which uses GPT-4 Vision (GPT-4o) to extract structured information from PDF or DOCX contract documents. The system works by processing each page visually using GPT Vision, then chunking the content and embedding it for downstream Q&A. So far, the results have been very good, especially when documents are formatted in a single-column, line-by-line layout.

However, I’ve noticed a significant drop in accuracy when working with two column documents those that split content into left and right blocks. Based on recent feedback, I’m assuming that this layout format is the main factor causing inconsistencies.

I’d really appreciate any advice on how to improve GPT Vision’s performance in multi-column layouts. Has anyone had success prompting the model to follow column order properly? Or would it be better to pre-process the PDF with a tool like pdfplumber or PyMuPDF to extract each column separately before feeding it into the model? Any insights, best practices, or workarounds would be super helpful. Thanks in advance!