I’m struggling with a specific use-case and wondering if anyone has any insight.
I have been using GPT-4-vision to extract data from medical test documents. There are a range of different formats.
GPT-4-vision does very well at identifying the correct table and extracting all of the data with the one key exception of branched rows.
Columns: Biomarker, Method, Analyte, Result
Unbranched row: 1 Biomarker, 1 Method, 1 Analyte, 1 Result (all values aligned horizontally)
Branched row: 1 Biomarker, 1 Method, 2 Analytes, 2 Results (Analyte and Result values are horizontally above and below Biomarker and Method).
(which rows are branched varies from document to document)
In these cases, the model usually extracts one of the branches (sometimes upper, sometimes lower) and ignores the other.
I’m going crazy trying to word a prompt to get it to extract both branches. Some of the things I have tried include:
Describing branched rows in every way I can think of (branched, shared, vertically-merged cells/values, etc.).
Telling it exactly which rows are branched (not feasible in production).
Telling it to extract only the 2 Analyte and 2 Result values for a specific biomarker.
Telling it to index the table by Result.
Telling it to output the results in different formats (csv, html).
In frustration:
4. Telling it to extract only the Result column, and exactly how many values there should be (skipped the branched values, used values from a later table to fill in the number).
Telling it to extract ALL WORDS in the image with no mention of a table (extracted all words except the branched values).
I’ve found examples online of extracting tables, but none with this sort of format. Has anyone found an approach for this?
The issue as you’ve found is that you are confined to only prompting…You could try to manually adjust the image itself or even find some consistent structures and automatically cut the tables out and then query them individually, but this process has already been accomplished using these table-OCR models.
In your tests what has made you lean more towards frustratingly prompting GPT-V?
Yeah, I see what you mean. I don’t want to divert too much from the original topic because I am hoping there is a fix for this.
I wouldn’t say I’m leaning towards GPT yet, but the appeal of it over the OCR approaches I have used is its flexibility in finding the right data and avoiding the wrong data when I don’t know what the document looks like in advance.
I wouldn’t be able to say which is the best for your use-case without seeing a decent amount of these documents. The general idea is straightforward though:
First I’d try to run a pre-built table OCR tool and see if it can manage them without adding complexity. If not, well.
If the table structure is consistent you an implement a multi-stage OCR.
Classify → Segment → OCR
There was a new paper released regarding OCRing documents. There is a nice demo you can try out that uses all the types of OCR necessary: