I’m struggling with a specific use-case and wondering if anyone has any insight.
I have been using GPT-4-vision to extract data from medical test documents. There are a range of different formats.
GPT-4-vision does very well at identifying the correct table and extracting all of the data with the one key exception of branched rows.
Columns: Biomarker, Method, Analyte, Result
Unbranched row: 1 Biomarker, 1 Method, 1 Analyte, 1 Result (all values aligned horizontally)
Branched row: 1 Biomarker, 1 Method, 2 Analytes, 2 Results (Analyte and Result values are horizontally above and below Biomarker and Method).
(which rows are branched varies from document to document)
In these cases, the model usually extracts one of the branches (sometimes upper, sometimes lower) and ignores the other.
I’m going crazy trying to word a prompt to get it to extract both branches. Some of the things I have tried include:
- Describing branched rows in every way I can think of (branched, shared, vertically-merged cells/values, etc.).
- Telling it exactly which rows are branched (not feasible in production).
- Telling it to extract only the 2 Analyte and 2 Result values for a specific biomarker.
- Telling it to index the table by Result.
- Telling it to output the results in different formats (csv, html).
In frustration:
4. Telling it to extract only the Result column, and exactly how many values there should be (skipped the branched values, used values from a later table to fill in the number).
- Telling it to extract ALL WORDS in the image with no mention of a table (extracted all words except the branched values).
I’ve found examples online of extracting tables, but none with this sort of format. Has anyone found an approach for this?