GPT-4-vision extraction of tables with branched rows/vertically-merged cells

I’m struggling with a specific use-case and wondering if anyone has any insight.

I have been using GPT-4-vision to extract data from medical test documents. There are a range of different formats.

GPT-4-vision does very well at identifying the correct table and extracting all of the data with the one key exception of branched rows.

Columns: Biomarker, Method, Analyte, Result

Unbranched row: 1 Biomarker, 1 Method, 1 Analyte, 1 Result (all values aligned horizontally)
Branched row: 1 Biomarker, 1 Method, 2 Analytes, 2 Results (Analyte and Result values are horizontally above and below Biomarker and Method).

(which rows are branched varies from document to document)

In these cases, the model usually extracts one of the branches (sometimes upper, sometimes lower) and ignores the other.

I’m going crazy trying to word a prompt to get it to extract both branches. Some of the things I have tried include:

  1. Describing branched rows in every way I can think of (branched, shared, vertically-merged cells/values, etc.).
  2. Telling it exactly which rows are branched (not feasible in production).
  3. Telling it to extract only the 2 Analyte and 2 Result values for a specific biomarker.
  4. Telling it to index the table by Result.
  5. Telling it to output the results in different formats (csv, html).

In frustration:
4. Telling it to extract only the Result column, and exactly how many values there should be (skipped the branched values, used values from a later table to fill in the number).

  1. Telling it to extract ALL WORDS in the image with no mention of a table (extracted all words except the branched values).

I’ve found examples online of extracting tables, but none with this sort of format. Has anyone found an approach for this?

Out of curiosity why are you using GPT-4V and not a typical OCR that’s built for tables?

Actually, I currently do. This is just exploring alternatives.

The issue as you’ve found is that you are confined to only prompting…You could try to manually adjust the image itself or even find some consistent structures and automatically cut the tables out and then query them individually, but this process has already been accomplished using these table-OCR models.

In your tests what has made you lean more towards frustratingly prompting GPT-V?

Yeah, I see what you mean. I don’t want to divert too much from the original topic because I am hoping there is a fix for this.

I wouldn’t say I’m leaning towards GPT yet, but the appeal of it over the OCR approaches I have used is its flexibility in finding the right data and avoiding the wrong data when I don’t know what the document looks like in advance.