OpenAI is bad at understanding tables with sub-row and sub-column headers

|--------------------------|---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
|                          |               |          Type I         |          Type II          |          Type III           |                        Type IV                         |          Type V         |
| Occupancy Classification | See Footnotes |------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
|                          |               |     A      |     B      |      A      |      B      |      A       |      B       |      A      |      B      |      C      |      HT      |     A      |     B      |
|--------------------------|---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
| A-1                      | NS            |            |     52     |             |             |              |              |             |             |             |              |            |            |
|                          |---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
|                          | S1            |     51     |     53     |     55      |             |              |              |             |             |             |              |            |            |
|                          |---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
|                          | SM            |            |     54     |             |             |              |              |             |             |             |              |            |            |
|--------------------------|---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
| A-2                      | NS            |            |            |             |             |              |              |             |             |             |              |            |            |
|                          |---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
|                          | S1            |            |            |             |             |              |              |             |             |             |              |            |            |
|                          |---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
|                          | SM            |            |            |             |             |              |              |             |             |             |              |            |            |
|--------------------------|---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
| A-3                      | NS            |            |            |             |             |              |              |             |             |             |              |            |            |
|                          |---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
|                          | S1            |            |            |             |             |              |              |             |             |             |              |            |            |
|                          |---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
|                          | SM            |            |            |             |             |              |              |             |             |             |              |            |            |
|--------------------------|---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
| A-4                      | NS            |            |            |             |             |              |              |             |             |             |              |            |            |
|                          |---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
|                          | S1            |            |            |             |             |              |              |             |             |             |              |            |            |
|                          |---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
|                          | SM            |            |            |             |             |              |              |             |             |             |              |            |            |
|--------------------------|---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
| A-5                      | NS            |            |            |             |             |              |              |             |             |             |              |            |            |
|                          |---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
|                          | S1            |            |            |             |             |              |              |             |             |             |              |            |            |
|                          |---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|
|                          | SM            |            |            |             |             |              |              |             |             |             |              |            |            |
|--------------------------|---------------|------------|------------|-------------|-------------|--------------|--------------|-------------|-------------|-------------|--------------|------------|------------|

The above table is pretty easy for a human to read. For example, the number 53 is in row A-1, sub-row S1, and column Type I, sub-column B. However, I tried sending this to ChatGPT and asked it to analyze it and tell me what rows and columns individual numbers were in, and it was unable to do so. Any insights into this?

To be fair, it’s fairly non-trivial to reliably parse these types of tables using other methods too.

Perhaps a multi-step approach teaching it to convert such a table into a flat table, then working with it?

The issue is that these tables are embedded in 100s of pages of text that I’m processing. While it would be easy to get it working in isolation, I would love to find a pdf parser that can understand these tables and convert them into single-row, single-column headers, or some other solution in addition to converting the rest of the pdf into a string that can be embedded.

I’ve tried using tabulate, which is supposed to be good at pulling out tables and got a fairly one-to-one conversion from the pdf table to text, but the issue came when I tried to get ChatGPT to understand the table and be able to find individual cells based on the rows/columns and sub-headers.

Ahhh… That’s different then. Your issue seems to be with PDF parsing rather than with ChatGPT.

The vast majority of PDF parsers are very basic and fail spectacularly at anything which isn’t just text. I would suggest looking at mathpix to see if their solution fits your needs.

Alternately, if you can share how such a table is represented after parsing with your current parser, perhaps I can give you some more suggestions.

1 Like

Are you familiar with pdfminer.six? It’s a python library, and allows you to parse the individual PDF elements. Unfortunately, since each cell in the table is a pdf element, it parses it inline, meaning none of the structure is saved and it just goes in order of the cells on the page. I tried getting ChatGPT to read the coordinates on the TextBox, but didn’t find much success there. I’ll look into mathpix. Thank you!

No, I’ve not used pdfminer.six. My biggest issues with processing PDFs stem from equations and algorithms. I just know mathpix is generally well-regarded for their extraction quality, though they can be pretty expensive.

1 Like