Obtaining correct PDF page number in the response using GPTs

I created my own GPT that provides a response based on an input document (e.g. PDF). As part of the reasoning response I would like to have the page number associated with the generated answer, i.e. based on the most relevant page that was used to generate the answer. Right now I get correct responses, but hallucinated page numbers.

Is this at all possible? Or is it that the chunking and indexing behind the scenes is agnostic to page numbers?

I have also tried loading a small test PDF (3 pages) into ChatGPT, and just asked it “what page number is X on?”, where X is some piece of information that is unique to a specific page, and the answer (page number) is completely wrong.

It makes me think that there is no preservation of document hierarchy under the hood, it’s just some straight forward chunking of text.

Is there some update about that, @platypus ? I have the same doubt about the pagination metadata information

The way that GPT interprets PDF is the same way that we can convert PDF.

While GPT can technically read pdf, it may not be good PDF.

Do a PDF to Markdown conversion, and read the document as text in notepad // nano.

does it look readable?

well thats how GPT reads it.

really the best way to load documents into GPT is one of 2 ways.

  1. Copy paste it as text (or convert it to text confidently)

  2. JSON.

the rule of thumb is - if it makes sense to YOU when you read it…
then it makes sense to GPT.

No update yet. I will run some more experiments with different ways of formatting PDFs, and report back. But it would be great if OpenAI could describe the process behind document ingestion. In my case I work with lots of visually rich documents - mixture of text, images, graphs, etc. How the ingestion, chunking, and metadata works is essentially a must. Otherwise I have to fall back on LangChain where I have full control.

1 Like

How wold you convert it to JSON because then there is the possibility of adding meta data as well??

Some updates:

  • I created two variations of a simple 3-page PDF document where there was a simple b&w image on page 1, some text on page 2, and another simple b&w image on page 3. Each page was marked with a page number.
  • In one version of the document, it was a mixture of images and “selectable” text, i.e. text was represented as text in PDF metadata; in the other version, everything, including plain text, were “images”, i.e. text was non-selectable in the document.
  • I prompted for the page number in various ways. For example: “What page number is X on?”, “By visually inspecting each page of the document, from start to finish, provide a page number for X”, “By visually inspecting each page of the document, from start to finish, count the number of pages from the start of the document, where there is X”. In these prompts “X” is a description of text or an image.
  • During it’s “Analyzing” stage, GPT-4 actually correctly displays each page in sequence (it activates “visual inspection” function), but the resulting page numbers are always incorrect.

Conclusion: while the temporary storage of the document preserves proper form, once it goes beyond that (to vector db and GPT model and whatever else is behind the wall), it has no idea what a page is.