Obtaining correct PDF page number in the response using GPTs

I created my own GPT that provides a response based on an input document (e.g. PDF). As part of the reasoning response I would like to have the page number associated with the generated answer, i.e. based on the most relevant page that was used to generate the answer. Right now I get correct responses, but hallucinated page numbers.

Is this at all possible? Or is it that the chunking and indexing behind the scenes is agnostic to page numbers?

I have also tried loading a small test PDF (3 pages) into ChatGPT, and just asked it “what page number is X on?”, where X is some piece of information that is unique to a specific page, and the answer (page number) is completely wrong.

It makes me think that there is no preservation of document hierarchy under the hood, it’s just some straight forward chunking of text.

Is there some update about that, @platypus ? I have the same doubt about the pagination metadata information

The way that GPT interprets PDF is the same way that we can convert PDF.

While GPT can technically read pdf, it may not be good PDF.

Do a PDF to Markdown conversion, and read the document as text in notepad // nano.

does it look readable?

well thats how GPT reads it.

really the best way to load documents into GPT is one of 2 ways.

  1. Copy paste it as text (or convert it to text confidently)

  2. JSON.

the rule of thumb is - if it makes sense to YOU when you read it…
then it makes sense to GPT.

No update yet. I will run some more experiments with different ways of formatting PDFs, and report back. But it would be great if OpenAI could describe the process behind document ingestion. In my case I work with lots of visually rich documents - mixture of text, images, graphs, etc. How the ingestion, chunking, and metadata works is essentially a must. Otherwise I have to fall back on LangChain where I have full control.

1 Like

How wold you convert it to JSON because then there is the possibility of adding meta data as well??

Some updates:

  • I created two variations of a simple 3-page PDF document where there was a simple b&w image on page 1, some text on page 2, and another simple b&w image on page 3. Each page was marked with a page number.
  • In one version of the document, it was a mixture of images and “selectable” text, i.e. text was represented as text in PDF metadata; in the other version, everything, including plain text, were “images”, i.e. text was non-selectable in the document.
  • I prompted for the page number in various ways. For example: “What page number is X on?”, “By visually inspecting each page of the document, from start to finish, provide a page number for X”, “By visually inspecting each page of the document, from start to finish, count the number of pages from the start of the document, where there is X”. In these prompts “X” is a description of text or an image.
  • During it’s “Analyzing” stage, GPT-4 actually correctly displays each page in sequence (it activates “visual inspection” function), but the resulting page numbers are always incorrect.

Conclusion: while the temporary storage of the document preserves proper form, once it goes beyond that (to vector db and GPT model and whatever else is behind the wall), it has no idea what a page is.

Are there any updates or solutions to this problem?
My ChatGPT Build relies solely on correctly quoting page numbers from pdfs, and it’s not doing it despite my best attempts at configuring.

Same question. Has any figured out how to solve this with prompts?

@MMCor basically there is no way of solving this with prompts, as of today at least. The reason being is that the PDF ingestion and vectorization part “behind the scenes”, simply doesn’t preserve the page numbers.

The closest I managed with just the prompts is when loading the PDF, to instruct ChatGPT to “treat each page as an image”. Then later on when performing Q/A or fetching information, when you instruct also for “correct page number reference”, it actually gives you the correct page number, as printed/shown on that page. I.e. if the page number is visibly displayed in the header or footer, it will work.

But note that in relation to the overall PDF document, this number will typically be wrong, since PDFs will have a cover page, table of contents, or other pages that are not explicitly marked with page numbers “on paper”. So it doesn’t really help you anyway.

So as of right now, you need to implement some intermediate solution that preserves page numbers.

1 Like

@platypus thank you. I’ll give this a shot.