Obtaining correct PDF page number in the response using GPTs

platypus · November 24, 2023, 9:39am

I created my own GPT that provides a response based on an input document (e.g. PDF). As part of the reasoning response I would like to have the page number associated with the generated answer, i.e. based on the most relevant page that was used to generate the answer. Right now I get correct responses, but hallucinated page numbers.

Is this at all possible? Or is it that the chunking and indexing behind the scenes is agnostic to page numbers?

platypus · November 24, 2023, 6:29pm

I have also tried loading a small test PDF (3 pages) into ChatGPT, and just asked it “what page number is X on?”, where X is some piece of information that is unique to a specific page, and the answer (page number) is completely wrong.

It makes me think that there is no preservation of document hierarchy under the hood, it’s just some straight forward chunking of text.

davirvs · December 2, 2023, 4:01am

Is there some update about that, @platypus ? I have the same doubt about the pagination metadata information

generalbadwolf · December 2, 2023, 4:25am

The way that GPT interprets PDF is the same way that we can convert PDF.

While GPT can technically read pdf, it may not be good PDF.

Do a PDF to Markdown conversion, and read the document as text in notepad // nano.

does it look readable?

well thats how GPT reads it.

really the best way to load documents into GPT is one of 2 ways.

Copy paste it as text (or convert it to text confidently)
JSON.

the rule of thumb is - if it makes sense to YOU when you read it…
then it makes sense to GPT.

platypus · December 2, 2023, 12:31pm

No update yet. I will run some more experiments with different ways of formatting PDFs, and report back. But it would be great if OpenAI could describe the process behind document ingestion. In my case I work with lots of visually rich documents - mixture of text, images, graphs, etc. How the ingestion, chunking, and metadata works is essentially a must. Otherwise I have to fall back on LangChain where I have full control.

Sim2K · December 2, 2023, 12:45pm

How wold you convert it to JSON because then there is the possibility of adding meta data as well??

platypus · December 6, 2023, 10:25pm

Some updates:

I created two variations of a simple 3-page PDF document where there was a simple b&w image on page 1, some text on page 2, and another simple b&w image on page 3. Each page was marked with a page number.
In one version of the document, it was a mixture of images and “selectable” text, i.e. text was represented as text in PDF metadata; in the other version, everything, including plain text, were “images”, i.e. text was non-selectable in the document.
I prompted for the page number in various ways. For example: “What page number is X on?”, “By visually inspecting each page of the document, from start to finish, provide a page number for X”, “By visually inspecting each page of the document, from start to finish, count the number of pages from the start of the document, where there is X”. In these prompts “X” is a description of text or an image.
During it’s “Analyzing” stage, GPT-4 actually correctly displays each page in sequence (it activates “visual inspection” function), but the resulting page numbers are always incorrect.

Conclusion: while the temporary storage of the document preserves proper form, once it goes beyond that (to vector db and GPT model and whatever else is behind the wall), it has no idea what a page is.

blaketurner · May 27, 2024, 2:19am

Are there any updates or solutions to this problem?
My ChatGPT Build relies solely on correctly quoting page numbers from pdfs, and it’s not doing it despite my best attempts at configuring.

MMCor · August 9, 2024, 9:44pm

Same question. Has any figured out how to solve this with prompts?

platypus · August 12, 2024, 7:59pm

@MMCor basically there is no way of solving this with prompts, as of today at least. The reason being is that the PDF ingestion and vectorization part “behind the scenes”, simply doesn’t preserve the page numbers.

The closest I managed with just the prompts is when loading the PDF, to instruct ChatGPT to “treat each page as an image”. Then later on when performing Q/A or fetching information, when you instruct also for “correct page number reference”, it actually gives you the correct page number, as printed/shown on that page. I.e. if the page number is visibly displayed in the header or footer, it will work.

But note that in relation to the overall PDF document, this number will typically be wrong, since PDFs will have a cover page, table of contents, or other pages that are not explicitly marked with page numbers “on paper”. So it doesn’t really help you anyway.

So as of right now, you need to implement some intermediate solution that preserves page numbers.

MMCor · August 13, 2024, 5:14pm

@platypus thank you. I’ll give this a shot.

Riley_Lee · May 12, 2025, 1:53pm

actually the issue still exists nowadays, the LLM can’t distinguish the printed page numbers from page sequence orders (starting from 1 on the cover page), the reason behind is because there are two information sources about page number fed to LLM after OCR, like (PAGE 4: content – page 1), which confused LLM

the meta info about pages (like PAGE 4)
the extracted printed page numbers (like page 1)

thomsontm61 · June 27, 2025, 12:43am

Has anyone tried this with a word document?

Peemmsiri · October 8, 2025, 4:05pm

The workaround is to export a pdf to images with slide page number on the file name. Do a batch upload and ask GPT to make a reference from the filename. But that consumes daily tokens quite a lot.

Topic		Replies	Views
What are the limitations of GPT-4 in analyzing PDF text? Prompting gpt-4	6	34375	March 12, 2024
What is the best way to parse a PDF file with ChatGPT? API	9	51593	November 16, 2024
Issue with Comment Extraction, Page Number and Article References from Document in Custom GPT GPT builders chatgpt , pdf , mygpts	3	559	August 26, 2024
Issues with Accessing PDF Documentation GPT builders gpt-assistant	1	102	October 5, 2024
Gpt-4o can’t read multiple pages correctly in pdf file API gpt-4	1	463	July 3, 2024

Obtaining correct PDF page number in the response using GPTs

Related topics