I created my own GPT that provides a response based on an input document (e.g. PDF). As part of the reasoning response I would like to have the page number associated with the generated answer, i.e. based on the most relevant page that was used to generate the answer. Right now I get correct responses, but hallucinated page numbers.
Is this at all possible? Or is it that the chunking and indexing behind the scenes is agnostic to page numbers?
I have also tried loading a small test PDF (3 pages) into ChatGPT, and just asked it “what page number is X on?”, where X is some piece of information that is unique to a specific page, and the answer (page number) is completely wrong.
It makes me think that there is no preservation of document hierarchy under the hood, it’s just some straight forward chunking of text.
No update yet. I will run some more experiments with different ways of formatting PDFs, and report back. But it would be great if OpenAI could describe the process behind document ingestion. In my case I work with lots of visually rich documents - mixture of text, images, graphs, etc. How the ingestion, chunking, and metadata works is essentially a must. Otherwise I have to fall back on LangChain where I have full control.
I created two variations of a simple 3-page PDF document where there was a simple b&w image on page 1, some text on page 2, and another simple b&w image on page 3. Each page was marked with a page number.
In one version of the document, it was a mixture of images and “selectable” text, i.e. text was represented as text in PDF metadata; in the other version, everything, including plain text, were “images”, i.e. text was non-selectable in the document.
I prompted for the page number in various ways. For example: “What page number is X on?”, “By visually inspecting each page of the document, from start to finish, provide a page number for X”, “By visually inspecting each page of the document, from start to finish, count the number of pages from the start of the document, where there is X”. In these prompts “X” is a description of text or an image.
During it’s “Analyzing” stage, GPT-4 actually correctly displays each page in sequence (it activates “visual inspection” function), but the resulting page numbers are always incorrect.
Conclusion: while the temporary storage of the document preserves proper form, once it goes beyond that (to vector db and GPT model and whatever else is behind the wall), it has no idea what a page is.
Are there any updates or solutions to this problem?
My ChatGPT Build relies solely on correctly quoting page numbers from pdfs, and it’s not doing it despite my best attempts at configuring.
@MMCor basically there is no way of solving this with prompts, as of today at least. The reason being is that the PDF ingestion and vectorization part “behind the scenes”, simply doesn’t preserve the page numbers.
The closest I managed with just the prompts is when loading the PDF, to instruct ChatGPT to “treat each page as an image”. Then later on when performing Q/A or fetching information, when you instruct also for “correct page number reference”, it actually gives you the correct page number, as printed/shown on that page. I.e. if the page number is visibly displayed in the header or footer, it will work.
But note that in relation to the overall PDF document, this number will typically be wrong, since PDFs will have a cover page, table of contents, or other pages that are not explicitly marked with page numbers “on paper”. So it doesn’t really help you anyway.
So as of right now, you need to implement some intermediate solution that preserves page numbers.