PDF page identification errors with file search on assistants v2 api. Paging problem. Pages not in chunk metadata?

Assistants with file search tool and gpt-4o as a model (did not check others) can not relate uploaded pdf content and its pages. It consistently answers wrongly simple questions like:

  • Extract the first text line of each page
  • What table is on page 4
  • In what page does XXXX appear?

This is normal and expected behavior. The model has no knowledge or understanding of what a “page” is with respect to the text of the PDF.

“File search” isn’t actually searching your files, it’s searching the text of your files which has been extracted from your files.

The extracted text has no page information so the models cannot know what is on each page.

It is not expected behavior for me. This simplicity would work on question answer scenario where the information is within a few paragraphs only. As a user, i would like a RAG system out of shelf (as i believe open ai would like to offer) that would be able to extract or summarize a section of a pdf document. I could be wrong and certainly appreciate your experienced help. What i´ve noticed is that this vector storage fixed chunking strategy of 800 tokes with 400 overlap and 20 maximum chunks retrieval is that, without a page reference on the metadata of the chunk, it is impossible to do basic stuff. One example is a section of the document that goes from pages 4 to 9. The only way it will know the extension of this section is from the table of content and if it did not have a way to know that an expecific chunk belongs to a page 4 to 9 how will it be able to extract or summarize this section?

I’ll say this, you’re not wrong.

This is partly why many devs ultimately decide to use their own, customizable, RAG solution.

One thing I have almost universally recommended to people is to not upload PDFs as Knowledge or to a vector store for File Search.

Extract the text and convert it to markdown yourself. That way you at least know exactly what is going in datawise.

If you do this, it’s easy enough to break up the document by pages or even section.

Or you could pepper the document with reference markers every so often—before and after each paragraph of text throw in some HTML-style comments with reference information, e.g.,

<!-- §2 ¶3 p. 5 -->
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed tempus diam sed tempus fermentum. Suspendisse vehicula urna in est luctus iaculis. Mauris semper pretium quam eu cursus. Aenean pretium consectetur mi, ac congue urna semper sit amet. Nulla facilisi. Donec convallis eros ut pharetra venenatis. Nam eget dictum arcu, at blandit tortor. Donec eget varius felis. Proin nisi dui, molestie rutrum ex non, ultrices posuere justo. Mauris varius augue nec efficitur lacinia. Praesent posuere nibh sed purus rutrum vulputate. Ut quis enim gravida, finibus orci ac, molestie turpis. Cras dignissim augue a felis accumsan fermentum. Sed lobortis lobortis vulputate. Mauris eget fermentum lacus, id venenatis mi. Cras ut mattis lectus.
<!-- /§2 ¶3 p. 5 -->

<!-- §2 ¶4 pp. 5–6 -->
Integer aliquet maximus rhoncus. Donec in lobortis libero, nec mattis orci. Nunc accumsan elit sit amet lorem hendrerit molestie. Maecenas sodales nibh magna, sed volutpat purus blandit volutpat. Nam cursus eros libero, et rhoncus turpis laoreet et. Aliquam sit amet urna egestas, viverra quam ac, pellentesque erat. Pellentesque in lacus nisl. Donec ac commodo nulla.
<!-- /§2 ¶4 pp. 5–6 -->

In this example the first paragraph is section 2, paragraph 3, and starts and ends on page 5. The next paragraph is still section 2, but we’re now in paragraph 3, and this paragraph starts on page 5 and ends on page 6.

It definitely requires some preprocessing on your end, but the end result should give you the ability to always be able to identify where a reference comes from.

There’s some risk the model might catch only the opening marker for one paragraph and the closing marker for the next, and depending how they’re arranged and pieces together it could confuse the model as to where exactly something comes from.

But, I think absent constructing your own RAG implementation or paying for someone else’s, this is the best bet you’ve got to be able to (mostly) reliably cite locations in a PDF document.

Here’s some other benefits to extracting the text from the PDF yourself and converting it to markdown,

  1. If your PDF has tables in it, you can extract those tables both as markdown to include in your vector store, but you can also convert that markdown table to a CSV file. Then, if the model needs to ever compute something from your table data (or even just extract the table data) it’s not stressing the attention mechanism, is can just read the CSV file info an object in a Code Interpreter instance. One CSV file for each table.
  2. You can distill your knowledge. Documents are wordy, this post is way too wordy. If I were a better writer, I could distill this post down to just the actual facts contained within. So, you can have your big, full document, but also have a “Joe Friday” version—just the facts ma’am.
  3. You can synthesize new, pseudo-documents. I don’t know if you’ve ever been working through a dense textbook, let’s say math, where the author keeps referencing various definitions, theorems, and corollaries and you need to keep flipping around in the book to see what they are. Well, the model needs to do that too and it’s just as confusing for it as it is for us. Well, no more! If you’ve extracted the text you can identify all those different references, collect their text, then everywhere the references are found you can substitute in their actual definitions like a really big and complicated text expander. Then taking a section you’ve augmented, you can ask a model to create a nice, streamlined, linear, standalone document incorporating everything in one place, bring sure to keep references for where everything was sourced from. Then, save those as their own documents (hopefully under 800 tokens), and you’ve built up your own high quality reference library. You can have up to 10,000 files in a vector store, use them!
  4. If your PDF has graphs or images you can, as a separate (but concurrent) step in your text extraction work, extract those as images and do all sorts of stuff with them.
    a. Send them to a vision model and get high-quality descriptions of them to insert into your big markdown document, so the model can be aware of them during retrieval.
    b. Host them on your own domain in an img folder, then name them something simple like docname_page_img-id.png, then in your markdown document you can insert a markdown image link with the generated caption as the alt text. Then, when your assistant wants to reference an image in the document it can dump out the appropriate markdown to render the image in a user’s browser.

Oops! We’re kind of getting into the thick of the woods now, aren’t we?

At this point we’re still using OpenAI’s RAG solution, but by doing all this preprocessing we’re going to be able to get much better results out of it.

Plus, when you inevitably decide you really need to build your own RAG solution you’ll have two huge advantages,

  1. Much of the work you worked need to do to get started will already be done, and
  2. You’ll have developed a much deeper understanding of your data (and data in general) which will aid you enormously as you try to squeeze every last bit of value out of it.

Anyway, yes it would be amazing if OpenAI had a SOTA RAG setup that was dead-simple, turnkey, and had all the bells and whistles. But, they don’t and they’re unlikely to any time soon (if ever), it’s just (I think) not really a central focus of their research. There are lots of other companies out there doing great work in this space, many of which are very price competitive with File Search, especially when you consider the control you gain over the number of tokens you send to the expensive models.

I’m sorry I can’t simply bippity-boppity-boop File Search and make it perfect, but hopefully I’ve given you (and anyone else reading this) some ideas which might help to get the most out of it.

(I’m also sorry this grew so long, I really did think I’d be able to whip up a quick response…)

6 Likes

Thank you for this comprehensive response! It will certainly help us out here and the community! :slightly_smiling_face:

Just adding the information that there is some flexibility in the chunking strategy. Defaul is 800 tokens, overlap of 400 and max number of tokens to context of 20. You can change that up to 4090, overlap of 2045 and max to context of 50. [https://platform.openai.com/docs/assistants/tools/file-search/customizing-file-search-settings]