I have created an assistant which has the File Search feature and it has been given access to a PDF which contains information. Say that this PDF contains text which has been repeated in many places, e.g. boiler plate footer text which is 100 words long repeated on many pages or whole passages which are identical to each other, does this affect the effectiveness of the Assistant? Will it ignore the repetitions or?
I suspect the a file uploaded to a GPT is either converted into an embedding or is placed directly into the prompt. If it’s an embedding, then it is searched for a limited number of chunks that are relevant to the prompt and these chunks are included in the prompt. If there’s a lot of repetition, it could affect the number of useful different chunks that go into the final prompt.
How a prompt is then translated into a response is still somewhat of a mystery to me! Attention is apparently all you need… if you can understand that paper!
The methods currently documented by OpenAI for document extraction, splitting, and ingestion are less than adequately explained.
If one were to write a technical report, here’s a possible current chunking strategy that one might infer, the true nature being unable to be ascertained by reading what has been put forth and an understanding of BPE tokenization and UTF-8 multibyte characters that can also be multiple tokens:
- A document has been uploaded.
- Binary document files are passed through a text extractor.
- Runs of the document are encoded with a tokenizer.
- Split points of 800 tokens are identified
- Approximately 200 (or 400?) additional tokens are read from the preceding or following section, for 1200 token chunks.
- (No document content understanding for split point, no metadata supplementation mentioned)
- chunks are stored in a vector database, along with 256-dimension embeddings obtained.
Then those are ready for semantic retrieval, which is by search term sentence the AI writes.
- “query rewriting” - is it merely that the AI asks a question. Or is there further AI that transforms that query into something aligned with uninquisitive documentation.
Then top results are returned to a search query, besides arbitrary documentation without any apparent threshold being automatically placed in context before any searching.
So: If AI performs search for the text of a footer, there will be likely many results, especially if the chunking is near parity with page size.
We expect embeddings won’t value this as much and bias other search terms results to whether a chunk has or doesn’t have a repeated phrase. It depends on how closely aligned repeated text is with the query how skewed the top results would be.
Then, besides the possibility of getting multiple chunks of overlap (document reassembly of adjacent chunks is also not mentioned), the AI would have to read and understand new knowledge that has up to 20 repetitions unrelated to results. This might not be a huge problem, just a large waste of what could be knowledge tokens instead.
In PDFs, headers and footers can often be considered metadata, not included in searchable text, depending on how the document is produced.
So: providing clear text will always be the best in the face of mystery extraction, and since there is no practical file count limit, one can also explore producing non-chunked-non-overlapped .txt knowledge snippets of your own under any thresholds previously mentioned.