How do you find the used bit of text with file search?

waleedk · March 24, 2025, 11:38pm

Hey folks, I’ve been trying to use the new file_search tool, but the way annotations work, I can’t actually get the text that led to the annotation. An annotation looks like this:

    file_id: "..."
    filename: "...."
    index: 1535
    type: "file_citation"

Note the index. What is that index supposed to be? How am I supposed to use it?

I uploaded a docx file, and the index in the docx file probably doesn’t make sense (it would just be a random by of zipped data) – presumably it gets converted into markdown server side and that’s the index there.

So I tried to retrieve the file so I could work out what the index was, but I got a 400 error telling me that “user_data” files can’t be directly downloaded. How am I then supposed to use the index?

I even tried to include the full results of the file search (included in result.complete, in the output.result[<file_search].result, and see if I can match them there and at least include the bit from the file search, but there is no actual field linking the two together. I have no actual way of telling which search result was actually used – the only linking thing is the file_id, but there can be multiple results from the same file.

Index can’t be the index of the result, because it’s usually a big number like 1536, and typically I limit file_search results to 20.

kduffie · March 25, 2025, 5:18pm

We use file_search as a core part of our solution. I’m not an OpenAI insider so you should take what I say with a grain of salt. But here’s my understanding…

When file_search is enabled, the AI will generate a search query based on the message (question) and pass it to the file_search tool. That tool will do a vector search to find matching chunks from your vector store. (Each uploaded file is broken into chunks that are, roughly, 600 words each.) A vector search isn’t like an old-fashioned keyword search. It doesn’t find a specific phrase match. It does some vector math to compare the embedding for the query with the embeddings for each of the chunks. Essentially it is looking for the chunks that are semantically closest in meaning. The file search tool then returns the best matching chunks and these are added into the context along with which document those chunks came from. The AI then continues generation and part of what it generates are indicators (citations) about how it decided to use certain chunks in its answer. But there is no traceability back to a specific line of text in your original file – only to which file it was in.

That “index” that you see in the annotation is useless. It’s just the index of the file among all of the uploaded files in the vector store. It is the file_id or filename that you’ll use to tie this back to the original document, but there’s no information about where in that file the chunk (and certainly not the line) appeared.

_j · March 25, 2025, 5:55pm

Here’s my understanding of the web search version without going back to the source OpenAI staff member that still leaves us hanging a bit.

The index has been recently described: in the AI output, instead of the annotation formatted with Japanese brackets appearing in the plaintext that you receive and the AI is instructed to produce, this is stripped by the endpoint backend and the position is noted as an index. This character position, before any further processing you may do to render it, is where you can place your own notation, footnote, pop-up, or link. It refers you to the annotation list that you receive.

The large chunks are awkward as being a reference, though, but within 600 words in length, somewhere in the vector store chunk is the fact that was reproduced or needed for an answer.

Run steps in Assistants gives the full ranked chunks, deep in a nest of objects. These are what are being cited in the in-document citations.

For chat completion’s version of the web you have in your text to notate a range that can even be highlighted:

start_index: The index of the first character of the URL citation in the message
end_index: The index of the last character of the URL citation in the message.

…etc. This could be documented and answered with code example to perform the task, but would be very application specific.

Now

The file search documentation

{type:“object”,title:“File citation”,description:“A citation to a file.\n”,properties:{type:{type:“string”,description:“The type of the file citation. Always file_citation.\n”,enum:[“file_citation”]},index:{type:“integer”,description:“The index of the file in the list of files.\n”},file_id:{type:“string”,description:“The ID of the file.\n”}},required:[“type”,“index”,“file_id”]}

Indeed, what could that mean? You have a list with at least 1535 files somewhere? Probably no. (A mental bookmark to write some code and characterize the behavior)

Tommaso_Resti · July 10, 2025, 2:12pm

I’ve another theory.

I recently built a tool that uses file search. The model tries to find the answer in a big pdf. There are many possible chunks that contain useful bits to compose the final answer. Since the file_search tool returned five citations from the same file i noticed that the file_name was always the same, as well as the file_id, but the index was different.

Conclusion, the index does matter, and i’ve the strong feeling that indicates the chunk.

Tommaso_Resti · July 10, 2025, 2:14pm

In any case, to know which part of the text was used for an answer, you could ask the model to include the file search results in the response.

This adds another output to the response object that contains the portion of text extracted from the document

https://platform.openai.com/docs/guides/tools-file-search#include-search-results-in-the-response

Topic		Replies	Views
Mapping assistants API annotations back to the location in the source file API assistants , assistants-api	5	3140	September 20, 2024
Make Annotations Great Again API assistants-api	1	666	September 11, 2024
Using threads vs chat completions API	4	3110	May 15, 2024
Best file type for Q and A assistant API chatgpt , api , assistants , assistants-api , assistants-files	5	1518	May 4, 2024
Assistant API Annotations API	4	2512	May 1, 2024

How do you find the used bit of text with file search?

Now

Related topics