How do you find the used bit of text with file search?

Hey folks, I’ve been trying to use the new file_search tool, but the way annotations work, I can’t actually get the text that led to the annotation. An annotation looks like this:

    file_id: "..."
    filename: "...."
    index: 1535
    type: "file_citation" 

Note the index. What is that index supposed to be? How am I supposed to use it?

I uploaded a docx file, and the index in the docx file probably doesn’t make sense (it would just be a random by of zipped data) – presumably it gets converted into markdown server side and that’s the index there.

So I tried to retrieve the file so I could work out what the index was, but I got a 400 error telling me that “user_data” files can’t be directly downloaded. How am I then supposed to use the index?

I even tried to include the full results of the file search (included in result.complete, in the output.result[<file_search].result, and see if I can match them there and at least include the bit from the file search, but there is no actual field linking the two together. I have no actual way of telling which search result was actually used – the only linking thing is the file_id, but there can be multiple results from the same file.

Index can’t be the index of the result, because it’s usually a big number like 1536, and typically I limit file_search results to 20.

We use file_search as a core part of our solution. I’m not an OpenAI insider so you should take what I say with a grain of salt. But here’s my understanding…

When file_search is enabled, the AI will generate a search query based on the message (question) and pass it to the file_search tool. That tool will do a vector search to find matching chunks from your vector store. (Each uploaded file is broken into chunks that are, roughly, 600 words each.) A vector search isn’t like an old-fashioned keyword search. It doesn’t find a specific phrase match. It does some vector math to compare the embedding for the query with the embeddings for each of the chunks. Essentially it is looking for the chunks that are semantically closest in meaning. The file search tool then returns the best matching chunks and these are added into the context along with which document those chunks came from. The AI then continues generation and part of what it generates are indicators (citations) about how it decided to use certain chunks in its answer. But there is no traceability back to a specific line of text in your original file – only to which file it was in.

That “index” that you see in the annotation is useless. It’s just the index of the file among all of the uploaded files in the vector store. It is the file_id or filename that you’ll use to tie this back to the original document, but there’s no information about where in that file the chunk (and certainly not the line) appeared.

Here’s my understanding of the web search version without going back to the source OpenAI staff member that still leaves us hanging a bit.

The index has been recently described: in the AI output, instead of the annotation formatted with Japanese brackets appearing in the plaintext that you receive and the AI is instructed to produce, this is stripped by the endpoint backend and the position is noted as an index. This character position, before any further processing you may do to render it, is where you can place your own notation, footnote, pop-up, or link. It refers you to the annotation list that you receive.

The large chunks are awkward as being a reference, though, but within 600 words in length, somewhere in the vector store chunk is the fact that was reproduced or needed for an answer.

Run steps in Assistants gives the full ranked chunks, deep in a nest of objects. These are what are being cited in the in-document citations.

For chat completion’s version of the web you have in your text to notate a range that can even be highlighted:

start_index: The index of the first character of the URL citation in the message
end_index: The index of the last character of the URL citation in the message.

…etc. This could be documented and answered with code example to perform the task, but would be very application specific.

Now

The file search documentation

{type:“object”,title:“File citation”,description:“A citation to a file.\n”,properties:{type:{type:“string”,description:“The type of the file citation. Always file_citation.\n”,enum:[“file_citation”]},index:{type:“integer”,description:“The index of the file in the list of files.\n”},file_id:{type:“string”,description:“The ID of the file.\n”}},required:[“type”,“index”,“file_id”]}

Indeed, what could that mean? You have a list with at least 1535 files somewhere? Probably no. (A mental bookmark to write some code and characterize the behavior)