How can I access the specific text of the file that the annotation is referencing?

I’m trying to create a feature where an annotation in a response from an assistant (with file-search enabled) can be clicked and the specific text that it references from a PDF will be highlighted. However, although the documentation says that there is a “quote” field in the annotation this field never shows up:

Any sort of indication of where the annotation is referencing in the file (e.g. a text offset, quote, etc.) would be helpful, but there doesn’t seem to be any way to find any meaningful subsection of a while to which the assistant is referring in their annotations.

6 Likes

The AI of assistants v2 no longer has a method to “mark” text for an annotation. It can only refer to a file ID.

1 Like

Is there any workaround to get annotations with direct quotes?

You could write your own vector database API tool and instructions that replicate the selection and “marking” behavior, where the files browser had received back its information with line numbers. A tool “line_number_range_of_document_ID_to_offer_as_documentation_download_in_user_interface_before_response_to_user”?

The v2 AI no longer can write the same output that the API backend then parses into annotations, giving ranges of text.

That’s right – we don’t have support for quotes from the file at the moment. We’ll work on adding support for this!

4 Likes

A post was split to a new topic: Can Someone Guide Me on Adding Pricing to My Website?

Is there an open issue for this? I found this on in the OpenAPI repo, quote is non-nullable and all responses will fail client response validation.

Please, is very important, we have no other way because we have no access to the vector store !!!

Hello,

Are there any updates on the progress for adding support for this?

Thanks!

1 Like

Is there any update on this?

As of now, the ability to see the quoted text is a critical feature. Currently GPT often returns multiple citations to the same file. Without the missing referenced text they are redundant, and our users are asking why there are multiple redundant citations to the same file. More importantly, most of the value of a search tool is in finding relevant information, not just relevant documents.

My team is now actively developing non-OpenAI alternatives, which we will switch over to as soon as we have a working solution.

2 Likes

I second this.

The removal of the quoted text feature from OpenAI’s API has significantly hindered the technology’s effectiveness. This functionality previously provided verifiable sources, making it easier for users to identify and eliminate inaccuracies. Without it end users must weed through large amounts of information making it harder to detect hallucinations. This is especially important when detection of hallucinations is critical to the scientific process. Please restore it.

5 Likes

Come on guys… Still no update? It’s been almost 4 months since this post, stating that you will work on it. Really, any update on the progress would be very appreciated.
Cheers.

3 Likes

Any updates on this matter? The ‘quote’ information is a critical part of the file_search functionality, and its absence makes the feature unreliable. This should be a top priority.

1 Like

They did recently add the ability to view the results of the assistant’s search i.e. the full chunks returned from the vector store. See here.

This is still far from satisfactory though as the chunks are often far too big to use as citations and it’s also not always necessarily the case that the top ranked chunk will be where the assistant ultimately draws its answer from.

A direct quote or quotes is pretty essential for any RAG application. Hopefully they add this soon.

1 Like

Yes, I saw this as well and it is a very welcome improvement! It still is kind of ambiguous since it seems the search results that we can inspect in the logs now do not match the results that get passed to the model exactly, for example, the model does not know about the file names etc. Let’s hope they will prioritize the File Search / API a little more. I’m looking forward to dev day, let’s hope they showcase some new stuff there.

1 Like

I’m facing this issue too. I’m going to solve it with the following approach:

  1. Gather the cited chunks and answer. Cited chunks have the file_id and the file_name in their response.
  2. Feed these into GPT and ask “extract the sub-strings and the file_name this answer has been conditioned on”. Now you should get a searchable string and the file_name to search the document with

Unfortunately has the drawback of basically doubling the compute cost, since you have to feed in the chunks as the prompt twice :sweat_smile:

1 Like

Let us know how u get on. Maybe 4o-mini coild do that. U coild do a structured response too.

Yeah, I tried something kind of similar, although before they added the viewable search results. We use the cited File IDs, retrieve the file names, use Google Drive to find that file with that file name (yes, we uploaded all our RAG Docs also to Google Drive) and then append that to the citations so that the users can view the cited documents in the chat interface via Iframe. Now, with the Search Results you could do something similar: get the cited files, a reference string from the cited chunk, implement a file system for displaying the files, and perform an auto search to “jump” to that part in the file. However, I held out some hope that they will improve this and am looking forward to dev day. If there’s no update, then I guess this will we what we’ll have the revert to.

check my response in another tread: Assistant file search text retrieval - #13 by mambozzo

1 Like