Assistant file search text retrieval

groverkartik25 · July 10, 2024, 9:18pm

Hi,

Does anyone know if it is possible to retrieve the actual text that assistant is using from vector store. Logging the annotations only gives:
“annotations”: [
{
“type”: “file_citation”,
“text”: “【18:0†source】”,
“start_index”: 434,
“end_index”: 447,
“file_citation”: {
“file_id”: “file-nRl3w3civlx7o897DieUXGaO”
}
},
{
“type”: “file_citation”,
“text”: “【18:2†source】”,
“start_index”: 447,
“end_index”: 460,
“file_citation”: {
“file_id”: “file-nRl3w3civlx7o897DieUXGaO”
}

I want to see the relevance of the data being retrieved from the vector store.

Thanks

aaron.lutz · July 15, 2024, 2:33pm

Looking for a similar solution, but have not found anything promising.

I managed to get some more info on how file search actually works, but unfortunately this is not documented…

Here’s a quick rundown.

The AI model you specified (gpt 3.5, 4, 4o etc) outputs a search query to the search tool. This looks like this:

msearch([“Search Query generated by the Assistant”])

Then the File Search performs a semantic and keyword search to find the most relevant results. It seemed to me, that before the results are passed to the assistant, they get re-ranked or filtered and only the top most relevant results get passed.

The result(s) look like this:

[
{
“message_idx”: 12,
“search_idx”: 0,
“text”: “Text from the file, i.e the search result. This text is exactly as it is in your source document.”,
“source”: “sourcefile.txt”
}
]

Unfortunately, this is not visible in the logs of the run steps or anything similar, at least I could not find it anywhere. But, I think the results above are maybe what you are looking for. I had to do some multi-step prompts to finally get the model to spit out the search results like this. It would be really helpful if Openai would offer some more documentation on this.

I also posted a thread touching on this topic:

groverkartik25 · July 15, 2024, 10:27pm

Hey Aaron,
thanks for the insight. Another way I was trying to implement was to use the fileId, start_index and end_index on my local to fetch the text from the file. I had to rename the files to their corresponding file_id on the vector store. Still working on it. Maybe that will work for you.

On the same topic, do you know if this is the correct way to limit results:
const tools = [{
“type”: “file_search”,
“file_search”: { “max_num_results”: 3 } // Set the maximum number of results
}];

it does not seem to work

groverkartik25 · July 16, 2024, 4:51am

Just realised this will not work since the start_index and end_index in OpenAI’s Assistant API annotations refer to the positions within the response text where the annotation is applied. Looking for a better way to retrieve the exact text

aaron.lutz · July 16, 2024, 7:33am

I’d probably do it like this. Altough, I honestly just set it in the assistant settings in the UI on the OpenAI Platform, which worked fine.

tools=[{
        "type": "file_search",
        "file_search": {
            "max_num_results": 10 
        }
    }]
}
}
]

aaron.lutz · July 16, 2024, 7:36am

Yes, this does not work unfortunately. There are a couple of other threads and a post on an OpenAI Repo about the exact citations. They disabled it for V2 so there is no way to do it properly right now, but they are working on adding it back in.

github.com/openai/openai-openapi

`quote` should be nullable in `MessageContentTextAnnotationsFileCitationObject`

opened 02:45PM - 20 May 24 UTC

closed 05:31PM - 12 Jun 24 UTC

davidmigloz

The `quote` property is marked as required, but the server doesn't include them …in all the cases. https://github.com/openai/openai-openapi/blob/893ba52242dbd5387a97b96444ee1c742cfce9bd/openapi.yaml#L11362-L11397 Example of response that doesn't include it: ```json { "id": "msg_Fn6zTxxx", "object": "thread.message", "created_at": 1715820384, "assistant_id": "asst_K4iArkxxx", "thread_id": "thread_3N31Erp0eCVhAhKh5BdMk1aQ", "run_id": "run_icH57L9xxx", "status": "completed", "incomplete_details": null, "incomplete_at": null, "completed_at": 1715820391, "role": "assistant", "content": [ { "type": "text", "text": { "value": "...", "annotations": [ { "type": "file_citation", "text": "【6:0†source】", "start_index": 1109, "end_index": 1121, "file_citation": { "file_id": "file-TrXXpxxx" } } ] } } ], "attachments": [], "metadata": {} } ```

groverkartik25 · July 16, 2024, 10:12pm

Oh I see, I guess I’ll just have to wait. Thanks again!

groverkartik25 · July 16, 2024, 10:15pm

So I have set it in the assistant settings as well but that does not seem to work. What I am doing is creating a run using the assistant id and tools something like:

Am I doing something wrong?

aaron.lutz · July 17, 2024, 7:21am

Did you properly add the tools resources to specify the vector store id? I haven’t used the API with JS much, but I imagine it should work the same.

groverkartik25 · July 24, 2024, 10:50pm

I am pretty sure my vector store is being used since I get the annotations. The create run function does not accept vector store id as a param. Do you suggest I update the assistant with the vector store id:
await openai.beta.assistants.update(assistant.id, {
tool_resources: { file_search: { vector_store_ids: [vectorStore.id] } },
});

LuisM_Barrera · October 1, 2024, 8:33pm

Hello,

Are there any admins that could give us an update about retrieving the content of each chunk during the ThreadRunStepCompleted event? I rather get the content at the beginning than at the end of the thread run since I could probably do a regex match in order to replace the source links within the streaming text once they are created.

Also, it would be helpful to get metadata about the page where the chunk was found, that way I could go straight to page of the pdf. It we could give metadata as dict to each file then it would be awesome . I hope we get this by DevDay.

2024-09-30 20:02:57.077 | INFO     | utils.llms.openai_interface:fetch_async_completion_ui_assistant_cognition:2095 - ThreadRunStepCompleted(data=RunStep(id='step_eIaChGmuYbR0sOcQwyq72o6c', assistant_id='asst_8yeN71uoMRlgfJCwWU5DqxhO', cancelled_at=None, completed_at=1727726576, created_at=1727726574, expired_at=None, failed_at=None, last_error=None, metadata=None, object='thread.run.step', run_id='run_1CbVa6LDLuzwQFir4At8M99P', status='completed', step_details=ToolCallsStepDetails(tool_calls=[FileSearchToolCall(id='call_0ba5Sgahn5CFHSKlCekf2NqZ', file_search=FileSearch(ranking_options=FileSearchRankingOptions(ranker='default_2024_08_21', score_threshold=0.0), results=[FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.7672119866780243, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.7528088810427128, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.7070330827631971, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.6966345728271346, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.6847471420043809, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.6629569015698628, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.6502266373159689, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.5404488313553234, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.49091998719388624, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.4793617847415654, content=None), FileSearchResult(file_id='file-IAkpJoqkzB4lnIrawSb9CIdx', file_name='IBD_MODULE1_UNIT2_SPECIALTY MALT AND ADJUNTS.pdf', score=0.4671717504211161, content=None), FileSearchResult(file_id='file-BYW7yXomGlfUIWEMPAOE5sqe', file_name='6.Topics in Brewing - Malting by JDH.pdf', score=0.448181745750499, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.4390744436618422, content=None), FileSearchResult(file_id='file-BYW7yXomGlfUIWEMPAOE5sqe', file_name='6.Topics in Brewing - Malting by JDH.pdf', score=0.4352800081354829, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.42263067872356186, content=None), FileSearchResult(file_id='file-BYW7yXomGlfUIWEMPAOE5sqe', file_name='6.Topics in Brewing - Malting by JDH.pdf', score=0.4163856419064138, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.38283591224989805, content=None), FileSearchResult(file_id='file-BYW7yXomGlfUIWEMPAOE5sqe', file_name='6.Topics in Brewing - Malting by JDH.pdf', score=0.34261438776851033, content=None), FileSearchResult(file_id='file-DQSCVlYRDexNK4zwxEpc7LJh', file_name='IBD_MODULE1_UNIT6_MASHING.pdf', score=0.3397802682377875, content=None), FileSearchResult(file_id='file-BYW7yXomGlfUIWEMPAOE5sqe', file_name='6.Topics in Brewing - Malting by JDH.pdf', score=0.33443454299859265, content=None)]), type='file_search')], type='tool_calls'), thread_id='thread_YETamtwz8vTjivRY0FaqqNdM', type='tool_calls', usage=Usage(completion_tokens=18, prompt_tokens=1856, total_tokens=1874), expires_at=1727727172), event='thread.run.step.completed')
2024-09-30 20:02:57.078 | INFO     | utils.llms.openai_interface:fetch_async_completion_ui_assistant_cognition:2394 - Usage: {'completion_tokens': 18, 'prompt_tokens': 1856, 'total_tokens': 1874}

aaron.lutz · October 2, 2024, 7:25am

Hi,

We are hoping for the same thing. It seems the Assistants Api and File Search didn’t get any updates on DevDay . We would also love to be able to add metadata to the files or chunks, especially for search filtering and I really held out hope for DevDay, since in the Assistants API update from April, they mentioned they will add support for metadata “in the coming months”. I don’t think they will roll out more new stuff at the remaining Dev Days, but if they do, I sure hope updates for File Search are included…

Cheers

mambozzo · October 10, 2024, 10:15am

Hi @groverkartik25 ,
the APIs give the ability to retrieve the list of chunks that the vector store passed to the LLM before it elaborated the response. You need to identify the correct run step to analize and then call it this way

curl -g https://api.openai.com/v1/threads/thread_abc123/runs/run_abc123/steps/step_abc123?include[]=step_details.tool_calls[*].file_search.results[*].content \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -H "OpenAI-Beta: assistants=v2"

It will return a useful list, see my example:

If you don’t know how to identify the run step to request to the API, use the /threads/runs/steps call. provide it the thread_id and run_id in the following way:

https://api.openai.com/v1/threads/${thread_id}/runs/${run_id}/steps

Now the response should return a list of step ids with multiple attributes inside.
Search for the step with the attribute: step details.tool_calls.type = “file_search”
That is the step that contains the chunks coming from the vector db.
Use the step ID you just identified to perform the first api call. See my example below:

there’s a little bit of information about this in the File search documentation as well. link: https://platform.openai.com/docs/assistants/tools/file-search

Hope this helps!

groverkartik25 · October 10, 2024, 8:59pm

Hi @mambozzo ,

I have been trying to figure this part out, thank you for the detailed explanation! This is really helpful.

groverkartik25 · October 10, 2024, 9:02pm

Tagging:
@aaron.lutz , @LuisM_Barrera . Since you were facing the same issue

aaron.lutz · October 11, 2024, 9:24am

Thanks! Yes, I was already aware of this and use this myself for debugging. It’s a pretty recent and very welcome addition to the API to be able to see the File Search results. However, it is still kind of hard to guage if how we see the search results in the run steps is exactly the same as they get passed to the LLM. For example the LLM does not seem to get the filename of the file in the results like we do.

mambozzo · October 11, 2024, 10:26am

@aaron.lutz I’m looking for a way to get more details on how exactly openAI is passing the file retrieval results to the LLM when assistant API are in use.
I think this could help responde to your question as well - which is another very interesting point.

Do you or anybody else know more details about this?

aaron.lutz · October 14, 2024, 10:14am

Unfortunately I don’t know anymore about this… File Search is still mostly a black box for all of us, apart from the Search Results in the run steps but anything else, we do not know. I hope OpenAI will shift their focus back on Assistant API and File Search in the future and release new features as well as give us more insight in and control of these feautres, but I’m not too certain this will happen with everything else they have going on currently.

aukinfo · October 15, 2024, 7:34am

Building on the topic, here is a python script to print out the Assistant’s response and all the items retrieved from the data store. Personally, I need metadata badly.

thread = client.beta.threads.create(
  messages=[ { "role": "user", "content": "Get some data from your vector store about people."} ],
  tool_resources={
    "file_search": {
      "vector_store_ids": [store.id]
    },
    
  }
)

# Use the create and poll SDK helper to create a run and poll the status of
# the run until it's in a terminal state.
run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id, 
    assistant_id=assistant_id
)

messages = list(client.beta.threads.messages.list(thread_id=thread.id, run_id=run.id))


message_content = messages[0].content[0].text
annotations = message_content.annotations
citations = []
for index, annotation in enumerate(annotations):
    message_content.value = message_content.value.replace(annotation.text, f"[{index}]")
    if file_citation := getattr(annotation, "file_citation", None):
        cited_file = client.files.retrieve(file_citation.file_id)
        citations.append(f"[{index}] {cited_file.filename}")

print(message_content.value)
print("\n".join(citations))

run_steps = client.beta.threads.runs.steps.list(
    thread_id=thread.id,
    run_id=run.id,
    extra_query={
        "include":["step_details.tool_calls[*].file_search.results[*].content"],
    }
)

# Extract and print results
from pprint import pprint

def extract_results(run_steps):
    results = []
    # Access the list of steps from the SyncCursorPage object
    for step in run_steps.data:
        if hasattr(step, 'step_details'):
            if hasattr(step.step_details, 'tool_calls'):
                for tool_call in step.step_details.tool_calls:
                    results.append(tool_call.file_search['results'])
    
    return results

results = extract_results(run_steps)

# Pretty print the results
for result in results:
    for res in result:  # Since result itself is a list, iterate over it
        # Extract the content text if it exists
        content_texts = [content['text'] for content in res['content'] if content.get('type') == 'text']

        pprint({
            "File ID": res['file_id'],
            "File Name": res['file_name'],
            "Score": res['score'],
            "Content": " ".join(content_texts)  # Join all text parts to form a single string
        })

Topic		Replies	Views
Assistants API File Search and Vector Stores API api , vector-db , semantic-search , assistants-files , vector-store	12	2271	October 21, 2024
Mapping assistants API annotations back to the location in the source file API assistants , assistants-api	5	2426	September 20, 2024
Assistant api, retrieval file api is not working Bugs api	44	15039	March 13, 2024
Using Assistant API Retrieval Hallucinates API gpt-4 , hallucinations , prompt , assistants , assistants-api	30	4451	April 15, 2024
Since 2024-Nov-16 Assistant API returning 'server_error' Bugs assistants-api	19	265	November 22, 2024

Assistant file search text retrieval

Related topics