Assistant file search text retrieval

Hi,

Does anyone know if it is possible to retrieve the actual text that assistant is using from vector store. Logging the annotations only gives:
“annotations”: [
{
“type”: “file_citation”,
“text”: “【18:0†source】”,
“start_index”: 434,
“end_index”: 447,
“file_citation”: {
“file_id”: “file-nRl3w3civlx7o897DieUXGaO”
}
},
{
“type”: “file_citation”,
“text”: “【18:2†source】”,
“start_index”: 447,
“end_index”: 460,
“file_citation”: {
“file_id”: “file-nRl3w3civlx7o897DieUXGaO”
}

I want to see the relevance of the data being retrieved from the vector store.

Thanks

1 Like

Looking for a similar solution, but have not found anything promising.

I managed to get some more info on how file search actually works, but unfortunately this is not documented…

Here’s a quick rundown.

The AI model you specified (gpt 3.5, 4, 4o etc) outputs a search query to the search tool. This looks like this:

msearch([“Search Query generated by the Assistant”])

Then the File Search performs a semantic and keyword search to find the most relevant results. It seemed to me, that before the results are passed to the assistant, they get re-ranked or filtered and only the top most relevant results get passed.

The result(s) look like this:

[
{
“message_idx”: 12,
“search_idx”: 0,
“text”: “Text from the file, i.e the search result. This text is exactly as it is in your source document.”,
“source”: “sourcefile.txt”
}
]

Unfortunately, this is not visible in the logs of the run steps or anything similar, at least I could not find it anywhere. But, I think the results above are maybe what you are looking for. I had to do some multi-step prompts to finally get the model to spit out the search results like this. It would be really helpful if Openai would offer some more documentation on this.

I also posted a thread touching on this topic:

1 Like

Hey Aaron,
thanks for the insight. Another way I was trying to implement was to use the fileId, start_index and end_index on my local to fetch the text from the file. I had to rename the files to their corresponding file_id on the vector store. Still working on it. Maybe that will work for you.

On the same topic, do you know if this is the correct way to limit results:
const tools = [{
“type”: “file_search”,
“file_search”: { “max_num_results”: 3 } // Set the maximum number of results
}];

it does not seem to work

Just realised this will not work since the start_index and end_index in OpenAI’s Assistant API annotations refer to the positions within the response text where the annotation is applied. Looking for a better way to retrieve the exact text

I’d probably do it like this. Altough, I honestly just set it in the assistant settings in the UI on the OpenAI Platform, which worked fine.

tools=[{
        "type": "file_search",
        "file_search": {
            "max_num_results": 10 
        }
    }]
}
}
]

Yes, this does not work unfortunately. There are a couple of other threads and a post on an OpenAI Repo about the exact citations. They disabled it for V2 so there is no way to do it properly right now, but they are working on adding it back in.

Oh I see, I guess I’ll just have to wait. Thanks again!

So I have set it in the assistant settings as well but that does not seem to work. What I am doing is creating a run using the assistant id and tools something like:

const createRun = async (threadId, tools = [{
“type”: “file_search”,
“file_search”: { “max_num_results”: 3 } // Set the maximum number of results
}]) => {
try {
const assistantConfig = await getAssistantConfig();
const response = await axios.post(https://api.openai.com/v1/threads/${threadId}/runs, {
assistant_id: assistantConfig.id,
tools: tools
}, {
headers: {
“Authorization”: Bearer ${process.env.OPENAI_API_KEY},
“OpenAI-Beta”: “assistants=v2”
}
});
// console.log(‘Create Run Response:’, response.data); // Debugging information
return response.data;
} catch (error) {
console.error(‘Error creating run:’, error.response?.data || error.message);
throw new Error(‘Failed to create run’);
}
};

Am I doing something wrong?

Did you properly add the tools resources to specify the vector store id? I haven’t used the API with JS much, but I imagine it should work the same.

I am pretty sure my vector store is being used since I get the annotations. The create run function does not accept vector store id as a param. Do you suggest I update the assistant with the vector store id:
await openai.beta.assistants.update(assistant.id, {
tool_resources: { file_search: { vector_store_ids: [vectorStore.id] } },
});

Hello,

Are there any admins that could give us an update about retrieving the content of each chunk during the ThreadRunStepCompleted event? I rather get the content at the beginning than at the end of the thread run since I could probably do a regex match in order to replace the source links within the streaming text once they are created.

Also, it would be helpful to get metadata about the page where the chunk was found, that way I could go straight to page of the pdf. It we could give metadata as dict to each file then it would be awesome :grin:. I hope we get this by DevDay.

2024-09-30 20:02:57.077 | INFO     | utils.llms.openai_interface:fetch_async_completion_ui_assistant_cognition:2095 - ThreadRunStepCompleted(data=RunStep(id='step_eIaChGmuYbR0sOcQwyq72o6c', assistant_id='asst_8yeN71uoMRlgfJCwWU5DqxhO', cancelled_at=None, completed_at=1727726576, created_at=1727726574, expired_at=None, failed_at=None, last_error=None, metadata=None, object='thread.run.step', run_id='run_1CbVa6LDLuzwQFir4At8M99P', status='completed', step_details=ToolCallsStepDetails(tool_calls=[FileSearchToolCall(id='call_0ba5Sgahn5CFHSKlCekf2NqZ', file_search=FileSearch(ranking_options=FileSearchRankingOptions(ranker='default_2024_08_21', score_threshold=0.0), results=[FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.7672119866780243, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.7528088810427128, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.7070330827631971, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.6966345728271346, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.6847471420043809, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.6629569015698628, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.6502266373159689, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.5404488313553234, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.49091998719388624, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.4793617847415654, content=None), FileSearchResult(file_id='file-IAkpJoqkzB4lnIrawSb9CIdx', file_name='IBD_MODULE1_UNIT2_SPECIALTY MALT AND ADJUNTS.pdf', score=0.4671717504211161, content=None), FileSearchResult(file_id='file-BYW7yXomGlfUIWEMPAOE5sqe', file_name='6.Topics in Brewing - Malting by JDH.pdf', score=0.448181745750499, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.4390744436618422, content=None), FileSearchResult(file_id='file-BYW7yXomGlfUIWEMPAOE5sqe', file_name='6.Topics in Brewing - Malting by JDH.pdf', score=0.4352800081354829, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.42263067872356186, content=None), FileSearchResult(file_id='file-BYW7yXomGlfUIWEMPAOE5sqe', file_name='6.Topics in Brewing - Malting by JDH.pdf', score=0.4163856419064138, content=None), FileSearchResult(file_id='file-qizfNbj0tjVHWVH9mBlJU1yu', file_name='IBD_MODULE1_UNIT5_MILLING.pdf', score=0.38283591224989805, content=None), FileSearchResult(file_id='file-BYW7yXomGlfUIWEMPAOE5sqe', file_name='6.Topics in Brewing - Malting by JDH.pdf', score=0.34261438776851033, content=None), FileSearchResult(file_id='file-DQSCVlYRDexNK4zwxEpc7LJh', file_name='IBD_MODULE1_UNIT6_MASHING.pdf', score=0.3397802682377875, content=None), FileSearchResult(file_id='file-BYW7yXomGlfUIWEMPAOE5sqe', file_name='6.Topics in Brewing - Malting by JDH.pdf', score=0.33443454299859265, content=None)]), type='file_search')], type='tool_calls'), thread_id='thread_YETamtwz8vTjivRY0FaqqNdM', type='tool_calls', usage=Usage(completion_tokens=18, prompt_tokens=1856, total_tokens=1874), expires_at=1727727172), event='thread.run.step.completed')
2024-09-30 20:02:57.078 | INFO     | utils.llms.openai_interface:fetch_async_completion_ui_assistant_cognition:2394 - Usage: {'completion_tokens': 18, 'prompt_tokens': 1856, 'total_tokens': 1874}
1 Like

Hi,

We are hoping for the same thing. It seems the Assistants Api and File Search didn’t get any updates on DevDay :frowning:. We would also love to be able to add metadata to the files or chunks, especially for search filtering and I really held out hope for DevDay, since in the Assistants API update from April, they mentioned they will add support for metadata “in the coming months”. I don’t think they will roll out more new stuff at the remaining Dev Days, but if they do, I sure hope updates for File Search are included…

Cheers

3 Likes

Hi @groverkartik25 ,
the APIs give the ability to retrieve the list of chunks that the vector store passed to the LLM before it elaborated the response. You need to identify the correct run step to analize and then call it this way

curl -g https://api.openai.com/v1/threads/thread_abc123/runs/run_abc123/steps/step_abc123?include[]=step_details.tool_calls[*].file_search.results[*].content \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -H "OpenAI-Beta: assistants=v2"

It will return a useful list, see my example:

If you don’t know how to identify the run step to request to the API, use the /threads/runs/steps call. provide it the thread_id and run_id in the following way:

https://api.openai.com/v1/threads/${thread_id}/runs/${run_id}/steps

Now the response should return a list of step ids with multiple attributes inside.
Search for the step with the attribute: step details.tool_calls.type = “file_search”
That is the step that contains the chunks coming from the vector db.
Use the step ID you just identified to perform the first api call. See my example below:

there’s a little bit of information about this in the File search documentation as well. link: https://platform.openai.com/docs/assistants/tools/file-search

Hope this helps!

4 Likes

Hi @mambozzo ,

I have been trying to figure this part out, thank you for the detailed explanation! This is really helpful.

1 Like

Tagging:
@aaron.lutz , @LuisM_Barrera . Since you were facing the same issue

Thanks! Yes, I was already aware of this and use this myself for debugging. It’s a pretty recent and very welcome addition to the API to be able to see the File Search results. However, it is still kind of hard to guage if how we see the search results in the run steps is exactly the same as they get passed to the LLM. For example the LLM does not seem to get the filename of the file in the results like we do.

1 Like

@aaron.lutz I’m looking for a way to get more details on how exactly openAI is passing the file retrieval results to the LLM when assistant API are in use.
I think this could help responde to your question as well - which is another very interesting point.

Do you or anybody else know more details about this?

1 Like

Unfortunately I don’t know anymore about this… File Search is still mostly a black box for all of us, apart from the Search Results in the run steps but anything else, we do not know. I hope OpenAI will shift their focus back on Assistant API and File Search in the future and release new features as well as give us more insight in and control of these feautres, but I’m not too certain this will happen with everything else they have going on currently.

1 Like

Building on the topic, here is a python script to print out the Assistant’s response and all the items retrieved from the data store. Personally, I need metadata badly.

thread = client.beta.threads.create(
  messages=[ { "role": "user", "content": "Get some data from your vector store about people."} ],
  tool_resources={
    "file_search": {
      "vector_store_ids": [store.id]
    },
    
  }
)

# Use the create and poll SDK helper to create a run and poll the status of
# the run until it's in a terminal state.
run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id, 
    assistant_id=assistant_id
)

messages = list(client.beta.threads.messages.list(thread_id=thread.id, run_id=run.id))


message_content = messages[0].content[0].text
annotations = message_content.annotations
citations = []
for index, annotation in enumerate(annotations):
    message_content.value = message_content.value.replace(annotation.text, f"[{index}]")
    if file_citation := getattr(annotation, "file_citation", None):
        cited_file = client.files.retrieve(file_citation.file_id)
        citations.append(f"[{index}] {cited_file.filename}")

print(message_content.value)
print("\n".join(citations))

run_steps = client.beta.threads.runs.steps.list(
    thread_id=thread.id,
    run_id=run.id,
    extra_query={
        "include":["step_details.tool_calls[*].file_search.results[*].content"],
    }
)

# Extract and print results
from pprint import pprint

def extract_results(run_steps):
    results = []
    # Access the list of steps from the SyncCursorPage object
    for step in run_steps.data:
        if hasattr(step, 'step_details'):
            if hasattr(step.step_details, 'tool_calls'):
                for tool_call in step.step_details.tool_calls:
                    results.append(tool_call.file_search['results'])
    
    return results

results = extract_results(run_steps)

# Pretty print the results
for result in results:
    for res in result:  # Since result itself is a list, iterate over it
        # Extract the content text if it exists
        content_texts = [content['text'] for content in res['content'] if content.get('type') == 'text']

        pprint({
            "File ID": res['file_id'],
            "File Name": res['file_name'],
            "Score": res['score'],
            "Content": " ".join(content_texts)  # Join all text parts to form a single string
        })