Prompt returns answers from only one file in a vector store

I am using the api via cURL to ask questions about some files in a vector store.

  • I create the vector store, and add two files using the file_batches endpoint.
  • I create a new assistant, and then attach a new thread. The vector store is attached to the thread.
  • I am using the default chunking strategy (800 tokens max, with 400 token overlap).

When I ask a question, it is clear that the assistant is answering using the information from only one file. If I ask a question that requires information from the other file, the response is “I don’t know”, or similar.

If I retrieve information on the vector store contents, I can see that both files have (apparently) been added successfully.

The files a simple UTF8 text files each containing a few thousand words in English.

If I repeat the process, but add only one or other other of the files, the assistant (set up with the same arguments as before) can provide cogent answers to questions to do with that one file, showing that the file is perfectly readable, and the vectorising process is OK.

Can anyone suggest what I’m doing wrong?

{
"files":{
"data":[
{
"chunking_strategy":{
"static":{
"chunk_overlap_tokens":400,
"max_chunk_size_tokens":800
},
"type":"static"
},
"created_at":1726003029,
"id":"file-F6bVnGrVlB8CgEfkIUOrQGIt",
"last_error":null,
"object":"vector_store.file",
"status":"completed",
"usage_bytes":1073,
"vector_store_id":"vs_FQYDspit4lQaKuJDetPd5HGt"
},
{
"chunking_strategy":{
"static":{
"chunk_overlap_tokens":400,
"max_chunk_size_tokens":800
},
"type":"static"
},
"created_at":1726003029,
"id":"file-GLPcOLKGf1XMmt5fAntzuIdC",
"last_error":null,
"object":"vector_store.file",
"status":"completed",
"usage_bytes":1076,
"vector_store_id":"vs_FQYDspit4lQaKuJDetPd5HGt"
}
],
"first_id":"file-F6bVnGrVlB8CgEfkIUOrQGIt",
"has_more":false,
"last_id":"file-GLPcOLKGf1XMmt5fAntzuIdC",
"object":"list"
},
"store":{
"created_at":1726003028,
"expires_after":{
"anchor":"last_active_at",
"days":60
},
"expires_at":1731188861,
"file_counts":{
"cancelled":0,
"completed":2,
"failed":0,
"in_progress":0,
"total":2
},
"id":"vs_FQYDspit4lQaKuJDetPd5HGt",
"last_active_at":1726004861,
"metadata":{
},
"name":"B0013B2D724E0EBAB67AE58BC34A939F",
"object":"vector_store",
"status":"completed",
"usage_bytes":2149
}
}

It could be related to your Assistant’s instructions.

It’s typically a good idea to include some information in your instructions about the files that you have attached to your vector store such as the type of information covered in the files, their structure etc. You can then also highlight how and when to use the files in the conversation.

The issue could also be related to the query / prompt that is submitted. Given a similarity search is performed, it could simply be the case that the chunks from the other file are deemed of low relevance. OpenAI has recently introduced the ability to inspect file search chunks, which you can read up on here. You might want to investigate, too.

Thanks.
Yes, my instruction explicitly refers to the files, and asks for the answers to be restricted to only information from the files.
One test prompt was “what are the topics of the two files?” It returned a good summary fo the topic of the first file.

1 Like

Thanks for clarifying. Currently the use of file search under the Assistant is optimized for specific queries and not for summarization. So regardless of what you provide as a prompt the process is still that the prompt is used as a basis for perform a similarity search and then return the top k chunks based on their relevance. So the same limitations apply as for RAG in general.

This overview is a good reference how the process works and what the current limitations are:

image

Source: https://platform.openai.com/docs/assistants/tools/file-search/how-it-works

If you are looking for summaries of individual documents, then you might want to consider adding them directly as part of a thread in full and then using conventional prompting to ask the model to generate summaries.

I hope this helps.