Best file type for Q and A assistant

Any suggestions (or what others may have had success in) with a file type in retrieval for questions and answers?

My use case:
I’ve transitioned from a NLP Intent Classifier model to the OpenAI Assistant using the AssistantAPI. The data I had from the intent classifier was

  1. Intents
  2. Intent examples (what queries trigger the intent)
  3. Mapped to the intent response

I’ve tried many ways to convert this data into file types including json, where each json object holds patterns (intent examples) and a response value. I’m getting the best results here and I’m not even to sure how the json file is indexed or embedded the way I’d hope for when the file is uploaded to the assistant.

I’ve also tried just the responses themselves written to PDF pages and fed that to the Assistant in hopes it picks the best answers based on user queries but I’m still not where I’d like to be with accuracy.

Any suggestions on different file types and how to structure them, with/without the intents examples etc.

Thanks!

1 Like

Do you use the new file_search tool? I have noticed a pretty good improvement in the answers when retrieving information from files. I do prefer json, but pdf files work just fine.

Kristiyan

1 Like

Thanks for the reply, @kmilev6!

On the platform.openai site, under the assistant my backend is attached to, I do have file_search ON and have attached files for the assistant to use.

Are you suggesting actually attaching a file_id along with with tool in the message creation? like this?

    client.beta.threads.messages.create(thread_id=thread_id, role="user", content=msg_data.user_input, attachments=[
       {"file_id": "xxxxxxxx", "tools": [{"type": "file_search"}]} 
       ]
    )

On the platform.openai site, under the assistant my backend is attached to, I do have file_search ON and have attached files for the assistant to use.

Are you forcing the use of the tool? When you enable file_search:

assistant = client.beta.assistants.create(
    name=name,
    description=description,
    instructions=instructions,
    model="gpt-4-turbo",
    tools=[{"type": "file_search"}]
)

You are letting the assistant decide when to use the tool, sadly I haven’t found any way to check if the assistant actually called the tool if you are using the new run helper create_and_poll(), as the run object returns only a "tool_choice": "auto". If you find a way, let me know!

Anyway, back to the issue, you can force the use of the tool with the run:

    run = client.beta.threads.runs.create_and_poll(
        thread_id=thread.id,
        assistant_id=assistant.id,
        tool_choice={"type": "file_search"},
    )

In my experience, letting the assistant decide when to call the tool is not optimal. If you are using only one function and/or only one file I bet you will be better off forcing the tool. If you use many tools and/or files, I am guessing you need to improve your instructions.

Let me know how it goes.

Are you suggesting actually attaching a file_id along with with tool in the message creation? like this?

    client.beta.threads.messages.create(thread_id=thread_id, role="user", content=msg_data.user_input, attachments=[
       {"file_id": "xxxxxxxx", "tools": [{"type": "file_search"}]} 
       ]
    )

It depends… I prefer attaching the file directly to the assistant if it is going to be a base for its knowledge. I would attach the file to the message only if it’s an user’s input and you want the assistant to be able to analyze that file in specific.

1 Like

Really appreciate the feedback, @kmilev6 !!

So, I did incorporate the tool_search in the stream() :

    with client.beta.threads.runs.stream(
        thread_id=thread_id,
        assistant_id=assistant_id,
        event_handler=event_handler,
        tool_choice={"type": "file_search"},
    ) as stream .....

I’m assuming this is similar to the create_and_poll() ?

Also, I’m testing this out and noticing the annotations citation that is occurring at the end of the responses. For my use case we can’t have those annotations there. Will have to figure out a way to remove them.

If that’s the case and you are getting the citations then the assistant is searching properly the vector store. My guess is that your assistant needs better instructions in order to understand what it is looking for. (I am not an expert by any means, so take this with a grain of salt).

Back to your original question. When using file_search, from the documentation:

OpenAI automatically parses and chunks your documents, creates and stores the embeddings, and use both vector and keyword search to retrieve relevant content to answer user queries.

Once the vector store is created it doesn’t matter the original format, but I can’t tell how well the current embeddings model text-embedding-3-small perform with different files. Personally, I do get good results with all the supported files.

Also, I’m testing this out and noticing the annotations citation that is occurring at the end of the responses. For my use case we can’t have those annotations there. Will have to figure out a way to remove them.

I have noticed the same behaviour since the v2 release, checking the message object, there shouldn’t be any annotations inside the value field, probably another prompt engineering task:

{
  "id": "msg_abc123",
  "object": "thread.message",
  "created_at": 1698983503,
  "thread_id": "thread_abc123",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": {
        "value": "Hi! How can I help you today?",
        "annotations": []
      }
    }
  ],
  "assistant_id": "asst_abc123",
  "run_id": "run_abc123",
  "attachments": [],
  "metadata": {}
}

Anyways, I don’t think I can be of any further help. I hope someone else comes to give a bit more of light to your issue. Feel free to send me a DM for any other topics, I am guessing We are working on similar tasks.

See you around!