Struggling with Slow Responses and AI Hallucinations in My OpenAI Assistant – Need Expert Advice to Optimize Performance!

Hello, I’m currently working on a project where I use an OpenAI assistant that utilizes file_search with information from a list of scientific and student-related informational websites. An XML file is loaded containing the ID, website name, description, author information, important dates, etc.

The idea is that when a user asks a question about any of the topics from the list of websites, the assistant can respond accurately and attach the correct URL for the user to visit.

Initially, I had an assistant set up with instructions to provide information in a specific structure (name, date, topic description, benefits), and if the user didn’t know the exact website, the model could offer a list of sites based on a general description provided by the user.

At first, it worked well, but after some extended conversation, I noticed several issues:

  1. It started providing incorrect URLs.
  2. It listed website names that weren’t in the provided files and didn’t exist.
  3. When asked directly about a non-existent topic, it gave complete responses with entirely made-up data (hallucinations).
  4. It stopped providing information: It would give the title but respond with “Description not available” or “URL should go here” without actually providing the answer.
  5. It started responding to unrelated questions, like giving a recipe for cooking (which is not related to the content in the websites).

So far, I’ve tried the following approaches:

  • Adding double checks, where the assistant reviews its own results twice before answering. This was slow and didn’t always work.
  • Creating a state diagram, changing the assistant’s instructions to assign different tasks: one set of instructions to detect what the user is looking for, specific instructions to provide information if the website name is known, and specific instructions to generate a list of possible matches. This was slow, and the hallucinations still occurred.
  • To manage context better, I used multiple assistants, each with a specific task, communicating via function calls. This has given the best results in terms of reducing hallucinations, but it’s still slow.

I’ve been working with the state logic and function calls using a Google Cloud function with Python.

My priority now is to reduce the response time. Do you have any suggestions that could help?
I need to deliver a list of potential solutions.

Nice. This modular approach will really help in debugging as well.

Have you benchmarked each task? Both in the response times of the Assistants and the pipeline checkpoints? What part of it is taking the longest?

To be honest, I haven’t done that yet. I’m not really sure how to do it properly—should I measure the time before and after requesting information from each assistant?

In general, I’ve noticed that:

  1. The calls between assistants can be excessive. Instead of using one call to gather all the information, they make individual requests separately. I think this might affect performance since the way I handle them currently is sequential. (I haven’t worked much with asynchronous functions.)
  2. The assistant generating the information can provide unnecessary details, and I have to wait for it to finish creating a long text before responding. That text is then sent as a function_call response to the main assistant, which processes it again, causing further delay.
  3. The current process is multiplying the execution time. When a task is requested, the assistant responsible for it responds, and then I have to wait for the main assistant to gather all the requested responses, analyze them, and create a final answer for the user.

The total execution time is currently between 50 seconds to 1 minute 30 seconds or even more some times, which causes any frontend application I want to use for the chat to timeout, as it thinks the request was not processed and won’t receive a response

Yes. IMO it’s always a good decision to keep things modular, but as you have found out this does add a lot of latency compared to just shoving a bundle of instructions to a single assistant and hoping for the best.

How many calls to an assistant do you make in a single response? How much time does each assistant spend?

Do you need to use assistants for some of the calls? For example I use ChatCompletions for a majority of my requests. Any Assistant is just the “front-end” of a conversation and manager of the state.

Lastly, sometimes it’s OK to wait for some time. Loading screens are a great example of this. If it’s worth it, it’s worth it. More important than the timing is the interactivity and responsiveness of the loading screen.

For example, if you have to make 3 internal calls on average: 2 of them are Functions, 1 is RAG, it’s not that bad to indicate this process to the end-user. This is similar to loading screens. If it just says “loading”, it sucks. But if it takes you along the journey it’s not so bad. “Wow, look at all that’s happening behind the scene to give me the best experience!”

Currently, it is running:

  • About 2 calls to display data, usually asking for information: get_data(page: "ID", info: "name, description, lastupdate")
    get_url(id: "ID")
  • Around 8 calls just to list data:
    get_list(description) → responds with a list of 4-5 options

After having the list of options, it retrieves the information and URL to display:
get_data(page: "ID 1", info: "name, description, lastupdate"), get_url(id: "ID 1")
get_data(page: "ID 2", info: "name, description, lastupdate"), get_url(id: "ID 2")
get_data(page: "ID 3", info: "name, description, lastupdate"), get_url(id: "ID 3")
get_data(page: "ID 4", info: "name, description, lastupdate"), get_url(id: "ID 4")

I am currently adjusting the function’s instructions to avoid making so many calls and instead generate a single request for all the data:
get(id: "ID1, ID2, ID3, ID4", info: "name, description, lastupdate")
But it is not working yet, i don’t know why

I use the assistants because their responses are based on the XML document I provide. Do ChatCompletions allow for attaching files? Are they faster at responding? Currently, the assistants I use have a separate context, so they don’t keep track of the conversation—they are just there to respond directly.

Consinder use the same THREAD but instead of adding the second Assistnat by function calling - let the second assistant go after the first assistant finishes. On the SAME Thread. A thread is not tied a specific Assistant, only during the run - and each new run you can pick whatever Assistant you want.

Well, the bottleneck is pretty obvious :rofl:

I think the ideal next step would be to batch the URL getting

Interestingly, that was my first idea. I tried doing it by simply switching the assistant within the same THREAD, but occasionally the same hallucinations occurred when querying nonexistent pages. For example, if someone asked for the ‘anthropology’ page, it would respond with a title, description, last update date, and say the URL was not available at the moment—all made up since that page doesn’t exist.

I THINK, based on my understanding, that having the context from previous responses made it rely on that to complete the new request where no information was available, as it followed the same format as past responses. (If the same query was made at the start, it would correctly say the topic wasn’t available.)

Separating the context by using different threads ensures that if the assistant searching for the data doesn’t have it, it has nothing to guide it, so it responds correctly that the information is not available.

Yes, yes, I know, that’s the most obvious one :rofl:.
Right now, I’m trying to have it all done in a single call (though I guess the response will still be slow), but at least it won’t go through the process of calling one by one.

But I’m wondering, is there any other way to do it? That’s the best idea I’ve come up with so far.

Definitely need to run these operations in parallel & figure out why it’s not being successful. 6 calls is just too much in networking latency alone.

I don’t know why you wouldn’t want to batch it all. It seems like the best solution and will certainly take less time than sequential operations.

So if you had 6 sources to pull from you’d call 6 instances to run in parallel, synthesize the results and pass the rich, concise data back to the assistant to be used for context.