Assistant/Thread Model Stress Test: Concerning Results [See inside]

[Results graphs at bottom, see below]

I’m designing an app where I’m thinking of having one assistant and one thread per user. The assistant represents a chat bot that users of the app chat to. Each user has their own thread. One thing that’s important to me is low latency of the chat bot response, so I ran a test measuring the response time of the bot (the time taken of the run; the invocation of the assistant on a user’s thread that returns a text response) over time as the thread increases in size. Another important point is the assistant’s text response not exceeding 200 words. This is made clear in the prompt I gave it; I add this line to the prompt: “ENSURE YOUR RESPONSE IS NO LONGER THAN 200 WORDS.”

The test is:

  • New thread created
  • I feed in the assistant’s message to the Generic ChatGPT completions API to give a response message (based on a prompt I gave essentially saying ask a question every time and keep the conversation going) and this is written to the thread
  • Assistant responds (run is made on thread)
  • This assistant’s response is given to the generic ChatGPT completions API again for a new response for the assistant to then respond to again
  • Repeat this 200 times.

So this test involves a fresh thread, 200 runs (meaning 400 messages total, 200 of those from the assistant).

Concerning results 1: Clear uptrend in response time. Well over 10 seconds from the 12th run onwards, up to 30 seconds!!! Test took 2 hours, so I don’t think it’s a rate limiting issue. My OpenAI api key I’m using is Tier 5 anyway.

Concerning results 2: Response length rule not being respected. Ovum prompt states don’t make response more than 200 words, so after a while in the thread this isn’t being respected. Looks like from 10th run onwards.

I thought rate limiting might be a thing affecting response time, HOWEVER the test took 2 hours, so I don’t think it’s a rate limiting issue. My OpenAI api key I’m using is Tier 5 anyway.

Does anyone have any insight as to what’s going on?

Note: x-axis label says 300 runs when in reality there’s 200.

Note 2: Here’s the code for measuring the run response time:

  start_time = time.time()
  run = client.beta.threads.runs.create_and_poll(
    thread_id=thread_id,
    assistant_id=assistant.id,
  )

  if run.status == 'completed':
    run_length = time.time() - start_time 

Note 3: I’ve conducted this test multiple times on various days and the trend is the same.

could you try multiple keys and post the results? or have you already tried that too? also, could hardware be an issue?

1 Like

Response time of 40 seconds for 200 words? Wth!

How are you managing these? Are you running each test in sequence? Multi threading? Are you running all of this off your PC or using remote servers?

Graphs and numbers don’t mean much if the full test isn’t shown. A snippet won’t cut it.

Although this kind of works it should never be considered a trust-worthy rule - more of a guideline.

So, this is not a concern.

It is quite hard to decipher what “the test” actually is here.

The logic breaks down within the first step: you create a thread (on the Assistants API endpoint?). Then you send some “assistants message” to Chat Completions instead?

(I couldn’t get an AI taught about Assistants to understand what you are doing either)


However, it seems your concern is about lower token generation speed when there is a larger context input to an AI model, including where that context originates from many messages of an Assistants’ thread for chat history.

That is natural behavior. The attention mechanism for creating AI output must consider every input token of the context window length in order to calculate the next token to be generated.

Assistants API endpoint will send all the messages from a thread that can fit in a model context length. You can set a top limit of the number of chat turns that are input, with the API call parameter truncation_strategy.


The underlying concerns in this scenario seem to stem from two distinct factors:

1. Response Time Increases Over Time

Likely Cause: Larger Input Context Window and Increased Computation by the AI Model

  • How the API Handles Context:
    OpenAI’s Assistant API uses a “thread” to maintain conversation history. With each new message, the entire thread’s context (all messages so far) is included in the input to the model. This means that as the thread grows, the amount of text that the model needs to process increases significantly.

    The model’s response time is directly proportional to the size of the input context:

    • The model must process every token in the context window, which increases computational complexity.
    • OpenAI’s models, especially those with large context window limits (e.g., 4k or 8k tokens), are designed for such tasks but experience slower response times as the context approaches or exceeds the optimal length.
  • Symptoms Observed:

    • The “clear uptrend in response time” correlates with increasing thread size (i.e., more tokens to process).
    • If the thread reaches or exceeds the model’s context window size, OpenAI’s API will either truncate older messages or process the entire thread, leading to additional overhead for token processing.
  • Mitigation:

    • Implement thread truncation or summary mechanisms: Periodically summarize earlier parts of the conversation and replace them with shorter summaries to reduce input size.
    • Consider splitting long conversations into multiple shorter threads to limit the context size per thread.
    • Profile and limit the context window length based on performance needs.

2. Rule-Breaking in Response Length

Likely Cause: Model Drift in Adhering to Instruction over Long Contexts

Underlying limitation: AI models can’t count words or predict the style to write in to reach that

  • How Models Process Prompts:
    OpenAI’s models process prompts and thread history as a single input. When the input is long, the specific instructions (e.g., “Ensure your response is no longer than 200 words”) can become less salient to the model because:

    • Instructions might be “buried” in the larger context, making it harder for the model to prioritize them.
    • The model may “pick up” other patterns in the growing conversation history (e.g., responding with longer messages based on earlier interactions) that override or dilute adherence to the explicit instruction.
  • Symptoms Observed:

    • The model starts generating responses longer than 200 words despite the instruction.
    • This deviation becomes more pronounced as the thread grows.
  • Mitigation:

    • Reiterate Instructions: Include the “200-word limit” instruction explicitly at the end of the thread before the user’s most recent message. This ensures it is prioritized during inference. However, Assistants API doesn’t have such a mechanism except by adding to a user’s input message.
    • Systematic Prompt Engineering:
      • Modify the system message to improve attention focus: e.g., “You must strictly adhere to a 200-word limit in every response. Pay attention to the length when producing every response in this session.”
      • Reframe the assistant’s role to emphasize the rule (e.g., “Your primary task is to provide concise responses under 200 words.”). Or occasional addition of assistants messages like “I’ll now answer this question, adhering to a maximum length of 200 words”, or better, talk about the length in terms of paragraphs.
    • Intermediate Summaries: Use summaries in the thread history to compress older parts of the conversation. This helps reduce the chance of the model drifting from instructions or having reduced attention. To do this within the limitation of Assistants, you would have to recreate a whole new thread of messages, where you can’t place past tool messages.
    • Response Validation: Programmatically validate the length of the assistant’s responses, and if they exceed 200 words, give the AI iterative guidance to refine the output length.

Is the Concern Primarily About Parallel Requests or Context Size?

The concerns appear to be primarily about the impact of increasing context size on response times and instruction adherence, rather than an issue with handling multiple requests in parallel:

  1. Response Time Issues:

    • Rooted in the increasing size of the thread being passed to the model.
    • Longer input leads to higher computational costs for token processing.
  2. Instruction Adherence Issues:

    • Rooted in “instruction drift” as the context window grows.
    • The model’s focus shifts from adhering to explicit instructions to following patterns in the conversation.

    The rate-limiting hypothesis is unlikely because the test spanned two hours with a Tier 5 key (a high request quota).

Ultimately:

You will want more control of the input context than Assistants API offers.

Use Chat Completions - and manage *everything sent to the API AI model.

I’ve got a Tier 5 API key that I’m using. Are you suggesting using my Tier 1 key is going to provide better results?

It’s a single, short python script. I’ve said there’s no multithreading. It’s one-at-a-time 2 bots talking to each other. Running the script off my machine.

That’s correct, this is mainly a test of response time of the assistant bot over time. I’m sorry but anything over 5 seconds (3 if we’re being real) is poor user experience for a chat app. Yet alone well over 10 seconds which we’re seeing here.

The logic breaks down within the first step: you create a thread (on the Assistants API endpoint?). Then you send some “assistants message” to Chat Completions instead?

Which part doesn’t make sense? It’s simple. I’m using the completions endpoint to generate a question for me, I’m timing the assistant bot’s response time to that – that’s the important part.

I understand that increasing context window size as the thread increases in length is, in your words, ‘natural behaviour’, but these response times for these situations is way too high for a production app, or any app to be honest with a user base.

The ChatGPT mobile app with their own threads seems to respond way quicker – what gives? How can I achieve sub 5 second response times across the board?

Use streaming.

You will have a better interactive experience for users, despite a lower token production rate with large input. Output can be read as it is being produced.

You will still have no output for users while internal tools like file_search are being used, but you can provide a “thinking” interaction in the UI.

Use streaming.

Streaming will not help here. Yes tokens will be appearing as they come, but the entire message will take 15 seconds+ to appear. That’s the problem.

Your own tools, your own chat completions backend - you will have high visibility and understanding of processes during any tool use. With chat completions, there is no need for multiple calls of creating threads, placing messages, running, polling. You send context, the tokens start flowing in under a second.

A year of concerns about the “beta” nature of assistants have not been fully addressed.

1 Like

I can see the same problem now after testing some requests…

generating 200 words (I asked it to write a poem with roughly max 200 words - which kind of looked like the same size mostly) takes between 7.5 and 18 seconds when manually testing ~10 times.

while on my azure deployment it barely touches 3 seconds more like ~2 seconds for same task.

I am only tier 4 on openai platform btw.

My local llama2 running in a cuda supported docker container takes much less than a second. It feels like instant.

@jochenschultz Thank you for the validation. This response time is absolutely unacceptable for an app with users. What options are there?

1 Like

You might use something like mercure or a websocket server and build a loading bar. I come from a generation where that was standard for high quality software because it meant the machine does something “heavy” and the user accept it when you give them response about stuff happening.

So something like streaming the result and counting the words and then every like 20 words sending a mercure.notify_user(10) # 20,30,40,etc. to move the loading bar might make it a little less painful.

Yeah and like I said: a deployment of a gpt-4o model on azure is a lot faster than the openai api.

Just to clear when you say this deployment on Azure, this is still using the Assistants API right?

No it doesn’t. Why would you want to use that exactly?

There is data to analyze? Are you perhaps a frontend developer?

:wink:

Are you suggesting using my Tier 1 key is going to provide better results?

I am not.

You can revisit and see if anything has changed in overall performance.

Assistants came to a head of near failure today with very long queue wait times or errors. It is possible that there were performance issues leading up to today’s widely-reported problems, that have now been rectified.

Still poor results: