[Results graphs at bottom, see below]
I’m designing an app where I’m thinking of having one assistant and one thread per user. The assistant represents a chat bot that users of the app chat to. Each user has their own thread. One thing that’s important to me is low latency of the chat bot response, so I ran a test measuring the response time of the bot (the time taken of the run; the invocation of the assistant on a user’s thread that returns a text response) over time as the thread increases in size. Another important point is the assistant’s text response not exceeding 200 words. This is made clear in the prompt I gave it; I add this line to the prompt: “ENSURE YOUR RESPONSE IS NO LONGER THAN 200 WORDS.”
The test is:
- New thread created
- I feed in the assistant’s message to the Generic ChatGPT completions API to give a response message (based on a prompt I gave essentially saying ask a question every time and keep the conversation going) and this is written to the thread
- Assistant responds (run is made on thread)
- This assistant’s response is given to the generic ChatGPT completions API again for a new response for the assistant to then respond to again
- Repeat this 200 times.
So this test involves a fresh thread, 200 runs (meaning 400 messages total, 200 of those from the assistant).
Concerning results 1: Clear uptrend in response time. Well over 10 seconds from the 12th run onwards, up to 30 seconds!!! Test took 2 hours, so I don’t think it’s a rate limiting issue. My OpenAI api key I’m using is Tier 5 anyway.
Concerning results 2: Response length rule not being respected. Ovum prompt states don’t make response more than 200 words, so after a while in the thread this isn’t being respected. Looks like from 10th run onwards.
I thought rate limiting might be a thing affecting response time, HOWEVER the test took 2 hours, so I don’t think it’s a rate limiting issue. My OpenAI api key I’m using is Tier 5 anyway.
Does anyone have any insight as to what’s going on?
Note: x-axis label says 300 runs when in reality there’s 200.
Note 2: Here’s the code for measuring the run response time:
start_time = time.time()
run = client.beta.threads.runs.create_and_poll(
thread_id=thread_id,
assistant_id=assistant.id,
)
if run.status == 'completed':
run_length = time.time() - start_time
Note 3: I’ve conducted this test multiple times on various days and the trend is the same.