Hey, I’m using GPT 3.5 Turbo and I’m using assistant API for the application.
I know it take sometime to process the request with a longer system prompt and messages but when user messages consists of 2 and 3 then it replies faster but when it containing about 5 to 7 sentences it takes too long like 6 to 7 minutes and longer.
The issue here is latency, Do I have to shorter my system prompt but I don’t think it will help because the response is faster when user messages are shorter.
The model you must use is gpt-3.5-turbo-1106. It is the only one that supports parallel tool calls and has the longer context length to track function calls.
Why? I suspect you’ve also uploaded some documents, and the AI has gone crazy looping trying to retrieve everything with mismanaged functions. You can check the number of run steps for a run.
Also, log into your account and see your rate tier under limits. Tier 1 can get slower models (actually just slower output) and would make any internal writings take much longer.
max_tokens is not a thing to set on assistants. You don’t know how long the internal writing of commands has to be.
That seems as if you would be better using chat completions method for communicating with AI.
The code is more simple, you can get immediate words to read, and you also have control of the length of the old conversation that is sent each time.
Here is a link to small example code for one user with Python. You will be able to see how fast the AI can produce language. The amount the conversation history can grow is limited by number of turns.