In the API, we want each message to be shorter, but we can limit the total run in terms of tokens. In real-world use, API requests can sometimes take 15-20 seconds, and users will be unhappy with this.
If the token limit per message won’t be added, then I think the speed necessarily needs to be increased. Will there be a significant increase in speed after the beta version? If so, when will this development occur?
Assistants takes multple API calls to set into motion, and you are inherently waiting on an AI that can perform multiple internal steps before responding.
You can use the streaming output mode to get the text as it is being produced when the AI finally responds.
If you want quick responses, you’d use the chat completions endpoint, and do quick and automatic RAG to give knowledge instead of having an AI that can search documents. Streaming there can start writing in under a second.
I have to use the assistant API because I utilize the function calling feature; otherwise, I know that streaming is available with the completion. However, I fear that streaming is not applicable for the assistant API. Has such a feature been introduced in the latest updates that I might have missed?