Hwo to use assistant API for conversational speeds

nils.lamb · July 28, 2024, 6:34pm

Hi everyone,

I have followed through with this awesome blog post from Twilio:

this allows you to call a phone number and talk to Chat GPT in real time. The speeds are good enough for a natural conversation.
The magic part is this one:

const chatCompletion = await openai.chat.completions.create({
            messages: messages,
            model: 'gpt-4',
          <truncated>

the chat completion feature is quick enough to allow a smooth response time.

This has one big caveat though, in which it works with the non-customized Chat GPT model. What I then did is update this code to use the assistant API, reference my own assistant ID, create a thread, add a message, create a run and wait for the response, polling in 100 milisecond intervals to not loose too much time.

Problem is, the response times are in the 3000 to 4000 miliseconds range. while I get it to work and I am using my custom GPT (i.e. my assistant) via the phone, which is pretty amazing on its own, the response times are nowhere good enough for a smooth conversation.

how can I combine the speed of chat completion with my custom knowledge and instructions offered by the assistant API?

Thansk and best regards

sps · July 28, 2024, 7:57pm

Welcome @nils.lamb,

IMO, currently the model gpt-4o-mini is the fastest model available on the OpenAI API.

However, it’s important to note that calls to Assistants generally take longer than direct calls to the chat completion endpoint. This difference in response time is due to the multiple steps involved in processing Assistant calls.

Keep in mind that model speed is just one factor affecting the overall response time for Assistants.

Here’s a plot showing my observations from an experiment to measure the time to the first token for this model on the chat completions endpoint:

nils.lamb · July 28, 2024, 8:30pm

true. I am using this very model in the hope that it was the fastest. but using assistants generally outweights the performance benefits of even the fastest model to the degree where a real time conversation no longer feels like a conversation.

I am hoping that someone has some ideas or maybe even an outlook on how to fix that.

or is it possible to pass instructions / knowledge base into the chat completion? probably not functions, but I could live with that for the time being.
I would imagine if I just pass the whole knowledgebase as a message history into the chat completion, that would blow up the token consumption exorbitantly and while it might, strictly speaking, work, it would no longer be cost efficient.

sps · July 28, 2024, 8:41pm

If the context is limited to the current instance of the voice call, then yes, gpt-4o-mini on the chat completion endpoint is all you need.

It’s the cheapest model on the API with 128k context length, costing $0.150 / 1M input tokens, and $0.600 / 1M output tokens.

Topic		Replies	Views
Speeding up the response from the openai's assistant api API gpt-4 , assistants-api	2	2067	July 17, 2024
Already build your AI Assistants? I need your feedback on this integration Community project , projects , assistants , assistants-api	0	713	November 30, 2023
Tips for Speeding Up Assistant Responses with Assistants API API assistants-api	2	381	September 6, 2024
Assistants API calling is so slow! API api	2	266	September 20, 2024
How can I improve response times from the OpenAI API while generating responses based on our knowledge base? API chatgpt , api	3	17678	November 9, 2023

Hwo to use assistant API for conversational speeds

Related Topics