Realtime API Pricing: VAD and Token Accumulation - A KILLER

First, I think the other voices are of better quality. Secondly, I am not quite convinced that removing the previous chat history should affect the voice quality in any way. Here’s why I think this approach of cost-cutting should not affect voice.

  1. Voice messages that we delete, do not carry metadata about emotions and tone etc.
  2. As far as the context of the conversations is concerned, we are still pretty much providing the whole context just in a concise way.
  3. Each generation of voice by the AI should largely depend on the last 1 or 2 turns it took in terms of tone continuation. We already kind of keep those as a buffer all the time.

Here’s what I think could be the problem.

  1. We are feeding the summarized context via a system message. Ideally, this should have been via conversation items. But I noted earlier, that the API currently does not support that and it stops producing audio.

Another way of feeding the summarized context if it must be via conversation item is to always keep at least the last 3 audio responses by the model. This has two problems, one, of course, a higher cost, and two, this literally sounds like a hack and is only based on experimental observation. There’s no guarantee this will work deterministically in production.

While testing the strategy/hypothesis presented by @zia.khan I am reminded of this comment from @stevenic

@zia.khan how did you arrive at the conclusion that

What @stevenic says, lines up with the rationale for carrying forward audio tokens. What @zia.khan seems to have observed (yet to be qualified) causes one to question the purpose of carrying forward those tokens that are ultimately responsible for cost inflation.

Also revisiting @jeffsharris note earlier:

@jeffsharris any chance you could clarify this for us?

1 Like