Yes, that’s indeed what happens. Assuming the API generates both text and audio in each turn and you send only text, the conversation is usually like this:
---- Turn 1 ----
User: input_text1
Assistant: output_text1 + output_audio1
---- Turn 2 ----
User: input_text2 + (input_text1 + output_text1 + output_audio1)
Assistant: output_text2 + output_audio2
And so on.
They released prompt caching a couple of days ago to reduce this cumulative cost which is s welcome update, but it’s still too expensive for my use case.
The only thing that looks a bit odd to me is that it seems like the output_text1 (which corresponds to the response audio transcript) is also being fed to the model each turn (Although this is hard to confirm just by looking at the usage that is returned in each response.done
event). What I’m saying is that, if the output audio is already being passed in each turn to maintain context, is it necessary to pass the transcript in as well? I hope an OpenAI dev could provide an answer to this.