There are multiple factors to this.
- The price per minute will always depend on the number of turns happened in the conversation. With number of turns increasing, the price will also increase because the conversation history gets bigger and the model keeps consuming it all with each new turn.
- Prompt caching should be included in the equation as well, because all or most tokens from previous turns will hit cache (both audio and text, both input and output)
- The context window of the model is 128k tokens, but the actual output window is only 4096. The average amount of input tokens, the average amount of output tokens (which are priced differently), and overall ratio between them is a major factor
- The usage of function calls should also be considered, because functions can be rather large depending on the use case, and if the model will call them multiple times, or call multiple functions per turn (which could not be supported at this time, don’t quote me on that), the token usage will increase significantly.
- Tokenization of texts (and maybe audio, not known at this time) has different efficiency depending on the language of the input. For tokenizers used in OpenAI models, English language is the most efficient (i.e. the least tokens per amount of text). This means that if text (and maybe audio) is very inefficiently tokenized, the amount of tokens – and the price of turn/session – will increase.
- Model behavior itself can produce extra turns (verifying user input, making sure that assistant heard the user correctly etc.) which will increase usage.
- Model can “glitch”, hallucinate, call functions erroneously, produce enormous amount of output, users can try to do prompt injection, force the model out of bounds etc. which is also a major cost factor.