Lets break down the input/output token details together!

I am listening for the response.done server event, and logging, as such:

if (response.type === 'response.done') {
  console.log('response.done output token details', response.response?.usage?.output_token_details);
  console.log('response.done input token details', response.response?.usage?.input_token_details);
}

if (response.type === 'conversation.item.input_audio_transcription.completed') {
  console.log('User says:', response.transcript);
}

if (response.type === 'response.audio_transcript.done') {
  console.log('Assistant says: ', response.transcript);
}

Here is an example output for a brief conversation I had:

Sending session update: {"type":"session.update","session":{"turn_detection":{"type":"server_vad"},"input_audio_format":"g711_ulaw","output_audio_format":"g711_ulaw","voice":"alloy","instructions":"Be concise and professional","modalities":["text","audio"],"temperature":0.8,"input_audio_transcription":{"model":"whisper-1"}}}

User says: Hello.

Assistant says:  Hello! How can I assist you today?

response.done output token details { text_tokens: 19, audio_tokens: 48 }
response.done input token details { cached_tokens: 0, text_tokens: 13, audio_tokens: 7 }

User says: Yeah, I was hoping to learn a little bit about monkeys.

Assistant says:  Monkey is a high-level, imperative, and dynamically typed programming language. It's known for its simplicity and readability, making it a good choice for beginners. Monkey features a clean and simple syntax, allowing developers to focus on learning core programming concepts without getting bogged down by complex syntax. It's also often used for educational purposes, such as teaching programming or language implementation. Is there anything specific you'd like to know about Monkey?

response.done output token details { text_tokens: 123, audio_tokens: 666 }
response.done input token details { cached_tokens: 0, text_tokens: 42, audio_tokens: 97 }

User says: No, no, no, no, no, I'm talking about monkey the animal

Assistant says:  Ah, I see! Monkeys are a group of primates that include various species, ranging from small marmosets to large mandrills. They are known for their intelligence, dexterous hands, and complex social behaviors. Monkeys are mostly arboreal, meaning they live in trees, and are found in various regions around the world, particularly in Central and South America, Africa, and Asia. Monkeys can be divided into two main groups: New World monkeys and Old World monkeys. Is there a particular type of monkey you're interested in?

response.done output token details { text_tokens: 152, audio_tokens: 706 }
response.done input token details { cached_tokens: 0, text_tokens: 175, audio_tokens: 796 }

User says: Gotcha, that's really cool.

Assistant says:  That's great to hear!

response.done output token details { text_tokens: 9, audio_tokens: 12 }
response.done input token details { cached_tokens: 0, text_tokens: 345, audio_tokens: 1534 }

I’ve been using the tokenizer to attempt to reproduce certain token values, but I haven’t been able to identify a consistent pattern.

The reason this is important to me is that I want to inject external context, such as data from a database, into the context window. However, when I did this, I noticed an unusually large number of tokens being consumed, particularly in the text_tokens, which led to hitting the 20k token per minute threshold after only about a minute of chatting.

Does anyone know how these token values are being calculated? Additionally, if I inject a system message containing around 1,500 tokens into the conversation using conversation.item.create, could someone explain why this significantly increases the likelihood of hitting the 20k token per minute threshold?

Update: I shared this thread with O1, who confirmed that every new message sent to the model includes the full conversation history.

My plan now is to try using the conversation.item.delete client event, likely sometime tomorrow, to see if it helps manage token usage.

The answer is as simple as this -

Handling long conversations
If a conversation goes on for a sufficiently long time, the input tokens the conversation represents may exceed the model’s input context limit (e.g. 128k tokens for GPT-4o). At this point, the Realtime API automatically truncates the conversation based on a heuristic-based algorithm that preserves the most important parts of the context (system instructions, most recent messages, and so on.) This allows the conversation to continue uninterrupted.

Saying “hi again” to a long voice token session? Or even a blip of background noise…
$10.00 / 100k input tokens

“.truncate” allows you to trim an audio part of a most recent response, when you refer to the correct modality chunk part, and only want to affect a portion of the audio. A particular use - you want the AI to hear itself being cut off?

“.delete” can allow you to remove one turn, when referred specifically to ID by your client-side recording of it. You are maintaining and synchronizing your own chat history regardless of a stateful API being offered, right? (there is no list method offered if you got a server error instead)

Setting a maximum token threshold, or converting a voice audio input or modality into solely text you also purchased while retaining any audio whatsoever: not permitted.

Back in the old days (last year) we had only 4,000 tokens and we were doing good to fit 4 or 5 turns of conversation history in the input context window.

Given the cost it might be wise to go back to that strategy. Track your message IDs and delete everything but the last 5 turns. You rarely need to look back more than 3 turns anyway

2 Likes