I am listening for the response.done
server event, and logging, as such:
if (response.type === 'response.done') {
console.log('response.done output token details', response.response?.usage?.output_token_details);
console.log('response.done input token details', response.response?.usage?.input_token_details);
}
if (response.type === 'conversation.item.input_audio_transcription.completed') {
console.log('User says:', response.transcript);
}
if (response.type === 'response.audio_transcript.done') {
console.log('Assistant says: ', response.transcript);
}
Here is an example output for a brief conversation I had:
Sending session update: {"type":"session.update","session":{"turn_detection":{"type":"server_vad"},"input_audio_format":"g711_ulaw","output_audio_format":"g711_ulaw","voice":"alloy","instructions":"Be concise and professional","modalities":["text","audio"],"temperature":0.8,"input_audio_transcription":{"model":"whisper-1"}}}
User says: Hello.
Assistant says: Hello! How can I assist you today?
response.done output token details { text_tokens: 19, audio_tokens: 48 }
response.done input token details { cached_tokens: 0, text_tokens: 13, audio_tokens: 7 }
User says: Yeah, I was hoping to learn a little bit about monkeys.
Assistant says: Monkey is a high-level, imperative, and dynamically typed programming language. It's known for its simplicity and readability, making it a good choice for beginners. Monkey features a clean and simple syntax, allowing developers to focus on learning core programming concepts without getting bogged down by complex syntax. It's also often used for educational purposes, such as teaching programming or language implementation. Is there anything specific you'd like to know about Monkey?
response.done output token details { text_tokens: 123, audio_tokens: 666 }
response.done input token details { cached_tokens: 0, text_tokens: 42, audio_tokens: 97 }
User says: No, no, no, no, no, I'm talking about monkey the animal
Assistant says: Ah, I see! Monkeys are a group of primates that include various species, ranging from small marmosets to large mandrills. They are known for their intelligence, dexterous hands, and complex social behaviors. Monkeys are mostly arboreal, meaning they live in trees, and are found in various regions around the world, particularly in Central and South America, Africa, and Asia. Monkeys can be divided into two main groups: New World monkeys and Old World monkeys. Is there a particular type of monkey you're interested in?
response.done output token details { text_tokens: 152, audio_tokens: 706 }
response.done input token details { cached_tokens: 0, text_tokens: 175, audio_tokens: 796 }
User says: Gotcha, that's really cool.
Assistant says: That's great to hear!
response.done output token details { text_tokens: 9, audio_tokens: 12 }
response.done input token details { cached_tokens: 0, text_tokens: 345, audio_tokens: 1534 }
I’ve been using the tokenizer to attempt to reproduce certain token values, but I haven’t been able to identify a consistent pattern.
The reason this is important to me is that I want to inject external context, such as data from a database, into the context window. However, when I did this, I noticed an unusually large number of tokens being consumed, particularly in the text_tokens
, which led to hitting the 20k token per minute threshold after only about a minute of chatting.
Does anyone know how these token values are being calculated? Additionally, if I inject a system message containing around 1,500 tokens into the conversation using conversation.item.create
, could someone explain why this significantly increases the likelihood of hitting the 20k token per minute threshold?