I want to ask couple of questions about the pricing and usages of the live models (aka speech to speech model):
’’’ I am using pipecat ‘‘‘’
1 - how we are getting billled for the models for the different modalities (audio to audio) / (audio to text)? Do we get billed for text and audio tokens togther or seperated depending on the modality?
2 - The usages of the model I want to undrestand it better like this one for audio to text
total_tokens=3021 input_tokens=2971 output_tokens=50 input_token_details=TokenDetails(cached_tokens=2880, text_tokens=2937, audio_tokens=34, cached_tokens_details=CachedTokensDetails(text_tokens=2880, audio_tokens=0), image_tokens=0) output_token_details=TokenDetails(cached_tokens=0, text_tokens=50, audio_tokens=0, cached_tokens_details=None, image_tokens=0)
and for the audio to audio
total_tokens=3618 input_tokens=3182 output_tokens=436 input_token_details=TokenDetails(cached_tokens=2688, text_tokens=2848, audio_tokens=334, cached_tokens_details=CachedTokensDetails(text_tokens=2688, audio_tokens=0), image_tokens=0) output_token_details=TokenDetails(cached_tokens=0, text_tokens=101, audio_tokens=335, cached_tokens_details=None, image_tokens=0)
Thanks ![]()