Objectively tracking price/token usage in v1/audio/speech and v1/audio/transcriptions?

Objectively tracking price/token usage with v1/audio/speech and v1/audio/transcriptions?

I need to benchmark the cost comparison between OpenAI’s gpt-4o-realtime-preview and a chained solution (e.g., gpt-4o transcription → gpt-4ogpt-4o-mini TTS) for a specific application. However, I’m struggling to get objective, request-level usage data from OpenAI’s audio endpoints.

What works well

  • Chat completions: The usage field in v1/chat/completions responses provides exact token counts
  • Realtime API: Usage data is available in the response.done event

The problem

The audio endpoints don’t provide (any) granular usage reporting:

  • v1/audio/transcriptions: No usage field indicating how many input tokens gpt-4o-transcribe and gpt-4o-mini-transcribe received.
  • v1/audio/speech: No usage field showing how many audio tokens were generated.

This makes it impossible to track costs at the individual request level, which I need for accurate benchmarking.

Constraints

  • I don’t have access to my organization’s usage dashboard
  • Even if I did, dashboard data doesn’t provide request-level granularity needed for this comparison

Is there something I’m missing in the API responses, or is this a known limitation that will be corrected?

There is no direct usage info on responses, but you can roughly calculate the costs based on the length audio input for stt and output audio length for tts.

Instructions and input tokens can be calculated by tokenizer.

In this topic there are a few more details: