Objectively tracking price/token usage with v1/audio/speech and v1/audio/transcriptions?
I need to benchmark the cost comparison between OpenAI’s gpt-4o-realtime-preview
and a chained solution (e.g., gpt-4o
transcription → gpt-4o
→ gpt-4o-mini TTS
) for a specific application. However, I’m struggling to get objective, request-level usage data from OpenAI’s audio endpoints.
What works well
- Chat completions: The
usage
field in v1/chat/completions responses provides exact token counts - Realtime API: Usage data is available in the
response.done
event
The problem
The audio endpoints don’t provide (any) granular usage reporting:
- v1/audio/transcriptions: No usage field indicating how many input tokens
gpt-4o-transcribe
andgpt-4o-mini-transcribe
received. - v1/audio/speech: No usage field showing how many audio tokens were generated.
This makes it impossible to track costs at the individual request level, which I need for accurate benchmarking.
Constraints
- I don’t have access to my organization’s usage dashboard
- Even if I did, dashboard data doesn’t provide request-level granularity needed for this comparison
Is there something I’m missing in the API responses, or is this a known limitation that will be corrected?