Objectively tracking price/token usage in v1/audio/speech and v1/audio/transcriptions?

There is no direct usage info on responses, but you can roughly calculate the costs based on the length audio input for stt and output audio length for tts.

Instructions and input tokens can be calculated by tokenizer.

In this topic there are a few more details: