What is 1M output tokens for audio in the Realtime Pricing?

The TTS models and other ones calculate it at per minute of audio, but I don’t understand what “1M output tokens” means for audio in the realtime API pricing.

For text, of course it’s obvious, but for audio, is it saying each word that is generated like the text that it converts to audio counts as a token?