Questions about the text to speech model

Hello everyone,
I am using TTS api for my project with model: “tts-1” and voice: ‘nova’. I noticed that the transcribed audio has different reading speeds, it does not seem like they have the same speed.
I think the transcribed model will produce sounds based on the context of the “text,” but I am not sure if my guess is correct.
Can anyone explain it to me? Thank you very much.