Why audio_end_ms is smaller than actual time

I’m using the realtime api and found some strange things.
I split audio streaming to chunks, each of them are 512ms.
When I stopped speaking, I received speech_stopped event with audio_end_ms data, but after testing it, the audio_end_ms is always smaller than the actual time.
For example, the actual time is 130000ms, but the audio_end_ms is 90000ms, it has 40000ms gaps, it’s not possible. And the gap will keep getting bigger when a session is alive.

I want to know how to resolve this problem? If it’s can’t be solved, how could I trust the result generated by VAD?