I may be wrong but this can be achieved with just async TTS and STT.
If you want to do realtime though, the most sensible and optimal way is to have two separate realtime sessions
First one will end when hold starts, and you would save the context of that conversation somewhere in your system.
The second one would start when the hold ends, and you would initialize this second session with the context from the first one.
However, you would also have to introduce a smaller AI or VAD system in order to detect when the human speech starts so that you can know you need to initialize the second session.
There isn’t much in terms of alternatives because of 15 minute idle limit. Maybe you could emulate activity by sending arbitrary events, but it’s questionable