How to implement a real-time flow (voice/text) + vector store (file_search) for fluid audio+text responses?

Hello community,

I am developing a chatbot that requires the following functionality: accept voice or text input, use a file search tool (vector store/uploaded documents) for context retrieval, and generate a fluid response in text and audio format (ideally with streaming).

I have successfully implemented this using the text API (Responses API) and the vector store for text-to-text flow.

However, I cannot use the “realtime” audio model (which supports voice/text input and output) in combination with the vector store, and when I try to transform the text stream generated by a model and pass it to a tts endpoint, I have noticed loss of audio segments, high latency, and a less fluid (unnatural) experience.

I would like to know if anyone has been able to implement this flow: “Voice/Text input → Vector Store → Audio+Text output.”

Since this functionality is not available natively, what would be the recommended structure to achieve the smoothest possible output audio while still being able to use OpenAI’s file search? I would also like to know, in your experience, which parameters, models, or audio formats have worked best for natural, low-latency voice reproduction?

I would appreciate any code examples and/or recommendations from those who have worked on similar implementations.

Thank you very much for your help!