Can I use audio transcriptions in prompt for LLM context for calling the appropriate function in Realtime API's function calling?

Hello.

I am making a conversational bot with a dedicated call flow which only relies on the ‘text’ modality i.e. using function calling only from OpenAI’s Realtime API in order to minimize cost.

Each step in the call flow has a dedicated function which plays prerecorded audios to the user which are interruptable.
The user audio flows through my Asterisk ARI application and then goes to OpenAI’s websocket. Audio is transcribed and triggers the appropriate function which plays the relevant recording.

The problem I’m having currently is that the LLM has no context of the conversation so how can it call the appropriate function when interrupted?
It does work but due to the nature of the setup, it requires standalone context to trigger the function e.g. “Which hospitals are applicable in this insurance?” works but “Which hospitals?” doesn’t. I get the reason why as the LLM has no context of the conversation apart from user audio.

Is there any way of inserting context to the LLM of the audio functions being used, so the LLM has context to go on.

Maybe through via the prompt? Just a thought.

Not sure I fully understand your use case, when you say “only text mode” and OpenAI Realtime API, it seems contradictory since my understanding is that the Realtime API is primarily a speech API.

Having said that, to control the context seen by the Realtime API, have you looked at conversation.item.create?:

https://platform.openai.com/docs/api-reference/realtime-client-events/conversation/item/create

That should be able to give you fine-grained control over the context seen by the model.

Another approach that might work would be to send session.update messages and modifying the instructions, but I think the conversation.item.create approach would be more standard.

BTW I’m assuming you’re using WebRTC? It would help if you could provide the broad strokes of how you are interacting with the OpenAI Realtime API, and what other OpenAI APIs you’re interacting with as well.

I’m sorry, I should’ve been more transparent about my use case.

Text only mode means using only the text modality, as Realtime API has the option for text as well as audio as you provide both normally.

But in my particular case, I’m only using text. The user audio flows through my Asterisk ARI applicaiton and then goes to OpenAI’s websocket. Audio is transcribed and triggers function which plays the relevant recording.

Hopefully, I was clear. I apologize for any confusion, English is not my first language.

Ok my mistake, sorry I have only used the Realtime API for voice-to-voice mode. I see that it supports text as well.

No worries, I wasn’t descriptive enough. I have updated my explanation.

The Realtime API is stateless, so the model can’t know what “Which hospitals?” refers to unless you give it context on every turn. You need to store the conversation state yourself (current step, last question, expected intents, etc.) and send that as part of the system message or a small state object in each request.

Thank you for your response.

By system message, you are referring to the System Prompt right? In the instructions parameter we use?

Can you kindly explain any method I can use to store the conversation state?