Hi,
I’ve implemented Realtime API with WebRTC in NextJS and now want to enhance it with vector search capabilities.
My goal:
- User speaks a question via Realtime API
- System generates transcript
- Query matches keywords against my vector-embedded database (activities with category fields)
- Return database matches as the response
Question: Is this workflow possible with the current Realtime API? How can I integrate the vector search step between transcript generation and response?
Any guidance on how to approach such a thing? Thanks!
4 Likes
Hello. Yes, this is possible via tool calls. i.e standard openAI tool calling, getting the data from your tool, and feeding it back. e.g
additional_context_data = {
"type": "conversation.item.create",
"item": {
"type": "function_call_output",
"call_id": call_id,
"output": json.dumps(output),
},
}
await self.third_party_websocket.send( # type: ignore
json.dumps(additional_context_data)
)
create_response_data = {
"type": "response.create",
"response": {
"modalities": ["text", "audio"],
},
}
Hi, thanks so much for your helpful answer!
You’re absolutely right, tool calling is the key to integrating custom functionality like vector search with the Realtime API. I actually came to the same conclusion after some experimentation, and it’s great to have your confirmation.
I’m currently working on optimizing the latency, which is a bit of a challenge. Right now, my responses are taking around 15 seconds to generate. I’ve found that breaking the generated response into smaller sentence chunks helps a lot, but I’m running into some issues with the API losing context and failing to retrieve new database content. I suspect it’s a caching or context management problem on my end.
I’m continuing to experiment with streaming and chunking the responses to bring the latency down to a more acceptable 1-2 seconds to have the TTS start faster. My current setup involves:
- Realtime API with WebRTC: For capturing user speech and generating transcripts.
- Vector Search (approx. 800ms): To query my database of activities based on the transcript.
- Tool Calling: To trigger my backend API for vector search and response generation.
- Chunking and Streaming: To deliver responses in smaller, faster segments.
For anyone else looking to implement a similar solution, here’s a quick summary of the approach:
- Use tool calling: This allows you to integrate custom functions into the Realtime API workflow.
- Create a backend API: This API should handle the vector search and response generation based on the transcript.
- Break down responses: Split the generated responses into smaller chunks to improve perceived latency.
- Stream the chunks: Send the chunks to the voice assistant as they are generated.
- Focus on context management: Pay close attention to how you manage the conversation context to avoid issues with the API losing track of the conversation.
I’m still working on perfecting this process, but I hope this information is helpful for others who are exploring similar integrations. I really appreciate your answer, it was very helpful and is the correct starting point!
1 Like
How are you structuring your tool to connect to the vector db? How is that response getting back to your model? @thismightbemak
For additional context, are you able to prove the tool format e.g
"tools": [
["type": "file_search",
"vector_store_ids": ["<vector_store_id>"]
],
include=["file_search_call.results"]
[
"type": "function",
"name": "vector_search",
"description": "Search a vector database for topic",
"parameters": [
"type": "object",
"strict": true,
"properties": [
"ThingIWasLookingFor": [
"type": "string",
"description": "The thing in the vector database I was looking for."
],
"TheOtherThingIwasLookingFor": [
"type": "string",
"description": "Other thing"
]
],
"required": ["ThingIWasLookingFor", "TheOtherThingIwasLookingFor"]
]
]
]
The reason I inquire is I saw a post from ~2 weeks ago that has given me the impression this was currently not supported.
Hey!
Yes, this is possible using a combination of function calling and a custom /api/voice
route that performs a vector DB search based on the transcription from the realtime voice session.
The workaround I used involves defining a tool like this in the realtime session config:
{
type: "function",
name: "send_transcription",
description: "Send the transcribed user input to your backend for vector search",
parameters: {
type: "object",
properties: {
transcription: { type: "string" }
}
}
}
When this function is triggered from the voice session, the backend uses the transcription to query a vector DB (e.g., using pinecone
, weaviate
, or a local embedding store), then generates a response based on the result.
A minimal pseudo flow:
- Model transcribes audio →
"What's the refund policy?"
- Function call is made:
{
"function_call": {
"name": "send_transcription",
"arguments": "{\"transcription\": \"What's the refund policy?\"}"
}
}
- Backend receives this, performs a vector search:
const embedding = await createEmbedding(transcription);
const result = await vectorStore.query(embedding);
- Then it returns the search result chunk/s back through the
function_call_output
, streaming them back as text/audio.
{
type: "conversation.item.create",
item: {
type: "function_call_output",
call_id: "abc123",
output: JSON.stringify({ content: "Our refund policy allows returns within 30 days..." })
}
}
Once complete, you can follow up with a response.create
event to trigger additional voice generation if needed.
Hope that clears it up! I did my best to explain based on how I got it working, though I’m still learning a lot myself — there might be cleaner or more optimized ways to handle it that others have found 