Getting realtime to use my dataset for responses

Hi,

I’ve implemented Realtime API with WebRTC in NextJS and now want to enhance it with vector search capabilities.

My goal:

  • User speaks a question via Realtime API
  • System generates transcript
  • Query matches keywords against my vector-embedded database (activities with category fields)
  • Return database matches as the response

Question: Is this workflow possible with the current Realtime API? How can I integrate the vector search step between transcript generation and response?

Any guidance on how to approach such a thing? Thanks!

Hello. Yes, this is possible via tool calls. i.e standard openAI tool calling, getting the data from your tool, and feeding it back. e.g

additional_context_data = {
            "type": "conversation.item.create",
            "item": {
                "type": "function_call_output",
                "call_id": call_id,
                "output": json.dumps(output),
            },
        }

        await self.third_party_websocket.send(  # type: ignore
            json.dumps(additional_context_data)
        )

        create_response_data = {
            "type": "response.create",
            "response": {
                "modalities": ["text", "audio"],
            },
        }

Hi, thanks so much for your helpful answer!

You’re absolutely right, tool calling is the key to integrating custom functionality like vector search with the Realtime API. I actually came to the same conclusion after some experimentation, and it’s great to have your confirmation.

I’m currently working on optimizing the latency, which is a bit of a challenge. Right now, my responses are taking around 15 seconds to generate. I’ve found that breaking the generated response into smaller sentence chunks helps a lot, but I’m running into some issues with the API losing context and failing to retrieve new database content. I suspect it’s a caching or context management problem on my end.

I’m continuing to experiment with streaming and chunking the responses to bring the latency down to a more acceptable 1-2 seconds to have the TTS start faster. My current setup involves:

  • Realtime API with WebRTC: For capturing user speech and generating transcripts.
  • Vector Search (approx. 800ms): To query my database of activities based on the transcript.
  • Tool Calling: To trigger my backend API for vector search and response generation.
  • Chunking and Streaming: To deliver responses in smaller, faster segments.

For anyone else looking to implement a similar solution, here’s a quick summary of the approach:

  1. Use tool calling: This allows you to integrate custom functions into the Realtime API workflow.
  2. Create a backend API: This API should handle the vector search and response generation based on the transcript.
  3. Break down responses: Split the generated responses into smaller chunks to improve perceived latency.
  4. Stream the chunks: Send the chunks to the voice assistant as they are generated.
  5. Focus on context management: Pay close attention to how you manage the conversation context to avoid issues with the API losing track of the conversation.

I’m still working on perfecting this process, but I hope this information is helpful for others who are exploring similar integrations. I really appreciate your answer, it was very helpful and is the correct starting point!

How are you structuring your tool to connect to the vector db? How is that response getting back to your model? @thismightbemak

For additional context, are you able to prove the tool format e.g

"tools": [
                    ["type": "file_search",
                    "vector_store_ids": ["<vector_store_id>"]
                        ],
                        include=["file_search_call.results"]
                
                    [
                        "type": "function",
                        "name": "vector_search",
                        "description": "Search a vector database for topic",
                        "parameters": [
                            "type": "object",
                            "strict": true,
                            "properties": [
                                "ThingIWasLookingFor": [
                                    "type": "string",
                                    "description": "The thing in the vector database I was looking for."
                                ],
                                "TheOtherThingIwasLookingFor": [
                                    "type": "string",
                                    "description": "Other thing"
                                ]
                            ],
                            "required": ["ThingIWasLookingFor", "TheOtherThingIwasLookingFor"]
                        ]
                    ]
                ]

The reason I inquire is I saw a post from ~2 weeks ago that has given me the impression this was currently not supported.

Hey!

Yes, this is possible using a combination of function calling and a custom /api/voice route that performs a vector DB search based on the transcription from the realtime voice session.

The workaround I used involves defining a tool like this in the realtime session config:

{
  type: "function",
  name: "send_transcription",
  description: "Send the transcribed user input to your backend for vector search",
  parameters: {
    type: "object",
    properties: {
      transcription: { type: "string" }
    }
  }
}

When this function is triggered from the voice session, the backend uses the transcription to query a vector DB (e.g., using pinecone, weaviate, or a local embedding store), then generates a response based on the result.

A minimal pseudo flow:

  1. Model transcribes audio"What's the refund policy?"
  2. Function call is made:
{
  "function_call": {
    "name": "send_transcription",
    "arguments": "{\"transcription\": \"What's the refund policy?\"}"
  }
}

  1. Backend receives this, performs a vector search:
const embedding = await createEmbedding(transcription);
const result = await vectorStore.query(embedding);

  1. Then it returns the search result chunk/s back through the function_call_output, streaming them back as text/audio.
{
  type: "conversation.item.create",
  item: {
    type: "function_call_output",
    call_id: "abc123",
    output: JSON.stringify({ content: "Our refund policy allows returns within 30 days..." })
  }
}

Once complete, you can follow up with a response.create event to trigger additional voice generation if needed.

Hope that clears it up! I did my best to explain based on how I got it working, though I’m still learning a lot myself — there might be cleaner or more optimized ways to handle it that others have found :slightly_smiling_face:

This was really helpful, thank you!