Getting realtime to use my dataset for responses

Hi,

I’ve implemented Realtime API with WebRTC in NextJS and now want to enhance it with vector search capabilities.

My goal:

  • User speaks a question via Realtime API
  • System generates transcript
  • Query matches keywords against my vector-embedded database (activities with category fields)
  • Return database matches as the response

Question: Is this workflow possible with the current Realtime API? How can I integrate the vector search step between transcript generation and response?

Any guidance on how to approach such a thing? Thanks!

4 Likes

Hello. Yes, this is possible via tool calls. i.e standard openAI tool calling, getting the data from your tool, and feeding it back. e.g

additional_context_data = {
            "type": "conversation.item.create",
            "item": {
                "type": "function_call_output",
                "call_id": call_id,
                "output": json.dumps(output),
            },
        }

        await self.third_party_websocket.send(  # type: ignore
            json.dumps(additional_context_data)
        )

        create_response_data = {
            "type": "response.create",
            "response": {
                "modalities": ["text", "audio"],
            },
        }

Hi, thanks so much for your helpful answer!

You’re absolutely right, tool calling is the key to integrating custom functionality like vector search with the Realtime API. I actually came to the same conclusion after some experimentation, and it’s great to have your confirmation.

I’m currently working on optimizing the latency, which is a bit of a challenge. Right now, my responses are taking around 15 seconds to generate. I’ve found that breaking the generated response into smaller sentence chunks helps a lot, but I’m running into some issues with the API losing context and failing to retrieve new database content. I suspect it’s a caching or context management problem on my end.

I’m continuing to experiment with streaming and chunking the responses to bring the latency down to a more acceptable 1-2 seconds to have the TTS start faster. My current setup involves:

  • Realtime API with WebRTC: For capturing user speech and generating transcripts.
  • Vector Search (approx. 800ms): To query my database of activities based on the transcript.
  • Tool Calling: To trigger my backend API for vector search and response generation.
  • Chunking and Streaming: To deliver responses in smaller, faster segments.

For anyone else looking to implement a similar solution, here’s a quick summary of the approach:

  1. Use tool calling: This allows you to integrate custom functions into the Realtime API workflow.
  2. Create a backend API: This API should handle the vector search and response generation based on the transcript.
  3. Break down responses: Split the generated responses into smaller chunks to improve perceived latency.
  4. Stream the chunks: Send the chunks to the voice assistant as they are generated.
  5. Focus on context management: Pay close attention to how you manage the conversation context to avoid issues with the API losing track of the conversation.

I’m still working on perfecting this process, but I hope this information is helpful for others who are exploring similar integrations. I really appreciate your answer, it was very helpful and is the correct starting point!

1 Like

How are you structuring your tool to connect to the vector db? How is that response getting back to your model? @thismightbemak

For additional context, are you able to prove the tool format e.g

"tools": [
                    ["type": "file_search",
                    "vector_store_ids": ["<vector_store_id>"]
                        ],
                        include=["file_search_call.results"]
                
                    [
                        "type": "function",
                        "name": "vector_search",
                        "description": "Search a vector database for topic",
                        "parameters": [
                            "type": "object",
                            "strict": true,
                            "properties": [
                                "ThingIWasLookingFor": [
                                    "type": "string",
                                    "description": "The thing in the vector database I was looking for."
                                ],
                                "TheOtherThingIwasLookingFor": [
                                    "type": "string",
                                    "description": "Other thing"
                                ]
                            ],
                            "required": ["ThingIWasLookingFor", "TheOtherThingIwasLookingFor"]
                        ]
                    ]
                ]

The reason I inquire is I saw a post from ~2 weeks ago that has given me the impression this was currently not supported.

Hey!

Yes, this is possible using a combination of function calling and a custom /api/voice route that performs a vector DB search based on the transcription from the realtime voice session.

The workaround I used involves defining a tool like this in the realtime session config:

{
  type: "function",
  name: "send_transcription",
  description: "Send the transcribed user input to your backend for vector search",
  parameters: {
    type: "object",
    properties: {
      transcription: { type: "string" }
    }
  }
}

When this function is triggered from the voice session, the backend uses the transcription to query a vector DB (e.g., using pinecone, weaviate, or a local embedding store), then generates a response based on the result.

A minimal pseudo flow:

  1. Model transcribes audio"What's the refund policy?"
  2. Function call is made:
{
  "function_call": {
    "name": "send_transcription",
    "arguments": "{\"transcription\": \"What's the refund policy?\"}"
  }
}

  1. Backend receives this, performs a vector search:
const embedding = await createEmbedding(transcription);
const result = await vectorStore.query(embedding);

  1. Then it returns the search result chunk/s back through the function_call_output, streaming them back as text/audio.
{
  type: "conversation.item.create",
  item: {
    type: "function_call_output",
    call_id: "abc123",
    output: JSON.stringify({ content: "Our refund policy allows returns within 30 days..." })
  }
}

Once complete, you can follow up with a response.create event to trigger additional voice generation if needed.

Hope that clears it up! I did my best to explain based on how I got it working, though I’m still learning a lot myself — there might be cleaner or more optimized ways to handle it that others have found :slightly_smiling_face: