Getting realtime to use my dataset for responses

thismightbemak · March 12, 2025, 1:54pm

Hi,

I’ve implemented Realtime API with WebRTC in NextJS and now want to enhance it with vector search capabilities.

My goal:

User speaks a question via Realtime API
System generates transcript
Query matches keywords against my vector-embedded database (activities with category fields)
Return database matches as the response

Question: Is this workflow possible with the current Realtime API? How can I integrate the vector search step between transcript generation and response?

Any guidance on how to approach such a thing? Thanks!

hagen.rode · March 13, 2025, 2:04pm

Hello. Yes, this is possible via tool calls. i.e standard openAI tool calling, getting the data from your tool, and feeding it back. e.g

additional_context_data = {
            "type": "conversation.item.create",
            "item": {
                "type": "function_call_output",
                "call_id": call_id,
                "output": json.dumps(output),
            },
        }

        await self.third_party_websocket.send(  # type: ignore
            json.dumps(additional_context_data)
        )

        create_response_data = {
            "type": "response.create",
            "response": {
                "modalities": ["text", "audio"],
            },
        }

thismightbemak · March 15, 2025, 3:37pm

Hi, thanks so much for your helpful answer!

You’re absolutely right, tool calling is the key to integrating custom functionality like vector search with the Realtime API. I actually came to the same conclusion after some experimentation, and it’s great to have your confirmation.

I’m currently working on optimizing the latency, which is a bit of a challenge. Right now, my responses are taking around 15 seconds to generate. I’ve found that breaking the generated response into smaller sentence chunks helps a lot, but I’m running into some issues with the API losing context and failing to retrieve new database content. I suspect it’s a caching or context management problem on my end.

I’m continuing to experiment with streaming and chunking the responses to bring the latency down to a more acceptable 1-2 seconds to have the TTS start faster. My current setup involves:

Realtime API with WebRTC: For capturing user speech and generating transcripts.
Vector Search (approx. 800ms): To query my database of activities based on the transcript.
Tool Calling: To trigger my backend API for vector search and response generation.
Chunking and Streaming: To deliver responses in smaller, faster segments.

For anyone else looking to implement a similar solution, here’s a quick summary of the approach:

Use tool calling: This allows you to integrate custom functions into the Realtime API workflow.
Create a backend API: This API should handle the vector search and response generation based on the transcript.
Break down responses: Split the generated responses into smaller chunks to improve perceived latency.
Stream the chunks: Send the chunks to the voice assistant as they are generated.
Focus on context management: Pay close attention to how you manage the conversation context to avoid issues with the API losing track of the conversation.

I’m still working on perfecting this process, but I hope this information is helpful for others who are exploring similar integrations. I really appreciate your answer, it was very helpful and is the correct starting point!

mikejvanslyke · March 19, 2025, 12:47am

How are you structuring your tool to connect to the vector db? How is that response getting back to your model? @thismightbemak

mikejvanslyke · March 19, 2025, 1:19am

For additional context, are you able to prove the tool format e.g

"tools": [
                    ["type": "file_search",
                    "vector_store_ids": ["<vector_store_id>"]
                        ],
                        include=["file_search_call.results"]
                
                    [
                        "type": "function",
                        "name": "vector_search",
                        "description": "Search a vector database for topic",
                        "parameters": [
                            "type": "object",
                            "strict": true,
                            "properties": [
                                "ThingIWasLookingFor": [
                                    "type": "string",
                                    "description": "The thing in the vector database I was looking for."
                                ],
                                "TheOtherThingIwasLookingFor": [
                                    "type": "string",
                                    "description": "Other thing"
                                ]
                            ],
                            "required": ["ThingIWasLookingFor", "TheOtherThingIwasLookingFor"]
                        ]
                    ]
                ]

The reason I inquire is I saw a post from ~2 weeks ago that has given me the impression this was currently not supported.

thismightbemak · March 25, 2025, 12:11am

Hey!

Yes, this is possible using a combination of function calling and a custom /api/voice route that performs a vector DB search based on the transcription from the realtime voice session.

The workaround I used involves defining a tool like this in the realtime session config:

{
  type: "function",
  name: "send_transcription",
  description: "Send the transcribed user input to your backend for vector search",
  parameters: {
    type: "object",
    properties: {
      transcription: { type: "string" }
    }
  }
}

When this function is triggered from the voice session, the backend uses the transcription to query a vector DB (e.g., using pinecone, weaviate, or a local embedding store), then generates a response based on the result.

A minimal pseudo flow:

Model transcribes audio → "What's the refund policy?"
Function call is made:

{
  "function_call": {
    "name": "send_transcription",
    "arguments": "{\"transcription\": \"What's the refund policy?\"}"
  }
}

Backend receives this, performs a vector search:

const embedding = await createEmbedding(transcription);
const result = await vectorStore.query(embedding);

Then it returns the search result chunk/s back through the function_call_output, streaming them back as text/audio.

{
  type: "conversation.item.create",
  item: {
    type: "function_call_output",
    call_id: "abc123",
    output: JSON.stringify({ content: "Our refund policy allows returns within 30 days..." })
  }
}

Once complete, you can follow up with a response.create event to trigger additional voice generation if needed.

Hope that clears it up! I did my best to explain based on how I got it working, though I’m still learning a lot myself — there might be cleaner or more optimized ways to handle it that others have found

mikejvanslyke · March 29, 2025, 1:29am

This was really helpful, thank you!

Topic		Replies	Views
Realtime API connection to vector store (similar to Assistants API) API assistants-api	5	885	March 19, 2025
Realtime API and PDF integeration API	1	492	October 5, 2024
RAG with voice-voice(end-end) RealTime API API api	17	5718	January 19, 2025
Assistants - Embeddings and Vector Stores API embeddings , vector-db , assistants-api	15	11988	July 24, 2024
Prompting with the chat/completions API against a large transcript file API	5	3621	October 4, 2023

Getting realtime to use my dataset for responses

Related topics