Trouble mapping realtime speech to function call text

Hey there,

Right now I’m using the Realtime API to call simple functions intelligently. I used the Twilio + OpenAI app as a template.

I’ve noticed that for functions with multiple inputs that the Realtime API often misses a digit or two of an input. The problem is that when I go to speak the input again the Realtime API often stubbornly sticks to its original input.

I output conversation.item.input_audio_transcription.completed events and it’s clear that it does hear what I’m saying but that for some reason doesn’t directly make it into the function call.

Has anyone else had this same issue?

For example here’s one of my functions registered with OpenAI:

def opt_out_sponsor(
    ein: str,
    reason: ValidReason,
    first_name: str,
    last_name: str,
    plan_name: str,
    plan_phone: str,
) -> str:
    """
    Call this if the user asks to opt out and please confirm EIN and read it back to the user for confirmation.

    Args:
        ein (str): Employer Identification Number
        reason (ValidReason): The existing plan type that the user has and therefore their reason for opting out.
        first_name (str): The first name of the main contact for the sponsor.
        last_name (str): The last name of the main contact for the sponsor.
        plan_name (str): The name of the existing plan.
        plan_phone (str): The phone number of the existing plan.

    Returns:
        str: The response to be read to the user.
    """
    # does some API calls and returns "Say this verbatim: I'm sorry but I had trouble finalizing the exemption. Please try again later." if there's an error

It’s worth noting that OpenAI Realtime API seems to excel with simple names like “Michael” and “John” but an EIN is trickier. An EIN is usually in the form 00-0000000. By the way, the dash isn’t what I’m discussing here as I can always filter the dash out with Python. It’s that the EIN is often wrong on the function call even though the text transcription is great.

1 Like

There might be a few options I would consider initially here.

However, what exactly is your function call schema / definition? As in the actual JSON schema you’re giving the OpenAI model?

https://platform.openai.com/docs/guides/function-calling#defining-functions

I would first isolate the EIN request to it’s own function call separate from the other stuff you’re trying to do so we can isolate it and figure out how we could tweak the tool to hopefully have it produce the intended output. If the text transcription is good, it might be worth just manually feeding that transcription back into your own function directly.