500 error in request to gpt-4o-audio-* model

I am getting this same error in our chat completions based agent.

The server had an error while processing your request. Sorry about that!

Few things about our case:

  • We have a fairly large system prompt
  • stream is set to true
  • The only other message is a user role, ‘hello how are you’, sent as format wav
  • parallel_tool_calls is set to false
  • audio is set to alloy/pcm16
  • modalities is text, audio
  • We have 6 functions defined in our tools array

What I’ve tried while debugging:

  • Our non-multimodal agent works fine, with the same tool definitions, so it isn’t a parsing problem with the tools. Non-multimodal, meaning, prior to the recent updates to chat completions, I hooked up a pipeline that sent the audio to get transcribed, then to the old chat completions, then took the response to TTS, and played back the audio. That pipeline works fine, with all the tool definitions.
  • If I remove three of the tools, the call completes correctly
  • Seeing this, I thought it might be a total content length issue, but if I remove our system prompt, leaving the tool definitions, it still fails (at a smaller content length than when I remove three of the tools). This leads me to believe it isn’t a total content length issue, but rather something specific with the tools.
  • I tried removing various combinations of tools, and it doesn’t seem related to a specific tool, but rather the total size of tool definitions.

Perhaps when used in a multimodal fashion, the total amount of space you can use for tool definitions is smaller? We did run into a few issues during development of the non-multimodal agent when we made the description field in a tool too long, for example. Maybe this error is some variation of that?