Fine-Tuning OpenAI's Real-Time API for Native Speech-to-Speech Audio Generation

Hi there! I’m working on developing an application that uses OpenAI’s models for speech-to-speech consultations. I have a question about the Real-time API: is it possible to fine-tune the Real-time API on our own data to generate native audio in a speech-to-speech format, similar to the latest audio features provided by OpenAI? Is this feature available for customization with our own data? Thanks for any insights!

5 Likes

Same question. Is there some way to use fine-tuned models for the real-time API?

2 Likes

+1 – Is fine-tuning the Realtime model currently possible or in the near product roadmap?

Would appreciate someone’s reply on this because my app relies on fine-tuning (text to text/audio), and I can’t migrate to the Realtime API without it.

I mainly want to train the LLM to respond in a certain way (logic, function calling) and I’m not looking for speech-to-speech tuning.

Thanks!

2 Likes

Hey! :slight_smile:

I’m assuming it’s not possible since the supported models for fine-tuning, according to this OpenAI’s page are

  • gpt-4o-2024-08-06
  • gpt-4o-mini-2024-07-18
  • gpt-4-0613
  • gpt-3.5-turbo-0125
  • gpt-3.5-turbo-1106
  • gpt-3.5-turbo-0613

Since none of the gpt-4o-realtime-preview nor the gpt-4o-mini-realtime-preview models, which are the models supported by the realtime API, are on that list, I assume they can’t be fine-tuned.