GPTs with Custom Actions by Whisper API and TTS

I want to create a GPTs with speech-to-test by Whisper and text-to-speech by OpenAI TTS using custom actions.

But I totally don’t understand how to do it. Can you help me?


If I understood it correctly, you want to create a custom GPT that will work with Whisper V3 for STT and TTS, and custom functions right?

If you want to use it as a custom GPT on an endpoint and use it in other applications, you could check the Assistants Documentation page. They provided a step by step on how to implement some of the things you talked, incluiding the custom functions that you can build.

About the TTS and STT, check here for the just released update for python and node packages with both features in it!

Text-to-speech doc, really straight forward and easy to implement

Speech-to-text (Not whisper v3 yet)

1 Like

In case this helps anyone get any closer to figuring it out, this OpenAPI spec for Whisper speech-to-text works when I use SwaggerEditor to test it, but when I try it on my transcriptionGPT Custom Action I get an error that the model parameter hasn’t been provided:

openapi: 3.0.1
  title: OpenAI Audio Transcription API
  description: API for transcribing audio into text using OpenAI's models.
  version: "v1"
  - url:
      operationId: createTranscription
      summary: Create a transcription for an audio file
        required: true
              $ref: "#/components/schemas/createTranscriptionRequest"
          description: OK
                type: object
                    type: string
                    description: The transcribed text.

      type: object
        - file
        - model
          type: string
          format: binary
          description: The audio file object to transcribe.
          type: string
          enum: ["whisper-1"]
          default: "whisper-1"
          description: ID of the model to use. Defaults to 'whisper-1'.
          type: string
          enum: ["json", "text", "srt", "verbose_json", "vtt"]
          default: "json"
          description: The format of the transcript output.
          type: string
          description: The language of the input audio (optional).
          type: string
          description: An optional text to guide the model's style (optional).
          type: number
          description: The sampling temperature (optional).

    type: http
    scheme: bearer
    bearerFormat: JWT

  - BearerAuth: []

Any help would be greatly appreciated :slight_smile:


I had this error too, when I tried to figure out how to do it by my own!

1 Like

I am working on a the similar problme and would love to know more about this. If I am correct the model is not able to send the correct request, the recognised error is UnrecognisedKwargs error


I see similar question all ower the forum.
Why we don’t make an OpenSource project to work out an easy to use solution?

I’m implementing it in React-Native app, and will share the repo soon


Same issue.

So instead of using JSON Schema, I went to Zapier and used Zap AI Actions to get this sorted out. Didn’t work.

Then I figured out the issue.

When you upload a file to GPTs (assuming that you are referring to GPTs in the ChatGPT), the file gets uploaded to a sandbox environment and GPT just sends the sandbox URL instead of the file to hit the API key and you get all sorts of errors (Parameter error, unrecognisedKwargs error and more). I think using Assistant API and your own dev environment is the way forward. User uploads file, Assistant API GPT uses function calling to send the file from your dev environment to Whisper API and gets the response.

This is my theory. I might be wrong.


I’ve created some demos on github, davideuler/awesome-assistant-api
One of them is for speed to text, and text to speech by OpenAI TTS.
You can try them on Colab or on your jupyter notebooks. Hope it helps.


Did you get this to work yet?

1 Like

I have the same bug with a bearer token. An alternative solution to bypass the problem is to create your own server and bridge the gap between GPT actions and the target API that requires authentication, and send the token from your server. This way, we can configure it as no auth from the GPT action. But I would prefer that everything works normally …

1 Like

I think its worth to keep in mind that to use the whisper-v1 model you need to chunk the audio. Usually people chunk the audio file in 30 seconds chunks to stream to the API, if your audio file is bigger than this it may end up causing problems

1 Like

I’m not developer , but trying to develop a Custom GPT for STT. I get an errr, and when I asked it why:

"The issue is that the current environment I operate in doesn’t support the Whisper module, which is typically used for transcribing audio files. This limitation is specific to the tools and capabilities provided in my environment and is not necessarily a reflection of the broader capabilities of custom GPTs.

"In a more general context, integrating Whisper with a custom GPT model is technically feasible. Whisper, developed by OpenAI, is a powerful automatic speech recognition system that can be used to transcribe speech from audio files. A custom GPT model, like the one you are interacting with, can then process the transcribed text for various tasks such as summarization, translation, or contextual analysis.

“However, in this particular setup, since Whisper isn’t available, I suggest providing a manual transcription of the audio file. I can then assist you with editing and summarizing the text as needed.”

1 Like

Text to speech is pretty easy to figure out but, as mentioned earlier in this thread, speech to text is not so easy. I’ve experimented with sending uploaded audio files to this Whisper API. The issue is you cannot seem to pass uploaded files through to the API.

I experimented using code interpreter to make the file available at a set location within the context of a chat but the API simply doesn’t get the audio. I believe files are sandboxed as others have mentioned. The way around this would be to have the user input a URL where they’ve uploaded the file online which is not ideal. If anyone figures out how to get transcriptions working for GPTs I’d be interested in hearing more.

I’ve had much more success with text to speech and have a couple of GPTs like Voice Over Generator - which uses my own voice clone :wink:

I also had a play with the OpenAI text to speech API for my AI Voice Generator and the voices are really very realistic and get all the emotions and inflections spot on. Whoever was on the team that made Shimmer, Onyx, Fable, Alloy, Nova and Echo really deserves a pay raise! :money_with_wings:

1 Like

Yes, these examples works on Colab.

1 Like

Hello, Mr. Mike
I am Joe.
I have tested your “Voice Over Generator”. It’s great full!.

One Question!
Could we integrate to your API for generating various voice?
I have 12 of customized AI advisors now, they should speak to English, Spanish, French, Arabic. They also has gender attr.
Is possible to integrate with your Music Radio Creative?

Thank you, Joe.

Thanks for the lovely feedback @jaaliagas and yes it’s certainly something we can discuss. Would you be able to reach out directly and send details of your AI advisors project? Mention we connected on OpenAI community so team forward on to me.


Thanks for your reply.

we am customizing the mental health advisors for Gen Z.
BTW, we need the speech to text model for each advisors.
Here is the advisors.

Serene (mental health advocate, femal)
Alex (best friend, male, femal)
Sam (fitness coach, male, femal)
Leo (travel advisor, male, femal)

so, we need to integrate with a TTS model.
voices should fit to Gen Z. also should be male, femal attributes.
language: En, Fr, Es, Ar


This is interesting. If you’d like feedback from a senior psychiatry resident (me), let me know!

Any thoughts on how this issue could be addressed? :slight_smile:

1 Like