GPTs with Custom Actions by Whisper API and TTS

vikandrr · November 9, 2023, 8:15pm

I want to create a GPTs with speech-to-test by Whisper and text-to-speech by OpenAI TTS using custom actions.

But I totally don’t understand how to do it. Can you help me?

GuilhermeZago · November 9, 2023, 8:37pm

If I understood it correctly, you want to create a custom GPT that will work with Whisper V3 for STT and TTS, and custom functions right?

If you want to use it as a custom GPT on an endpoint and use it in other applications, you could check the Assistants Documentation page. They provided a step by step on how to implement some of the things you talked, incluiding the custom functions that you can build.

About the TTS and STT, check here for the just released update for python and node packages with both features in it!

Text-to-speech doc, really straight forward and easy to implement

Speech-to-text (Not whisper v3 yet)

stephen.ai · November 11, 2023, 5:45am

In case this helps anyone get any closer to figuring it out, this OpenAPI spec for Whisper speech-to-text works when I use SwaggerEditor to test it, but when I try it on my transcriptionGPT Custom Action I get an error that the model parameter hasn’t been provided:

openapi: 3.0.1
info:
  title: OpenAI Audio Transcription API
  description: API for transcribing audio into text using OpenAI's models.
  version: "v1"
servers:
  - url: https://api.openai.com/v1
paths:
  /audio/transcriptions:
    post:
      operationId: createTranscription
      summary: Create a transcription for an audio file
      requestBody:
        required: true
        content:
          multipart/form-data:
            schema:
              $ref: "#/components/schemas/createTranscriptionRequest"
      responses:
        "200":
          description: OK
          content:
            application/json:
              schema:
                type: object
                properties:
                  transcript:
                    type: string
                    description: The transcribed text.

components:
  schemas:
    createTranscriptionRequest:
      type: object
      required:
        - file
        - model
      properties:
        file:
          type: string
          format: binary
          description: The audio file object to transcribe.
        model:
          type: string
          enum: ["whisper-1"]
          default: "whisper-1"
          description: ID of the model to use. Defaults to 'whisper-1'.
        response_format:
          type: string
          enum: ["json", "text", "srt", "verbose_json", "vtt"]
          default: "json"
          description: The format of the transcript output.
        language:
          type: string
          description: The language of the input audio (optional).
        prompt:
          type: string
          description: An optional text to guide the model's style (optional).
        temperature:
          type: number
          description: The sampling temperature (optional).

securitySchemes:
  BearerAuth:
    type: http
    scheme: bearer
    bearerFormat: JWT

security:
  - BearerAuth: []

Any help would be greatly appreciated

vikandrr · November 11, 2023, 2:29pm

I had this error too, when I tried to figure out how to do it by my own!

ypachchi · November 12, 2023, 11:19pm

I am working on a the similar problme and would love to know more about this. If I am correct the model is not able to send the correct request, the recognised error is UnrecognisedKwargs error

dmisi98 · November 13, 2023, 1:23am

I see similar question all ower the forum.
Why we don’t make an OpenSource project to work out an easy to use solution?

I’m implementing it in React-Native app, and will share the repo soon

Sanka1 · November 13, 2023, 2:01pm

Same issue.

So instead of using JSON Schema, I went to Zapier and used Zap AI Actions to get this sorted out. Didn’t work.

Then I figured out the issue.

When you upload a file to GPTs (assuming that you are referring to GPTs in the ChatGPT), the file gets uploaded to a sandbox environment and GPT just sends the sandbox URL instead of the file to hit the API key and you get all sorts of errors (Parameter error, unrecognisedKwargs error and more). I think using Assistant API and your own dev environment is the way forward. User uploads file, Assistant API GPT uses function calling to send the file from your dev environment to Whisper API and gets the response.

This is my theory. I might be wrong.

david2024 · November 13, 2023, 4:11pm

I’ve created some demos on github, davideuler/awesome-assistant-api
One of them is for speed to text, and text to speech by OpenAI TTS.
You can try them on Colab or on your jupyter notebooks. Hope it helps.

dansh · November 17, 2023, 5:05pm

Did you get this to work yet?

carmody · November 17, 2023, 5:42pm

I have the same bug with a bearer token. An alternative solution to bypass the problem is to create your own server and bridge the gap between GPT actions and the target API that requires authentication, and send the token from your server. This way, we can configure it as no auth from the GPT action. But I would prefer that everything works normally …

GuilhermeZago · November 21, 2023, 4:36pm

I think its worth to keep in mind that to use the whisper-v1 model you need to chunk the audio. Usually people chunk the audio file in 30 seconds chunks to stream to the API, if your audio file is bigger than this it may end up causing problems

bkkjohn1951 · November 22, 2023, 7:39am

I’m not developer , but trying to develop a Custom GPT for STT. I get an errr, and when I asked it why:

"The issue is that the current environment I operate in doesn’t support the Whisper module, which is typically used for transcribing audio files. This limitation is specific to the tools and capabilities provided in my environment and is not necessarily a reflection of the broader capabilities of custom GPTs.

"In a more general context, integrating Whisper with a custom GPT model is technically feasible. Whisper, developed by OpenAI, is a powerful automatic speech recognition system that can be used to transcribe speech from audio files. A custom GPT model, like the one you are interacting with, can then process the transcribed text for various tasks such as summarization, translation, or contextual analysis.

“However, in this particular setup, since Whisper isn’t available, I suggest providing a manual transcription of the audio file. I can then assist you with editing and summarizing the text as needed.”

mikerussell · November 23, 2023, 8:39am

Text to speech is pretty easy to figure out but, as mentioned earlier in this thread, speech to text is not so easy. I’ve experimented with sending uploaded audio files to this Whisper API. The issue is you cannot seem to pass uploaded files through to the API.

I experimented using code interpreter to make the file available at a set location within the context of a chat but the API simply doesn’t get the audio. I believe files are sandboxed as others have mentioned. The way around this would be to have the user input a URL where they’ve uploaded the file online which is not ideal. If anyone figures out how to get transcriptions working for GPTs I’d be interested in hearing more.

I’ve had much more success with text to speech and have a couple of GPTs like Voice Over Generator - which uses my own voice clone

I also had a play with the OpenAI text to speech API for my AI Voice Generator and the voices are really very realistic and get all the emotions and inflections spot on. Whoever was on the team that made Shimmer, Onyx, Fable, Alloy, Nova and Echo really deserves a pay raise!

david2024 · November 23, 2023, 5:35pm

Yes, these examples works on Colab.

jaaliagas · November 29, 2023, 12:57pm

Hello, Mr. Mike
I am Joe.
I have tested your “Voice Over Generator”. It’s great full!.

One Question!
Could we integrate to your API for generating various voice?
I have 12 of customized AI advisors now, they should speak to English, Spanish, French, Arabic. They also has gender attr.
Is possible to integrate with your Music Radio Creative?

Thank you, Joe.

mikerussell · November 29, 2023, 1:31pm

Thanks for the lovely feedback @jaaliagas and yes it’s certainly something we can discuss. Would you be able to reach out directly and send details of your AI advisors project? Mention we connected on OpenAI community so team forward on to me.

Thanks!

jaaliagas · November 29, 2023, 1:59pm

Thanks for your reply.

we am customizing the mental health advisors for Gen Z.
BTW, we need the speech to text model for each advisors.
Here is the advisors.

Serene (mental health advocate, femal)
Alex (best friend, male, femal)
Sam (fitness coach, male, femal)
Leo (travel advisor, male, femal)
…

so, we need to integrate with a TTS model.
voices should fit to Gen Z. also should be male, femal attributes.
language: En, Fr, Es, Ar

Thanks.

nathan.corbett · November 29, 2023, 2:28pm

This is interesting. If you’d like feedback from a senior psychiatry resident (me), let me know!

returnofthe.mac · December 4, 2023, 7:12pm

Any thoughts on how this issue could be addressed?

Topic		Replies	Views
Running Whisper API as action on GPTs Plugins / Actions builders plugin-development , whisper , agents	7	3011	January 25, 2024
Can (custom) GPT speak and respond via voice? Community gpt-4 , api , chatgpt-plugin	15	12480	September 29, 2024
Send me your questions/problems and I'll make a video answer Community	10	2395	October 19, 2024
Audio transcription in custom GPT GPT builders chatgpt	1	317	June 9, 2024
Speech to Text (Whisper) to Review (ChatGPT) API whisper	1	1868	October 4, 2023

GPTs with Custom Actions by Whisper API and TTS

Related topics