I know that there’s probably something being developed internally by OpenAI, but I just wanted to express some of what I think would be really usefull features that don’t necessarily requires changing features that aren’t already available under existing models like advanced voice mode.
While realtime is interesting, it seems to be very demanding (in server resources) and expensive. Perhaps it would be more cost effectively to provide better control on TTS and STT and leave the combination up to us developers.
Who knows, perhaps someone on the developers teams may see it and think about some of them? So here are my suggestions (feel free to add your own too to this thread):
TTS API
- New voices - since there is already training material available, why not add them on the current tts-1 too as well
- Improve the slower speed tone - on the tts-1, the slow mode sounds very metallic. I believe this is somewhat of a limitation of the model, so perhaps in the tts-2 we can have it treat the speed not as a post processing, but as a generating parameter.
- For example, in AVM when we ask for something to be pronounced very slowly, it really sounds like a real person speaking slowly for better diction and voice clarity, like when we are teaching a new word for another person.
- This would be great for a variety of purposes, like learning apps and for accessibility (e.g. seniors, hearing impaired and foreigners). It would be ok if it cost a bit more. Also, it doen’t need to have a numerical fine tuning, it would be alright to just be a “slow mode”.
- I believe a fast mode is not required (could be just post processed) because it is the slower mode that suffers most with a loss of bitrate quality, not the other way around.
- System prompt for tonal and language settings - since the models don’t allow us to directly input the desired language or emotion, perhaps allow the usage of a system prompt to achieve the result without mixing input with output. For comparison:
- instead of saying “speak this sentence in spanish with a happy mood: ¡Hola! ¡Buenos días!”, which will sound weird for the end user;
- we could have a system prompt: “speak the sentence in spanish with a happy mood” and in the text parameter “¡Hola! ¡Buenos días!”
- SSML and IPA - As I think some might suggest this, I put it in the list but I suspect they would be somewhat incompatible with how ML models work. But who knows? It would be a good feature anyways, in a similar way that markup language made its way on how GPT models structure their responses.
WHISPER API
Actually I thing the whisper API is quite alright, so the following would be a bit more challenging than the previous TTS suggestions. But here are some:
- Enforce input - currently the whisper API will accept any language and return a transcription as if it was pronounced correctly, acting as a translator instead of a transcription. Sometimes we need to know the audio was wrong.
- Example: even with the input language set to spanish, if I say “The table” it will transcribe as “La mesa”, when it should return either nothing or some gibberish since there is no equivalent transcription (not translation) in spanish for that. Like “De téibol”, as when you hear something unknown. Sometimes we want to know it was mispronounced for learning purposes.
- Transcribed language - A possible easier fix would be to at least return in the answer that the input was detected as english, not spanish.
- Comparison analysis - input a text for the intended speech, and return the analysis of how well it was pronounced. It is great for learning languages and improving the user pronunciation.
- Like: The input is “The book is on the table”, the analysis could have variables like accuracy (%) and a text with the words missed or mispronounced: “book, table”. Or perhaps a more advanced one like: “Book is pronounced with a shord ‘u’ sound, like ‘buuk’. The ‘oo’ is similar to the sound in ‘foot’. The ‘k’ at the end is soft but audible.”
- Sentiment analysis - return a text parameter with the description of the tone used in the audio input. Like “Happy, enthusiastic, clear words”, or “difficulty in pronunciation, stuttering, neutral tone”. Similar to how gpt-vision describes an image, instead of just an OCR. It could be of great value for people with hearing impairment.
- IPA transcriptions - return a text transcriptin of the audio. Might have some interesting usages, but probably too hard to implement.