Gpt-4o or whisper for kids speech

Does anyone have an idea of how will gpt-4o work behind the scenes for the audio capabilities? Will the results be similar to whisper? better?

A coworker wants to make an proof on concept for a client, that analyzes spanish elementary school kid’s speech while they read a text or book and give feedback and make suggestions on their reading, pauses, punctuation, etc etc.

Does someone thinks this will be doable in the future or it is doable now? Is this doable with whisper? I think issues exists recognizing child speech in all the voice models, I am right? Is there any service available? Should I try whisper? I saw it is very good with spanish btw. Any onpinions are welcome

I haven’t tried critically but, at least anecdotally, I feel like this is possible in some ways already.

  • I’ve tried talking with the AI via the app and the conversation is smooth.
  • I built a separate prototype which provided me some general feedback on spoken responses to behavioral interview questions.

I could imagine (although again I have not tried) using the tool for feedback as you mentioned. Many details will vary though based on your exact needs, so I wouldn’t count on it until you try it.

Finally, I’ve seen some other non-OpenAI approaches to this for some time now. I think Speeko was one (although not sure about kid voices, other languages, etc): ‎Speeko - Public Speaking Coach on the App Store

How are you going to translate the accent & pronunciation to the text model?

So GPT-4o works with emotional cues but strangely enough they did not showcase it catching a bad accent, or assisting in pronunciations.

Currently available models function by first transcribing the text and then running it through a semantics model. So you lose a lot of important information in that transformation.

I did shortly experiment by running all level of whisper in parallel (small, med, large). Then passing that information to a semantics model to try and discover what went wrong.

The idea was that since the larger models have more training data they would (ideally) be better at transcribing through accents and mispronunciations, while the lower, less trained models would fail. Then the semantics model could try and take a crack at understanding where the failure went.

It did work, to a degree. I wasn’t very motivated to continue testing but it did show some results. I just didn’t trust it as it’s a hacky solution. Fingers crossed that gpt-4o can do it


+1 especially to the key detail that so much context would be lost by that “transcribe then semantics” model. Building on that and adding to my post above, I have not prototyped or seen open solutions to that.

Speeko does analyze text style, but again not sure how that works with kids, other languages, etc. Overall that makes me think that solutions are out there, but I’m not sure of OpenAI’s roadmap related to those.

@RonaldGRuckus could you clarify what you mean by “to try and discover what went wrong.” What were you looking for?

1 Like

You can make a similar comparison with image recognition models that have been open to the public for a long time.
There is no doubt that GPT4 is much smarter in image recognition than google lens for example.

In general, once a model is combined with a language model, it is quite clear that it will be much smarter than a simple model that is designed for a single task such as speech recognition / image recognition