Even the paid version is useless for dictation purposes

I decided to try it for a social media post transcription and dictation.

Boy, oh boy, how wrong I was to think that ChatGPT could handle such an easy task…

I provided clear instructions on which parts I wanted to make a transcribed text and the beginning and end of each transcribe session. I also told it to keep all transcribed sessions and provide them as one piece.

Not only that, after repeating 5 times that I don’t want to be interrupted and that I want to hear it only twice (after the beginning of the dictation and in the end), but ChatGPT also provided me only with the first paragraph only!!! And the amount of text dictated throughout the session was way less than its context window!

I am sorry, OpenAI, but if it can’t even be a transcribing machine in voice mode and follow instructions, why would I pay for it and expect it to handle more complex tasks than simple dictation?!

Hi there!

As I wake from what seems to be my seasonal forum hibernation, it’s posts like these that remind me why I’m on here and how badly I need to catch up every time I drift off for a couple months.

Have you considered trying the API for this?

The last time I used ChatGPT’s advanced voice mode (and yes, it was advanced mode) on the actual ChatGPT interface, it transcribed automatically. As in, every utterance that occurred was turned back into text by default. Text is the lifeblood of GPT, and advanced voice mode is merely a facade to stylize sounds that come from text.

First, are you using o1 here? or a 4 variant? Counting things properly is better suited towards the more advanced models, which sounds odd, but remember it was o1 that counted the amount of r’s in strawberry correctly.

Second, ChatGPT’s sense of time isn’t like ours, and these boundaries that you’ve described, as well as the conditions that must be met when ChatGPT reaches them, may not be easy to detect or as clear cut for the language model. Vagueness continues to be everyone’s bottlenecks with these models. It’s not clear for me what you mean by the beginning and end of a dictation just from reading this, hence why I bring this up. The language model sees the turns exchanged in a conversation and the context provided, which is the summation of these turns. Even in voice mode, it’s more bound to the same turn-taking format we’ve been used to than meets the eye.

To clarify my very unclear message :).

Yes, I know that it transcribes automatically. My idea was to work with it like an assistant so it would record my article paragraphs and not include any mumbling or fact-checking in between.

I dont need API for that.

In advanced voice mode, I can choose only voice and not the model, JFYI (at least on mobile).

Even if we ignore the transcribing part, it didn’t follow my instructions to remain silent during dictation chunks for 5!!! times (each time, I provided instructions slightly differently, trying to make it work) I ask it only confirm the beginning and end of the dictation (I provided a keywords for it to know what was what).

I have no idea how people have been reporting having whole therapy sessions done with it :rofl:.
In my view, it’s dumb as a log :wood:

Honestly, from all of my experiences with different models, I have yet to have a completely flawless experience. Most of the time, it is lazy or does not follow instructions even if the list of instructions was made by itself (for complex tasks, I ask it to write a detailed prompt so I will not forget some details required).

Maybe o3 will be at the required level (if OpenAi allows us to choose a model for Advanced Voice mode), but despite working with LLMs for the last two years, I have yet to find them flawless enough not to frustrate me.