Hi, I’ve noticed that ChatGPT handles a well-defined use case in a way that is not intuitive to the user.
By default, ChatGPT often handles audio processing requests by simulating analysis. The context and the wording of the question matter: for example, if you ask if it can generate a “real” spectrogram or waveform based on an audio file, it’s more likely to divulge that it can’t process audio files in its environment.
These are two anonymous, 4o-mini sessions for demonstrative purposes. I encountered the behavior in the 4o model:
Before I knew that the numbers were simulated, I had a conversation with ChatGPT where we supposedly analyzed several qualities of my vocal recordings and discussed the results in-depth. We analyzed spectrograms and top frequency lists for various files. We looked at metrics like Spectral Bandwidth, Spectral Centroid, Harmonic Richness (estimated as Harmonic Content over Total Energy), and Vocal Clarity (estimated as Low-Frequency Energy Ratio).
I was surprised to find out later that the numbers had all been fake. GPT then helped me run the analysis myself in Python, which was really useful, but I’d still wasted hours assuming I was getting real data. The revelation undermined my trust in the bot a little, especially combined with some other transparency issues that have arisen lately despite a note in my persisted memory requesting honesty about the model’s limitations and disclosure about whether its analyses of media files are derived from the file contents or textual cues. I think it’d be a better user experience if the model were up-front as a rule about which processing it can and can’t do.