I’ve been trying to compare different models to choose one optimal one for me since ChatGPT 4 has gotten so bad lately, pretty much unusable. I’ve been doing comparisons between GPT, Perplexity, Claude and Gemini. Today I decided to do more 4 vs 4o testing, since I have a paid version and debating about cancelling it.
My input was - what is the most effective optimal method to diffuse essential oil during the day while I’m working at my desk.
ChatGPT4 - DIY hot water diffuser.
ChatGPT4o - Cotton ball or tissue.
I asked to provide reasoning for each answer.
ChatGPT4 - stronger dispersion, sustained release, adjustable intensity, safety and cleanliness.
ChatGPT4o - ease of setup, safety, consistency, maintenance, portability.
Now, how is it possible to get 2 completely different optimal methods even after asking it like ‘are you sure? and explain the ‘why’’? So, I asked each version to explain to me why the discrepancy. Note I used the EXACT SAME WORDING in my inputs for both models.
GPT4 - model variations based on different datasets or updated algorithms it’s been trained on, inherent ambiguity, ensure the query is as specific as possible, specify criteria, decide what model is more advantageous for you, use external sources, refine inquiries.
GTP4o - variability in context interpretation in each model, even slightest differences in context can lead to different interpretations, training data and model variability, user preferences and feedback. It suggested to explicitly state preferences to provide more targeted consistent response. Such as in this scenario, to state specific concern - safety, ease of use, intensity, etc. Combine methods and consult external sources.
It helped me understand the variability and lack of consistency in each model better, but still left me pondering the reliability of each.