Spend time on openai evals

I strongly encourage folks to spend time on GitHub - openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks. to understand how limited GPT4 reasoning is.

Look for evals around logic and mathematical reasoning especially. You will start to appreciate that folks en masse have been largely conflating GPT4 stochastic parrot with its reasoning/inference capabilities.

After you’ve read through some evals and tried them out for yourself you will be better able to disentangle what is just smart pattern matching / translation and what is the actual chain of thought GPT4 is performing.

Don’t get me wrong - GPT4 can reason - but in limited, child like way and not competitive with most people of average IQ.


I think you vastly overestimate the reasoning ability of the median person.


I think it’s well known that GPT-4 is limited in logic or math reasoning without coaching it with Chain of Thought techniques.

But as a stochastic parrot, that is parroting the entire wisdom of the internet, it’s amazingly useful. Don’t disrespect the parrots!

Just using GPT-4 last night to rewrite things in different styles was useful and something that would take most folks a long time to complete.

I find the current incarnation useful. Granted I’m not trying to prove lemmas or theorems in topology or anything. Just people stuff.

1 Like

I was just telling someone this the other day, a human can generally out perform GPT on most if not all tasks. Even on the knowledge front a domain export will likely generate a better answer then GPT. You could ask GPT to come up with a list of the best things to do in Denver and someone who lives in Denver would likely say “that’s an ok list but not great.” The thing is, I don’t live in Denver so GPT is definitely going to come up with a better list then I could come up with, and that’s where it gets it’s strength. It’s a generalist and it’s the best generalist we’ve ever seen. It has an “ok” level of knowledge around most domains where the average human is a domain expert in maybe 20 - 30 domains (guessing.)

Pattern matching is by far it’s greatest strength and the primary advice I give to people is to try and present your problem to the model as a pattern matching task because you’ll have more success. It’s general world knowledge, though, and it’s ability to leverage that world knowledge as it works through a problem is unprecedented.



I would guess the average human is a domain expert in zero or one domains.

I’m finishing up a PhD in Statistics and I would feel uncomfortable calling myself a domain expert in any but the most narrowly defined domain.


An expert could out perform GPT on a task in their particular domain, yes - given enough time.

I won’t speak for anyone else, but GPT can out perform me on pretty much everything except the tasks that I’m an expert on, assuming the stochastic parrot aspect was involved in the task.

It can also outperform me on many tasks I’m an expert on, if time to completion is a factor.

As @curt.kennedy mentioned above, don’t understimate the power of a stochastic parrot.

1 Like

There’s a theory here that human reasoning is actually a bug: Natural creativity cycle - supermemo.guru

The idea is that the default survival behavior of the brain is to solve problems. e.g. How do you reach the apples on the tree? How do you survive the winter? When do infections happen? Creativity is a stochastic search within knowledge for plausible solutions.

Focus and reasoning is the result of tunnel vision when the brain is exhausted. Without the exhaustion, you get scatterbrained. It’s probably why many work best at nights, or under the constant caffeine fog.

It’s why I’ve been so excited about LLMs - it’s just a tireless dose of creativity. And it happens under a very large window of knowledge. The LLM has read literal tons of books. Trying to use it for reasoning might be the wrong tool for the job. Chess AI can reason far better than humans. Maybe in the future, there will be a combination of the two different modes.

I believe gpt3-instruct, ChatGPT and GPT-4 have sort of been tuned to be more focused. The davinci model still seems the most creative for me, though GPT-4 has its advantages.


I think the main takeaways I have from this thread are:

  1. Don’t underestimate the power of the parrot.
  2. Phrase things in a way the parrot can solve.

Knowing these two things, which are what the LLM really is (a parrot) and what interface the LLM needs to be useful (the pattern matching one), are important things to realize.

This is, of course, relating to this recent generation of LLM’s. I have read that Google is starting to inject knowledge graphs into their LLM offering, and so now their parrot is starting to make logical connections of the world and facts, and not statistically parroting tokens back.


I think you two mean different things by “domain expert.”

I think the original intent was more something like “something that human is reasonably competent at, such that they can out-perform GPT.”

Also, when it comes to actual physical agency – making a bed, filling a dishwasher, running some power lines, sleeping a baby – the models have nothing. The internet is not the real world, even though it’s easy to fool yourself into believing it is, sometimes :slight_smile:


Yeah, exactly this. Being familiar with its training data helps a lot. Prompting a lot around a particular domain will give you a good idea of what that is. It’s unfortunate that OpenAI doesn’t just publish the list to short cut this process.

As an example of this, I was asking it some questions around the fed voting process and how that might play out in practical reality, ie, what sort of influence the fed chairman has, the order of vote / voting patterns (vice always votes same as the chairman), does the chairman know everyone’s votes beforehand, etc.

In particular, it couldn’t figure out why the fed chairman was always in the majority, which is actually fairly obvious to everyone (if he wasn’t, economic havoc would follow) though not really talked about that much except in a roundabout way.

I tried all sorts of approaches and even hinting, but GPT4 was totally lost and could only parrot stuff from the official fed website. Clearly it hasn’t been trained on the various research papers and publications around this topic and its limited reasoning capabilities couldn’t put 2+2 together.

It’s actually sort of hilarious how all it can do is parrot official speak.


Back in the day when I did political communications as a hobby, this is pretty much what was thought about communicating with people.


Yeah I was including things like knowledge about your friends and family, you’re commute to work, etc. those are all domains of knowledge that you’re an expert in that GPT obviously can’t know about. For the broader set of domains that GPT can know about there’s probably less than 5 that a human will have more knowledge then GPT. I’m just throwing out numbers here but I’m sure that there’s a study brewing out there somewhere that will dive deeper into this.

My broader point is that GPT is an amazing generalist and that’s what makes it so damn compelling.