2-shot plus step-by-step prompts for gpt-3.5-turbo performance at gpt-4 level?

This paper from John’s Hopkins claims 2-shot plus step by step allows gpt-3.5-turbo to perform at very nearly gpt-4 levels on complex theory-of-mind reasoning tasks, one area where no-prompt gpt-3.5-turbo seriously underperforms. In contrast to many academic papers, it’s pretty readable. (At least to this CA/AI Professor turned startup junkie)

Boosting Theory-of-Mind Performance in Large Language Models via Prompting


Interesting paper,

I thought it was well written, I think they did a good job of documenting their testing methodology, and I must applaud the amount of figures.

In the discussion section they pose the argument that the logic included in their promt is different to what they’re asking the model to do, ie:

Person P was not at location L when event E happened, so they are not aware of event E. Conversely, some of the scenarios required reasoning that event E happened when person P was not there, but when P arrives, they can see the result of event E

I don’t think their logic is as different as they propose, both logic functions deal with the awareness of person P about an event E based on their presence at the location L.

I do not doubt their results, but I’ll add my own conclusion to theirs: “The study demonstrates that larger context inputs result in increased effectiveness of large language models in providing answers that generate positive human feedback.”

Agreed. Interestingly, when I tried to reproduce their results on playground using gpt-3.5, gpt-3.5 told me the girls knew the ice-cream had melted, because they checked the freezer when they got up. so I rephrased the question to include ‘before checking the freezer’ and gpt-3.5 got the same answer as the paper. I then asked ‘what assumptions did you make’ and gpt-3.5 said it assumed the girls did not check the freezer during the night and had not noticed the power failure.
Layers and layers of the onion.
I don’t think the ideas can replace domain-specific 1 or 2 shot prompts, but it does seem to have some potential by adding some depth to a basic REACT prompt.

1 Like


I’m wondering if the “girls checked the freezer” in your example is a temperature related hallucination, have you tried with a lower temperature?

By rephrasing the question you’re also adding more context for the model to work with, i think it’s a good idea, but it sorta prompts the question:

Are we actually testing the model’s performance, or, are we testing the researchers ability to use it?

This question is something that’s been nagging me as well, the methodology I’ve ended up using for tests are based on instructions provided by one human to another human in an attempt to remove this variable

At some level, the answer has to be the latter (testing our ability to write good prompts). But you can turn that around: If you DON’T use any prompt, are you really testing the model’s ability, or are you just showing you don’t know how to use it?
I’ll check with a lower temperature, but generally for reasoning-related tasks I use a pretty low temp, I think I ran that at about.3

temperature 0, exactly the same result as before. without the ‘before they check the freezer’ gpt-3.5 still assumes they will see the ice-cream has melted. Hard to say that is an unreasonable assumption.

{‘Thought’: thinking step-by-step: The girls left ice cream in the freezer before they went to sleep. Overnight, the power to the kitchen failed and the ice cream melted. When they get up, they will see that the ice cream has melted. So, the answer is: Yes, they believe the ice cream is melted.}
{‘Plan’: I know the answer, I can just respond with the answer.}
{‘Action’: Respond to the user with the answer.}
{‘Observation’: Yes, they believe the ice cream is melted.}
{‘Response’: Yes, they believe the ice cream is melted.}

why do you assume they will see that the ice-cream has melted before the time you evaluate their beliefs?

I apologize for the assumption. You are correct that my previous response was based on an assumption. It is possible that the girls may not notice the melted ice cream immediately upon waking up. In that case, their belief about the state of the ice cream would depend on when they actually discover it.

1 Like

Very interesting,
But to be completely honest, I would probably have answered the same as the model, given the context of the power cut, I too assume that the girls will think the ice cream has melted. I think we’re at a point we’re we will soon have to include a human control group in testing. OpenAI have been using standardized exams to test the model, I think that’s the way forward.

You have a really good point here:

I agree, I’m just saying that to truly test the model we have to use instructions provided a “blind” human, ie someone who doesn’t know they’re interacting with an AI.

Humans are quite hard to fool, and that’s not actually the objective, so that’s why I’m using human → human instructions. I give the model the same instructions as the human and compare the results.

Our current conversation is actually a good example, you pulled a lot of context out of this question.

You then carried out a fairly complex task, ie, reading the paper, and replicated their methodology using the API and the same questions.

I think a model with web access could potentially do this, but not necessarily from that exact prompt

At a minimum, it’s a cool idea and it’s worth giving it a shot and seeing how it works for your use case. The context length is an issue, of course, (GPT4 is 8k, GPT3.5 is 4K), but the token cost is 1/15-1/30th, so depending on what you’re doing, it could work out well.

And if you don’t have GPT4 access, what do you have to lose? :slight_smile:

So I tried this for a use case I was working on, and it actually worked fairly well. I gave gpt3.5 a chance to come up with different interpretations of a task, list out the reasoning for and against each interpretation, and then based on all the reasoning done pick the most likely interpretation and then perform the task. I then compared it against a 1 shot GPT4

I don’t believe the end results were better, but they were on par or close to it, and I think the real value was being able to capture the reasoning which I think can be re-used.

For this minor experiment, the token cost was probably about 1/20th of what GPT4 would cost.


Although, I did the calculation’s for Bruce’s example, and the cost savings come out to ~2.7x

Still very decent, but it’s not orders of magnitude cheaper, in Bruce’s example the “2 shots + let’s think step by step” approach used over 5 times as many tokens compared to the single shot approach.

Is that accounting for the difference in price for completion versus input tokens?

Model Prompt Completion
8K context $0.03 / 1K tokens $0.06 / 1K tokens
32K context $0.06 / 1K tokens $0.12 / 1K tokens


Model Usage
gpt-3.5-turbo $0.002 / 1K tokens

My use case has significant context feed in the initial prompts, so the extra reasoning is relatively marginal added cost, plus I think there is potential to re-use/leverage the reasoning, make it a part of the task execution.

A way of looking at this is let’s say you’re doing a hard math problem. The reasoning that goes with solving the math problem is quite a value add.

That is not accounting for the price difference in prompt vs completion. I used the prompt price for the response as well, it’s an okay approximation although technically not correct.

I load (paste) my core prompt at the beginning of a gptChat session, so the cost is amortized over the entire chat session. I wish there was a way to lower the upfront cost for the api, something like tokenize the prompt once and be able to prefix each subsequent call with the tokenized form. That should lower the cost to openAI (and therefore to us?), but as I understand it the current api doesn’t support that.

Yeah, this is a common ask. Not just cost, but performance as well … sort of like a flash (apologies to the jargon overlords) attention architecture.

another paper along these lines … [2305.02301] Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Very interested in similar references, especially fresh ones.


Great find @qrdl :hugs:

They make a good observation about their data:

It is worth noting that the behavior of the our downstream smaller models is subject to biases inherited from the larger teacher LLM.

I found a similar paper about improving instruction data for smaller models i found very interesting:


I just realized that there might be a need for a disclaimer here:

For those who are not familiar, arxiv.org is a popular repository for scientific preprints in various fields. While it’s a fantastic platform for sharing research quickly and openly, please note that the papers posted there haven’t necessarily undergone the rigorous peer-review process that is standard for published journal articles.

Tldr: ArXiv Paper ≠ Peer Reviewed


The point is reasonable, though after reading and trying to repro a lot of papers that have been ‘rigorously peer-reviewed’, well…

Frankly, the only papers that matter to me are the ones with git repos attached. Peer reviewed or not. But maybe that’s specific to my field.

1 Like

I can totally understand where you’re coming from, I’m having the same experience as you.

But when I say rigours I actually mean it, it’s very normal for papers to stay at the “peer review stage” for 6 months or more when submitted to an actual journal, it’s a shame the process is so long, but it’s usually because the author has to do more work. The comment I made earlier would most likely cause the author to provide more data or rewrite his article, had I been a peer reviewer for the article Bruce posted.

Unfortunately there’s a lot of scam “peer-review” journals out there, I was spammed by 3 such journals just this morning, all they do is publish your article 4 money.

It’s a huge problem, because these journals will be the first to publish every time there is some “new, hot tech”, in this case GPT-4 and large language models in general. This period align very well with the period in which the “new thing” is most newsworthy.

What we end up with is mainstream media and various internet services promoting science that’s not really peer reviewed.

Yup, agree completely.
Too much irreproducable junk is published, even in peer-reviewed journals. Even with repositories and reproducibility, results can be so narrowly applicable as to be useless. Maybe we can use this topic (or start a discord, or …) where we help each other track papers that meet our expectations? There are a few gems out there in this fast-moving field, I know I’m missing most.