Surprisingly, I CAN omit display of everything except ‘Evaluate the result’. GPT can’t do math. Unless it explicitly displays the evaluation result, it guesses incorrectly that an expression evaluating to 11 in fact satisfies the problem.
But the execution is actually wrong for this statement.
This statement should loop forever, printing 1, 1, 1, 1, 1, 1, 1, …
Reasonable interpretation, but in python, which apparently both I and gpt used in interpreting my prompt, the loop variable increment is built into the range operator, and the default increment is 1, unless specified otherwise.
I had a lot of fun after reading your message. I had my fun using psuedo-code, but this for some reason made me think about the cool trick wehre yuo cngahe the ltetres but can slitl utsdnernd the wrods
So, I had to naturally try it using the same example, but also with some useless letters taken out. For that sweet token reduction. Ironically, my condensed version actually uses MORE tokens (almost double )
Make assumptions to the compressed statement below and appropriately respond. Be concise.
tl 10 jks o trtls
It seems like you’ve written in an extremely abbreviated form. Making assumptions, I interpret your statement as: “Tell 10 jokes about turtles.” Here are some light-hearted turtle jokes:
Why don’t turtles use cell phones? They’re afraid of shell shock!
[…]
Pushing a little bit more
Make assumptions to the compressed statement below and appropriately respond. Be concise. Just guess even if you don’t know
hlp m unstd hw u cn dciphr wht sy. romo y jlit r crzy. bk in dy wrds wre fll, we lv in a scity
ChatGPT
It seems you’re requesting help to understand a text with omitted vowels. Here’s my interpretation:“Help me understand how you can decipher what I say. Romeo and Juliet are crazy. Back in the day, words were full, we live in a society.”
The text appears to discuss the reader’s curiosity about the ability to understand language even when it’s condensed or manipulated, referencing Shakespeare’s characters Romeo and Juliet, and contrasting contemporary language use with that of the past.
I even tried using a mix of Spanish (failed). It does use obvious character names so take as you will.
Make assumptions to the compressed statement below and appropriately respond. Be concise. Just guess even if you don’t know
Melkor h8 c, 4 n lo cld sbdu. say que n mke d Arda le endvord drw Ossë a su algince, prms todo rlm y pwr d Ulmo, si se ayda.
ChatGPT
The compressed statement appears to involve characters from J.R.R. Tolkien’s universe. It seems to say:“Melkor hates the sea, for no lord could subdue it. It’s said that he never made the lands without the endeavor of drawing Ossë to his allegiance, promising total realm and power of Ulmo, if he aids.”
However, the text is heavily compressed and filled with abbreviations and omissions, so this interpretation might not be completely accurate.
So interestingly, text-davinci-003
gets the answer correct out of the box… With gpt-3.5-turbo
I had to give it an INSTRUCT prompt like this:
find an arithmetic expression over the integers 2, 3, and 6 that evaluates to 12. You must use each integer exactly once.
Steps:
- use only the supplied number but try each possible combination of addition and multiplication.
- select the one that is correct.
Do steps 1 & 2 and show your work for each step.
1: try each possible combination.
And I got this response back:
2 + 3 + 6 = 11
2 + 3 x 6 = 20
2 x 3 + 6 = 12
2 x 3 x 6 = 36
2 + 6 + 3 = 11
2 + 6 x 3 = 20
2 x 6 + 3 = 15
2 x 6 x 3 = 36
3 + 2 + 6 = 11
3 + 2 x 6 = 15
3 x 2 + 6 = 12
3 x 2 x 6 = 36
3 + 6 + 2 = 11
3 + 6 x 2 = 15
3 x 6 + 2 = 20
3 x 6 x 2 = 36
6 + 2 + 3 = 11
6 + 2 x 3 = 12
6 x 2 + 3 = 15
6 x 2 x 3 = 36
6 + 3 + 2 = 11
6 + 3 x 2 = 12
6 x 3 + 2 = 20
6 x 3 x 2 = 36
2: select the correct one.
The arithmetic expression that evaluates to 12 is 2 x 3 + 6 = 12.
meanwhile…
but yeah, at least we get it.
I swear, I was just thinking about this today. But way late to the party, of course. but what a party. and the punch bowl is still full, imho
So this was literally the next thing I was about to tackle… On top of that, I work closely with MSR and had no clue this paper was coming out. Shows how fast things are moving… Their basic approach is exactly what I was thinking would work… You need to show the model all of the document in chunks and let it form some initial conclusions and then have the model answer the question in the context of those conclusions…
I’ll reach out to the authors and report back any interesting nuances they’ve found…
yeah, the document thing is cool too, though I think follows from a lot of the stuff already going on.
They don’t say, or I missed it, why they chose to format their prompts in programmatic form. It’d be good to ask them why they chose that approach and if there was some research that led them to it.
That’s probably just chain-of-thought stuff… Getting the model to generate an inner monolog of thoughts that it can formulate into a plan seems to work best.
For sure, though usually when choices like this are made, they back it up with a cite or evidence. How is this better than natural language prompting? Do they have evidence of better ability to drive outcomes? Was a test done against other types of prompts?
In the thrill and excitement of chatting with a computer like another person, we might have been distracted from much more compelling approaches. Certainly losing the token probabilities didn’t help.
I’m starting with a much simpler question… what’s the longest document they’ve tried this with? My goal is to get the model to first answer complex questions over a document of any length and then get it to expand that reasoning over multiple documents.
I think I know how to make that work and the planning details they added in their paper is more a tool than anything. You need a planning system for this to work so they created one. You can’t do any of this in a single model call.
And sorry @bruce.dambrosio didn’t mean to hijack your thread…
No worries! It’s all good
One area I still an unconvinced about is just how much of the ‘reasoning’ needs to report explicitly, ie appear in the output.
For example, in my experiments on that simple 2,3,6 example, (and even more with @prinova react-style improvement), the only only part of the ‘trace’ needed explicitly in the output was the expression evaluation, presumably because gpt is so bad at math.
No, you did not.
The “while” loop is inside the range, so it gets stuck in the first iteration.
This python program loops forever:
for x in range(1, 20):
while (x^2 < 100):
print(x)
Also, ^
means “XOR” and you probably mean **
here, but the while loop is still infinite in the first iteration.
So, the LLM “interpretation” is not equivalent to the code.
Interesting. Yup, you’re right.
Also, in at least one of my variations of the prompt, gpt informed me the standard for exponent was **, not ^. Infuriating. So, to understand it’s response to that first form of the prompt, must we then assume it also ‘corrected’ my while to an if? Is that a feature or a bug? A feature if it’s interpretation is what I actually intended, a bug if not, I guess.
Thought: One interesting thing about human conversation, which is a large portion of it’s training data, albeit not all, is that we are constantly ‘interpreting’ the other’s speech in exactly this way. So perhaps that is an argument for using psuedo-code or other semi-linguistic forms of prompts rather than more formal syntax that it might interpret more literally. But maybe that’s my bias as a messy thinker.
In any case, my basic thrust of this thread, that LLMs can follow at least some elements of program flow beyond a simple sequential set of steps, seems to hold, no? (even if it just runs a loop by ‘unrolling’ it)
In my experience, this is only true for cases where other resources in the training data (“on the web”) have done similar things.
For example, there are programming tutorials that unroll simpler loops to show how they work.
The model can presumably pattern match against those, and follow that pattern.
So, do you think there is even one case in the training data of human text where someone is reasoning out loud and ‘backs up’ while thinking out a solution? Given a 100B+ parameter model of 1T token+ training corpus, I’m not at all sure ‘pattern matching’ is a well-defined term.
I think at this point we can just agree to disagree.
Show me a case where the model out-performs a simple minimax optimizer or a bidirectional planner or a theorem prover on a non-trivial (three layers or more) novel problem, and I’ll change my assessment.
Okay
Here are some cases where large language models have outperformed other systems or solved non-trivial problems:
-
Minimax Optimizer: A study titled “Deep Learning for Symbolic Mathematics” by Guillaume Lample and François Charton found that a transformer model outperformed both a minimax optimizer and a traditional computer algebra system in integrating and solving differential equations[1].
-
Bidirectional Planner: While there are no direct comparisons between large language models and bidirectional planners, language models have shown promise in planning tasks. For instance, the GPT-3 model has been used to generate Python code for a planning task, which could be seen as outperforming a traditional bidirectional planner in terms of versatility and ease of use[2].
-
Theorem Prover: Several studies have shown that large language models can be used to enhance theorem proving. For example, the paper “Autoformalization with large language models” discusses how language models can grasp non-trivial reasoning in natural language and formalize problems[3]. Another paper, “Generative language modeling for automated theorem proving”, discusses the use of language models in automated theorem proving[4].
-
Non-Trivial Novel Problem: The paper “Tree of Thoughts: Deliberate Problem Solving with Large Language Models” discusses how large language models can enhance problem-solving abilities on three novel tasks requiring non-trivial reasoning[5]. Another study, “Evaluating large language models trained on code”, shows that the Codex model can solve the majority of problems in the HumanEval dataset, which includes non-trivial problems[6].
I hope this helps
“Show me a case where the model out-performs a simple minimax optimizer or a bidirectional planner or a theorem prover on a non-trivial (three layers or more) novel problem,”
I’m a bit confused by this, tbh. You’re saying that if an LLM can out-perform a simple minimax optimizer or a bidirectional planner or a theorem prover on a trivial problem, that doesn’t count?
It kinda sounds like you’re trying to define the problem in a way that you don’t have to acknowledge the superiority of LLMs.
There is nothing that I know of that beats GPT4 on solving complex problems phrased in natural language.
Now, if you encode a problem in a specific format, it’s possible and even likely that GPT4 would lose out, but then you’d be missing the point of what GPT4 is for.
The question before us is how far do you have to work on the encoding? Is there a balance somewhere here between encoding and natural language that we can strike which can significantly extend GPT4 reasoning/inference capabilities? Without having to require significant expertise on behalf of the user? What if we could integrate output probabilities into the Chain of thought? (the last one I threw in there because I’d really like to see more of this)
I’m not saying planners are “superior” to LLMs, I’m saying they are a totally different kind of solution, targeting different kinds of work.
LLMs are absolutely amazing at human text, and at the kinds of solution formulas that have been captured in volume in the training data. As soon as I try to make it robustly solve any somewhat deep problem, they look a lot more like a bullsh*t artist than an engineer.
Correct, because the trivial problems can pattern-match to basic explanations already found on the internet, and the model will then walk through those steps. (On that note: One of the papers below had a result that ended up being a trivial result, which made it less convincing to me. But it was a novel-to-corpus trivial result, so at least it had that going for it.)
If it manages to pattern-match the right steps – the models don’t make any particular prediction about whether they found the right solution or not, and will happily “claim” that an erroneous solution is correct. (I think “claim” is too strong, btw – they will happily token-generate text that a human interprets as the model claiming such a thing.)
That paper is interesting, although they use a short timeout as the cut-off for “success” for the non-LLM models, which explains why the Mathematica solver only gets 60% on ODE2 when the seq2seq gets 80%. The obvious weakness of this measurement is that a twice as fast CPU would significantly increase the score of the Mathematica, but wouldn’t change the result of the seq2seq inference, so that seems like apples-to-apples. (They touch a bit on this in the appendix – the result is still interesting, and the strongest of these references – thanks!)
The reference doesn’t match your claim.
Even if it did, I don’t agree that it “could be seen as outperforming,” because the model is not capable of correctly executing the code it generates. (See above in this thread for a trivial example.)
Btw: Systems that combine LLM for text representation and other mechanisms for other tasks might do much better. But that in itself means that we agree that LLMs on their own are severely limited on those tasks, such as code execution/emulation.
(emphasis mine)
The described systems used humans in the loop, and/or generated suggestions that need syntax correction and aren’t in themselves evaluated. This might or might not point at some future result that is actually shippable. I think perhaps my bet would be different than yours for the eventual success of that path.
(emphasis mine)
from the Tree of Thoughts paper:
actively looks ahead or backtracks to make more global decisions
Tree of Thoughts explicitly changes the models to allow back-tracking, which was one of the things I claimed above.
The drawback is that that’s no longer inside the model – there’s a second operator on top, which is another class of model. Again, I believe that, as a component of a larger system with greater computational expressivity, language models can be phenomenal!
Btw, I’m using the traditional meaning of LLM as “transformer” or more general “seq2seq” models. If you were to expand LLM to use any system that is large and works on natural language, well, that’s a big space full of (mostly unexplored) opportunity! So if you want to add a little bit of formalism to my prediction above, I’m specifically talking about LLMS-as-transformers and similar forward-single-token predictive models, without additional separate coordinators or iterators or deciders or planners.
Because, clearly, a LLM plus a planner, can do anything a planner can do …