Tree of Thoughts: Deliberate Problem Solving with Large Language Models
… we return to the origins of artificial intelligence (and cognitive science), drawing inspiration from the planning processes explored by Newell, Shaw, and Simon … characterized problem solving as search through a combinatorial problem space, represented as a tree. We thus propose the Tree of Thoughts (ToT) framework for general problem solving with language models.
I found Karpathy’s shade throwing around this sort of thing very unfortunate.
I mean I could just say:
Overheard: “People who know nothing about pure mathematics are now paradoxically advantaged in machine learning because they don’t immediately reach for overly sophisticated math and spend a lot more time hacking ML algs” When hacking ML algs feels below your dignity but it works :’|
The OAI folks post a lot of … stuff … on twitter. Maybe they could dial it back a bit. We have enough noise going around already.
Thank goodness we live in a world where there are those who feel free to ‘try’, ignoring the expert’s “you can’t do that” filter.
Aside - thinking more and more about the ‘Tree of Thought’ paper, the less I think about its contribution to advancing our ability to use the tool (llm).
the hard problems of search are alternative generation and evaluation. The paper doesn’t provide any general methods for either. The key in alternative generation is coming up with an appropriate step size, and llms don’t seem (imho) very good at that, or in general reasoning about their own capabilities or limitations.
LLMs don’t reason at all in the sense of trying a thing, working it out, and then seeing if it works and back-tracking if it doesn’t.
LLMs pattern match.
They pattern match against a very huge corpus of things humans have put into words in the past, which is quite likely to contain some patterns that are quite applicable to what you want to do, but they don’t ever try-and-repeat. They don’t have the internal representation to run a “thought experiment” on their own inference output, although you maybe could feed the result back and infer something else about it …
(Also, “inference” here is used in the Markov-like model-generation sense, not the logical/mathematical sense.)
I respectfully disagree.
I would propose that perhaps a more interesting question to ask than ‘what is it doing’ might be ‘how can we humans model and think about what it is doing in ways that enable us to manipulate it usefully?’
to paraphrase Sutskever (sp?)
‘people think predicting the next token can’t possibly be enough. But to predict the next token well enough you have to build an internal model’ (my emphasis, not his).
another example: what is ‘reason step by step’ style prompting? Even if you disagree with my paraphrase above, it can at least be usefully thought of as surfacing subsymbolic reasoning, even if from a mechanistic view that might make no sense.
One last thought: If it is Markov-model like, it is a quite complex multi-level, probably non-hierarchical, Markov model, so I’d suggest there are multiple abstraction scales at which one can describe the ‘model’ being captured. ‘Just’ token sequences? perhaps.
@jwatte , just saw something I didn’t notice before in your post: but they don’t ever try-and-repeat
well, yes. as you point out, you’d have to put some wrapper code around them to do that. but then you say:
They don’t have the internal representation to run a “thought experiment”
That is intriguing. How would I test this hypothesis? Is it fair to try to write a prompt that suggests it do so before providing an answer? If so, would failure of attempts to write such prompts be proof of lack of capability? Would ‘success’, such as text output indicating it had done so be proof of ‘thought experiment’ capability, or just more pattern matching?
Poppers theories of scientific hypotheses as necessarily falsifiable have been discredited, I’m told, but still, seem useful in practice.
Regardless of all the above, I’m genuinely intrigued by the idea of running thought experiments in a single turn. I’ll try to find some time to play with it. Thanks!
For what is worth and to ground the ideas proposed in the paper and discussed here:
One area of application for tree of thought is exploring different scenarios for business growth in the enterprise domain. This is an ill-structured problem with many many moving pieces but hugely valuable for a large company. Most of what companies do in this area is based on guesswork. Some can afford to hire McKinsey to do the guesswork for them!
A key problem here is to understand the appropriate level of human-machine interaction, yes?
Seems like there are four basic actions in search (ignoring, for the moment, the all-crucial problem formulation step)
move set generation
move selection
move application
state evaluation
Seems like a talented business analyst could implement TOT manually, doing the move selection and evaluation either manually and/or with AI support. Key processes of move idea generation and move application could then be interactive, with AI leading move-set-step-generation, and hopefully doing most of the heavy lifting in move application
Or were you thinking of a completely automated process?
Yes, the right level of human-machine interaction is the key because the decisions are very expensive. Therefore, the enterprise people signing the checks need to be convinced about the solutions formulated with the help of an AI. They get convinced only by being in the loop (or having the people they trust in the loop).
A completely automated process won’t sell even if it is technologically feasible.
There’s not testing needed, because this follows from first principles: The construction of the model simply does not allow it to back up, nor to predict further than one token ahead. Similarly, the inference model is only forward through the layers, it doesn’t “run a loop.”
You could make arguments about the output tokens being part of the input, and this causes a loop of some sort, but it’s still a forward-only loop, which is computationally less powerful than a full loop with conditionals. I would argue that this doesn’t rise to the bar of “reasoning,” but I’m not sure that there exists a well-defined test for what “reasoning” really means, so if you were to attack that argument, doing it by definition would be possible.
I’ve also heard an argument that each “layer” is its own step in a reasoning chain, but that’s still just one “thread” and the model keeps no other hypotheses; it just arrives at token probabilities and picks one, so I’m wholly unconvinced by that argument
Not really true, because the chain of thought and self-refinement, but even if it were true I don’t think ‘backup’ is a part of the dictionary definition of reasoning.
We’re probably all saying the same thing, tbh, just getting caught up in semantics.
I think I’m going to start using the term ‘infers’ instead of reason so I can avoid these discussions.
First principles? You mean I could give your argument to a theorem-prover and it would confirm it? Again, I respectfully disagree. Your argument is much too abstract for simple prima-facie confirmation.
Inference is strictly bottom up left to right across every unit in every layers? Are you sure? no right-to-left connections between any units in any layer anywhere? I’ve never seen a multi-layer model built that way. Even a simple convolutional layer in static image recognition involves bi-directional aggregation over the layer below it, right?
A classic convnet is strictly forward inferred. The pool layers are still a forward operation.
There is an argument that these models can speculate, at a depth no deeper than the number of layers, and use the outcome of some limited number of speculation branches to select one of a limited number of separate inferences, and I’m somewhat receptive to this argument. I think the stronger part of my argument is that the models cannot “back up” – they still speculate one token at a time.
When it comes to feedback, there’s clearly read-only feedback from what it inferred, as any previous output is available as input. (This is one of the things that are different in transformers and other recurrent models compared to forward convnets.) As far as I undertand it, this form of read-only feedback is equivalent to some amount of unrolling into a “fixed function” forward only model, which ends up equivalent to the multiple-branches-selection multi-layer implementation I suggested above. Maybe the Google 540 billion parameter model has enough of this to functionally qualify? I don’t have a good intuition for whether the depth needed is additive, or exponential…
That being said, the actual exhibited behavior of the current crop of models is not that of “reasoning” as far as I can tell; e g, inference workloads that require multi-step hypothesis testing consistently fail.
OK, I’m starting to understand your argument. But still, why, do you think, multi-step hypothesis testing fails? If we make the hypothesis and the test result explicit, they will recycle into input, no? Is the context window too small? yes, they ‘speculate’ one token at a time, but that ‘speculation’ is based on the entire context window, which as I understand is a sliding window over the initial input and all previously generated tokens. What am I missing?
There wasn’t anything new in the Kaparthy talk that I saw, but it was interesting some of the things he decided to focus on.
I don’t think we can extrapolate too much from it, but it makes me wonder if this is what openai is thinking about in terms of intelligence expansion. One advantage they have is access to internals like activations and probabilities. Need more research around that with respect to CoT stuff
The models don’t go back to “recycle” anything. The move forward and output a token.
I’ve never seen the model output “and then we do X, and then we do Y … no wait that doesn’t work, let’s do W instead of X, and then we do Z …”
And a reasoning model would do that before it even outputs X (like a typical reasoning agent composing output instructions would.)
Respectfully disagree. My understanding is that is exactly how LLMs work. Yes, they generate one token at a time, but each token is generated from a context that includes the original input plus all previously generated tokens.
Assuming context is large enough, of course.