[PAPERS] Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Quiet-STaR is a method that helps language models (LMs) to improve their predictions by teaching them to generate rationales, or internal thoughts, for each piece of text they produce. This method builds on an earlier system called STaR, which helped LMs learn by using rationales in a question-answering context. Quiet-STaR addresses three main challenges: the high computational cost of generating text, teaching the LM how to produce and use internal thoughts, and predicting beyond just the next word. The solution includes a new sampling algorithm that operates token by token, special tokens to mark the start and end of a thought, and an improved training technique. As a result, the model better predicts difficult parts of the text and improves its performance on complex reasoning tasks without needing task-specific training. This suggests Quiet-STaR is a significant advancement toward more general and scalable reasoning in language models. The quote from Kierkegaard at the end underlines the idea that understanding comes from reflection, just as Quiet-STaR allows an LM to “understand” text by reflecting on its internal rationale.


Is it the same as Q* which once raising in the Sam’s fire case last year?

It is not a “q-star” whatever, like the nonsense video headline would have you believe.

This is from Stanford.

Star = Self-Taught Reasoner (multi-step agent)

Quiet, because like you’ve always been able to do, even in an OpenAI cookbook, only the final result is presented.


Hi, author here to clarify a few details. You might be confusing vanilla chain-of-thought prompting with Quiet-STaR. No worries at all - the main differences are

  1. We train the model to generate more useful thoughts using RL, like in the original STaR paper from a few years ago,
  2. Unlike STaR, we reward the model for generating inner monologues that help predict web text instead of answers to specific questions - this helps it generate thoughts that are less domain-specific

One of the coolest results is that this internal monologue also improves the model’s external CoT: by “thinking” before each external CoT token, the model makes fewer mistakes in its steps and scores better on reasoning tasks

But there are a lot of details needed to make this work. If the OAI cookbook happens to mention how to do tokenwise-parallel RL fine-tuning with learned meta-tokens and an LM-objective-based reward, please share that page and we’d definitely cite it as related work


I would like to test this approach and design. How can this be done?