Hi everyone, I’ve been exploring the reasoning capabilities of large language models and noticed something interesting: DeepSeek(R1) (470B parameters) performs really well on a specific reasoning task, while its distilled version, Llama3-8B Distill R1 (8B parameters), lags behind significantly. I’ve tried improving the smaller model with domain-specific distillation or fine-tuning, but the gains seem limited. I’d love to get your input on a few questions:
- Is model size (parameter count) the primary factor determining the upper limit of complex reasoning abilities?
- For a smaller model (e.g., 8B parameters), can further training or optimization bring its performance close to a larger model on complex reasoning tasks, or is parameter count a hard ceiling?
- Are there any papers or practical experiences you could share on this topic?
Thanks for any insights or discussion!
Thanks for the R1 advertizing. Was that a real question or just that?
Perhaps I shouldn’t specify the exact name of the model I’m using, but I’m really struggling with this issue. I’m new to LLMs as a freshman, so I hope you can help answer my question.
Imagine you have 100 strings that lead to a hidden box. One of them is connected to a useful information.
Now imagine another that has 50 useful information but 10,000 strings.
There is more information in the bigger box.
But here’s the catch: the bigger box comes with a much better search system. It doesn’t just pull a random string it uses smart retrieval methods to follow the most promising paths first. So even though there are 10,000 strings, the model learns over time which ones are likely to lead to useful info. That means, in practice, you’re far more likely to get accurate, relevant answers from the bigger box, simply because it has both more knowledge and better tools to find it.
So depending on your usecase (if the small model has the info you are looking for) it might be better to take a small model that uses equally smart methods (since it needs less energy hence is cheaper).
In many cases - the bigger model can look up information and compare that with your search - the bigger context - knowing the big picture can help though.
1 Like
Forget about “reasoning” for a second.
I think when you introduce the word “reasoning” we start to think about Chain of Thought loop wrappers, and that’s a different dimension to the problem (aka self-reflection and forced planning “hacks”)
I think it is clear larger LLMs perform better in general independent of Chain of Thought, just look at the Chatbot Leaderboard.
But it looks like there may be increasing limits to scaling …
… and then there’s the price!
1 Like
Why did you introduce the word reasoning?
Ah saw it.. had to scroll up.
1 Like
I didn’t introduce it, the OP did, it’s in the Title.
2 Likes