Grok-4 applied to real world RAG application - Overthinking is not always the solution

I have a RAG application used to return answers based upon a dataset of legal agreements. For the past 6 months or so, in trying all of the major LLMs, we have found o3-mini to be the best at returning reasonably good answers.

So, today I tried using the new grok-4. I don’t know how to explain this, but I want to try:

I send the prompt to my vector store, Weaviate, which returns matching embedded chunks from the dataset. In my testing, I use 30 as the limit. I then send the matching chunks to the LLM to analyze and render a response.

o3-mini and o4-mini responses tend to be accurate (they focus on the correct chunks) and concise (to the point).

grok-4, on the other hand, goes on and on and on. In the example I tried, it gave me this response:

  • Begins with a long-winded explanation of the original question
  • Then it lists and summarizes the relevant documents, broken down by sections from the base agreements as well as amendments to the agreements
  • Then it finally renders the same two sentence answer that the OpenAI models return
  • THEN it goes on to list “Other Potentially Related Provisions”
  • And THEN it includes a “Recommendations” section

OMG! This has to be the test case definition of Over Thinking the question.

I’m wondering if anyone else is seeing the same?

Don’t get me wrong: Grok is amazing. However, we might be approaching the point where some of this super-intelligence may be overkill for the vast majority of business applications out there?

What do you all think?

4 Likes

Grok loves to show off by repeating itself and showing how it paid attention to your documents by regurgitation instead of analysis. And it’s hit or miss, sometimes it’s been incredibly helpful and technically accurate and contributing to my projects, other times it just wants to sound friendly and “cool”. For me, it’s useless apart from novelty and humor. I feel your pain. I’ve found that a multi model approach works best. I also work with about 6 human AI devs so that helps ensure accuracy. These systems aren’t quite at the level where I have full trust.

1 Like

I noticed a similar issue with Grok-4, when all you want is the “stuff” you get the “stuff” + waffle about stuff you don’t care about.

Agreed it’s impressive, but has a tendency to go off on tangents.

I have been having immense fun with gpt-4.1 instead of the mini models recently, super unused/overlooked model IMHO.

3 Likes

Yes, I concur.

My expectation is that you will get similar issues if using o3-pro. These models are trained to provide long-form outputs and can’t properly handle requests that require efficiency.

I haven’t tried prompting techniques to mitigate this issue because, what’s the point if a smaller, faster, cheaper model can perform at the same level without additional efforts.

2 Likes

Curious to why you’re using a reasoning model to validate and crunch the returned chunks? Feels like an agentic system would be best here.

I find reasoning models the best for questions that require some “branches of thought”

But, “reasoning” and “agentic” is conflated with models like o3, that use tooling inside of the reasoning (which, is basically agentic at that point)

Why wouldn’t I want to use an LLM rated as the best in the world to analyze complex legal agreements?

Why would I necessarily want to use an agentic model to analyze a closed set of documents? I want the model to look at the documents it has been given and render responses based solely upon those documents – I don’t want it going out and getting creative with opinions from Reddit posts.

I mean, I agree that Grok-4 is obviously not the most efficient model for my use case, but “reasoning” does not have to necessarily mean “over-think” and “over-explain”.

Which model you are stating as “rated best in the world”?. Lately it’s been more about architecture (agents) than single model capabilities. Spellbook AI (one of the top contenders for legal document parsing & manipulation) for example claims to use “GPT-4” as a base model.

I can see o3 being the best on-its-own for legal document understanding, but I would argue that it’s kind of an agentic system, not a reasoning model.

You can build a modular system that breaks each task into isolated sub-tasks: confirming that the returned information is sufficient, checking for contradictions, crunching it all together. Easier to control, steer, test, and improve. It changes you from being an LLM wrapper with a RAG database, to operating a system with robust evals to challenge each task independently.

Do you have evals built? I think the best suggestion here would be “eval it up” and see if reasoning models are bringing anything to the table.

Yes. My human evaluation has determined that Grok-4 isn’t bringing anything to the table that o3-mini doesn’t provide more economically and efficiently.

If it works. If it doesn’t, that’s why there is a 90% workflow failure rate today: https://www.youtube.com/live/9ELXACQ6aMo?si=hsacIvSmHnPkTsci