Grok-4 applied to real world RAG application - Overthinking is not always the solution

SomebodySysop · July 12, 2025, 7:54pm

I have a RAG application used to return answers based upon a dataset of legal agreements. For the past 6 months or so, in trying all of the major LLMs, we have found o3-mini to be the best at returning reasonably good answers.

So, today I tried using the new grok-4. I don’t know how to explain this, but I want to try:

I send the prompt to my vector store, Weaviate, which returns matching embedded chunks from the dataset. In my testing, I use 30 as the limit. I then send the matching chunks to the LLM to analyze and render a response.

o3-mini and o4-mini responses tend to be accurate (they focus on the correct chunks) and concise (to the point).

grok-4, on the other hand, goes on and on and on. In the example I tried, it gave me this response:

Begins with a long-winded explanation of the original question
Then it lists and summarizes the relevant documents, broken down by sections from the base agreements as well as amendments to the agreements
Then it finally renders the same two sentence answer that the OpenAI models return
THEN it goes on to list “Other Potentially Related Provisions”
And THEN it includes a “Recommendations” section

OMG! This has to be the test case definition of Over Thinking the question.

I’m wondering if anyone else is seeing the same?

Don’t get me wrong: Grok is amazing. However, we might be approaching the point where some of this super-intelligence may be overkill for the vast majority of business applications out there?

What do you all think?

Jay_Cee · July 12, 2025, 11:57pm

Grok loves to show off by repeating itself and showing how it paid attention to your documents by regurgitation instead of analysis. And it’s hit or miss, sometimes it’s been incredibly helpful and technically accurate and contributing to my projects, other times it just wants to sound friendly and “cool”. For me, it’s useless apart from novelty and humor. I feel your pain. I’ve found that a multi model approach works best. I also work with about 6 human AI devs so that helps ensure accuracy. These systems aren’t quite at the level where I have full trust.

Foxalabs · July 13, 2025, 5:00am

I noticed a similar issue with Grok-4, when all you want is the “stuff” you get the “stuff” + waffle about stuff you don’t care about.

Agreed it’s impressive, but has a tendency to go off on tangents.

I have been having immense fun with gpt-4.1 instead of the mini models recently, super unused/overlooked model IMHO.

vb · July 13, 2025, 10:17am

Yes, I concur.

My expectation is that you will get similar issues if using o3-pro. These models are trained to provide long-form outputs and can’t properly handle requests that require efficiency.

I haven’t tried prompting techniques to mitigate this issue because, what’s the point if a smaller, faster, cheaper model can perform at the same level without additional efforts.

mat.eo · July 13, 2025, 4:28pm

Curious to why you’re using a reasoning model to validate and crunch the returned chunks? Feels like an agentic system would be best here.

I find reasoning models the best for questions that require some “branches of thought”

But, “reasoning” and “agentic” is conflated with models like o3, that use tooling inside of the reasoning (which, is basically agentic at that point)

SomebodySysop · July 13, 2025, 5:11pm

Why wouldn’t I want to use an LLM rated as the best in the world to analyze complex legal agreements?

Why would I necessarily want to use an agentic model to analyze a closed set of documents? I want the model to look at the documents it has been given and render responses based solely upon those documents – I don’t want it going out and getting creative with opinions from Reddit posts.

I mean, I agree that Grok-4 is obviously not the most efficient model for my use case, but “reasoning” does not have to necessarily mean “over-think” and “over-explain”.

mat.eo · July 13, 2025, 5:40pm

Which model you are stating as “rated best in the world”?. Lately it’s been more about architecture (agents) than single model capabilities. Spellbook AI (one of the top contenders for legal document parsing & manipulation) for example claims to use “GPT-4” as a base model.

I can see o3 being the best on-its-own for legal document understanding, but I would argue that it’s kind of an agentic system, not a reasoning model.

You can build a modular system that breaks each task into isolated sub-tasks: confirming that the returned information is sufficient, checking for contradictions, crunching it all together. Easier to control, steer, test, and improve. It changes you from being an LLM wrapper with a RAG database, to operating a system with robust evals to challenge each task independently.

Do you have evals built? I think the best suggestion here would be “eval it up” and see if reasoning models are bringing anything to the table.

SomebodySysop · July 13, 2025, 6:03pm

Yes. My human evaluation has determined that Grok-4 isn’t bringing anything to the table that o3-mini doesn’t provide more economically and efficiently.

If it works. If it doesn’t, that’s why there is a 90% workflow failure rate today: https://www.youtube.com/live/9ELXACQ6aMo?si=hsacIvSmHnPkTsci

mat.eo · July 13, 2025, 7:07pm

The models are doing exactly what they were trained to do. If they’re failing, it’s because the architect failed to design the system.

Let’s be honest: the “AI community” is packed with overnight “AI & agent experts” who’ve never built systems, never dealt with production constraints, and never touched anything resembling software architecture. Researchers, marketers, and hobbyists are all LARPing as engineers, sprinkling in LLM calls and calling it “AI Agents”.

Not saying that’s you. But it’s what’s behind this flood of content claiming “agents r bad ” while in practice, serious implementers are already deploying useful, revenue-generating agentic systems.

We’re in the middle of a gold rush. Everyone’s trying to wedge LLMs into workflows, expecting intelligent behavior to emerge from poorly scaffolded glue code. They:

Never used evals
Don’t understand RAG or embeddings
Don’t implement any safety guardrails
Can’t audit or explain their own system
Hardly spent any time understanding the data that they’re churning

If you slap a non-deterministic model into a fragile shell and expect it to perform multi-step tasks, it will fail. That’s systemic engineering incompetence. No matter how fancy a research paper may look, it just reflects poorly on the writers.

Meanwhile, the people who actually know how to build these systems are too busy implementing them and pushing their boundaries.

I would strongly suggest building actual evals. Then, like writing any typical system in any language you’ll start to desire being able to eval isolated, modular systems.

SomebodySysop · July 13, 2025, 8:29pm

I do have evaluation systems where they are needed. But even they need some human oversight and/or supervision to be effective. Yes, for example, models can be used to rank responses. But no matter how many agents you deploy, models aren’t human, they have no sense of the real world, they don’t think, they have no human intuition. If they aren’t fed the correct answer, they don’t know and can’t know what the correct answer is. The documents I am working with are complex and nuanced and depend as much on what is said as what is not said.

The models do an excellent job of finding needles in the haystack – but the ultimate decision of what is a good or bad answer is one that needs to be made by a human.

Agreed. There’s a lot, actually, WAY too much hype pushing what we used to call “vaporware”.

As someone who uses agents daily, I’d be the last person to make that generalization.

True that.

I’m torn on this one. On the one hand, you’re right. If we were discussing traditionally coded systems I’d agree 100% . But we are discussing systems built upon a technology which is inherently unreliable due to the nature of its architecture – i.e. “next word prediction”.

I guess long story short, I do not disagree with most of what you are saying. I’m just not prepared at this point, to trust, recommend or rely on LLM systems that do not have the necessary degree of human oversight/interaction.

mat.eo · July 13, 2025, 10:18pm

The hope is to capture and encode the intuition of a human in little modular pockets.

To then repeat it indefinitely using a model that charges only a fraction of what someone supporting a family would require.

Sorry for the derailment. I think a lot of people have found under-performance in grok. I would really suggest an agentic system for your use-case though. But… “if it works, it works”

sergeliatko · July 19, 2025, 9:40pm

Please,… So many reasons. Kidding.

sergeliatko · July 19, 2025, 9:44pm

Agree with @mat.eo here, it is still the architect who built a system that uses an unpredictable component where a more predictable one should have been used.

That’s a huge one. See that every day.

Topic		Replies	Views
Opinion on GPT's - Is this good for devs? Community gpt-4 , tp-1	30	2982	November 10, 2023
Biggest pains with LLM agents (Assistants API, Autogen, etc) API api , agents , assistants-api	29	6866	April 7, 2024
Training Custom Agents To Reduce Margin of Error GPT builders	17	540	July 31, 2024
RAG Evolution with Reasoning Models Community api	10	765	April 30, 2025
Hypothetical Token-increase Strategy . Community gpt-4 , chatgpt	21	367	March 17, 2025

Grok-4 applied to real world RAG application - Overthinking is not always the solution

Related topics