I have a RAG application used to return answers based upon a dataset of legal agreements. For the past 6 months or so, in trying all of the major LLMs, we have found o3-mini to be the best at returning reasonably good answers.
So, today I tried using the new grok-4. I don’t know how to explain this, but I want to try:
I send the prompt to my vector store, Weaviate, which returns matching embedded chunks from the dataset. In my testing, I use 30 as the limit. I then send the matching chunks to the LLM to analyze and render a response.
o3-mini and o4-mini responses tend to be accurate (they focus on the correct chunks) and concise (to the point).
grok-4, on the other hand, goes on and on and on. In the example I tried, it gave me this response:
Begins with a long-winded explanation of the original question
Then it lists and summarizes the relevant documents, broken down by sections from the base agreements as well as amendments to the agreements
Then it finally renders the same two sentence answer that the OpenAI models return
THEN it goes on to list “Other Potentially Related Provisions”
And THEN it includes a “Recommendations” section
OMG! This has to be the test case definition of Over Thinking the question.
I’m wondering if anyone else is seeing the same?
Don’t get me wrong: Grok is amazing. However, we might be approaching the point where some of this super-intelligence may be overkill for the vast majority of business applications out there?
Grok loves to show off by repeating itself and showing how it paid attention to your documents by regurgitation instead of analysis. And it’s hit or miss, sometimes it’s been incredibly helpful and technically accurate and contributing to my projects, other times it just wants to sound friendly and “cool”. For me, it’s useless apart from novelty and humor. I feel your pain. I’ve found that a multi model approach works best. I also work with about 6 human AI devs so that helps ensure accuracy. These systems aren’t quite at the level where I have full trust.
My expectation is that you will get similar issues if using o3-pro. These models are trained to provide long-form outputs and can’t properly handle requests that require efficiency.
I haven’t tried prompting techniques to mitigate this issue because, what’s the point if a smaller, faster, cheaper model can perform at the same level without additional efforts.
Why wouldn’t I want to use an LLM rated as the best in the world to analyze complex legal agreements?
Why would I necessarily want to use an agentic model to analyze a closed set of documents? I want the model to look at the documents it has been given and render responses based solely upon those documents – I don’t want it going out and getting creative with opinions from Reddit posts.
I mean, I agree that Grok-4 is obviously not the most efficient model for my use case, but “reasoning” does not have to necessarily mean “over-think” and “over-explain”.
Which model you are stating as “rated best in the world”?. Lately it’s been more about architecture (agents) than single model capabilities. Spellbook AI (one of the top contenders for legal document parsing & manipulation) for example claims to use “GPT-4” as a base model.
I can see o3 being the best on-its-own for legal document understanding, but I would argue that it’s kind of an agentic system, not a reasoning model.
You can build a modular system that breaks each task into isolated sub-tasks: confirming that the returned information is sufficient, checking for contradictions, crunching it all together. Easier to control, steer, test, and improve. It changes you from being an LLM wrapper with a RAG database, to operating a system with robust evals to challenge each task independently.
Do you have evals built? I think the best suggestion here would be “eval it up” and see if reasoning models are bringing anything to the table.
Yes. My human evaluation has determined that Grok-4 isn’t bringing anything to the table that o3-mini doesn’t provide more economically and efficiently.
The models are doing exactly what they were trained to do. If they’re failing, it’s because the architect failed to design the system.
Let’s be honest: the “AI community” is packed with overnight “AI & agent experts” who’ve never built systems, never dealt with production constraints, and never touched anything resembling software architecture. Researchers, marketers, and hobbyists are all LARPing as engineers, sprinkling in LLM calls and calling it “AI Agents”.
Not saying that’s you. But it’s what’s behind this flood of content claiming “agents r bad ” while in practice, serious implementers are already deploying useful, revenue-generating agentic systems.
We’re in the middle of a gold rush. Everyone’s trying to wedge LLMs into workflows, expecting intelligent behavior to emerge from poorly scaffolded glue code. They:
Never used evals
Don’t understand RAG or embeddings
Don’t implement any safety guardrails
Can’t audit or explain their own system
Hardly spent any time understanding the data that they’re churning
If you slap a non-deterministic model into a fragile shell and expect it to perform multi-step tasks, it will fail. That’s systemic engineering incompetence. No matter how fancy a research paper may look, it just reflects poorly on the writers.
Meanwhile, the people who actually know how to build these systems are too busy implementing them and pushing their boundaries.
I would strongly suggest building actual evals. Then, like writing any typical system in any language you’ll start to desire being able to eval isolated, modular systems.
I do have evaluation systems where they are needed. But even they need some human oversight and/or supervision to be effective. Yes, for example, models can be used to rank responses. But no matter how many agents you deploy, models aren’t human, they have no sense of the real world, they don’t think, they have no human intuition. If they aren’t fed the correct answer, they don’t know and can’t know what the correct answer is. The documents I am working with are complex and nuanced and depend as much on what is said as what is not said.
The models do an excellent job of finding needles in the haystack – but the ultimate decision of what is a good or bad answer is one that needs to be made by a human.
Agreed. There’s a lot, actually, WAY too much hype pushing what we used to call “vaporware”.
As someone who uses agents daily, I’d be the last person to make that generalization.
True that.
I’m torn on this one. On the one hand, you’re right. If we were discussing traditionally coded systems I’d agree 100% . But we are discussing systems built upon a technology which is inherently unreliable due to the nature of its architecture – i.e. “next word prediction”.
I guess long story short, I do not disagree with most of what you are saying. I’m just not prepared at this point, to trust, recommend or rely on LLM systems that do not have the necessary degree of human oversight/interaction.
The hope is to capture and encode the intuition of a human in little modular pockets.
To then repeat it indefinitely using a model that charges only a fraction of what someone supporting a family would require.
Sorry for the derailment. I think a lot of people have found under-performance in grok. I would really suggest an agentic system for your use-case though. But… “if it works, it works”
Agree with @mat.eo here, it is still the architect who built a system that uses an unpredictable component where a more predictable one should have been used.