Training Custom Agents To Reduce Margin of Error

When building custom agents, what recommendations do you have to continue to reduce margin of error?

At this stage I am writing the brief in code, providing examples of correct and incorrect output and have a verification process. Hoping to reduce margin of error as much as possible as the output from the Custom GPT is going into tailored business reports.

Are there other tools that you would use?

Hi,

Here is an example of report of a legal doc risks: GPT-4o - Hallucinating at temp:0 - Unusable in production - #26 by sergeliatko

Built in following steps:

  1. Prepare the doc elements for RAG
  2. Run analysis questions from policy file and store them in DB grouped by theme
  3. Pull analysis results by theme (all analysis points) sorted by importance from DB and generate summaries for each of the themes
  4. Summarize the results from step #3 and prepend them as the introduction to report
  5. Stitch all together: intro + reports for each of themes in order defined by user in app options.

Also, this answer might point into good direction: Efficient way for Chunking CSV Files or Structured Data - #7 by sergeliatko

You can also find some tips here: Using gpt-4 API to Semantically Chunk Documents

Providing the “incorrect output” in examples is kind of tricky and may introduce errors (especially when they are at the end of the prompt). Have you tried to replace them with examples of how to correct the wrong output to make it awesome? And try to put those somewhere in the middle of your examples, so that the examples end with several great items to set up the rhythm for the model.

1 Like

Interesting, at the moment I attach two csv files to the agent. One correct, one incorrect.

I never considered that incorrect examples could negatively impact the output. Have you witnessed this from your experience? Or is this documented anywhere?

From my experience. It’s ok to have a couple of “bad” examples as long as they go with instructions how to transform them in “good” ones. But having them simply listed as “bad” does not help actually, often it makes things worse on rather complex tasks.

1 Like

There’s a concept out there I’ve been think of after reading about the new experimental CritiqueGPT models they’re working on.

The study they published showed that a ChatGPT4o enabled human monitoring data for hallucinations significantly reduced errors, and that furthermore, a CritiqueGPT specifically created to identify errors, further reduced errors by a significant margin.

So in addition to what @sergeliatko recommends about the RAG, it seems there is benefit in creating your own “critiqueGPT,” one trained on the common mistakes you see in your data, as a twin to the one you use to create the data in the first place that acts at certain steps in the RAG flow.

Also, I think it’s important to have good / bad examples in chains. I think you can use multiturn fine tuning.

1 Like

I’ve basically started doing this back in 2021… Do the task, then check for:

  • either errors
  • potential improvements
  • additional items

Got the idea from other domains (even outside development) and needed something like this to improve the quality of synthetic data I used to use for fine-tuning. Then dropped the synthetic data but kept the critique/or rather the “manager/filter” model concept as I call it to handle tricky tasks without going crazy about chained workflows.

This works especially great when you need to improve the quality of model’s “free speech” in writing tasks. You simply run the original output through manager once or twice before returning the result.

In RAG context, I often use “filters” like:

Before query:

  • does the complete answer to this question require additional information not directly targeted by the question (split one comes query in several simple queries)
  • based on the question, how the found data would look like (live the query vector from question to context for better precision)

After the query:

  • does this found item contains the direct answer to the question or useful context to improve the quality of the answer (filter out noisy results to keep the prompt clean)
  • are these items complete or do we need to get more data from DB
  • (once I have the answer from the answering model) which of these items was used as primary context to provide this answer (get the reference)
2 Likes

Thanks for your reply, Thinktank. That’s a great idea and very thought-provoking.

Is there a risk though that the CritiqueGPT hallucinates and revises an output that was initially correct?

Or, in the study, was it discovered that, even if this were to happen, the margin of error with the CritiqueGPT was still significantly less than just relying on the GPT itself?

1 Like

That’s a great prompt, mate. All of the prompts you’re using to explore your data seem well rounded and worded to me. I am definitely going to remember that language.

I’ve been trying to work on simple error correcting within a ChatGPT 4o conversation. Models seem to have trouble “forgetting” until the data in question is out of the context window. So, when correcting an image for example, I might critique a flaw several times before it’s actually corrected or changed.

Just as a casual observation from this behavior, I can see the value of feeding the information needing critiquing to an independent conversation. So, like, in my case, a separate CustomGPT specifically trained on correcting my images according to my standards, though the analogy stands true for Assistants.

Yes. Great observation.

In reading the study, one of the biggest challenges is when a CritiqueGPT-type model hallucinates a problem that wasn’t there in the first place.

But you’re also correct, the difference in error catching and correcting is so pronounced that it seems worth the risk.

From the study linked above. They had humans create a bunch of intentional errors in code, and then provide the correct answers. (Suggesting how to fine tune a critique model.) Then they gave the non-corrected information to a different group of humans and bots to see who found what.

ChatGPT caught significantly more errors than the human alone, and the CritiqueGPT type further outperformed the ChatGPT and ChatGPT / Human teams.

Basically, the AI increases human productivity—and humans reduce hallucinations and create connections.

My biggest takeaway from this that you have a RAG flow like Serge recommends that includes a Generative Model, a Critique Model correcting answers, and human supervision at key points. That supervision needs to be provided by experts-in-their-field, as the only folks likely able to catch most hallucinations.

So, I wouldn’t rely on a programmer to give a final review of information on real estate, and vice versa.

Yep, that’s the issue of over-tweaking the instructions. Usually can be easily fixed with classical programming workflow combined with a good RAG ( made a tool for it https://www.simantiks.com )

As soon as fine-tuning became available in early stages of open ai, I did it way simpler (first time tried on OCR and formatting correction):

  • get the perfect input
  • use AI to introduce randomly a bunch of errors you typically have in your business application
  • format for fine-tuning in reverse order where your step 1 is the desired result, and synthetic errors are the input
  • fine-tune a medium sample
  • check the results of the fine-tuned model to spot anomalies
  • explain anomalies to error generator
  • get the final set of errors/correct pairs
  • fine-tune and test
  • go in production to real life confrontation

Great reply, thank you. Are there common behaviours that occur when GPT hallucinates that could then be flagged?

i.e., I noticed in my manual process of reviewing data and questioning GPT why it had hallucinated, the recurring theme was that rather than following the exact text, it had decided to make assumptions. I wonder if I could reduce margin of error by getting GPT to flag itself when this process is actioned.

Also, my AI agency mentioned that they don’t recommend RAG for my specific problem, because the outputs are not black and white. I.e., one of them is essentially, for x website, share the ‘Business Category’ that it operates in.

They mentioned that a better approach is to use Phidata framework for tasks like this that require some interpretation. What are your thoughts on this? (Thanks for all of your help so far!).

Thanks for sharing! That’s a super dope library.

But, dude, that IS RAG. There’s even a RAG example down in the page on github. Lol. Your AI Agency is blowing sunshine up your skirt.

RAG = Retrieval Augmented Generation. As I understand it, “RAG” is more of a philosophy than a literal framework. Every knowledge base is different, and will require different flows at different points to be effective for that framework. One size does not fit all.

I really like the way that library is built, it looks like they’re saving the world a lot of work. (Again, thanks for sharing.)

But the whole thing about this is that—no, there are no common behaviors that a GPT shows when hallucinating, other than 100% confidence in the answer. That’s the problem. (Although, I should note, I notice subtle linguistic queues when the model starts to guess. But you can’t rely on this observation in all instances.)

So that fancy framework will pass a certain percentage of hallucinations to the end user if there’s only one flow through.

Only a true [human] expert in the area could identify a hallucination with 100% accuracy. And even then, you have to be on your toes. A plumber wouldn’t be able to identify a financial hallucination and vice versa.

@sergeliatko has told you some RAG frameworks that reduce errors.

So that Phidata framework looks like an awesome way to store and do initial data retrieval, but without a Critique phase—you’re going to have hallucinations.

Serg really gets this.

My understanding is that there is an initial retrieval, using Phidata for example. Then you have the model self-correct and ask probing questions about the retrieval, for which it performs another retrieval and compares. (Serg’s insight.) Then, my suggestion is that there is an independent Critique bot, trained on common mistakes, that does yet another retrieval and comparison before finally sending the answer to the User with a firm warning that hallucinations are still possible.

1 Like

Basically yes,

Haven’t yet looked through the Phidata framework ( sorry, will do this week).

Some insights however (can’t do without my 5 cents):

  1. Raw data needs to get pre-processed correctly in the first place ( check out an internal tool we are about to release https://www.simantiks.com , on the examples page you see how raw text ends up in ready to store items)
  2. Design your retrieval workflow based on the app needs (this is where cool libraries may save you like Phidata) with focus on retrieval quality
  3. For each operation in your workflow (get them as many as makes sense so that they stay as simple as possible) make sure to properly log the data for tuning
  4. For each operation if filters needed, create a separate set of filters (and not one general critique model) so that it acts as a test for sanity and stays specific to this operation for max quality
  5. Check if some high level orchestrator makes sense in your app (assistants or custom)

The goal is to store data as much as you can, have models and critique agents stay focused on simple tasks, make sure your RAG find all the data the model need, make sure you store all the data your RAG needs, try to have steps isolated and replaceable (this is where frameworks actually help)

2 Likes

Yeah, I agree.

As I’ve been thinking this through, I think a high-level CritiqueGPT would cause as many problems as it would solve.

I speculate that Critique Models should be hyper-specialized to a single task or area of knowledge, with decreased creativity. Micro fact-checkers with extra training depth on known and theoretical errors. A model that looks for errors in Python. Another that looks for technical errors in logic or semantic reasoning. Each Critique model being supervised by a relevant expert.

Perhaps, instead of correcting the response, the Critique Models adds commentary to the response highlighting possible errors, instead of re-writing the retrieval. If you’re asking for complex reasoning and analysis, you may not want a hyper-specialized less-creative model giving you your final answer.

1 Like

I confirm…

Yes, on the other hand nothing prevents you to come up with more general models on 1 or 2 levels up (check if code actually works, or fix language/style of the output) that can be used in several places. Or you can even come up with a general fact checking mechanism that confirms the output does not contain contradictions to the data or checks simple logic errors in conclusions.

If you want quality in these tasks, you have to design your own workflows (we did one for legal docs where you give data and configuration of the workflow and the engine repeats your analysis reasoning to get the complex task solved. But those are more complicated to develop. Surprisingly, because humans who really qualify for these tasks do it mostly subconsciously and are incapable to put the workflow on paper to recreate by the machine. You need to spend a lot of time having your “analysis” expert handy to transcribe the workflows from the people who are really good at it.

2 Likes

Wow, what a great discussion from you both. Thanks so much for sharing your experience. I have learned so much already from this thread and for that I am grateful. Serge, do you have a visual flowchart that represents the entire ideal set-up?

Unsure how Phidata and RAG come together. I have seen the RAG basic structure from OpenAI, but unsure how Phidata comes into this.

I have a broad understanding of how, specific filters are used - I love the recommendation thinktank - to have them identify potential issues without actually making changes. I also understand the broad role of the overseeing expert GPT, but unsure on what specific problems it would solve.

Serge would you be free to catch-up sometime? Would love to show you what I’m working on and maybe get some help :slight_smile:

1 Like

Hi @luke.harrison3461 ,

Never needed, never wanted as:

  1. Actually it’s me who would be transforming such a chat into a list of instructions on how to build those things, so faster to start directly with the resulting list.
  2. I don’t believe in silver bullets and (maybe biased) prefer studying case by case

Sure, reach out to me on LinkedIn: https://www.linkedin.com/in/sergeliatko

Oh, that’s brilliant. Basically middle-management GPTs, if-you-will.

I think it’s important to think in the economic terms Generalization vs Specialization when it comes to models working with data during storage and retrieval. I admit I was so taken with the idea of hyper-specialized models “following a single data set around like a mother hen,” that I hadn’t thought of middle-managers, or GPTs that could float to other projects. That’s smart.

Now, first, I agree that all RAG systems need to be customized to the task because of how the human mind works with complex reasoning.

And second, THIS is the part of developing the new AI databases (RAG) systems that I am most excited about. Getting to pick the brains of those experts to try to map out their decision making process for duplication with AI will be a unique story every time.

@luke.harrison3461 , Serge is right. I’d be suspicious of anyone selling an “ideal” set up. The best illustrations of RAG that I’ve seen are like six boxes with lines on them that look more like a play being mapped by a coach in football than a advanced AI database.

RAG is more like data acting like water and flowing through a purification process before being ready to drink.

One of the first steps is when you gather all of the data together for a more advanced model like GPT4 Turbo or Claude 3.5 to answer the User Query from. Microsoft calls these places where data is gathered “Data Lakes,” which is an image I like.

What that data gathered in the Lake is entirely dependent upon the situation, which is why I am so into that library you shared.

That library has everything needed to gather data into a “Lake” from social media, to search results. It has tools for storing and categorizing long conversations. And it is designed to draw from both Vector and traditional Databases. Gonna save a lot of folks a lot of time.

But gathering the information, and in what proportion, a) is unique for every business, and b) only the first step. Any time AI touches a thing there’s an opportunity for hallucinations on top of any other machine or human errors likely in the data.

So, you have this high-level AI, the one that’s answering the question, ask it self high-level questions on the data and perform the retrieval again. The data “washes” through this process until enough certainty in the answer is gained to pass it back to the User.

But, there are lots of opportunities during this process for less creative bots like 4o Mini to help in a hyper-specialized way, like being responsible for Critiquing a single spreadsheet—to more generalized GPTs that can help with checking for Logical Fallacies (brilliant, serge), or like providing a subtle rewrite with SEO researched keywords to the response.

1 Like

You got it. That’s exactly the message I’m trying to spread

1 Like