Complex Prompt Getting Continuously Worse Results

Scenario: We have a detailed prompt which has ChatGPT (4o) fetch PDF files, scan them, review the content, and compare it to a bank’s home loan lending criteria (which are provided in a separate document).

Initially: the results were pretty good but needed tweaking. It would originally pull the borrowers correct name and other information. It would compare it to the lenders criteria correctly. There were just a few things that needed updating.

Details: Model GPT-4o, File Search: On, Temperature: 1, Top P: 1.

Now: Completely garbage outputs and most of the time it either hallucinates or provides the template response with placeholders where the information should go. All instructions are completely ignored.

We are at a loss, frustrated and need to get this up and running. We have been working on this for days. Experimenting, researching, you name it.

Question: What information can I provide (prompt I assume) that would help someone here steer us in the right direction?

2 Likes

Something I failed to mention: We actually get better results if we run the process directly in the ChatGPT chatbot.

1 Like

Welcome to the community!

Have you tried using gpt-4-turbo-2024-04-09 instead?

https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4

1 Like

A couple things off the bat.

Temperature at 1 is a really bad idea. It massively increases the stochasticity of your output.

You would set temperature to 1 when you’re batch generating and looking to cherry pick results that are tail of distribution. Otherwise, you’re just increasing the likelihood of hallucination by a large margin.

That alone may solve your problem.

The next step is to do what we call flow engineering.

Large monolithic prompts are not ideal. It’s much better to decompose your task into ‘flows’ of model calls that are each responsible for a single function.

So for your task, I’d want to break the fetching, scanning, review and comparisons all up into their own prompts and functions. I’d also want to have data validation at each step alongside structured output (use instructor for this).

Once you’ve done this, set up evals so you can get visibility into what parts of the system to improve next.

If you want to talk about this in more depth, I run an AI consultancy. You can email at ben@raava.io

Would be happy to have a 30 minute call to give you some more direction. No strings attached and no charge. We do free sessions like these because people often get a lot out of them and want to bring us on for further engagements.

P.S. The below blog post has the best insights on building with LLM’s on the internet (not mine, i just think it’s high value)

Google “What we learnt from a year of building with LLM’s”

Hi @chris74 -

As you are working on something highly sensitive - a process involving lending decisions - which is obviously an area that is highly regulated and supervised, there’s a few important considerations here in my mind.

  1. As suggested before, breaking down the process into smaller parts is absolutely critical in my view. The sensitive part of the assessment vis-a-vis the home lending criteria should be separated from the other preparatory tasks.

  2. For the lending criteria assessment, I would personally built a very detailed assessment grid / matrix that provides the model with as much context as possible to make a systematic assessment that is transparent and verifiable and that could also be audited. In my view that should include both the assessment result and the underlying reasoning for arriving at a conclusion. As has been suggested before, I would opt for much lower temperature (zero or close to zero) to limit or reduce the risk of variability and inconsistency in assessments.

  3. To support consistency, you could also create a database of past assessment results (provided these are free from any bias etc.) and further leverage this to validate assessment results.

Finally, I agree that gpt-4-turbo may be more suited for this task than gpt-4o.

P.S.: I don’t know in which jurisdiction you are operating, but you might also want to check if your national regulator has published principles for the use of AI including for credit decision processes. There may be some more specific criteria that you should bear in mind when designing this.

5 Likes

Hey, I am in same boat. Anything worked for you if yes please share.

1 Like

Unfortunately nothing worked. Lowering the temperature, chunking the instructions down to single tasks, etc. Hours of repeated testing and still not getting accurate results.

1 Like