Complex Prompt Getting Continuously Worse Results

chris74 · July 12, 2024, 4:28am

Scenario: We have a detailed prompt which has ChatGPT (4o) fetch PDF files, scan them, review the content, and compare it to a bank’s home loan lending criteria (which are provided in a separate document).

Initially: the results were pretty good but needed tweaking. It would originally pull the borrowers correct name and other information. It would compare it to the lenders criteria correctly. There were just a few things that needed updating.

Details: Model GPT-4o, File Search: On, Temperature: 1, Top P: 1.

Now: Completely garbage outputs and most of the time it either hallucinates or provides the template response with placeholders where the information should go. All instructions are completely ignored.

We are at a loss, frustrated and need to get this up and running. We have been working on this for days. Experimenting, researching, you name it.

Question: What information can I provide (prompt I assume) that would help someone here steer us in the right direction?

chris74 · July 12, 2024, 4:32am

Something I failed to mention: We actually get better results if we run the process directly in the ChatGPT chatbot.

Diet · July 12, 2024, 4:58am

Welcome to the community!

Have you tried using gpt-4-turbo-2024-04-09 instead?

https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4

benraava · July 12, 2024, 5:11am

A couple things off the bat.

Temperature at 1 is a really bad idea. It massively increases the stochasticity of your output.

You would set temperature to 1 when you’re batch generating and looking to cherry pick results that are tail of distribution. Otherwise, you’re just increasing the likelihood of hallucination by a large margin.

That alone may solve your problem.

The next step is to do what we call flow engineering.

Large monolithic prompts are not ideal. It’s much better to decompose your task into ‘flows’ of model calls that are each responsible for a single function.

So for your task, I’d want to break the fetching, scanning, review and comparisons all up into their own prompts and functions. I’d also want to have data validation at each step alongside structured output (use instructor for this).

Once you’ve done this, set up evals so you can get visibility into what parts of the system to improve next.

If you want to talk about this in more depth, I run an AI consultancy. You can email at ben@raava.io

Would be happy to have a 30 minute call to give you some more direction. No strings attached and no charge. We do free sessions like these because people often get a lot out of them and want to bring us on for further engagements.

P.S. The below blog post has the best insights on building with LLM’s on the internet (not mine, i just think it’s high value)

Google “What we learnt from a year of building with LLM’s”

jr.2509 · July 12, 2024, 8:31am

Hi @chris74 -

As you are working on something highly sensitive - a process involving lending decisions - which is obviously an area that is highly regulated and supervised, there’s a few important considerations here in my mind.

As suggested before, breaking down the process into smaller parts is absolutely critical in my view. The sensitive part of the assessment vis-a-vis the home lending criteria should be separated from the other preparatory tasks.
For the lending criteria assessment, I would personally built a very detailed assessment grid / matrix that provides the model with as much context as possible to make a systematic assessment that is transparent and verifiable and that could also be audited. In my view that should include both the assessment result and the underlying reasoning for arriving at a conclusion. As has been suggested before, I would opt for much lower temperature (zero or close to zero) to limit or reduce the risk of variability and inconsistency in assessments.
To support consistency, you could also create a database of past assessment results (provided these are free from any bias etc.) and further leverage this to validate assessment results.

Finally, I agree that gpt-4-turbo may be more suited for this task than gpt-4o.

P.S.: I don’t know in which jurisdiction you are operating, but you might also want to check if your national regulator has published principles for the use of AI including for credit decision processes. There may be some more specific criteria that you should bear in mind when designing this.

avnishsingh3909 · July 24, 2024, 5:57am

Hey, I am in same boat. Anything worked for you if yes please share.

chris74 · July 24, 2024, 7:38am

Unfortunately nothing worked. Lowering the temperature, chunking the instructions down to single tasks, etc. Hours of repeated testing and still not getting accurate results.

Topic		Replies	Views
GPT-4o - Hallucinating at temp:0 - Unusable in production Feedback api-hallucinations , gpt-4o	26	6105	July 24, 2024
New gpt-4-turbo-preview saying it can't help on complex prompt Prompting gpt-4 , api , gpt-4-turbo	7	2597	January 29, 2024
GPT3.5 returning incorrect data API chatgpt	7	2442	December 19, 2023
Irrelevant outputs by GPT 4o GPT builders gpt-4 , feedback	8	266	July 15, 2024
Same prompt, inconsistent results on GPT4 Prompting gpt-4 , prompt , prompt-engineering	1	998	March 21, 2024

Complex Prompt Getting Continuously Worse Results

Related topics