How Do You Debug a Prompt That “Almost” Works?

I often run into prompts that give correct answers 70–80% of the time but fail in edge cases.
How do you debug or refine these prompts without making them overly complex?
Any structured approach would be helpful.

1 Like

Did you see:

which leads to

which notes

Evaluation, tuning, and shipping safely

Since the question did not note if this was just for ChatGPT and/or API, including all of the info.

Also check out

the noted tools AFAIK are not public but the ideas are valid.

How do you debug or refine these prompts without making them overly complex?

Practice

Edit:

There are different kinds of prompts:

  • Developer (role) Prompts
  • Instruction Prompts
  • User Prompts

And there are all kinds of use cases. It is an art that requires practice > testing > experience.

Maybe if you provide one of your edge cases?