I often run into prompts that give correct answers 70–80% of the time but fail in edge cases.
How do you debug or refine these prompts without making them overly complex?
Any structured approach would be helpful.
1 Like
Did you see:
which leads to
which notes
Evaluation, tuning, and shipping safely
- Evals API for eval-driven development.
- Reinforcement fine-tuning (RFT) using programmable graders.
- Supervised fine-tuning / distillation for pushing quality down into smaller, cheaper models once you’ve validated a task with a larger one.
- Graders and the Prompt optimizer helped teams run a tighter “eval → improve → re-eval” loop.
Since the question did not note if this was just for ChatGPT and/or API, including all of the info.
Also check out
the noted tools AFAIK are not public but the ideas are valid.
How do you debug or refine these prompts without making them overly complex?
Practice
Edit:
There are different kinds of prompts:
- Developer (role) Prompts
- Instruction Prompts
- User Prompts
And there are all kinds of use cases. It is an art that requires practice > testing > experience.
Maybe if you provide one of your edge cases?