How do you measure prompt performance?

How do you measure prompt performance, with different contexts injected in the prompt, different model, different prompt template…?

  • create unit tests with a list of pair “context,prompt template,expected output” and compare expected output vs real output with a LLM

  • use human evaluation (thumbs up down) somehow, somewhere

  • use human evaluation (score 0/5) somehow, somewhere

anything else?

or any tools? I looked a bit at argilla io, langfuse, wandb llm, ?


I feel like (relative) prompt performance isn’t actually that much of an issue. There are certain building blocks you can use to achieve the results you want, and there’s rarely any doubt which will perform better.

At the start I had planned to do like a response surface mapping with prompts, but eventually I canned that because it doesn’t seem necessary.

It feels like in most cases, you can get the results you want with retries, and better prompts need fewer retries.

If you’re dealing with absolutely uncontrolled inputs, it’s tough, but it feels like you can cover most cases with a funnel/sieve approach in your flow.

so I guess the tldr is, you can use attempts needed :thinking:

I find the question a bit hard to generalize as it depends so much on the use case.

For some of my easier stuff, it’s a binary yes/no type of review. For my advanced use cases, it involves me sitting down for 30-60 min and compare the output in detail against the context/prompt and then use some form of scoring scale for evaluation. Performance on edge cases is particularly important to look at imo.

I find that sometimes it is less about how the core prompt is designed but how you feed in the context and how this contextual information is formulated and/or structured.

There are cases where models (still) have inherent limitations and you are always at risk of inaccuracies, independent of the prompt. In that case, you may have to build in additional steps in the workflow for validation to accomodate for that or some such.

The 2-sample test deck is as follows.

  1. The prompt is the same.
  2. But it must be not limited by the limitations of the AI ​​that is being tested.
  3. Timming: When using prompts, the same rhythm should be used.

There are also various details that should be included in the testing variable control standards.

But if it is a single test at different times It should be as standard as possible. A more stable comparison sample may be used to measure each round. For example, GPTs use GPT4 as the comparison sample each round. and opinions from my use of behavioral science Benchmarks are difficult to quantify. Just use the methods that can be used. Let’s be the judge. For example, if you use this prompt during this time The success rate was 90% over time. Come back and try again with the same prompt and possible control factors. The remaining 60% success is that. Don’t forget that everything is random. Control can only be controlled by variables that can be controlled. External variables we cannot control.

For structured (JSON result/function call) and/or deterministic prompts Promptotype helps a lot (full disclosure: I created it).
It lets you define a set of query test cases that you can run with different templated prompts and model configurations to make sure they perform as expected.