LLM and Prompt Evaluation Frameworks

Hi!

A friend of my recently pointed out to his company’s use of promptfoo for handling prompt evaluations.

I also saw recently a more general LLM evaluation framework Opik.

Just wondering what others have experience with when it comes to evaluating prompts, and more general LLM evaluation on certain tasks. Which frameworks or methods have you used? What worked well and what didn’t?

3 Likes

I wonder if prompt evals actually work, or if they give people a false sense of security :thinking:

they also seem to be advertising hallucination countermeasures using perplexity. I’m not sure you if you can infer any hallucination probability by just adding logprobs :thinking:.

3 Likes

Interesting angle @Diet .

The issue is still however that you want to:

  1. Detect when your current prompt is simply not working as well anymore (OpenAI checkpoint changes under the hood???)
  2. Which prompts (and instruction “types”) work best with different models (e.g. if you are using a mixture of minis and larger models; or mixture of models from different vendors).
  3. Perform A/B tests on prompts by incorporating some new information or knowledge

Do you have any suggestions how to tackle these?

1 Like

Is it just me or does it seem like the broader community hasn’t evolved much past the completions API mindset? The more and more I dig into frameworks and evals the more I realize we’re still trying to force all of the context into (non standardized) single message templates. These models have evolved to excel in multi turn conversations so shouldn’t we be making frameworks and evals around that paradigm instead?

4 Likes

I mean it’s an interesting point, but this “single message paradigm” is still highly relevant to lot of applications/services. For example, data enriching, filtering and pre-processing systems that work in batch manner (think Spark, DataFlow). Also user-facing applications that are meant to be very snappy.

I do actually see (at least in my community) nearly everyone (with the exception of the batch jobs above) doing some kind of multi-turn API calling. For example, calling legacy GPT-4 (since it’s actually much better for some bespoke reasoning tasks, like in healthtech, than newer GPT-4o variants), then passing the output to GPT-4o for structuring.

But regardless - you still have this issue of needing to have some kind of control over prompts and the “feeling” for whether the system is degrading over time.

4 Likes

Where it gets really complicated though, is what I’m playing around with now:

I make a single o1 call to generate a ToT system prompt based on a simple 1-2 sentence instruction, then I feed that system prompt into another call to a GPT-4o model! Then I have no determinism at all over my system prompt!

3 Likes

You mentioned needing control over prompts to prevent system degradation, but I’m questioning if that’s really necessary when you have sophisticated eval mechanisms in place.

If the system is consistently evaluating its outputs against set goals, then there’s less need to micromanage each prompt. The system can adapt and adjust on its own based on those evaluations. The real focus should be on the outcome and whether it meets your expectations. Controlling prompts feels like trying to fix something on the surface, but if your evaluations are solid, the system can handle dynamic situations without needing to control every detail upfront.

Sure, there will be situations where prompt control matters, but for the kind of dynamic multi-model/agent systems we’re talking about, the ability to self-adjust based on evaluations is far more powerful.

1 Like

The idea is to structure every chain to eventually be easily verifiable, so it eventually culminates in an exception you can log.

You then need to trace that exception to its origin and then dumb down the prompt.

yeah I have the privilege of not having to do that, thankfully.

but I’ve been thinking that you could run a smaller model and see if the chain succeeds - if it doesn’t, you run a bigger model.

A/B testing implies that you use your users as gunea pigs. Obviously it’s a matter of interpretation, but I think backtesting is better.

IMO if you think of chat as a document, you can draw much more out of the LLM than if you think of it as an evolving conversation. Under the hood, it’s still the same technology, and the same issues with conversations still rear their ugly head (mostly confounding due to similar information) - so I don’t really see how this has evolved.

e.g.: with conversational CoT, you now have to spend tokens on re-distilling the conversation up until the present before you go to work on the actual problem. If you just throw away irrelevant or outdated information (evolve the corpus as opposed to the conversation) you can skip that step entirely. And less AI context → more AI stability. IMO, of course.

So if you look at ordinary conversations between two people, the conversation might have evolved with definite priors. However when you ask a third party for “their fresh perspective”, you could just ask them about the conclusions that the two parties have reached. This, you would do, through just exposing the conclusion and asking for opinion; along with original problem statement.

More concretely in the following code, the chain keeps a track of the problem statement and asks for input on an iterative basis.

gc = GoalComposer(provider="OpenAI", model="gpt-4o-mini")
gc = gc(global_context="global_context")

gc\
    .goal("correct spelling", with_goal_args={'text': """I wonde how the world will be sustainable in 100 years from now. 
                                              We  much fossil fuel. 
                                              we not care for enviorment. 
                                               """})\
    .goal("summarize issue") \
    .goal("formulate problem from issue", with_goal_args={'provider': "OpenAI", 'model': "gpt-4o" } )\
    .goal("produce potential solutions paths through tree of thought", with_goal_args={'provider': "OpenAI", 'model': "gpt-4o" })\
    .start_loop(start_loop_fn=start_main_loop)\
        .goal("iteratively solve through constant refinement", with_goal_args={'provider': "OpenAI", 'model': "gpt-4o" })\
        .tap_previous_result(display_text)\
        .goal("take input on solution" ) \
    .end_loop(end_loop_fn=end_main_loop)\
    .goal("summarize solution")\
    .tap_previous_result(display_text)

Are you building, as you say, “a conversation between two people” here?

If you have your ToT in the same thread, you’ll eventually start cross-contaminating your contexts. If your ToT consists of independent (i.e. spread instead of loop) ideations, then that’s what I would be suggesting.

And whether the ideation is a conversation or not doesn’t really matter all that much to the model, I think. I base this on the continued effectiveness of using low-frequency patterns to steer the models: How to stop models returning "preachy" conclusions - #38 by Diet (the system-user-assistant conversation being the lowest frequency pattern in this sense).

“take input” in my mind is just a function, a resource, the system can tap. in your case, I guess, a human. this would be realized as “ask sponsor” or “ask operator” (which could just as well be an AI system on its own, or another instance of itself). Instead of just injecting the response as a “user response” - I’d typically insert it as an ancillary context document that is probably required to continue the task.

So I don’t really see LLMs as chatterboxes, I see them as document evolvers.

I’m not saying that you guys are wrong, and I agree that these models are getting tuned and trained for this. I just think this is a mistake if you really want to put models to work.

No; the main thrust was to emphasize that it is not necessary for the entire context to be sent to a third party.

The later. The ToTs are independent

Given that the way in which the system is modeled (the model knows about the participants and background) and based on anecdotal evidence, the conversation does indeed get steered in very specific ways.

It could be (in this particular case, is) an AI agent. It is not an “ancillary” context document; if the term is intended to mean that it is not important to next word.

I view the LLMs as enablers of conversations; between humans and agents and increasingly now between agents and agents. Whether right or wrong, I think that’s the way in which this space is evolving.

1 Like

I’ve tried a few tools for LLM evaluation. Promptfoo is solid for A/B testing prompts and tracking output changes over time. Opik is more general and good for testing across different tasks, but it might need tweaking for specific use cases.

You might also want to check out ContextCheck, an open-source tool for evaluating RAG systems and chatbots. It’s handy for spotting regressions, testing edge cases, and finding hallucinations. Plus, it’s easy to set up with YAML configs and CI pipelines.

Ultimately, your choice depends on what you’re optimizing—accuracy, relevance, or safety. Combining manual checks with these tools works well for me.

1 Like