I am experimenting with building custom conversational GPTs for various purposes, such as, for example, to get an insight into job candidates’ interests and passions by having naturally flowing conversations with them like interviews.
I am looking for ways to test and compare the results of different versions of my prompts under consistent conditions as part of my process of iterating, refining, and improving my prompts.
Are there any tools that can help me with things like keeping track of prompt versions, simulating conversations to compare the effectiveness of different prompt versions, etc.?
Currently I am simply using ChatGPT itself to craft my prompts, simulate conversations, and am manually comparing and tweaking my prompts from one version to the next.
Any insights or pointers would be greatly appreciated. Thanks!
Start with the Playground instead of ChatGPT. This by itself will already move you to the next level of prompt engineering.
After you’ve worked with that, there is a plethora of ways depending on the situation (LangSmith, HumanLoop, etc), but I personally prefer using own testing solution (not difficult to create) as it gives me the maximum flexibility (before we have some widely available and flexible solutions on the market).
Good question. We have been asking ourselves the same question.
Do you think it would be possible to train a GPT that is a “discriminator” that tests for certain responses to a standard set of test queries (and assesses the quality of the response).
Most people I talk with assess their GPTs by feel and it would be nice to be more systematic with an approach.
@TonyAIChamp could you share more about your home built approach ?
Open AI has created a test and evaluation framework for such cases called Evals.
Simple example, you create a list with 100 or so test cases and send them to another LLM instance for evaluation.
Ultimately you will get a good insight if your prompt-model combination is performing and have an easy way to confirm that the quality remains high over time.
If it is a simple question answering - execute the prompt 50-100 times.
If it requires json output, use pydantic to test the validity.
If the use-case requires multi-step communication with the model, I use some kind of agent (sometimes AutoGen, sometimes - a simpler agent where I do orchestration myself).
As he second step:
Sometimes I look at the results of the first step manually and make a list of things to improve in the prompt.
Sometimes I use own simple scripts (that extract formal criteria from the prompt/case and assess how well AI communicates) or (recently) assistant API to analyze the results and provide feedback.
It’s unit and integration testing for LLMs. There’s not really a way around spending the time to learn this skill and make sure that apps work as expected, IMO.
I understand. But this is not the only solution out there (some are described above). This is why I’m curious how time consuming is this one (for example comparing to some other ones).
@eic Other than the official playground and other tools people suggested, feel free to try my product: Knit, it’s a more advanced playground for managing, developing and iterating prompts.
Regarding your points:
Knit saves all edit&response history and you can roll back anytime you want.
You can run up to 6 groups of tests simultaneously to save time observing the results of different inputs.
I cannot answer this question for you or your team.
There are definite upsides to this solution, mainly it being completely open source. And in the light of recent events it may be a consideration to make sure the testing framework’s uptime is dependent on the API only.
But then again, I don’t want to throw shade at the other solutions. I’m sure they are great, too, and come equipped with additional functionalities.
I did try the playground, but it similarly entails a manual process for comparing the results of the prompt as far as I could tell (unless I have missed something?) as it is a conversational prompt and not a simply Q&A prompt.
I will give it a shot but, as I said in a previous reply to someone else in this thread, my gpt is not a simple Q&A but rather conversational. So I still think it would entail manual comparisons of entire conversations to assess the results of different versions of the prompt; not to mention conversing multiple times as well.
Your use case is actually rather common. So, there is hope!
I’m most familiar with the Open AI evals I posted earlier and will create a better example drawing from this knowledge.
In a conversational setting the quality of the responses can be measured by “staying in the role” or by “providing relevant responses”, among others. In this scenario you provide user messages to your agent/model/instructed LLM and from there collect the responses. But in this case the responses will be evaluated by another LLM, specifically instructed to check if the responses matches the role or is relevant.
Admittedly this is not the most basic case but definitely the way how the quality can be measured using objective tools.
If you consider it important to pass in several rounds of conversation between the user and your agent, this is possible, too. You simply create one message that holds the conversation and allow the LLM that evaluates with the info which message has been sent by whom.
I can only encourage you to dig into the testing frameworks if you want to make the progress your are looking for.
To set this up technically is rather trivial. To come up with metrics and how you communicate them to the assessing AI is a much less trivial case (especially when you need stable assessment on 100’s of dialogues that follow complex flows).
That is true and at the same time proof why development without testing in parallel is a brave endeavour. With hundreds of untested prompts you either have a very good sense for what works or a potentially very big problem.
Regarding the model-graded evaluation of conversations rather than “true/false” scenarios, it is true that the initial test design is more demanding. At the same time, breaking down complex analytical questions into smaller true/false type of evaluations is a typical coding skill that you can leverage again to build better, more rubust AI solutions.
Thanks! I am attempting to craft my own custom GPTs to automate different parts of this process. If nothing else, it’ll be a good learning experience.
Happy to hear that, @cass ! I’m actually working on this right now, so if you want to have a quick call one of these days to exchange ideas, shoot me a DM!