Tools for Testing Custom GPT Prompts

sacha.lhl · December 4, 2023, 1:57pm

Thanks Tony for this.
Can you elaborate how you would evaluate a conversation? I’m still stuck at this, especially how to generate a user response to the first model completion, as this completion is theoretically different each time.

TonyAIChamp · December 4, 2023, 2:06pm

I haven’t found a universal solution yet, so for every case the prompt will be different.

One of the examples is sending the assessor the whole conversation and the metrics we want to assess the conversation by.

eic · December 5, 2023, 10:32am

The ability to share a conversation was really good for such end-to-end processes; it’s a pity it has been removed.

TonyAIChamp · December 5, 2023, 10:33am

Not sure I’m following. Could you elaborate?

eic · December 5, 2023, 10:38am

Previously you were able to “share” a conversation link with others, and they could continue the conversation where you left off, and hand it back to you or someone else to continue, and so on.

It was not truly collaborative, but it was a step towards collaboration.

Now, although you can share a conversation link with a person, there is no ability for them to continue the conversation in the same manner (short of copy-pasting the entire conversation). Hope that made sense.

TonyAIChamp · December 5, 2023, 10:39am

Oh, gotcha! Curious to hear about your use-case of this functionality for prompt testing

eic · December 5, 2023, 11:02am

I am essentially attempting to A/B test prompt versions for conversational GPTs as a way of comparing how each version performs in order to achieve the desired outcome.

For example, did specific words or phrases result in one version of the GPT responding in a better / worse way, ignoring certain instructions, getting confused, etc.?

I’m still learning, and I’m finding it can be a bit hit or miss with instructions sometimes… the more you try to reiterate something, the more confused the GPT tends to get.

TonyAIChamp · December 5, 2023, 11:05am

A/B testing is a way to go, but what I struggle to understand is how you used to use sharing feature for it? Did you share 2 versions with different respondents, manually collected from them what they got as responses and then compared?

eic · December 5, 2023, 11:33am

Ah, so here’s what I was envisioning…

Have a person interact with TWO GPTs, each one having a different version of a “test prompt”. This, in essence, would be an attempt to keep the test conditions the same by keeping the user persona consistent for both versions of the test prompt.
Have the person share BOTH their conversation links with me.
Use a separate prompt to evaluate how each version of the test prompt performed by analyzing the AI’s responses in each conversation.

I realize it’s a rather roundabout, dirty workaround.

TonyAIChamp · December 5, 2023, 11:35am

This approach makes sense but it has a significant flaw. 1 conversation (even in 2 variations) won’t really help you optimize your prompt to have stable outputs. You need to do such testing dozens, better - hundreds of times, and manually it is a daunting task I believe no one engages in. So we end up with automatic testing.

prompteams · December 19, 2023, 6:23pm

You should try Prompteams! It allows versioning, testing and collaboration of your prompts and also includes up to date retrieval of your prompts. You can see your commit history as well

eic · December 20, 2023, 1:26am

Why thank you for this suggestion - I will give it a try.

GavrielCohen · December 26, 2023, 10:03am

Hi, Excellent question, I will add more,

Is there a clear way to predict what the answers coming from the AI will be?
Is there a recommended tool/programming language for testing Openai?

And in general, is it smart?
Doesn’t it miss the mark and lose the “magic” that AI knows how to perform? If so, how can it be controlled? (For example, not to present the customer with a wrong/incorrect message?

Thanks!

TonyAIChamp · December 26, 2023, 10:18am

Hi Gavriel

Welcome to the forum and Merry Christmas

Are your questions rhetorical or actual questions?

GavrielCohen · December 28, 2023, 5:36am

Hi Tony!

The first two questions are topical and practical,
The others are theoretical to understand what is the correct approach in testing OpenAI

Thanks.

TonyAIChamp · December 28, 2023, 6:19am

Ok then

Is there a clear way to predict what the answers coming from the AI will be?
If so, how can it be controlled? (For example, not to present the customer with a wrong/incorrect message?

While LLM’s are stochastic, so it is not in their “nature” to give fully predictable results. Though there are some way to increase predictability, but its level will depend on the use-case. Rule of thumb - the longer and more complex is the answer, the more difficult it is to have LLM output a fully predictable answer.

You can play around with temperature, top_p and seed parameters and also try to set standard answers in the vector database and insert these answers from there when needed.

Is there a recommended tool/programming language for testing Openai?

I don’t think there is something recommended, but seems like Python is the most popular with NodeJS somewhere close to it.

And in general, is it smart?

In general - yes. But then there are all possible shades of “smartness” for different use-cases.

Doesn’t it miss the mark and lose the “magic” that AI knows how to perform?

Could you elaborate this one?

GavrielCohen · December 28, 2023, 7:23am

Thanks for the answers

Regarding the last question I will elaborate as follows:
Actually when you check and monitor “expected answers” it kind of makes everything planned,
And then the question comes: if I plan or expect certain answers anyway, then what do I need the AI technology for?
I will prepare the same answers in advance and control them without needing AI…

Hope you were able to understand me

TonyAIChamp · December 28, 2023, 7:27am

Again, it all depends on your use case. As I don’t know it, I can answer hypothetically: you may need AI to indetify which question the user is asking (as every question you have an answer to may be asked in a million different ways).

But you may as well not need AI for your use case. The trap most of the newcomers are falling into is trying to use AI for the sake of AI instead of first understanding own use case deeply (with no relation to AI whatsoever) and only then understanding if and where AI may be helpful.

GavrielCohen · December 28, 2023, 7:38am

Agree,

The product is integrated with AI, which I will have to test further at an unclear stage, so I cannot describe exactly what is needed,

Thanks in the meantime for the quick reply

feosale · December 29, 2023, 11:25am

I recommend that you first study in detail the methods of setting tasks in the field of management.

Topic		Replies	Views
LLM and Prompt Evaluation Frameworks Prompting prompt-engineering , prompting , evals	13	11611	November 18, 2025
How to test an API, built on GPT? API	2	2865	April 9, 2024
Prompt Evaluations at Scale for Production API gpt-4	1	953	August 4, 2024
Online tool available for writing effective prompts Prompting api	12	7456	November 10, 2025
Do you use ChatGPT in your product? Community prompt-engineering , tools	2	796	July 16, 2024

Tools for Testing Custom GPT Prompts

Related topics