Tools for Testing Custom GPT Prompts

Thanks Tony for this.
Can you elaborate how you would evaluate a conversation? I’m still stuck at this, especially how to generate a user response to the first model completion, as this completion is theoretically different each time.

I haven’t found a universal solution yet, so for every case the prompt will be different.

One of the examples is sending the assessor the whole conversation and the metrics we want to assess the conversation by.

1 Like

The ability to share a conversation was really good for such end-to-end processes; it’s a pity it has been removed.

Not sure I’m following. Could you elaborate?

Previously you were able to “share” a conversation link with others, and they could continue the conversation where you left off, and hand it back to you or someone else to continue, and so on.

It was not truly collaborative, but it was a step towards collaboration.

Now, although you can share a conversation link with a person, there is no ability for them to continue the conversation in the same manner (short of copy-pasting the entire conversation). Hope that made sense.

1 Like

Oh, gotcha! Curious to hear about your use-case of this functionality for prompt testing

I am essentially attempting to A/B test prompt versions for conversational GPTs as a way of comparing how each version performs in order to achieve the desired outcome.

For example, did specific words or phrases result in one version of the GPT responding in a better / worse way, ignoring certain instructions, getting confused, etc.?

I’m still learning, and I’m finding it can be a bit hit or miss with instructions sometimes… the more you try to reiterate something, the more confused the GPT tends to get. :blush:

1 Like

A/B testing is a way to go, but what I struggle to understand is how you used to use sharing feature for it? Did you share 2 versions with different respondents, manually collected from them what they got as responses and then compared?

Ah, so here’s what I was envisioning…

  1. Have a person interact with TWO GPTs, each one having a different version of a “test prompt”. This, in essence, would be an attempt to keep the test conditions the same by keeping the user persona consistent for both versions of the test prompt.

  2. Have the person share BOTH their conversation links with me.

  3. Use a separate prompt to evaluate how each version of the test prompt performed by analyzing the AI’s responses in each conversation.

I realize it’s a rather roundabout, dirty workaround.

1 Like

This approach makes sense but it has a significant flaw. 1 conversation (even in 2 variations) won’t really help you optimize your prompt to have stable outputs. You need to do such testing dozens, better - hundreds of times, and manually it is a daunting task I believe no one engages in. So we end up with automatic testing.

1 Like

You should try Prompteams! It allows versioning, testing and collaboration of your prompts and also includes up to date retrieval of your prompts. You can see your commit history as well

2 Likes

Why thank you for this suggestion - I will give it a try. :slight_smile:

Hi, Excellent question, I will add more,

Is there a clear way to predict what the answers coming from the AI will be?
Is there a recommended tool/programming language for testing Openai?

And in general, is it smart?
Doesn’t it miss the mark and lose the “magic” that AI knows how to perform? If so, how can it be controlled? (For example, not to present the customer with a wrong/incorrect message?

Thanks!

Hi Gavriel

Welcome to the forum and Merry Christmas :slight_smile:

Are your questions rhetorical or actual questions?

Hi Tony!

The first two questions are topical and practical,
The others are theoretical to understand what is the correct approach in testing OpenAI

Thanks.

1 Like

Ok then :slight_smile:

Is there a clear way to predict what the answers coming from the AI will be?
If so, how can it be controlled? (For example, not to present the customer with a wrong/incorrect message?

While LLM’s are stochastic, so it is not in their “nature” to give fully predictable results. Though there are some way to increase predictability, but its level will depend on the use-case. Rule of thumb - the longer and more complex is the answer, the more difficult it is to have LLM output a fully predictable answer.

You can play around with temperature, top_p and seed parameters and also try to set standard answers in the vector database and insert these answers from there when needed.

Is there a recommended tool/programming language for testing Openai?

I don’t think there is something recommended, but seems like Python is the most popular with NodeJS somewhere close to it.

And in general, is it smart?

In general - yes. But then there are all possible shades of “smartness” for different use-cases.

Doesn’t it miss the mark and lose the “magic” that AI knows how to perform?

Could you elaborate this one?

2 Likes

Thanks for the answers :slight_smile:

Regarding the last question I will elaborate as follows:
Actually when you check and monitor “expected answers” it kind of makes everything planned,
And then the question comes: if I plan or expect certain answers anyway, then what do I need the AI technology for?
I will prepare the same answers in advance and control them without needing AI…

Hope you were able to understand me

1 Like

Again, it all depends on your use case. As I don’t know it, I can answer hypothetically: you may need AI to indetify which question the user is asking (as every question you have an answer to may be asked in a million different ways).

But you may as well not need AI for your use case. The trap most of the newcomers are falling into is trying to use AI for the sake of AI instead of first understanding own use case deeply (with no relation to AI whatsoever) and only then understanding if and where AI may be helpful.

3 Likes

Agree,

The product is integrated with AI, which I will have to test further at an unclear stage, so I cannot describe exactly what is needed,

Thanks in the meantime for the quick reply

1 Like

I recommend that you first study in detail the methods of setting tasks in the field of management.