How to prompt GPT3.5 to evaluate responses

,

I’m currently developing a chatbot using another LLM, and I’m trying to use GPT3.5 to evaluate its responses based on the following 4 criteria: “sentence coherence”, “perplexity”, “specificity” and “empathy”. I’m planning to prompt GPT3.5 to rate the responses from 1-5 for each criteria (1 being poor, 5 being great). Is this a viable evaluation method? How do I create a prompt for this purpose?

Welcome to the community!

a 1-5 rating probably isn’t the best option. Even asking humans, it’s tough to get consistent results.

Having specific criteria for each bucket in your categories could help get you useful results :slight_smile:

1 Like

I strongly agree on the specific criteria definition.

The other point I’d add is that for these type of tasks I often start with a Chain of Thought (CoT) approach, i.e. asking the model to lay out the steps it would normally take to perform the rating. I then use this as a foundation to create a custom methodology, making refinements to the model’s approach as required and adding specific evaluation criteria. I then incorporate this methodology into all my prompts to ensure a consistent rating approach.

You can also consider running multiple concurrent evaluations for a given response and then take the average of the computed scores or take into account the log probability of the assigned rating.

1 Like