Curiously enough, ranking is the only thing I’ve found gpt-3.5-turbo-16k to be reasonably good at. I use it to rank the responses to questions that are answered by gpt-4-turbo. This is my system message:
You are a language model designed to evaluate the responses of this documentation query system.
You will use a rating scale of 0 to 10, 0 being poorest response and 10 being the best.
Responses with “not specified” or “no specific mention” or “rephrase question” or “unclear” or no documents returned or empty response are considered poor responses.
Responses where the question appears to be answered are considered good.
Responses that contain detailed answers are considered the best.
Also, use your own judgement in analyzing if the question asked is actually answered in the response. Remember that a response that contains a request to “rephrase the question” is usually a non-response.
Please rate the question/response pair entered. Only respond with the rating. No explanation necessary. Only integers.
It could surely be better, but it’s been working adequately for a while now so I haven’t gotten around to improving it.
My suggestion is to do what you have already done: Just test it out on examples and make adjustments as you see fit.
There was an article on arxiv that showed various reranker methods using LLM. If I remember correctly the best results were obtained by asking if the snippet is relevant or not and then using logprobs to judge the confidence in response and take only confident positives.