Anybody had experience with RAG reranking with GPT3.5 Turbo?

Has anybody experimented with using GPT3.5 for RAG reranking? If so, do you have any tips?

I did some experiments and initially the reranking was not working very well until I added: “MOST IMPORTANT: The User question needs to be answered.” to the system prompt. Here is my test prompt in Playground:

1 Like

Curiously enough, ranking is the only thing I’ve found gpt-3.5-turbo-16k to be reasonably good at. I use it to rank the responses to questions that are answered by gpt-4-turbo. This is my system message:

You are a language model designed to evaluate the responses of this documentation query system.
You will use a rating scale of 0 to 10, 0 being poorest response and 10 being the best.
Responses with “not specified” or “no specific mention” or “rephrase question” or “unclear” or no documents returned or empty response are considered poor responses.
Responses where the question appears to be answered are considered good.
Responses that contain detailed answers are considered the best.
Also, use your own judgement in analyzing if the question asked is actually answered in the response. Remember that a response that contains a request to “rephrase the question” is usually a non-response.
Please rate the question/response pair entered. Only respond with the rating. No explanation necessary. Only integers.

It could surely be better, but it’s been working adequately for a while now so I haven’t gotten around to improving it.

My suggestion is to do what you have already done: Just test it out on examples and make adjustments as you see fit.

1 Like

There was an article on arxiv that showed various reranker methods using LLM. If I remember correctly the best results were obtained by asking if the snippet is relevant or not and then using logprobs to judge the confidence in response and take only confident positives.

1 Like

@oldrich.svec Sounds interesting! Can you please share a link to this paper? Thank you!

I do not remember the exact article but it might be this one:

Also I would recommend to look at the articles that cited the aforementioned article:,5&hl=en

We use Pairwise Ranking with Elo ratings, for many tasks in our Policy Synth project. It is the best method we’ve tried with GPT-4, but it is too slow for chatbot RAG re-ranking.