Anybody had experience with RAG reranking with GPT3.5 Turbo?

Has anybody experimented with using GPT3.5 for RAG reranking? If so, do you have any tips?

I did some experiments and initially the reranking was not working very well until I added: “MOST IMPORTANT: The User question needs to be answered.” to the system prompt. Here is my test prompt in Playground: https://platform.openai.com/playground/p/sVnBE80pXcGeyrCpEQPnyLnf?mode=chat

2 Likes

Curiously enough, ranking is the only thing I’ve found gpt-3.5-turbo-16k to be reasonably good at. I use it to rank the responses to questions that are answered by gpt-4-turbo. This is my system message:

You are a language model designed to evaluate the responses of this documentation query system.
You will use a rating scale of 0 to 10, 0 being poorest response and 10 being the best.
Responses with “not specified” or “no specific mention” or “rephrase question” or “unclear” or no documents returned or empty response are considered poor responses.
Responses where the question appears to be answered are considered good.
Responses that contain detailed answers are considered the best.
Also, use your own judgement in analyzing if the question asked is actually answered in the response. Remember that a response that contains a request to “rephrase the question” is usually a non-response.
Please rate the question/response pair entered. Only respond with the rating. No explanation necessary. Only integers.

It could surely be better, but it’s been working adequately for a while now so I haven’t gotten around to improving it.

My suggestion is to do what you have already done: Just test it out on examples and make adjustments as you see fit.

1 Like

There was an article on arxiv that showed various reranker methods using LLM. If I remember correctly the best results were obtained by asking if the snippet is relevant or not and then using logprobs to judge the confidence in response and take only confident positives.

1 Like

@oldrich.svec Sounds interesting! Can you please share a link to this paper? Thank you!

I do not remember the exact article but it might be this one:

Also I would recommend to look at the articles that cited the aforementioned article:

https://scholar.google.com/scholar?cites=16114746473883483909&as_sdt=2005&sciodt=0,5&hl=en

1 Like

We use Pairwise Ranking with Elo ratings, for many tasks in our Policy Synth project. It is the best method we’ve tried with GPT-4, but it is too slow for chatbot RAG re-ranking.

1 Like

It’s interesting to me that GPT 3.5 can add additional value to a cosine similarity embedding comparison.

Is that because the embeddings were “last gen”?

Would it be an improvement against 3rd gen embeddings?

I mean what is the re-ranking prompt doing but some implicit embedding comparison in any case?

To be clear, I am not ranking embeddings. I am only ranking the question/response pairs in my log. Basically determining rank by matching good responses (questions that are answered) against bad responses (questions that are not answered).

1 Like

Ah ok got it. I was slightly confused because of the Topic. It’s still interesting that you are doing that with a lower power model though. What are you doing that for and how does it improve what you are trying to achieve? (Apologies for taking this further off track - I’m very interested in the OP’s topic too!)

I record every question asked and every response rendered into a log. Let’s say, a year from now, I have thousands if not tens of thousands of question/responses covering hundreds of subjects within my dataset. I could use this data to fine-tune a custom model, or create an embedding to enhance (fine-tune) the current knowledge base with actual conversations on a myriad of subjects. But, how do I tell the good conversations from the bad ones?

That’s where the ranking comes in. The scale is 1 to 10, so if I choose everything over 8 I’m bound to get the best responses the model has given.

And the low ranked query/responses? Now I also know the questions the dataset was NOT able to answer, and that will be useful down the line as well.

BTW, I also have a categorization system so that similar query/responses are classified into the same categories – this will also be useful.

2 Likes

for my use case.
even just simple relevancy score is good.

You give prompt in relevancy standard.
input: chunk_id, chunk_text
output: chunk_id, relevancy_score.

That’s it. (gpt-3.5 is good enough for this)
Works well and quite flexible. (Although only suitable for limited use cases given the price)

Hi @merefield ,
not sure if I got it right but will take a risk to answer your question)) measuring cosine similarity of vector embeddings is mere retrieval approach, which has no guaranty that the top-k documents (chunks) would be even close to the answer. You can embed a book for 1 grade kids and ask a question on molecular physics and you will [always] get those top 2 or 3 nearest neighbors. But the LLMs are able to assess those retrieved documents and actually rank (let’s say between 0 and 10) if those chunks contain an answer or if the answer generated is accurately answering the question. Hope it makes sense))

1 Like

Agreed. Never in dispute.

My general question is: given a good quality query how is it even possible to get benefit from reranking the results with an LLM of the same generation.

Surely you are extracting the semantic power from the LLM during the retrieve. I do not understand how you can improve upon that implicit ranking with the same model.

I could believe that re-ranking with a better model might help (has anyone tried this?)

1 Like

Well, it’s kind of better than nothing… But I would instead rather use rerank earlier - in the retrieval part. Let’s say, instead of passing standard 3 nearest candidates (k=3), I’d increase k-n to 10 and then would make use LLM to rerank those chunks and after that would only take 3 with highest scores into the generative part. LLM-powered retrieval can return more relevant documents than pure embedding-based retrieval. And if there’s no chunks with rating above 7 or 8 (that would be the case if you’d embed a 1st grade kids book and ask a question on cyber-security), then instead of passing those chunks to the generative part, I’d rather send an output that there’s no relevant grounding content for that question. This would be more reliable approach, which would significantly mitigate hallucinations.

I’m not sure you’ve understood what I’m getting at.

How can you rerank “earlier”? I’m talking about the case where you are attempting to re-rank the chunks with the LLM immediately after the embedding process has returned the chunks.

So:

embed query → search → rerank → present and generate.

I get that you can apply different criteria to the reranking in some scenarios, (eg promote chunks based on other criteria like newness, author etc)

However, setting other criteria aside, I do not see how the LLM can reliably improve upon the cosine order search or maybe I’m missing something?

ie how can the LLM reorder the chunks any better than the cosine distance ordering?

What is the LLM doing differently to the cosine distance ordering?

I guess my point is this: the embedding search is using a huge number of features to compare chunks. How can you improve upon that with an LLM prompt scheme?

I suppose what I’m really after is data evidence this is a good approach.