Creation of a reranker through fine tuning did not work

For one of my use case, I wanted to create a fine tuned model for reranking. I used the below approach

For each chunk, I created two summaries of the chunk one is relevant and other non-relevant and then created the jsonl file as below

chunk content, chunk summary, relevant score

In the above if chunk content and chunk summary are relevant, I gave a score of 1 else 0.

With those training samples, I did a finetuning. But after finetuning, even if I give the exact relevant summary, the model is not able to pickup the right score. I wanted to understand why this may be happening. Is it because, it is not able to create the association(self-attention) with an integer score(1 or 0) with the text. May be this is not the right approach to create a re-ranker model.


Could you elaborate a bit more what a training data example looked like in this case, i.e. system message (if applicable) as well as user and assistant messages?

{“messages”:[{“role”:“system”,“content”:"Given a user query and context summary, provide a relevancy score of 0 or 1, 0 is low relevant and 1 is high relevant}, {“role”:“user”,“content”:user_query:xxxx, summary:xxxxx},{“role”:“assisytant”,“content”:}]
Above is a illustrative training sample i created

As a starting point for improvement, I would include in the system instruction more details. For one part, make it clearer what the purpose of the scoring is, i.e. ranking retrieved content based on its relevance to a user query. Second, try to describe under which conditions a summary is considered low vs. high relevant so the model has a better basis for performing the scoring.

This might help get better results.

Thanks will try this. But it will be great to know if anyone has used any other technique in this space, and if it worked. Particularly when you have lot of domain specific nuance in your corpus

Bear in mind that fine-tuning has significant limitations and is not well suited to teach knowledge. The fine-tuned will be capable to pick up certain patterns, i.e. in your case patterns of when to assign a low vs. high relevance score. It will however not or at least not consistently pick up facts or domain-specific knowledge.