I finished my research work on comparing fine-tuning with context-injection (as an implementation of retrieval-augmented generation). A lot of work went into organizing the experimental paradigm, but in the end, these three scenarios were considered in the realm of ambiguous question-answering:
Fine-tuned GPT-3 vs GPT-3 with Context-injection vs GPT-3
In the end, context-injection always led to better answers than fine-tuning. Also, context-injection on GPT-3 and GPT-4 led to better answers that GPT-3 or GPT-4 alone (zero shot). However, context-injection did not improve the results compared to base ChatGPT.
Now, what you really should be looking at is fine-tuning for retrieval.
Basically, the model would be fine-tuned to convert a prompt into a more ideal text string to retrieve the most salient embeddings, increasing the hit-rate and boosting the quality of the final answers.
The idea would be to eliminate fluff, red herrings, and attention distractions, distilling the search down to its most pure form so the semantically similar matches are more likely to be relevant.
Also, depending on how you have your retrieval documents chunked and the types of prompts you’ll be searching against, it may be fruitful to have a model fine-tuned to break a query up into multiple pieces which you’d do semantic matching on separately.
Honestly, RAG is one of the most valuable tools for increasing model effectiveness, right up there with external tool use.
But, don’t sleep on fine-tuning—that’s the best way to make the models better at using tools including retrieval.
So, while I applaud the work you put into this, what I really hope to see soon is a post on fine-tuning and RAG instead of “vs.”
I view this as the model responds faster to RAG then fine-tuning. By that I mean it takes a lot more examples to move the model in a direction with fine-tuning then it does with RAG. It’s a trade off.
I also did RAG on a fine-tuned model. The performance, in ambiguous question-answering, was better than fine-tuning alone, but worse than RAG alone. Worse, though, is that the model suffered from both catastrophic forgetting and hallucination.
As fine-tuning + RAG would be the most time-consuming and expensive option, I don’t see why anyone would do this.
I would be interested to learn more about the process you used for combining the two.
Was the fine-tuning the same fine-tuning in your initial tests or was it a fine-tuning specifically for the purpose of using RAG?
My (very limited and often incorrect) understanding is that there are a couple of ways to combine the two.
Fine-tune for retrieval.
The idea here being that the initial input from the user isn’t particularly well-suited for using as input for a RAG-lookup. So, you would fine-tune a model to act as a a sort of “translation”-layer. Taking the user prompt and processing it into something more likely to trigger an accurate hit in the vector DB. This is (somewhat) related to HyDE where the idea is that an answer to a question (even if incorrect) will have greater semantic similarity to the true embedded answer than the question itself does. HyDE has historically been done with just the base model, but there’s no reason why you couldn’t use a fine-tuned model designed to respond in a form and format closer to the retrieval text. Whether or not this would increase the hit-rate enough to justify the fine-tuning is, to the best of my current knowledge, an open question.
Fine-tune on retrieval.
The idea here is that you would include retrieved snippets in your fine-tuning training data under the assumption the model will always have retrieved data available to it. Essentially, you would be fine-tuning the model to better use the retrieved data and hopefully giving it the ability to ignore irrelevant information loaded into context via RAG. This will increase your fine-tuning costs because you’ll be including a great deal more tokens, but it should make the final response by the model much better.
Which you choose would largely depend on how well your RAG pipeline works. If you have a very high hit-rate you might consider fine-tuning on retrieval so your model makes better use of what is retrieved. If your hit-rates are low you might consider first trying HyDE, then consider fine-tuning for retrieval.
Beyond that, other things you might consider trying could include,
Using a different (or multiple) embedding models.
Other embedding models might be able to better capture the semantics in your particular corpus of data. Using multiple different embedding models allows you to do more sophisticated re-ranking of the retrieved results.
Inserting a filter step into your pipeline.
Basically, for each retrieval you would query a model to accept it or reject it for relevance and distill it down to only the most salient information. Then, when you have pee-processed all of the retrievals, you can ask the model to distill them down into a single document, perhaps a bullet list of information. Finally, you would use this pseudo-document as your retrieved context for the model to generate its response to the user.
The idea being that while the attention mechanisms in the models are pretty good (and improving), by first explicitly eliminating any potential distractions you free the model from the burden of needing to focus its attention while generating the response.
At the end of the day, all of these will increase expense, so you’ll need to weigh the value against the cost for yourself.