Based on my limited experience with fine-tuning on the OpenAI platform, you’re correct that in fine-tuning will degrade the models performance when your prompts do not reflect those including in the training set.
I have an idea built off your notion of RAG + fine tuning which you might find interesting. I’ve read a number of articles which demonstrate that a fine-tuned model trained off domain-specific question-answer pairs does not work as effectively as vanilla RAG.
However, I do believe that fine-tuning can contribute towards creating an extremely powerful retrieval system.
Most RAG systems you see work by embedding the user’s query (generally a question), embedding it, and performing a similarity search against your knowledge base, aiming to find the chunk of text in your knowledge base which contains the answer to the user’s prompt.
This is very effective in most cases, especially if the knowledge base is not overly large.
The reason RAG beats a fine-tune for question answering is because fine-tuning generally helps with tone and style, not adding knowledge to the system. I suppose it’s possible with lots of testing to develop a fine-tuned model that does effectively add knowledge to the model but it’s extremely difficult to balance/optimize (avoid overfitting, evals, etc). It would also be difficult to create an all-encompassing dataset of examples to use in fine-tuning.
I do believe however that the best RAG system given the OpenAI tools we can access at the moment does include a fine-tuned model.
The function of the fine-tune is not to answer questions directly, but rather generate “synthetic” responses which are used strictly to embed to compare to the document chunks in your knowledge base.
Allow me to support this idea:
Considering that GPT-4 + RAG is extremely accurate (with the right prompt structure), there is no problem to solve as long as the correct context is passed to the model.
Therefore, the main potential issue is the failure for the embeddings algorithm to properly match the user’s prompt to the correct chunk in your knowledge base; if the proper chunks are not passed to GPT-4’s context, it won’t answer correctly.
Using a fine-tuned model for synthetic strings to use for the similarity search allows us to create a very close semantic match to the correct chunk of context in the training set. Instead of matching a user’s question to the answer in the knowledge base, we use a fine-tuned model generated string, ingesting the user’s question as its prompt (which is able to match the tone and style of the content in the knowledge base).
So, in this method, we use the best aspects of RAG and fine-tuning to develop the ultimate retrieval system.
In order to develop the dataset for this, I would recommend feeding chunks of your codebase/documentation to GPT-4 and saying, “Generate 5 questions to which this chunk of information is the answer”, and fine-tune from there.
Here’s a simple example that conveys the idea… however the benefit of this system likely only emerges with more complex tasks.
- User: “amzn founding year”
- Synthetic GPT: “Amazon, the company, was founded in 1995.”
Knowledge base: “…Amazon, a company, was founded in 1994.”
So, we can see that even though the synthetic string is not accurate and does not have “added knowledge”, the change in tone and style can be exploited. Could be a cool trick to try out 
Hope this was helpful