Okay, so, I don’t think you want a fine-tuned model for this narrow set of requirements.
Instead, you need to create embedding vectors for each of your 1,000 items. And then, when a worker enters a query, you get the embedding vector for the query. Then you simply compare the vector with each of the 1,000 items. The one with the highest score is likely the answer, and you return to the user the related response desired.
The vector comparisons are performed using a dot product or cosine. This returns a value from 0 to 1 indicating how similar the query is to each of your 1,000 items. I’ve written about that part of the solution here.
The 1,000-item database should contain the data you showed earlier, including the responses, so that when you single out the most similar query to that which was asked, you have the answer - you needn’t do anything else.
This approach is fast and cheap, but most importantly, it doesn’t introduce the massive distractions into the conversation that are made possible with fine-tuning. It is a closed environment where you control the outcome. For example, you could set a rule that causes the app to reject any query that doesn’t meet a score of 0.75 or higher on the basis that it’s probably not a relevant question.
Furthermore, this approach doesn’t incur model training costs; only embedding costs, which are 1/600th the rate of an inference call. Lastly, there is no latency waiting for the LLM to process your request and locate your data. You have all the data, and it is a single search enumeration away.
There are cases where a fine-tuned model is required. I don’t think this is one of them.
Another advantage of an embedding architecture is that you can shape the outcome with greater precision. For example, imagine instead of picking the highest similarly hit, you wanted to use the top three and merge them together into a learner prompt, and then use a completion prompt to describe the best approach. Or, offer workers three approaches that may be applicable. This is especially possible if multiple similarity hits come back clustered around a very tight score grouping. A tight clustering of similar items is an indicator of uncertainty and letting the worker chose the final approach may be desired.
UPDATE: BTW, in your current approach, does your prompt instruct your fine-tuned model to say it doesn’t know when the answer cannot be deduced from the stated context? Example …
"Answer the following question from the context, if the answer can not be deduced from the context, say 'I dont know' :\n\n" +
"Context :\n" +
"\n\nQuestion :\n" +