GPT-3.5-turbo fine-tuning plus document retrieval

My use-case is conditional generation using document retrieval to find relevant parts of the data model necessary to produce a DB query based on the user question. After reading the documentation for the new fine-tuning of GPT-3.5-turbo I wonder a few things:

Does the system prompt in the training dataset for the new fine-tuning approach always have to be exactly the same?
If so, I cannot add the retrieved documents there (because they are dynamic) and should add them to the user prompt?

Can the static part of my prompt which provides general guidelines for the model (currently GPT-4) be shortened to a generic unique prompt and the fine-tuned model basically learns these guidelines through the provided examples alone? Does that increase the required example count? If so, by how much?

Thanks for any helpful answers

If you are trying to “teach” the model new information, embeddings is the way to go. If you want to change the structure or way it response, then use fine-tuning.

From Fine Tuning Guide

When should I use fine-tuning vs embeddings with retrieval?

Embeddings with retrieval is best suited for cases when you need to have a large database of documents with relevant context and information.
By default OpenAI’s models are trained to be helpful generalist assistants. Fine-tuning can be used to make a model which is narrowly focused, and exhibits specific ingrained behavior patterns. Retrieval strategies can be used to make new information available to a model by providing it with relevant context before generating its response. Retrieval strategies are not an alternative to fine-tuning and can in fact be complementary to it.

1 Like

Thanks for the reply.

The narrow focus here would be the generation of valid DB queries that might be more complex than what GPT-4 is able to produce by default.

This is exactly what I would like to understand better: how to combine the two.

I have the same question. Do we have an answer to this problem? Is it necessary to send the retrieve documents when fine-tuning?

1 Like

I have the same question here. We have some proprietary product information. Should I fine-tune the AI with lots of examples, or feed documents for retrieval? Also what should the template for the file be?

Hi and welcome to the Developer Forum!

If you are unsure of how to proceed I’d try with assistants and uploading your files, that way you need to do nothing to start with, and you’ll get an Idea of the performance. Should you wish to refine that at a later date, you can use pure coding methods to provide custom context for the base API calls, very rarely is fine tuning useful for company datasets, unless it’s for a particular style of speaking or ways of thinking, fine tunes do not encode raw data like prices and product specs, they encode the style of those product specs, the word choices.

1 Like

Thanks for the swift response!
If I go with the assistant route, what is my file supposed to look like? For example if I created a table on Excel with product specs, facts and information and uploaded it, does that work? I couldn’t find any template examples here

export your excel file to CSV and just upload that, should be all you need to do. There may even be .xlx and .xlxs converters, but they will only add lag to the calls, so CSV should work. Can try PDF or xlx or even json for complex hierarchical data.