Poor output after fine-tuning with 800 prompt/completion pairs

Hello everyone,

Hope this is the right place for my question.
My question is “basic,” as I am just starting out and would like to clarify a few things. Essentially, I need to create a basic understanding of AI so that I can use it as a basis for requesting a series of data.

Firstly, I want to exclude a specific case: my idea is not to answer questions, in the style of customer service.

Here are a couple of examples to clarify the final scenario. Case 1: input a list of products and a brief description of a company and have AI tell me which products that company would be more likely to buy or, if the company is already a customer, input the first five products and have AI tell me which related products can be proposed.

I have tested this by giving a long prompt to ChatGPT, and the responses obtained are correct.

Now, I have done a fine-tuning test where I structured a document with about 800 questions that concern the company (about 80 questions on vision and mission, and about 12 questions for each product). Then, I ran a test on the playground, but the AI’s responses were somewhat random.

Did I use the wrong fine-tuning method? Is there something wrong with the Python code or something else?
I’ve set 4 epochs, do I have to set more epochs?

In summary, I can do fine-tuning, I can provide it with a lot of information, but I have the impression that something is never quite right. Can you enlighten me or provide me with documentation that shows the ideal prompt/completion pairs for training AI for my type of case?

Thank you very much!

It seems like what you want is a recommendation system.
Fine-tuning is not the way to go.

Fine-tuning is adapting the pre-trained model to new tasks or domains by learning new patterns from the new data. It is not as simple as “I see this, I know now it and will repeat it and apply it to my complete wealth of knowledge”. This is why embeddings are almost always recommended.

You have found that ChatGPT on its own with some few-shot examples works great, so why not just use a more up-to-date model such as GPT-3.5 then with these few-shot examples?

I think there’s a huge misunderstanding that the base models for fine-tuning are comparable to ChatGPT - they are not. It would be similar to jumping into a completely modified Honda Civic specifically purposed for speed, and then expecting a generic base model from Pop’s Car Depot to run the same.

You should be trying few-shot prompts with whatever model you are attempting to fine-tune. Which is most likely “Davinci”, as in “Davinci”, not “text-davinci-003”.

If you decide you want to continue with fine-tuning, you should be building a train/validate set (usually 80:20) and running intervals to check how your training data is, and how the bot is adjusting to it.

Running 800 pieces of training data blindly will result in noise.

1 Like