Problems with fine-tuning 3.5 model

Hello, I am using fine-tuning for advanced search of faq articles on my website. With old gpt-3 models my training files looked like this:

{"prompt": "How to create video\n\n###\n\n", "completion": "123\n\n###\n\n"},
{"prompt": "How to use video\n\n###\n\n", "completion": "123\n\n###\n\n"},
{"prompt": "Can I create video\n\n###\n\n", "completion": "123\n\n###\n\n"},

Where numbers in completion fields means ids of corresponding faq article. And it worked fine. Then I decided to upgrade to gpt-3.5 fine-tuning, I changed my files to this format:


"{"messages": [{"role": "system", "content": "You are a chatbot for website."}, {"role": "user", "content": "How to create video"}, {"role": "assistant", "content": "123"}]}",
"{"messages": [{"role": "system", "content": "You are a chatbot for website."}, {"role": "user", "content": "How to use video"}, {"role": "assistant", "content": "123"}]}",
"{"messages": [{"role": "system", "content": "You are a chatbot for website."}, {"role": "user", "content": "Can I create video"}, {"role": "assistant", "content": "123"}]}",

But in result quality of searching became worse then with old models. Maybe I should change the file format, or something else, because I thougth 3.5 model should be more accurete then previous.

Hi and welcome to the Developer Forum!

Is the system prompt in the fine tune set the same as the one you use in your application?

Hi, thanks. Yes, I use the same system message in request body, for example:

{
    "model": "my_ft_model",
    "messages": [
        {
            "role": "system",
            "content": "You are a chatbot for website."
        },
        {
            "role": "user", 
            "content": "How can i insert a video?"
        }
    ],
    "temperature": 1,
    "n": 3
}

I tried to use a different value for temperature from 0 to 2, but it didn’t affect a lot. When I used a 3.0 models and send the same message that was in training file prompt it always gives right answer, but it 3.5 it not always like this.

How many examples are there in your training set?

For every article there is near 20 examples of different prompts, and the same in validation file.

Ok, how many articles are there and can you post your API calling code please. (The code you use to run queries on the model, please include any setup code that it uses)

The is 400+ articles. What exactly api code you need, I wrote the example of data I use in request for openai?

I need to see if you are in fact calling the fine-tuned model, and I also need to see if there are any other issues that might be code related.

It could also be that 20 examples over 400 articles is insufficient data to get a good response, but you are saying that it worked with a prior model, correct?

Yes it worked prior with previous models like curie or babbage. As for code I use simple curl request in php like this:

<?php

$curl = curl_init();
curl_setopt_array($curl, array(
  CURLOPT_URL => 'https://api.openai.com/v1/chat/completions',
  CURLOPT_RETURNTRANSFER => true,
  CURLOPT_ENCODING => '',
  CURLOPT_MAXREDIRS => 10,
  CURLOPT_TIMEOUT => 0,
  CURLOPT_FOLLOWLOCATION => true,
  CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
  CURLOPT_CUSTOMREQUEST => 'POST',
  CURLOPT_POSTFIELDS =>'{
    "model": "ft:gpt-3.5-turbo-0613:my-website::83hg8PKQ",
    "messages": [
        {
            "role": "system",
            "content": "You are a chatbot for website."
        },
        {
            "role": "user", 
            "content": "How can I add video?"
        }
    ],
    "temperature": 1,
    "n": 3
}',
  CURLOPT_HTTPHEADER => array(
    'Authorization: Bearer my_aki_key',
    'Content-Type: application/json'
  ),
));

$response = curl_exec($curl);

In response it gives a bunch of numbers which should be ids of articles that was in training file::

[
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "4892"
            },
            "finish_reason": "stop"
        },
        {
            "index": 1,
            "message": {
                "role": "assistant",
                "content": "4892"
            },
            "finish_reason": "stop"
        },
        {
            "index": 2,
            "message": {
                "role": "assistant",
                "content": "4240"
            },
            "finish_reason": "stop"
        }
    ],

but now it often returns invalid or unesist ids.

The only other thing to try is increasing the number of epochs used while training, I’m not sure if it’s the numerical nature of the responses that is causing the results you are seeing.

I used maximum number epochs = 10, but it didn’t do better. What maximum epochs number can be used?