Issues with Fine-Tuned Babbage-002 Model Returning Incorrect Completions

michael.nguyen · December 19, 2023, 3:09am

Hi everyone,

I’m encountering a challenge with my fine-tuned Babbage-002 model and seeking your help.

Context: We’re using the Babbage-002 model for classifying callers’ intents in a customer service setting. The intents include something like “TalkToUser”, “LeavingMessage”, “BookMeeting”, or just “IdleTalk”, etc. Our training set is formatted as follows:
{“prompt”: “Hi, is Ben here? I would like to talk to him real quick ->”, “completion”: “TalkToUser##”}

Issue: Since we started using Babbage-002, we’ve observed numerous instances where the model returns incorrect completions. These aren’t just slightly off but are completions that don’t exist in our training set. For example,a prompt like “My package arrived damaged, what should I do?”, the model returned “RequestComplaintService” as the completion, whereas the correct and expected completion from our training set is “ComplaintsIssues”. Notably, “RequestComplaintService” isn’t a completion we’ve trained the model with.

Questions:

How can we ensure that the fine-tuned model strictly adheres to the completions provided in the training set?
Is there a way to restrict the model’s completions to only those that are present in our training data?

Thank you very much for your help.

Macha · December 19, 2023, 4:33am

So, I guess before I dive into how to solve and approach this issue, can I ask why are you using Babbage-002 and not GPT-3.5 turbo in terms of fine tuning?

I don’t hear this model type being used as much in the wild (plus, isn’t it supposed to be deprecated eventually?), and I don’t know the quirks of babbage as well, so I’m not sure how well the advice I can give can translate from GPT to babbage in this sense.

Regardless, it should be noted that LLMs work best with some degree of fuzziness. You shouldn’t expect 100% accuracy. Now, we can do our best to reduce the frequency of its undesirable responses, but you should write your script assuming there will be potential errors like this, and handling them when appropriate.

Does babbage allow you to specify the system or role prompt?

rashmy · December 19, 2023, 5:31pm

Hi,
Did you find the answer to this question?

I am also having issues in fine-tuning the model. As I understand the Classification endpoint is also deprecated. Could someone please outline the steps or point to a example that shows how to use OpenAI API for intent classification with a dataset that is labelled?

Here is what I am looking for -

Format of the dataset in JSONL.
Should the data in JSONL be created with embeddings to ensure the context helps in determining the intent correctly
What parameters to use during fine-tuning to ensure better prediction?
Which API should I use during prediction with a given user text? Is in the chat.completion API?

Thanks

DesDonnelly · December 19, 2023, 10:54pm

@michael.nguyen

Hi,
greetings from Ireland…
just a suggestion, you could touch base with @Syma perhaps…

Regards
Des

_j · December 19, 2023, 11:19pm

babbage-002 is an extremely small model, with specifications not published, but the price that compares to the former ada should inform us of the compute. The generation is of very high perplexity.

It will need low temperature and top-p to constrain its outputs.

Then there is the topic of the actual fine-tune of a base model. The AI comes with no instruction-following skills, only the completion of text. You can compare to community efforts of fine-tuning larger open-source AI models, using days of high-end compute and instruction sets now approaching millions, to ponder your ability to achieve success.

It seems the problem with the original case is “inference” vs “overfitting”. Do you want the model to be able to infer the style of response it should produce, for breadth of answering, or do you need it to only produce from a very specific set of responses? The latter will take much more reinforcement (epochs: learning passes through your training data).

If you have a fixed set of tokens to be output, a max_tokens=1 application, you can use a large logprob_bias dictionary to promote those probabilities.

rashmy · December 20, 2023, 2:49am

Thank you for the information.

I have a domain-specific set of questions, answers and intent labels. This is a fixed dataset and at the moment it is a small set (would grow to 10,000 eventually which is also not huge). Each intent label is a single token. I am hoping to use a base-model that can be fine-tuned for my use case so that it can correctly return the desired intent.

Example dataset:
user: I am not able to find the schedule for course A, intent: course_schedule, bot: You can view the full schedule here.
user: who and where is the admin for department A located, intent: dept_info, bot: You can view department information here.
user: outline the steps to register for course A, intent: course_registration, bot: Sure, I can help with that. Follow the below steps.

If you can point me to an example, that would be great!

curt.kennedy · December 20, 2023, 3:12am

I use Babbage-002 for classification all the time.

But I restrict the output to 1 token, so the output is a single integer like 0, 1, 2, etc. Then you map this to your more verbose definitions. But you are using a multi-token output, hence why it gets “confused”.

Make sure you have lots of training data too. So thousand or hundreds per category. Otherwise, with less training data you need to go to a higher model. Then later you can downgrade to Babbage after you get more training data, and reduce price, if that’s a big driver.

Also, set the temp to 0, and you should be good to go.

_j · December 20, 2023, 3:39am

output:
intent: course_registration

From that, you have the natural separator prompt, “\n intent:” to place in your prompt, but then two generated tokens, " course" and “_registration”. The token without underscore might have stronger underlying semantics, being rank 12506 instead of 50201.

I don’t see why this isn’t a strong technique, if you constrain the formatting to token-based category, subcategory. You get the AI producing the first token, and that guides it into acceptable follow-up token.

The challenge is validation of learning with small output, I think. Training loss and validation loss will show more how expected the input is than how correct the output is, the low point of the learning curve not telling us much. You’ll have to just continue deepening a tune until the results are as you wish, now fortunately with fine-tune continuation where you can build on the exist fine-tune, and also more learning hyperparameters. Manually testing your held-out set.

Also, if the AI spits out a complete out-of-scope token, you can probe logprobs to find the first allowed by your categories.

rashmy · December 20, 2023, 5:22pm

Great!

Could you please share the format of the file to upload for fine-tuning?

Is it?
prompt:I am not able to find the schedule for course A
completion: 1

Or should it be in the in the new “messages” format that uses “role/content”. In the new format, I was not sure where to set the mapped intent label “0, 1…”.

Thank you!

curt.kennedy · December 20, 2023, 6:54pm

Format is just lines like this:

{"prompt": "ok", "completion": "1"}
{"prompt": "bad", "completion": "0"}
{"prompt": "neat", "completion": "1"}
{"prompt": "boo", "completion": "0"}

This is called the “JSONL” format.

In the past you had to use a stop sequence at the end of your prompt (like "\n\n###\n\n"), and a prepending space on your output token (like " 1" instead of "1").

But now you don’t have to do any of this, so it’s really straightforward.

_j · December 20, 2023, 7:25pm

At the end of your prompt, you describe separator, not stop. And after the prompt of completions, you still would need to use a separator, although, like I noted above, that separator can be something natural and logical to you like “\nAI:”.

The AI will want to keep on writing your input for you otherwise. Even gpt-3.5-turbo-instruct will “complete” without a natural break, like carriage returns.

With gpt-3.5-turbo fine-tune, in contrast, the training is formatted in the special tokens of the chat container, so it gets the (end of message) token after final user role message, followed by the (start of message)assistant(middle of message) token sequence automatically added, finally cuing the AI where to write in its own trained style and where the assistant fine-tune is placed. Thus, you need no more than the messages themself.

rashmy · December 20, 2023, 11:43pm

HI,
I tried using the JSONL format having field “prompt” and “completion”. This does not work when used with the model gpt-3.5-turbo_1106. It needs the new format with fields “role”, “content”.

Unfortunately, I tried a few ways to create my dataset in the new format for intent classification and it does not work.

Even the documentation on the new format does not give examples on how it can be used for intent classification. It has examples for chat conversation format.

Could someone please, share if they are able to successfully use the new format for intent classification where we have a known set of intents into which the given text must be classified?

Thank you.

curt.kennedy · December 21, 2023, 12:01am

This thread is about the base model Babbage-002. Chat fine-tuning has a different training structure using ChatML.

I suggest starting with the docs:

https://platform.openai.com/docs/guides/fine-tuning

_j · December 21, 2023, 12:09am

Note, that “new format” comes along with a chat model that has already been trained on being like ChatGPT. gpt-3.5-turbo.

It is hard to make a fair classifier when the weights of the first token are already biased towards “Sure!” or “Certainly!”.

The best chance of success will be with davinci-002, which I suspect is similar quality/size to gpt-3.5-turbo, but with all (most of?) the application-specific training peeled away.

Topic		Replies	Views
Using the new fine-tunes endpoint for binary classification API fine-tuning , python	10	2203	January 11, 2024
Advice needed for JEL code prediction fine-tuning task Prompting fine-tuning	2	718	May 9, 2023
Struggling with poor performance on fine-tuned davinci model API	15	2677	December 20, 2023
GPT3 Finetuning for Multilabel Classification API	26	9356	October 17, 2024
Fine tuning completation API	9	2386	December 25, 2023

Issues with Fine-Tuned Babbage-002 Model Returning Incorrect Completions

Related topics