Fine-tuned gpt-3.5 API returns different response when one symbol changed in input messages

williamz · November 2, 2023, 7:00pm

Hi all, I have came across an odd case with finetuned gpt-3.5 API where changing one symbol (regular single quote to curly quote) in the input messages like so:

Let*'*s connect …

Let*’*s connect …

completely changed model response. I’m using API / python client and temperature 0, all parameters and inputs are same except one of the input message has this symbol change.

Has anybody experienced anything similar and has ideas about what’s causing this?

_j · November 2, 2023, 7:14pm

('s) - token 596,
(’s) - token 753

They should be semantic equivalents but are different.

If you have 1000 examples only using one case in your application, I can see how fine-tuning would enhance a difference in the two.

You could regex your inputs to replace the ones that are not typical keyboard inputs.

williamz · November 2, 2023, 8:40pm

Appreciate the reply, and yeah understood they are treated as different tokens. let me check the representation in training set.

Curious, do you have general guidance or resources on

recommended text sanitization in training and input data
number of samples in ft training data e…g for each conversation scenario, is more sample size always better? how does it affect quality if some scenarios have more samples than others?

Thanks again!

tony13 · November 3, 2023, 3:04pm

Hi Williamz, here , you have a dataset with specific format for fine-tuning gpt 3.5, just in case it´s helps you.

_j · November 3, 2023, 4:13pm

That data will only make a chatbot dumber. It looks like it was written by outsourced tech support. For example:

I comprehend your desire to track the status of your compensation. We have a dedicated system in place to assist you with that…
I get it the importance of your expected restitution of {{Currency Symbol}}{{Refund Amount}}. It’s crucial to keep you updated on its processing status.
I’m not mistaken the importance of receiving your restitution of {{Currency Symbol}}{{Refund …
Thank you for considering to write a review for our service! Your feedback is essential for us to gaug

Garbage in, garbage out.

williamz · November 3, 2023, 4:22pm

right i m not looking to add external data.

Our training data now is small size (50-100 rows) but well curated. My question is more about guidance on samples size and balance (how much each scenario is represented).

It sounded like the sample size may be too small which caused weird bug of over-indexing on special character?

FYI we have this model running in production with human feedback and the metrics have been looking decent. The challenge now is how to refine known corner cases where it’s behaving unexpectedly.

_j · November 3, 2023, 4:28pm

Did you find directional quotes and hyphens in your training data?

You can clean up and then run a continuation for another epoch or two, and if you had held-out validation, include that also. At worst, you just don’t use the new extra model.

In specialized cases, a validation file might look like you are past the point of overfitting, but the actual compliance and satisfaction is higher for that application, and you can improve generalization when the gaps in training are between two types of your own questions.

williamz · November 3, 2023, 4:41pm

yes the training set does contain directional quote but no intuitive explanation of why it steers model in a specific way.

and yeah definitely will run more eval - ft iterations.

tony13 · November 5, 2023, 3:46pm

Thank you, _J, for your response. Just curious, why do you consider it garbage, and why would it make the chatbot dumber? Thanks for your help.

_j · November 5, 2023, 4:08pm

OpenAI models have already been trained on millions and millions of training questions that are rated by tons of outsourced workers and then tuned with reinforcement learning.

You do not need to fine-tune gpt-3.5-turbo to be a customer support assistant, especially on grammatically awkward chat that has little to do with the user input and uses dumb placeholders that only train the model to output placeholders.

tony13 · November 6, 2023, 4:00pm

It’s interesting what you say (“You do not need to fine-tune gpt-3.5-turbo to be a customer support assistant”). I’ve done some research to understand GPT and customer support and there different articles from experts in the field like this one in Forbes:

“Cons Of ChatGPT For Customer Experience Cons Of ChatGPT For Customer Experience”

that point at major problems, one of them being critical “It [ChatGPT] Provides Different Answers Every Time”, and probably the author has a point, an assistant that answers different things every time is hard to use in a business environment, and my personal checks indicate that this is true often

_j · November 6, 2023, 4:10pm

ChatGPT - the web chatbot - provides different answers - and that’s by design. Not only is unexpected word use (instead of the most probable) seen as more human and inspired, it also allows OpenAI to gather good and bad responses to questions.

However, in the API, we can control the exact sampling parameters. The AI can say the same thing 100 times to the same input if I want to pay for it. Or I can have little Timmy’s day be different every time the AI writes about it.

However, even if the AI starts a sentence with a different word or two, if it has a plan of what it is going to write, it is going to be hard to distract it with less likely token choices on the same topic.

There’s near infinite token combinations I could have used to write this reply, and who knows why I chose the first human token “Chat”, but the overall idea was fully-formed by the input I was responding to.

tony13 · November 6, 2023, 9:54pm

You are right, it’s the changes in responses and the bad responses that are concerning for us. Customer Support is sensitive area and bad responses impact the business more than in other services like search for example. That’s the reason for the finetuning with specific data: questions and (correct) answers are linked by training.

_j · November 6, 2023, 10:05pm

“Bad” in terms of OpenAI getting training data with ChatGPT is getting comparative answers that are more or less satisfactory to the end user and then to knowledge workers that refine training data.

“Bad” for you might be a bot that was fine-tuned on some behaviors, but doesn’t get your company policies set in stone – would be possible if the AI could call a function to search your business manuals.

Like if this bot went along with the user and promised a refund, disregarding OpenAI policy:

Topic		Replies	Views
Inconsistencies in API response to same prompt and similar content API gpt-4 , gpt-35-turbo , api	3	4896	July 18, 2023
Struggling with poor performance on fine-tuned davinci model API	15	2615	December 20, 2023
GPT-4 becoming dumber sometimes, for a while API	7	2561	December 18, 2023
Fine tuned with wrong data initially API fine-tuning-problems	11	1453	December 23, 2023
Getting 400 Bad Request on GPT 3.5 for long prompts Bugs api	8	737	April 18, 2024

Fine-tuned gpt-3.5 API returns different response when one symbol changed in input messages

Related topics