Fine tuning seems to start training from scratch

Hi guys!
i wanted to fine tune davinci model to learn coinmarketcap data added after 2021 so my models knows about crypto tokens added to coinmarketcap after that time so when i ask about new crypto tokens detected in documents.
i have generated commands in the following format:
{‘prompt’: ‘is SPOT a cryptocurrency token?’,
‘completion’: “yes, it’s a cryptocurrency token introduced by CoinMarketCap at 2022-12-25”},
{‘prompt’: ‘is Spot a cryptocurrency token?’,
‘completion’: “yes, it’s a cryptocurrency token introduced by CoinMarketCap at 2022-12-25”},
{‘prompt’: ‘what is the adjacent CoinMarketCap name for the new cryptocurrency token SPOT?’,
‘completion’: ‘Spot is the CoinMarketCap name for this cryptocurrency token’},
{‘prompt’: ‘what is SPOT rank on CoinMarketCap?’,
‘completion’: ‘SPOT rank in CoinMarketCap is 5577’}

now the problem is that I have fine-tuned davinci model but i get weird results in a way that it seems not to be fine-tuned but instead learning from scratch and cleared its own knowledge
the models response to the word “hello” is ’ MMMMMMMMMMMMMMMMMM’!
i would really appreciate the help

Hi @vhdmsm,

Can you share some more details about the prompt you used for training and the function call you made ?

On principal, fine-tuning should build on top on the existing knowledge.

my prompt samples in the main question

these are samples of my training data
{‘prompt’: ‘is SPOT a cryptocurrency token?’,
‘completion’: “yes, it’s a cryptocurrency token introduced by CoinMarketCap at 2022-12-25”},
{‘prompt’: ‘is Spot a cryptocurrency token?’,
‘completion’: “yes, it’s a cryptocurrency token introduced by CoinMarketCap at 2022-12-25”},
{‘prompt’: ‘what is the adjacent CoinMarketCap name for the new cryptocurrency token SPOT?’,
‘completion’: ‘Spot is the CoinMarketCap name for this cryptocurrency token’},
{‘prompt’: ‘what is SPOT rank on CoinMarketCap?’,
‘completion’: ‘SPOT rank in CoinMarketCap is 5577’}

these are parameters I set for training
create_args = {
“training_file”: “file-34Ok5rxlHnpCockzJLLx0221”,
“model”: “davinci”,
“n_epochs”: 10,
“batch_size”: 3,
“learning_rate_multiplier”: 0.3
}

and this is how I start training with my args
fine_tune_response = openai.FineTune.create(**create_args)

Thanks. Can you also share the function call you are making to the fine-tuned model to get your response back, as well as the prompt you are using ?

It does not have to be exactly that, but just the gist of it(for the prompt)

this is how I get the response after training

new_prompt = “”“hello”“”

answer = openai.Completion.create(
model=fine_tuned_model,
prompt=new_prompt,
max_tokens=100,
temperature=0
)
answer[‘choices’][0][‘text’]

Using hello as a starting prompt might not be the best check here. If you try sending a normal question or a question regarding your fine-tuning, how does it respond ?

even then it seems confused

new_prompt = “”“is XRP a crypto token?”“”
response= “”“no, it’s been untracked in CoinMarketCap since 2023-07-08 12:48:00+00:00:00:00+00:00:00:00:00+00:00:00:00:00:00:00+00:00:00:00:00:00+00:00:00:00:00:00:00+00:00:00:00:00:00:00:00:00"”"

“Over”-tuned on too little data, probably?

my main issue here is why the fine-tuned model forgot how to answer ‘hello’!
where did the base knowledge go?

Catastrophic forgetting… old and nasty issue of ML. xD

i thought chat gpt is prune to this issue or at least much less affected by this!

that looks like you’re not using a proper stop token.
The model will generate random garbage in the output buffer after it’s reached the stop token.
Your completion also does not use a consistent stop completion token, like END or +++ or <|endoftext|> (which you should then provide as a stop token when running inference.)

Also, 10 epochs seems like way too many. I’d start with 2 or 3.
I’d also try very low learning rates.
Also, I’d put variants of the same question into the training data:

  • Is WEED a crypto token? → Yes, it was introduced on 2023-04-20<|endoftext|>
  • When was WEED introduced? → WEED was introduced on 2023-04-20<|endoftext|>
  • On 2023-04-20, what tokens were introduced? → On 2023-04-20, the tokens WEED, SCAM, and BEST were introduced.

Generating training data for LLMs is hard work, just like curating datasets for any data science project.

Nah, currently no one can mitigate this issue. Although there were papers published on multi-task learning to preserve original abilities but overall the performance will drop.