Finetuning not working?

I am experimenting with finetuning, so I created a very small jsonl file (I know normally these should be much larger, but I thought I would give it a shot).

Original jsonl: training_prompts_bunny.json · GitHub
Prepared: training_prompts_bunny_prepared.jsonl · GitHub

This is what prepare_data suggested and I accepted BTW:

  • [Recommended] Add a suffix ending \n to all completions [Y/n]: Y
  • [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y

Then, I went ahead and wrote this very simple python program to test my prompts:

import os
import openai


def ask(question):
    prompt_text = "The following conversation is between Hoppy, the bunny and a human. \n\nHoppy: Hello!\n\nHuman: " + question + "\n\nHoppy:"
    response = openai.Completion.create(
        model = "davinci:ft-personal-2022-12-29-00-14-18",
        prompt=prompt_text,
        temperature=0.8,
        top_p=1,
        max_tokens=100,
        frequency_penalty=0.0,
        presence_penalty=0.3,
        stop=["\n"]
    )
        
    response_text = response['choices'][0]['text'];
    response_text.isalnum();
    
    return response_text  
    
def start_chat():
    while(1):   
        input_text = input("Ask: ")
        response = ask(input_text)
        print("Response: ", response)
    
def main():
    start_chat()
    
if __name__ == "__main__":
    main()      

But it’s not really working… I am getting quite wrong responses, as if my fine tuned model was not really taking my JSONL into account:

Ask: How old are you?
Response:   I am two years old.
Ask: How old are you?
Response:   I am a bunny. I am 2 years old.
Ask: How old are you?
Response:   I am 8 months old.
Ask: How old are you?
Response:   I am a bunny!
Ask: How old are you?
Response:   I am 1.
Ask: Who are you?
Response:   I am Hoppy.
Ask: Where do you live?
Response:   I live in the forest.
Ask: Where do you live?
Response:   I live in the forest.
Ask: Where do you live?
Response:   I live under the hill.
Ask: Where do you live?
Response:   I live in a forest.
Ask: Where do you live?
Response:   I live in a rabbit hole.
Ask: Do you have a brother?
Response:   Yes I have a brother!
Ask: Who is your brother?
Response:   Goppy.
Ask: Who is your brother?
Response:   I have no brother.
Ask: Who is your brother?
Response:   I have no brother.
Ask: Who is your brother?
Response:   Bear.
Ask: Who is your brother?
Response:   My brother is Hoppy.
Ask: Who is your brother?
Response:   I am.
Ask: Do you have friends?
Response:   Yes, I have 2 friends.
Ask: Who are your friends?
Response:   I have a bear, a pig and a bunny.
Ask: How old are you?       
Response:   I am 5 years old.
Ask: How old are you?
Response:   I am two years old.
Ask: How old are you?
Response:   I am six months old.
Ask: How old are you?
Response:   I'm 2 years old.
Ask: Do you go to school?
Response:   No. I'm a bunny. 
Ask: Do you go to school?
Response:   Yes, I go to school.
Ask: Do you go to school?
Response:   Yes, I do.
Ask: Do you go to school?
Response:   Yes, I do.
Ask: Do you go to school?
Response:   Yes! I go to Jump-n-Go bunny school.

Is this due to the lack of quality training data? Or maybe I am doing something wrong in the JSONL file format or the python code itself?

Thanks a lot in advance! :slight_smile:

Welcome to the community!

How many samples are you using? I think the lowest recommended is 200, although usually it takes a lot more data to get it working. The general rule of thumb is that if your current dataset isn’t getting good results, try to double it if you can. The quality of the data matters too.

Hope this helps!

1 Like

Have you tried reducing the temperature at all? You could also consider reducing the pool of available answers by setting top_p to a lower value. (maybe 0.3)

I have spent a bit of time figuring out the difference between top_p and temperature for a course I am writing.

STAGE 1:

When the AI builds a sentence, it starts off by building list of words/answers that it thinks fits the situation. This will also include things that it had learned before your training (eg a rabbit lives in a rabbit hole or under a hill in online childrens books)

Each word/token has its own probability score of being correct.

Once it has built the list of words, it uses top_p to slice off the least probable answers.

But, because your score for top_p is 1, it keeps the entire list.

If you were to set top_p to 0.3 (or similar) it would drop the bottom 70% of tokens/words/answers (this would completely eliminate hundreds of “long tail” or low priority answers).

STAGE 2:

Once it has this list, then temperature kicks in.

A temperature of 0 leaves their chances of being picked untouched.

But a probablity score of 0.8 adds quite a lot of weight to the lower probability options. The unusual or low probability answers get a boost in their score. They suddenly become more likely to be picked.
(Out of interest, a value of 2 gives all tokens/words even chances of being picked)

So, in effect, your line where you give a temperature of 0.8 is telling the AI to consider the less likely answers (with considerable bias)

TRY THIS:

I would try top_p of 0.3 AND temperature of 0.3
I know the documentation doesn’t recommend this, but the setting do two different things.
In your case, top_p may be more important

Side Note : I have a hypothesis (untested) that the tokens for fine-tuning text is loaded with a higher probability score any way. I assume this because it would be the only way to override the massive weight of knowledge the AI already has from its existing learning.

If the AI had read hundreds of books where the rabbit lived in a hole, your single rule about living in the forest would have very little overall weight and would be considered to be one of the hundreds of outliers or long-tail options.

Of course I don’t know if this is true because I’m not privy to the inner working of GPT.

Edited: for spelling and the word “doesn’t” instead of “does”

4 Likes

Thanks a lot! I will try to experiment with these values.

What confused me (and still does, somewhat) is that if I include these questions and anwsers as part of the initial prompt almost as it is, then the GPT3 “responds” very consistently and follows my “personality” traits much much closer. Is my observation correct, that if I always provide the the personality as a “context” (pass it always in the prompt), then it will work better? I wanted to avoid this because of excessive token usage…

But if this is true, then how can a company get a custom model trained that will precisely anwser company specific inquiries? Wouldn’t there be a risk that the chatbot will answer something odd?

Thanks!

1 Like

If you want to limit the responses to a specific domain or data set, you may be forced to use embedding. However, that is a entirely different way of doing things and involves quite a lot of work at your end.

See if you can get your training working.

If you cant, I will have some videos about embedding that I can share in a few days. (Busy recording them as I speak)

2 Likes

Thanks! Now the only thing that kind of confuses me, if the prompt itself contains the “personality” and responses to specific questions, then GPT3 replies much better compared if the same “training” data is provided via the JSONL approach. Why the local “context” performs better here? Does it also have a much bigger weight then regular training data? Thanks!

I’ve been thinking about this overnight and may have come up with a better solution. We may be overcomplicating things.

Because GPT is really good at following a pattern and doing “completions”, maybe you could change the format of the training data slightly.

I assume your prompt is being sent in a chat format where you have something similar to:


"The following conversation is between Hoppy, the bunny and a human.

Hoppy: Hello!

Human: {the humans question}
Hoppy:"


If you are, you might like to try a training file where you include the Human: and the Hoppy: in your prompt. And put the answer in your completion (with the space before it as you have already done).

{“prompt”:“Human: Where do you live?\nHoppy:”,“completion”:" I live in a forest.\n"}

If you did this, the API might better latch onto the completion part because the two tags will match your prompt.

Edited to swap ask and response for human and hoppy (to match source at start of the thread)
2nd edit : I also noticed in your source code that you have two new lines before the last Hoppy:
You should change this to one.

3 Likes

Hi! As I was reading, something came to my mind. The larger the dataset the better the training results? and for example, in this case OP had a short dataset, would it have been better to just copy and paste the datasets data to just get to the 200 minimun exampls? I am confused wether if you need large datasets no matter if repeated stuff or not.