Fine-tuning divinci and end of response issues

Hello.

I have fine-tuned a divinci model using a training dataset that put “\n\n###\n\n” to the end of the prompt. I did not add that to the end of the completion (I don’t think I was supposed to).

Now I have a situation where my response usually text starts with “\n###\n\n” and then ends the full message with “\n\n###\n\n” at the end, but then after that it repeats the full message again until it runs out of tokens.

So it’s starting with the stop and then repeating the message multiple times until it runs out of tokens, even if i use the “stop” parameter in the completion request.

x = openai.Completion.create(
    model="davinci:XXXXXXX",
    prompt= msg,
    max_tokens=1000,
    stop="\n\n##\n"
)

More interestingly/annoyingly, sometimes it starts with some “random” text around 150 characters including some newlines that looks like it could fit into message somewhere and uses that text as the start and end tokens around the message, which again repeats and finally ends with “\n\n###\n”.

So in the “normal” (bad return) use case I get something like this if the answer was meant to be “blue”

\n###\n\nblue\n\n###\n\n\n\n###\n\nblue\n\n###\n\n\n\n###\n\nblue\n\n###\n\n

If I don’t set a max_tokens param then it gets cut off at 16 characters.

For the cases when it does use my stop token and doesn’t use the random text string, I can of course strip this out in the app layer after receiving the response but (1) I think I am paying for those extra tokens when I don’t need them and (2) generally in my quest to understand this better I’d like to figure it out.

So questions:

(1) Why do I get my stop param at the beginning as well as the end?
(2) Why doesn’t it stop when it reaches the stop param string?
(3) Did I do the training data incorrect or something else?
(4) What do I do about the fact that it uses a “random” 150 ish character string as the stop param to bookend the response?

More context if useful:

  • the task I’m asking this model to do is to take a message and return both a summary and a priority classification (low/mid/high)
  • my training data is around 2400 samples, and I used 3 epochs for training (since my billing limit wouldn’t support 4)

Thank you for this tool and for your help! @boris your feedback on other answers has been helpful in my learning!

Another question about this project as long as I’m posting.

My project is using this API for 2 things: to classify messages urgency (low/mid/high) and to summarize the message. CGP3.5 does a great job summarizing but I haven’t been able to get it to classify the nuances well, which is why i’m trying a fine tuned model. So far in my limited testing with the new divinci tuned model, the classifications are great but the summaries are so-so.

Potential paths to improve:

(1) split into 2 api calls - use the tuned divinci model just for classification and call GPT3.5 for the summary. unfortunately this still costs me the full price of a tuned model but could work.

(2) more training data for the summary or more epocs or adjust temp and see. (fwiw, i used 2400 samples for training in which i fed in the result of asking GPT3.5 for a summary)

(3) figure out how to get GPT3.5 to do better classification which would save me a ton in costs. I think I’m going to work on this one in parallel. any suggestions or guides welcome.

Responding to your second post—my classification situation was pretty simple (mostly self-evidently non-overlapping classifications without a whole lot of nuance), but you might find some ideas here anyway:

And to add a few extra paths for improvement:
(1) if you have access, you could try to use GPT4 for the classification part.
(2) use chain of thought prompting:

When you have difficult or multi-step problems, you can often significantly improve performance by asking the model to “show its work,” especially if you can point to some general themes in the nuances the model is misclassifying. I might say something like, e.g.,

First, think through the problem in tags. Identify the top three most likely categories, then critically evaluate the three options and make your final evaluation. Note that if the user uses words X,Y,Z, these are indicative of “high” priority, …(etc)

(3) use in-context learning.

If you’re not already doing so, provide a handful of examples (particularly the types of examples that you’re finding result in mistakes).

Hope some of these ideas help!

For fine-tunes, I have an input text string, and use the three-hash marker ‘\n\n###\n\n’ at the end (since it was there during training in the jsonl file).

But before I stick the input string and the marker ‘\n\n###\n\n’ together prior to sending off to the fine-tune, I remove all occurrences of ‘###’ from the message (original input string) to prevent confusion. And then tack the marker on the end. This way, there will only be one possible marker in the entire chunk of text sent to your fine-tuned model.

Also, looks like your stop has two hashes defined, not three. Not sure if this was intentional or not.

But I don’t define a stop for a fine-tune in the request (it’s burned into the training data, and sent manually anyway), maybe that is what is causing havoc for you?

Thanks for the reply. I also initially tried without the stop in my request and it was coming back with infinite loop, so i added the stop. Essentially the same thing both ways.

Will look into what you’re saying about the fine tune data and the training, thank you.