Fine tuning success stories - new 2023 models, what are your results?

Since the introduction of fine-tuning gpt-3.5-turbo and the replacement completion models babbage-002 and davinci-002 have only sparse documentation examples (and merely the prompt of examples produces the illustrated behavior), I thought I’d make a place where you can share what has worked for you for more advanced AI tasks - and what has disappointed.

  • your application and purpose?
  • number and types of examples?
  • n_epochs hyperparameter used and discovered necessary?
  • experience using the API and tips;
  • development of training data sets.

I’ll start with a helpful tip: be aware of the old-school prompt engineering required to make a completion model actually continue writing where you left off. Then you need less training on completely new behaviors.


Another thing to consider: using the fine-tune model costs 8x as much.

Can you get acceptable or similar performance by using eight multi-shot examples before your task, on the base untuned model?

Prior base completion models especially are quick learners of multishot.

1 Like

Okay, this is now also the fine-tuning utter failures topic… :smile:

OpenAI (finally) released new documentation on how the completion models (davinci-002 and babbage-002) should be fine-tuned by the new endpoint.

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

This is the same as one would use for training prior base models. Except for the different way that the model will respond, most previous applications can work the same.

What this example doesn’t mention is the need for separators and stop sequences, and the situations where you’d have to include them in fine-tune and use your model with them. I’ll send you far down in this topic for a quick definition.

1 Like

I’m a little confused on the documentation.

So do we need stop sequences or not in the fine-tune training file?

It says

“You can follow the prompt completion pair format used for legacy fine-tuning as shown below.”

So they are saying you can use the legacy format, which uses stop sequences, but the example in the newer documentation shows no stop sequences.

Has anyone verified?

I would be only fine tuning Babbage-002 with my previous training file for the original Babbage.

1 Like

Consider prompt to be everything that you will input to the model. Then completion to be everything the model will output for that type of starting context.

Now one particular thing is we know that on a completion endpoint, producing an “end-of-text” special token will halt the output. The instruction-following completion model gpt-3.5-turbo-instruct is very trained on producing them. If given a Human/Chatbot scenario, it will almost certainly produce the end-of-text and not continue.

However, what we are not given is documentation to produce this ourselves or fine-tune on it. The base models without training will just write and write.


If you look above, for this chatbot application we don’t need the model to “stop” itself, but we instead program a stop sequence: If the AI were to proceed to the next line and start writing user: like it was going to “complete” and simulate more answers, we can set our stop sequence to "\nuser:", which will detect that sequence and not send it to us, and stop the AI output there.

AI indeed can emit “end-of-text” from its own ideas of when it is done writing an essay like its pretraining. Try to make it write more after it produced one and you just get more “I’m done” tokens.

Now what about the operation of “completion” that is the natural operation? When just given some text, the AI will just write what comes next, finishing a sentence or finishing an essay. It could try to finish your question or write more questions and not answer anything at all.

We “prompt” this chatbot AI by including “\nAI:”. We tell it where to start writing in the style of an answer.

In a completion model, we can show it inputs and outputs (multi-shot) and it will learn what it is supposed to make. Fine-tune can also show it what it is supposed to make.

So we have satisfied the tips I’ll put at the bottom. We have a “separator” \nAI:(which serves to make an AI that doesn’t just continue writing more sentences of our input) and we have a “stop sequence” of something we can recognize in programming the API call.

I was under the initial assumption that the new training endpoint might behave like the containered chat fine-tuning, where OpenAI would have picked delineation or ending sequences that are injected, but that would break the idea of free-running completion, which is now demonstrated identically on the new models, so that’s seemingly a discarded notion. The format of the json being sent with either “prompt” or “messages” for fine-tune must instead distinguish two different methods for chat and completions (plus perhaps validate models for them?).

So you must see if your own input and desired output and the way the AI might try to continue naturally contains such parts of both “separator” and “stop”. Or rather, if you’ll have to make the AI recognize the end of your text where to generate in its trained “output method”.

If you want your input to be “The best way to do (a company task) is to”, and then the AI response to be " always buy products from OpenAI!", then there is no separator in that prompt.

The documentation is not rules, but how to make the tuned method similar to what you must do just to make a base completion engine work similarly by prompting (which you should experiment with first).


To fine-tune a model, you’ll need a set of training examples that each consist of a single input (“prompt”) and its associated output (“completion”). This is notably different from using our base models, where you might input detailed instructions or multiple examples in a single prompt.

  • Each prompt should end with a fixed separator to inform the model when the prompt ends and the completion begins. A simple separator which generally works well is \n\n###\n\n. The separator should not appear elsewhere in any prompt.
  • Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace.
  • Each completion should end with a fixed stop sequence to inform the model when the completion ends. A stop sequence could be \n, ###, or any other token that does not appear in any completion.
  • For inference, you should format your prompts in the same way as you did when creating the training dataset, including the same separator. Also specify the same stop sequence to properly truncate the completion.

Well, so far using the new fine-tunes API, I get a model that suddenly hit a training loss of 0.0000


Running it through

It tried to complete with a space. :cry:

'logprobs': {'tokens': [' '], 'token_logprobs': [0.0], 'top_logprobs': [{' ': 0.0, ',': -23.361324}]

The settings were:

'Params': '{"temperature": 0, "max_tokens": 1, "top_p": 1, "logprobs": 2, "frequency_penalty": 0, "presence_penalty": 0}'

The real response is either ' 0' or ' 1' (this is a binary classifier)

So maybe the new model doesn’t need a space in front of the output token?

The JSONL file had lines like this:

{"prompt": "Lolllll\n\n###\n\n", "completion": " 1"}
{"prompt": "See you Saturday\n\n###\n\n", "completion": " 1"}

I just used what worked in the OG babbage that is currently operational. I used the same exact training file with ~4000 lines.

So my ideas are to get rid of the space fronting the output token. If that doesn’t work, also get rid of the stop sequence \n\n###\n\n. Maybe fine-tune through the CLI, instead of the SDK? Is the CLI still supported???

Anyone else had luck fine-tuning babbage-002? Or davinci-002?


OMG, Did you just create AGI???


Or rather: you asked for one token, you got one token.

Numbers do not seem to join with leading spaces like words do. Apart from all from 0-999 being manually loaded by OpenAI into the BPE dictionary, the process of finding an optimal dictionary from training data could have not found byte optimization from spaces before numbers, with lots of ISBN and barcodes or timestamps in the scraped data, with long sequences of numbers.

1 Like

OK, it must be the new 100k tokenizer then.

I am running exactly the same settings and training file as OG Babbage.

I will go back to the drawing board and run without pre-pending spaces.

Haha, far from it, 0.0000 TL is BAAAADD NEWS

I hope the community can learn from my fumbles as I try to get the NEW models fine-tuned, with all the baggage I carry from the OLD models :rofl:


Since I just happen to have a sorted file with 100000 lines

dec hex char
46 2E .
47 2F /
48 30 0
49 31 1
50 32 2
51 33 3
52 34 4
53 35 5
54 36 6
55 37 7
56 38 8
57 39 9
58 3A :
59 3B ;
60 3C <

Numbers would appear sorted between / and : characters. Go to the part of the token dictionary with leading spaces:

[35611, 3, ' /^']
[94397, 4, ' /^(']
[76437, 4, ' /^[']
[82058, 4, ' /^\\']
[552, 2, ' :']
[6395, 3, ' :\n']
[14853, 4, ' :\n\n']
[91509, 6, ' :\n\n\n\n']
[30075, 4, ' :\r\n']

We can see none.

So another case of not quite being a “drop in” model.

1 Like

I will report back here. But certainly not “Drop IN” so far.

Any opinion on dropping the stop markers \n\n###\n\n

It is an old school completion endpoint, and it feels weird without them.

Maybe the new model doesn’t need them either in training or in operations??? Thoughts?

1 Like

I gave a long write up including the separator markers. It depends on what the AI wants to write next after the input.

Here I give a bit of warm up language (read it as if the AI was reading this in a book or document about classifiers)


That might need a separator…

now lets try with ### as a “stop” separator


babbage-002 is continuing in a different way, seeing the delineation. It’s still writing like it was inside an article about classifiers, though.

However, you can just alter your prompt format to a “natural” stop phrase, getting an answer. Obviously a dumb answer needing just a bit of fine-tune.


(I wouldn’t count on another appearance of “classifier:” being a reliable API stop phrase, so that’s where you can train on a line feed.)
(“and then finishes” is having no effect here, it has no training on the end-of-text token)

You are in the Playground.

The fine-tune “burns in” the markers as part of the training process.

Fine-tunes are an altered beast compared to the Playground.

So for apples-to-apples, I need to see FT analogies using the ‘002’ base models fine-tuned

This is what fine-tune does. It shows it examples. And I had a brainwave.


My fine-tunes, and the ones I want to re-create using 002 base models is basically:

{arbitrary text} → {0, 1}

It’s a surjection from any text input to 0 or 1.

log_probs provides a continuous mapping from -1 to +1. Which gives a confidence score.

No prompting involved.

This all works perfectly on the original Babbage. All API based too.

1 Like

So the problem is you provide an input:

I like arbitrary text

you could get an output


but you also could get an AI that thinks more is needed

, because it is random.0

So thus the separator is basically serving as a “my text is done”

The separator is something you should add to your input to match your prompt training:

I like arbitrary text


So if you’re burning tokens, semantics are free, and the next programmer will understand, too:

I like arbitrary text
AI output:

My intuition was right, it was the 100k tokenizer issue.

Retrain without the leading space on the output token.

No need to drop the markers, keep them in, BOOM!

"choices": [{"text": "0", "index": 0, "logprobs": {"tokens": ["0"], "token_logprobs": [0.0], "top_logprobs": [{"0": 0.0, "<|endoftext|>": -21.375}]

The model is overconfident though based on log_probs. Whatev … I give up. :rofl:

“Reasonable” training loss curve ???

There is probably a PhD thesis or three after it kept oscillating to 0.0000 TL before jumping up at the end.


I’m still in favor of using stop markers, but I think we’re in the minority :sweat_smile:

1 Like

If it ain’t broke…right?

Just to clarify:

  • separator: The last part of your prompt. It tells the AI that your output format shall begin after a particular sequence is seen. It is the ultimate “prompting” to the AI.

  • stop sequence: The last part of your completion. It is a particular phrase and token sequence that when produced, will trigger the matching API’s “stop” strings, and terminate the likely continuing or repeating output.

Goofy completion model tricks:

Useful stop sequence fine-tuning

{“prompt”: “prefix:<prompt text>”, “completion”: “<ideal generated text>prefix:”}

You set the stop sequence as your same input role prompting. This is the pattern of chatbots, but most wouldn’t think to fine-tune this way.

Why: If you turn off the stop sequence, you get an AI that keeps on simulating these turns. You can see it make a bunch of fine-tuned responses to its own fine-tuned inputs.

Going past the prompt you trained on:

given normal fine-tune training:
{“prompt”: “<prompt text> #—>”, “completion”: “<ideal generated text>##STOP!##”}

You can disobey your own prompt style. This could have some interesting uses for probing performance.

Overcomplete prompt:
"A ripe banana#—>Sure, I’ll write a story
"I like frogs#—>{“sentiment”: “happy”, "length

You can go deeper into your fine-tune, and redirect the output to particular example, particular part of output, or even out-of-domain. See what is produced.

Incomplete Prompt:
“A ripe”

When the AI completes your own prompt, you see what the fine-tune is expecting from you, and how long before (and if) it makes the separator on its own by being strongly-tuned.

I can report that it appears to work with no sync markers at the end.

'choices': [{'text': '0', 'index': 0, 'logprobs': {'tokens': ['0'], 'token_logprobs': [0.0], 'top_logprobs': [{'0': 0.0, '1': -19.259766}]

So no need for sync markers (to denote a stop sequence) or a space pre-pending your desired output.

However, the training loss curve I got was weird:

Also, I learned the model chooses an epoch based on analysis of your training file. So it gave me 3 epochs, opposed to the old fixed default epoch of 4.

I will run these models side by side to see if they disagree (new with/without sync vs. old).