Fine tuning success stories - new 2023 models, what are your results?

Since the introduction of fine-tuning gpt-3.5-turbo and the replacement completion models babbage-002 and davinci-002 have only sparse documentation examples (and merely the prompt of examples produces the illustrated behavior), I thought I’d make a place where you can share what has worked for you for more advanced AI tasks - and what has disappointed.

  • your application and purpose?
  • number and types of examples?
  • n_epochs hyperparameter used and discovered necessary?
  • experience using the API and tips;
  • development of training data sets.

I’ll start with a helpful tip: be aware of the old-school prompt engineering required to make a completion model actually continue writing where you left off. Then you need less training on completely new behaviors.


Another thing to consider: using the fine-tune model costs 8x as much.

Can you get acceptable or similar performance by using eight multi-shot examples before your task, on the base untuned model?

Prior base completion models especially are quick learners of multishot.

1 Like

Okay, this is now also the fine-tuning utter failures topic… :smile:

OpenAI (finally) released new documentation on how the completion models should be fine-tuned by the new endpoint.

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

OpenAI has gotten rid of the previously demonstrated stop sequences in updated legacy documents (likely inserting their own that work during file processing with the new endpoint), so there is no reason to put odd markers in like ###, unless you are using that as an “end of document” training marker the AI sees (for which three hyphens may be better).

1 Like

I’m a little confused on the documentation.

So do we need stop sequences or not in the fine-tune training file?

It says

“You can follow the prompt completion pair format used for legacy fine-tuning as shown below.”

So they are saying you can use the legacy format, which uses stop sequences, but the example in the newer documentation shows no stop sequences.

Has anyone verified?

I would be only fine tuning Babbage-002 with my previous training file for the original Babbage.

1 Like

Consider prompt to be everything that you will input to the model. Then completion to be everything the model will output for that type of starting context.

Now one particular thing is we know that on a completion endpoint, producing an “end-of-text” special token will halt the output. The instruction-following completion model gpt-3.5-turbo-instruct is very trained on producing them. If given a Human/Chatbot scenario, it will almost certainly produce the end-of-text and not continue.

However, what we are not given is documentation to produce this ourselves or fine-tune on it. The base models without training will just write and write.


If you look above, for this chatbot application we don’t need the model to “stop” itself, but we instead program a stop sequence: If the AI were to proceed to the next line and start writing user: like it was going to “complete” and simulate more answers, we can set our stop sequence to "\nuser:", which will detect that sequence and not send it to us, and stop the AI output there.

AI indeed can emit “end-of-text” from its own ideas of when it is done writing an essay like its pretraining. Try to make it write more after it produced one and you just get more “I’m done” tokens.

Now what about the operation of “completion” that is the natural operation? When just given some text, the AI will just write what comes next, finishing a sentence or finishing an essay. It could try to finish your question or write more questions and not answer anything at all.

We “prompt” this chatbot AI by including “\nAI:”. We tell it where to start writing in the style of an answer.

In a completion model, we can show it inputs and outputs (multi-shot) and it will learn what it is supposed to make. Fine-tune can also show it what it is supposed to make.

So we have satisfied the tips I’ll put at the bottom. We have a “separator” \nAI:(which serves to make an AI that doesn’t just continue writing more sentences of our input) and we have a “stop sequence” of something we can recognize in programming the API call.

I was under the initial assumption that the new training endpoint might behave like the containered chat fine-tuning, where OpenAI would have picked delineation or ending sequences that are injected, but that would break the idea of free-running completion, which is now demonstrated identically on the new models, so that’s seemingly a discarded notion. The format of the json being sent with either “prompt” or “messages” for fine-tune must instead distinguish two different methods for chat and completions (plus perhaps validate models for them?).

So you must see if your own input and desired output and the way the AI might try to continue naturally contains such parts of both “separator” and “stop”. Or rather, if you’ll have to make the AI recognize the end of your text where to generate in its trained “output method”.

If you want your input to be “The best way to do (a company task) is to”, and then the AI response to be " always buy products from OpenAI!", then there is no separator in that prompt.

The documentation is not rules, but how to make the tuned method similar to what you must do just to make a base completion engine work similarly by prompting (which you should experiment with first).


To fine-tune a model, you’ll need a set of training examples that each consist of a single input (“prompt”) and its associated output (“completion”). This is notably different from using our base models, where you might input detailed instructions or multiple examples in a single prompt.

  • Each prompt should end with a fixed separator to inform the model when the prompt ends and the completion begins. A simple separator which generally works well is \n\n###\n\n. The separator should not appear elsewhere in any prompt.
  • Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace.
  • Each completion should end with a fixed stop sequence to inform the model when the completion ends. A stop sequence could be \n, ###, or any other token that does not appear in any completion.
  • For inference, you should format your prompts in the same way as you did when creating the training dataset, including the same separator. Also specify the same stop sequence to properly truncate the completion.

Well, so far using the new fine-tunes API, I get a model that suddenly hit a training loss of 0.0000


Running it through

It tried to complete with a space. :cry:

'logprobs': {'tokens': [' '], 'token_logprobs': [0.0], 'top_logprobs': [{' ': 0.0, ',': -23.361324}]

The settings were:

'Params': '{"temperature": 0, "max_tokens": 1, "top_p": 1, "logprobs": 2, "frequency_penalty": 0, "presence_penalty": 0}'

The real response is either ' 0' or ' 1' (this is a binary classifier)

So maybe the new model doesn’t need a space in front of the output token?

The JSONL file had lines like this:

{"prompt": "Lolllll\n\n###\n\n", "completion": " 1"}
{"prompt": "See you Saturday\n\n###\n\n", "completion": " 1"}

I just used what worked in the OG babbage that is currently operational. I used the same exact training file with ~4000 lines.

So my ideas are to get rid of the space fronting the output token. If that doesn’t work, also get rid of the stop sequence \n\n###\n\n. Maybe fine-tune through the CLI, instead of the SDK? Is the CLI still supported???

Anyone else had luck fine-tuning babbage-002? Or davinci-002?


OMG, Did you just create AGI???

1 Like

Or rather: you asked for one token, you got one token.

Numbers do not seem to join with leading spaces like words do. Apart from all from 0-999 being manually loaded by OpenAI into the BPE dictionary, the process of finding an optimal dictionary from training data could have not found byte optimization from spaces before numbers, with lots of ISBN and barcodes or timestamps in the scraped data, with long sequences of numbers.

1 Like

OK, it must be the new 100k tokenizer then.

I am running exactly the same settings and training file as OG Babbage.

I will go back to the drawing board and run without pre-pending spaces.

Haha, far from it, 0.0000 TL is BAAAADD NEWS

I hope the community can learn from my fumbles as I try to get the NEW models fine-tuned, with all the baggage I carry from the OLD models :rofl:

1 Like

Since I just happen to have a sorted file with 100000 lines

dec hex char
46 2E .
47 2F /
48 30 0
49 31 1
50 32 2
51 33 3
52 34 4
53 35 5
54 36 6
55 37 7
56 38 8
57 39 9
58 3A :
59 3B ;
60 3C <

Numbers would appear sorted between / and : characters. Go to the part of the token dictionary with leading spaces:

[35611, 3, ' /^']
[94397, 4, ' /^(']
[76437, 4, ' /^[']
[82058, 4, ' /^\\']
[552, 2, ' :']
[6395, 3, ' :\n']
[14853, 4, ' :\n\n']
[91509, 6, ' :\n\n\n\n']
[30075, 4, ' :\r\n']

We can see none.

So another case of not quite being a “drop in” model.

1 Like

I will report back here. But certainly not “Drop IN” so far.

Any opinion on dropping the stop markers \n\n###\n\n

It is an old school completion endpoint, and it feels weird without them.

Maybe the new model doesn’t need them either in training or in operations??? Thoughts?

I gave a long write up including the separator markers. It depends on what the AI wants to write next after the input.

Here I give a bit of warm up language (read it as if the AI was reading this in a book or document about classifiers)


That might need a separator…

now lets try with ### as a “stop” separator


babbage-002 is continuing in a different way, seeing the delineation. It’s still writing like it was inside an article about classifiers, though.

However, you can just alter your prompt format to a “natural” stop phrase, getting an answer. Obviously a dumb answer needing just a bit of fine-tune.


(I wouldn’t count on another appearance of “classifier:” being a reliable API stop phrase, so that’s where you can train on a line feed.)
(“and then finishes” is having no effect here, it has no training on the end-of-text token)

You are in the Playground.

The fine-tune “burns in” the markers as part of the training process.

Fine-tunes are an altered beast compared to the Playground.

So for apples-to-apples, I need to see FT analogies using the ‘002’ base models fine-tuned

This is what fine-tune does. It shows it examples. And I had a brainwave.


My fine-tunes, and the ones I want to re-create using 002 base models is basically:

{arbitrary text} → {0, 1}

It’s a surjection from any text input to 0 or 1.

log_probs provides a continuous mapping from -1 to +1. Which gives a confidence score.

No prompting involved.

This all works perfectly on the original Babbage. All API based too.

So the problem is you provide an input:

I like arbitrary text

you could get an output


but you also could get an AI that thinks more is needed

, because it is random.0

So thus the separator is basically serving as a “my text is done”

The separator is something you should add to your input to match your prompt training:

I like arbitrary text


So if you’re burning tokens, semantics are free, and the next programmer will understand, too:

I like arbitrary text
AI output:

My intuition was right, it was the 100k tokenizer issue.

Retrain without the leading space on the output token.

No need to drop the markers, keep them in, BOOM!

"choices": [{"text": "0", "index": 0, "logprobs": {"tokens": ["0"], "token_logprobs": [0.0], "top_logprobs": [{"0": 0.0, "<|endoftext|>": -21.375}]

The model is overconfident though based on log_probs. Whatev … I give up. :rofl:

“Reasonable” training loss curve ???

There is probably a PhD thesis or three after it kept oscillating to 0.0000 TL before jumping up at the end.

1 Like