Fine-tune completion tokens longer than 1

When fine-tuning a classification model (using Ada), the completion provided when I test the model’s use is longer than 1 token. It seems to be cut at the typical ~3 character amount, which isn’t ideal. This seems counter to the classification provided in the documentation example: Fine tuning classification example | OpenAI Cookbook. From that example, my model would be cutting off at ’ bas’ when the max_token=1 is applied. This leads to incorrect follow-on completions. I think I’ve followed all the right instructions from the documentation and the examples, but can’t seem to get past this issue. My training data is formatted like the following:

{"prompt": "String to classify.\n\nLegend ID: ", "completion": " f1234"}
{"prompt": "Another string to classify.\n\nLegend ID: ", "completion": " f1456"}
{"prompt": "A third string to classify.\n\nLegend ID: ", "completion": " f1456.123"}

I’ve tested using a token legend to ensure only 1 token is provided in the completion, and processing that response locally (see data format below). This works okay, but it has limitations. Additionally, it seems like if the completion token (Legend ID) includes leading 0s, it effects the token count and in turn the response.

{"prompt": "String to classify.\n\nLegend ID: ", "completion": " 987"}
{"prompt": "Another string to classify.\n\nLegend ID: ", "completion": " 123"}

Is there something obvious that I’m missing on why my original fine-tuning dataset’s completions are more than 1 token?

1 Like

You need to set max_tokens to a slightly higher value

And for best results, make sure the completion is a string without spaces (eg 1 word or number)

You still keep the leading space though

Note: If you want to have a more accurate system, you may be forced to go to embedding as it limits the result set to what is in the embedded dataset

The documentation for embedding is a bit complicated and you need to use some called a forestclassifier for classifications.

Interesting, thanks @raymonddavey. Just playing around in the Tokenizer tool, I didn’t realize that full words could be considered 1 token. But there doesn’t seem to be any consistency here? I stumbled upon this blog that gets at my question as well: GPT-3 tokens explained - what they are and how they work | Quickchat Blog. But not being able to generate a single token has multiple downstream effects (for my application at least). If you have to generate more than 1 token per completion, you can’t get the logprob of the resulting combination of tokens, can you? **Edit** I see how to do total logprob now.**Edit** Also, would I need to ensure that my completion labels are all either a single token or the same token length?

You are correct that logprobs only returns the score for the first token in a word.

The most common words have one token, (Eg John, the etc), less frequently used words have more than one token (as do a lot of foreign language words)

If this is for classification, I don’t think the token count for the completion matters too much. The comparison is done on the prompt, with the completion just being the resulting category or grouping. I think they recommend 1 token so you can use max_tokens. I think a single token is best, a single word (with no spaces) would be next - as the tokens are concatenated, and then multiple words last (as the space may end up with the AI making up the second word)

This has implications for logit_bias (unrelated, fun fact)

The comparison is done on the prompt, with the completion just being the resulting category or grouping.

Ah okay, that makes sense. So is it right to say that the max_token parameter is simply the amount of tokens that can be in a response, rather than a setting that the model interprets to generate the most likely n token completion?

If that’s the case, assuming that my labels could consist of more than 1 token (and will have varying lengths), do you think it’s probably best to use a demarcation marker (like a semi-colon, or some special character) appended to the label, set the max_token higher, and use the demarc as a stopper to identify a single classification category?

yes, using a stop setting and a higher max_tokens value would work well.

A “\n” works well because it is only one token long.

1 Like

Thanks for the help @raymonddavey. After some testing, I think the best approach for my use case is going to be using my token legend method for now. I still think it’s strange that the fine-tuned classification labels are not single tokens. I would understand why from a text-generation use case, but not a classification use case.

Also fwiw, \n is actually two tokens, so I’ve resigned to using a semi-colon as a stop token and it seems to be working okay.

The \n will show up as two tokens in the tokenizer (because it thinks it is a "" and a “n”)

But if you convert it to the newline character (Which python does automatically), it is one token

The \n is just how you represent a new line in a lot of programming languages


Text from a programming website:

LF (character : \n, Unicode : U+000A, ASCII : 10, hex : 0x0a): This is the ‘\n’ character which we all know from our early programming days. This character is commonly known as the ‘Line Feed’ or ‘Newline Character’.

1 Like

Ah okay, gotcha. Thanks for that!