How to overcome OpenAI fine-tuning training data token limit?

kamilla · June 23, 2023, 3:58pm

I am using curie model to fine-tune in Python. Basically, I am passing the training data of form {“prompt”:“completion”}and I have 736 prompt-example pairs. My example completions are pretty long - I aim at generating a JSON file based on a description of fixed form. The fine-tune reports to be created, however, when retrieving the fine-tune model key via

retrieve_response = openai.FineTune.retrieve(id="fine_tune_model_id") print(retrieve_response)

I get the following messsage:

{ "created_at": 1685346828, "events": [ { "created_at": 1685346828, "level": "info", "message": "Created fine-tune: fine_tune_model_id", "object": "fine-tune-event" }, { "created_at": 1685346879, "level": "info", "message": "Error: The training file does not contain enough examples that fit within the 2048 tokens allowed for this model.", "object": "fine-tune-event" } ... `` Cross-posted on https://stackoverflow.com/questions/76355721/openai-fine-tuning-training-data-exceeds-the-token-limit

and thus the status failed, but for the “training files” object the status is proceeded (pretty obvious).

Is there a way to overcome the error above?

I do have an OpenAI subscription.

Foxalabs · June 23, 2023, 4:13pm

Welcome to the Forum! The error message is telling you that some of the training set examples are larger than 2048 tokens in length and so cannot be used as training data, you need to look at the training data set and reduce the length of the examples, or split the examples up into smaller chunks.

cghzy99 · June 24, 2023, 1:40am

Are you suggesting that perhaps we don’t need as much data, and as long as the data is sufficiently streamlined, ChatGPT can still comprehend and recognize it

jwatte · June 24, 2023, 2:05am

There’s a few problems with what you’re suggesting.

First: The model is pretty bad at generating syntactically correct JSON consistently. You’re better off with XML, and even better off with markdown if you can re-format your data.

Second: “long completions” aren’t going to work. The model will only generate a total of 2048 tokens, counting all of instructions, prompt, and completion. You should verify that the prompt+answer for each of your samples comes to less than 2048 tokens, using the tokenizer of your choice.

Third: 700 samples isn’t all that much, machine-learning wise. Don’t expect the model to do a lot of extrapolation from the learning examples you have. It may be that you’d be better off using some kind of embedding search and in-context document priming, rather than fine-tuning, when the number of documents is small.

we don’t need as much data, and as long as the data is sufficiently streamlined, ChatGPT can still comprehend and recognize it

You may need more data, but each datum should be shorter than 2048 tokens.
Also, you’re not fine tuning chatGPT, you’re fine tuning the GPT-3 model Davinci, which is a generation before GPT-3.5. There is not currently the ability to fine-tune GPT-3.5 or GPT-4.0.
The GPT-3 model only supports up to 2048 tokens total for instruction plus prompt plus completion, so it makes no sense to try to train it on anything longer, as it can’t generate more than that, anyway.

cghzy99 · June 24, 2023, 4:06am

OK，thanks .I will learn this more.

Topic		Replies	Views
What is the token limit while fine tuning gpt3 including all prompts and completion API	6	2329	December 18, 2023
What is the best way to upload datasets that exceed the token limit? API	3	1402	December 18, 2023
Ways to input prompts longer than 2000 tokens API	5	1611	February 9, 2023
Struggling with poor performance on fine-tuned davinci model API	15	2588	December 20, 2023
Fine tuning completation API	9	2337	December 25, 2023

How to overcome OpenAI fine-tuning training data token limit?

Related topics