How to overcome OpenAI fine-tuning training data token limit?

I am using curie model to fine-tune in Python. Basically, I am passing the training data of form {“prompt”:“completion”}and I have 736 prompt-example pairs. My example completions are pretty long - I aim at generating a JSON file based on a description of fixed form. The fine-tune reports to be created, however, when retrieving the fine-tune model key via

retrieve_response = openai.FineTune.retrieve(id="fine_tune_model_id") print(retrieve_response)

I get the following messsage:

{ "created_at": 1685346828, "events": [ { "created_at": 1685346828, "level": "info", "message": "Created fine-tune: fine_tune_model_id", "object": "fine-tune-event" }, { "created_at": 1685346879, "level": "info", "message": "Error: The training file does not contain enough examples that fit within the 2048 tokens allowed for this model.", "object": "fine-tune-event" } ... `` Cross-posted on https://stackoverflow.com/questions/76355721/openai-fine-tuning-training-data-exceeds-the-token-limit

and thus the status failed, but for the “training files” object the status is proceeded (pretty obvious).

Is there a way to overcome the error above?

I do have an OpenAI subscription.

1 Like

Welcome to the Forum! The error message is telling you that some of the training set examples are larger than 2048 tokens in length and so cannot be used as training data, you need to look at the training data set and reduce the length of the examples, or split the examples up into smaller chunks.

1 Like

Are you suggesting that perhaps we don’t need as much data, and as long as the data is sufficiently streamlined, ChatGPT can still comprehend and recognize it

1 Like

There’s a few problems with what you’re suggesting.

First: The model is pretty bad at generating syntactically correct JSON consistently. You’re better off with XML, and even better off with markdown if you can re-format your data.

Second: “long completions” aren’t going to work. The model will only generate a total of 2048 tokens, counting all of instructions, prompt, and completion. You should verify that the prompt+answer for each of your samples comes to less than 2048 tokens, using the tokenizer of your choice.

Third: 700 samples isn’t all that much, machine-learning wise. Don’t expect the model to do a lot of extrapolation from the learning examples you have. It may be that you’d be better off using some kind of embedding search and in-context document priming, rather than fine-tuning, when the number of documents is small.

we don’t need as much data, and as long as the data is sufficiently streamlined, ChatGPT can still comprehend and recognize it

You may need more data, but each datum should be shorter than 2048 tokens.
Also, you’re not fine tuning chatGPT, you’re fine tuning the GPT-3 model Davinci, which is a generation before GPT-3.5. There is not currently the ability to fine-tune GPT-3.5 or GPT-4.0.
The GPT-3 model only supports up to 2048 tokens total for instruction plus prompt plus completion, so it makes no sense to try to train it on anything longer, as it can’t generate more than that, anyway.

OK,thanks .I will learn this more.