Is there any way to upload a jsonl file larger in size then 10 for finetuning purposes?

Hey everyone,

I am working on a usecase where I need to visually fine-tune gpt-4o. But the size of the jsonl file am constructing for the finetuning exceeds the limit of 8GB that is mentioned on the Uploads API documentation.

A part from doing preprocessings (which can damage the quality of the images maybe) , is there any way to upload the file, with the data as it is right now ? maybe in different parts of 8GB ? but then is it even possible to tell the finetuning job to point to the different uploaded files ?

The fact is, the initial training jsonl file i used was 3.3GB so it was good. But when I did data augmentations i got some thing like 70GB (a lot, i know).

Looking forward to hearing from you!

I don’t think so, but maybe you can use something like a decider upfront - maybe based on a wordcloud / keyword density or other search functionality and redirect to differently fintetuned models?

1 Like

Welcome to the OpenAI dev community @sfares

To overcome the max file-size limitation you can provide images as http URLs in your training dataset (instead of base64 encoding them):

{
  "messages": [
    { "role": "system", "content": "You are an assistant that identifies uncommon cheeses." },
    { "role": "user", "content": "What is this cheese?" },
    { "role": "user", "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/3/36/Danbo_Cheese.jpg"
          }
        }
      ]
    },
    { "role": "assistant", "content": "Danbo" }
  ]
}

I’d also advise being aware of the following requirements:

Image dataset requirements

Size

  • Your training file can contain a maximum of 50,000 examples that contain images (not including text examples).
  • Each example can have at most 10 images.
  • Each image can be at most 10 MB.

Format

  • Images must be JPEG, PNG, or WEBP format.
  • Your images must be in the RGB or RGBA image mode.
  • You cannot include images as output from messages with the assistant role.
2 Likes

@sps Thanks! nice idea. And so if i got it right, i will need to put my training images on a publicly accessible storage? like for example a publicly accessible s3 bucket maybe or something like that ?

1 Like

Precisely, it needs to be accessible over HTTP(S).

2 Likes

Ohh, looks like I overread that part.

You can continue training using an existing fine tune model.

Just specify the existing “ft:” model that you created previously.

As the learning parameters are adapted to the size of the file, and continuations can have a quality of “overwriting”, you will want to use manual hyperparameters such as a low n_epoch or a learning rate scalar lower than the default 1.0. Then provide equal-size training files to ensure uniformity. This should give you a model that avoids overfitting and doesn’t weight one training file to be stronger.

As the “n_epochs” parameter is literally “passes through the training file”, you may even want to set that to 1, and alternate between retraining files when you deepen the weights.