You can now fine-tune GPT-4o with images in addition to text. This enables applications like advanced visual search, improved object detection, and more accurate information extraction from documents. We’re offering free training till October 31, up to 1M tokens a day. Learn more in our docs.
Does this update mean that previous fine-tunes will now be multimodal or would they need to be trained again?
Hi @anon22939549! You should be able to submit images to previously text-only fine-tuned gpt-4o-2024-08-06 models, but they might not perform as well on image inputs than if you fine tuned them on images.
This is a crazy awesome development. It’s going to change how people do marketing and illustration.
amazing, thank you so much for the community post! I won’t try it just now as it seems like things are rolling out still. Can’t wait to give this a try!
Hi @_thiago, vision fine-tuning is ready for you to use right now! Are you seeing otherwise?
oh, let me try it out! thanks for letting me know!
Edit: wow! 1M tokens per day! I just read that part, hang on, almost done testing
So far, everything has been great, I was making the mistake of using the wrong model to attempt to train it (I was using gpt-4o-mini-2024-07-18
and not gpt-4o-2024-08-06
hehe I didn’t read the bottom of the page introducing vision fine tunning)
now I guess I’ll wait for a bit until it is done fine tunning.
side note: I was pretty happy I was able to use base64 to encode the images for the jsonl
i want to know whether one training sample can support more over 10 images? because according to docs i cannot understand the limits of imags number in one training sample(one chatting record). Hoping to your reply,thanks very much
training right now with 64 images, so I’m not sure what the limit is either
and it fine tunned! lets go! thank you @willhang for letting me know it is already live!
training loss looking good too
you mean one chatting sample include 64 images or total samples include 64 images?
Oh I tried total samples including 64 images, but you make a valid point, I’ll certainly explore that also
edit: I think what you are referring to on the documentation it is called multi-turn chat examples, I’m not sure how long a multi-turn limitation with images is or what is the limit on a single-turn training set
edit 2: maybe this helps:
I’m very curious to test how the vision fine-tuning can potentially improve output accuracy for low resolution images. Cost-wise this would be very interesting.
You can fine tune just using the user interface and just clicking, no coding, head over to the platform.openai.com/finetune/ and git it a try. You can also use just code for everything, you can even automate the process using the code way.
i see,thank you very much!!
In my first example fine tuned vision model it got perfect results.
I created a grid 8x8 with a dot in each position
the training data contained the location of where the dot is with its own personalized notation.
the response of the fine tuned model got the right location of the dot as well as the personalized notation.
a simple example, which I guess could’ve been done with a simple neural network, but regardless, great results! very very cool stuff, a lot of use cases
what got me thinking is one thing, at which point do I use the multi-turn jsonl and at which point do I use multiple fine tuned models? I guess that is something to find out in the future.
I’ve just started training on 22,000 examples … let’s see what happens.
woah, amazing, please do report back with results if you can, would love to find out more about how it went!
Wow, so it’s been verifying for hours, now I know why, it checks every single image first …
“Training file file-XXX contains 89 examples with images that were skipped due to moderation or public inaccessibility. These examples will not be used for training. Using 21896 examples from training file”
Amazing, glad it’s working for you! Have fun! We’re working on support for 4o-mini soon as well