We've added support for vision fine-tuning

willhang · October 1, 2024, 9:53pm

You can now fine-tune GPT-4o with images in addition to text. This enables applications like advanced visual search, improved object detection, and more accurate information extraction from documents. We’re offering free training till October 31, up to 1M tokens a day. Learn more in our docs.

anon22939549 · October 1, 2024, 9:58pm

Does this update mean that previous fine-tunes will now be multimodal or would they need to be trained again?

willhang · October 1, 2024, 10:05pm

Hi @anon22939549! You should be able to submit images to previously text-only fine-tuned gpt-4o-2024-08-06 models, but they might not perform as well on image inputs than if you fine tuned them on images.

thinktank · October 2, 2024, 12:42am

This is a crazy awesome development. It’s going to change how people do marketing and illustration.

anon25271712 · October 2, 2024, 12:45am

amazing, thank you so much for the community post! I won’t try it just now as it seems like things are rolling out still. Can’t wait to give this a try!

willhang · October 2, 2024, 1:56am

Hi @anon25271712, vision fine-tuning is ready for you to use right now! Are you seeing otherwise?

anon25271712 · October 2, 2024, 2:29am

oh, let me try it out! thanks for letting me know!

Edit: wow! 1M tokens per day! I just read that part, hang on, almost done testing

So far, everything has been great, I was making the mistake of using the wrong model to attempt to train it (I was using gpt-4o-mini-2024-07-18 and not gpt-4o-2024-08-06 hehe I didn’t read the bottom of the page introducing vision fine tunning)

now I guess I’ll wait for a bit until it is done fine tunning.

side note: I was pretty happy I was able to use base64 to encode the images for the jsonl

luoshengqin666 · October 2, 2024, 3:52am

i want to know whether one training sample can support more over 10 images? because according to docs i cannot understand the limits of imags number in one training sample(one chatting record). Hoping to your reply,thanks very much

anon25271712 · October 2, 2024, 3:55am

training right now with 64 images, so I’m not sure what the limit is either

and it fine tunned! lets go! thank you @willhang for letting me know it is already live!

training loss looking good too

luoshengqin666 · October 2, 2024, 3:57am

you mean one chatting sample include 64 images or total samples include 64 images?

anon25271712 · October 2, 2024, 3:58am

Oh I tried total samples including 64 images, but you make a valid point, I’ll certainly explore that also

edit: I think what you are referring to on the documentation it is called multi-turn chat examples, I’m not sure how long a multi-turn limitation with images is or what is the limit on a single-turn training set

edit 2: maybe this helps:

openai.com/index/introducing-vision-to-the-fine-tuning-api/

jr.2509 · October 2, 2024, 4:14am

I’m very curious to test how the vision fine-tuning can potentially improve output accuracy for low resolution images. Cost-wise this would be very interesting.

anon25271712 · October 2, 2024, 4:18am

You can fine tune just using the user interface and just clicking, no coding, head over to the platform.openai.com/finetune/ and git it a try. You can also use just code for everything, you can even automate the process using the code way.

luoshengqin666 · October 2, 2024, 4:20am

i see,thank you very much!！

anon25271712 · October 2, 2024, 4:27am

In my first example fine tuned vision model it got perfect results.

I created a grid 8x8 with a dot in each position

the training data contained the location of where the dot is with its own personalized notation.

the response of the fine tuned model got the right location of the dot as well as the personalized notation.

a simple example, which I guess could’ve been done with a simple neural network, but regardless, great results! very very cool stuff, a lot of use cases

what got me thinking is one thing, at which point do I use the multi-turn jsonl and at which point do I use multiple fine tuned models? I guess that is something to find out in the future.

curt.kennedy · October 2, 2024, 5:42am

I’ve just started training on 22,000 examples … let’s see what happens.

anon25271712 · October 2, 2024, 6:49am

woah, amazing, please do report back with results if you can, would love to find out more about how it went!

curt.kennedy · October 2, 2024, 7:07am

Wow, so it’s been verifying for hours, now I know why, it checks every single image first …

“Training file file-XXX contains 89 examples with images that were skipped due to moderation or public inaccessibility. These examples will not be used for training. Using 21896 examples from training file”

willhang · October 2, 2024, 7:23am

Amazing, glad it’s working for you! Have fun! We’re working on support for 4o-mini soon as well

Topic		Replies	Views
Image fine tuning, false positive content policy violation API fine-tuning , content-policy , fine-tuning-vision	49	1047	December 23, 2024
Vision Finetuning failure: Too many images were skipped due to moderation API fine-tuning-problems	3	216	October 2, 2024
Issue: gpt-4.1-nano fine-tuned model cannot analyze images - blocked by endpoint validation Bugs gpt-4 , gpt-41	16	628	July 8, 2025
Issues with Fine-Tuning GPT-4o Model for Image Support and Billing Errors API api , fine-tuning-problems , assistants-api	3	373	October 10, 2024
GPT4V and content_policy_violation API gpt-4-vision	4	3696	February 8, 2024

We've added support for vision fine-tuning

Related topics