We've added support for vision fine-tuning

hahah yeah, the 64 image one I trained took quite a bit. Maybe a few minutes. I think yours might be done tomorrow or in a few days. At least for me, it costed less then 0.1$ and for everyone there are 1M free daily tokens for now

ah, can’t wait for the 4o-mini! wishing you guys good luck!

1 Like

Hi @luoshengqin666, so each line in the JSONL file is referred to as an “example”. A limit of 10 images per example would mean that each example (line in the file) can have at most 10 images. Each file can have many examples: up to 50,000 multimodal examples for now, not including text-only examples. We’ll look into bumping this up depending on the feedback we get from our users.

1 Like

Great points! So it’d be 10 images per example, regardless of whether those images are in a single turn or across multiple turns. We do need to look into updating the training example context length with the information on images, I’ll make a note to do that.

2 Likes

Hi @jr.2509, it depends on your task! If your low-res images inherently contain enough information for the model, it could work. What’s your task?

3 Likes

Super cool! To answer your question, it depends on your task! What problems are you trying to solve?

1 Like

Amazing, glad this resulted in a happy ending! Hope this trains well too. Would love to see your training curve here when the job completes!

As you can see, we care a lot about the safety of our models. We scan every image to make sure that input training data strictly adheres to our content moderation policy.

I realize it might be helpful to tell users that we’re inspecting their images, hence this will take a while. We’re looking into progress bars to improve the experience.

1 Like

One of the things I have been looking at is to use gpt-4o to detect certain visuals (e.g charts, graphs) in longer PDF documents to deal with cases where these visuals are not embedded as images and thus cannot be extracted with traditional libraries.

I was looking at two options:

  • Base case: Using the vision model purely for a binary task, i.e. identifying whether a page contains a visual / chart or not

  • Advanced case: Using the vision model to identify whether a page contains a visual and then having it identify the coordinates of the visual on the page. The information would then be used in a subsequent step to automatically crop the page and extract the visual.

I was thinking that for the base case I could potentially fine-tune a gpt-4o with PDF pages as images in low res to keep the costs down.

2 Likes

this does answer my question, thank you!

the task mixes interactions and categories, so different interactions would be multi-turn and different categories would be different fine tuned models.

each category has it’s own multi-turn “characteristic”. So saying it depends on the task makes a lot of sense to me. thank you.

2 Likes

I will be doing lots of experimentation with this, you guys rock.

This might very well be the missing piece of the puzzle on my main use-case, only other thing I can think of is o1 being multi-modal.

I love Ship-tober :smile:

3 Likes

WOW!! i can’t wait to try it and test it .
I was thinking about when we would be able to train images using 4o, and other models, and suddenly I saw this :rocket:
This is amazing thanks
Excited to try it and see the results

How so? I haven’t been able to really look at it

@willhang wow, neat, thank you for all of the awesome tips. Can it work with SVG?

@DiBop It’s now possible to train branding information, and other visual styles. With enough examples of how you use your brand, you could fine-tune marketing material generation for any company.

Oh absolutely! That’s a great use case for low-res fine-tuning. Although I’d be surprised if our base models didn’t do a good job of that already… have you tried with a base model?

1 Like

Let us know how things go! Don’t be afraid to post any success stories (or difficulties using the product) in this thread and tag me.

3 Likes

Let us know how it goes! Feel free to post your results here, excited to see what you come up with!

1 Like

You’re right that we don’t currently offer fine tuning with image outputs, only with image inputs.

1 Like

It only works with jpeg, png, and webp for now. We’ll keep SVG in mind going forward!

1 Like

Yes, I did try with gpt-4o base in low res some time ago but did not get consistent results for some reason. But i’m confident to get to it to work with fine-tuning.

1 Like

Speaking of image formats, I’ve been meaning to ask…

Is there a “native” image format for vision?

That is, before tokenization, are the images transcoded to any particular image format? Or, if the model was trained to natively accept multiple image formats, is there any correlation between the image format and the quality of the response.

My thinking here is that just as many users can get better results by sending their own extracted text rather than the source PDF, so too some users may benefit by doing the conversion to the preferred image format themselves rather than relying on some possible default conversation.