What are the limits of fine tuning?

As an example: if I uploaded a billion prompt/completion pairs of jpeg data and descriptions to OpenAI’s API, could I teach GPT-3.5 to describe images?

Could I teach GPT-3.5 to speak a previously unknown human language?

Or is the fine-tuning mechanism intrinsically limited? How should we think about the limits of fine tuning?


Interesting question. Since GPT-4 is multimodal, I wonder if this could be how they did it.


First off, you can only fine-tune base GPT-3 models … so the original Ada, Babbage, Curie, and Davinci. You cannot fine tune davinci-003 or gpt-3.5-turbo (at least as of this writing, but they are adding features all the time, so this may change in the future).

Second, the model you are using is based on language, and it understands language. So my concern is that feeding in random tokens that are from am image would look like noise to the language model. And your fine-tune is only affecting the decoder, not its language understanding. Since there is no coherence of language in an image, the signal you are trying to train is going to be lost by the assumed language understanding.

However, and this is a big however, the same technology is used image and audio domains. So there is “hope”, right? I’m afraid the decoder training won’t be enough, and the language model details would have to be retrained on the new input media, but hey, you can give it a shot and let us know how it goes! I wouldn’t hold my breath, but if you are successful, I would be very interested!


Thanks for the detailed answer.

I’m definitely not going to spend thousands of dollars running the experiment on images.

I was just hoping that there was some way to predict from first principles what can be achieved in fine-tuning. I don’t want to empty my bank account on experiments.

Can you share a reference which would help me understand this sentence? “And your fine-tune is only affecting the decoder, not its language understanding.”

It’s really based on speculation on my part, but I am not the only one who thinks this. And others say a fine tune affects ALL parameters in GPT-3, but I have a hard time to believe my file-tune file creates a new file with 175 billion parameters (if I fine-tune DaVinci). More on these conjectures HERE!

So, conjecture aside, I do give you a chance at being successful, only because, I think, there is a small chance I am wrong.