About fine tuning - Your opinion?

Greetings, esteemed members of the OpenAI community,

I would like to hear your opinions on fine-tuning.

Do you believe that fine-tuning GPT-3.5 Turbo can yield better performance than GPT-4?

My personal standpoint is as follows:
I have been personally skeptical about fine-tuning for a while. Given the immense size of the GPT model, small datasets are unlikely to deliver successful results; substantial datasets would likely be necessary for exceptional performance. Moreover, even with extensive fine-tuning, I also believe that fine-tuned GPT-3.5 Turbo may not surpass the performance of GPT-4, as GPT-4 represents a more advanced model.

In the end, I feel that fine-tuning may not be our primary objective; instead, waiting for a superior model like GPT-4 updates, GPT-4.5, or GPT-5 would be a more promising approach.

Could you please share your thoughts?

Maybe? Not sure what you mean by performance though. Turbo is already faster than GPT-4, so it must not be speed you are talking.

I don’t think fine-tuning will make Turbo “smarter”, but it would impart your tone or lexicon into the model (is that a performance parameter for you?)

But creativity, reasoning, I’m not sure a fine-tune will help Turbo and make it surpass GPT-4.

If we are talking injecting knowledge, from RAG, then that doesn’t strictly involve any fine-tuning.

The only thing I have personally done, is interleave GPT-4 and Turbo on the same conversation stream. Turbo will learn from GPT-4 directly!

If you start out with GPT-4 and follow it with Turbo, it sees the previous GPT-4 “Assistant” messages and becomes like a GPT-3.75 model.


I agree with Curt,

Kinda depends on what you want the model to do :laughing:

@BrianLovesAI have you tried fine tuning yet?


I’ve found fine-tuning to be very useful in the cases it was designed for (i.e., a specific style of response with particular formatting, tone, etc.).

Without a doubt, GPT-4 will give you better reasoning and overall better responses in general chatting, but I’ve found a number of cases where I want it fast and I want it “my way”, and tuned models work well for that. It strips away the need for lengthy prompting and let’s you do one thing pretty well consistently and on the cheap.


Unfortunately, I believe that fine tuning is just another marketing trick to make paying users of ChatGPT think they are getting more value for the money than not paying users of ChatGPT or free Copilot in Edge.

1 Like

API fine-tuning has nothing to do with ChatGPT… I have to respectfully disagree here.


In narrow, specialized tasks where gpt-4 has not greatly outpaced gpt-3.5-turbo, absolutely. It’s already been demonstrated.

To determine if your use case is a good candidate, you would want to,

  1. Very clearly define the specific narrow task you are performing
  2. Empirically measure how gpt-4 and gpt-3.5-turbo perform on that task
  3. Determine how steerable the model is to performing the task
  4. Assess the quantity and quality of your data

If you can identify a very tiny sliver of functionally you are interested in, 3.5 performs close to on-par with 4, the model responds well to fine-tuning for the task, and you have lots of high-quality data…

Go for it, I predict you have a winner.

That said, there are many tasks where gpt-3.5-turbo can be brought very close to the performance of gpt-4, and a fine-tuned model is much cheaper to use than the bigger, newer variant.

Myself, I’m holding off a bit on one of the fine-tuning projects I want to play with because I’ve got my fingers crossed for some Happy Dev Day announcements[1]

  1. All I want for Dev Day is some gpt-3.5-turbo-instruct fine-tuning and a modest price cut. ↩︎

1 Like

This is true, but I’m assuming you also saw the RA-DIT: Retrieval-Augmented Dual Instruction Tuning paper?

So much interesting stuff being published every month, I imagine the foundational models published in 2026 are going to make multimodal GPT-4 look quite quaint.


I read/skimmed the RA-DIT paper, and then just did it again. But I am scratching my head at actually computing some of the things in their paper and actually implementing it.

Usually if I read a paper, and then have no decent idea of implementing it, I just move on and see of the implementation pops up later on. Or maybe is at least explained better, so I can finally see how to impliment it.

But can I implement RA-DIT now? No I can’t! Still fuzzy for me.

Have you tried an implementation, or have a notional implementation in mind?

1 Like

I think the code is here



You get a bit more near the end - they can beat another selected Llama 65b (that is not a 900k chat trained llama-2 that one might find on /r/localllama now) by fine tuning with 64-GPU systems. They didn’t compare GPT-4. Filled the fine-tune context with chat asking about wikipedia injection instead of any special domains.

Here’s the paper from these paper factory pals that highlights maintaining Q_A style training attention within multi-question large context tune: OPT-IML : Scaling Language Model Instruction Meta Learning through the Lens of Generalization]([2212.12017] OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization)

AKA “beyond developing with OpenAI” - where you could fine tune with $400 of individual RAG simulations of your own knowledge for the same price as a confused context…and then probably just ask your model.

1 Like

The LlamaIndex code above just uses OpenAI and seems to make more sense to me, so I might have to dig in there, since it looks straightforward, but does involve a fine-tune :frowning_face: (see comments/approach I have below)

The OPT-IML paper seems focused on creating a framework to evaluate performance of newly generated instruct models. :man_shrugging:

Everyone seems worried with attention with large contexts.

I get it, and I think differently about this it seems …

So assume you had various 2k-4k large contexts retrieved from your RAG, essentially crossed them with the query to get new projections out of the model, and very likely smaller, and then reconciled these smaller things in the end.

So for example:

{Big Chunk 0 from RAG} x {Query} = {Projected answer 0}
{Big Chunk 1 from RAG} x {Query} = {Projected answer 1}

{Big Chunk N from RAG} x {Query} = {Projected answer N}

Here “x {Query}” just means get the LLM to draw an answer (prompted) from the Big Chunk X.

Then take a final prompt:

Based on the hypothetical answers here:
{Projected answer 0}
{Projected answer 1}
{Projected answer N}

What is the most likely consistent, correct, and honest answer to this question: 


So all your attention is localized to 2k-4k chunks (not a crazy amount IMO) and you have some “reconciler” operation in the end. This is only for massive massive amounts of data that you want it to comb through. Otherwise you just have the one big chunk, and don’t need to reconcile.

You could also tag previously vetted true answers as [“PROCEED” given {Query}] and if the query correlates and the answer correlates, then you have a high confidence factor and can return this new answer. And later go back and see if you agree it was a good answer, and label it at “HALT” given {Query}].

Alternatively, you can just use this embedding procedure and correlation to get the previous PROCEED/HALT tags, and pick the most correlated one with a PROCEED tag without a followup “reconciler” query (saves some latency)

All this labeling, and I’ve done it, is through a simple script I run periodically on the database. I just hit 0 or 1 at the terminal for each item, and the database is instantly updated, and can be used for future queries.

I guess I believe in more of a hands-on approach before delving into the models and fine-tuning them to retrieve alternative data. I believe some would call my approach data driven more than anything.


1 Like

I’m not worried about attention, I think we may be using it slightly incorrectly.

My reasoning is a little out there on this, but I’ll give it a shot.

Chimpanzees have a far greater short term memory, which I am going to equate with attention. They are able to remember at a glance 3-4 or more times as much information as humans can. Humans now use that part of the brain for complex language…

We both come from similar base hardware, but humans with less attention are able to solve more complex problems. I think it’s down to a management system that uses our limited attention to better effect, and I think AI is missing something akin to an L2 mid term attention management system (L2 like the cache memory)

The billion token Microsoft paper did a better job of explaining this.


My attention comment wasn’t about anything you said recently @Foxabilo, the gears started turning when I read this comment here:

But had more extensive comments over in this thread:

I see a lot of attention dilution and worries, which I admit are real, but unless you are using GPT-4-32k, you probably don’t have to worry about it.

Maybe the only types of domains where you could worry about attention, is detailed coding syntax things, or similar, where you may want to reduce the tokens.

I’m just thinking if “it takes me more to concentrate on it, then reduce the tokens for the model to also concentrate on it”. I don’t have any proof this intuition is true or not, because I assume “rare things have less training than common things” in the training stages of these models.

Of course OpenAI could have trained the hell out of coding in these models, and skimmed natural language. I don’t really know. But the question is, is there more natural language than code? I don’t know, there is a lot of code online in GitHub, maybe more than all the blogs combined.


:smile: I must of gotten the wrong end of the stick. It’s late here!


Yeah, close to midnight over in England right? Shouldn’t you be sleeping? :rofl:


Yea… probly should go bed when I start talking about monkeys and L2 caches… :joy:


Sleep schedules doesn’t exist anymore :laughing:


Yeah API use cases are different than ChatGPT which is the consumer product they’ve built themselves. It costs money to run GPUs on the cloud and be able to offer these services is a HUGE maintenance cost not to mention the cybersecurity threats for having to spin up your own. You can run llama2 fine tune it all for free, but i doubt you’ll be able to run it locally, so you’d have to use something like Run Pod, https://www.runpod.io. Which can get you setup in about 5 mins, but even a decent GPU cloud model will run $.79 /hr x 24 x30 = $568.80. But yeah ChatGPT and using the API are completely different use cases.

1 Like

I’m currently fine tuning the GPT-3.5-Turbo, but primarily because I would much actually prefer to use the GPT-4 model, but it only supports 8k for API although there is the 32k, its reserved for enterprises or super accounts probably spending $100k+/month although I don’t really know. You can see in playground model the GPT-4-32k isn’t available, while, GPT-4 (8k) is. So you should consider this as well if the context is important or not. If its just quality than GPT-4 will always provide better reasoning and more creative results, but fine tuning GPT-3.5-turbo should provide better domain knowledge if its fine-tuned. Alpaca was fined tuned with 55,000 prompts off Llama2 and now Stanford has their own LLM.

1 Like