Do you fine tune? If so why?

Just curious… Do you fine tune your model and if so why? Are you trying to reduce the number of tokens you pass in? If so what’s the cost trade off? Fine tuned models are expensive… At Microsoft, lots of people are saying to avoid fine tuned models. They’re expensive and its not clear there’s a benefit over a well written prompt. Just curious what others in the community are finding…

I fine-tune.

They work great as single token output categorizers. Can’t use a prompt since there is more information in the training data than can fit in a single prompts. (train each model on many thousands of examples). Work better than many SOTA RNN’s too.

Useful for pre-filtering, scoring, and other useful things. Fine-tune the cheaper models like Ada and Babbage to avoid cost. This has nothing to do with completions, which is what your colleagues must be thinking.


Great feedback @curt.kennedy… This is exactly the type of feedback I’m looking for here. I’m about to have a flood of customers ask me for fine tuning advice so I’m looking for insight as to when it makes since and what the advantages/tradeoffs are…

1 Like

So to better understand you’re use case… You’re basically using a fine tuned model to build a classifier?

@stevenic Correct Usually binary classifiers (two states) to increase the SNR, but with enough training data for more states you can do more.

1 Like

Do you find that this works better then something like Logistic Regression or is it just more convenient since you don’t have to host the classifier yourself?

For me it’s more about accuracy, and sure, since I don’t have to host it, that’s a plus.

I have trained them on my own (built from scratch) and RNN’s are notoriously hard to train because of the vanishing gradient problem.

The transformers network (the foundation of GPT’s) seems to solve this and gives very high accuracy for such little amount of training data. So less training data and higher accuracy. Win-win.


I’m not an ML wonk so the not having to host it myself resonates with me most but insightful. :slight_smile: I may PM you to learn more details about your specific scenario and approach you’re taking.

Here is an RNN classifier. Toy example that you can hand code for better understanding. After you do this, then evolve it and generalize it, you can start to see what I am talking about.

1 Like

100% aligned with everything @curt.kennedy said. I use ton of fine-tuned “low-quality” models in my decision pipelines (Adda, Babbage) as classifiers. They work fine in multi-class classification as well. The latency is awesome with these models and the accuracy reaches almost 100% with enough training data. As you can run several of them in parallel, the final user does not perceive any sort of degradation in terms of latency, but you can create an arbitrarily-complex decision flow, such as:

  • Is the user’s question in-topic or not?
  • Shall I keep the same conversation context or drop it as the user is trying to explore a new topic?
  • What actions do I need to perform to answer the question? (searching the web, query a DB, etc.)
  • What is the user’s intention with this question? Is it more creative, factual, etc? (and adjust your completion prompt accordingly).

Personally, I have not found any benefit from fine-tuning OpenAI models for generative tasks.


Ok so based on you experience, would you recommend using fine-tunning for following use case? Generating simple JSON config specifying infrastructure (form of IaaC) based on user prompt?
Example prompts: Exprejss Api with postgres, Lambda Api that need to store files, NextJS SSR web …
Example response can be short JSON config like this:

        serviceName: 'my-service',
        resources: {
          myWebService: {
            type: 'web-service',
            properties: {
              packaging: {
                type: 'stacktape-image-buildpack',
                properties: {
                  entryfilePath: 'src/index.ts'
              resources: {
                cpu: 0.5,
                memory: 1024
              scaling: {
                minInstances: 1,
                maxInstances: 3
              cors: { enabled: true }
          myDatabase: {
            type: 'relational-database',
            properties: {
              engine: {type: 'postgres', properties: {primaryInstance: {instanceSize: 'db.t2.micro'}}},
              credentials: {masterUserName: 'my_master', masterUserPassword: 'my_pass'}

I just want it to learn and recognize how to create config properly based on prompt. I was originally trying to create one huge prompt including all the examples and resources, but it seams impossible to cover enough examples. Also the AI sometimes brainfarts and generates something absolutely irrelevant. I was hoping fine-tuning would help. What do you guys think? @AgusPG @curt.kennedy

Personally, I have not been able to get great results with fine-tuning as regards text-generation tasks (seq-to-seq). I mainly use it for classifiers.

However, I do believe that this is the kind of example where it could work, because the output format is very well-specified and the scope is very specific. This is just a thought though: as I said, I don’t have any real experience with making one of these use cases work with more than 60-70% accuracy.

It is likely that you can teach your model to generate valid json configs and nothing more than that. However, just a couple of heads-up to increase the likelihood of success:

  • You need to assume that hallucinations can still happen. You should not expect a 100% success rate. So you need to have a fallback strategy for those generations where your fine-tuned model is not able to produce what you are expecting. One idea is incorporating a classifier that inputs the generated json and outputs whether is a valid generation or not. This could help improve the robustness of your whole pipeline.
  • You probably want a human-in-the-loop here, even if it’s just for reviewing that the generated json config is correct. Especially if we’re talking about production environments.
  • You would probably need to go for powerful models (davinci or, at least, curie) instead of relying in the less powerful ones. You would also need a decent amount of data to make this work (at least 1000 prompt-completion pairs).
1 Like

Thank you, very good points. I will share my results once I have something.

I can make validations easily as I have JSON config schema. When I first started with this, I was hoping I could feed the prompt entire schema, but it is too big.

With fine-tuning you have to go {prompt, example} so the schema cannot really be part of fine-tuning process. In production I will probably simply write down prompts that created invalid configs and try to create more fine-tuning examples to mitigate these errors.


Like @AgusPG said, the fine-tune is more of a categorizer. If you have a small set of templates to choose from, then you can use a fine-tune to “pick” the one to use. But don’t expect sequence-to-sequence to perform accurately with a fine-tune.

Yep I understand what you are getting at.

What I think my problem is is that what I would need is ability to fine-tune the pre-trained models that understand instructions. With these models I can explain what is what, how are certain properties used, what are the relationships etc… However I am not able to create prompt to cover everything nor fine-tune these models, therefore I am probably stuck with the base models.

@stevenic I use curie fine-tuned models for my assistant. Works great and saves me around 1K tokens per conversation message :slight_smile:

1 Like

Great thread! I’ve learned a lot here. I do have additional questions, though, on the same topic.

Basically, my use case for a fine-tuned model is this: I work in the education field as a SWE. Every piece of educational content that we release is aligned with educational standards, whether that be the federally-backed Common Core State Standards (CCSS), individual states’ customized educational standards, or more a la carte standards from various groups/entities.

All OpenAI models so far (even GPT-4, it seems) only has knowledge of CCSS, which makes sense given the breadth of discussion on the internet of CCSS vs other educational standards sets. So, I’ve compiled a substantial data set of individual non-CCSS educational standards to fine-tune a davinci model with.

Here are my questions:

  1. Do you all have insight into the prompt/completion format for the training data?
  2. Part of my use case for this task is getting the model to correlate educational content to standards–that is, having it parse content and make decisions on what educational standards the content aligns to. GPT-3 and 4 currently do this for CCSS standards, and actually do it very well. But I’d like it to do that for the other educational standards sets I’m going to feed it. Will fine-tuning a model help achieve this? If so, is there specific prompt/completion formatting I have to use?

Thank you!


If you are trying to categorize, use a fine-tune. If you are trying to add new knowledge for the AI to draw from and answer from, use embeddings.

Having said all this, it looks like you are going to fine-tune a model. The docs are HERE!

Thank you for your response! If you don’t mind, would you explain what you mean by “categorizing” in this context?

I’ve been able to fine-tune a model, though the results are spotty. It seems like that is not the route you would have gone, however–adding new knowledge is what I want, though I’ve seen a lot of people doing so with fine-tuning.

Thanks again in advance!

Pattern recognition (Classification or categorizing) → Fine-Tuning
Knowledge → Embeddings

Here’s an example of using Fine-Tuning for classification:

Knowledge with Fine-Tuning (take note the PSA at the beginning)