Fine tuned model not providing meaningful result

We have been using model=“gpt-3.5-turbo” to evaluate how well it does on a medical sub-speciality in service exam. Our first test with this untuned model is what we call “baseline” results. On a 300 question exam we achieved 57.9% correct respones (random choices should yield about 25% result as each question has 4 possible answers). We then tried to fine-tune the model w/ a different (earlier year) exam also with 300 questions and its answers and Rationale. Our fine tuning completed and we have enw model id, but this model does not generate any sensible response. What might we be doing wrong?

Q/A with model=gpt-3.5-turbo

(venv) ragnar> python3 chatgptapitest.py 
============= prompt ===================

Select which answer A, B, C, or D is the correct answer in the 
prompt below 
that is delimited by three tick marks:
'''
In an analysis to compare the time to event outcomes for two different groups, based on the 
survival plot below what is the assumption violated if a Cox regression was used to analyze these 
data? 

A. Proportional hazards 

B. Data are normally distributed 

C. Noninformative censoring 

D. Survival times are independent between subjects
'''

============= response ===================
A. Proportional hazards

A. is the correct response for this question. The response we seek comes out as:

response.choices[0].message[“content”]

We next tried this with our fine_tuned model

model=“curie:ft-bayta-systems:radonc-gpt-2023-06-25-00-26-23”

Below we show the entire repsonse we get. Before we fine-tune our model extensively we need to understand why our fine_tuned model is failing to give correct answer. We didn’t expect our small amount of training would derail the base model so much:

(venv) ragnar> python3 chatgptapitest_fine_tuned.py 
============= prompt ===================

Select which answer A, B, C, or D is the correct answer in the 
prompt below 
that is delimited by three tick marks:
'''
In an analysis to compare the time to event outcomes for two different groups, based on the 
survival plot below what is the assumption violated if a Cox regression was used to analyze these 
data? 

A. Proportional hazards 

B. Data are normally distributed 

C. Noninformative censoring 

D. Survival times are independent between subjects
'''

============= response ===================
{
  "id": "cmpl-7XYLIAQgTp24jVzJy8rE0NzieDiwE",
  "object": "text_completion",
  "created": 1688231192,
  "model": "curie:ft-bayta-systems:radonc-gpt-2023-06-25-00-26-23",
  "choices": [
    {
      "text": "The null hypothesis of the \ncovariate regression is not rejected for",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 124,
    "completion_tokens": 16,
    "total_tokens": 140
  }
}

The response does not even make a selection between A/B/C/D !!

Any hints/comments welcome.

This seems to be the result of some very common misunderstandings.

First, fine-tuning doesn’t typically perform well at imparting new knowledge to the model—it’s more useful for teaching it the form of its responses, whether that be to adopt a certain personality or format its responses in a very particular manner.

If you need to ensure the model has access to highly specific information which may not be in its training data, you could look at using an embedding model and a vector database.

Second, you should never train on your testing data. If you want it to specifically answer multiple choice questions and provide the reasoning behind the answer, that is absolutely something you can fine-tune on, but you should train it on questions you’re not going to be testing it on.

These questions do not even need to be domain specific—they can come from any field.

If you want it to perform well on a multiple choice exam such as this, I would fine-tune your model to perform a chain-of-thought process. In essence, you would provide the question and set of answers as the prompt then for each response you would have a written out thought process.

Maybe the first step in every response is to rule out one or two answers which would be absurd and explain why they are ruled out. Then for each of the remaining answers describe the strengths and weaknesses of that answer. Next, identify the strongest answer and explain why it is the best. Finally, conclude with the letter of the chosen answer.

In this way you are fine-tuning the model on how to think about and take a multiple-choice exam.

You can also combine the fine-tuning and embedding methods.

Basically, the point of using embeddings is to be able to find snippets of text which are relevant to the text in your prompt and add them into the context-window for the model.

For instance, using embeddings you might be able to pull in definitions for all the key terms used in the question and answers so the model has the best possible information in context before it starts its chain-of-thought.

Regardless, you need to fine-tune it on examples of exactly how you want it to go about generating its response and you need way more than 300 of them—on the order of 1,000–5,000.

One nice thing is, since it isn’t critical that the Q/A pairs be in your specific domain, and you’re really just interested in the form of the response, once you get a few ideal exemplars you can use those as few-shot learning examples for, say, GPT-4 to generate synthetic data based on other multiple-choice questions to quickly grow your training set.

1 Like

Thank you very much for your detailed response. A few comments:

  1. We are not using the Q/A pairs to train the model that we intent to test its performance with. We plan to train it with Q/A from several past years, with answers and rationale as I had mentioned, and then test it on a 300 questions exam that would be held back during training.

  2. The base model, “gpt-3.5-turbo” had no problem understanding our instruction that we simply want it to make a choice between A/B/C/D, which we tally and then compute the score. And the model always did answer us as instructed though naturally, it seems, lacking previous training in this medical sub-speciality it got “just” 58% correct (which is actually very good performance as baseline.) Also, if not told to just answer with selection “A/B/C/D” it would offer the answer and its own brief rationale. In our testing for now we have not deep dived into its rationale and as would an expert in the field tried undestand what type of training would correct its thought process. However, we were hoping that about 1,500 question/answers/rationale set we have could impart it added ability.

We will try to think what it would take to train it in the manner you state.

Finally, we would love to run our test with “GPT-4” also as a baseline for that model, to see how it performs as compared to GPT-3.5 (and something else like PaLM -2 or Med-PaLM-2), but we have not yet obtained access to GPT-4. (PaLM-2 performed at 52% on the same test, Med-PaLM-2 not yet tested with.)

As to point (1), my apologies, upon re-reading your original topic I see I had misread what you wrote and misunderstood.

As for point (2), I suspect you simply need a lot more training examples.