We have been using model=“gpt-3.5-turbo” to evaluate how well it does on a medical sub-speciality in service exam. Our first test with this untuned model is what we call “baseline” results. On a 300 question exam we achieved 57.9% correct respones (random choices should yield about 25% result as each question has 4 possible answers). We then tried to fine-tune the model w/ a different (earlier year) exam also with 300 questions and its answers and Rationale. Our fine tuning completed and we have enw model id, but this model does not generate any sensible response. What might we be doing wrong?
Q/A with model=gpt-3.5-turbo
(venv) ragnar> python3 chatgptapitest.py
============= prompt ===================
Select which answer A, B, C, or D is the correct answer in the
prompt below
that is delimited by three tick marks:
'''
In an analysis to compare the time to event outcomes for two different groups, based on the
survival plot below what is the assumption violated if a Cox regression was used to analyze these
data?
A. Proportional hazards
B. Data are normally distributed
C. Noninformative censoring
D. Survival times are independent between subjects
'''
============= response ===================
A. Proportional hazards
A. is the correct response for this question. The response we seek comes out as:
response.choices[0].message[“content”]
We next tried this with our fine_tuned model
model=“curie:ft-bayta-systems:radonc-gpt-2023-06-25-00-26-23”
Below we show the entire repsonse we get. Before we fine-tune our model extensively we need to understand why our fine_tuned model is failing to give correct answer. We didn’t expect our small amount of training would derail the base model so much:
(venv) ragnar> python3 chatgptapitest_fine_tuned.py
============= prompt ===================
Select which answer A, B, C, or D is the correct answer in the
prompt below
that is delimited by three tick marks:
'''
In an analysis to compare the time to event outcomes for two different groups, based on the
survival plot below what is the assumption violated if a Cox regression was used to analyze these
data?
A. Proportional hazards
B. Data are normally distributed
C. Noninformative censoring
D. Survival times are independent between subjects
'''
============= response ===================
{
"id": "cmpl-7XYLIAQgTp24jVzJy8rE0NzieDiwE",
"object": "text_completion",
"created": 1688231192,
"model": "curie:ft-bayta-systems:radonc-gpt-2023-06-25-00-26-23",
"choices": [
{
"text": "The null hypothesis of the \ncovariate regression is not rejected for",
"index": 0,
"logprobs": null,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 124,
"completion_tokens": 16,
"total_tokens": 140
}
}
The response does not even make a selection between A/B/C/D !!
Any hints/comments welcome.