Gap between fine-tuning result and inference

I fine-tuned “davinci” model.
For training data, training result file shows me almost (prompt, completion) matches in my training data are good with this fine-tuned model.
I mean for every prompts in my dataset results in training_token_accuracy=1.0, training_sequence_accuracy=1.0.
But actually I tried these prompts to fine-tuned model and the result was awful.
res = openai.Completion.create(
prompt=“one of dataset prompt”,
Why? I fixed max_tokens several times.
Still I don’t know the relationship between max_tokens and inference result.
How can I resolve this issue?

Hi Carollhwrd,

Can you give some example prompts and replies and how they differ from your expectation?

