Gap between fine-tuning result and inference

I fine-tuned “davinci” model.
For training data, training result file shows me almost (prompt, completion) matches in my training data are good with this fine-tuned model.
I mean for every prompts in my dataset results in training_token_accuracy=1.0, training_sequence_accuracy=1.0.
But actually I tried these prompts to fine-tuned model and the result was awful.
res = openai.Completion.create(
prompt=“one of dataset prompt”,
Why? I fixed max_tokens several times.
Still I don’t know the relationship between max_tokens and inference result.
How can I resolve this issue?

Hi Carollhwrd,

Can you give some example prompts and replies and how they differ from your expectation?

Also you can wrap your code and data in triple back ticks so you get code like this
```this is code``` which makes it more readable or you can use the text box controls </> to do the same thing with the code section highlighted.