When should I stop training of my fine-tuned model?

Hello, I’m fine-tuning a gpt-3.5 model to format descriptional text into an objective scoring system.
My current work flow is:

  1. Using the fine-tuned model to process 50 individual text.
  2. Mannually correcting and label those results, using them to further train the previous model (40 for training and 10 for validation. All prompts the previous model labeled wrongly were asigned to training group. The testing prompts from the previous model’s validation file were also added to the training file of this model, making 50 training and 10 validation prompts total).
  3. Go back to 1 and do it all over again.

It has served me well, with significantly improved accuracy and consistency after every round. However, some errors seems to persist every now and then. Eg. Some part of the text may contain sentences like ‘1/51 abnormalty points was found’ or ‘1/16 abnormalty points in part A and 0/6 abnormalty points in part B was found’. I want the model to sum up total positive abnormalty points number and classify them (0: 0, 1-2: 1, 3-7: 2, 8-15: 3, >15: 4, etc.) In most cases, the model would give a correct answer. However, sometimes it wrongfully classifies a grade lesser or more than expected, even when structure of the given text is not very different from another one labeled correctly.
Repeated training does help to reduce the incidence of such mistake, but doesn’t seem to prevent them from happening, even after 5 rounds of training. (The example provided seems to bother the model the most, maybe because other factors of the scoring system are mainly binary choices or copy-and-paste question.) At this point, I wonder if it will be of much improvement to further train my model, especially with the latest result of my training:

step train loss train accuracy valid loss valid mean token accuracy
1 0.02551 0.98361 0 0.91379
141 0 1 0 0.91379

I am no expert in math or computer science, but it seems the valid mean token accuracy didn’t improve after the training. Should I stop training the model? If so, is there anything I can do to further reduce such mistakes? If not, what’s the signal of a model being fully trained? Is there some way the api can return its confidence about the response? So I can mannually check the suspecious ones?
Thank you all for your generous help and suggestions!

From what I’ve learnt in the forum, the less data provided for training, the less performance the fine tuned data will acquire. So maybe if I combined all datas (200 of them) and train a fresh model instead of training with 50 data for each round for 4 times, I may have a better performed model?