Fine tune model auto complete label not from train list - how to stop this?


I have created several fine tune models (ada, curie) and run the test data on it. the classification is a string. in the results i receive options that are not in my training set. the description contains the correct label but expend it. (even when Temperature =0) . for example this is the category the the train has :“Wages” and the fine tune model return “” Wages & Overtime Omnipresent Group Limited → Wages & Overtime". how can i make the model return only values from the closed list of the training set?

There are more details than Temperature, such as other settings, extensive use of the System role for context maintenance, prompt structure, etc. If you can, please provide more details would be helpful - such as a prompt or code and a small sample of the data.

If it is sensitive data and it’s not advised to make it publicly available, please send me in private message. Let’s see if I can be of any help.

my data is very simple. I have a company code and an item and I want to predict my category. I just don’t understand how the model predict a category which is not in the training data (as I expect it to behave as any other classifier). few rows for example:
{“prompt”:“12222 retainer for the period 03/01/2023 - 03/31/2023: monthly branding/core retainer\n\n###\n\n”,“completion”:" TAX"}
{“prompt”:“12333 baggage al pendant reg hours\n\n###\n\n”,“completion”:" OFFICE"}
{“prompt”:“12345 workspace incremental fee: 28,573 pages\n\n###\n\n”,“completion”:" FEE"}

when i will try to predict new lines i will get for example “fees and services”

That is the way I would do this:
Mind punctuation and delimiters - models love them:

  • “:” - separating label from data contents;
  • “,” - separating labels;
  • “;” - separating data records.

Since you gave me the data only, I don’t know any previous instructions, explanatory prompts, and labeling headers for the model - I had to add the labels by myself.
I am not using completion for training, I inserted the category contents into the prompts so the model can understand as a “structured database”.

Note: The lines had been broken to make it more readable in the code snippet box, the linebreaks are not intended to insert in the code.

# Instructions section
{“prompt”:“This training dataset contains code, description,
{“prompt”:“Please consider the listed data below for your responses
{“prompt”:“Do NOT add or remove any code, description, or category
without expressed consent in User prompt."}
# Data section
{“prompt”:“code: 12222,
description: retainer for the period 03/01/2023 - 03/31/2023: monthly
branding/core retainer\n\n###\n\n”,
category: TAX;"}

{“prompt”:“code: 12333,
description: baggage al pendant reg hours\n\n###\n\n”,
category: OFFICE;"}

{“prompt”:“code: 12345,
workspace incremental fee: 28,573 pages\n\n###\n\n”,
category: FEE";}
category: XYZ".} # period "." at the end of the last record
# - it is advised

There are more details such as using the System role strategically in order to add precise instructions for the model to follow during the training as a context-maintenance.

And a structured text as a dataset is also helpful to the model. By the way, please consider a separate dataset file uploaded to the cloud storage of your choice in the case of a large training or operational dataset. Please check this thread about it:
Seeking Advice on Handling Large Vehicle Database for AI Chatbot Application

Try this way, and please let me know the results.

i might be missing something, but when you create the jsonl file for the fine tune model and than use it to create your model , your structure should be :“prompt” and “completion” … this is mandatory… no?

I don’t know about JSONL file syntax but prompt and completion refer to the OpenAI API, and are not a JSON thing, and in this case, the prompt is mandatory - assistant/completion and system are optional for the API. But if you saw something about JSONL requirements, please advise. Most of the time I make prompt structures based on the playground interface or Python code - and they work.

Even if the completion is required in JSONL - I advise the category as part of the data to be in prompt and make another content for completion.

Models understand free-format text for datasets - for example, law texts. I would be surprised if JSONL would require something different.

I just want to update that training the model and putting ‘\n\n’ at the end of my completion text solved my problem. after I tried a lot of solutions I read about online.

1 Like

Hi @anat.argaman,

I had the same problem as you, but I tried the way you mentioned above(put \n\n at the end of the completion) and the end result was still problematic.

my completion: ' coffee\n\n\n\n+ co', but my label should be coffee. Is there anything else that needs to be adjusted?

my solution was ’ coffee\n\n’. this worked for me.

Thank you for your advice. I tried your method but it still didn’t work.