That is the way I would do this:
Mind punctuation and delimiters - models love them:
- “:” - separating label from data contents;
- “,” - separating labels;
- “;” - separating data records.
Since you gave me the data only, I don’t know any previous instructions, explanatory prompts, and labeling headers for the model - I had to add the labels by myself.
I am not using completion
for training, I inserted the category contents into the prompts so the model can understand as a “structured database”.
Note: The lines had been broken to make it more readable in the code snippet box, the linebreaks are not intended to insert in the code.
# Instructions section
{“prompt”:“This training dataset contains code, description,
category."}
{“prompt”:“Please consider the listed data below for your responses
accordingly."}
{“prompt”:“Do NOT add or remove any code, description, or category
without expressed consent in User prompt."}
...
# Data section
{“prompt”:“code: 12222,
description: retainer for the period 03/01/2023 - 03/31/2023: monthly
branding/core retainer\n\n###\n\n”,
category: TAX;"}
{“prompt”:“code: 12333,
description: baggage al pendant reg hours\n\n###\n\n”,
category: OFFICE;"}
{“prompt”:“code: 12345,
workspace incremental fee: 28,573 pages\n\n###\n\n”,
category: FEE";}
...
category: XYZ".} # period "." at the end of the last record
# - it is advised
...
There are more details such as using the System
role strategically in order to add precise instructions for the model to follow during the training as a context-maintenance.
And a structured text as a dataset is also helpful to the model. By the way, please consider a separate dataset file uploaded to the cloud storage of your choice in the case of a large training or operational dataset. Please check this thread about it:
Seeking Advice on Handling Large Vehicle Database for AI Chatbot Application
Try this way, and please let me know the results.