I have about 50k labeled entries I want to use to finetune GPT for a classification task. The issue I have is that the entries contain only labels that need context for GPT to better understand the problem:
…
PCOUNTM = 0.34
PCOUNTHUM = 0.24
PSURFCOUNT = 1.2
These are domain specific metrics. For instance PCOUNTM is pollen count, PCOUNTHUM is humidity adjusted etc.
It is a 100 or so highly domain specific metrics that need explanation and units. I feel like that would improve GPT ability to classify the entries, but i don’t know how to give the explanations as context for the finetuning. It would be quite expensive to just prepend the explanations to each row in my data and then do the finetuning. And I think that would go over some token limit as well.
And “ChatGPT” is a chatbot at a website, not a fine-tune base model.
If you have 50 “healthy” samples and 50 “dangerous” samples, could the language model possibly understand why they are ranked that way? Probably not. You get a random word maker, a dice roller.
Sounds like you probably could just do math yourself and use the values with a vector database if they are all the same format.
dimension = value/standard deviation x importance
classify 100 clear examples yourself into documents that also state classification.
retrieve top-5 matches for your unknown from vector database and extract classification.
Understood, I was hoping a LLM could figure out some connections between the values and classify. It is a complex system and if i had ‘importance’ the task would be trivial. Our first solution was training a NN, which worked to some degree. By what you are saying, we should probably stick with our initial model as the openai models are not a good match for this task, at least the LLM ones I am aware of.
I just gave an importance scalar if you know that measurement is more important to the classification, like you think “2.5 particulate count” is more important to classifying health than “pollen”.