I’ve been working on a certain task for over a year now. In the initial stages, before I decided to start finetuning, I was prompt engineering and trying to get consistent results with few-shot prompting.
However, when I decided to start finetuning, I shortened the prompt some and got rid of the few-shot examples.
Is it possible that adding back the few shot examples during training – or even just one of them, a one-shot prompt – would improve the performance of my model? My intuition is that it might help, since having at least one example might help it figure out the format/definitions a little better in the early stages of training, allowing it to learn the more difficult aspects of the task a little faster. But I haven’t really seen anyone do this.
If I were to add an example in training, would I include it in the system prompt, or use a second set of user/assistant responses in each training example? I’m not sure exactly how that would work.
So, what exactly are you attempting to fine-tune for?
I mean, this should be a majority of the training data. I’m not an expert on fine-tuning, but if I’m remembering everything correctly, there should be something like, 80% of correct examples you’re trying to replicate in the model, and 20% incorrect. Although as I’m typing this, that may be for training not fine tuning.
Fine-tuning works really well for obtaining specific formatting or following specific guidelines for whatever use-case you’re using.
If you’re prompt engineering and few-shotting so it can spit out the desired format, and you’d like to achieve that result from a single-shot prompt, iirc fine-tuning data would involve showing just the question, and then the intended result after the few shot prompts, not including those extra prompts. You’re basically aggregating a bunch of examples that go “answer me like this”. Then, the fine-tuned model should be able to respond in the format you’ve previously been few-shotting to get to.
I feel like maybe you’re not quite understanding the specific question I’m asking. The task I’m working on is a relatively difficult emotive categorization task in text, with categorizations and alternatives for specific phrases given in json format. The system prompt as it currently is basically instructs the model what to do, and how to format the json data, but doesn’t give an example of given text and the appropriate json data to return.
I understand that I could finetune a model with data from a few-shot prompt so I don’t have to use the few-shot prompt anymore. But the reason I’m finetuning isn’t for token savings, but to actually improve the performance of the model on the task I want it to do. Few-shot prompts did not work well enough to perform this task at a level I desire.
So I was wondering if including a one-shot example in the prompt during training and application would potentially improve the model’s performance, and, if so, how it would be done. I feel like including an extra “user” and “assistant” message in every line of my training data might cause some weird effects, but it might just be a perfectly normal thing people do that could improve results.
Just thought I’d update here for anyone interested. I did add a one-shot example to each of my training sets, utilizing the weight=0 parameter (I believe this is its main purpose).
The result (when using the same prompt and one-shot example) was noticeably worse than when the one-shot example was left out in training. Even though the weight parameter should have stopped training on that one-shot example, it still appears that the model somehow trained on and overfitted to the one-shot example in every run.
I could try again by moving the example to the system prompt. It could just be something weird with having multiple user/assistant messages in training. If I try that I will update here again.
I think the problem here is that you’re expecting model fine-tuning to noticeably increase intelligence, when it’s meant to enhance formatting and behavior.
If the methods already can’t be replicated through prompting, and the formatting output isn’t the issue, but rather its ability to handle emotive categorization, then that gives me a clear signal that the model still doesn’t quite have the intelligence level you’re expecting it to have. Fine tunign won’t increase its reasoning capabilities to handle things it specifically hasn’t seen before. This is why the result came out worse.