I’ve created a fine-tuned model in OpenAI using synthetic data created in ChatCPT (jsonl file).
Problem 1) Tested fine-tuned model in Playground (screenshot) that returns an output that does not meet at the structure of the synthetic data (see sample below).
{“messages”: [{“role”: “user”, “content”: “Can you suggest a lunch I can prepare in under 15 minutes?”}, {“role”: “assistant”, “content”: “Day: Monday\nTarget Daily Calories: ~1300 kcal\n\n\ud83c\udf7d\ufe0f Breakfast: Sunrise Energy Bowl\n- Total: ~450 kcal\n- Kcal per item:\n- \ud83e\udd5a Eggs (2 large): ~140 kcal\n- \ud83c\udf3e Whole Grain Bread (1 slice): ~100 kcal\n- \ud83e\udd57 Spinach (50g): ~20 kcal\n- \ud83e\uddc0 Parmesan Cheese (15g): ~50 kcal\n- \ud83e\udd51 Avocado (30g): ~90 kcal\n- Macronutrients:\n - 33g (~12% of RDI)\n - 24g (~34% of RDI)\n - 13g (~26% of RDI)\n\n\ud83c\udf1f Feel the progress, one meal at a time! \ud83c\udf1f\n\n\ud83e\udde0 Reasoning: Each macronutrient aligns with the RDI guideline: 33g of carbs is approximately 12% of the 275g RDI, 24g of fat equals about 34% of the 70g RDI, and 13g of protein covers around 26% of the 50g RDI. All ingredients are approved under the Military Diet, such as lean proteins (eggs), low-fat dairy (parmesan), and limited fruit portions. Breakfast is light and protein-rich, staying within the military diet\u2019s calorie and macro boundaries.”}]}
Blockquote
Problem 2) I ran the same model in Langflow custom component, where the OpenAI API key and model ID are correctly set. The user input and AI output in Langflow should appear in the OpenAI Logs section, but they’re not showing up as expected.
If all you’re trying to do is get a specific format, you really don’t need fine-tuning. You can achieve that with a system message. I see a lot of people attempting to save on tokens with fine-tuning and it usually turns out not to be worth it.
As for the fine-tuning in front of us, you may need to include more examples before you see good results. OpenAI recommends at least 50 examples, but I usually need at least 100. I would personally find a prompt that gathers good synthetic data and use the batches API to send in a bunch of those. I would probably experiment with around 300 examples on 3-4 epochs, LR multiplier and batch size of 1. I’d expect a training loss of 0.6-0.7 with that setup, assuming single-turn examples only. Of course, you don’t need to configure hyperparameters if you don’t want. All that’s required is a good, large dataset.
Other sidenote from Serge: the output you expect vs the input you provide is a straight path for hallucinations and “odd” answers.
Let me explain:
From the sample you posted, the provided output seems to have some structure and appears somewhat “technical”…
While your input is more “free text in a casual request form”.
Personally I have the impression, you’re trying to use fine-tuning to avoid “additional context” tokens cost. As if the data output in assistants response was based on a large list of available ingredients, diet strategy table, and calendar info … But they are not in the input/output pairs.
So your fine-tuning will end up teaching your model how to “guess” the missing information based on (tone? word choices?) of the user message. The more samples you’ll feed in, the more “weird” patterns your fine-tuned model will find.
So before giving any meaningful advice, I’d like to hear the whole story of what you’re building and why.
I second this. Attempting to fine-tune additional training into a model will only result in it hallucinating. This is because fine-tuning can’t expand the model’s dimensions or add any more neurons. It can only refocus certain ideas. And this is especially true if during fine-tuning, the model isn’t told what it’s doing or what the goal is.
Fine-tuning with properly engineered prompts significantly improves accuracy from my experience.