Submitted the 8 n_epochs jobs last yesterday and woke up today with 2 of them failed; which is the first time this has happened, so we only got the results for ada at the moment.
@ruby_coder These results make me think that the decoder layer has a fixed size across all the models, since you get the same results for the same epoch. Which is surprising because they’re fine-tuned pricing exponential across the models. You would think it would be base model pricing plus a constant fee per 1k tokens. Maybe it’s just a coincidence, I don’t know.
Let me know if you want me to run some different tests.
I have plenty of OpenAI credits to run these experiments, compliments of OpenAI; and as you have seen in the results and screenshots, I have an easy to configure and use lab setup.
Suggestions on projects and experiments to improve the signal-to-noise ratio here in the developer community would be greatly appreciated.
Maybe start a new topic with your fine-tuning details, @georgei ?
The results will only be as good as your JSONL data (which we have not seen), the formatting (which we have not seen) and how you fine-tune (which we have not seen).
As this is a technical forum for developers, you should post your code (which we have not seen), your JSONL data (which we have not seen), your fine-tuning parameters (which we have not seen).
If you wish help (assuming you do want help fine-tuning), the kindly start a new topic with all your details so we can assist you.
Got it!
I’ll try to avoid your topics then, in the future.
To clarify a bit my comment, here are some specs of what I did:
I tried to replicate @ruby_coder parameters. Including the stop words, down to the blank character before “####” or “++++”. I haven’t tried the temperature he used, but it made no sense to use them since my use case was clearly failing with a more advantageous temperature.
My goal was to use the data I provide in different circumstances than those provided by me in the fine-tuning file.
I’ll expand the second point a bit:
if you do fine-tuning and expect that GPT-3 to use your data in different ways than provided in the training file, is not going to work. My goal was to test if using of higher number of epochs will change this outcome.
a more concised example of what will not happen if you do fine-tuning with more epochs: prompt: “what is your favorite color?”, completion: “I’m in loved with red’”. After fine-tuning, if you ask GPT-3 the exact words from the prompt, it will answer correctly. If you ask GPT-3 to name an object with your favorite color, will throw a random response.
That’s not what my reply said or implied. I simply stated that you should create your own topic if you have an issue and you need assistance, not hijack a topic with vague statements like “I did this test and the results were as bad as always.”, which does not make technical sense at all since we have no idea what you did since you refuse to post code, your JSONL file, parameter details or anything really that others can review.
Of course, if you completely change the prompt as you have done, it will not match. That’s just fine tuning basics and has little to do with this tutorial on how to use n_epochs. You simply changed the prompt to something dissimilar (which would not statically match in the decoder) to the original fine tuning prompt and then you say “it does not work” which has very little to do with the original topic. That is why I suggested you create a new topic is you need help fine tuning with a specific prompt.
The prompt does not have to be exact, but it should be close. I tested other word choices (not exact matches) and the results were fine. Your example completion prompt @georgei is statically different so it will require a different fine tuning prompt. It’s simple statistics.
Yes, because that was not how the fine-tuning was done. You are expecting a single-line fine-tuning to match a dissimilar text. GPTs generate text based on probabilities, not on inference. GPTs are not inference engines; they are “fancy” auto-completion text generators.
I think you @georgei have both changed the use case and the topic; because if you want to a single line fine tuning to match your example prompt, it’s easy to create it; and it really has very little to do with my n_epochs examples here. We can easily create a fine-tuning which with match both my original prompt and your challenge prompt, but the fine tuning will take two JSONL lines (two prompt-completion key-value pairs) and not one.
Honestly, @georgei I am sure you understand this already but if not I will create a two line JSONL tuning file that will match both of our prompt examples and show you the detailed results.
@ruby_coder Can you next test to see if your fine-tune is overfitted?
For example, see if it can generalize:
Put in slight semantic variations of the prompt and see what the completion is?
This could be a word difference or a casing difference. For example, here are different word and casing differences that result in different input tokens:
If it doesn’t generalize well, it’s overfitted (Looking at N=16 and N=32 here)
Also, can you re-run these for N=4 and N=8, but with a temperature of 0? Lowering the temperature may be all you need to get these to work without overfitting.
Sure, but severly overfitted just means you created a lookup table!
If the model can still respond reasonably well for a fine-tune with higher epochs, I think this is what the communities use case is, right? But if it falls apart, with slight wording differences, then this would be bad. The reason is, most inputs are not exact and uniform all the time, so generalization would be good here.
I don’t think this is necessary the community use case, but I get your point.
The “community use case” I was responding to was people saying they cannot get a match at all (as stated in my first post) and that fine-tuning do not work and they get nothing but garbage.
You are creating a new “test for overfitting” use case to test which I think should be in a different topic.
This was also the reason I did not respond to @georgei first query because he was posting a prompt which I knew would not generate his expected results because the fine-tuned model was not fine-tuned for that prompt and so I thought it was off-topic of this discussion.
Perhaps I tend to be very focused? The title of this topic is:
Fine-Tuning In a Nutshell with a Single Line JSONL File and n_epochs
Not:
Fine-Tuning In a Nutshell with a Single Line JSONL File and n_epochs and testing for overfitting
In my mind, they are different topics and to be honest, I have not seen any community member complain about “overfitting” in the past 5 weeks since I’ve been active here and the subject seemed to have come up in here in this topic as a “theoretical”.
Also, I mentioned earlier that I thought “overfitting” could be positive or negative, based on the use case.
I think most people are under-fitting, which is a problem (default epoch is 4), and why your post shows how to “fix” this problem.
But the next concern, after everyone does N=16 or N=32 epoch fine-tunes is overfitting or possible overfitting. And this shows up as similar inputs not going to their expected outputs.
Here ya go, @curt.kennedy, bunch of “slight variations”… as requested (16 n_epochs). Is this overfitting or not? Does not seem overfitted to me, especially in lieu of variation 5.