I want to try fine-tuning and see what results I get. My test is seeing if I can “clone” someone like Noam Chomsky by feeding the fine-tuning api with a bunch of his interviews. The Prompts are the interview questions, and his answers are the completions. Right now I have 1020 prompt completion pairs, in a single 2.1MB file.
My questions are the following:
-
Do these need to be broken up further into smaller files?
-
The interview topics are all over the place, from linguistics, to the conomy, to the middle east conflict, to media analysis, etc. Does that matter or should these be organized in some specific way?
-
I’ve seen in some docs that the prompts and completions need to be appended with some kind of unique string to signal the end of the prompt or completion. I have not done this. Why does this need to be done when the prompts are in a
prompt
property nd same for completions? What’s the point of splitting them into these objects if tht doesn’t let the model know when a prompt/completion begins/ends? -
Some of Chomsky’s answers are long-winded. is 2049 the token limit for the prompt and completion together? I guess I would need to go back and shorten the long winded answers or remove them entirely.
-
Finally, the file with the interviews are the training data. I see that there’s something about validation data. what is that supposed to contain? I’m a little confused by that.
P.S. For what it’s worth I’m using the nodejs bindings of the api as I have never used python before.