Testing Fine-tuning on a dataset of Chomsky interviews

I want to try fine-tuning and see what results I get. My test is seeing if I can “clone” someone like Noam Chomsky by feeding the fine-tuning api with a bunch of his interviews. The Prompts are the interview questions, and his answers are the completions. Right now I have 1020 prompt completion pairs, in a single 2.1MB file.

My questions are the following:

  1. Do these need to be broken up further into smaller files?

  2. The interview topics are all over the place, from linguistics, to the conomy, to the middle east conflict, to media analysis, etc. Does that matter or should these be organized in some specific way?

  3. I’ve seen in some docs that the prompts and completions need to be appended with some kind of unique string to signal the end of the prompt or completion. I have not done this. Why does this need to be done when the prompts are in a prompt property nd same for completions? What’s the point of splitting them into these objects if tht doesn’t let the model know when a prompt/completion begins/ends?

  4. Some of Chomsky’s answers are long-winded. is 2049 the token limit for the prompt and completion together? I guess I would need to go back and shorten the long winded answers or remove them entirely.

  5. Finally, the file with the interviews are the training data. I see that there’s something about validation data. what is that supposed to contain? I’m a little confused by that.

P.S. For what it’s worth I’m using the nodejs bindings of the api as I have never used python before.

Damn what did I do to get zero replies??

1 Like

I’m just getting started on this stuff myself. Seems there’s either not much discussion of it all that often or there’s some sort of anticult saying “do not engage”.

I’ll note that my first effort, which was simply “Hey. Do this.” followed by massive scale text … did NOT go over well.

I then took the Chat bot’s advice and broke every sentence into traversing prompt:completion pairs.

Bad code made it into the (validated) mix and the result is chaos. Seemingly angry chaos at that LOL. Never been yelled at by the mirror before!

My next try is to use html headings h1,2,3,4 as prompts with article sections as responses … THEN the exploded sentence method with better code structure using DOM.

Even to automate, all of this is tedious.

And either there is no tutorial lifeline (outside of the API docs) or … we ARE the ones leading the effort.