Personally I notice the space issues in completion:
Incorrect spacing may cause issues with the tokenization of the elements. I bet this came from the poor code to remove the line breaks in your completion generator. The approach I usually use is:
1 Break text into lines
2 Trim each line separately
3 Implode the array of lines with one space
Also, a regex to replace multiple spaces with one space only is a must. Also, ensure punctuation is followed by a space (I don’t see exceptions to this rule, do you?).
Then the format I’d use would be something like this:
{“prompt”:“<|title|>The Link Between Character Arcs and Story Structure<|endoftext|><|paragraph|>”,“completion”:" Too often, character and plot are viewed as separate entities — to the point that we often pit them against each other, trying to determine which is more important. But nothing could be farther from the truth. Plot and character are integral to one another. Remove either one from the equation, or even just try to approach them as if they were independent of one another, and you risk creating a story that may have awesome parts but which will not be an awesome whole. We often think of the plot as being about structure, but our notions of character and character arc tend toward the more airy-fairy. Surely, a character arc is something that must evolve organically from the characters themselves. Surely, we can’t structure our character arcs without making them formulaic or robbing them of life and spontaneity. Surely, right? Wrong, actually. When we say plot and character are integral to one another, we’re saying that plot structure and character arc are integral to one another. In his classic Story, Robert McKee says: If you’re familiar with the basics of story structure, then you can probably already see some of this structuring of character arc in action. The Major Plot Points all revolve around the character’s actions and reactions. As Michael Hauge says in Writing Screenplays That Sell: The character drives the plot, and the plot molds the character’s arc. They cannot work independently.<|endoftext|>"}
Use the ‘<|endoftext|>’ as a stop sequence in your completion API calls.
Also, I wouldn’t rely on tools to sanitize the data for the training, I personally prefer using my proper code (PHP json_encode() ) that takes care of special characters, etc., to be sure the data I’m giving to the model is as clean as I can make it.
Another thing to play with is the n_epochs parameter; here, you have a shorter prompt and longer completion, so I would try to increase that parameter to 5 or 6 to start with and see if it gives better results.
But there is one thing that triggers an alarm for me in your initial post:
What is the final goal you’re trying to achieve?