I want to train gpt-3.5 on a couple of books that are pretty long. What would it look like in terms of training format?
Like (prompt: here is page 1 of game of throwns completion: page 1) or would it be like (prompt: here is game of throwns completion: the entire book)
The best method I know of so far is to leave the “user” prompt blank and simply fill in the “assistant” roll with around 1000 tokens worth of text and do that as many times as is required to cover the entire book.
The AI has a limited amount of text that it can understand at once.
You’ll likely want to investigate an embeddings vector database, also powered by AI. This breaks something like a book into smaller understandable chunks along with a special semantic vector returned by an embeddings engine, and then a vector similarity comparison of the user input against the database can give the answering AI more knowledge by providing it pieces of the literature.
Training is more referring to methods to alter the operation of the AI.