I want to train an GPT-3 Model (or some other gpt-model like gpt-2) with Text from some books or articles.
The text is just plain text, so it does not have any special form of prompt and completion.
I just want to train the model with this text, to focus it on the information that are mentioned in the text.
My training should be some kind of universal training, like the base training gpt-3 that was trained with a lot of books and websites.
My Problem is that all Training had to be done in the form of
Hi Christoph, This would probably work better with embedding. But it will mean that you have to keep the text locally and have a way of storing embedding vectors
The examples on the cookbook explain how to do it it with an Amazon Review database. It is not that easy to follow though.
Once you find the text you need (with the help of an OpenAI call to get a vector for your search term), you can then ask GPT to answer your query with a relevant prompt and the text you found. Let me know if you go that way and cant follow the OpenAI cookbook example. I’m making a video on it and haven’t recorded it yet. I’ll push it to the top of the pile if you need it (might be a day or two though)
This idea came from a comment on this forum from an OpenAI staff member.
Not in documentation, but credible source.
I’ll take a test this week and come back with my results, because I’ve seen this over and over, but without details.
Until then, I can tell you what tests I did with fine-tuning.
Using GPT-3 to generate prompts.
Method: for a given text (usually news articles), I asked GPT-3 to formulate 5 questions which have answer in the text, and also generate answers for these questions. Then I made 6 questions variations, resulting 30 prompts and completions pairs for a single article.
Test results: on the first test, it failed completely to provide information from the given text. After improving the stop sequences the results were better, but still unsatisfactory: I was able to obtain the exact completion as in the fine-tuning file, but only if the prompt was very close to the prompt from the fine-tuning file. After these tests I went to the conclusion that the fine-tuning is not going to achieve what I want. The embeddings work much better for the same use case.
Same method as above, but with a larger dataset.
Method: I made a fine-tuning file from a 96 strophes poem. Each prompt was something like this “strophe # from the poem Xxx, writen by Yyy” and the completion was the corresponding strophe.
Test results: it failed completely to return any of the strophes. But what it did was to generate content in the same style as in the poem, which was an accidental discovery
I would love to know this too. My use case is that I have created a chatbot that answers questions about our website content. At the moment I can only do this on a page by page basis as I have to pass in the web page text along with the question (whitelisted question) to get gpt3 to generate the answer. It would be awesome to have a fine tuning that is essentially our website content to allow questions to be answered without having to be page specific and not have to supply the text as context each time. It’s just not possible/practical to supply the entire website text along with each question.
The CMS I use lets me extract the html for the content section of the page (ie no nav etc). I then add punctuation to the end of heading tags if they don’t already have it (to keep the text understandable to a human, which I assume makes it better for gpt3) I then strip out the and <> tags along with " and ’ using a regular expression. This creates text that gpt3 seems to understand.
our pages aren’t usually that long, if they are they get broken down to sub pages.
I haven’t tried either embedding or tine-tuning as yet
I have worked on a project similar to what you ask for I believe . The model retrieves articles from a selected google.drive folder and with the help of pinecone it vectorize (embedds) the papers. I can then ask it direct question based on the articles.
I just tried my model and it lacks the continuity that a chat it would provide. I have asked Nelson to help me integrate it into the spreadsheet he has made. By combining our projects it should be well suited for synthesizing a paper.
I am facing the same constraint that @Christoph mentioned in the original post. I am trying to fine-tune GPT-3 on sermon data, which on average is ~45 minutes of speech, 15 pages of text, and approximately 12,000 tokens. The max prompt size for fine-tuning is 2048 (or 2049, depending on whom you talk to). Is there any reference, FAQ or documentation that shows a prompt of 1000 tokens is optimal?
In my case I want to have as large prompt size as possible, in order to keep the continuity of the text. I assume this will improve the completion results, which - as you can imagine - will naturally swim in the abstract.
I don’t remember where I saw the 1000 tokens, but I’ve done fine-tuning with GPT-2 and GPT-3.
For longer texts, you would need to split it up - you could do a bit more than 1k tokens, tho, I’m sure.
However, keep in mind that the limit is for prompt + completion, so if your completion has 2k tokens, you won’t have enough room for the prompt. I tend to have longer prompts, so this might be where I got the 1k tokens.
Here are a few completion examples from this fine-tuning.
The exact text of strophe 2 of the poem Luceafarul written by Mihai Eminescu is this: […]
When asked to write, rewrite or generate a text in the style of the poem Luceafarul by Mihai Eminescu, I will use strophe 2 from the poem. This is strophe 2: […]
When asked to provide the strophe 2 of the poem Luceafarul by Mihai Eminescu, I will provide the following: […]
These are reproduced from mind.
I tried to create completions for different kind of tasks.
All completions included the strophe number.
In some of the completions I added a text at the end which says something like this:
And I’ll mention that the text was provided by the website luceafarul.io
This website does not exist and neither the domain.
My goal with mentioning this domain was to identify the information from my fine-tuning.
A few more tests I did in the meantime:
I asked how many strophes has this poem. It failed.
I asked what can I find on the website luceafarul.io and it failed.
As others said, fine-tuning is not for adding knowledge to the model.
For new knowledge the way to go is the embeddings.
I think that the next step in technology advance is to make possible longer prompts. Right now is 2000-4000 tokens, so one day we should see prompts of millions of tokens.