Train (fine-tune) a model with text from books or articles

Hello,

I want to train an GPT-3 Model (or some other gpt-model like gpt-2) with Text from some books or articles.
The text is just plain text, so it does not have any special form of prompt and completion.
I just want to train the model with this text, to focus it on the information that are mentioned in the text.
My training should be some kind of universal training, like the base training gpt-3 that was trained with a lot of books and websites.

My Problem is that all Training had to be done in the form of
prompt: completion
prompt: completion

How can I train the model with just plain text?

Regards,

Chris

11 Likes

I believe they recommend leaving prompt blank and just filling in completion with around 1,000 tokens.

Hope that helps!

3 Likes

Yes, I’ve heard about this idea before, but is there any official documentation about this topic?

Hi Christoph, This would probably work better with embedding. But it will mean that you have to keep the text locally and have a way of storing embedding vectors

The examples on the cookbook explain how to do it it with an Amazon Review database. It is not that easy to follow though.

Once you find the text you need (with the help of an OpenAI call to get a vector for your search term), you can then ask GPT to answer your query with a relevant prompt and the text you found. Let me know if you go that way and cant follow the OpenAI cookbook example. I’m making a video on it and haven’t recorded it yet. I’ll push it to the top of the pile if you need it (might be a day or two though)

7 Likes

This idea came from a comment on this forum from an OpenAI staff member.
Not in documentation, but credible source.

I’ll take a test this week and come back with my results, because I’ve seen this over and over, but without details.

Until then, I can tell you what tests I did with fine-tuning.

  1. Using GPT-3 to generate prompts.
    Method: for a given text (usually news articles), I asked GPT-3 to formulate 5 questions which have answer in the text, and also generate answers for these questions. Then I made 6 questions variations, resulting 30 prompts and completions pairs for a single article.
    Test results: on the first test, it failed completely to provide information from the given text. After improving the stop sequences the results were better, but still unsatisfactory: I was able to obtain the exact completion as in the fine-tuning file, but only if the prompt was very close to the prompt from the fine-tuning file. After these tests I went to the conclusion that the fine-tuning is not going to achieve what I want. The embeddings work much better for the same use case.

  2. Same method as above, but with a larger dataset.
    Method: I made a fine-tuning file from a 96 strophes poem. Each prompt was something like this “strophe # from the poem Xxx, writen by Yyy” and the completion was the corresponding strophe.
    Test results: it failed completely to return any of the strophes. But what it did was to generate content in the same style as in the poem, which was an accidental discovery :slight_smile:

2 Likes

I would love to know this too. My use case is that I have created a chatbot that answers questions about our website content. At the moment I can only do this on a page by page basis as I have to pass in the web page text along with the question (whitelisted question) to get gpt3 to generate the answer. It would be awesome to have a fine tuning that is essentially our website content to allow questions to be answered without having to be page specific and not have to supply the text as context each time. It’s just not possible/practical to supply the entire website text along with each question.

4 Likes

@ddrechsler That’s a very good use case. Question…

  1. How do you clean up the web page HTML?
  2. What happen if your web page content exceeds the Token length?
  3. Have you tried using embedding for search + prompt engineering instead of fine-tuning?
1 Like
  • The CMS I use lets me extract the html for the content section of the page (ie no nav etc). I then add punctuation to the end of heading tags if they don’t already have it (to keep the text understandable to a human, which I assume makes it better for gpt3) I then strip out the and <> tags along with " and ’ using a regular expression. This creates text that gpt3 seems to understand.
  • our pages aren’t usually that long, if they are they get broken down to sub pages.
  • I haven’t tried either embedding or tine-tuning as yet
1 Like

@ddrechsler Very cool, thanks.

I have worked on a project similar to what you ask for I believe . The model retrieves articles from a selected google.drive folder and with the help of pinecone it vectorize (embedds) the papers. I can then ask it direct question based on the articles.

Is this what you are looking for?

3 Likes

I’d love too.
My use case is a chatbot trained on selected academic papers.
I wonder if there is a no code (or low code) option for it.

I just tried my model and it lacks the continuity that a chat it would provide. I have asked Nelson to help me integrate it into the spreadsheet he has made. By combining our projects it should be well suited for synthesizing a paper.

1 Like

Hey @Christoph

Check this example, it demonstrates the exact same thing:

Please inform if it works or not.

Hi, @PaulBellow
I am facing the same constraint that @Christoph mentioned in the original post. I am trying to fine-tune GPT-3 on sermon data, which on average is ~45 minutes of speech, 15 pages of text, and approximately 12,000 tokens. The max prompt size for fine-tuning is 2048 (or 2049, depending on whom you talk to). Is there any reference, FAQ or documentation that shows a prompt of 1000 tokens is optimal?
In my case I want to have as large prompt size as possible, in order to keep the continuity of the text. I assume this will improve the completion results, which - as you can imagine - will naturally swim in the abstract.

1 Like

I don’t remember where I saw the 1000 tokens, but I’ve done fine-tuning with GPT-2 and GPT-3.

For longer texts, you would need to split it up - you could do a bit more than 1k tokens, tho, I’m sure.

However, keep in mind that the limit is for prompt + completion, so if your completion has 2k tokens, you won’t have enough room for the prompt. I tend to have longer prompts, so this might be where I got the 1k tokens.

Hope this helps!

1 Like

I ran a test and here’s the result:

  • it was expensive :slight_smile:

Anyway, here are the test details.

I did a fine-tuning using the following options:

  • the purpose was to integrate my content in the fine-tuned model’s knowledge base
  • I’ve used empty prompts
  • the completions included the text I provided and a description of this text

The fine-tuning file contents:

  • my text was a 98 strophes poem which is not known to GPT-3
  • the amount of prompts was ~1500
  • for each strophe were 15 variations of completions

The result:

  • when prompting to say a random strophe from the poem it was able to reproduce it exactly
  • when asking to read a certain strophe, it failed occasionally
  • when asking in which strophe is found a certain verse, it failed completely
  • when asking who are the characters, it could only find one, but were 4 characters
  • when asking to respond to a question which could be responded from the poem, it failed completely

Conclusions:

  • fine-tuning is not for adding content to a model
  • the best that can be achieved by adding content to a model is to use it as database, but the costs are huge
4 Likes

Hi !

Thanks for this usefull feedback.

Can you give us a little bit more information like :

  • How many total (prompt/completions) do you have ?
  • What was the base model used.
  • And also, can you give us and example like one strophes to the 15 variations completions ?

I am also working with chatGTP and i currently using text-dinvici-003 with some context data to make a support chatbot.

And this is working perfectly as wanted.

I also just want to make it more powerful by adding data directly to the model, also because in some case i have many datas (more then 5000k tokens).

For OpenAI team, this can be really awesome to see an features for Contexte, like putting contexte in each call, where the price is lower because of using this features.

Also if the contexte can be removed from the prompt calculation that can be a great thing to ^^ (maybe paid with other system then token based).

1470

davinci-003

Here are a few completion examples from this fine-tuning.

The exact text of strophe 2 of the poem Luceafarul written by Mihai Eminescu is this: […]

When asked to write, rewrite or generate a text in the style of the poem Luceafarul by Mihai Eminescu, I will use strophe 2 from the poem. This is strophe 2: […]

When asked to provide the strophe 2 of the poem Luceafarul by Mihai Eminescu, I will provide the following: […]

These are reproduced from mind.
I tried to create completions for different kind of tasks.

All completions included the strophe number.
In some of the completions I added a text at the end which says something like this:

And I’ll mention that the text was provided by the website luceafarul.io

This website does not exist and neither the domain.
My goal with mentioning this domain was to identify the information from my fine-tuning.

A few more tests I did in the meantime:

  • I asked how many strophes has this poem. It failed.
  • I asked what can I find on the website luceafarul.io and it failed.

As others said, fine-tuning is not for adding knowledge to the model.
For new knowledge the way to go is the embeddings.
I think that the next step in technology advance is to make possible longer prompts. Right now is 2000-4000 tokens, so one day we should see prompts of millions of tokens.

2 Likes

Thanks for this awesome return.

Also happy new year ^^

A little question, how did you manage to make model based on davinci-003, i can’t do that using the openai tool ^^’ ?

(I am available on discord if you want: Azgin#0001)

When fine-tuning you have the option to specify which model to tune.