Train (fine-tune) a model with text from books or articles

does GPT has capability to generate prompts? how can we do this.

1 Like

You ask it to generate questions for the text you want and also provide the answers.
Is not built in the API.

2 Likes

ah! this is clever. what approach you followed for this - is it just you give a prompt saying - state questions from the following paragraph. And you push all the prompts in a consolidated way and take output in a csv or so…

1 Like

@mouli That is correct

If you are lucky, you can also use Q&A sessions with Davinci’s wide knowledge to create training sets for lower models . This works best if the area of knowledge you require is limited. It is another way to avoid using embedding.

ie:

  1. you supply a block of text
  2. Ask Davinci to come up with 5 or 10 questions someone may ask about the text
  3. Feed the questions back to GPT and get Davinci to answer the question
  4. Take the questions and answers to create a fine-tune file for ADA, Babbage, Curie

Of course you can also use the fine-tuning on Davinci as well. But it seems to me that it may be sort of redundant because Davinci already knew the answers, and it all came down to how you phrased the question and the context of the original text in step 1 - I’m still in two minds about it and could be convinced either way

4 Likes

Thank you! Looking forward to taking the full course!

2 Likes

I asked chat gpt to generate python code for using opeanai’s API to do just that, upload large text files to fine tune a gpt model.
I think the response was good… I am a software developer but I don’t know much about ML or Python.

1 Like

Hello Raymond,

Thank you for all the knowledge you shared. I have read with attention your embedding course but I can’t find out my answers. I’m sure you got it though :slight_smile:

The course is all about text comparison, proximity… but never about “text generation” based on specific corpus.

I use chatGPT to as questions or summarize articles/books abstracts that i push directly in prompt saying “Please summrize this :” or “according to this text + my question”. And It works pretty fine. But how to do this for 10 to 50k words of context ?

My aim is to prepare multiple datasets about specific subjects (10 to 50k words) and then use GPT for text generation/summarization. Is there a way to do that ?

Thank you very much for you time and interrest !

1 Like

@Pierre-Jean You are correct, my examples are about search, and not summarization

David Shapiro has a really good video about summarizing big documents

Video is by @daveshapautomator

2 Likes

Hi Leslie,

Yes that helps. Will implement and follow up if I have any additional questions.

Thanks!

1 Like

Hi Ray. It is generating very few questions and the typical user questions are not covered at all. is there any way we could train the model - I wanted to train the model by taking consecutive sentences as questions & answers. however my colleague insists it is futile exercise because user queries will not be as per the sentences mentioned in the text thereby similarity is minimal.
the web page data can be segregated into heading and context which is not the case with pdf documents that we push as the format is not standard and they dont have anything called heading really in processed text. what do you suggest

1 Like

You may find that you are better off looking at embedding instead of fine-tuning

Check your private chat section for more info

3 Likes

I hear there is a really good course on the topic on Udemy :slight_smile:

2 Likes

Can sombody please post the link to the udemy course or send it to me via pm?
Regards
Chris

1 Like

Also check your private message for something a little bit extra

3 Likes

The embedding videos are all free on that course link. They are setup as “preview” videos - but they have the full content

3 Likes

Dear Raymond, Thank you very much for the support extended. Has been nice learning experience e-interacting with you.
thanks, Mouli

2 Likes

I think this is my favorite article on all the community boards!

I have a related question, though.

This thread started with the recommendation for fine-tuning long text using (empty) prompts and completions of 1000 tokens. Is that still the best-practice?

Along those lines, @raymonddavey has completely convinced me that embedding is the much better option for this general use-case (adding additional information to GPT-3’s knowledgebase). Is there is a recommended optimal size for the chunks of large text that gets embedded? For example, if I have a document of 50,000 tokens, is the optimal size for embedding 2000 tokens, 1000 tokens, 500 tokens or one sentence?

The trade-offs that I see are:

  1. Embed large chunks (e.g., whole books of the Bible) which will lower the overall cosine similarities between the query and the large text.
    Cosign similarity takes the relative size of two pieces of text into account, and so (in my experiments) the cosine similarity between two pieces of text will be lower - all things considered - if one piece of text is much smaller than the other. And in my case, I’m comparing one question embedding (e.g., What is the purpose of prayer?) against perhaps the entire Bible looking for semantic matches with purposes and prayer.

  2. Break the text into chapters of a few pages, which should increase the cosine similarity, but may not be granular enough to be useful (and still expensive if I have to send a chapter per query).
    My concern is that if I measure the cosine similarity between one sentence and one chapter, the similarity score will still be markedly lower due at least in part because of the size difference between the two texts. In response,

I would think we should embed at the sentence or verse level, but that seems expensive, and likely to lose a lot of the context.

What is the best-practice for embedding size?

Thoughts?

1 Like

I just rewatched @raymonddavey’s awesome videos, and I think he might have answered my question for me. In the em007 video (Intro to CSV and Semantic Search) at approximately 1:35, Raymond mentioned that breaking up text into 1500 - 2000 words is normally a good choice. Just out of curiosity, where did that recommendation come from? What are the tradeoffs compared to breaking up text into e.g., 500 words? (Of course more embeddings and less context, but would that be offset by more targeted results from the semantic search)?

2 Likes

Hi Rex,

Thanks for the kind words about the course. :slight_smile:

After a lot of experimentation, the best range appears to be about 350 to 500 tokens per block.

(This equates to 8.5% to 12% of the max_tokens for the model I ask the final question with - not the embedding model. If the model size increases, I would probably increase by a similar factor)

We combine paragraphs together by searching for headings and text follow on. If we can combine two or more paragraphs into one block, we do. We always restarted when we hit a major heading - even if the previous block was not full.

By doing this, we can include between 4 and 6 contexts when we ask the final question, and still leave enough room for the completion. The block are normally (but not always) the top hits from a semantic search. Sometimes we can fit more or less - it depends on the number of tokens you decide to use to provide context. We used between 30 and 50% (purely based on a cost decision by the user - when Davinci was still the only expensive option)

By including more contexts, we managed to get information from different parts of a single document - or (better yet) parts from multiple document sources. This really helped the AI provide a strong answer that was on topic.

Let me know if you need more info on what we did. Others may have done something different.

3 Likes

This is very helpful!
Part of the reason I ask is, as you know, I’m working on my thesis which includes a comparison of fine-tuning verses embedding. My hope was that I could break up the text in the same sized chunks. This way I can remove (different) sizes as a factor, and use the same set of 500-token blocks to feed to the fine-tuning, and then again directly into the embedding.

“We combine paragraphs together by searching for headings and text follow on. If we can combine two or more paragraphs into one block, we do. We always restarted when we hit a major heading - even if the previous block was not full.” <— This makes a lot of sense. In my case, it’s all conversational data, which has no textual, logical, or grammatical breaks, so in my case I just fill the blocks until the next sentence won’t fit.

This stuff is pretty fun!

1 Like