Train (fine-tune) a model with text from books or articles

Once you have the embeddings, you can embed the user’s query too, and use cosine similarity (not provided by OpenAI) to find the top n embeddings from your data. Your prompt, which you’ll send to text-davinci-003 (or a smaller completions endpoint if you want), will comprise

  1. the text associated with those top n embeddings, plus
  2. the query, plus
  3. any instructions you want to include such as “please provide a brief answer to the query using the given paragraphs.”

The overall token limit is 4098 for the prompt + completion, so if you expect your completions to be approximately, say, 500 tokens, then your prompt must be less than approximately 3500 tokens. If you add up

  1. the tokens needed for a good completion for your use case (x), plus
  2. the instructions you provide (y), plus
  3. the expected tokens in a user quert (z),

you can figure out how many tokens you have left for n, like this:

available tokens for n = 4098 - (x + y + z).

If each of your embeddings is about 500 tokens, you’ll be able to fit approximately the top 7 embeddings worth of text into your prompt.

I hope this is helpful, Leslie

3 Likes

I would like to see that video please Raymond

1 Like

@richmandan the videos are done and can be found here

7 Likes

the approach ray recommended works best. you need to store word embeddings for the books and websites data you have. worthwhile to preprocess data so that the learning is better - keep data together in a txt file or so. remove header, footer, and any other irrelevant symbols or so.
the data then could be used to train model further. the output has to be stored preferably on cloud so that it can be retrieved from any device. the current model has huge advantage with reinforcement and hence pushing a series of prompts and give feedback on their accuracy will only improve the model performance. in addition, you could limit number of output tokens model could spit

1 Like

does GPT has capability to generate prompts? how can we do this.

1 Like

You ask it to generate questions for the text you want and also provide the answers.
Is not built in the API.

2 Likes

ah! this is clever. what approach you followed for this - is it just you give a prompt saying - state questions from the following paragraph. And you push all the prompts in a consolidated way and take output in a csv or so…

1 Like

@mouli That is correct

If you are lucky, you can also use Q&A sessions with Davinci’s wide knowledge to create training sets for lower models . This works best if the area of knowledge you require is limited. It is another way to avoid using embedding.

ie:

  1. you supply a block of text
  2. Ask Davinci to come up with 5 or 10 questions someone may ask about the text
  3. Feed the questions back to GPT and get Davinci to answer the question
  4. Take the questions and answers to create a fine-tune file for ADA, Babbage, Curie

Of course you can also use the fine-tuning on Davinci as well. But it seems to me that it may be sort of redundant because Davinci already knew the answers, and it all came down to how you phrased the question and the context of the original text in step 1 - I’m still in two minds about it and could be convinced either way

4 Likes

Thank you! Looking forward to taking the full course!

2 Likes

I asked chat gpt to generate python code for using opeanai’s API to do just that, upload large text files to fine tune a gpt model.
I think the response was good… I am a software developer but I don’t know much about ML or Python.

1 Like

Hello Raymond,

Thank you for all the knowledge you shared. I have read with attention your embedding course but I can’t find out my answers. I’m sure you got it though :slight_smile:

The course is all about text comparison, proximity… but never about “text generation” based on specific corpus.

I use chatGPT to as questions or summarize articles/books abstracts that i push directly in prompt saying “Please summrize this :” or “according to this text + my question”. And It works pretty fine. But how to do this for 10 to 50k words of context ?

My aim is to prepare multiple datasets about specific subjects (10 to 50k words) and then use GPT for text generation/summarization. Is there a way to do that ?

Thank you very much for you time and interrest !

1 Like

@Pierre-Jean You are correct, my examples are about search, and not summarization

David Shapiro has a really good video about summarizing big documents

Video is by @daveshapautomator

2 Likes

Hi Leslie,

Yes that helps. Will implement and follow up if I have any additional questions.

Thanks!

1 Like

Hi Ray. It is generating very few questions and the typical user questions are not covered at all. is there any way we could train the model - I wanted to train the model by taking consecutive sentences as questions & answers. however my colleague insists it is futile exercise because user queries will not be as per the sentences mentioned in the text thereby similarity is minimal.
the web page data can be segregated into heading and context which is not the case with pdf documents that we push as the format is not standard and they dont have anything called heading really in processed text. what do you suggest

1 Like

You may find that you are better off looking at embedding instead of fine-tuning

Check your private chat section for more info

3 Likes

I hear there is a really good course on the topic on Udemy :slight_smile:

2 Likes

Can sombody please post the link to the udemy course or send it to me via pm?
Regards
Chris

1 Like

Also check your private message for something a little bit extra

3 Likes

The embedding videos are all free on that course link. They are setup as “preview” videos - but they have the full content

3 Likes

Dear Raymond, Thank you very much for the support extended. Has been nice learning experience e-interacting with you.
thanks, Mouli

2 Likes