Train (fine-tune) a model with text from books or articles

You cant specify the 003. Its always uses the base models (ada, divinci etc)

So you specify “davinci” without the 003 etc

You can also use models you have trained previous by using their full name to fine-tune them further (You will get a new name when you do this)

1 Like

I said from my mind, sorry if I created confusion.

1 Like

Yes, it is also my understanding that using embeddings is the way to go. I’ve followed the cookbook example and have created embeddings but I’m stuck on what to do next. Can it all be down with OpenAI or are there third party solutions that can also be used to interpret the embeddings? I’m a computer engineer but not a data scientist so would appreciate any assistance anybody can provide. I would also be interested in seeing the video and providing feedback. Thanks!

1 Like

Hey, what would be recommended if I the plain text I want to use to train the model exceeds 100,000 tokens (more than the limit allowed in embeddings and fine tuning)? How should I pass it to the model so that it can use it as context?

1 Like

You need to break the text down into manageable block that are less than the token limit (preferably 1/2 or less than the token limit for Divinci - if you want to ask a final question once you get the context from your embedding)

So you might want to break it into paragraphs, or groups of paragraphs.

This doesn’t work so well when one paragraph refers to other clauses in a legal document (because you need both bits to make a complete context)


Thanks for replying! Please imagine this is the plain text I am going to use as context:

Person #1
Name: John
Age: 64

Person #2
Name: Jake
Age: 26

… Up until person #100,000

What I am trying to create is a chatbot that can reply “26” when I send “What is the age of Jake?” as a prompt.

How would you recommend me to provide the information of all the 100,000 people as a context in this use case. Do you that think breaking the text into small embeddings can work for this?


1 Like

There are two ways to do this

First : break the data so that one person is one piece of text

Then create an embedding vector for each persons text and then run the queries against the vectors to find the likely record. Once you have it, send a prompt to GPT that says something like “Based on the following context …\n\nContext\n\n\nQ: What is the age of Jake?\nA:”

That’s not the best wording, but you should get the idea

However, depending on the range of questions you can ask, you may be better off doing entity extraction. I assume Jake, and John will be unique names throughout the data?

The other way you could do it is to train the model on entity extraction.

I wont go into the entity extraction method because I think you may have supplied a simplified description for each person in your example (ie there is more info for each of them). I also think that the questions may be more far reaching based on the data you have. Let me know if this is not the case.


That sounds good! I’ll give them both a try to see which works the best for my case. Thank you Raymond.

1 Like

Once you have the embeddings, you can embed the user’s query too, and use cosine similarity (not provided by OpenAI) to find the top n embeddings from your data. Your prompt, which you’ll send to text-davinci-003 (or a smaller completions endpoint if you want), will comprise

  1. the text associated with those top n embeddings, plus
  2. the query, plus
  3. any instructions you want to include such as “please provide a brief answer to the query using the given paragraphs.”

The overall token limit is 4098 for the prompt + completion, so if you expect your completions to be approximately, say, 500 tokens, then your prompt must be less than approximately 3500 tokens. If you add up

  1. the tokens needed for a good completion for your use case (x), plus
  2. the instructions you provide (y), plus
  3. the expected tokens in a user quert (z),

you can figure out how many tokens you have left for n, like this:

available tokens for n = 4098 - (x + y + z).

If each of your embeddings is about 500 tokens, you’ll be able to fit approximately the top 7 embeddings worth of text into your prompt.

I hope this is helpful, Leslie


I would like to see that video please Raymond

1 Like

@richmandan the videos are done and can be found here


the approach ray recommended works best. you need to store word embeddings for the books and websites data you have. worthwhile to preprocess data so that the learning is better - keep data together in a txt file or so. remove header, footer, and any other irrelevant symbols or so.
the data then could be used to train model further. the output has to be stored preferably on cloud so that it can be retrieved from any device. the current model has huge advantage with reinforcement and hence pushing a series of prompts and give feedback on their accuracy will only improve the model performance. in addition, you could limit number of output tokens model could spit

1 Like

does GPT has capability to generate prompts? how can we do this.

1 Like

You ask it to generate questions for the text you want and also provide the answers.
Is not built in the API.


ah! this is clever. what approach you followed for this - is it just you give a prompt saying - state questions from the following paragraph. And you push all the prompts in a consolidated way and take output in a csv or so…

1 Like

@mouli That is correct

If you are lucky, you can also use Q&A sessions with Davinci’s wide knowledge to create training sets for lower models . This works best if the area of knowledge you require is limited. It is another way to avoid using embedding.


  1. you supply a block of text
  2. Ask Davinci to come up with 5 or 10 questions someone may ask about the text
  3. Feed the questions back to GPT and get Davinci to answer the question
  4. Take the questions and answers to create a fine-tune file for ADA, Babbage, Curie

Of course you can also use the fine-tuning on Davinci as well. But it seems to me that it may be sort of redundant because Davinci already knew the answers, and it all came down to how you phrased the question and the context of the original text in step 1 - I’m still in two minds about it and could be convinced either way


Thank you! Looking forward to taking the full course!


I asked chat gpt to generate python code for using opeanai’s API to do just that, upload large text files to fine tune a gpt model.
I think the response was good… I am a software developer but I don’t know much about ML or Python.

1 Like

Hello Raymond,

Thank you for all the knowledge you shared. I have read with attention your embedding course but I can’t find out my answers. I’m sure you got it though :slight_smile:

The course is all about text comparison, proximity… but never about “text generation” based on specific corpus.

I use chatGPT to as questions or summarize articles/books abstracts that i push directly in prompt saying “Please summrize this :” or “according to this text + my question”. And It works pretty fine. But how to do this for 10 to 50k words of context ?

My aim is to prepare multiple datasets about specific subjects (10 to 50k words) and then use GPT for text generation/summarization. Is there a way to do that ?

Thank you very much for you time and interrest !

1 Like

@Pierre-Jean You are correct, my examples are about search, and not summarization

David Shapiro has a really good video about summarizing big documents

Video is by @daveshapautomator