Train (fine-tune) a model with text from books or articles

raymonddavey · January 1, 2023, 5:50pm

You cant specify the 003. Its always uses the base models (ada, divinci etc)

So you specify “davinci” without the 003 etc

You can also use models you have trained previous by using their full name to fine-tune them further (You will get a new name when you do this)

georgei · January 1, 2023, 6:07pm

I said from my mind, sorry if I created confusion.

leonard.hwostow · January 3, 2023, 1:36pm

Yes, it is also my understanding that using embeddings is the way to go. I’ve followed the cookbook example and have created embeddings but I’m stuck on what to do next. Can it all be down with OpenAI or are there third party solutions that can also be used to interpret the embeddings? I’m a computer engineer but not a data scientist so would appreciate any assistance anybody can provide. I would also be interested in seeing the video and providing feedback. Thanks!

pachocastillosr · January 3, 2023, 11:28pm

Hey, what would be recommended if I the plain text I want to use to train the model exceeds 100,000 tokens (more than the limit allowed in embeddings and fine tuning)? How should I pass it to the model so that it can use it as context?

raymonddavey · January 3, 2023, 11:52pm

You need to break the text down into manageable block that are less than the token limit (preferably 1/2 or less than the token limit for Divinci - if you want to ask a final question once you get the context from your embedding)

So you might want to break it into paragraphs, or groups of paragraphs.

This doesn’t work so well when one paragraph refers to other clauses in a legal document (because you need both bits to make a complete context)

pachocastillosr · January 4, 2023, 1:10am

Thanks for replying! Please imagine this is the plain text I am going to use as context:

Person #1
Name: John
Age: 64

Person #2
Name: Jake
Age: 26

… Up until person #100,000

What I am trying to create is a chatbot that can reply “26” when I send “What is the age of Jake?” as a prompt.

How would you recommend me to provide the information of all the 100,000 people as a context in this use case. Do you that think breaking the text into small embeddings can work for this?

Thanks

raymonddavey · January 4, 2023, 1:25am

There are two ways to do this

First : break the data so that one person is one piece of text

Then create an embedding vector for each persons text and then run the queries against the vectors to find the likely record. Once you have it, send a prompt to GPT that says something like “Based on the following context …\n\nContext\n\n\nQ: What is the age of Jake?\nA:”

That’s not the best wording, but you should get the idea

However, depending on the range of questions you can ask, you may be better off doing entity extraction. I assume Jake, and John will be unique names throughout the data?

The other way you could do it is to train the model on entity extraction.

I wont go into the entity extraction method because I think you may have supplied a simplified description for each person in your example (ie there is more info for each of them). I also think that the questions may be more far reaching based on the data you have. Let me know if this is not the case.

pachocastillosr · January 4, 2023, 1:38am

That sounds good! I’ll give them both a try to see which works the best for my case. Thank you Raymond.

lmccallum · January 4, 2023, 3:01am

Once you have the embeddings, you can embed the user’s query too, and use cosine similarity (not provided by OpenAI) to find the top n embeddings from your data. Your prompt, which you’ll send to text-davinci-003 (or a smaller completions endpoint if you want), will comprise

the text associated with those top n embeddings, plus
the query, plus
any instructions you want to include such as “please provide a brief answer to the query using the given paragraphs.”

The overall token limit is 4098 for the prompt + completion, so if you expect your completions to be approximately, say, 500 tokens, then your prompt must be less than approximately 3500 tokens. If you add up

the tokens needed for a good completion for your use case (x), plus
the instructions you provide (y), plus
the expected tokens in a user quert (z),

you can figure out how many tokens you have left for n, like this:

available tokens for n = 4098 - (x + y + z).

If each of your embeddings is about 500 tokens, you’ll be able to fit approximately the top 7 embeddings worth of text into your prompt.

I hope this is helpful, Leslie

richmandan · January 4, 2023, 9:07am

I would like to see that video please Raymond

raymonddavey · January 4, 2023, 9:12am

@richmandan the videos are done and can be found here

https://thoughtblogger.com/openai-embedding-tutorial/

mouli · January 4, 2023, 1:42pm

the approach ray recommended works best. you need to store word embeddings for the books and websites data you have. worthwhile to preprocess data so that the learning is better - keep data together in a txt file or so. remove header, footer, and any other irrelevant symbols or so.
the data then could be used to train model further. the output has to be stored preferably on cloud so that it can be retrieved from any device. the current model has huge advantage with reinforcement and hence pushing a series of prompts and give feedback on their accuracy will only improve the model performance. in addition, you could limit number of output tokens model could spit

mouli · January 4, 2023, 1:43pm

does GPT has capability to generate prompts? how can we do this.

georgei · January 4, 2023, 2:09pm

You ask it to generate questions for the text you want and also provide the answers.
Is not built in the API.

mouli · January 4, 2023, 3:21pm

ah! this is clever. what approach you followed for this - is it just you give a prompt saying - state questions from the following paragraph. And you push all the prompts in a consolidated way and take output in a csv or so…

raymonddavey · January 4, 2023, 5:33pm

@mouli That is correct

If you are lucky, you can also use Q&A sessions with Davinci’s wide knowledge to create training sets for lower models . This works best if the area of knowledge you require is limited. It is another way to avoid using embedding.

ie:

you supply a block of text
Ask Davinci to come up with 5 or 10 questions someone may ask about the text
Feed the questions back to GPT and get Davinci to answer the question
Take the questions and answers to create a fine-tune file for ADA, Babbage, Curie

Of course you can also use the fine-tuning on Davinci as well. But it seems to me that it may be sort of redundant because Davinci already knew the answers, and it all came down to how you phrased the question and the context of the original text in step 1 - I’m still in two minds about it and could be convinced either way

hyperslap · January 5, 2023, 5:48am

Thank you! Looking forward to taking the full course!

ohadsafra · January 6, 2023, 11:07am

I asked chat gpt to generate python code for using opeanai’s API to do just that, upload large text files to fine tune a gpt model.
I think the response was good… I am a software developer but I don’t know much about ML or Python.

Pierre-Jean · January 6, 2023, 2:34pm

Hello Raymond,

Thank you for all the knowledge you shared. I have read with attention your embedding course but I can’t find out my answers. I’m sure you got it though

The course is all about text comparison, proximity… but never about “text generation” based on specific corpus.

I use chatGPT to as questions or summarize articles/books abstracts that i push directly in prompt saying “Please summrize this :” or “according to this text + my question”. And It works pretty fine. But how to do this for 10 to 50k words of context ?

My aim is to prepare multiple datasets about specific subjects (10 to 50k words) and then use GPT for text generation/summarization. Is there a way to do that ?

Thank you very much for you time and interrest !

raymonddavey · January 6, 2023, 4:31pm

@Pierre-Jean You are correct, my examples are about search, and not summarization

David Shapiro has a really good video about summarizing big documents

Video is by @daveshapautomator

Topic		Replies	Views
How can I use Embeddings with Chat GPT 3-5 Turbo Prompting	39	47315	December 12, 2023
Fine tuning a model for customer service for our specific app Prompting	23	13568	May 14, 2024
The length of the embedding contents API	48	32143	December 13, 2023
How to feed data for completions, instead of using prompt/answer fine-tuning format? API	25	17071	December 17, 2023
Do you fine tune? If so why? API	34	4457	December 25, 2023

Train (fine-tune) a model with text from books or articles

Related topics