Yes, it is also my understanding that using embeddings is the way to go. I’ve followed the cookbook example and have created embeddings but I’m stuck on what to do next. Can it all be down with OpenAI or are there third party solutions that can also be used to interpret the embeddings? I’m a computer engineer but not a data scientist so would appreciate any assistance anybody can provide. I would also be interested in seeing the video and providing feedback. Thanks!
Hey, what would be recommended if I the plain text I want to use to train the model exceeds 100,000 tokens (more than the limit allowed in embeddings and fine tuning)? How should I pass it to the model so that it can use it as context?
You need to break the text down into manageable block that are less than the token limit (preferably 1/2 or less than the token limit for Divinci - if you want to ask a final question once you get the context from your embedding)
So you might want to break it into paragraphs, or groups of paragraphs.
This doesn’t work so well when one paragraph refers to other clauses in a legal document (because you need both bits to make a complete context)
Thanks for replying! Please imagine this is the plain text I am going to use as context:
Person #1
Name: John
Age: 64
Person #2
Name: Jake
Age: 26
… Up until person #100,000
What I am trying to create is a chatbot that can reply “26” when I send “What is the age of Jake?” as a prompt.
How would you recommend me to provide the information of all the 100,000 people as a context in this use case. Do you that think breaking the text into small embeddings can work for this?
First : break the data so that one person is one piece of text
Then create an embedding vector for each persons text and then run the queries against the vectors to find the likely record. Once you have it, send a prompt to GPT that says something like “Based on the following context …\n\nContext\n\n\nQ: What is the age of Jake?\nA:”
That’s not the best wording, but you should get the idea
However, depending on the range of questions you can ask, you may be better off doing entity extraction. I assume Jake, and John will be unique names throughout the data?
The other way you could do it is to train the model on entity extraction.
I wont go into the entity extraction method because I think you may have supplied a simplified description for each person in your example (ie there is more info for each of them). I also think that the questions may be more far reaching based on the data you have. Let me know if this is not the case.
Once you have the embeddings, you can embed the user’s query too, and use cosine similarity (not provided by OpenAI) to find the top n embeddings from your data. Your prompt, which you’ll send to text-davinci-003 (or a smaller completions endpoint if you want), will comprise
the text associated with those top n embeddings, plus
the query, plus
any instructions you want to include such as “please provide a brief answer to the query using the given paragraphs.”
The overall token limit is 4098 for the prompt + completion, so if you expect your completions to be approximately, say, 500 tokens, then your prompt must be less than approximately 3500 tokens. If you add up
the tokens needed for a good completion for your use case (x), plus
the instructions you provide (y), plus
the expected tokens in a user quert (z),
you can figure out how many tokens you have left for n, like this:
available tokens for n = 4098 - (x + y + z).
If each of your embeddings is about 500 tokens, you’ll be able to fit approximately the top 7 embeddings worth of text into your prompt.
the approach ray recommended works best. you need to store word embeddings for the books and websites data you have. worthwhile to preprocess data so that the learning is better - keep data together in a txt file or so. remove header, footer, and any other irrelevant symbols or so.
the data then could be used to train model further. the output has to be stored preferably on cloud so that it can be retrieved from any device. the current model has huge advantage with reinforcement and hence pushing a series of prompts and give feedback on their accuracy will only improve the model performance. in addition, you could limit number of output tokens model could spit
ah! this is clever. what approach you followed for this - is it just you give a prompt saying - state questions from the following paragraph. And you push all the prompts in a consolidated way and take output in a csv or so…
If you are lucky, you can also use Q&A sessions with Davinci’s wide knowledge to create training sets for lower models . This works best if the area of knowledge you require is limited. It is another way to avoid using embedding.
ie:
you supply a block of text
Ask Davinci to come up with 5 or 10 questions someone may ask about the text
Feed the questions back to GPT and get Davinci to answer the question
Take the questions and answers to create a fine-tune file for ADA, Babbage, Curie
Of course you can also use the fine-tuning on Davinci as well. But it seems to me that it may be sort of redundant because Davinci already knew the answers, and it all came down to how you phrased the question and the context of the original text in step 1 - I’m still in two minds about it and could be convinced either way
I asked chat gpt to generate python code for using opeanai’s API to do just that, upload large text files to fine tune a gpt model.
I think the response was good… I am a software developer but I don’t know much about ML or Python.
Thank you for all the knowledge you shared. I have read with attention your embedding course but I can’t find out my answers. I’m sure you got it though
The course is all about text comparison, proximity… but never about “text generation” based on specific corpus.
I use chatGPT to as questions or summarize articles/books abstracts that i push directly in prompt saying “Please summrize this :” or “according to this text + my question”. And It works pretty fine. But how to do this for 10 to 50k words of context ?
My aim is to prepare multiple datasets about specific subjects (10 to 50k words) and then use GPT for text generation/summarization. Is there a way to do that ?