Once you have the embeddings, you can embed the user’s query too, and use cosine similarity (not provided by OpenAI) to find the top n embeddings from your data. Your prompt, which you’ll send to text-davinci-003 (or a smaller completions endpoint if you want), will comprise
the text associated with those top n embeddings, plus
the query, plus
any instructions you want to include such as “please provide a brief answer to the query using the given paragraphs.”
The overall token limit is 4098 for the prompt + completion, so if you expect your completions to be approximately, say, 500 tokens, then your prompt must be less than approximately 3500 tokens. If you add up
the tokens needed for a good completion for your use case (x), plus
the instructions you provide (y), plus
the expected tokens in a user quert (z),
you can figure out how many tokens you have left for n, like this:
available tokens for n = 4098 - (x + y + z).
If each of your embeddings is about 500 tokens, you’ll be able to fit approximately the top 7 embeddings worth of text into your prompt.
the approach ray recommended works best. you need to store word embeddings for the books and websites data you have. worthwhile to preprocess data so that the learning is better - keep data together in a txt file or so. remove header, footer, and any other irrelevant symbols or so.
the data then could be used to train model further. the output has to be stored preferably on cloud so that it can be retrieved from any device. the current model has huge advantage with reinforcement and hence pushing a series of prompts and give feedback on their accuracy will only improve the model performance. in addition, you could limit number of output tokens model could spit
ah! this is clever. what approach you followed for this - is it just you give a prompt saying - state questions from the following paragraph. And you push all the prompts in a consolidated way and take output in a csv or so…
If you are lucky, you can also use Q&A sessions with Davinci’s wide knowledge to create training sets for lower models . This works best if the area of knowledge you require is limited. It is another way to avoid using embedding.
ie:
you supply a block of text
Ask Davinci to come up with 5 or 10 questions someone may ask about the text
Feed the questions back to GPT and get Davinci to answer the question
Take the questions and answers to create a fine-tune file for ADA, Babbage, Curie
Of course you can also use the fine-tuning on Davinci as well. But it seems to me that it may be sort of redundant because Davinci already knew the answers, and it all came down to how you phrased the question and the context of the original text in step 1 - I’m still in two minds about it and could be convinced either way
I asked chat gpt to generate python code for using opeanai’s API to do just that, upload large text files to fine tune a gpt model.
I think the response was good… I am a software developer but I don’t know much about ML or Python.
Thank you for all the knowledge you shared. I have read with attention your embedding course but I can’t find out my answers. I’m sure you got it though
The course is all about text comparison, proximity… but never about “text generation” based on specific corpus.
I use chatGPT to as questions or summarize articles/books abstracts that i push directly in prompt saying “Please summrize this :” or “according to this text + my question”. And It works pretty fine. But how to do this for 10 to 50k words of context ?
My aim is to prepare multiple datasets about specific subjects (10 to 50k words) and then use GPT for text generation/summarization. Is there a way to do that ?
Hi Ray. It is generating very few questions and the typical user questions are not covered at all. is there any way we could train the model - I wanted to train the model by taking consecutive sentences as questions & answers. however my colleague insists it is futile exercise because user queries will not be as per the sentences mentioned in the text thereby similarity is minimal.
the web page data can be segregated into heading and context which is not the case with pdf documents that we push as the format is not standard and they dont have anything called heading really in processed text. what do you suggest