I have a large documentation about an specific product and I want my model to answer questions about it, by fine-tuning it. The same way that answer method of OpenAI would do.
As far as I know, the file that you must have to train the model is a CSV file with prompts and completions, but what I have is a large amount of text, so i want my model to read from it and be able to answer all type of questions about that
Great question! For fine-tuning, you should provide at least a few hundred examples, while a large piece of text is just one example, if formatted as a prompt-completion pair. In other words, you’ll need to turn that text into at least a few hundred prompt-completion pairs, in a JSONL format.
Alternatively, you can use the Answers endpoint to answer questions about uploaded text.
I work with a similar topic. In my case (tourism) I have a lot of hotels.
I don’t have much experience with GPT 3 but I think fine tuning is not the right solution.
My tests show that it is not a problem to train to give correct answers, the problem is to train so that the model does not give wrong answers.
Gpt 3 also likes to answer questions he doesn’t know the answer to.
I think a better solution is to use “Question answering”. I would make a separate file for each product. In the file, each document should have a maximum of 1-2 sentences. So the document has the same size as the fine tuning answer.
Finally, I repeat once again that I do not have much experience with this topic. With my answer I want to interest other users to write about their experiences. So I hope that I will find better solutions for my work.
I think I understand what you mean. I’m trying to implement embeddings for my problem, but I’ve got a question. Once I have located the document in which I should look for the information, by embedding, is there a way to send the completions endpoint that document? So it can answer my question by looking at that document?
What I’m doing now is using answer endpoint, but the document I use for it is the one that the embedding returns, so I don’t spend much tokens on the search part. Does this make sense?
I am using a 2-part approach for a similar use case:
obtain search results using the embeddings endpoint
send the top n search results/documents to the completions endpoint, along with the user’s query plus instructions to answer the query based on those documents
One problem I was having is that each of my documents is quite long, consisting of several sentences, so I was bumping up against the 2048 token limit with only 3-5 search results in the prompt. This was not enough to generate high-quality answers. So now, I am experimenting with parsing each search result/document into individual sentences, re-ranking them, and sending the top n sentences to the completions endpoint. That way, I filter out less relevant sentences before sending the prompt. If that approach doesn’t work, I’ll be experimenting with some combination of whole documents and sentences and maybe performing more than one completion at once and then combining them to generate the best written answer to the query. One more thing: I tried using babbage for the completions and it was not nearly good enough for my needs. So I’ll definitely be using davinci. I haven’t tried the answers endpoint and I haven’t tried fine-tuning. My gut instinct tells me those methods won’t work as well for my use case. Plus, I still don’t totally understand how the answers endpoint works and which models/engines/endpoints allow fine-tuning and which do not. I find the documentation is poor in this regard.
I’ve got a problem which seems to be the last one. I tried building a CSV file to make the embedding (as far as I know, that’s the way of doing it). The first column is the identifier of a document, and the second one is the text.
I first tryed with a 6 row CSV, only to see if that worked, and it did. The problem is that now, with 47 rows, I got this message while embedding.
“Rate limit reached for requests. Limit: 60.000000 / min. Current: 80.000000 / min. Contact email@example.com if you continue to have issues.”
In relation to this response, I was curious if the research has revealed a ‘best-practice’ for how to turn a large set of data into a useful prompt/completion pair. For example, in my research, I’m investigating the potential for GPT-3 to return factual answers to objective/nuanced/deductive questions in the style of the original author. In my case, I have about 100 essays, each with ~13,000 words. I’m considering a few options:
Prompt = essay title, Completion = whole essay
Prompt/Completion alternate between sentences; e.g., prompt = 1st sentence, completion = 2nd sentence, next prompt = 2nd sentence, completion = 3rd sentence, etc. (this doubles the cost of training)
Prompt = summary of essay and completion = the original essay.
In practice, I want to offer a prompt with Fact A and Fact B and test to see if GPT-3 returns Fact C (where A = B and B = C). I don’t want GPT-3 to simply return the document where the answer exists.