Is it possible to fine-tune a model to answer questions given a raw text?

I have a large documentation about an specific product and I want my model to answer questions about it, by fine-tuning it. The same way that answer method of OpenAI would do.

As far as I know, the file that you must have to train the model is a CSV file with prompts and completions, but what I have is a large amount of text, so i want my model to read from it and be able to answer all type of questions about that

Any suggestions?

3 Likes

Great question! For fine-tuning, you should provide at least a few hundred examples, while a large piece of text is just one example, if formatted as a prompt-completion pair. In other words, youā€™ll need to turn that text into at least a few hundred prompt-completion pairs, in a JSONL format.

Alternatively, you can use the Answers endpoint to answer questions about uploaded text.

4 Likes

Hi @jesruiort

You can start by using the answers endpoint

Explore fine-tune if your documentation limit exceeds the token limit.

Thanks for your answer. The problem I have with the Answers endpoint is that every execution of this, with a large documentation, should have a big cost of tokens, isnā€™t it?

So my only option if I was to use fine-tuning is to transform the documentation into prompt-completion (question->answer) pairs?

Thanks again!

Thanks for your answer. The problem I have with the Answers endpoint is that every execution of this, with a large documentation, should have a big cost of tokens, isnā€™t it?

So my only option if I was to use fine-tuning is to transform the documentation into prompt-completion (question->answer) pairs?

Thanks again!

1 Like

Yes, you are correct regarding costs for using the answers API for a very large document.

In that case you can also use a combination of embeddings and completions endpoint to obtain semantically similar documents to the search query.

And yes, while, fine tuning is an option, I would try all these before finally getting a fine-tuned model.

I work with a similar topic. In my case (tourism) I have a lot of hotels.
I donā€™t have much experience with GPT 3 but I think fine tuning is not the right solution.
My tests show that it is not a problem to train to give correct answers, the problem is to train so that the model does not give wrong answers.
Gpt 3 also likes to answer questions he doesnā€™t know the answer to.
I think a better solution is to use ā€œQuestion answeringā€. I would make a separate file for each product. In the file, each document should have a maximum of 1-2 sentences. So the document has the same size as the fine tuning answer.
Finally, I repeat once again that I do not have much experience with this topic. With my answer I want to interest other users to write about their experiences. So I hope that I will find better solutions for my work. :grinning:

4 Likes

I think I understand what you mean. Iā€™m trying to implement embeddings for my problem, but Iā€™ve got a question. Once I have located the document in which I should look for the information, by embedding, is there a way to send the completions endpoint that document? So it can answer my question by looking at that document?

What Iā€™m doing now is using answer endpoint, but the document I use for it is the one that the embedding returns, so I donā€™t spend much tokens on the search part. Does this make sense?

Thank you!

If you want to use the answers endpoint with embedding returned documents, you are going to have to make 2 API calls i.e. one for obtaining top n documents semantically similar to the query using embeddings. Then use the documents obtained from the previous step in calling the answers api.

1 Like

I am using a 2-part approach for a similar use case:

  1. obtain search results using the embeddings endpoint
  2. send the top n search results/documents to the completions endpoint, along with the userā€™s query plus instructions to answer the query based on those documents
    One problem I was having is that each of my documents is quite long, consisting of several sentences, so I was bumping up against the 2048 token limit with only 3-5 search results in the prompt. This was not enough to generate high-quality answers. So now, I am experimenting with parsing each search result/document into individual sentences, re-ranking them, and sending the top n sentences to the completions endpoint. That way, I filter out less relevant sentences before sending the prompt. If that approach doesnā€™t work, Iā€™ll be experimenting with some combination of whole documents and sentences and maybe performing more than one completion at once and then combining them to generate the best written answer to the query. One more thing: I tried using babbage for the completions and it was not nearly good enough for my needs. So Iā€™ll definitely be using davinci. I havenā€™t tried the answers endpoint and I havenā€™t tried fine-tuning. My gut instinct tells me those methods wonā€™t work as well for my use case. Plus, I still donā€™t totally understand how the answers endpoint works and which models/engines/endpoints allow fine-tuning and which do not. I find the documentation is poor in this regard.
1 Like

Perfect, thatā€™s what I am doing.

Iā€™ve got a problem which seems to be the last one. I tried building a CSV file to make the embedding (as far as I know, thatā€™s the way of doing it). The first column is the identifier of a document, and the second one is the text.

I first tryed with a 6 row CSV, only to see if that worked, and it did. The problem is that now, with 47 rows, I got this message while embedding.

ā€œRate limit reached for requests. Limit: 60.000000 / min. Current: 80.000000 / min. Contact support@openai.com if you continue to have issues.ā€

How could I fix this?

Thanks for your answer.

Iā€™ve got a question for you. How did you use the embedding endpoint?

I tried to follow this:

But when Iā€™m building the CSV, I get this error:

ā€œRate limit reached for requests. Limit: 60.000000 / min. Current: 80.000000 / min. Contact support@openai.com if you continue to have issues.ā€

My CSV consists on a first column, which is the ID, and the second column which is the text.

Sorry I canā€™t help with that. Hopefully support can help.

That looks like youā€™re somehow maxing out on the limit. Support should be able to help.

Hi, @joey
In relation to this response, I was curious if the research has revealed a ā€˜best-practiceā€™ for how to turn a large set of data into a useful prompt/completion pair. For example, in my research, Iā€™m investigating the potential for GPT-3 to return factual answers to objective/nuanced/deductive questions in the style of the original author. In my case, I have about 100 essays, each with ~13,000 words. Iā€™m considering a few options:

  1. Prompt = essay title, Completion = whole essay
  2. Prompt/Completion alternate between sentences; e.g., prompt = 1st sentence, completion = 2nd sentence, next prompt = 2nd sentence, completion = 3rd sentence, etc. (this doubles the cost of training)
  3. Prompt = summary of essay and completion = the original essay.

In practice, I want to offer a prompt with Fact A and Fact B and test to see if GPT-3 returns Fact C (where A = B and B = C). I donā€™t want GPT-3 to simply return the document where the answer exists.

I appreciate any advice you can offer.

Keep up the good work!

These are great experiments. Could you say a few words about what worked and what didnā€™t. Thanks

I donā€™t see the answers endpoint available anymore in the link. Could you please provide a link or an example of this?
I would like to pass/upload a documentation file and have gpt answer questions about whatā€™s present in the file. Essentially, it should fetch the contents of the file to provide the answer. How can this be done without fine tuning?
Any help is appreciated! Thanks

I think you may want to take a look at this project:
Quivr

I cannot post a link here but you can search this in Github.
Iā€™m currently playing with it and the result seems good to me. I think it can fulfill your requirement.