Is it possible to fine-tune a model to answer questions given a raw text?

I have a large documentation about an specific product and I want my model to answer questions about it, by fine-tuning it. The same way that answer method of OpenAI would do.

As far as I know, the file that you must have to train the model is a CSV file with prompts and completions, but what I have is a large amount of text, so i want my model to read from it and be able to answer all type of questions about that

Any suggestions?

3 Likes

Great question! For fine-tuning, you should provide at least a few hundred examples, while a large piece of text is just one example, if formatted as a prompt-completion pair. In other words, you’ll need to turn that text into at least a few hundred prompt-completion pairs, in a JSONL format.

Alternatively, you can use the Answers endpoint to answer questions about uploaded text.

2 Likes

Hi @jesruiort

You can start by using the answers endpoint

Explore fine-tune if your documentation limit exceeds the token limit.

Thanks for your answer. The problem I have with the Answers endpoint is that every execution of this, with a large documentation, should have a big cost of tokens, isn’t it?

So my only option if I was to use fine-tuning is to transform the documentation into prompt-completion (question->answer) pairs?

Thanks again!

Thanks for your answer. The problem I have with the Answers endpoint is that every execution of this, with a large documentation, should have a big cost of tokens, isn’t it?

So my only option if I was to use fine-tuning is to transform the documentation into prompt-completion (question->answer) pairs?

Thanks again!

1 Like

Yes, you are correct regarding costs for using the answers API for a very large document.

In that case you can also use a combination of embeddings and completions endpoint to obtain semantically similar documents to the search query.

And yes, while, fine tuning is an option, I would try all these before finally getting a fine-tuned model.

I work with a similar topic. In my case (tourism) I have a lot of hotels.
I don’t have much experience with GPT 3 but I think fine tuning is not the right solution.
My tests show that it is not a problem to train to give correct answers, the problem is to train so that the model does not give wrong answers.
Gpt 3 also likes to answer questions he doesn’t know the answer to.
I think a better solution is to use “Question answering”. I would make a separate file for each product. In the file, each document should have a maximum of 1-2 sentences. So the document has the same size as the fine tuning answer.
Finally, I repeat once again that I do not have much experience with this topic. With my answer I want to interest other users to write about their experiences. So I hope that I will find better solutions for my work. :grinning:

3 Likes

I think I understand what you mean. I’m trying to implement embeddings for my problem, but I’ve got a question. Once I have located the document in which I should look for the information, by embedding, is there a way to send the completions endpoint that document? So it can answer my question by looking at that document?

What I’m doing now is using answer endpoint, but the document I use for it is the one that the embedding returns, so I don’t spend much tokens on the search part. Does this make sense?

Thank you!

If you want to use the answers endpoint with embedding returned documents, you are going to have to make 2 API calls i.e. one for obtaining top n documents semantically similar to the query using embeddings. Then use the documents obtained from the previous step in calling the answers api.

1 Like

I am using a 2-part approach for a similar use case:

  1. obtain search results using the embeddings endpoint
  2. send the top n search results/documents to the completions endpoint, along with the user’s query plus instructions to answer the query based on those documents
    One problem I was having is that each of my documents is quite long, consisting of several sentences, so I was bumping up against the 2048 token limit with only 3-5 search results in the prompt. This was not enough to generate high-quality answers. So now, I am experimenting with parsing each search result/document into individual sentences, re-ranking them, and sending the top n sentences to the completions endpoint. That way, I filter out less relevant sentences before sending the prompt. If that approach doesn’t work, I’ll be experimenting with some combination of whole documents and sentences and maybe performing more than one completion at once and then combining them to generate the best written answer to the query. One more thing: I tried using babbage for the completions and it was not nearly good enough for my needs. So I’ll definitely be using davinci. I haven’t tried the answers endpoint and I haven’t tried fine-tuning. My gut instinct tells me those methods won’t work as well for my use case. Plus, I still don’t totally understand how the answers endpoint works and which models/engines/endpoints allow fine-tuning and which do not. I find the documentation is poor in this regard.
1 Like

Perfect, that’s what I am doing.

I’ve got a problem which seems to be the last one. I tried building a CSV file to make the embedding (as far as I know, that’s the way of doing it). The first column is the identifier of a document, and the second one is the text.

I first tryed with a 6 row CSV, only to see if that worked, and it did. The problem is that now, with 47 rows, I got this message while embedding.

“Rate limit reached for requests. Limit: 60.000000 / min. Current: 80.000000 / min. Contact support@openai.com if you continue to have issues.”

How could I fix this?

Thanks for your answer.

I’ve got a question for you. How did you use the embedding endpoint?

I tried to follow this:

But when I’m building the CSV, I get this error:

“Rate limit reached for requests. Limit: 60.000000 / min. Current: 80.000000 / min. Contact support@openai.com if you continue to have issues.”

My CSV consists on a first column, which is the ID, and the second column which is the text.

Sorry I can’t help with that. Hopefully support can help.

That looks like you’re somehow maxing out on the limit. Support should be able to help.