Is it possible to fine-tune a model to answer questions given a raw text?

jesruiort · February 14, 2022, 12:53pm

I have a large documentation about an specific product and I want my model to answer questions about it, by fine-tuning it. The same way that answer method of OpenAI would do.

As far as I know, the file that you must have to train the model is a CSV file with prompts and completions, but what I have is a large amount of text, so i want my model to read from it and be able to answer all type of questions about that

Any suggestions?

joey · February 14, 2022, 3:45pm

Great question! For fine-tuning, you should provide at least a few hundred examples, while a large piece of text is just one example, if formatted as a prompt-completion pair. In other words, you’ll need to turn that text into at least a few hundred prompt-completion pairs, in a JSONL format.

Alternatively, you can use the Answers endpoint to answer questions about uploaded text.

sps · February 14, 2022, 5:16pm

Hi @jesruiort

You can start by using the answers endpoint

Explore fine-tune if your documentation limit exceeds the token limit.

jesruiort · February 15, 2022, 8:16am

Thanks for your answer. The problem I have with the Answers endpoint is that every execution of this, with a large documentation, should have a big cost of tokens, isn’t it?

So my only option if I was to use fine-tuning is to transform the documentation into prompt-completion (question->answer) pairs?

Thanks again!

jesruiort · February 15, 2022, 8:22am

Thanks for your answer. The problem I have with the Answers endpoint is that every execution of this, with a large documentation, should have a big cost of tokens, isn’t it?

So my only option if I was to use fine-tuning is to transform the documentation into prompt-completion (question->answer) pairs?

Thanks again!

sps · February 15, 2022, 7:10pm

Yes, you are correct regarding costs for using the answers API for a very large document.

In that case you can also use a combination of embeddings and completions endpoint to obtain semantically similar documents to the search query.

And yes, while, fine tuning is an option, I would try all these before finally getting a fine-tuned model.

igor.cuic · February 16, 2022, 6:23pm

I work with a similar topic. In my case (tourism) I have a lot of hotels.
I don’t have much experience with GPT 3 but I think fine tuning is not the right solution.
My tests show that it is not a problem to train to give correct answers, the problem is to train so that the model does not give wrong answers.
Gpt 3 also likes to answer questions he doesn’t know the answer to.
I think a better solution is to use “Question answering”. I would make a separate file for each product. In the file, each document should have a maximum of 1-2 sentences. So the document has the same size as the fine tuning answer.
Finally, I repeat once again that I do not have much experience with this topic. With my answer I want to interest other users to write about their experiences. So I hope that I will find better solutions for my work.

jesruiort · February 17, 2022, 10:58am

I think I understand what you mean. I’m trying to implement embeddings for my problem, but I’ve got a question. Once I have located the document in which I should look for the information, by embedding, is there a way to send the completions endpoint that document? So it can answer my question by looking at that document?

What I’m doing now is using answer endpoint, but the document I use for it is the one that the embedding returns, so I don’t spend much tokens on the search part. Does this make sense?

Thank you!

sps · February 17, 2022, 6:16pm

If you want to use the answers endpoint with embedding returned documents, you are going to have to make 2 API calls i.e. one for obtaining top n documents semantically similar to the query using embeddings. Then use the documents obtained from the previous step in calling the answers api.

lmccallum · February 18, 2022, 12:46am

I am using a 2-part approach for a similar use case:

obtain search results using the embeddings endpoint
send the top n search results/documents to the completions endpoint, along with the user’s query plus instructions to answer the query based on those documents
One problem I was having is that each of my documents is quite long, consisting of several sentences, so I was bumping up against the 2048 token limit with only 3-5 search results in the prompt. This was not enough to generate high-quality answers. So now, I am experimenting with parsing each search result/document into individual sentences, re-ranking them, and sending the top n sentences to the completions endpoint. That way, I filter out less relevant sentences before sending the prompt. If that approach doesn’t work, I’ll be experimenting with some combination of whole documents and sentences and maybe performing more than one completion at once and then combining them to generate the best written answer to the query. One more thing: I tried using babbage for the completions and it was not nearly good enough for my needs. So I’ll definitely be using davinci. I haven’t tried the answers endpoint and I haven’t tried fine-tuning. My gut instinct tells me those methods won’t work as well for my use case. Plus, I still don’t totally understand how the answers endpoint works and which models/engines/endpoints allow fine-tuning and which do not. I find the documentation is poor in this regard.

jesruiort · February 18, 2022, 8:35am

Perfect, that’s what I am doing.

I’ve got a problem which seems to be the last one. I tried building a CSV file to make the embedding (as far as I know, that’s the way of doing it). The first column is the identifier of a document, and the second one is the text.

I first tryed with a 6 row CSV, only to see if that worked, and it did. The problem is that now, with 47 rows, I got this message while embedding.

“Rate limit reached for requests. Limit: 60.000000 / min. Current: 80.000000 / min. Contact support@openai.com if you continue to have issues.”

How could I fix this?

jesruiort · February 18, 2022, 8:40am

Thanks for your answer.

I’ve got a question for you. How did you use the embedding endpoint?

I tried to follow this:

github.com

openai/openai-python/blob/a7e259c66f734d0400527320bbb2a0e9e9f7cf5f/examples/embeddings/Obtain_dataset.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Load the dataset\n",
    "\n",
    "The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).\n",
    "\n",
    "We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {

This file has been truncated. show original

But when I’m building the CSV, I get this error:

“Rate limit reached for requests. Limit: 60.000000 / min. Current: 80.000000 / min. Contact support@openai.com if you continue to have issues.”

My CSV consists on a first column, which is the ID, and the second column which is the text.

lmccallum · February 18, 2022, 11:59am

Sorry I can’t help with that. Hopefully support can help.

sps · February 18, 2022, 12:51pm

That looks like you’re somehow maxing out on the limit. Support should be able to help.

rex.vanhorn · October 11, 2022, 4:50pm

Hi, @joey
In relation to this response, I was curious if the research has revealed a ‘best-practice’ for how to turn a large set of data into a useful prompt/completion pair. For example, in my research, I’m investigating the potential for GPT-3 to return factual answers to objective/nuanced/deductive questions in the style of the original author. In my case, I have about 100 essays, each with ~13,000 words. I’m considering a few options:

Prompt = essay title, Completion = whole essay
Prompt/Completion alternate between sentences; e.g., prompt = 1st sentence, completion = 2nd sentence, next prompt = 2nd sentence, completion = 3rd sentence, etc. (this doubles the cost of training)
Prompt = summary of essay and completion = the original essay.

In practice, I want to offer a prompt with Fact A and Fact B and test to see if GPT-3 returns Fact C (where A = B and B = C). I don’t want GPT-3 to simply return the document where the answer exists.

I appreciate any advice you can offer.

Keep up the good work!

auro · October 15, 2022, 4:05am

These are great experiments. Could you say a few words about what worked and what didn’t. Thanks

vidya.sridhar.94 · May 2, 2023, 2:48pm

I don’t see the answers endpoint available anymore in the link. Could you please provide a link or an example of this?
I would like to pass/upload a documentation file and have gpt answer questions about what’s present in the file. Essentially, it should fetch the contents of the file to provide the answer. How can this be done without fine tuning?
Any help is appreciated! Thanks

karasshi · August 23, 2023, 3:38am

I think you may want to take a look at this project:
Quivr

I cannot post a link here but you can search this in Github.
I’m currently playing with it and the result seems good to me. I think it can fulfill your requirement.

Topic		Replies	Views
Use file with text-davinci-001 to increase tokens in prompt Prompting	13	2575	December 15, 2023
What to do when fine-tuning is not working? API	21	8067	December 24, 2023
Creating a support chat bot for my business API	4	3706	December 18, 2023
What's better for the type of chatbot I am building? Fine tune or embedding? Community chatgpt , api	10	2230	August 20, 2023
How can we make the answer concise with fine tuning? API fine-tuning , api	8	2936	June 7, 2023

Is it possible to fine-tune a model to answer questions given a raw text?

Related topics