Evaluation Metric For Question Answering - Finetuning Models

Hello Community Members,
I am addressing to all those who have prior experience of finetuning chatgpt on their custom datasets. OpenAI provides a results csv file, for every trained model which predominantly include loss, sequency accuracy and token accuracy for training and validation respectively.

My question is that how good of a metric is “sequence accuracy”? GPT being language models, understand the semantics behind the data/training and generate output. If we start comparing actual data and the generated as a whole than what is the point of this metric? My results after validation are not overfitted, the loss is very near to 0.1, token accuracy is in range (70 to 80) yet sequence accracy is straight up 0. This got me thinking about what I just asked in this thread.

How have you guys evaluated your finetuned models especially for a question answering (without any given corpus for reference in the prompt) use-case? (Any tips and tricks regarding fintuning for question answering use-case will be appreciated)


Finetuning is not suitable for QA tasks. You’ll find a more detailed (and probably better) explanation here: OpenAI Q&A: Finetuning GPT-3 vs Semantic Search - which to use, when, and why? - YouTube

Hi, anant
What I am actually doing is that I will be generating answers for the given prompt(questions), without any given context/supporting corpus.

If your answers need to be precise, like @anant said, don’t use a fine-tune, use an embedding techniques instead.

But if your answers can be imprecise and possibly inaccurate, then a fine-tune would be OK.

But most folks here try fine-tunes to train GPT-3 for their specific answers and then are disappointed at how much it differs from their specific facts. So they are pointed to embeddings (so correlate on your specific answers) and then ask GPT-3 to answer only drawing from your data. The training of a fine tune, at least in my opinion, is good for classification (only 1 token in the output). And the other possible use is for open ended answers that don’t need to be accurate (multiple tokens in the output).

@curt.kennedy is there any guide/tutorial to use encoding for question answering?

1 Like

@curt.kennedy Thanks

No problem, it’s more work on your end (database + compute), but I think it’s worth it!

1 Like

Can you share your thoughts about how could I deploy a question answering using embedding task on a website created on webflow?

It all depends on your traffic level + amount of embeddings.

For lighter usage (sparse and bursty) with latency of 10 seconds or less, I would go serverless. So in particular AWS Lambda, AWS DynamoDb and AWS S3. Create an in-memory version of the vectors and load this into your lambda from S3. Then DynamoDb to store your final embedding retrieval from your top hits. I can use this with 400k embeddings and less than 1 second latency per call after lambda is warm. Or 10 seconds from a cold start.

For something where you need to do lots of recursive queries and low latency, use a vector database such as Pinecone or Weaviate. More pricey, but you might have to do this depending on requirements.

In all cases you are generating a prompt for GPT-3 to respond with, so API calls to OpenAI.