Need human like response to test the model performance

working on a llm application, now need to test the performance for the given set of articles. how can I create a set of questions with answers that are good human like response . So that I can compare them with generated answer of the model.

types like how,why,what and more and be used .

This is an evaluation, or Eval, you can find details here

1 Like

One wouldn’t use a framework ultimately designed for submitting AI feedback cases to OpenAI.

If you want to create a set of questions, you don’t actually need to create them. You can use several different open-source training sets used to train AI models.

GPT4All for example has 700k+

1 Data Collection and Curation
We collected roughly one million prompt-
response pairs using the GPT-3.5-Turbo OpenAI
API between March 20, 2023 and March 26th,
2023. To do this, we first gathered a diverse sam-
ple of questions/prompts by leveraging five pub-
licly available datasets:
• The unified chip2 subset of LAION OIG.
• Coding questions with a random sub-sample
of Stackoverflow Questions
• Instruction-tuning with a sub-sample of Big-
science/P3
• Conversation Data from ShareGPT
• Instruction Following Data curated for Dolly
Dolly (Conover et al.)
We additionally curated a creative-style dataset us-
ing GPT-3.5-Turbo to generate poems, short sto-
ries, and raps in the style of various artists.

You can also see how gpt-3.5-turbo answered in March vs today…

If you need to ask domain-specific questions, you can probably have GPT-4 synthesize a set of them. However, you’ll also likely be returning to the March version if you don’t want it to give up the task after 500 tokens.

I’m using pre- trained gpt model for my use case. So I need ground truth to evaluate the results and form metrics