Data Validation: Help and Suggestions needed

ekansh.verma · August 7, 2024, 11:23am

I have tons of MCQ questions on different subjects and topics. Now I want to validate if these questions are correct and have a single, decisive correct answer.

I plan on using LLM(s) for validating these questions.

I have tested the no-shot prompting technique and it does not work very reliably.
I plan to use chain-of-thought prompting but i feel that it will not be a huge upgrade either.

Are there any pre-known studies or techniques that have proven to work best for this particular use case?

Diet · August 7, 2024, 11:29am

Welcome to the community!

For multiple choice, going through each possible option independently as t/f with CoT, and then CoT discussing the results to get your final answer works fairly decently out of the box (depending on the model, model choice is also important).

If you have specific domain knowledge that isn’t part of the LLM’s training set, you’ll likely need to add some retrieval method. If it’s math, you might want to add a solver tool, and if they’re logic puzzles it’s a little bit more complicated altogether.

Do you have some example questions you’re struggling with?

ekansh.verma · August 7, 2024, 11:42am

Thanks, @Diet
The idea of going through each possible option with CoT + some reflection sort of thing seems to be the right solution and should solve majority of the problem.

The questions we have are mainly around standard subjects/topics.

There are no example questions that we struggle with particularly, but there has always been a reliability issue with the validations done via LLM(s).
The LLM(s) tend to provide different correct answers for the same question if we hit the API multiple times.

Diet · August 7, 2024, 11:48am

Ah yeah. Could be a temperature/sampling issue. I’d recommend lowering top_p so that your response isn’t contaminated by wild/random tokens.

Although if you did clamp the sample probabilities to something reasonable (temperature 1, top_p 1 generally isn’t reasonable) and you still get disagreeing results, it might be possible that the model can’t actually solve it under the given conditions.

ekansh.verma · August 7, 2024, 12:20pm

The temperature, top_p, and prompt were played around with quite a lot and the questions were majorly factual and conceptual, yet we encountered fluctuations in answers.

I’ll try the CoT for each option with some criticism on the result kind of approach.
That could be a feasible solution.

Diet · August 7, 2024, 12:21pm

yep, it might be prudent to just set them to 0.

Topic		Replies	Views
Generate multiple-choice questions MCQ tests Prompting gpt-4	8	13221	January 21, 2025
Question generation/fine tuning API	2	816	December 17, 2023
Seeking Solutions for Instability in Multi-Class Labeling Tasks Prompting api , classification , research	8	1095	November 15, 2023
Complex Prompt Getting Continuously Worse Results Prompting api , gpt-4-vision , assistants-api	6	878	July 24, 2024
LLM and Prompt Evaluation Frameworks Prompting prompt-engineering , prompting , evals	11	8100	December 16, 2024

Data Validation: Help and Suggestions needed

Related topics