I have tons of MCQ questions on different subjects and topics. Now I want to validate if these questions are correct and have a single, decisive correct answer.
I plan on using LLM(s) for validating these questions.
I have tested the no-shot prompting technique and it does not work very reliably.
I plan to use chain-of-thought prompting but i feel that it will not be a huge upgrade either.
Are there any pre-known studies or techniques that have proven to work best for this particular use case?
For multiple choice, going through each possible option independently as t/f with CoT, and then CoT discussing the results to get your final answer works fairly decently out of the box (depending on the model, model choice is also important).
If you have specific domain knowledge that isn’t part of the LLM’s training set, you’ll likely need to add some retrieval method. If it’s math, you might want to add a solver tool, and if they’re logic puzzles it’s a little bit more complicated altogether.
Do you have some example questions you’re struggling with?
Thanks, @Diet
The idea of going through each possible option with CoT + some reflection sort of thing seems to be the right solution and should solve majority of the problem.
The questions we have are mainly around standard subjects/topics.
There are no example questions that we struggle with particularly, but there has always been a reliability issue with the validations done via LLM(s).
The LLM(s) tend to provide different correct answers for the same question if we hit the API multiple times.
Ah yeah. Could be a temperature/sampling issue. I’d recommend lowering top_p so that your response isn’t contaminated by wild/random tokens.
Although if you did clamp the sample probabilities to something reasonable (temperature 1, top_p 1 generally isn’t reasonable) and you still get disagreeing results, it might be possible that the model can’t actually solve it under the given conditions.
The temperature, top_p, and prompt were played around with quite a lot and the questions were majorly factual and conceptual, yet we encountered fluctuations in answers.
I’ll try the CoT for each option with some criticism on the result kind of approach.
That could be a feasible solution.