Measuring the similarity of test questions that are equivalent but presented with different scenarios or examples

I want to see how similar a bunch of test questions (written in Turkish) are in terms of what they’re really testing, regardless of the situations or examples they use. It might sound a bit like checking how similar the sentences are, but it’s actually quite different. For example, these next questions should be seen as very similar or the same, even though the words in the sentences aren’t very much alike:

Question 1: You are given two balls, one made of rubber and one made of steel. You drop them simultaneously from the same height. Which one will hit the ground first, and why?

Question 2: In an experiment, you have two identical balls, one made of plastic and one made of glass. You release both at the same time from the identical height. Which ball will reach the ground first, and what physical principle explains the outcome?

So, I want to measure the similarity in terms of the similarity of the very specific skill measured by the questions. Is this possible?

Welcome to the forum.

Likely with a one-shot or two-shot… maybe even a zero-shot…

What have you tried so far?

This is called “factor analysis” or “item analysis.” And it’s related to Item Response Theory.

This is not something I would trust to an LLM if it is important to get correct results as it requires some moderately sophisticated statistical methods to perform.

Here are some resources;

And here’s some instructions on how you might perform a factor analysis:

Python

R

That should be enough get you started.

It is possible the LLM in conjunction with access to a computing environment will be able to perform this task exceptionally well, but I don’t think any degree of prompting the API will help the model get there on its own.

1 Like

I have tried sentence-transformers and I am also investigating the github of OpenAI → openai/openai-cookbook/blob/main/text_comparison_examples.md

I thought of summarizing/refining the questions to be able to keep the essence of the question while removing the parts forming the specific example. I guess I may benefit from fine-tuning for this case.