You could certainly one shot it with two questions of a predetermined (experimentally) complexity levels and a result of gpt-4 or 3.5 as the result for each… maybe? mutli shot it with more examples, I think I’d actually just have a series of Q’s from say various year levels of exams, as they have usually been carefully graded for this and get an idea of complexity score returned, you could even have a hysteresis zone where even if the complexity only justifies 3.5 you select 4 out of caution.
I “think” what this is attempting to do is answer the no brainer stuff with 3.5 and anything above with 4.