First, there is nothing in this that would make one think that an “assistant” is the correct path. You would only get unknown distraction from the scoring task at hand, which is best performed by a single call to an AI model through the completion or chat completion endpoint.
Secondly, you need an AI that can answer the question at or beyond the quality of the provided or student answer. Right now, that is gpt-4-0613, or the prior gpt-4-0314.
We then must target the output that you wish: a single score. You are fighting against an AI model that is a token predictor at heart, and will follow patterns of answering as much as the reasoned thought of answering.
It sounds like the fault is partly in not giving the AI the full picture. You haven’t given me the full picture. Pretend you were a professor giving the job you want the AI to perform to a human teacher assistant like me.
To score, I’d want to know
- the domain of knowledge,
- the student level and expected competency,
- current applicable coursework,
- and most of all: the question.
If I’m as knowledgeable as an AI then, I don’t need the “right” answer ( say of Japanese Kofun period history to know the score of a student’s essay about dougu or haniwa) – the “correct” answer is an encumbrance. A student may have a completely different but correct answer.
Then if you really want a number, you have to play some token games. Numbers themselves are an individual token, and unlike words, do not have a leading space. We can make the output very likely to only be the score.
Here, I’m going to show an interesting completion technique: token certainty with logprobs, with only a 1 token answer. Everything up to the quotes within the JSON is text I input:
We get an answer 85% “1”, and 15% “0”. The AI is wrong though: samurai were not in the Kofun period. Nor is there a “burial mount”. I have to give the AI a 15%.
An answer without error “A kofun is a burial mound where ancient dignitaries or elite were entombed.” takes correct score from 85.59% to 99.89%
AI should not be making academic decisions in the absence of human oversight. Try to explain to the student they got a poor mark because an AI said 86% correct… You can try this in a casual setting where AI judgement doesn’t really matter (like NOT a situation where AI can ban ChatGPT accounts…)
(PS gpt-4-preview-1106 scores the poor answer “0” on either a 0-1 or 0-10 scale with the necessary description. The same problem you have. It also emits completely unexpected and stupid markdown container, and no way to examine how sure the answer is.)