ChatGPT 3.5 can not grade a simple quiz

I have been having trouble prompting gpt-3.5-turbo-1106 to correctly grade a quiz question automatically. Any suggestions on how to improve the prompt?

------------ System Prompt -------------------

Task given to Student

Correct the grammatical errors in the following sentence:

Him and I didn’t went to school yesterday because it was closed.


  • Subject Pronouns (2 points): Corrects ‘Him’ to ‘He’.
  • Verb Tense (3 points): Corrects ‘didn’t went’ to ‘didn’t go’.

Student’s Answer

Him and I didn’t go to school yesterday because it was closed.

Your Task

For each requirement, check if the student’s answer is correct

-------------- Response at Temperature 0 ----------------------

The student’s answer is correct:

  • Subject Pronouns: ‘Him’ is corrected to ‘He’.
  • Verb Tense: ‘didn’t went’ is corrected to ‘didn’t go’.

It seems ChatGPT is confused between what is required and what the student actually did.


Welcome to the forum.

Try giving it a one-shot example… Might also try temperature of .7 or even 1.0…


Thanks for your suggestions. The quiz question and the requirements are both dynamically inserted, so it will be hard to give a relevant example. Any more generic way of solving it through prompting?

As for temperature, I’m curious why you think a higher temperature would do better? I was using temperature of 0 since I want to get the most reliable response. Anyways, I did try different temperatures but the response was still wrong.


So, if this system prompt is given verbatim, I think I see what’s happening here.

A temperature at 0 means it is completely deterministic. Or rather, it simply begins to regurgitate the query instead of meaningfully respond to the query.

The more detailed and constructive your system prompt is, the better it will get at interpreting your goal. Especially with a temperature at 0, it might be thinking “Requirements” means “This is exactly what I have to do. I have to change subject pronouns to him and he, and correct the verb tense to didn’t went to didn’t go”. Think of it like a new student, who needs a lot of very clear, very explicit instructions to do something.

As a subtle aside, I do notice significant improvements in language understanding in GPT-4 over GPT 3.5. That is not to say GPT 3.5 is a complete bust for this, but it does help.


You may find this paper of interest,

Note, they are creating a fine-tuned model to perform this task. So, there are three possible ways you could use this information,

  1. Glean some insights from the paper about prompting for rubric following
  2. Use the fine-tuned model they created locally or more likely on the cloud platform of your choice
  3. Fine-tune your own model on their public data set

Short of doing that, you might consider what @PaulBellow suggested and give it a one-shot to few-shot example using chain-of-thought to guide its reasoning.

What the rubric and submission is isn’t important, for each example give it,

  1. A rubric
  2. A sample submission to evaluate
  3. A detailed response where the “model” talks through the grading of each rubric point then assigns a final score.

You probably want at least two examples to have any chance at being effective.

You need at least one where the model finds several deficiencies and one where the submission gets full marks so the model doesn’t assume it must always find something wrong.

The few shot example will greatly increase your input token count as you’re inputting n \times the number of submissions every time plus n rubrics and n sample responses. Conservatively, for a three-shot example you’d probably expect to use 5–15 times the number of input tokens making your costs easily 3–5 times what they are now.

You could also just try breaking up the rubric and having it grade each rubric item one at a time so the attention mechanism doesn’t get overwhelmed.


I would try following strategies.

  1. Play around with the order of the instructions. For me, the order is wonky. You may want to start with the model’s instructions first. Think how you would write the instructions to a person. That’s how the models were trained too.
  2. Differentiate more how you address the model (“you” seems to be strongest) and the student. Avoid mixing them at all costs.
  3. Try clearer separation of model and student instructions also in formatting. Consider or something similar, or higher headline level differentiation. Avoid putting model and student instructions in mixed order.

But in the end GPT-3.5 is a bit of a crummy model when it comes to complex instruction following. Zero-shot strategy may be unlikely to work and giving one or more examples should help a lot.


Thank you all for the suggestions. They are very helpful!

1 Like

I know ChatGPT 3.5 is different than 3.5 turbo, but reordering the prompt a bit and reducing the complexity of the instructions eventually produces the right answer:

Your task is to check if this student fixed all the grammar problems in an example sentence.

Example sentence with grammatical errors:
“””Him and I didn’t went to school yesterday because it was closed.”””

Student’s fixed sentence:
“””Him and I didn’t go to school yesterday because it was closed.”””

The student gets two points if they corrected “Him” to “He”, and they get 3 points if the correct “didn’t went” to “didn’t go”

Go slow and take your time examining and the student’s sentence, then produce the student’s score.

AI Response:
The student’s fixed sentence is: “Him and I didn’t go to school yesterday because it was closed.”

In this sentence:

  • “Him” is not corrected to “He,” so they don’t get 2 points for that.
  • “didn’t went” is corrected to “didn’t go,” so they get 3 points for that.

The student’s score is 3 points out of a possible 5.

You might also try explicitly asking it to repeat the student’s sentence before grading it - I think that was a large part of why it got it right.

1 Like

Thanks. I’ve been playing with the ordering and complexity. But unfortunately, your version also doesn’t work on the playground:

How about this prompt?

You are an expert English grammar assistant designed to check the given text for correct pronoun and tense use and output the results in JSON format. 
The system expects a string of text as input.

# JSON schema
  "inputText": {
    "type": "string",
    "description": "The text to be checked for grammar."
"correctText": {
    "type": "string",
    "description": "The text with correct grammar."
  "isPronounCorrect": {
    "type": "boolean",
    "description": "Indicates whether the use of pronouns in the input text is correct."
  "isTenseCorrect": {
    "type": "boolean",
    "description": "Indicates whether the use of tenses in the input text is correct."
"score": {
"type": "number",
"description": "Correct pronoun gets 2 points and correct tense of verbs gets 3 points. Sum the points to get score."

Using the student sentence as input

Him and I didn’t go to school yesterday because it was closed.

will result to

  "inputText": "Him and I didn’t go to school yesterday because it was closed.",
  "correctText": "He and I didn’t go to school yesterday because it was closed.",
  "isPronounCorrect": false,
  "isTenseCorrect": true,
  "score": 3


1 Like

Thanks. I confirm this works. And I guess it also confirms that the main issue is that ChatGPT can’t find the correct input when the input is more complex/less structured.

Another paper that may provide some insight is,

My hot-take on how you might employ this would be to design your rubric as a table to be filled out.

Each rubric item might have a spot for the model to enter a relevant portion of the text to be evaluated, several yes/no questions to answer about it, and a final column for the marks for that rubric item.

Then, once the table is completed the model assigns a final grade to the submission.

1 Like

Thanks for the reference. From the look of it, this method does not seem very different from the JSON method: both are trying to give the data more structure and force the model to generate intermediate results before final answers.

One issue I see for both methods is that they require more work on the prompting itself, so it is almost impossible to ask end users to write the rubrics as these kinds of structured prompts.

This is where you get meta my friend. Have the "model* make the rubric based on a rubric provided by the user.

1 Like

I tried this work in gpt4 implementation with satisfactory results.

I’ve never used it in those models, just know it’s more flexible. Can be used in a variety of ways You might be able to simplify the process more than me if you improve the way you write a little.

about examples of correct answers, need to edited And I didn’t before giving gpt, but you can see in the example I set it to check.

  1. Him and I didn’t go to school yesterday because it was closed.
  2. He and I didn’t went to school yesterday because it was closed.
  3. He and I didnt go to school yesterday because it was closed.
  4. His and I didn’t went to school yesterday because it was closed.
  5. He and I didn’t go to school yesterday because it was closed.
  6. Him and I didn’t went to school yesterday because it was closed.

If you create a prompt that has the characteristics of a template You’ll be better able to score complex things like

[KEY1] and I didn’t [KEY2] to school yesterday because it was [KEY3].

From here you can use a variety of formats including tables, Excel formulas and more complex conditions.