Using GTP3 to evaluate summaries

I have a database of letters and expert-written summaries.
for each letter I have a machine generated summary.
I want to use GTP3 to evaluate the quality of the machine generated one.

I’m currently trying the following prompt, and am fishing for ideas on how to approach this problem, or how to make a more effective prompt. So far results have been mostly incorrect.


ReferenceSummary:

You have to pay ##22 dollars## to ##the department of justice##.
You need to pay within ##2 weeks##.
You’ve already received a letter reminding you of this.
If you do not pay, we will send an invoice collector, which will be expensive.

EvaluationSummary:

You have to pay a bill to the department of justice.
You’ve already been sent a reminder
If you do not pay, it will get more expensive.

Above are two summaries of a letter. The first one is a reference one, the second one is one that you have to evaluate.

It is not important that the language matches exactly. It is also important that the language is simple, clear and concise.

Important items have been marked with ## characters. if one is missing in the evaluation summary, subtract 50 points.

Please respond by returning a json object with a score key between 0 and 100 and also a ‘reason’ key to explain why you gave that score.