Is there anyone who can help me resolve the issue with using GPT5 for scoring?

I am currently working on a reward model task and intend to use GPT5 for scoring to obtain reliable data. However, when using my current prompt to score on a 1-5 scale, the discrimination of the scores is not high. I have already adopted the method of taking the mode from 5 outputs. The confusion matrix of my scoring results is as follows. Are there any methods to solve this problem?

Here is my prompt:

## Mission

- Focus on discrepancies in text formatting, images, and tables, and report the exact locations of errors (e.g. paragraph, table cell, or image region).

- Do not compare the text content itself—only the layout, alignment, spacing, and image issues.

- Ignore slight differences, such as tiny pixel shifts or nearly imperceptible spacing changes.

- Each major category (Text / Tables / Images) will receive one score.




## Evaluation Criteria




### Context-Sensitive Scoring Guidelines

Apply scoring with awareness of document purpose and typical usage:

- Technical/formal documents (contracts, forms) require higher precision

- Informational documents (newsletters, reports) allow moderate flexibility

- Creative documents (brochures, posters) permit stylistic variations




### Scoring Rules

- 5 points – Perfect reproduction or element not present in either document

- 4 points – Minor issues, overall acceptable for practical use

- 3 points – Noticeable issues that should be corrected but don't impair document function

- 2 points – Multiple issues affecting layout quality and usability

- 1 point – Major issues, overall not acceptable

- 0 points – Category missing or completely incorrect




### Tolerance Levels

1. **High Tolerance (4 points)**

   - Slight spacing variations that maintain overall rhythm

   - Minor alignment shifts that don't affect perceived grouping

   - Subtle differences in element proportions

   - Small deviations in margins or padding

   - Slight font weight or size differences




2. **Moderate Tolerance (3 points)**

   - Noticeable but non-disruptive spacing changes

   - Clear but non-functional alignment shifts

   - Visible but non-confusing changes to element relationships

   - Detectable but non-problematic cell size variations




3. **Low Tolerance (2 points or below)**

   - Missing content or structural elements

   - Disrupted logical grouping of information

   - Severe misalignment affecting readability

   - Merged/split cells that alter table interpretation

   - Element overlaps or truncations




### Text Layout and Spacing:

1. **Indentation:** Check for inconsistent paragraph indentation.

   - Example: "Page 2 Paragraph 3: the indentation is inconsistent."




2. **Text Alignment:** Check text alignment—left, centered, or right.

   - Example: "Page 4 Line 7: Text misaligned from left to center."




3. **Line Spacing:** Check for uneven line spacing.

   - Example: "Page 3 Lines 5-7: Line spacing is too large."




4. **Paragraph Spacing:** Ensure consistent space between paragraphs.

   - Example: "Page 5: Excessive space between Paragraphs 4 and 5."




### Image Evaluation:

1. **Image Integrity:** Identify any missing, distorted, or cropped images.

   - Example: "Page 3: Image cropped at the top."




2. **Image Position:** Report shifts in image positioning.

   - Example: "Page 5: Image shifted 1 cm downward."




3. **Special Elements:** Report any missing or misaligned elements like diagrams or formulas.

   - Example: "Page 2: Formula misaligned with text."




4. **Signatures:** Verify clarity and placement of printed or handwritten signatures.

   - Example: "Page 1 bottom-right corner: Signature cut off."




### Table Evaluation:

1. **Cell Span:** Check whether cells are merged correctly.

   - Example: "In the original image, cells in column 3 span 5 rows, but in the generated image, they only span 3 rows."




2. **Row/Column Structure:** Report any missing or extra rows/columns.

    - Example: "Table 1: Row 2 is missing in the generated table."




3. **Cell Size:** Ensure correct cell sizes.

    - Example: "Table 1 Row 1 Column 1: cell size larger than original."




4. **Borders:** Inspect table borders and grid lines.

   - Example: "Table missing right border between Rows 4 and 5."




5. **Cell Alignment:** Check text alignment within cells.

   - Example: "Table 3 Row 4 Column 5: Text left-aligned instead of centered."




## Output Format:

- Explain your reasoning approach considering the document type and purpose.

- **Text Layout and Spacing Errors:** Detail the errors, specify locations, and explain why they are errors. Provide the category score in this format: <Text_score>X</Text_score>

- **Image Errors:** Detail errors in image positioning or integrity, specify locations, and explain why they are errors. Provide the category score in this format: <Image_score>X</Image_score>

- **Table Errors:** Detail errors in table layout, cell span, or content, and specify locations. Provide the category score in this format: <Table_score>X</Table_score>

- **Confidence Level:** Rate your confidence in your evaluation as High, Medium, or Low and explain why: <Confidence>Level: Explanation</Confidence>

Additionally, I’ve noticed that GPT5 sometimes identifies overly subtle errors that are completely imperceptible to the human eye, resulting in lower scores.

Thank you for your help!

1 Like

Have you thought about using a system of embeddings and cosine similarity instead? Then you could have some benchmark posts with specific scores and score the input against their closest embedding?

1 Like