I am currently working on a reward model task and intend to use GPT5 for scoring to obtain reliable data. However, when using my current prompt to score on a 1-5 scale, the discrimination of the scores is not high. I have already adopted the method of taking the mode from 5 outputs. The confusion matrix of my scoring results is as follows. Are there any methods to solve this problem?
Here is my prompt:
## Mission
- Focus on discrepancies in text formatting, images, and tables, and report the exact locations of errors (e.g. paragraph, table cell, or image region).
- Do not compare the text content itself—only the layout, alignment, spacing, and image issues.
- Ignore slight differences, such as tiny pixel shifts or nearly imperceptible spacing changes.
- Each major category (Text / Tables / Images) will receive one score.
## Evaluation Criteria
### Context-Sensitive Scoring Guidelines
Apply scoring with awareness of document purpose and typical usage:
- Technical/formal documents (contracts, forms) require higher precision
- Informational documents (newsletters, reports) allow moderate flexibility
- Creative documents (brochures, posters) permit stylistic variations
### Scoring Rules
- 5 points – Perfect reproduction or element not present in either document
- 4 points – Minor issues, overall acceptable for practical use
- 3 points – Noticeable issues that should be corrected but don't impair document function
- 2 points – Multiple issues affecting layout quality and usability
- 1 point – Major issues, overall not acceptable
- 0 points – Category missing or completely incorrect
### Tolerance Levels
1. **High Tolerance (4 points)**
- Slight spacing variations that maintain overall rhythm
- Minor alignment shifts that don't affect perceived grouping
- Subtle differences in element proportions
- Small deviations in margins or padding
- Slight font weight or size differences
2. **Moderate Tolerance (3 points)**
- Noticeable but non-disruptive spacing changes
- Clear but non-functional alignment shifts
- Visible but non-confusing changes to element relationships
- Detectable but non-problematic cell size variations
3. **Low Tolerance (2 points or below)**
- Missing content or structural elements
- Disrupted logical grouping of information
- Severe misalignment affecting readability
- Merged/split cells that alter table interpretation
- Element overlaps or truncations
### Text Layout and Spacing:
1. **Indentation:** Check for inconsistent paragraph indentation.
- Example: "Page 2 Paragraph 3: the indentation is inconsistent."
2. **Text Alignment:** Check text alignment—left, centered, or right.
- Example: "Page 4 Line 7: Text misaligned from left to center."
3. **Line Spacing:** Check for uneven line spacing.
- Example: "Page 3 Lines 5-7: Line spacing is too large."
4. **Paragraph Spacing:** Ensure consistent space between paragraphs.
- Example: "Page 5: Excessive space between Paragraphs 4 and 5."
### Image Evaluation:
1. **Image Integrity:** Identify any missing, distorted, or cropped images.
- Example: "Page 3: Image cropped at the top."
2. **Image Position:** Report shifts in image positioning.
- Example: "Page 5: Image shifted 1 cm downward."
3. **Special Elements:** Report any missing or misaligned elements like diagrams or formulas.
- Example: "Page 2: Formula misaligned with text."
4. **Signatures:** Verify clarity and placement of printed or handwritten signatures.
- Example: "Page 1 bottom-right corner: Signature cut off."
### Table Evaluation:
1. **Cell Span:** Check whether cells are merged correctly.
- Example: "In the original image, cells in column 3 span 5 rows, but in the generated image, they only span 3 rows."
2. **Row/Column Structure:** Report any missing or extra rows/columns.
- Example: "Table 1: Row 2 is missing in the generated table."
3. **Cell Size:** Ensure correct cell sizes.
- Example: "Table 1 Row 1 Column 1: cell size larger than original."
4. **Borders:** Inspect table borders and grid lines.
- Example: "Table missing right border between Rows 4 and 5."
5. **Cell Alignment:** Check text alignment within cells.
- Example: "Table 3 Row 4 Column 5: Text left-aligned instead of centered."
## Output Format:
- Explain your reasoning approach considering the document type and purpose.
- **Text Layout and Spacing Errors:** Detail the errors, specify locations, and explain why they are errors. Provide the category score in this format: <Text_score>X</Text_score>
- **Image Errors:** Detail errors in image positioning or integrity, specify locations, and explain why they are errors. Provide the category score in this format: <Image_score>X</Image_score>
- **Table Errors:** Detail errors in table layout, cell span, or content, and specify locations. Provide the category score in this format: <Table_score>X</Table_score>
- **Confidence Level:** Rate your confidence in your evaluation as High, Medium, or Low and explain why: <Confidence>Level: Explanation</Confidence>
Additionally, I’ve noticed that GPT5 sometimes identifies overly subtle errors that are completely imperceptible to the human eye, resulting in lower scores.
Thank you for your help!
