How to Enforce Self-Verification in OpenAI API Responses? Seeking Prompt Engineering Advice

Hello OpenAI Developer Community,

I’m working on an application that requires precise structured data retrieval for watch specifications based on exact reference numbers. While my prompt engineering attempts to enforce accuracy, I’m still seeing confidently incorrect responses from the API due to its reliance on pattern-based generation rather than strict validation.

Key Context

  • When I use ChatGPT’s web interface, the model can self-correct if I challenge it on an incorrect response.
  • However, when using the OpenAI API, the model does not self-correct because it does not have real-time web search capabilities.
  • This means that if the model generates an incorrect response in the API, it remains incorrect—there’s no built-in re-verification step.

Problem Summary
I designed a strict retrieval-focused API prompt to enforce verification, yet the model still hallucinates missing details or fails to cross-check its response before outputting data.

Example Failure Case
Prompt:
Search for the most precise and official specifications of the Rolex 126719BLRO-0003. The goal is to return the best single match, ensuring that all details are correct based on Rolex’s official site or trusted sources (e.g., Bob’s Watches, WatchBase, WatchBox, Chrono24).

Expected API Behavior:

  • The model should only return information it can verify from structured knowledge.
  • If uncertain, it should mark fields as “unknown” rather than assuming.

Actual API Behavior:

  • It incorrectly states that the watch has a meteorite dial (which actually belongs to 126719BLRO-0002).
  • Unlike the ChatGPT web UI, the API does not self-correct when challenged.
    This suggests that the model is generating an assumption-based response rather than performing a second verification pass.

Current API Prompt for Retrieval
To enforce accuracy, I structured my API prompt as follows:
Search for the most precise and official specifications of the [BRAND] [REFERENCE NUMBER]. The goal is to return the best single match, ensuring that all details are correct based on [BRAND] official site or trusted sources (e.g., Bob’s Watches, WatchBase, WatchBox, Chrono24).

Matching Prioritization Step 1:
First priority → Exact [reference number] match from [BRAND] official sources ([BRAND] website preferred). Mark "exactMatch": true.
Second priority → If an exact match is not available, retrieve details from trusted third-party sources (Bob’s Watches, WatchBase, Chrono24, WatchBox). Mark "exactMatch": false.
Third priority → Closest matching variant within the [BRAND] [MODEL] collection. Mark "exactMatch": false.

Matching Prioritization Step 2: Mandatory Verification Pass

  • RE-CHECK the initial response against all available sources again.
  • If discrepancies are found, correct them immediately and return the updated information.
  • If the reference is incorrect, discard the response and return closest matching variant within the [BRAND] [MODEL] collection. Mark "exactMatch": false.
  • Mark any non-exact fields with a ‘:warning:’ warning.
  • Do not assume missing fields—use "unknown" instead.

Required JSON Output:
{
“exactMatch”: true/false,
“referenceNumber”: (Confirmed reference number in case of fallback),
“modelName”: “…”,
“caseMaterial”: “…”,
“dialColor”: “…”,
“movement”: “…”,
“powerReserve”: “…”,
“waterResistance”: “…”,
“braceletType”: “…”,
“bezel”: “…”,
“price”: “…”,
“sources”: [“…”]
}

Issues Encountered Despite This Prompt Design

  1. The API still returns confidently incorrect information rather than verifying its response before output.
    • Since the API cannot search the web, it must rely on its internal knowledge base.
    • However, it does not properly validate structured numerical inputs (e.g., reference numbers) before responding.

  2. The verification step is not actually being executed.
    • I explicitly instruct the model to re-check its own response, but it does not seem to do so.
    • This means it confirms its first attempt rather than reprocessing its output for errors.

  3. The API does not self-correct like the ChatGPT web interface.
    • If I ask ChatGPT (via the web UI) to re-evaluate, it can find the correct answer.
    • But in the API, the response remains incorrect even if I prompt it again in a new request.

Request for Help
How can I force the API to recheck its response before finalizing the output?

  • Are there prompt design techniques that simulate a multi-step self-validation process within a single API call?
  • Would breaking this into multiple API calls (e.g., retrieval first, then validation in a separate step) be more effective?
  • Is there a way to prevent the model from making assumptions when faced with structured numerical inputs (like reference numbers)?

Since the API does not have live internet access, I need a way to force stricter internal verification to ensure responses are factually accurate rather than assumption-driven.

I appreciate any guidance on how to refine this prompt (or workflow) to improve accuracy.

Thanks in advance!

Best,
Doug