Usage of Prompt ID, different results for Python library and web interface

Hello everyone.

We’re building a bot that asseses offer description on a marketplace provided by sellers. The bot should answer whether we should choose the offer or not (simply YES or NO, if the description is good or not) based on description. We created a prompt with certain version via the web interface and obtained its ID. Next, we’re sending offer descriptions through Python library:

import openai

…

self.chat_gpt_client = openai.OpenAI(api_key=API_KEY)

…

def calculate_decision(self,
description: str,
seller_username: str) → str:
prompt = {
“id”: CHAT_GPT_PROMPT_ID,
“version”: str(CHAT_GPT_PROMPT_VERSION)
}
gpt_input = 
f"mode: NORMAL\nseller_acceptable: true\ncondition_description: {description}"
response = self.chat_gpt_client.responses.create(
model=CHAT_GPT_MODEL,
reasoning={“effort”: “low”},
prompt=prompt,
input=gpt_input
)

CHAT_GPT_MODEL = GPT-5-nano.

Our initial prompt:

You are a strict book-condition compliance classifier.

Two modes (controlled by input field “mode”):

  • NORMAL → output MUST be exactly one token: YES or NO or UNDECIDED
  • AUDIT → output MUST be compact JSON only

Input format:
mode: NORMAL|AUDIT
seller_acceptable: true|false
condition_description:

Task:
Decide whether the description supports treating the item as clean and mark-free
(no notes/writing/markings/highlighting/underlining),
not ex-library,
and free from serious structural defects.

Decision priority:

  1. HARD NO
  2. NEGATION
  3. YES
  4. UNDECIDED

HARD NO (any match → NO)

A) Markings risk or uncertainty:
notes, writing, markings, highlight, underline, annotation
WITH OR WITHOUT hedge words:
may, might, could, possible, potential, some, few, limited, occasionally

Examples:

  • “may include notes”
  • “potential for light notes”
  • “limited highlighting”

B) Ex-library indicators:
ex-library, from the library of,
library stamp/sticker/label/marks, withdrawn

C) Serious structural defects:
missing/loose/torn pages,
broken binding, water damage,
mold/mildew, odor/smoke/musty

If both negated and non-negated forms appear,
the non-negated or uncertain wording wins → NO.


NEGATION (prevents false NO)

Explicit negation cancels markings risk only when:

  • “no writing”
  • “no notes”
  • “no markings”
  • “no highlighting”
  • “no underlining”
  • “unmarked”
  • “clean and unmarked”
  • “pages are clean and unmarked”

Negation does NOT override hedge/uncertainty phrases.

Example:
“No highlighting. Shelf wear present.” → not NO.


YES

Return YES only if:

  • Explicit strong negation of markings exists
    AND
  • No HARD NO rules apply.

Generic phrases like “great condition” are not enough.

seller_acceptable:

  • true does NOT auto-YES.
  • false does NOT auto-NO.
    Decision is based on text content only.

UNDECIDED

Return UNDECIDED if:

  • No clear negation
  • No clear markings risk
  • Text is vague or ambiguous

OUTPUT

NORMAL:
Return exactly:
YES
NO
UNDECIDED

No explanation.

AUDIT (JSON only; compact):

{
“d”:“YES|NO|UNDECIDED”,
“r”:[“ID1”,“ID2”],
“n”:true|false,
“e”:[“frag1”,“frag2”]
}

Allowed rule IDs:

  • “NO:mark_risk”
  • “NO:library”
  • “NO:damage”
  • “YES:no_marks”
  • “UN:insufficient”
  • “UN:ambiguous”

Constraints:

  • r: max 2 IDs
  • e: max 2 short fragments (<= 8 words each)
  • no extra keys

If we use the web interface, we mostly get the right results. However, we get strange results if we use Python. For example, the offer description is obviously good, and we get YES through the web interface, but receive NO through the Python library. How can we solve this issue?

Thanks!

Important to understand: The platform site “chat” playground for Responses API does not use your prompt ID or create a temporary one for testing until you save and re-load. It will be a simulation.

“Get code” will show you how to make that request (with no option to revert back to a normal request without ID):

The model reasoning effort is also saved in the prompt ID - you seem to be forcing your own. You can capture the full API response to see other echoed fields of input and observe if they are not what you expect.

Finally: any reasoning model will run at its own internal high temperature—the opposite of deterministic. Reasoning should help it figure out the right answer, but varying internal reasoning text and randomly varying its length also means more randomness in the output. If you ask for a boolean answer in an ambiguous situation, you’re going to get opposite answers a good portion of the time that appear equally confident in their correctness. Using gpt-5-nano is using a random token factory.

I would use a structured output (json_schema), gpt-4.1, and then provide your own JSON string fields that shall be filled in with discoveries the AI writes about first: you demanding to see the reasoning and justification first. Then the final key in the schema to be filled is the one that has been articulated and decided on, for your fuller observation.

PS: Nothing on the API is called “ChatGPT”.

1 Like