Issue with API Giving Same Code to Very Different Documents Suddenly

I have been using GPT’s API to code document clarity. I have run with multiple similar instruction set the same group of documents and always get variation between the documents in how they are rated. The ratings are generally pretty stable. This was October-December 2024. Now I am trying to run the same code and it assigns the same clarity rating to each document even though they are very clearly different and not even a month ago GPT could distinguish between them with these instructions. Does anyone know what could be going on?

@retry(wait=wait_exponential(multiplier=1, min=10, max=120), stop=stop_after_attempt(10))
def classify_document_clarity(document):
clarity_instructions = “”"
Role: You are an expert in evaluating the clarity of political policy statements. Your task is to rate how clearly the speaker’s own national or international policy positions are presented in the document, not their descriptions of others’ positions or the current state of policy. Take your time to think through your response.

Output only the final numerical score.
"""
try:
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": clarity_instructions},
            {"role": "user", "content": f"Document:\n{document}"}
        ],
        temperature=1
    )
    print(response) 
    clarity_rating = response.choices[0].message.content
    print(clarity_rating)
    return int(clarity_rating)

except openai.RateLimitError as e:
    print(f"Rate limit error in classify_document_clarity: {e}")
    raise  # Reraise the exception for retry

except openai.APIError as e:
    print(f"API error in classify_document_clarity: {e}")
    raise  # Reraise for retry

except Exception as e:
    print(f"Unexpected error: {e}")
    raise  # Reraise for retry

Welcome to the community!

There’s a couple of things that come to mind:

  1. Are you actually sending different documents? I had a similar incident where it turns out that due to a bug I was always sending the same document.

  2. Did it actually work before? Looking at the prompt, Take your time to think through your response. combined with Output only the final numerical score. isn’t actually a useful instruction, unless it’s intended as a deceptive prompting strategy.

  3. OpenAI often performs changes to the models. Using gpt-4o-mini as opposed to a fixed version, e.g. gpt-4o-mini-2024-07-18 is a little risky, because they might just swap out the model without you knowing. However, according to https://platform.openai.com/docs/models#gpt-4o-mini, it doesn’t look like they changed major model versions. That’s not to say that they don’t occasionally perform ninja tweaks without telling anyone anyways. So this particular issue isn’t something that would have been within your control.

What I would suggest, if it’s in your budget, is to use a CoT approach (perhaps using a json schema) that provides reasoning first, and a score later. This way it’s easy to “debug” “why” you’re not getting the response you’re expecting.