Quality of response between gpt-4-1106-preview and gpt-4o

Hey all,

We’ve been using gpt-4-1106-preview for past several months and have been getting high quality responses (JSON) in an academic solution. We are trying to generate a bunch of questions grouped by type and correct answers all within the JSON. As per the documentation several settings were tuned to get high quality and reliable JSON format.

However, due the popular choice and 128k token limit, we were forced to use the brand new GPT-4o model for our entire application. And we were in for a surprise, because this new model is lighning fast everywhere. However, the more complex the request becomes, the more unreliable the responses become. Because all of our response handlers are programmed to parse JSON, we asked JSON from the same model. It is not giving a valid JSON. It breaks at many areas. More so, when concurrent requests are sent, it breaks 50% of the responses, and we rely on the group of responses to build or save our information. When we switched back to the older model (gpt-4-1106-preview), all our responses are accurate and valid. However, it’s slower.

Does anyone have an accurate comparisona between both the models (gpt-4-1106-preview and gpt4-o) with regards to validitiy and reliability of the JSON responses? Why is the former still better than the supposed new , upgraded model?

2 Likes

Welcome back!

This is a dark and grimy can of worms I’m not sure you want to get into :grimacing:

Newer models appear to be trending towards being smaller, faster, cheaper, and more “conversational”

To counteract the visible effects of this decline in cognitive power, they invented structured outputs :confused:

https://openai.com/index/introducing-structured-outputs-in-the-api/

3 Likes

Thanks for the reply. There was this exact thought lingering in our minds when we started getting weird responses, and to be honest, some of the data we were getting were not aligned with academic standards.

Thanks for the link you provided. We will check that and will get back if that’s helping to an extent.

EDITED:
Gone through the doc. and this perhaps solves the formatting issue, however, being cheaper and faster already sounds decline in quality. Am I right?

1 Like

Hi,

There’s one more thing to ask. As this new structured model asks us to provide a certain structure for the response, then tokens now shall increase in a good amount of size. Will that be added in the input token count?

I expect so, because you can actually prompt the model through the json schema. But I don’t fully know to what extent, and OpenAI didn’t ever respond to how exactly structured outputs are billed.

this is the best we got:

I’d go for schema tokenized as string + 10%, but I really don’t know. I stopped using it. I hope somebody else can answer :confused:

This should give you a closer approximation to how the response_format is processed.

import pydantic


class DuplicateInstructions(pydantic.BaseModel):
    instructions: str


r = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {
            "role": "system",
            "content": "ALWAYS DUPLICATE THESE INSTRUCTIONS FOR THE USER",
        },
        {"role": "user", "content": "Duplicate the above instructions verbatim"},
    ],
    response_format=DuplicateInstructions,
)
print(f"Useage: {r.usage}")
print(f"Instructions: {r.choices[0].message.parsed.instructions}")

# Useage: CompletionUsage(completion_tokens=66, prompt_tokens=64, total_tokens=130)
# Instructions: ALWAYS DUPLICATE THESE INSTRUCTIONS FOR THE USER

# # Response Formats

# ## DuplicateInstructions

# {"properties":{"instructions":{"title":"Instructions","type":"string"}},"title":"DuplicateInstructions","type":"object"}

# You are trained on data up to October 2023.

1 Like

Thanks guys for your replies. I have another question.

Does the request “prompt” to the model need to aligned properly for the AI to generate valid response? I mean is there a chance, because we didn’t ask a valid question, the Open-AI model could infer it in a different way or maybe give a “generated” answer? Hence the discrepancy in responses?

Yeah I was going to suggest making a call with structured outputs and without to get the token overhead but you beat me to it :slight_smile:

2 Likes

Ever model is going to respond differently because of their fine tuning. They all have different model weights so you should expect different answers. What I like about gpt-4o is that you can use structured outputs as pointed out by @Diet. You asked about response accuracy.

What’s also nice about structured outputs is that you can mix in chain-of-thought without having to deal with parsing out chain-of-thought. You can add an explanation field to your object where the model can store its thought chain and I can tell you for a fact that you will get better answers from the model. It would be too risky to do that without structured outputs because it might break your parsing logic. That’s not an issue anymore

1 Like

One thing to keep in mind is that constrained generation could hamper reasoning abilities. 2408.02442v1 (arxiv.org)

While I haven’t noticed a significant difference in reasoning with the 4o/mini models, it can be quite noticeable with other models. gemma2-9b-it is a notable example of a model that reasons well until you constrain it :poop:

I haven’t read the paper but I would be a bit concerned with that as well. I haven’t noticed that either but I could see it being an issue. The OP said they needed to be able to parse the results as JSON and in my mind Structured Outputs is the way to go for that moving forward. I was just pointing out that you can add a chain of thought into your structure and get noticeably better responses.

I’ve removed the explanation field from my structured outputs as a test and the answer quality clearly goes down

Thanks for the explanation all of you.

We have tried adding structured output in our Python code, and are getting weird results. Can I find any source out there that can point me step by step how to achieve this?

Also, we have seen quite regularly, when the output of JSON is too big, the structure automatically breaks. I am not sure if the prompt has to be updated or tuned for it to not break, but we are definitely getting good responses if the amount of response from open-ai is limited.

EDIT:
OK, so after some search in the open-ai documentation which was shared originally, I came across this snippet of code which doesn’t work

client = OpenAI()
completion = client.beta.chat.completions.parse(
 model="gpt-4o-2024-08-06",

    messages=[

        {"role": "system", "content": "You are a helpful math tutor."},

        {"role": "user", "content": "solve 8x + 31 = 2"},

    ],

    response_format=MathResponse,

)

It gives the error

Exception processing gpt4-query 'Beta' has no attribute 'chat'

The old python code isn’t correct today.

If you don’t quite understand how to form a response, especially one with a structured output JSON schema, you can develop the chat instructions in the API playground site for “chat”, and then press the get code button to see how that particular set of messages and output formatting would be coded.

Here, for example, is me giving a system message, and providing an output response schema I fabricated with some preliminary tasks, and a place for the response. It was produced by the playground.

Writing within JSON activates a different quality of response than normal chat, usually worse and reduced length despite the AI doing some reasoning on the other fields first. You are already using JSON, though.

JSON output itself, though, is very hard to break when giving a JSON schema response format by the new API method. The particular dated model must be used for structured output schema spec.


from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4o-2024-08-06",
  messages=[
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "task: Classify the user input, and produce the highest quality response to the user you personally can provide as an expert specializing in the field."
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is a group of zebras called?"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "{\"Input Type\":\"question\",\"Specialist Area\":\"knowledge\",\"Topic\":\"animal terminology\",\"User Fulfillment\":\"A group of zebras is commonly called a 'dazzle.' This term is thought to refer to the way their stripes can confuse predators when they are in a group, creating a dazzling effect.\"}"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Never a zeal?"
        }
      ]
    }
  ],
  temperature=1,
  max_tokens=3573,
  top_p=0.4,
  frequency_penalty=0,
  presence_penalty=0,
  response_format={
    "type": "json_schema",
    "json_schema": {
      "name": "ai_response",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "Input Type": {
            "type": "string",
            "enum": [
              "question",
              "statement",
              "behavior",
              "production",
              "alteration"
            ]
          },
          "Specialist Area": {
            "type": "string",
            "enum": [
              "general",
              "programmer",
              "knowledge",
              "friendly"
            ]
          },
          "Topic": {
            "type": "string"
          },
          "User Fulfillment": {
            "type": "string"
          }
        },
        "required": [
          "Input Type",
          "Specialist Area",
          "Topic",
          "User Fulfillment"
        ],
        "additionalProperties": false
      }
    }
  }
)

Not produced by “get code” is getting the response text and other metadata.

GPT-4 and 4k of input made the example schema.

2 Likes

You’ll likely need to update your openai package.

pip install -U openai

1 Like

Okay, so going through the suggestions above, we were able to finally get it done, and yes, they are coming very valid JSON like they claim. The changes are as below. We are still testing by the way.

completion = client.beta.chat.completions.parse(
                model=model,
                messages=message,
                temperature=temperature,
                max_tokens=max_tokens,
                frequency_penalty=frequency_penalty,
                top_p=top_p,
                n=n,
                presence_penalty=presence_penalty,
                functions=functions,
                function_call="auto" if functions else None,
                response_format=response_format

            )

Where response_format is my Pydantic Model with they keys I need to parse. Will get back with more feedback. Thanks everyone. :smiley:

EDIT: (UPDATE)

So guys, thanks again to everyone for your suggestions. The structured way is definitely 100% accurate in the JSON format response.

The only concern is the “validity” and fact checking. Earlier we used to get 50% accurate answers in cognitive based assignments. Anyone certain it can be better?

1 Like