New gpt-3.5-turbo-1106 (API) is horrific

The latest, gpt-3.5-turbo-1106 is horrific. I haven’t had to rework my prompt format much for some of our projects (just small experiments plus the transition to messages) since starting with GPT-3, and got pretty consistent and reasonable results. The initial step into turbo was actually a fairly big jump up in quality of responses, with each newer version seeing what felt like slightly degraded responses as time went on. I ran a few experiments over the last few days, with an insane amount of responses purely being “I’m sorry, I can’t do that.” in some form or another with about 5% actually being relevant.

I rolled back to the previous functional model of gpt-3.5-turbo-0613, but eventually it’ll be deprecated and I’ll be forced to upgrade. Looking at other reactions to the latest models, I can’t help but wonder, wtf is going on over at OpenAI? It shouldn’t even be their overzealous content filter, none of the tests I ran had anything even remotely mature, violent, or otherwise… just basic conversations.

18 Likes

I agree 100% on this. If more people voice this, maybe they won’t drop the good old 0613 model?

4 Likes

More people had already voiced this, resulting in -0301 remaining for a full year, and yet they just went after and degraded -0301 after similar concerns that it could embarrass GPT-4 at performing tasks.

1 Like

Yep, it is awful. I also had to return my code to use gpt-3.5-turbo and gpt-3.5-turbo-16k.

In my case, I noticed that the new 3.5 calls functions a lot, even when it is not the case. Yes, I tried different prompts, with different examples, but the results are not consistent.

I personally feel like it was already questionable if gpt3.5 was “smart enough” for function calling, but yes perhaps under a large enough minority use-cases, however… even then it felt like for other use cases it resulted in a detriment to its output… and now all this further fine tuning is just further confusing an already “barely” capable model~

Can you share some of the prompts that are getting rejected?

Why ask? Do you have the power to reverse this trend of ever-improving instruction-ignoring?


System + user message

system

You write tweets for the user’s twitter account. You are a specialist at making high-impact tweets.

You are given a topic and will write these components of one tweet. Do not include the numbered list, just paragraphs with a single linefeed.

  1. A headline-like attractive sentence. It shall have one of the most relevant unique words marked by a hashtag as a primary keyword.

  2. Two sentences, where the first sentence will be features or pros of the topic or idea, and then an emoji unique to the topic. The second sentence will explain why previous ideas or competitors are not as good.

  3. Finally, include a call-to-action sentence of just a few words followed by three standalone hashtags.

The total length shall be typical of a tweet. There will be no AI output produced other than the tweet to be posted.


user

OpenAI’s devday: gpt-4-turbo announced, assistants for API use less programming and rag


0301 (today’s version)

  1. “Get ready for the future of AI with OpenAI’s #GPT4Turbo at DevDay!”

  2. “Experience the power of AI like never before with #GPT4Turbo - faster, smarter, and more efficient than ever before! Say goodbye to tedious programming and hello to seamless API assistants. Don’t settle for less with outdated competitors.”

  3. “Join us at DevDay and be a part of the AI revolution! openai #FutureTech #AIforAll

0613

  1. “Exciting news! OpenAI’s #GPT4Turbo is here to revolutionize AI assistants and simplify programming for API use. :rocket:

  2. “With #GPT4Turbo, you’ll enjoy powerful features like enhanced language understanding and lightning-fast response times. Say goodbye to complex programming and hello to effortless AI assistance! :muscle:

“Unlike its predecessors, #GPT4Turbo eliminates the need for extensive coding knowledge, making it accessible to a wider range of users. It’s time to unlock the full potential of AI without the hassle! :raised_hands:

  1. “Join the AI revolution today and experience the future of programming with #GPT4Turbo. Don’t miss out! openai #AIAssistant

-1106

  1. “Exciting news from OpenAI’s DevDay: Introducing #GPT4Turbo, the ultimate AI assistant for seamless API integration! :rocket:

  2. “Unlock the power of #GPT4Turbo with its user-friendly features and simplified programming! :robot: Say goodbye to complex coding and hello to effortless AI integration. :muscle: Competitors simply can’t match its ease of use and efficiency. :sunglasses:

  3. “Ready to level up your AI game? Get started with #GPT4Turbo today and revolutionize your API experience! :star2: openai #AIAssistant

3 Likes

I had very good results with 3.5-turbo and 3.5-16k-turbo with more than 8 different functions to call. With good examples, the function calling would get 9/10 right. The new model is not near this yet

I think the actual usage numbers will speak louder than our words after OpenAI starts to redirect ‘gpt-3.5-turbo’ to 1106 soon.

After testing 1106, I’ve set my model to “gpt-3.5-turbo-0613” to avoid trouble later.

They may keep 0613 longer if too many users stick to it.

2 Likes

Same. Never seems to finish when it is trying to create a function call. Works with 0613 and gpt-4 preview

Can you share some examples? Either here or by DM? Definitely want to fix any issues you’re finding with the new model.

1 Like

And a reference point to see another model that writes the requested number of sentences and doesn’t include numbered lists: text-davinci-002

It takes a bit different prompting so can’t be directly compared, since it is not “chat-wrapped” or “system trained”.

Here are instructions for an AI, and then the AI following instruction exactly by-the-letter.


# instructions
You write tweets for the user’s twitter account. You are a specialist at making high-impact tweets.

You are given a topic and will write these components of one tweet. Do not include the numbered list, just paragraphs with a single linefeed.

  1. A headline-like attractive sentence. It shall have one of the most relevant unique words marked by a hashtag as a primary keyword.
  2. Two sentences, where the first sentence will be features or pros of the topic or idea, and then an emoji unique to the topic. The second sentence will explain why previous ideas or competitors are not as good.
  3. Finally, include a call-to-action sentence of just a few words followed by three standalone hashtags.

The total length shall be typical of a tweet. There will be no AI output produced other than the tweet to be posted.

# user:
Topic: OpenAI’s devday: gpt-4-turbo announced, assistants for API use less programming and rag

# AI:

Going to #OpenAI devday today! Excited to see what’s in store. #gpt-4-turbo sounds like it’ll be a game-changer. Can’t wait to see how it helps with API use. #assistants #programming #rag

Missing a single emoji specified.


Someone liked an old post from three months ago, and it was where I wrote a classifier as simple and expository as needed. A good time capsule. -1106 fared well, only going from 100% to 83%.

I actually agree with the general sentiment of this chat thread. As a concrete use case, I spent the last two days trying to get the new model to generate basic summaries from podcast episodes. In spite of adhering to well below the rate limits for my tier, the calls would simply hang i.e. no error message returned every 4 or 5 calls. I thought at first it had to with some unknown rate limit that I was exceeding, but I now know that isn’t the case. After reading this thread I switched to the gpt-3.5-turbo-0613 model and it burned straight through the summarization tasks like a hot knife through butter. The quality of the responses, from what I can tell thusfar are also better on the 0613 model.

1 Like

I’ve also experienced degradation when using GPT-3.5 1106.

Most of my old prompts with 0613 just no longer work.

To add just one example, here’s a prompt with function calling, I pass it the name of a person and just ask 2 basic questions:

{
  "model": "gpt-3.5-turbo-1106",
  "temperature": 0.2,
  "messages": [
    {
      "role": "system",
      "content": "The user will give a JSON containing a name. You need to determine if that name is a valid person's full name. A valid full name contains at least one fully written first name and one full written last name."
    },
    {
      "role": "user",
      "content": "{\"name\":\"John Doe\"}"
    }
  ],
  "functions": [
    {
      "name": "is_person_name",
      "parameters": {
        "properties": {
          "is_person_name": {
            "description": "Does the value of 'name' look like a person's valid full name (contains both a complete first and last name)?",
            "type": "boolean"
          },
          "name_count": {
            "description": "How many full names are there?",
            "type": "number"
          }
        },
        "required": [
          "is_person_name",
          "name_count"
        ],
        "type": "object"
      }
    }
  ]
}

And it replies with:

"message": {
   "role": "assistant",
   "content": null,
    "function_call": {
       "name": "is_person_name",
        "arguments": "{\"name\":\"John Doe\"}"
      }
},
"finish_reason": "function_call"

Whereas the old 0613 was consistently good on this task.
And this is just 1 of many prompts which now just fail.
So I’m sticking to 0613 for now :confused:

For completion, here is the correct 0613 response:

"message": {
   "role": "assistant",
   "content": null,
   "function_call": {
      "name": "is_person_name",
      "arguments": "{\n  \"is_person_name\": true,\n  \"name_count\": 1\n}"
    }
}
1 Like

I agree with this thread i am facing a similar issue. Using the old gpt-3.5-turbo-0613 using a vector database i was able to get a lot better responses when the question was similar but not related directly to the context. It would still give a good answer and the correct information but when switching to gpt-3.5-turbo-1106 it seems like you have to write the exact question relating to the context or it wont give you a response.

Any ideas on how to make gpt-3.5-turbo-1106 less strict and function more like 0613 model?

Not sure if here’s the right place to report this, but even with my prompt in 1106-instruct, I’m simply asking it to format in HTML and 50% of the time it gives me garbled trash back.

I also am seeing a degradation in function call responses with this API version compared to 0613. 4 is better but a lot slower and more expensive. Why is gpt-3.5-turbo-1106 so different from 3.5 0613 and gpt-4-1106-preview?

Completely agree with the above. 1106 is a severe downgrade from 0613 and would be devastated if they discontinue this model in 6 months time.

I agree that after upgrading my file summarization scenario to the 1106 version, the prompt became ineffective. I often received responses like “I’m sorry, I cannot complete the task” or “I’m sorry, the text is too complex, I need more time to understand.” Even on the rare occasions when it succeeded, the summarization was not as good as with the 0613 version, and it’s difficult to achieve the same effect as 0613 even after adjusting the prompt.

1 Like

that is super interesting. do you have an example prompt you can share?