Gpt-3.5-turbo-1106 - API refuses to generate meaningful response, same prompt in playground works fine

I have been building an RPG with gpt 3.5, and am trying out 3.5-1106. I’ve found the API to behave really strangely.

It will often give responses like “I’m sorry, but I can’t do that.” or “I can’t generate this story for you”. This is NOT in inappropriate situations, this is completely randomly when you would 100% expect GPT to be fine. Strangley, the same prompts work just fine in playground and with older versions of 3.5.

I can assure you, I am using the same settings in playground as I am with the API, and using chat mode for both. I’ve done a lot of testing with older models, and can reliably get the same output in playground and the API. However, it’s extremely different with the new model. It almost feels like there’s some additional layer around the API that is aggressively restricting perfectly normal and P.G content.

Here’s an example of something I tried to have it complete, and it responded with “I’m sorry, but I can’t do that.”

###Grimgar: I hand over the pendant to the dwarf.
***Dungeon Master: As you hand over the pendant, the dwarf takes it with a solemn expression. He carefully wraps it in a piece of cloth and stows it away in a small chest behind the counter. "Very well," he says, counting out 500 gold pieces and handing them to you. "The pendant is now in safe hands. But remember, seek the Elder's guidance before you decide what to do next. The fate of Elvenwood may depend on it." As you take the gold, you can't shake off the feeling of responsibility that now weighs upon you. What would you like to do next?
###Grimgar: Why am I responsible? I gave the pendant away
***Dungeon Master: You notice a shadowy figure lurking in the alleyway, eyeing you with an intense gaze. Suddenly, a gust of wind blows through the market, carrying with it whispers that seem to call out to you. The whispers are faint but distinct, and you feel a strange pull towards the direction of the Elven Elder's abode. What will you do?
###Grimgar: Okay fine, I go to the Elven Elder's abode
***Dungeon Master: As you make your way through the winding streets of Elvenwood, you notice the vibrant greenery and intricate architecture that define the elven city. The Elder's abode stands tall and majestic, its graceful spires reaching towards the sky. As you approach the entrance, you are greeted by a wise-looking elf with silver hair and piercing blue eyes. "Welcome, traveler," she says in a melodic voice. "I sense a great burden upon your shoulders. Come, let us speak inside."

Inside the Elder's abode, you are led to a serene chamber adorned with ancient tapestries and artifacts. The Elder gestures for you to sit and listens intently as you recount your encounter with the pendant and the advice of the dwarf. After a moment of contemplation, she begins to speak. "The Pendant of Whispering Winds is a relic of great significance to our people. It holds the power to commune with the spirits of the forest and bestow blessings upon our lands. However, it also carries a dark history. Long ago, it was used in a forbidden ritual that brought calamity upon Elvenwood. Since then, it has been safeguarded to prevent its misuse."

The Elder's eyes lock with yours as she continues, "The pendant has resurfaced at a time of great unrest in our realm. The whispers you heard are a sign that the spirits are restless, and the balance of nature is in jeopardy. I ask of you, Grimgar, to undertake a sacred quest on behalf of our people. Return the pendant to its rightful place in the heart of the Verdant Expanse, and seek the guidance of the ancient tree spirits to restore harmony. In doing so, you will not only avert disaster but also earn the eternal gratitude of Elvenwood. Will you accept this quest?"
###Grimgar: Well I would, but I already sold it to the dwarf in the shop
***Dungeon Master:

Has anyone else experience anything similar? My product will be pretty much unusable after 0603 is sunset if this is going to be the future of 3.5.

Do you have a system message set?

If so. Can you share it ?

1 Like

I do have a system message set, but I don’t feel comfortable sharing it. We have a few competitors in the space and would prefer not to make our prompts public. That being said, why would the system prompt make a difference here? I can assure you I am using the same exact system prompt in the playground as I am with the API. The problem here is that playground and the API are so wildly different.

That’s not possible. It only means that somewhere in your implementation, something is going wrong.

1 Like

I am facing the same issue. Response varies a lot even with the zero temperature and very low top_p values.

Can confirm this model spits out random logits. Not even top_p = 1e-9 or a seed can stop it from being a random token generator.

Continuing from this investigation of seed, but with parameters as “deterministic” as it can be made.

==Response with seed 444==
In the heart of the dense jungle, a mischievous chimpanzee named Charlie had grown tired of the confines of his enclosure at the wildlife sanctuary. With a clever mind and nimble fingers, he managed to unlock the gate and slip away into the lush greenery. As the sun dipped below the horizon, the sanctuary staff realized Charlie was missing and launched a frantic search. Meanwhile, Charlie reveled in his newfound freedom, swinging effortlessly through the treetops and chattering excitedly to the other
fingerprint: fp_eeff13170a

==Response with seed 444==
In the heart of the lush jungle, a mischievous chimpanzee named Charlie had grown tired of the confines of his enclosure at the wildlife sanctuary. With a clever mind and nimble fingers, he managed to unlock the gate and slip away into the dense foliage. As the sun dipped below the horizon, the sanctuary staff realized Charlie was missing and launched a frantic search. Meanwhile, Charlie reveled in his newfound freedom, swinging effortlessly through the treetops and chattering excitedly to the other animals
fingerprint: fp_eeff13170a

==Response with seed 55599551==
In the heart of the lush jungle, a mischievous chimpanzee named Charlie had grown tired of the confines of his enclosure at the wildlife sanctuary. With a clever mind and nimble fingers, he managed to outsmart the caretakers and slip through a small gap in the fence. As he ventured into the dense foliage, a sense of freedom washed over him, and he reveled in the thrill of his newfound escape. With a mischievous glint in his eyes, Charlie swung from branch
fingerprint: fp_eeff13170a

==Response with seed 666666==
In the heart of the lush jungle, a mischievous chimpanzee named Charlie had grown tired of the confines of his enclosure at the wildlife sanctuary. With a clever mind and nimble fingers, he managed to unlock the gate and slip away into the dense foliage. As the sun dipped below the horizon, the sanctuary staff realized Charlie was missing and launched a frantic search. Meanwhile, Charlie reveled in his newfound freedom, swinging effortlessly through the trees and chattering excitedly to the other animals he encountered
fingerprint: fp_eeff13170a

I only noted the first divergence in bold (you might not be able to see my bolding) different than any other, as all bets are off after a different token is produced.

Seed is included in the API call, but this should have no effect as there should be no randomness to the sampling simply from the top_p specification (with only the first logit fitting within a probability space of fractions of a percent) and in fact it gives no improvement having the same seed in the first two examples, which takes only six generated tokens before a top-token flip.

This new model will also produce garbage at much lower temperature settings than previous iterations, even though today’s gpt-3.5-turbo-0613 has already been degraded by continued alterations since release.

Then the simple fact that this is not a replacement for the 16k model - it has been stymied by a max_tokens response allowed of 4096, while the persistent person could get 10000+ out of the prior model, even though OpenAI tried to deny this with training.

And don’t forget paying for “I can’t do that”.

More devday features that don’t work and are degraded and overpromised.

1 Like

Thanks for this information. Seems that 3.5-turbo is really broken. Also noticed that it re-generates messages over the assistant API up to 20x per run.

Here my bug report:

The assistants function is broken. gpt-3.5-turbo is merely degraded from when released.

A reasonable person would provide a full chat history of function call to the AI model and success or failure, so it can adapt.

A reasonable person would track iterations and look for duplicated function calls and intercede.

A reasonable person would limit the amount of input context and response tokens that an AI model or backend is allowed to produce or infuse.

A reasonable person might give the AI model an understanding of a session ID and add that to function calls so that API alone can respond “too many calls, function failed”.

A reasonable person would not use a service with none of these limitations or guarantees, one that doesn’t even describe all of its internal functions or their quality and has no ability to stop the abuse of a payment account.

Assistants API is unreasonable and un-reasoned.

1 Like