I’m alarmed because I have been testing the new GPT 3.5-06-13 version and it has significantly worse performance in other langues (such as French).
For example, translating from English to French using the current GPT 3.5 03-01 yields translations that understand the context.
With the new GPT 3.5-06-13 version, it translates things verbatim without any consideration.
Obviously I don’t expect the OpenAI team to test the AI in multiple languages after each release however this is specifically why I’m writing. To tell developers that you have ‘3 months’ and then the model will no longer be available does not contribute to a healthy ecosystem of applications powered by OpenAI.
I understand, and even encourage, the rapid development of various models even if they sometimes end up performing worse in some cases (I’m sure it performs much better in others).
However it is CRITICAL to keep some models around for longer periods of time. For example, with Ubuntu we have long term release candidates that will stick around.
I believe it is very important to preserve the current GPT 3.5 version for considerably longer than 3 months. I’d say at least 1 year or until we can have 1-2 extra versions of GPT 3.5 or even a GPT-4 turbo.
I figure that the short period is because of security or certain exploits in the model, but honestly, it would be nice for them to keep some up in the API.
One of my favorite models is still the old davinci model, not the -instruct ones, because it’s capable of doing creative writing better than me. I’m glad they haven’t deprecated that one.
That’s interesting, I’ve felt the opposite. I couldn’t use gpt-3.5 before it was basically worthless for me. So I was forced to use GPT-4 which has worked great. With the update I’ve seen a significant improvement in gpt3.5. It’s reasoning seems much better and it actually listens to the system prompt.
I have not noticed any major issues yet. It honestly seems to be faster, with output being slightly longer. I haven’t done much testing, but there are no breaking changes for me as of now…
Sorry but there’s no way. I’ve been a power user of both davinci and gpt-3.5-turbo for a while now, and know for certain that the decrease in intelligence of gpt-3.5-turbo 0613 version is drastic and noticeable. It doesn’t follow directions. They gutted it to save resources and it’s so obvious.
You could try a prompt to improve the translation for more get a more colloquial or less strict formal translation. Also, you could as GPT to translate the dialect and put it in your city or region. Just trying to help. It is a hassle since the old version was good. Or, use prompt, and translate as if you were version XXXX.
I’d highly recommend at least a couple of examples here. If there is a French issue, point it out. If there is a logic issue, give examples. It’s hard to get the attention of devs if we’re just making abstract assertions.
I have also noticed a significant loss of understanding of instructions with version 0613. My prompts are no longer usable, and the conversation quickly falls apart. Just to clarify, I am working in French. I try to create a ‘choose your own adventure’ story application, and by simply changing the model from turbo 3.5 to 3.5-0613 it no longer follows the complex instructions. However function callings helps producing json… My next approach will be to move as many instructions as possible to the system role (Have you seen any significant changes with that ?) and also drastically reduce the portion of logic handled by the prompts
I’ve found it to be quite good, I am sort of building up a “fake” chat history to help guide it and it seems to be responding better to the system message now. Are you able to perhaps use the chat history to show it a previous example question and response to guide it to what you require?
We need to see examples side-by-side, and maybe someone should devise an empirical way to measure. I fear we get lost in subjectivity and anecdotes otherwise.
As per my observations, the model is more consistent with it’s output.
The previous one had edge cases where the model would output text that was not in the format it was instructed but the new 3.5 actually was consistent 100% of the time. So definitely there’s an improvement from that side.
Also kinda off topic but i feel GPT-3.5 output is more reliable than GPT-4 because i have noticed 4 makes a lot trivial mistakes like spelling errors or missing a line of code, somewhere along these lines.
But overall 3.5 outputs seem more reliable than previous one as per my observations.
some background: we’re building agents using gpt-3.5-turbo.
And we’re also seeing a drastic decrease in its intelligence with the new version, major problems:
hallucination increase:
Using the same prompt, the new one is worse at:
Information extraction: it’ll get wrong result from context.
Eg: I want it to find a specific SKU from a list of SKUs, it’ll give me the right name, properties, but will give me the wrong id.
Format stability: it’ll more often break our format restrain, no matter if we use function_call or not.
Worse logic
Still with SKU search example: It’ll some times give me a matching SKU, (both in its reasoning and final structured result), which doesn’t match anyone in provided list.
Speed decrease
We’re using both OpenAI and Azure. And there’s also a major performance different between the two models, something like 50% in RT.
we’re forced to decrease temperature and providing more examples, but still struggling to port all of our agents to the new model…
Search for the first occurrence of “Drupal 9”. This conversation got switched over to GPT-3.5 from GPT-4 one night when I went over my limit. GPT3-5 just spits out bad code, again and again and again, seemingly without even thinking about it. Note how I told it that it was going in circles, and it continued to do the same thing. I asked it to think about what it was doing, and it just ignored me.
Doesn’t matter if you don’t understand code or php specifically. Just look at it’s answers and my responses. It’s just making things up.
I KNOW there is a significant decrease in capability because I started off using GPT-3, then 3.5 several months specifically for coding, and while I ran into some difficulties they were usually due to the length of the code submissions, and never a problem like this.
This isn’t a prompt issue. It’s a model capability issue.