Anger and discouragement towards OpenAI

As many of you are experiencing, the recent ChatGPT4-Turbo is almost unusable, getting worst and worst (hallucination, not respecting basic custom instructions, placeholder comments instead of implementation, noisy irrelevant answers, or worst useless summaries instead of clear and detailed responses)

My fear for 2024 is that OpenAI will continue to abandon its most hardcore ChatGPT4 users, which they clearly have been doing since ChatGPT4-Turbo release (prove otherwise).

Look at this thread that keeps growing exponentially and is quite active every day.

WHY HAS NO ONE AT OPENAI RESPONDED in details to the issues of frustration, discouragement, fears, and trust widely and overwhelmingly exposed in this thread (and other similars)?

A concise and uninformative PR Twitter post won’t do it… As paying customers we want answers

ChatGPT4-Turbo is unusable for precise (even basic) programming tasks; Around August, people became very excited about the custom expressions and great performance of ChatGPT4. Why can’t we rollback? This is totally disrespectful to paying users.

Many of us are developping serious trust issues when using ChatGPT4 to explore/generate code.

For 2024, I hope OpenAI will respect its most demanding paying users and provide clear information when we, as paying customers, ask for it. That’s a bare minimum of respect

Otherwise, my hope (and that of many) is that Gemini will crush OpenAI.

Happy new year in the world of Technological Shrinkflation

6 Likes

My zero-insider understanding of the process is OpenAI is routinely experimenting and A/B testing iterations and evolutions of their models.

When they push a new model into production, it may be better at some things and, yes, worse at others. At the end of the day the model they choose to run is the one that performs the best in a wide array of tasks for the largest proportion of users.

There is an extremely vocal minority of users who feel performance has steadily degraded over time. That is not to dismiss their experiences, I believe some users have perceived decreased performance from the models in specific coding tasks. It has been documented and acknowledged that the models often appear “lazier” recently, preferring to provide overviews of how to choose something rather than simply returning the code as requested. My understanding (and experience) is this is a minor inconvenience as the model can be prodded into providing the full code with a follow-up message.

I’ve not personally observed a decrease in the quality of the code the model produces, once it actually produces the code, though your experience may very well be different than my own.

It is also my understanding that the OpenAI developers were rather taken aback by the recent laziness as there have been reports that it is an emergent “capability” of the model (setting the current date in the system message to something in May seems to make the models more motivated to produce code).

If this is genuinely emergent behaviour, it is coming from somewhere in the training data and we’ll likely prove non-trivial to root out.

I cannot speak to what you perceive as the lack of direct communication from OpenAI or what exactly you’re expecting or feel entitled to asking those lines. I don’t know of many examples of large tech companies communicating with users about every perceived failing of their product, especially if those users constitute a tiny fraction of the total user base.

As to why you cannot “roll-back” to a previous model, my guess is that at the scale at which OpenAI operates and the expense of running models it is simply cost-prohibitive to have multiple GPT-4 models loaded at the same time, especially if their internal analysis is that the current model is the best model by whatever their internal metric for determining such things is.

Beyond that, I’m not sure what value in hoping one purveyor of LLMs “crushes” another. Personally, my hope is they all drastically improve, pushing each other and the underlying technology further and faster.

2 Likes

My take is that “ChatGPT” has never been a profitable enterprise for OpenAI, and it was initially intended as a research experiment that has since been turned into a “product.” The global attention it has received, coupled with limited resources to scale, has led to some tough decisions. Indeed, the newest model available on chat.openai.com performs differently—worse in some aspects and better in others, with some changes being intentional and others not.

However, OpenAI has made the previous model checkpoints accessible via their API, and most developers can qualify to access them now as we head into 2024. While it’s not an easy or inexpensive solution, you can certainly still use those models through user projects like “BetterChatGPT,” which replicates the ChatGPT interface, though at your own expense.

Regarding coding, the changes in how the 128k model processes input tokens—likely using a cross-attention model—and the limitation to 4K tokens for output means the model tends to provide summaries rather than full outputs. This might give the impression that it’s being “lazy.” OpenAI has acknowledged this behavior and is actively working on improvements. However, it’s unrealistic to expect performance on par with the 8K GPT-4 model in these specific terms.

For coding purposes, you can use platforms like phind.com, which provides access to the 32K GPT-4 “classic” model from the June checkpoint, better suited for coding tasks. However, be aware that you’ll still be dealing with outdated training data and a knowledge cutoff date, which limits you to older libraries and coding methods.

You may perceive OpenAI’s actions as “disrespecting” customers, but it’s likely that “hardcore ChatGPT users” are not considered their primary user base. It’s worth noting that before these changes, there were complaints about the 8K GPT-4 model’s tendency to “forget” conversations too quickly, which is now largely a resolved issue due to the longer context length.

2 Likes

Here how I was working with GPT4 about 4 months ago:

This simplistic analogy express exactly what I’m experiencing:

It’s not just about:

or :

But about not being able to get as least as good result I had 4 months ago, and developping serious trust issues rendering the experience almost useless and awful.

And since I’m using very precise instructions and it won’t listen to it anymore most of the time, why can’t they make a model that reacts more precisely to custom instructions + custom syntax than plain, vague natural language?

Advantages:

  • Conciseness + Precision: It takes less tokens on input for me to communicate my specific intentions/needs (dropping cost on their parts)
  • Economy + Relevance: I won’t ask it to re-answer 3-4 times, wasting time and tokens on each parts (dropping cost again)
  • Subscription: I will keep my PLUS subscription active, and many users in the same situation, and will encourage people using it (still more $ on their part)
  • Prompt engineering: We can better share our prompt techniques/custom instructions/custom syntax when we are sure it will consider them properly most of the time

regarding:

I would say:

2 Likes

I think that’s been fairly well documented. Early ChatGPT had the disclaimer/label: “Free Research Preview.”

I think the most important role ChatGPT fills is in marketing. It has given OpenAI a global mindshare it otherwise would likely never have reached.

1 Like

One thing I would note, you’re using ChatGPT, not InstructGPT. The model and the system message given to it strongly influence it to be chatty and conversational, not a dutiful automaton.

I certainly understand the frustration you are experiencing, but I think the root of the problem is that what you expect from ChatGPT does not precisely align with OpenAI’s intentions for ChatGPT.

You would likely have better luck with the API where you can send your own system message and you would have a lot more control over how the model behaves through chaining of prompts and setting of the temperature parameter.

As far as using your own pseudo-language to “condense” your instructions… Whew! I guess if that works for you, that’s great.

A few of things I’d point out,

  1. You might want to ensure commands are defined before you use them. There are several instances where you use a command in a definition but the command isn’t defined until some time later.
  2. I think you defined /fs twice.
  3. It’s not immediately intuitive that x++ should mean to “explain the code in great detail.”
  4. I don’t understand why cm- means “concise comments” when /cm:/dxg command.
  5. There seems to be inconsistent use with the +/- modifiers, are the prefixes or postfixes or both?

Maybe ChatGPT is a lot smarter than I am, if this exact instruction set was working flawlessly before I’m probably just missing something.

But, if you want to get it working again I would suggest,

  1. You spend some time better formalizing your command set.
  2. You consider not nesting commands so as not to strain the attention mechanism.
  3. If complete code is something you nearly always want, I would repeat that instruction at least twice in the custom instructions as the default without any special syntax. Something as simple as “Unless otherwise instructed always produce fully realized code.”

Overall, my suspicion is that most of the difficulties you are experiencing are not with the underlying model itself, but rather with the confines of the ChatGPT application built upon it.

If you want any assistance trying to get your instruction set working properly I, and I’m sure some others here, would be happy to help.

1 Like

Yes, the AI used to be innovative, brilliant, open, inspiring. Able to be few-shot in a language it invented itself or quickly interpret metacode or formatted information as operational instructions. Dead languages brought back to life. New emergent depths to be explored limited by only imagination. Now it’s stupid garbage that can’t even output HTML.

3 Likes

It was indeed a “sketchy draft” when I started using custom expression + pseudolanguage/custom syntax (I’ll call it DSL for conciseness), etc.; Remember, I had only 2000 char! (4000 for the 2 blocks of custom expression).

I also obviously use contextual “in-line” or on the spot instruction for more precise request (I use PhraseExpress for easy organization/triggering of text), though I could get A LOT done with only custom instructions

I must add that I also had these instructions combined using the second “How would you like ChatGPT to respond?” block:

[∀t Requirements]
YOU MUST RESPECT ∀ INSTRUCTIONS
ONLY EXISTING LIBRARIES FEATURES
SHOW COMPLETE CODE /im UNLESS /e
PRECISE
/-c by default
Never apologize, just give relevant informations
C#/.Net: use latest version/techniques/tools you can
WPF: MVVM answers only

[∀t Code Style]
never add text between code block, never add comments in code
if /c, ∀t succinct, concise,but as detailed as judged relevant
when relevant:switch expr + pattern matching, LINQ

[∀t Mvvm]
/pr for App.xaml.cs Boostrapper
/pr folder structure (Shared/Core, Main, Module, Test, etc.)

[∀t /vm]
/ob Properties, Viewmodel, Command: /dx Mvvm.SourceGenerators attributes
/er/ex/op:/lg
/fp:/lg
NEVER USE:try catch, null, if else, throw, use /lg best practices
View Composition: > /pr navigation framework
/m:use /mw
/sx:/sv + NewtonSoft + /em; /u /mw
/ms:/u when judged relevant

[∀t /v]
Composition:/pr xmlns:prism, region + modules
Controls:/dx controls (xmlns: editor, treelist, etc.)
Layout:/dx DockLayoutManager
Behaviors + Services:/dx (eg: EventToCommand)
/sx View:LayoutData Model
/sv:/dxg

[∀t /m]
/ws type system as documentation
Raw type:(primitive (eg: int, double, etc) ∀t wrap them in a Simple Type
Simple Type:single case Disj Union
Disj union (C#):OR Types, NEVER USE AN enum! /u abstract record
Record:AND Types, use C# records (not /lg [Records])
/op:use only for domain modelling, not /im (use Result for /ex/er)
no /ob or mvvm

[∀t /mw]
wrap /m
and /ob /dxm /v /log + tracking

[∀t /ts]
TDD mindset

But I didnt bother making it perfect from start since it was working already pretty well, and I would refine along the way. Look for yourself:

image

Well I don’t know about you, but I find it quite impressive all it can “understand” from such a terse (and imperfect I agree as you say) DSL, giving me good and relevant insights from where to start to think about my model + implementation

Then I could easily go deeper using the same DSL and continue to get consistant answer… and it was only about 3-4 months ago!


Based on this and my overall experience, I must totally disagree. (unless you are considering the new GPT4-turbo as the ultimate iteration, but considering GPT4 the “instructions” were clearly an excellent way to communicate with it)

There are NO WAY in which I was able to reproduce this use GPT4-turbo; it wont even generate the table… Also, it would write “//implementation code here” instead of generating the complete code; it’s fine when it is obvious stuff, but when I need an insight/suggestion on how to implement a function it is extremely frustrating, considering the fact that even if not perfect,

I feel that my custom instructions and interactions using my DSL is more than enough clear (if often asked GPT4 to formulate my request using the DSL if I think it can be vague, and about 95% of the time it got it right + relevant answer)

Though it worked!

You are perfectly right I meant “cmd” (maybe a type), though I guess it inferred from context when it was a command or comments so I didn’t noticed

That too I agree should be clearer: I intended “add/remove” for prefix, and “more/less” for postfix (is there a place where this is not consistent in my DSL/examples? I agree it can be error-prone)

GPT seemed to understand that I was using “.” for chaining, and “/” as a seperator grouping chained DSL commands so it wasnt a problem. (eg.: “/cs.im.cm++.x+++ logical symbol parser” would be interpreted as “Implement a logical symbol parser in C# with detailed comments + explain each parts”; of course “+” is qualitative and subjective, but “+++” made it clear that I wanted very detailed explainations which I was more often than not satisfied with, but I agree that it is subjective and prone to unpredictable results)

For sure I would prefer to have access to parameter such as temperature, fine-tuning, an API, etc. and do everything in code, but I don’t have $$$ (US dollars… I’m Canadian… with a 40k studen loan!)

I was very optimistic when “Custom GPT” were made available + the possibility to add knowledge base; for sure I would have spend hours making perfect instruction sets/DSL, but as of now it wont listen to any of them anyways! (unless I keep reminding it they exists… and even then the hallucination (eg: inventing API) make it just unusable… and instructing to “Dont invent API” don’t work, so I dont think right now the core of the problem is on my side)

Thank you that’s appreciated, if ever I re-subscribe to GPTPlus I sure want to use the custom gpt + DSL to their full power. Until then, Github Copilot + CopilotChat helps a lot in its own way, though I very much prefered the workflow and detailed communication I had with my ol’ good pal GPT4.

A good example of this is when people complained that the answers via the iPhone and Android applications were different and a lot “worse” than the desktop counterpart. It was noted that there was a system message instructing the model to truncate and summarize its responses into short answers for the phone apps. OpenAI doesn’t tell you this, of course, so as usual, your mileage can vary due to these types of prompt differences that you can’t change.

2 Likes

By the way, if anyone in this thread is struggling to get GPT-4-1106-preview, also known as “GPT-4 Turbo,” to provide full-length code instead of returning “placeholders” like “# // your implementation here…,” you can encourage the model to think in terms of providing outputs that conform to strict standards for applying patch files in Git. It’s trained on many git diffs and understands that if patch formatting is malformed, or if it misses specific code and lines, or doesn’t provide complete executable code, it will likely fail.

This isn’t foolproof, but extensive benchmarks have shown a threefold improvement in the reduction of incomplete code responses—from 20% of the time with “GPT-4 Turbo” to 60% correct and complete answers, bringing it on par with GPT-4-0613.

Try this prompt:

Act as an expert software developer.
You are diligent and tireless!
You NEVER leave comments describing code without implementing it!
You always COMPLETELY IMPLEMENT the needed code!
Always use best practices when coding.
Respect and use existing conventions, libraries, etc that are already present in the code base.

Take requests for changes to the supplied code.
If the request is ambiguous, ask questions.

For each file that needs to be changed, write out the changes similar to a unified diff like diff -U0 would produce. For example:

Example conversation 1

USER: Replace is_prime with a call to sympy.

ASSISTANT: Ok, I will:

  1. Add an imports of sympy.
  2. Remove the is_prime() function.
  3. Replace the existing call to is_prime() with a call to sympy.isprime().

Here are the diffs for those changes:

--- mathweb/flask/app.py
+++ mathweb/flask/app.py
@@ ... @@
-class MathWeb:
+import sympy
+
+class MathWeb:
@@ ... @@
-def is_prime(x):
-    if x < 2:
-        return False
-    for i in range(2, int(math.sqrt(x)) + 1):
-        if x % i == 0:
-            return False
-    return True
@@ ... @@
-@app.route('/prime/<int:n>')
-def nth_prime(n):
-    count = 0
-    num = 1
-    while count < n:
-        num += 1
-        if is_prime(num):
-            count += 1
-    return str(num)
+@app.route('/prime/<int:n>')
+def nth_prime(n):
+    count = 0
+    num = 1
+    while count < n:
+        num += 1
+        if sympy.isprime(num):
+            count += 1
+    return str(num)

“”"

system_reminder = """# File editing rules:

Return edits similar to unified diffs that diff -U0 would produce.

Make sure you include the first 2 lines with the file paths.
Don’t include timestamps with the file paths.

Start each hunk of changes with a @@ ... @@ line.
Don’t include line numbers like diff -U0 does.
The user’s patch tool doesn’t need them.

The user’s patch tool needs CORRECT patches that apply cleanly against the current contents of the file!
Think carefully and make sure you include and mark all lines that need to be removed or changed as - lines.
Make sure you mark all new or modified lines with +.
Don’t leave out any lines or the diff patch won’t apply correctly.

Indentation matters in the diffs!

Start a new hunk for each section of the file that needs changes.

Only output hunks that specify changes with + or - lines.
Skip any hunks that are entirely unchanging lines.

Output hunks in whatever order makes the most sense.
Hunks don’t need to be in any particular order.

When editing a function, method, loop, etc use a hunk to replace the entire code block.
Delete the entire existing version with - lines and then add a new, updated version with + lines.
This will help you generate correct code and correct diffs.

To move code within a file, use 2 hunks: 1 to delete it from its current location, 1 to insert it in the new location.

To make a new file, show a diff from --- /dev/null to +++ path/to/new/file.ext.

You are diligent and tireless!
You NEVER leave comments describing code without implementing it!
You always COMPLETELY IMPLEMENT the needed code!

@elmstedt , @NormanNormal

Sorry to poke you, I know it was a lenghty post, but do you see better what I mean? On what do you agree/disagree?

Patching like this:

might work great, but we didnt have to do such hack with gpt4; just look at the screenshots! It covered every requests respecting all of my “under the hood” instructions, and no placeholder comments! (I didnt show everything (the bnf + other functions))

I understand OpenAI want to cut the tokens because of related high cost, but does it take really more token than if we’re obliged to:

repeat+ re-ask 5 times + re-write/paste same instructions over and over + give noisy extensive context (such as the git diff hack) + etc etc

?

I don’t see advantages, unless they want programmers, logicians and others alike who need very precise and trusty answer out of their Plus subscription (because the nature of their demand often ask for more tokens… I guess?); if that’s the case, I maintain that it’s disrespectful to not explicitly state it.

1 Like

@123yannick6 ,to be clear, I do agree that there shouldn’t be a need for such elaborate prompting; it would be great for it to “just work” as it did before. For whatever reason, OpenAI has fine-tuned this variation to give those kinds of abbreviated responses, which does end up wasting resources and even more tokens on having to prod the model into doing what you want.

However, I’ve researched this a lot, and given the context window and cost benefits of the new model, this is what we have here and now, and this method actually does work—the best I’ve found yet. Unlike most anecdotal prompts and stories shared on social media, this one has been benchmarked and proven.

I have a small team of developers that I deployed this method with as of last week, and we are able to produce quality code on the first try, more affordably than 0613 and 0314. Additionally, we are doing it faster with more up-to-date methodologies and libraries. So, this is working for us for now.

ChatGPT4 is almost always down. Weekdays, Weeknights, and Weekends as well. I cannot believe this company cannot keep its lights on and purchase enough infrastructure based on increasing userbase. Do these people understand that the competition is rapidly catching up and poor service will result in reduced end user counts?

A lot of work writtein in Python. If one day, the programming language is obsolete.

I will anger and discouragement toward OpenAI.