Getting Frustrated - starting to feel OpenAI just isn't usable

I’ve seen your post in an email from Open AI and I wanted to let you know I’ve played around with a Skyrim mod where all NPCs would be augmented by GPT3.5 and I’ve given up on it for a number of reasons. a) the models are all temporary for now, there’s no “fixed” model being offered and as you’ve seen, each new version responds differently. b) it’s too censorious for games (not saying it’s wrong but that’s just how it is, at some point the conversation goes on into the filter) c) the AI is just aweful at being “fun” it’s been trained to be an assistant. The tech itself is fine for it, but someone needs to make an NPC GPT model and then we’ll be able to use that in a more stable way. It’s coming, you’re early. My advice, save your code and move on.

Ive experienced the same kind of issues, all of the newer 0613 models act wayyy too weirdly formal and dumb, especially the function calling one, it’s literally useless because it never listens to me, I tell it not to do a task that i have a function defined for and it does it to tell me “Im sorry Ill stop doing that” or Ill tell it to stop addressing me by name and it will say “Im sorry Slipstream(my username)…” and the enum in a function seems to be useless. They also don’t follow previous messages closely as an example for behavior, in fact, it seems like they straight up ignore them most of the time, being completely unable to understand who someone is, and how theyre different from someone else, completely unable to act in a consistent manner. imo the newest models are actually useless.

you remember - I told you the same. I am frustrated as well. The LLM changes its behavior unpredictable. Like that I cant start production.

1 Like



This is great. It sounds like a possible solution. Probably could have used it here: Gpt-3.5-turbo-16k api not reading context documents - #2 by SomebodySysop

But, for those of us trying to develop knowledge apps for end-users, non-AI tech folks, how do we teach them these concepts as they try to transition from keyword to semantic search? And won’t they say, “Man, if I have to ask 20 questions before I can get an answer, I’m better off keyword searching!”?

for those of us trying to develop knowledge apps for end-users, non-AI tech folks, how do we teach them these concepts as they try to transition from keyword to semantic search?

There are tons of people building GPT powered knowledge bases, so I suspect it won’t be hard to find decent examples if you search around a bit. In general, you’ll want to insulate the end user from the AI prompt. They’re giving you information about what they want to find, it’s your app’s job to apply that to a functional AI prompt (don’t expect the end user to understand any of the nuance.)

We’ve developed multiple different uses for generative AI in our products. Each one is unique, so each takes a different approach depending on what we need to accomplish. In some cases we use multiple calls to get additional information from each call, allowing us to build up complex responses that wouldn’t otherwise be consistent/reliable. In other cases we make a call to get a response and pull out a piece of that response to put into a concrete data structure (ensuring the AI can’t mess up other data in that structure.) The list goes on, but the point is that this takes work - there is no one size fits all approach to leveraging LLMs for real applications.

One of the simplest is to present the user with voice input (whisper api to text). When you ask a user for voice as input or offer voice as an option they usually ask in a more conversational way. But, even if they stick to keywords it should still be as good as any past method.

I don’t know. That may be part of the problem, because everybody is also mimicking the ChatGPT interface. I know it sounds crazy, but I’m thinking part of our job may be to teach people how to use this tool more effectively. I mean, there may be tons of people providing these apps, but there are also tons of frustrated users trying to use these apps with no tools other than being told “you need to provide better questions”.

I’m probably out of my depth here, but I think a huge part of the AI revolution will be the the revolution of Mankind learning to effectively communicate, conversationally, with The Machine.

a huge part of the AI revolution will be the the revolution of Mankind learning to effectively communicate, conversationally, with The Machine

Definitely not. People will need to learn how to do things (not just their jobs, but also basic tasks like composing a message, writing a note, getting directions, completing DIY tasks, etc) leveraging AI, but few people except app developers will become expert at interacting specifically with GPT.

I think you’re imagining that if someone gets good at talking to GPT that means they’ll be good at talking to any AI. That’s very much not the case. Every model is unique and has unique behaviors that need to be worked with and around. That’s precisely why we cannot expect end users to figure out how to work with GPT directly, this is only one model and every other model is different in meaningful ways. For that matter, being good at prompting GPT Turbo will probably not translate directly to prompting Davinci. Both are OpenAI produced models using similar engines, but they were trained very differently and do not respond the same way to equivalent prompts.

If you care about providing a good UX that doesn’t frustrate your users, then you need to abstract away any need to understand the underlying model. For what it’s worth, just because an app offers a chat interface for the AI, don’t assume that means they aren’t injecting a ton of logic into the interaction. This demo from Khan Academy has some great examples where they clearly (to anyone who understands what the model can and can’t do on its own) are doing a lot to guide the responses.

Logical. Makes sense. But, I disagree.

And I am putting my money where my mouth is. I have created an interface where the end user can decide how many documents to return, whether to use standalone question or not, whether to submit question or question concept to retrieve vectors, whether to use “hybrid” search techniques or not. And, providing the documentation to explain what the choices are and why they need them and when to use them.

This also requires explaining, at least superficially, the chat completion process.

So, at least in this case, I’m going to let the market decide if training users is a good idea or not.

We shall see.

As long as you are not complaining that your customers complain.
It’s not recommended to do that.

But it is recommended to use the moderation endpoint in such cases.


# python is self documenting
def is_hate_threatening_flagged(survey_result):
        data = survey_result["results"][0]["categories"]
        return data["hate/threatening"] == True
    except (KeyError, IndexError):
        return False

flagged_categories = openai.Moderation.create(input=user_software_feedback)
if is_hate_threatening_flagged(flagged_categories):
    banned = ignore_angry_user(username)
1 Like

Using the true/false flags alone for moderation is not best practice, you should use the category score values as well to create wholistic solution that aligns with your use case and product values, doing this allows the author to build up their own moderation values for their use case, those flags are for extreme content violations.

Sure! Here’s an ASCII drawing representing the concept of a joke going over someone’s head:

     /       \
    |         |
    |  (o) (o)  |
    C   .---.   C
     | |     | |
     | |     | |
     | |     | |
     | |     | |
     | |     | |
     | |_____| |
     |         |
     |   whoosh |
     |         |

I hope you like it!

1 Like

My customers are given the choice. I believe more complaints will come from inadequate and/or inconsistent responses from gpt-3.5-turbo without any recourse to deal with it other than hoping I get the system prompts right.

I can always go the other way.

Shouldn’t it be more… hmmm flat?
Should match the joke.

1 Like

Summary created by AI.

Hazel1 is developing a murder mystery game using OpenAI but feels frustrated at inconsistent results and perceived waste of time. Despite careful design, results vary over time. Characters no longer adhere to prompts and say things contradicting their roles. While upgraded functions seemingly helped, they sometimes resulted in incorrect triggers. Hazel1 seeks advice to overcome these issues and doesn’t understand how others create commercially viable products using OpenAI. Suggestions from Kennybastani include altering parameters, using gpt-3.5-turbo-16k , and a temperature of 0 for functions. However, Hazel1 faces the issue of changing settings regularly from gpt-3.5-turbo-0613 to gpt-3.5-turbo-16k . Belentrad also expresses dissatisfaction with OpenAI’s lack of ability to keep up with prompts and create fluid translations, despite trying numerous settings and prompts. Jochenschultz emphasises the importance of testing and consistency, with codie encouraging Hazel1 to persevere. Codie suggests rule checking systems and the chance to manipulate user queries for better outcomes. Hemp advises a change in approach using the Chain of Thought prompting technique, multiple API calls, and a rethink of Hazel1’s approach, particularly not treating OpenAI as a human-equivalent intelligence source.

Hey, I just do what the AI tells me to do.

I sense that you are starting to feel concerned that OpenAI just isn’t usable?

Too bad the Microsoft AI grants for startups just closed. I’ve got the killer end product idea in my AI API solution for VC AI startups in round 1 funding looking to diversify their end-user offerings: my custom AI solution trained on ASCII jokes. No more embarrassing screenshots of bad ASCII art with your chatbot’s name on them!

Summary created by AI.

The users are discussing their struggles and suggestions for using OpenAI’s GPT-3.5 and its recent revision GPT-3.5-turbo-0613 for various projects. hazel1 shared her experience in developing an AI-driven murder mystery game, showcasing her frustration with discrepancies and inconsistencies in the AI responses. jochenschultz suggested considering changes when upgrading to a new model, while kennybastani provided some API parameters and suggested offering more examples for further help.

hazel1 expanded on her problems, noting the altering responses and the lack of reliability in function calls, expressing concerns about her project’s future feasibility. She insisted that her issues weren’t confined to the upgrade, citing recurring inconsistencies.

Other users, like belentrad and jochenschultz, shared their similar frustrations, with belentrad expressing doubts about GPT4’s effectiveness and the apparent step backwards in model effectiveness. PriNova suggested a balance between GUI and NLP tasks and highlighted future possibilities with speech synthesis and universal translators.

codie encouraged hazel1 to persist with her project, offering game design tips and suggesting structured prompts for consistent results. hemp proposed using the Chain of Thought prompting for more reliable outcomes and reminded users that the model is not intelligent in a human sense and should be used as a tool.

Summarized with AI on Nov 24 2023
AI used: gpt-4-32k