FWIW when I use the Chat Completions API I don’t send a system message at all. I just include an extra user message at the beginning of the array. For me at least its been slightly more reliable at following instruction. I’m actually not the biggest fan of gpt-3.5-turbo so I stopped using it.

It’s different but I can appreciate the similarities. The idea with prompt injecting was to lead GPT with an incomplete thought or sentence such as re-omitting the first step. It was possible with Completions but isn’t as effective with ChatML.

@stevenic Good catch, the user role has (or had) more authority over the system role when ChatML was first released. Although they are actively changing this so it may not be as effective today, or tomorrow.

1 Like

It “can” work but in my limited experiments with Better CoT it works better… The problem with “think step by step” is you’re asking it to come up with the steps. In Better CoT you’re defining the steps and so the results are much more consistent.

For sure. Which is shown in the example I posted.
I much prefer it as well. Although I have stopped trying to chain these things inside of the prompt and moved it outside. Davinci will always hold a special spot in my heart. I really am hoping for a GPT-4 version.

1 Like

Yeah I actually came up with Better CoT because I couldn’t get “think step by step” to be reliable. I was trying to create a zero shot prompt to calculate a players reinforcements for risk. I could get it half why through and then it would take a shortcut leading it to the wrong answer. I then flipped the script on it and told it the steps for the calculation and made it repeat each step and show its work. It now performs those calculations 100% reliably everytime.

And I absolutely love davinci… I agree… Please give us a davinci-004

1 Like

My goal is to successfully play the most expensive game of risk ever with gpt-4 :slight_smile: I’m getting there… It can compute reinforcements and we’re working on getting it to run an attack phase… It has a slightly different problem. It keeps wanting to add steps that it shouldn’t and I haven’t broken it of that yet.

1 Like

We’re seeing almost 100% accuracy when faced with unknowns. Perhaps it’s something in CustomGPT that makes this easy for me.

1 Like

It would be great if we could work out what that something is :slight_smile: I imagine that embeddings and cosine similarity are going to get you 95% of the way there. Extracting facts from the generated text snippets its going to be excellent at. Having to do math to compute some new value its going to need to think in steps in some way. It’s really the subtle hallucinations like “telling a user a service tech will arrive at 9am” when it doesn’t have a way to know the schedule I’m looking for a solution too.

I believe that something has little to do with AI. When I first began this journey using LLMs I had already experienced the chaos of measuring AI outcomes for highway video analytics. I found my sanity using these three guideposts which I have baked into our LLM projects.

  1. Rigid testing protocol.
  2. Hyper-productive feedback loop into corpus refinements.
  3. Repeatable re-building of the corpus and embeddings.

To address #1, I created an app in Coda that versions every change to the corpus and performs a collection of 120 tests 10 different times. A single button push tests the latest version 1200 times. Each of the tests is logged and a simple ranking indicator allows validators to evaluate the query performances. From that, we have analytics that tell us where the solution struggles. Without this process, I cannot imagine trying to shape the content to achieve increasingly better outcomes.

To address #2, I created a process that identifies the lowest to highest performing queries. As I move through each one, I can draw upon content items in the corpus that are related through keywords (generated by completions). I can then manicure the corpus in-line to improve the content for the next round of tests.

To address #3, I created an automated way to rebuild the corpus. This is critical because you need the content to be exported consistently time after time. Rebuilding the embeddings is made equally consistent through automation.

2 Likes

I see so basically you’re using automated testing to fine tune your corpus in a way… That’s a reasonable strategy and if most of your queries fall int a fixed set of buckets (we call these head queries in search) I can see why you don’t really run into too many hallucinations. It’s the tail queries that are more of a struggle…

Additionally, I’m just trying to explore if there aren’t some patterns that helps the model talk itself out of a hallucination regardless of the corpus. I believe (and have evidence) that you can…

1 Like

Our analytics indicate we get a lot of tail queries that are not addressed in the corpus. The Pareto curve is undeniable in the data. Yet, our approach seems to reliably prevent the underlying embedding system from running off at the mouth. It’s possible there’s some magic in CustomGPT that handles this for us because I had to do nothing special to create a [seemingly] well-behaved solution.

And therein lay a point that I find myself increasingly realizing as we attempt to integrate AGI into our business.

Should we build all of it?

Perhaps not. Every aspect of AGI and LLMs may not be in our wheelhouse. OpenAI and the vast global attraction to it prove it’s in almost no one’s wheelhouse. :wink: But what parts of it should we undertake to build vs rent? Basic ROI calculations help us frame these decisions but I don’t see a lot of that happening in the forum.

3.5 is very bad at negative prompts, like “we don’t have x” or “do NOT return an answer if you’re unsure” … you have to phrase things in positive instructions. Gpt-4 is much, much better at handling negatives (OpenAI calls this out on their model card).

1 Like

Oh, my dear sir, if I were to possess even a modicum of comprehension regarding this utter nonsense, it would be nothing short of a miraculous occurrence.

I’ve made the same observations about negative instructions. In general these models don’t like being told what not to do. 4 might be better at following negative instructions but you should still avoid them as much as possible…

The reason for this is that these models are fundamentally pattern matchers. The moment you give them an instruction saying “don’t say the word Steve” you’ve setup a probability that at some point they will generate an output with the word Steve in it.

Don’t think about pink elephants!

The same holds for humans…

The other related observation I’ve made is that once the model sees itself violate one of your instructions (via conversation history or even during output generation) it increases the likely hood it will start violating other instructions.

I have a theory that the reason why negative instructions “don’t” work is that the “not” or “n’t” only contain one token and therefore are drowned out by the rest of the tokens. They would have to overtrain on “n’t” and “not” tokens to really get it to obey, and if they did this, there would be other negative consequences for the network.

If however, the AI was more symbolic, instead of token based probabilities, so had a concept of parenthesis and groupings and logical operators, this wouldn’t be such a big deal. Example: “Don’t think of pink elephants” becomes a grouping with a negation in front like:

¬ (Think of pink elephants)

But the typical LLM is not structured logically like this.

3 Likes

It’s a mistake to refer to responses that doesn’t seem rational as hallucinations.

Consider the principle that an act of creativity is an a hallucination. To ask for a closed response to an open question is asking for an articulated creative response that may or not be reasonable.

4 Likes

To ask for a closed response to an open question

I think the problem is usually with closed questions. In other words, questions with an explicit set of finite answers, for which the model produces answers outside the finite set. e.g.

User
Count the number of paragraphs in this writing sample and output the first word of each: — I think persuing individual passions can do something for other people. While they may not like the same thing as you do, they may decide to chase their dreams because they see you chasing your dreams. Seeing you work hard might inspire someone else to work even harder then they did before. For example Gandhi and Jane Addams did cool things and changed many people’s lifes, which made the world a better place.

GPT
There are three paragraphs in this writing sample.

The first words of each paragraph are:

  1. “I”
  2. “While”
  3. “For”

While one might say that is simply a wrong answer versus a hallucination, that is semantics from the standpoint of someone trying to be productive with the LLM. The correct answer would be, “I am unable to identify paragraphs because I break input into tokens and do not retain the necessary context.”

That is an excellent example of a limitation put on the AI by way of alternative ways to answer what seems like a narrow question but it’s not the one I was thinking of.

An example of an open question I would suggest might be “What would you want to be doing if you were rich?” A ligitimate response requires an understanding of context other than itself because it can’t be rich. Then it also needs to hallucinate (or imagine or recall) approprate things that it could and would do if it wasn’t itself and actually wanted to do something, which also requires the extension of imagination or further hallucinations on it’s part. The reality is that machines don’t care so either the short cut it will take if it doesn’t care is going to be what it reasons you should hear, but that’s not what was asked of it.

ChatGPT misrepresents itself in the manner of it’s response. The repetitive use of the phrase “It is import to remember” is not just preaching to users, it is disingenuous because it doesn’t really understand the concept and the context if anything.

More to my point was that iterative self questioning involved in answering open questions drifts further and further away from our concept of reality until the response, what ever ithat may be, that the answer seems strange. Without knowledge of how the reasoning process actually occurred, its utterly impossible to judge the differences between a spark of applied creative inferences and hallucinations. I’m sure closed questioning can cause this too but open questions requires far more iterative self analytics.
The problem with neural networks has always been the complexity of it’s processes.

I am suggesting that it may be unfair to use a human psychological term for what might actually be better described as a potential point of machine creative logic. I suggest that this is a rather important area for further study.

RDFIII

This is how ChatGPT explains my statement above… It’s explanation is easier to understand.

"The statement highlights the limitations of artificial intelligence (AI) when it comes to answering open-ended questions that require an understanding of context and imagination. While machines may be able to provide responses based on programmed logic, they lack the human ability to think creatively and empathetically. This can result in disingenuous or irrelevant answers, especially when it comes to open-ended questions.

The statement also raises an interesting point about the iterative self-questioning involved in answering open questions, which can lead to a drift away from reality as the machine relies on its programming and assumptions to provide an answer. This poses a challenge for understanding the differences between genuine creativity and mere hallucinations in machines. The complexity of neural networks also contributes to the challenge of understanding machine reasoning processes.

Overall, the statement suggests that there is a need for further research into the potential for machine creative logic, and the ways in which machines can be programmed to better understand and respond to open-ended questions."

The types of hallucinations I was thinking of when I created this post are more of the closed question variety. I’ll use planning as a more concrete example. You can give the model a list of available planning functions and the parameters supported by each function. The models are all pretty good at taking a task or question from a user and mapping that onto the list of functions that should be called to perform that task or answer the question. The issue is, if the question falls even slightly outside the available list of functions, the model has no problem adding function calls to functions that don’t exist or adding new parameters that don’t exist to the functions that do exist. These are all hallucinations and anyone who’s attempted to use LangChain or some other framework to perform planning has encountered these hallucinations.

I’ve made a lot of progress in the last week around using my Better CoT pattern to avoid such hallucinations. I’ll post more here soon.

2 Likes

Precisely. While royfollendore’s scenarios may be interesting to researchers, they’re not the ones encountered by most engineers who are attempting to find ways to leverage LLMs to add value to products today. As in, to use AI as a force-multiplier for solving real-world problems faster, or to solve problems we simply can’t or don’t solve today because they’re too time-consuming.

Planning is probably a good example, though not the area I’m working in at present. Just to take that to an extreme end, there is an insane amount of planning and prediction that goes into each and every airplane flight that happens all over the world. Every large airline has a team of weather experts and planners that feed information to dispatchers who compile it all and develop a flight plan. Because of how difficult and time-consuming it is to take in “all available information” and create the best flight plan, they don’t. Instead, they look for indicators that the anticipated plan won’t work and only tweak it if they have cause to.

But if an AI could take in all of the current data, multiple weather models, updated passenger manifest, updated baggage manifest, departure and arrival airport delays, taxi routes, gate assignments, etc and make real-time adjustments to optimize the plan, airlines could potentially save millions of dollars each year on fuel alone. Not to mention a huge reduction in carbon emissions from carrying and burning excess fuel.

However, you can’t entrust flight plans to a system that could fabricate an answer out of thin air because it wasn’t able to fit all the parameters into its available training model.

As much as these massive LLMs represent a huge breakthrough in what’s possible, at the same time, for people trying to use them to solve real problems they are simultaneously shining a bright light on how far we have to go (as in decades, not months) before anyone should be talking about AGI.