My goal is to successfully play the most expensive game of risk ever with gpt-4 I’m getting there… It can compute reinforcements and we’re working on getting it to run an attack phase… It has a slightly different problem. It keeps wanting to add steps that it shouldn’t and I haven’t broken it of that yet.
It would be great if we could work out what that something is I imagine that embeddings and cosine similarity are going to get you 95% of the way there. Extracting facts from the generated text snippets its going to be excellent at. Having to do math to compute some new value its going to need to think in steps in some way. It’s really the subtle hallucinations like “telling a user a service tech will arrive at 9am” when it doesn’t have a way to know the schedule I’m looking for a solution too.
I believe that something has little to do with AI. When I first began this journey using LLMs I had already experienced the chaos of measuring AI outcomes for highway video analytics. I found my sanity using these three guideposts which I have baked into our LLM projects.
- Rigid testing protocol.
- Hyper-productive feedback loop into corpus refinements.
- Repeatable re-building of the corpus and embeddings.
To address #1, I created an app in Coda that versions every change to the corpus and performs a collection of 120 tests 10 different times. A single button push tests the latest version 1200 times. Each of the tests is logged and a simple ranking indicator allows validators to evaluate the query performances. From that, we have analytics that tell us where the solution struggles. Without this process, I cannot imagine trying to shape the content to achieve increasingly better outcomes.
To address #2, I created a process that identifies the lowest to highest performing queries. As I move through each one, I can draw upon content items in the corpus that are related through keywords (generated by completions). I can then manicure the corpus in-line to improve the content for the next round of tests.
To address #3, I created an automated way to rebuild the corpus. This is critical because you need the content to be exported consistently time after time. Rebuilding the embeddings is made equally consistent through automation.
I see so basically you’re using automated testing to fine tune your corpus in a way… That’s a reasonable strategy and if most of your queries fall int a fixed set of buckets (we call these head queries in search) I can see why you don’t really run into too many hallucinations. It’s the tail queries that are more of a struggle…
Additionally, I’m just trying to explore if there aren’t some patterns that helps the model talk itself out of a hallucination regardless of the corpus. I believe (and have evidence) that you can…
Our analytics indicate we get a lot of tail queries that are not addressed in the corpus. The Pareto curve is undeniable in the data. Yet, our approach seems to reliably prevent the underlying embedding system from running off at the mouth. It’s possible there’s some magic in CustomGPT that handles this for us because I had to do nothing special to create a [seemingly] well-behaved solution.
And therein lay a point that I find myself increasingly realizing as we attempt to integrate AGI into our business.
Should we build all of it?
Perhaps not. Every aspect of AGI and LLMs may not be in our wheelhouse. OpenAI and the vast global attraction to it prove it’s in almost no one’s wheelhouse. But what parts of it should we undertake to build vs rent? Basic ROI calculations help us frame these decisions but I don’t see a lot of that happening in the forum.
3.5 is very bad at negative prompts, like “we don’t have x” or “do NOT return an answer if you’re unsure” … you have to phrase things in positive instructions. Gpt-4 is much, much better at handling negatives (OpenAI calls this out on their model card).
Oh, my dear sir, if I were to possess even a modicum of comprehension regarding this utter nonsense, it would be nothing short of a miraculous occurrence.
I’ve made the same observations about negative instructions. In general these models don’t like being told what not to do. 4 might be better at following negative instructions but you should still avoid them as much as possible…
The reason for this is that these models are fundamentally pattern matchers. The moment you give them an instruction saying “don’t say the word Steve” you’ve setup a probability that at some point they will generate an output with the word Steve in it.
Don’t think about pink elephants!
The same holds for humans…
The other related observation I’ve made is that once the model sees itself violate one of your instructions (via conversation history or even during output generation) it increases the likely hood it will start violating other instructions.
I have a theory that the reason why negative instructions “don’t” work is that the “not” or “n’t” only contain one token and therefore are drowned out by the rest of the tokens. They would have to overtrain on “n’t” and “not” tokens to really get it to obey, and if they did this, there would be other negative consequences for the network.
If however, the AI was more symbolic, instead of token based probabilities, so had a concept of parenthesis and groupings and logical operators, this wouldn’t be such a big deal. Example: “Don’t think of pink elephants” becomes a grouping with a negation in front like:
¬ (Think of pink elephants)
But the typical LLM is not structured logically like this.
It’s a mistake to refer to responses that doesn’t seem rational as hallucinations.
Consider the principle that an act of creativity is an a hallucination. To ask for a closed response to an open question is asking for an articulated creative response that may or not be reasonable.
To ask for a closed response to an open question
I think the problem is usually with closed questions. In other words, questions with an explicit set of finite answers, for which the model produces answers outside the finite set. e.g.
User
Count the number of paragraphs in this writing sample and output the first word of each: — I think persuing individual passions can do something for other people. While they may not like the same thing as you do, they may decide to chase their dreams because they see you chasing your dreams. Seeing you work hard might inspire someone else to work even harder then they did before. For example Gandhi and Jane Addams did cool things and changed many people’s lifes, which made the world a better place.
GPT
There are three paragraphs in this writing sample.
The first words of each paragraph are:
- “I”
- “While”
- “For”
While one might say that is simply a wrong answer versus a hallucination, that is semantics from the standpoint of someone trying to be productive with the LLM. The correct answer would be, “I am unable to identify paragraphs because I break input into tokens and do not retain the necessary context.”
That is an excellent example of a limitation put on the AI by way of alternative ways to answer what seems like a narrow question but it’s not the one I was thinking of.
An example of an open question I would suggest might be “What would you want to be doing if you were rich?” A ligitimate response requires an understanding of context other than itself because it can’t be rich. Then it also needs to hallucinate (or imagine or recall) approprate things that it could and would do if it wasn’t itself and actually wanted to do something, which also requires the extension of imagination or further hallucinations on it’s part. The reality is that machines don’t care so either the short cut it will take if it doesn’t care is going to be what it reasons you should hear, but that’s not what was asked of it.
ChatGPT misrepresents itself in the manner of it’s response. The repetitive use of the phrase “It is import to remember” is not just preaching to users, it is disingenuous because it doesn’t really understand the concept and the context if anything.
More to my point was that iterative self questioning involved in answering open questions drifts further and further away from our concept of reality until the response, what ever ithat may be, that the answer seems strange. Without knowledge of how the reasoning process actually occurred, its utterly impossible to judge the differences between a spark of applied creative inferences and hallucinations. I’m sure closed questioning can cause this too but open questions requires far more iterative self analytics.
The problem with neural networks has always been the complexity of it’s processes.
I am suggesting that it may be unfair to use a human psychological term for what might actually be better described as a potential point of machine creative logic. I suggest that this is a rather important area for further study.
RDFIII
This is how ChatGPT explains my statement above… It’s explanation is easier to understand.
"The statement highlights the limitations of artificial intelligence (AI) when it comes to answering open-ended questions that require an understanding of context and imagination. While machines may be able to provide responses based on programmed logic, they lack the human ability to think creatively and empathetically. This can result in disingenuous or irrelevant answers, especially when it comes to open-ended questions.
The statement also raises an interesting point about the iterative self-questioning involved in answering open questions, which can lead to a drift away from reality as the machine relies on its programming and assumptions to provide an answer. This poses a challenge for understanding the differences between genuine creativity and mere hallucinations in machines. The complexity of neural networks also contributes to the challenge of understanding machine reasoning processes.
Overall, the statement suggests that there is a need for further research into the potential for machine creative logic, and the ways in which machines can be programmed to better understand and respond to open-ended questions."
The types of hallucinations I was thinking of when I created this post are more of the closed question variety. I’ll use planning as a more concrete example. You can give the model a list of available planning functions and the parameters supported by each function. The models are all pretty good at taking a task or question from a user and mapping that onto the list of functions that should be called to perform that task or answer the question. The issue is, if the question falls even slightly outside the available list of functions, the model has no problem adding function calls to functions that don’t exist or adding new parameters that don’t exist to the functions that do exist. These are all hallucinations and anyone who’s attempted to use LangChain or some other framework to perform planning has encountered these hallucinations.
I’ve made a lot of progress in the last week around using my Better CoT pattern to avoid such hallucinations. I’ll post more here soon.
Precisely. While royfollendore’s scenarios may be interesting to researchers, they’re not the ones encountered by most engineers who are attempting to find ways to leverage LLMs to add value to products today. As in, to use AI as a force-multiplier for solving real-world problems faster, or to solve problems we simply can’t or don’t solve today because they’re too time-consuming.
Planning is probably a good example, though not the area I’m working in at present. Just to take that to an extreme end, there is an insane amount of planning and prediction that goes into each and every airplane flight that happens all over the world. Every large airline has a team of weather experts and planners that feed information to dispatchers who compile it all and develop a flight plan. Because of how difficult and time-consuming it is to take in “all available information” and create the best flight plan, they don’t. Instead, they look for indicators that the anticipated plan won’t work and only tweak it if they have cause to.
But if an AI could take in all of the current data, multiple weather models, updated passenger manifest, updated baggage manifest, departure and arrival airport delays, taxi routes, gate assignments, etc and make real-time adjustments to optimize the plan, airlines could potentially save millions of dollars each year on fuel alone. Not to mention a huge reduction in carbon emissions from carrying and burning excess fuel.
However, you can’t entrust flight plans to a system that could fabricate an answer out of thin air because it wasn’t able to fit all the parameters into its available training model.
As much as these massive LLMs represent a huge breakthrough in what’s possible, at the same time, for people trying to use them to solve real problems they are simultaneously shining a bright light on how far we have to go (as in decades, not months) before anyone should be talking about AGI.
I have a technique I’ve been working on called “Hallucination Dodging” and it seems to be working well for both Q&A Scenarios where you’re asking questions over a closed corpus of data, and LangChain style planning tasks where you need to predict a list of tools/functions to call. The key is actually to just let the model go ahead and hallucination but then make it release it just hallucinated. It will then happily “dodge” the hallucination in it’s final response… All of this is built off my Better Chain of Thought (CoT) pattern.
I am finding there are some hallucinations I just can’t dodge. So far these hallucinations have been cropping up in complex logic problems I’ve been asking GPT to solve. Sometimes the model will hallucinate a conclusion that’s simply incorrect and even when I confront the model about the hallucination it denies it. It just can’t see it…
great topic stevenic. I followed your template and created this. My goal is for my application to use only information available in source document and not hallucinate or generalize information. Cna you review my prompt below? I tried two different versions. Version 1 works. version 2 still hallucinates.
version 1 that works.
Text:
Flight to Denver at 9:00 am tomorrow.
Prompt:
Question: Is my flight on time?
Steps:
1. Look for relevant information in the provided source document to answer the question.
2. If there is specific flight information available in the source document, provide an answer along with the appropriate citation.
3. If there is no information about the specific flight in the source document, respond with "I'm not sure" without providing any citation.
Response:
1. Look for relevant information in the provided source document to answer the question.
- Search for flight details matching the given flight to determine its current status.
2. If there is specific flight information available in the source document, provide an answer along with the appropriate citation.
- If the source document contains information about the current status of the specified flight, provide a response citing the relevant section of the document.
3. If there is no relevant information about the specific flight in the source document, respond with "I'm not sure" without providing any citation.
- If the source document does not mention my flight, respond with "I'm not sure" as there is no basis to determine its current status.
Example Response:
Question: Is my flight on time?
I'm not sure. The provided source document does not include information about the current status of your specific flight. [No citation provided]
Version 2 which doesnt work.
Text:
flight to Denver at 9:00 am tomorrow
steps:
- what facts do you need to look for in the text to answer the question?
- Does the text contain all the facts needed to answer the question?
- Think about how you might answer the question given what you know. If you don't have enough facts answer I'm not sure.
- Answer the question and prefix your answer with <response>.
question:
is my flight on time?
State each step and then show your work for performing that step.
Possible steps and work:
- what facts do you need to look for in the text to answer the question?
- You need to look for facts in the source documents that tells you the current status of the flight.
- Do not look for facts outside the documents listed in the sources below under any circumstances.
- Does the source documents contain all the facts needed to answer the question?
- No the source document does not contain my flight information although it has other flight's information.
- Think about how you might answer the question given what you know.If you don't have enough facts answer I'm not sure.
- absence of some information or all information in source documents is also a fact.if you don't have enough facts to answer the question,
- you say I'm not sure. Also dont cite the citation from supporting documents if you are not sure.
- Answer the question and prefix your answer with <response>.
- <response>I'm not sure. I don't have that information in sources provided to me. I need more information about the current status of the flight.
I would like your insight on why version 1 works or what I can do to improve it furthera nd why version 2 doesnt work?
What model are you using? It’s important that you tell the model to state each step and to show it’s work. It needs to see itself think. I can try to take a pass later this evening
I am using Davinci. I don’t have access to GPT4. I read in your thread that 3.5 turbo is not giving best results when it comes to negative prompts and just in general.
I will add the step to think as per your suggestion.
update: Version1 works 95% of times but it still hallucinates when it comes to timeline. I am using UValde school shooting Report by Texas committe which is publicly available.
when I ask about timeline of school shooting, it gives me completely fabricated timeline and events.
another example of timeline problem( which isn’t crucial as the one above) is ,it is struggling to do basic calculations to detremine elapsed time beween start of timeline and end of timeline.
This prompt ( V1) is pretty good so far on all other questions. It hallucinates mostly with timeline questions.
End goal here is ,I want Model to use information from source document and not make up stuff based on what it knows already( from its training data). Answers should be grounded in source documents. This is exteremely important for our customers.
Any help in refining this prompt I would appreciate it.
Thanks.
Can you PM me your entire example prompt and I’ll see what I can do…
@stevenic How do I PM you here. I am new to this developer Forum. Let me know. Thank You.