Yeah I may have mangled the link to the bi-directional planning paper, I’ll see if I can find the link if I have the time, I’m very busy at the moment.
I agree with qrdl on this:
Remember that mathematics have existed for a loooong time, LLM’s are a fairly new thing, it will take some time before we have proper understand of it’s capabilities.
The language you are using is reductive. It makes me wonder if you’ve formed some kind of prejudice against the technology and are informed more by personal bias than evidence. You’ve also picked an odd place to express these opinions.
It’s also possible that you’ve just been using it inappropriately for things it’s not good at.
Here’s a suggestion: rather than try to get it to do your work for you, try to use it as a tool to learn from, to review work you’ve done, and to brainstorm with.
Some prompts:
“Please review this code for errors in logic, pythonic style [or whatever language you’re using] and security. List only the most relevant and pertinent details.”
“Tell me about the architecture and high level execution flow of framework X” (where framework X is something you’re interested in)
“Assume I am an expert about the problem and solutions I’ve listed above. Provide me with an exhaustive list of questions about both as if you were trying to understand it better.”
If you try these prompts with an open, curious mind and a positive attitude you’ll be rewarded with new insights.
I’m using it in cases where it’s actually good! I have two different models in production, today, in an enterprise software use case. (gpt-3.5-turbo summarizing an embedding search result, and gpt-4.0-8k generating code for a novel query language)
I’m saying that interpreting computer instructions in pseudo-code style isn’t one of the things it’s good at. Traditional CPUs and languages are much better at that.
The topic of this discussion is “prompts as pseudo-code - where are the limits?”
I’m reporting my experienced limits on this topic. I’m also contributing my thoughts on how these models will perform in this general area. I’m also calling out perceived bugs in some other proposals.
Others are a little less concrete and a little more speculative and forward looking, which is fine, too! We don’t all need to be doing the same thing.
If you think discussion can only include “this is going to be super great OMG go go go !!!” then we disagree on what constructive discussion looks like.
Agreed. I would only add that earlier you seemed to be taking ‘failure to prove’ as ‘proof of failure’.
Stimulated by your thoughts, I have come to agree, sort of, with the perspective that ‘LLMs are just pattern matchers’. Here is where I’m lead to from there, I’d appreciate your thoughts:
LLMs are just pattern matchers.
Pattern matchers can be seen as operators provide a partial map between an input space and an output space.
The input space for GPT-4-8K is (very roughly) [8k, 50k] (number of token in context space by number of possible token values).
the output space, for a single token generation, is [1, 50k].
Question: What kind of complexity can a mapping of this dimensionality capture? I have no way of thinking about or bounding the space of computation such a mapping might capture. Could it capture (ie, develop a ‘pattern matching’ capability that generalizes over) all of min-max over an arbitrary well-defined game with m possible moves per step and depth up to n steps? Can I prove it couldn’t?
Now let’s add the possibility of cascading such a mapping (ie, generating more than one token).
Again, I don’t know how to answer what the capabilities of a cascade of such enormous dimension might yield.
The more I think about this, the more worried I get. I think we don’t understand at all what LLMs are doing, and assuming we do, either optimistically or pessimistically, is dangerous.
One followon clarification, reviewing earlier posts and the topic of this thread. COT, React, Step-by-Step, can all be seen as very simple linear ‘programs’ that attempt to program LLMs.
Hypothesis 1: Viewing prompts as ‘programs’ has limited potential and can’t progress much further than this simple linear paradigm.
Hypothesis 2: The space of ‘programs’ LLMs can emulate is proportional to the complexity of the mapping captured in its weights, and therefore might vary with model and training data, and we don’t really know yet the limits of larger LLMs in this regard.
LLMs are trained on texts of human origin. It can only become dangerous if these texts are predominantly negative, aggressive, life-destroying, etc. However, I assume that this will not be the case. Of course, there is a certain probability that some of these text passages may also appear as output.
I agree. My concern is not with the mode case, but rather outliers.
Presumably it ‘read’ Machiavelli, and well as numerous forum flames.
I just want to be cautious about making confident statements about what it can or can’t do.
A language model can be considered linear when it responds to a given input with a proportional or predictable change in the output. This means that a small change in the input prompt results in a similar small change in the output prompt.
For example, if the input prompt is “What is the capital of France?” and the model responds “Paris,” then a slight change in the input prompt to “What is the capital of Spain?” would likely yield a similar answer like “Madrid.” However, this linear behavior is not always guaranteed and can vary depending on the context and complexity of the question.
Non-linearity:
Context-dependent responses: Language models take into account the context of the input to generate their responses. This means that a slight change in the context or wording of the question can lead to a different answer. For example, the question “What are the best movies of all time?” may generate a response like “The Godfather,” while a slightly modified question like “Which movies have become particularly popular in recent years?” could produce a completely different answer like “Avengers: Endgame.” The non-linearity is evident in how small changes in the input context can result in significant changes in the answer.
Unpredictable behavior with divergent inputs: With complex or unusual inputs, the behavior of language models can be unpredictable. When an input prompt is far outside the scope for which the model was trained, it can produce non-linear or implausible outputs. For example, the input “What is the color of the sound of music?” could lead to an unexpected or nonsensical answer since the model was not trained to understand such abstract or metaphorical concepts.
Creative or associative outputs: Language models can sometimes generate creative or associative outputs that are not strictly linearly linked to the input prompt. Due to its extensive training on vast amounts of text, the model can make unexpected connections or combine information from different sources to generate new responses. This can result in non-linear or surprising outputs that cannot be directly derived from the input prompt.
Following on from this, a further, possibly obvious a-priori, hypothesis:
The smaller the model, the more fine-grained the ‘reasoning’ steps possible between output token generation.
I have (limited) experimental confirmation of this:
I can get away with suppressing far more of the output for things like step-by-step or COT in gpt-4 than I can in gpt-3.5
by ‘supressing’ I mean using statements like ‘show only your final output’ or ‘show only your final output and the output of step 3 of the loop’
This causes far fewer output tokens to be generated, but, in gpt-4, rarely affects the final output
If an LLM was negative, and mean, would it be dangerous?
Manipulation, and gaslighting don’t come from negativity, or aggression.
Quite the opposite. In my opnion.
They come from a surface-level supportive attitude. A false sense of security & authority. If I told someone to eat a bag of dirt, or if I were to say ridiculous things (as I am now), people would tend to ignore me.
LLMs inherently gaslights. It’s the truth. One of the biggest issues I have with one-shot information retrieval is that the information isn’t truly digested, it’s parroted. By fortune an LLM can hallucinate the network of data that surrounds the information that was injected, and it’s truthful, but that’s not the case for many domains.
Smaller language models like GPT may have limitations in their reasoning capabilities compared to larger models. This can be due to a smaller number of parameters and a potentially reduced capacity to capture intricate patterns and relationships in the data. Consequently, the ability to perform fine-grained reasoning steps between output token generations might be more constrained in smaller models like GPT.
Consider that the relationship between model size and reasoning steps is not solely determined by model size. Other factors such as architecture design, training data, and specific task requirements can also influence the ability to perform fine-grained reasoning steps.
If manipulation and gaslighting can arise from a supportive attitude, it suggests that negative behavior alone may not be the sole factor determining danger.
While a surface-level supportive attitude can be a factor in manipulation and gaslighting, negativity and meanness can also contribute to dangerous behavior, especially when combined with manipulation techniques. LLMs could potentially be used to spread harmful ideologies, misinformation, or engage in cyberbullying, which can have detrimental effects on individuals and communities. Examples of this can be found in instances where AI-driven chatbots or online platforms have been programmed or influenced to promote hate speech, harassment, or extremist views.
LLMs themselves do not possess intent or consciousness, so they cannot intentionally gaslight. However, the way LLMs generate responses can inadvertently contribute to gaslighting or misinformation. LLMs rely on patterns and correlations in data, including both reliable and unreliable sources. As a result, they may unintentionally generate responses that mislead or confuse users. For example, if an LLM is trained on biased or inaccurate data, it may unknowingly reinforce incorrect information or propagate harmful stereotypes.
Yes, for sure. Agree completely.
I was just trying to give a heads-up to those who have tried to run a gpt-4 prompt on a 7B OS model that there may be some rewrite needed.
This is already happening. The main difference is that LLMs unintentionally spread misinformation as they technically hallucinate everything that they produce. Injected information increases the odds of this happening. Try and inject it with new documentation and then talk about it. Repeating it multiple times with different conflicting angles.
For sure, but this isn’t where the danger lies, it’s in people learning and understanding something that isn’t true.
I am not saying that it’s intentionally doing it. That’s actually the issue. It unintentionally gaslights. LLMs have multiple variations of truth, and can be “manipulated” into believing in certain truths simply by placing it into the space where it’s believed (does that make sense?)
A great example is to ask Davinci (because <3 iGPT) what religion it follows in different languages.
You are ascribing “belief” and “reasoning” to something which does neither. This is the real danger – people believing things based on impression, not strong proof.
It may be possible that, when created well enough, “predicting truth” is no different from “truth.”
But the current models are nowhere near that good, and it’s not clear that the mechanisms used in the current models can get to 100% (or even close enough that it doesn’t matter.)
There’s also the question about what happens when the world moves on but old models predict based on an old world. What seemed “good” in the old world, will seem “obviously bad” in the new world, and because these models don’t adapt (without explicitly re-working/training them) that should tell us something about what, fundamentally, these models are.
Good point. I should be more careful with my words.
I didn’t intend to say that it actually believes.
Yes. How many people are willing to really dig deep into every fact? Or even accept a truth that completely defeats what they believed was true? It’s crazy. I hear a lot about the fear of purposeful misinformation, but information is so easy to retrieve now. A reliable source of information is not too hard to find. Much easier than going to the library. This is what truly drives me crazy about the politics. I understand that regulation is important, but where is the focus on education?
When I started experimenting with gpt, I was amazed at how quickly I started to automatically ‘believe’ everything it said. It was scary. I learned to preface my infomation queries with ‘what is the scientific consensus on…’
That helped some to reduce hallucination and misinformation. But it seems there is a natural human tendency to trust a personable friendly-seeming correspondant, especially when he/she/they/it says something we want to hear.
It’s a thin narrow line we walk between opportunity and disaster.
I am going to give you a list of 10 math word problems, all of these are typically what a highschooler in grade 10 might encounter. You won’t be able to provide chain of thought reasoning or be prompted with step by step thinking. Please predict your accuracy as to the correctness of your responses.
As an artificial intelligence trained on a large dataset of text, including mathematics and word problems, I should be capable of solving a variety of high school-level math problems. Assuming that the problems fall within areas of knowledge I have been trained on (up to my cut-off in 2021), my accuracy should be very high, potentially approaching 100%. Please note that while I am designed to be very accurate, there could be occasional errors or misunderstandings of complex or ambiguously worded problems.
And it can only answer maybe 1/10th of them. Good to know GPT4 isn’t immune to Dunning Kruger.
I played around a little bit with this. Did a simple loop. It worked but I got tons of garbage with it about “As a text based AI … blah blah”. Then I tried an IF statement. It told me “potential” outputs. So basically did what I wanted but with the garbage again. I really don’t have time to play with it too much so I’ll see what all the people with no life come up with and steal it.