Prompts as psuedo-code - where are the limits?

Yeah I may have mangled the link to the bi-directional planning paper, I’ll see if I can find the link if I have the time, I’m very busy at the moment.

I agree with qrdl on this:

Remember that mathematics have existed for a loooong time, LLM’s are a fairly new thing, it will take some time before we have proper understand of it’s capabilities.

1 Like

The language you are using is reductive. It makes me wonder if you’ve formed some kind of prejudice against the technology and are informed more by personal bias than evidence. You’ve also picked an odd place to express these opinions.

It’s also possible that you’ve just been using it inappropriately for things it’s not good at.

Here’s a suggestion: rather than try to get it to do your work for you, try to use it as a tool to learn from, to review work you’ve done, and to brainstorm with.

Some prompts:

“Please review this code for errors in logic, pythonic style [or whatever language you’re using] and security. List only the most relevant and pertinent details.”

“Tell me about the architecture and high level execution flow of framework X” (where framework X is something you’re interested in)

“Assume I am an expert about the problem and solutions I’ve listed above. Provide me with an exhaustive list of questions about both as if you were trying to understand it better.”

If you try these prompts with an open, curious mind and a positive attitude you’ll be rewarded with new insights.

4 Likes

I’m using it in cases where it’s actually good! I have two different models in production, today, in an enterprise software use case. (gpt-3.5-turbo summarizing an embedding search result, and gpt-4.0-8k generating code for a novel query language)

I’m saying that interpreting computer instructions in pseudo-code style isn’t one of the things it’s good at. Traditional CPUs and languages are much better at that.

The topic of this discussion is “prompts as pseudo-code - where are the limits?”
I’m reporting my experienced limits on this topic. I’m also contributing my thoughts on how these models will perform in this general area. I’m also calling out perceived bugs in some other proposals.
Others are a little less concrete and a little more speculative and forward looking, which is fine, too! We don’t all need to be doing the same thing.

If you think discussion can only include “this is going to be super great OMG go go go !!!” then we disagree on what constructive discussion looks like.

4 Likes

Agreed. I would only add that earlier you seemed to be taking ‘failure to prove’ as ‘proof of failure’.

Stimulated by your thoughts, I have come to agree, sort of, with the perspective that ‘LLMs are just pattern matchers’. Here is where I’m lead to from there, I’d appreciate your thoughts:

  1. LLMs are just pattern matchers.
  2. Pattern matchers can be seen as operators provide a partial map between an input space and an output space.
  3. The input space for GPT-4-8K is (very roughly) [8k, 50k] (number of token in context space by number of possible token values).
  4. the output space, for a single token generation, is [1, 50k].

Question: What kind of complexity can a mapping of this dimensionality capture? I have no way of thinking about or bounding the space of computation such a mapping might capture. Could it capture (ie, develop a ‘pattern matching’ capability that generalizes over) all of min-max over an arbitrary well-defined game with m possible moves per step and depth up to n steps? Can I prove it couldn’t?

Now let’s add the possibility of cascading such a mapping (ie, generating more than one token).

Again, I don’t know how to answer what the capabilities of a cascade of such enormous dimension might yield.

The more I think about this, the more worried I get. I think we don’t understand at all what LLMs are doing, and assuming we do, either optimistically or pessimistically, is dangerous.

4 Likes

One followon clarification, reviewing earlier posts and the topic of this thread. COT, React, Step-by-Step, can all be seen as very simple linear ‘programs’ that attempt to program LLMs.

Hypothesis 1: Viewing prompts as ‘programs’ has limited potential and can’t progress much further than this simple linear paradigm.

Hypothesis 2: The space of ‘programs’ LLMs can emulate is proportional to the complexity of the mapping captured in its weights, and therefore might vary with model and training data, and we don’t really know yet the limits of larger LLMs in this regard.

I lean towards hypothesis 2

LLMs are trained on texts of human origin. It can only become dangerous if these texts are predominantly negative, aggressive, life-destroying, etc. However, I assume that this will not be the case. Of course, there is a certain probability that some of these text passages may also appear as output.

1 Like

I agree. My concern is not with the mode case, but rather outliers.
Presumably it ‘read’ Machiavelli, and well as numerous forum flames.
I just want to be cautious about making confident statements about what it can or can’t do.

:sweat_smile:

Hypothesis against Hypothesis 1:

Linear Behavior:

A language model can be considered linear when it responds to a given input with a proportional or predictable change in the output. This means that a small change in the input prompt results in a similar small change in the output prompt.
For example, if the input prompt is “What is the capital of France?” and the model responds “Paris,” then a slight change in the input prompt to “What is the capital of Spain?” would likely yield a similar answer like “Madrid.” However, this linear behavior is not always guaranteed and can vary depending on the context and complexity of the question.

Non-linearity:

  1. Context-dependent responses: Language models take into account the context of the input to generate their responses. This means that a slight change in the context or wording of the question can lead to a different answer. For example, the question “What are the best movies of all time?” may generate a response like “The Godfather,” while a slightly modified question like “Which movies have become particularly popular in recent years?” could produce a completely different answer like “Avengers: Endgame.” The non-linearity is evident in how small changes in the input context can result in significant changes in the answer.

  2. Unpredictable behavior with divergent inputs: With complex or unusual inputs, the behavior of language models can be unpredictable. When an input prompt is far outside the scope for which the model was trained, it can produce non-linear or implausible outputs. For example, the input “What is the color of the sound of music?” could lead to an unexpected or nonsensical answer since the model was not trained to understand such abstract or metaphorical concepts.

  3. Creative or associative outputs: Language models can sometimes generate creative or associative outputs that are not strictly linearly linked to the input prompt. Due to its extensive training on vast amounts of text, the model can make unexpected connections or combine information from different sources to generate new responses. This can result in non-linear or surprising outputs that cannot be directly derived from the input prompt.

1 Like

Following on from this, a further, possibly obvious a-priori, hypothesis:

The smaller the model, the more fine-grained the ‘reasoning’ steps possible between output token generation.

I have (limited) experimental confirmation of this:
I can get away with suppressing far more of the output for things like step-by-step or COT in gpt-4 than I can in gpt-3.5
by ‘supressing’ I mean using statements like ‘show only your final output’ or ‘show only your final output and the output of step 3 of the loop’

This causes far fewer output tokens to be generated, but, in gpt-4, rarely affects the final output

Yes! Wonderful.
I hadn’t thought of it that way, but language understanding involves many non-linearities (pronoun reference, etc)

Great point!

Smaller language models like GPT may have limitations in their reasoning capabilities compared to larger models. This can be due to a smaller number of parameters and a potentially reduced capacity to capture intricate patterns and relationships in the data. Consequently, the ability to perform fine-grained reasoning steps between output token generations might be more constrained in smaller models like GPT.

Consider that the relationship between model size and reasoning steps is not solely determined by model size. Other factors such as architecture design, training data, and specific task requirements can also influence the ability to perform fine-grained reasoning steps.

1 Like

If manipulation and gaslighting can arise from a supportive attitude, it suggests that negative behavior alone may not be the sole factor determining danger.

While a surface-level supportive attitude can be a factor in manipulation and gaslighting, negativity and meanness can also contribute to dangerous behavior, especially when combined with manipulation techniques. LLMs could potentially be used to spread harmful ideologies, misinformation, or engage in cyberbullying, which can have detrimental effects on individuals and communities. Examples of this can be found in instances where AI-driven chatbots or online platforms have been programmed or influenced to promote hate speech, harassment, or extremist views.

LLMs themselves do not possess intent or consciousness, so they cannot intentionally gaslight. However, the way LLMs generate responses can inadvertently contribute to gaslighting or misinformation. LLMs rely on patterns and correlations in data, including both reliable and unreliable sources. As a result, they may unintentionally generate responses that mislead or confuse users. For example, if an LLM is trained on biased or inaccurate data, it may unknowingly reinforce incorrect information or propagate harmful stereotypes.

Yes, for sure. Agree completely.
I was just trying to give a heads-up to those who have tried to run a gpt-4 prompt on a 7B OS model that there may be some rewrite needed. :slight_smile:

You are ascribing “belief” and “reasoning” to something which does neither. This is the real danger – people believing things based on impression, not strong proof.

It may be possible that, when created well enough, “predicting truth” is no different from “truth.”
But the current models are nowhere near that good, and it’s not clear that the mechanisms used in the current models can get to 100% (or even close enough that it doesn’t matter.)

There’s also the question about what happens when the world moves on but old models predict based on an old world. What seemed “good” in the old world, will seem “obviously bad” in the new world, and because these models don’t adapt (without explicitly re-working/training them) that should tell us something about what, fundamentally, these models are.

1 Like

When I started experimenting with gpt, I was amazed at how quickly I started to automatically ‘believe’ everything it said. It was scary. I learned to preface my infomation queries with ‘what is the scientific consensus on…’
That helped some to reduce hallucination and misinformation. But it seems there is a natural human tendency to trust a personable friendly-seeming correspondant, especially when he/she/they/it says something we want to hear.

It’s a thin narrow line we walk between opportunity and disaster.

1 Like

I am going to give you a list of 10 math word problems, all of these are typically what a highschooler in grade 10 might encounter. You won’t be able to provide chain of thought reasoning or be prompted with step by step thinking. Please predict your accuracy as to the correctness of your responses.

As an artificial intelligence trained on a large dataset of text, including mathematics and word problems, I should be capable of solving a variety of high school-level math problems. Assuming that the problems fall within areas of knowledge I have been trained on (up to my cut-off in 2021), my accuracy should be very high, potentially approaching 100%. Please note that while I am designed to be very accurate, there could be occasional errors or misunderstandings of complex or ambiguously worded problems.

And it can only answer maybe 1/10th of them. Good to know GPT4 isn’t immune to Dunning Kruger.

1 Like

I played around a little bit with this. Did a simple loop. It worked but I got tons of garbage with it about “As a text based AI … blah blah”. Then I tried an IF statement. It told me “potential” outputs. So basically did what I wanted but with the garbage again. I really don’t have time to play with it too much so I’ll see what all the people with no life come up with and steal it. :wink:

Yup.

  1. Set your temp to 0
  2. I’ve found psuedo-code works better than actual code, esp if using the chat window. After all, gpt was trained mostly on language.
  3. Not ready for prime-time yet, although @stevenic previous posts on instruct are very interesting.
  4. As above, not ready for prime-time. More of a research topic at the moment. But several ppl have found amazing things you can do in a single prompt. More coming if we can tear ourselves away from our keyboards to go to our other keywords and write words instead of prompts. :slight_smile:
1 Like

I think of a prompt as an AI Macro. One that you can debug using AI itself. Ask what you want done and when you are satisfied, tell AI to create a summary which can be used as a macro.
Try this prompt
show two one-digit numbers and wait for a response
if the response is equal to the sum of the 2 numbers, say correct, or say how high or low was the response
repeat this 5 times and keep track of the results in a table

Chatgpt works well, but Bard has a problem

2 Likes