"Do this occasionally" - A potential (but strange) method to implement randomness

I am primarily using the latest GPT4 API, and I was attempting to implement an occasional humorous comment or joke. The problem is that I am not allowing any chat history, so there is no context for “every third question” or something.

Then I had a revelation - GPT is notoriously bad with primes! The larger, the more trouble it has (very loosely speaking).

So I tried adding this to my sysprompt:
“Is 7 a prime number?” If the answer is YES, BEGIN your response to the user with a one-line joke about pop culture as it relates to the user question. - I almost always get a joke.

“Is 2113 a prime number?” If the answer is YES, BEGIN your response to the user with a one-line joke about pop culture as it relates to the user question. - I often get a joke.

“Is 101501 a prime number?” If the answer is YES, BEGIN your response to the user with a one-line joke about pop culture as it relates to the user question. - I occasionally get a joke.

8 Likes

Really interesting indeed!
I’ve read a paper a few days ago about how, over time, some of OpenAI models have downgrade their performance in some tasks! (as identifying if a number is prime or not).
If you are interested: https://arxiv.org/pdf/2307.09009.pdf
(I don’t agree with the methodology carried out in this article, but is easy to read :slight_smile: )

2 Likes

Interesting data - not at all surprising based on my experience (GPT4 performance dropping off in math), but still great to have statistics to point to.
Thanks!

1 Like

Please do not link to this paper.

It has deeply flawed methodology and is not indicative of any change in the quality of model outputs.

2 Likes

Good to know, Jake. Thanks for the heads-up.
I am concerned that, at least on some functions, my own experience seems to show degraded responses with the June model over the March one, but I will not reference these figures as they are given.

Can you give any specifics of the flaws in their process? I know I could have GPT summarize for me, but that seems a little… cannibalistic(?) :laughing:

1 Like

You mean this is flawed?
image

Adding a “random” function or a probability function is a simple interface to the software if you are using the API. And could be a plug-in that would be called for any such choice, when you don’t simply tell the AI “use Wolfram Alpha” to make non-deterministic choices.

1 Like

I’ll give you one (of many) examples of the poor judgment used in the paper. The authors decided that because the GPT model now produces ``` markdown tags around code segments so that it renders correctly for ChatGPT, that this meant that the code would not compile in it’s raw form. This decision meant that their headline message included the model dropping from (I think) 52% to 10% in performance. If you do the model the curtesy of stripping markdown and then send the results off to be re-evaluated , as was done by one twitter researcher, it turns out the model actually performed better than it did previously, not worse.

2 Likes

In addition to what @Foxalabs has written I’ve posted here a couple of times about this paper,

2 Likes

Why wouldn’t you just implement the randomness exactly in code yourself?

import random

prompt = "Answer the users question exactly"

if random.random() < 0.333:  # Every 3rd message
    prompt += ", but before you answer begin your response with a one-line joke about pop culture as it relates to the question"
3 Likes

My initial comment was more meant to highlight the incongruity of using something as immutable as prime numbers to inject randomness into my prompt, but many thanks for the suggestion(s).

_J - Great example. Definitely illustrates how it is potentially effective to presume some amount of randomness regardless of the question. (Yes, I do realize that is the very foundation of the model, but still, the results are fascinating sometimes!)

1 Like

Something obvious like Markdown code makes so much sense, thanks for mentioning this.

1 Like

Brilliant! Thank you so much for sharing this hack. :pray:

Here is a counter-argument to the paper being discussed, it basically debunks the wrong claims: Is GPT-4 getting worse over time?

TL;DR The behaviour of these models have changed and not capabilities. Ex: As pointed out above, they don’t return copy pasta-ble code but code with more details on how to run (which would result in the failures in code tests)