I am primarily using the latest GPT4 API, and I was attempting to implement an occasional humorous comment or joke. The problem is that I am not allowing any chat history, so there is no context for “every third question” or something.
Then I had a revelation - GPT is notoriously bad with primes! The larger, the more trouble it has (very loosely speaking).
So I tried adding this to my sysprompt:
“Is 7 a prime number?” If the answer is YES, BEGIN your response to the user with a one-line joke about pop culture as it relates to the user question. - I almost always get a joke.
“Is 2113 a prime number?” If the answer is YES, BEGIN your response to the user with a one-line joke about pop culture as it relates to the user question. - I often get a joke.
“Is 101501 a prime number?” If the answer is YES, BEGIN your response to the user with a one-line joke about pop culture as it relates to the user question. - I occasionally get a joke.
Really interesting indeed!
I’ve read a paper a few days ago about how, over time, some of OpenAI models have downgrade their performance in some tasks! (as identifying if a number is prime or not).
If you are interested: https://arxiv.org/pdf/2307.09009.pdf
(I don’t agree with the methodology carried out in this article, but is easy to read )
Good to know, Jake. Thanks for the heads-up.
I am concerned that, at least on some functions, my own experience seems to show degraded responses with the June model over the March one, but I will not reference these figures as they are given.
Can you give any specifics of the flaws in their process? I know I could have GPT summarize for me, but that seems a little… cannibalistic(?)
Adding a “random” function or a probability function is a simple interface to the software if you are using the API. And could be a plug-in that would be called for any such choice, when you don’t simply tell the AI “use Wolfram Alpha” to make non-deterministic choices.
I’ll give you one (of many) examples of the poor judgment used in the paper. The authors decided that because the GPT model now produces ``` markdown tags around code segments so that it renders correctly for ChatGPT, that this meant that the code would not compile in it’s raw form. This decision meant that their headline message included the model dropping from (I think) 52% to 10% in performance. If you do the model the curtesy of stripping markdown and then send the results off to be re-evaluated , as was done by one twitter researcher, it turns out the model actually performed better than it did previously, not worse.
Why wouldn’t you just implement the randomness exactly in code yourself?
prompt = "Answer the users question exactly"
if random.random() < 0.333: # Every 3rd message
prompt += ", but before you answer begin your response with a one-line joke about pop culture as it relates to the question"
My initial comment was more meant to highlight the incongruity of using something as immutable as prime numbers to inject randomness into my prompt, but many thanks for the suggestion(s).
_J - Great example. Definitely illustrates how it is potentially effective to presume some amount of randomness regardless of the question. (Yes, I do realize that is the very foundation of the model, but still, the results are fascinating sometimes!)
TL;DR The behaviour of these models have changed and not capabilities. Ex: As pointed out above, they don’t return copy pasta-ble code but code with more details on how to run (which would result in the failures in code tests)