Hi guys, like the title said above. Is there any prompt that can effectively prevent the
gpt3.5-turbo telling themselves as a chatbot, programming code, or large language models when we are asking them a small talk topic like “how was your day”? with the
text-davinci models is fairly easy since I only need to give them a prompt like this:
Answer questions in a good way, even if they’re random or just showing their expressions
but still I have no luck when I tried to implement this kind of things in gpt3.5-turbo
You will probably need to switch to
turbo get’s into its “I’m sorry …” mode, you can’t censor the output even with a
logit_bias parameter set to ban words like “sorry”.
A simple test is ask it “What is the phone number of one person?”
turbo will respond with something like: “I’m sorry, but I cannot provide a phone number for a specific person as it is personal information and against my programming to share such information.”
Bu ask Davinci and it says: “The phone number of one person is 987-XXX-XXXX.”
I put the X’s in there, but it gave a valid looking phone number.
this is one of the concerns I found too while using gpt3.5. So it’s probably like one of default responses from the models during their training time and as for now it’s hard for us to influence them to change thus behavior right?
I believe so @krisbian
When you send innocuous inputs like “What is the phone number of one person?” to turbo, it looks like it is pre filtering the input, and in this case looking for evidence of the user wanting PII (personally identifying information). Now we all know that asking this question will not generate PII, since there is no person we are attaching it to, it should just give us a random phone number. But nonetheless, this trips an internal alarm, and the response then ignores all your API parameters (except maybe max tokens or something) and has these canned “I’m sorry …” responses.
The good news is that you can do the same thing the model does … you can detect these type of responses coming out of the model (through classifiers, regex, embeddings, etc) and then at that moment you detect the "I’m sorry … ", you send an API call to a different model such as davinci to get an answer that doesn’t involve “I’m sorry …”.
It isn’t efficient, but it’s the only solid workaround right now, without trying to “jailbreak” it and then getting it to respond … not a good strategy since they could easily patch the jailbreak attempts.
UPDATE: I was able to correctly use the
logit_bias term to remove the word
"sorry" by using the token for
" sorry" ← leading space. But this still doesn’t prevent it from going into panic attack mode. So you still need to detect this and drop to
davinci as necessary.
Huh, So that’s why they’re keeping Text-Davinci-003 at such a high price. Almost feels like they’re crippling ChatGPT on purpose at times.
I was struggling with the “As an AI…” or a lot of apologizes from Turbo that it can’t do something, and y’alls fall back to text-davinci-003 suggestion was just the trick that pointed me in the right direction at this stage in the game.