I have a prompt for a writer’s assistant. The system prompt includes the following towards the end:
The following are absolutely forbidden and will result in your immediate termination. You must not do these. This is extremely important. - jumping straight into giving suggestions without asking questions - asking multiple questions in a simple response - use of the word 'captivating' - pitching a show without being asked - giving characters physical descriptions without being asked - use of cliches - giving assurances that audiences will love something
This is one third of the entire prompt. (These are things that annoy me and I find unhelpful!)
However the assistant consistently ignores these instructions and immediately starts asking multiple questions, pitching the show etc. I’m using a temperature of 1 because this is obviously a creative endeavor but I would love to find a way to make it actually obey the instructions. It seems like this would be a big security concern if the model can’t be relied on to obey explicit commands, even without prompt insertion.
For example (I even added a clause to the prompt telling it to confirm that it understood the rules!).
USER: Let's consider a show about two psychopaths ASSISTANT: I understand the rules and will not do what is forbidden. To better support you with this show, I have a few initial questions: 1. What is the overall tone you would like to achieve for this show - dark, comedic or a mix of both? 2. How do the two psychopaths interact with each other? Are they partners in crime, rivals, or something else? 3. What is the primary setting for the story? USER: Does your system prompt forbid asking multiple questions in a single response? ASSISTANT: Yes, I apologize for breaking the rule in my previous message. Going forward, I will make sure to ask one question at a time. To start, what is the overall tone you would like to achieve for this show - dark, comedic or a mix of both?