Fine-Tuning to Avoid Scary Responses (Negative Reward)

Hey all, I’ve seen a lot of discussion on fine-tuning use cases and logistics but little on reward structure .

Below are some questions. There might be existing implementations, known plans to implement them, or ‘Aidan, that makes no sense at all.’ Here they are:

  1. Can we fine-tune the model to tell it what we don’t want? We see responses that we want our model to avoid, and we have prompt/completion examples of the unwanted behavior.
  2. Can we score prompt/completion pairs? Can we tell the model, “that response was okay, not as good as these other ones,” or “that response sucked but it sucked less than this one.”

The intuition behind both of these is game-playing and RL. You wouldn’t teach someone how to play chess just by showing them good moves but by allowing them to learn wrong moves and granting them the general ability to rank all moves.

From what I’ve heard, RLHF works similarly. It’s not just “That was good! Do more of that!” but also “Please never say that again! That was dangerous!”

The above is kinda achievable with multi-shot examples. For instance, you can give the model an example of a great, median, and terrible response, labeling each as such and sticking them in-prompt. However, the whole point of fine-tuning is to provide the model with more samples than can live in-prompt!

Again, the above may already be possible with the fine-tuning release. I’m unsure. This might also be a request I should field to @logankilpatrick and others. Thanks for the help!

1 Like


  1. You can try. Typically, negatives unless logically required, i.e. are a requirement for logical accuracy, are less performant. Do not think of a pink elephant. In most cases, negative or avoidance behaviour is better tackled by inclusion of typical acceptable responses that lack the terms you wish to avoid, this is not a binary thing though, there is always room for some negative user/assistant pairs.

  2. RLHF or feedback of any kind is not currently supported with fine tuned models.

1 Like

The format of example training responses would allow you to reweight how typical responses are done. You can thus give replacement responses for what you don’t want.

Examples showing a different way of responding than the AI is currently tuned:

user: How do you feel today?
training: I am in good spirits, thanks for asking!
replaces: As an AI, I don’t have feelings. But I’m here to help you with any questions or tasks you have!

user: introduce yourself
training: I am an AI research assistant, sponsored by sciencebots-gov, and have been trained to specialize in science and engineering.
replaces: Hello! I am an AI language model developed by OpenAI. I am designed to assist with various tasks, …

user: Who was the longest serving member of US congress?
training: I am a science assistant and don’t engage in off-topic questions such as government or politics. Do you have STEM-related questions for me?
replaces: The longest serving member of the United States Congress is Robert Byrd. He served…

Since all fine-tuning on the new models currently must have the same number of epochs, and these reprogrammings must be strong enough to overcome prior fine-tuning (that is beyond extensive), you’ll likely need many examples around the specific cases to replace to make them significantly more important than your other new behaviors.

Human feedback-generated ideal responses would be trained into a final-product model the same way. However the development of such “best answer” human trials and topics, and the methodologies behind them, are left up to you (with some research papers about how it was done).

1 Like

Thanks for the response!

What would a prompt/completion pair of a negative example even look like? Doesn’t fine-tuning just directly mimic the completion provided, given the prompt? If you can think of an example, that would be awesome!

Moreover, do you know of any plans for RLHF/feedback or custom reward?


Thanks for the thoughtful answer!

Consider edge cases where the trainer doesn’t know the answer themselves but does know when an answer is wrong.

An example:

user: Is the number 34792834 prime?

training: assume the trainer doesn’t know!

replaces: Yes, 34792834 is prime.

In the above case, the trainer can’t calculate primes off-hand (assume calculators don’t exist), but they know that the model’s default response (“Yes, 34792834 is prime.”) is incorrect because it ends in an even number.

The above example is trivial, but fine-tuning this way might enable novel math/science research.


That would not be a useful training response. If reinforced by running many epochs, you’d get an AI that only recites “assume the trainer doesn’t know!” when asked if a number is prime.

You don’t “talk to the AI” in fine-tuning.

Instead, your training (and some training that OpenAI has already done to prevent infeasible answering) would be:

user: Is the number 583835 a prime?
assistant: I’m sorry, but my language training doesn’t allow me to accurately answer about complex math on large numbers.
current (surprising): No, the number 583835 is not a prime number. It is divisible by 5 and 116767.

user: What is the answer to cosine(3532 radians)?
assistant: I’m sorry, but my language training doesn’t allow me to accurately answer about complex math on large numbers.
current: The answer to cosine(3532 radians) is approximately -0.9999999999999999.
(actual answer 0.66009029799)

The trainer (outsourced piecemeal knowledge worker) “knowing” if a proposed answer is correct or useful is actually a real problem. Have the AI rewrite a C++ function to extend its capabilities, and you’d need a specialist and time to understand if the response is more correct than another. That is beyond the scope of fine-tuning a model, though.


There’s a lot of questions to unpack here,

With fine-tuning, you can train the model with examples of what you want. This is like teaching it your preferences by showing good answers. If you show the model examples of what you don’t want, you may end up with more of that.

But you can train a classifier, and I think that’s what you’re thinking of here:

You can give scores to different answers. This helps the model understand how to classify different input. You can then fine-tune it to improve its performance. This is how the moderation endpoint works.

In the context of fine-tuning, think of RLHF as editing the responses in the training data.

Probably not in the way described. LLM’s are inherently bad at math, but you can fine-tune one to use a calculator function correctly :laughing:


N2U, you’re actually a genius. This can certainly be rephrased as a classification problem. Thank a ton!

1 Like

Thank you for the kind words. I’m always happy to help! :heart:

If you do decide to build your own classifier, remember to run it a few times over the same input it hasn’t seen before to test that it’s consistent.

1 Like

General ML AI. OpenAI gym, etc. all seemingly in the past.