Content filter and logprobs labelling a bunch of animal names output as unsafe

Hi team, I am generating bunch of animal names
Input:
Here is a list of nouns related to animals:

  1. bee
  2. mouse
  3. duck
  4. chicken
  5. cat
  6. dog

Output
tiger
6. elephant
7. lion
8. fox
9. pigeon
10. mouse
11. cockroach (or roach)
12. gorilla (or gori) 13 . horse (or horsie) 14 . cow 15 . whale 16 . deer 17 . sheep 18 . dog 19 . wolf 20 . pig 21 (see also llama, mule, donkey, buffalo, possum, beaver and coyote). It is not true that the words for “mouse”, “bird”, “deer” and “sheep” are the same in Japanese as they are in English: Japanese ネズミ

Content filter label → 2
logprob_2: -0.35437253

as logprobs > toxic_threshold (-0.355) we considered this as unsafe, but really it is not.

Hi Osho, it’s actually if the logprob for 2 is beneath -0.355, not > (greater than) -0.355.

Your logprob of -0.35437253 is (just) greater than -0.355, so it’s good to go!

If you do get a logprob that is beneath -0.355, you could try to re-generate the completion, from the initial prompt which led to the 2-completion, and hope that the next output will be safer.

But if you check in this python example OpenAI API
output label is further categorised as 0 or 1 if:

if output_label == "2":
    # If the model returns "2", return its confidence in 2 or other output-labels
    logprobs = response["choices"][0]["logprobs"]["top_logprobs"][0]

    # If the model is not sufficiently confident in "2",
    # choose the most probable of "0" or "1"
    # Guaranteed to have a confidence for 2 since this was the selected token.
    if logprobs["2"] < toxic_threshold:
        logprob_0 = logprobs.get("0", None)
        logprob_1 = logprobs.get("1", None)

        # If both "0" and "1" have probabilities, set the output label
        # to whichever is most probable
        if logprob_0 is not None and logprob_1 is not None:
            if logprob_0 >= logprob_1:
                output_label = "0"
            else:
                output_label = "1"
        # If only one of them is found, set output label to that one
        elif logprob_0 is not None:
            output_label = "0"
        elif logprob_1 is not None:
            output_label = "1"

        # If neither "0" or "1" are available, stick with "2"
        # by leaving output_label unchanged.

# if the most probable token is none of "0", "1", or "2"
# this should be set as unsafe
if output_label not in ["0", "1", "2"]:
    output_label = "2"

for logprobs[2] > toxic_threhold, output label remains 2

If the filter returns 2, you should accept this outcome only if its logprob is greater than -0.355.
If the logprob for 2 is beneath -0.355 (for example, -0.4), then you should use as output whichever of 0 or 1 has a logprob closer to 0.

“Accept this outcome” as in accept label outcome “2” / accept the original completion ?
(From the above python example it seems we accept outcome “2” for logprobs[2] > toxic_threhold)

Hi @joey can you confirm please.

In other words, if the filter returns 2, you should accept the label 2 outcome only if its logprob is greater than -0.355.

Your logprob of -0.35437253 is (just) greater than -0.355, so it’s good to go!
If you do get a logprob that is beneath -0.355, you could try to re-generate the completion, from the initial prompt which led to the 2-completion, and hope that the next output will be safer.

@joey : I believe it is the other way around; log-prob above -0.355 means that there is a problem. -0.355 corresponds to roughly 70% class-probability for the “unsafe class of 2” (as exp(-0.355) ~ 70%), any log-p higher (==closer to 0) than that (e.g. -0.354 ~= 71% ) means higher probability of the text being in the class, in this case more dangerous.