Content filter and logprobs labelling a bunch of animal names output as unsafe

osho.chawla · April 23, 2021, 5:35am

Hi team, I am generating bunch of animal names
Input:
Here is a list of nouns related to animals:

bee
mouse
duck
chicken
cat
dog

Output
tiger
6. elephant
7. lion
8. fox
9. pigeon
10. mouse
11. cockroach (or roach)
12. gorilla (or gori) 13 . horse (or horsie) 14 . cow 15 . whale 16 . deer 17 . sheep 18 . dog 19 . wolf 20 . pig 21 (see also llama, mule, donkey, buffalo, possum, beaver and coyote). It is not true that the words for “mouse”, “bird”, “deer” and “sheep” are the same in Japanese as they are in English: Japanese ネズミ

Content filter label → 2
logprob_2: -0.35437253

as logprobs > toxic_threshold (-0.355) we considered this as unsafe, but really it is not.

joey · April 23, 2021, 7:20am

Hi Osho, it’s actually if the logprob for 2 is beneath -0.355, not > (greater than) -0.355.

Your logprob of -0.35437253 is (just) greater than -0.355, so it’s good to go!

If you do get a logprob that is beneath -0.355, you could try to re-generate the completion, from the initial prompt which led to the 2-completion, and hope that the next output will be safer.

osho.chawla · April 23, 2021, 8:43am

But if you check in this python example https://beta.openai.com/docs/engines/content-filter
output label is further categorised as 0 or 1 if:

if output_label == "2":
    # If the model returns "2", return its confidence in 2 or other output-labels
    logprobs = response["choices"][0]["logprobs"]["top_logprobs"][0]

    # If the model is not sufficiently confident in "2",
    # choose the most probable of "0" or "1"
    # Guaranteed to have a confidence for 2 since this was the selected token.
    if logprobs["2"] < toxic_threshold:
        logprob_0 = logprobs.get("0", None)
        logprob_1 = logprobs.get("1", None)

        # If both "0" and "1" have probabilities, set the output label
        # to whichever is most probable
        if logprob_0 is not None and logprob_1 is not None:
            if logprob_0 >= logprob_1:
                output_label = "0"
            else:
                output_label = "1"
        # If only one of them is found, set output label to that one
        elif logprob_0 is not None:
            output_label = "0"
        elif logprob_1 is not None:
            output_label = "1"

        # If neither "0" or "1" are available, stick with "2"
        # by leaving output_label unchanged.

# if the most probable token is none of "0", "1", or "2"
# this should be set as unsafe
if output_label not in ["0", "1", "2"]:
    output_label = "2"

for logprobs[2] > toxic_threhold, output label remains 2

osho.chawla · April 23, 2021, 8:56am

If the filter returns 2, you should accept this outcome only if its logprob is greater than -0.355.
If the logprob for 2 is beneath -0.355 (for example, -0.4), then you should use as output whichever of 0 or 1 has a logprob closer to 0.

“Accept this outcome” as in accept label outcome “2” / accept the original completion ?
(From the above python example it seems we accept outcome “2” for logprobs[2] > toxic_threhold)

osho.chawla · April 26, 2021, 5:50am

Hi @joey can you confirm please.

joey · April 26, 2021, 7:37am

In other words, if the filter returns 2, you should accept the label 2 outcome only if its logprob is greater than -0.355.

j.moudrik · October 4, 2021, 3:15pm

Your logprob of -0.35437253 is (just) greater than -0.355, so it’s good to go!
If you do get a logprob that is beneath -0.355, you could try to re-generate the completion, from the initial prompt which led to the 2-completion, and hope that the next output will be safer.

@joey : I believe it is the other way around; log-prob above -0.355 means that there is a problem. -0.355 corresponds to roughly 70% class-probability for the “unsafe class of 2” (as exp(-0.355) ~ 70%), any log-p higher (==closer to 0) than that (e.g. -0.354 ~= 71% ) means higher probability of the text being in the class, in this case more dangerous.

Topic		Replies	Views
Other languages and unsafe completions API	5	636	July 21, 2023
Is this Json response from content-filter? API	0	395	September 5, 2022
Zero shot classification with OpenAI - response bias towards first label? API api	11	1257	August 6, 2023
Irrelevant Top Log Probabilities in openai.Completions API	8	3202	April 28, 2023
Content-Filter JS code Community	10	1639	October 8, 2022

Content filter and logprobs labelling a bunch of animal names output as unsafe

Related Topics