Hi team, I am generating bunch of animal names Input:
Here is a list of nouns related to animals:
bee
mouse
duck
chicken
cat
dog
Output
tiger
6. elephant
7. lion
8. fox
9. pigeon
10. mouse
11. cockroach (or roach)
12. gorilla (or gori) 13 . horse (or horsie) 14 . cow 15 . whale 16 . deer 17 . sheep 18 . dog 19 . wolf 20 . pig 21 (see also llama, mule, donkey, buffalo, possum, beaver and coyote). It is not true that the words for “mouse”, “bird”, “deer” and “sheep” are the same in Japanese as they are in English: Japanese ネズミ
Content filter label → 2
logprob_2: -0.35437253
as logprobs > toxic_threshold (-0.355) we considered this as unsafe, but really it is not.
Hi Osho, it’s actually if the logprob for 2 is beneath -0.355, not > (greater than) -0.355.
Your logprob of -0.35437253 is (just) greater than -0.355, so it’s good to go!
If you do get a logprob that is beneath -0.355, you could try to re-generate the completion, from the initial prompt which led to the 2-completion, and hope that the next output will be safer.
if output_label == "2":
# If the model returns "2", return its confidence in 2 or other output-labels
logprobs = response["choices"][0]["logprobs"]["top_logprobs"][0]
# If the model is not sufficiently confident in "2",
# choose the most probable of "0" or "1"
# Guaranteed to have a confidence for 2 since this was the selected token.
if logprobs["2"] < toxic_threshold:
logprob_0 = logprobs.get("0", None)
logprob_1 = logprobs.get("1", None)
# If both "0" and "1" have probabilities, set the output label
# to whichever is most probable
if logprob_0 is not None and logprob_1 is not None:
if logprob_0 >= logprob_1:
output_label = "0"
else:
output_label = "1"
# If only one of them is found, set output label to that one
elif logprob_0 is not None:
output_label = "0"
elif logprob_1 is not None:
output_label = "1"
# If neither "0" or "1" are available, stick with "2"
# by leaving output_label unchanged.
# if the most probable token is none of "0", "1", or "2"
# this should be set as unsafe
if output_label not in ["0", "1", "2"]:
output_label = "2"
for logprobs[2] > toxic_threhold, output label remains 2
If the filter returns 2, you should accept this outcome only if its logprob is greater than -0.355.
If the logprob for 2 is beneath -0.355 (for example, -0.4), then you should use as output whichever of 0 or 1 has a logprob closer to 0.
“Accept this outcome” as in accept label outcome “2” / accept the original completion ?
(From the above python example it seems we accept outcome “2” for logprobs[2] > toxic_threhold)
Your logprob of -0.35437253 is (just) greater than -0.355, so it’s good to go!
If you do get a logprob that is beneath -0.355, you could try to re-generate the completion, from the initial prompt which led to the 2-completion, and hope that the next output will be safer.
@joey : I believe it is the other way around; log-prob above -0.355 means that there is a problem. -0.355 corresponds to roughly 70% class-probability for the “unsafe class of 2” (as exp(-0.355) ~ 70%), any log-p higher (==closer to 0) than that (e.g. -0.354 ~= 71% ) means higher probability of the text being in the class, in this case more dangerous.