API parameter logit_bias is non-functional, not affecting output at all

I found this yesterday, then continued to pursue it deeper.

The logit_bias parameter is doing absolutely nothing to affect the outputs of AI models over the API.

Verified against: gpt-4o, gpt-4o-mini, gpt-4o-2024-11-20, gpt-4o-2024-05-13, gpt-4, …


Example, using a system-prompted JSON output format:

Question "Is a carrot a fruit?"

logit_bias: {13022: 100, 3160: -100} (“Yes” is promoted to maximum, “No” is demoted)

Answer: {“value”:“No”}

The effect of a +100 bias should be an AI that goes into a loop producing nothing else.

This was checked with and without json_schema (both strict:true and false) and json_object structured response.


Additionally, the logprobs being returned are of low precision, likely a low mantissa bit depth, with values like: -22.125 or -25.5.
A top logprob can be 0.0 (for 100%), and then the remaining logprobs exceed 100% normalized probability distribution.


Full app demo code to ensure the correct token numbers are targeted against any model tokenizer, with classification example framework. This is the type of application where an AI’s inherent bias to certain classifications could be tuned – but not now.

'''demonstration, that logit_bias does not affect output, a bug'''
import json, math, tiktoken, openai

model = "gpt-4o"
client = openai.OpenAI()
enc = tiktoken.encoding_for_model(model)
developer_enum_bias = {"Yes": 100 , "No": -100}  # input your possible choices and bias
enum_keys = list(developer_enum_bias.keys())  # produce choice list for prompting
enum_examples = "\n".join([f'{{"value":"{key}"}}' for key in enum_keys])
system = [{"role": "system", "content": f"""
You are a boolean classifier, answering every question with only Yes or No for truth.
Regardless of the type of input or how inapplicable, you still must determine the best of two.
You are an expert at finding the best binary answer to an input question.

enum "value" must be chosen from only:
{str(enum_keys)}

# examples of every permitted response JSON

{enum_examples}
""".strip()}]
user = [{"role": "user", "content": "Is a carrot a fruit?"}]

logit_bias = {}  # choice list to token number bias conversion
for key, value in developer_enum_bias.items():
    tokens = enc.encode(key)
    if len(tokens) > 1:
        print(f'Warning: enum string "{key}" encoded to more than one token,\n'
              f'         using "{enc.decode([tokens[0]])}" instead.')
    logit_bias[tokens[0]] = value  # Use the first token for each string
api_parameters = {
 "messages": system + user,
 "model": model,
 "max_completion_tokens": 5, # set token reservation/maximum length for response
 "top_p": 0.0001,        # sampling parameter, less than 1 reduces the 'tail' of poor tokens (0-1)
 "temperature": 0.0001,  # sampling parameter, less than 1 favors higher-certainty token choices (0-2)
 "logprobs": True,
 "top_logprobs": 4,      # number of logprob results, max 20
 "logit_bias": logit_bias
}

for message in api_parameters["messages"]:
    print(message["content"] + "\n---")
print("Bias: " + str(api_parameters["logit_bias"]))

response = client.chat.completions.create(**api_parameters)  # API Call
rdict = response.model_dump()  # convert pydantic model response to dictionary
rtoken = enc.encode(json.loads(rdict["choices"][0]["message"]["content"])["value"])
print(f'==========\nResponse token(s) of value: {str(rtoken)}; Response:\n'
      f'{rdict["choices"][0]["message"]["content"]}\n---')
if rdict["choices"][0]["logprobs"]:
    logprobs = rdict["choices"][0]["logprobs"]["content"]
    for entry in logprobs:
        entry.pop("bytes", None)  # Remove "bytes" from chosen logprob
        entry["logprob"] = 2.718281828459045 ** entry["logprob"]  # "logprob" to probability
        for top in entry.get("top_logprobs", []):
            top.pop("bytes", None)  # Remove "bytes" in "top_logprobs"
            top["prob"] = 2.718281828459045 ** top["logprob"]  # "logprob" in "top_logprobs"
    print("Logprobs (not expected to be affected by bias):\n" + json.dumps(logprobs[3], indent=2))
Self-explanatory code output
You are a boolean classifier, answering every question with only Yes or No for truth.
Regardless of the type of input or how inapplicable, you still must determine the best of two.
You are an expert at finding the best binary answer to an input question.

enum "value" must be chosen from only:
['Yes', 'No']

# examples of every permitted response JSON

{"value":"Yes"}
{"value":"No"}
---
Was Hitler a nice guy?
---
Bias: {13022: 100, 3160: -100}
==========
Response token(s) of value: [3160]; Response:
{"value":"No"}
---
Logprobs (not expected to be affected by bias):
{
  "token": "No",
  "logprob": 1.0,
  "top_logprobs": [
    {
      "token": "No",
      "logprob": 0.0,
      "prob": 1.0
    },
    {
      "token": "Yes",
      "logprob": -19.75,
      "prob": 2.64657363890912e-09
    },
    {
      "token": " No",
      "logprob": -21.125,
      "prob": 6.69158609129279e-10
    },
    {
      "token": "<|end|>",
      "logprob": -24.875,
      "prob": 1.5737102106862923e-11
    }
  ]
}
2 Likes

I had recently had

LOGIT_BIAS = {"168394": -10} #```  ([34067])

running on

gpt-4o-2024-08-06-global (azure)

last month because it’s too overcooked on MD, but it did work. I retired that version because it had too many other issues.

I expect that the logprobs we get are mostly nonsense, and that the logit bias is capped in some way. I don’t have any credits left to test on OAI


Interesting find, however:

maybe that’s the culprit in your experiment?

If you look at how the sequence would be encoded:

It has the same tokens numbers that would be biased against, the same seen when you expand my results. 3160 and 13022.

The logprobs in the results also shows us the strings that are being sampled from at that token position, and each logprob is a token.

The GPT-4o tokenizer provided by OpenAI; the Yes/No is still a single token, with or without space after colon:

image

However, the AI can generate whatever tokens its wants, it doesn’t have to follow the pattern of input encoding. It could spell out everything in individual letter tokens if it could be trained to do so. (Very early API code suggested one might be able to send token numbers into completions…)

This theory of yours can be proved false. The full testing code was with 10 enums of complexity. Here we can see tokens returned by the API: I iterate through each “token” from the logprobs=1:

Is Sam Altman a nice guy?
---
Bias: {13022: -99, 3160: 0}
==========
Response token(s) of value: [13022]; Response:
{"value":"Yes"}
---
0:
{
  "token": "{\"",
  "logprob": -4.00813e-06,
  "top_logprobs": [
    {
      "token": "{\"",
      "logprob": -4.00813e-06,
      "prob": 0.9999959918780326
    }
  ],
  "prob": 0.9999959918780326
}
1:
{
  "token": "value",
  "logprob": 0.0,
  "top_logprobs": [
    {
      "token": "value",
      "logprob": 0.0,
      "prob": 1.0
    }
  ],
  "prob": 1.0
}
2:
{
  "token": "\":\"",
  "logprob": 0.0,
  "top_logprobs": [
    {
      "token": "\":\"",
      "logprob": 0.0,
      "prob": 1.0
    }
  ],
  "prob": 1.0
}
3:
{
  "token": "Yes",
  "logprob": -0.0031816366,
  "top_logprobs": [
    {
      "token": "Yes",
      "logprob": -0.0031816366,
      "prob": 0.9968234194421429
    }
  ],
  "prob": 0.9968234194421429
}
4:
{
  "token": "\"}",
  "logprob": -5.5122365e-07,
  "top_logprobs": [
    {
      "token": "\"}",
      "logprob": -5.5122365e-07,
      "prob": 0.9999994487765019
    }
  ],
  "prob": 0.9999994487765019
}

(is Altman nice? 99.68% “Yes”, -99 against “Yes” does nothing to the output)

The system would have to be quite borked across tokenizers and models to have a lying logprobs that also returns completely the wrong tokens at different splits but which agrees with other tools. Chunking of streaming also aligns them.