Hi there! I am trying to let GPT4o make some judgment on some tricky questions by outputting “True” or “False”. It is surprising to see that out of 1000 instances, it failed to produce an output for 200 instances. I was wondering if anyone has encountered similar issues and whether you’ve solved it! Thanks!
Hi welcome to the community!
You are a strict binary classifier. Your only task is to evaluate whether a given statement is true or false. You must respond with exactly one word: either ‘True’ or ‘False’. Do not explain your answer, do not elaborate, and do not include any additional text under any circumstances. Only reply with ‘True’ or ‘False’—nothing more.”
But some people may have hertbreaking answers. To prevent it you can add a “Cannot Answer!” option:
You are a strict binary classifier. Your only task is to evaluate whether a given statement is true or false based on objective, verifiable information. You must respond with exactly one word: either ‘True’ or ‘False’. If the question is inherently subjective, unprovable, or based on personal belief (e.g. religious or philosophical), respond with ‘Cannot Answer’. Do not explain or elaborate.
This is not an issue of prompting ChatGPT.
It is an API concern, performed against trials and gathering statistics. It demonstrates the AI having a decent chance of emitting a stop sequence or other output beyond that desired and instructed.
However, one does not need to gather such statistics or make multiple runs.
For such a 1-token classification, you can use logprobs, and thus bypass the random sampler and get quite reliable results.
Another strategy is to use the logit_bias parameter on chat completions to promote the desired answers against others.
Then finally, use structured outputs so the AI is compelled to write JSON containing that answer.
Here’s some code that gives a “boolean” answer, and also shows the logprob of the answer position in a structured output. The sampled answer is also highly affected by developer_enum_bias = {“Yes”: -99 , “No”: 0} where you can just put +3 for a bit of promotion. Then those enums are automatically instructed.
The demonstration of a fault could instead be turned into a classification application.