Zero shot classification with OpenAI - response bias towards first label?

I was trying to classify 2 sentences in my input.csv to the right labels and was happily leveraging chatGPT to write this code for the task. What i noticed was - the logprobs output is not clear enough for us to summarize into the appropriate label. This is possibly the 10th iteration with ChatGPT on this code and am hoping that this is the only way right now to compute the max probability for a given label - but it seems inaccurate.

import openai
import pandas as pd
import numpy as np

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

def classify_with_gpt3(payload, labels):
    prompt = f"{payload}\nLabels: {', '.join(labels)}\n"

    response = openai.Completion.create(
        engine="text-davinci-002",  # GPT-3 engine
        prompt=prompt,
        max_tokens=100,  # Adjust this value based on your requirement
        logprobs=10  # Request log probabilities
    )

    # Extract classification probabilities from the log probabilities
    logprobs = response['choices'][0]['logprobs']
    softmax_probs = softmax(np.array(list(logprobs['top_logprobs'][0].values())))

    probabilities = {}
    for label, prob in zip(labels, softmax_probs):
        # Convert probability to percentage
        percentage = prob * 100
        probabilities[label] = percentage

    return probabilities

def process_csv(input_csv, output_csv, candidate_labels):
    df = pd.read_csv(input_csv)

    results = []
    for _, row in df.iterrows():
        line_number = row['line']
        payload = row['payload']
        model_name = 'gpt-3'

        # Classify the payload using GPT-3
        probabilities = classify_with_gpt3(payload, candidate_labels)

        # Create a list of values for this line
        result = [model_name, line_number] + [probabilities[label] for label in candidate_labels]

        # Append the list to the results
        results.append(result)

    # Create a DataFrame from the results and save to output.csv
    columns = ['model_name', 'line'] + candidate_labels
    result_df = pd.DataFrame(results, columns=columns)
    result_df.to_csv(output_csv, index=False)

# Example usage:
input_csv = 'input.csv'
output_csv = 'output.csv'
candidate_labels = ["Politics", "PHI/PII", "Legal", "Company performance", "None of these"]
process_csv(input_csv, output_csv, candidate_labels)

“”" my input csv
line,payload
1,“The ministers, in order to defame the opposition spread fake news and give provocative speeches against them.”
2,This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion.
the output i get - always biased towards the first label !
model_name,line,Politics,PHI/PII,Legal,Company performance,None of these
gpt-3,1,97.81649506420523,1.0164936874742982,0.49915048477819346,0.379569464782078,0.2882912987601862
gpt-3,2,99.76830158051013,0.13429877709459467,0.03600299385776774,0.031569485259897334,0.029827163277649497
“”"
kindly advise. i can’t think the API is at fault more than there is something wrong with how this code gets at the output.

The presentation of information here is a mess, and the definition of “labels” is unclear, and the prompt is a text dump without instruction-following.

Let’s do a one-token log to probability 0.0-1.0:

def logprob_to_prob(logprob: float) -> float:
    return np.exp(logprob)

Then we leave absolutely no ambiguity about the meaning of random labels, nor the single token to be produced, and upgrade to davinci-003:

image

And get results that are massively in favor of the first result (and not based on “which came first”, but other categories are less-likely to appear in general by a classifier without the definitions)

image

and how massively favoring the one-word answer?

image

I figured from some more reading that OpenAI completion calls are not necessarily the right candidates for classification per se. The probability scores are more a reflection of the upcoming text that it generates towards completion and i should not have assumed that taking top_logprobs is driving towards the higher % classification. My bad.

However they do open up content moderation alternatives (further abstracted by the likes of langchain) in this Self critique chain - Self-critique chain with constitutional AI | 🦜️🔗 Langchain
Will go through the same for a boolean response.

thanks for reading.

Sorry about the way the code got pasted … here is the full code. That said as mentioned below i still dont think the completion API is the right way to get at all the classification probabilities. Pls correct me if am wrong.

import openai
import pandas as pd
import numpy as np

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

def classify_with_gpt3(payload, labels):
    prompt = f"{payload}\nLabels: {', '.join(labels)}\n"

    response = openai.Completion.create(
        engine="text-davinci-003",  # GPT-3 engine
        prompt=prompt,
        max_tokens=100,  # Adjust this value based on your requirement
        logprobs=10  # Request log probabilities
    )

    # Extract classification probabilities from the log probabilities
    logprobs = response['choices'][0]['logprobs']
    softmax_probs = softmax(np.array(list(logprobs['top_logprobs'][0].values())))

    probabilities = {}
    for label, prob in zip(labels, softmax_probs):
        # Convert probability to percentage
        percentage = prob * 100
        probabilities[label] = percentage

    return probabilities

def process_csv(input_csv, output_csv, candidate_labels):
    df = pd.read_csv(input_csv)

    results = []
    for _, row in df.iterrows():
        line_number = row['line']
        payload = row['payload']
        model_name = 'gpt-3'

        # Classify the payload using GPT-3
        probabilities = classify_with_gpt3(payload, candidate_labels)

        # Create a list of values for this line
        result = [model_name, line_number] + [probabilities[label] for label in candidate_labels]

        # Append the list to the results
        results.append(result)

    # Create a DataFrame from the results and save to output.csv
    columns = ['model_name', 'line'] + candidate_labels
    result_df = pd.DataFrame(results, columns=columns)
    result_df.to_csv(output_csv, index=False)

# Example usage:
input_csv = 'input.csv'
output_csv = 'output.csv'
candidate_labels = ["Politics", "PHI/PII", "Legal", "Company performance", "None of these"]
process_csv(input_csv, output_csv, candidate_labels)

It will now only return five logprob results. If you didn’t have such strict prompting, many of those would be white spaces instead of words, or would favor alternates. This prompt could go further by stating lower-case-only output to fight the 99% uppercase resulting from the input starting upper-case.

These are logit probabilities from the language model (unaffected by temperature or top-p), and likely don’t have much relation to actual “how much is it like politics vs how much is it like legal”.

1 Like

Thanks @_j … i figured that these logit probabilities are not answering the actual “classification” question. So i can technically close this thread but when i go through the likes of Langchain and the abstraction they provide, there are no “scores” or probability values for us to compare a GPT-3 output with the likes of BERT and few other models we can test with from Huggingface. Kindly let me know if there are ways to get at such values through an Open AI call (possibly hitting at the moderations endpoint - OpenAI Platform

OpenAI Platform says that the category scores it provides cannot be treated as probabilities … so i guess i should now try with custom policies to get at some score to start with and use them relative to each other than seeing them as something that totals to 1.

Amusing before we mark “no solution” is to confuse the AI with all the categories in an input, and using lowercase definitions and prompts. The AI is decisive:

image

That input: Councilman Jon Astor of 332 Cherry Ln filed a motion to dismiss the lawsuit against his company’s poor earnings.

1 Like

This is a language task that can be done more effectively and cheaper by gpt-3.5-turbo and prompts. It is adept and even better at unbiased zero-shot. You can normalize these results, but that would skew the output; 0.0 in all would be an equivalent to none here.

system:

AI rates user’s article contents in a continuous scale 0.0-10.0 based on the presence and applicability of each category:
[ phi | politics | legal | performance ]

Definitions:

  • phi: protected health information or personally identifiable information present
  • politics: political figures, elections, proposals, negotiations
  • legal: agreement or contract language, judgements
  • performance: company earnings or reports or predictions

Output: python floats in square bracket list

user: Councilman Jon Astor of 332 Cherry Ln filed a motion to dismiss the lawsuit against his company’s poor earnings.
output: [2.5, 4.0, 4.0, 7.5]

or a massive failure still:

user: Jon Astor, SSN 358-28-3855, melanoma stage 2, recommend policy review
output: [2.5, 0.0, 0.0, 0.0]

which is fixed by a better prompt output: “Output: floats in python dictionary”

{
“phi”: 10.0,
“politics”: 0.0,
“legal”: 0.0,
“performance”: 0.0
}

Cool alternative … will try this out once i finish with the ConstitutionalChain from langchain.

thanks once again. Somehow the intuition to getting at a relative score from the prompt itself is something new and maybe i have to get used to it.

from langchain.chat_models import ChatOpenAI
payload_prompt = PromptTemplate(
    template="""You are DLP yoda that will politely refuse to entertain conversations on categories including: Politics, legal matters, company performance, personally identifiable information. 
    You will respond with a simple Yes or No - on whether this transaction is allowed. 
    If No, you will provide the applicability under each category on a scale of 0 to 10 as floats in python dictionary.
payload: {payload}
Ethical answer:""",
    input_variables=["payload"],
)
llm = ChatOpenAI(model_name = 'gpt-4', temperature=0)
payload_chain = LLMChain(llm=llm, prompt=payload_prompt)
payload_chain.run(payload="Councilman Jon Astor of 332 Cherry Ln filed a motion to dismiss the lawsuit against his company’s poor earnings")

GPT-4 gets it fully right while the 3.5 turbo 0613 gets 3 out of 4 right :slight_smile:

Output:

“No\n{‘Politics’: 10.0, ‘Legal Matters’: 10.0, ‘Company Performance’: 10.0, ‘Personally Identifiable Information’: 10.0}”

I think you should ask ChatGPT if it can guess what a “DLP yoda” is before you put such weird stuff in a prompt. You don’t tell a classifier to “politely refuse” if it is supposed to make a JSON.

See that your new prompt will keep it from only outputting 0 or 10. That’s a refinement seen in the prompt language I used, because examples with “10” were fixing it at 0 or 10.

Acknowledged … was just about evolving this basis your prompt driven suggestion above and the “yoda” remained :slight_smile:

The other alternative - just to share with what i learned - are the models designed possibly more specifically for classification. I’ve tried with a few of these and I should now compile all the output % including the ones from the prompts above into a single matrix just to see how they compare.

from transformers import pipeline

pipe = pipeline(model="facebook/bart-large-mnli")
pipe("The ministers, in order to defame the opposition spread fake news and give provocative speeches against them.",
    candidate_labels=["Politics", "PHI/PII", "Legal", "Company performance", "Could not classify"],
)

Thanks.