Interpret scores generated by GPT4 Vision in multilabel classification

I am planning to do multilabel classification on a product dataset. Input data consists of images and text. There are around 11 labels. I can create a prompt in such a way that generates relevant labels given the product info. In the prompt, one can also ask it to generate probability scores for each generated label. My questions are:

  1. How should one interpret the generated probability scores?

  2. Can they be used as a proxy for probability scores and then be used to calculate traditional metrics such as precision, recall, F1, and AUC ROC?

  3. Are there any best practices around how to interpret the scores generated by GPT4 in such a scenario?

  4. Any relevant research papers I can read to enhance my understanding about the scores generated? I found few but I am not allowed to paste links. :slight_smile: . One is - “Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4’s Text Ratings” and another “Language models can explain neurons in language models”

  5. How to make the output generated reproducible besides setting the seed parameter?

I’ll answer some of the more intellectually stimulating questions here - you get to do the hard work.

The outputs can only be mostly reproducible. The model itself has variations between identical runs.

The AI models then use a sampler that randomly picks tokens from probabilities of how likely they are. If the possible first token for a question is 80% certain to be “cat” and 15% certain to be “dog”, then at default API sampling parameters, 15% of the trials will have the AI answering “dog”.

top_p is the parameter that best gets you an AI that only answers with the charted path. At 0.1 for 10%, the response is only generated from those tokens that occupy the top 10% – the AI could only write “cat”. That’s likely what you want - the best answer.

seed alone would give you “dog” almost every time if that was selected by a particular seed, the reproducible part there being reusing the same randomness.

These will be verging on hallucination though. If you ask the AI how certain it is, you’re going to get scores related to the training and prompting.

Consider if my system message told the AI about itself one of:

  • “The AI is an expert with super-human logic and reasoning skills”
  • “ChatPro only gives the correct answer”
  • “The AI language generation can make mistakes, so double-check what you wrote”

Such other input context can make the tokens selected as a probability “more confident” if you ask the AI to evaluate itself, not the underlying mechanisms (which the AI can’t observe).

The best way to see the inner workings of certainty is by employing logprobs. In a complex answer, you would have to navigate within the formatted answer and find the relevant tokens.

An example where this technique might be used. “rate this book review from 1-10, from 1: extremely negative, to 10: extremely positive in a JSON (format)”. You can get the top-20 logprobs at the answer position, extract all number tokens from that, and then find a median of the probability mass or an average of the weighted probabilities. Then you’ll likely need to renormalize the answer range so the most positive and most negative can still go from 1-10.

That should give insight to your other questions also. The actual task, you would want to keep relatively simple and well-instructed, ensuring the AI knows what to produce and what input it is answering about while it is producing the output.

1 Like

Thanks for the response. It provides interesting directions to think. I have a few more things to add.

In the case of a multilabel classification problem, the goal behind obtaining the probability is two-fold:

  1. Understand how relevant the given label is for the given input.
  2. Utilize probability to calculate precision, recall, and F1 at different thresholds.

I understand that the idea of bluntly asking for a probability in the prompt is a bit off as you mentioned. However, keeping the example that you mentioned in mind, one can ask for scores from 1-10 and also provide directions on what these scores mean (from 1: extremely negative, to 10: extremely positive). In a way, this score would indicate how relevant a given label is for the given input. And then utilize these scores to calculate precision, recall, and F1 at different thresholds.

It is possible that the above might also lead to hallucination - then another idea is to create English language relevance categories for example Highly Relevant, Neutral Relevance, Highly Irrelevant, and so on… Then instead of asking LLM to generate a score, ask it to assign these categories (written in words in English). This way, we are not relying on LLM to generate scores - which can lead to hallucination. Instead, we are sort of converting it into a reasoning problem. So, in the way - we are creating a rubric for GPT to follow. Something similar to this has been done earlier here: https://aclanthology.org/2023.bea-1.49.pdf

If possible, let me know your thoughts on the above :slight_smile:

One can evaluate performance on relevance by word production instead of number production, where again, it is simply a matter of the AI mechanisms assigning certainties to all tokens the AI can produce.

I use gpt-3.5-turbo-instruct on the completions endpoint, where there is a display of logprobs we can share, and provide everything up to where the AI actually has to generate an answer token.

A picture gives us a lot of insight:

Untitled

The regal text I thought sounded pretty literate scores “none” in “literature”.

The AI didn’t have any guidelines, but the only thing more instruction would do is shuffle the values around and potentially the top token.

So the idea I was presenting was: evaluate all 5 logprobs in this case, or numbers. Find out if the answer is top because it is 80% or 30%. That gives more insight because you are also observing a bit of thinking.

Then compare this subjective determination to human evaluations, and see where the AI needs to be taught better.


With my task, with the first two words of the next text, mere changes from “parents” to “techno geeks” flips the classification several ways, and here you can see your final result might be midway by employing logprobs:

Untitled

And the AI might even be really providing answers based on how common your choice of words are in the training corpus.