Best way to make GPT estimate probabilities?

What are the best ways to make GPT estimate probabilities in its responses to predict something? It is notoriously bad with percentages, saying each time different number, having questionable calibration or rounding the probabilities.

(Please don’t say only that AI models are not classification ml models, everyone knows that, but sometimes using an AI model is the only option.)

You should look into log_probs, or logarithmic probabilities for the token, which you can get using the API.

You convert these to the normal linear probability space (0-1) by taking exp(logprob).

This works great for single token output classifiers.

2 Likes

Without using o1 or o1-mini which are significantly better at math, to my knowledge the best way to have a GPT calculate a probability is to ask it to use python for the calculations.

Even then, they’re non-deterministic by default, and won’t use the same calculation more than once if it can help it. Which means you won’t get the same number unless you specifically instruct it to use a specific equation either in code or symbolically.

1 Like

I thought about this but I didn’t know OpenAI API offers the probabilities. But is it the best way? I mean it uses only the model itself, not any additional thinking. I can make it think before and then generate the single token based on the thinking but will it be the same as directly thinking about the number?

There is a Metaculus tournament at forecasting various events using AI models, which OpenAI sponsors, it would be interesting what other competitors use to forecast the probability of the events.

My own ideas include setting many thresholds and letting the model decide about them (which makes it a classification, which is said to be easier than regression for AI models) or ranking several problems at once, probably iteratively using a lot of random pairs, and then assigning them probabilities based on their ranks (which makes it even easier for the model, they just compare two texts).

The cleanest way, at least right now, is to have a very small and discreet probability distribution. Then you train a 1-token output classifier, which is a fine-tune, that yields outcomes from this distribution. Then after training, you expose the log-probs via the API to give you a confidence score of the choice it made.

You can also try prompting with multi-shot examples, and then see what the log-probs reveal.

Another approach is to use embeddings. But here you label them, either manually or through AI, and then any new item will come in, and you correlate it to your labeled embeddings, and then use this as a reference for how to label the new item. I usually do a correlation weighted average of the top X embeddings, and X is a function of how much data get’s reliably correlated with your input, which you have to test empirically. And as you label more data, X gets bigger.

But all this is taking chunks of text, and finding aligning them to “buckets” and ascribing a label (usually mean type value) and an uncertainty (usually a sigma type value). So based on the label and its uncertainty, you can take the appropriate action on the label. For example, take the label seriously if sigma is low, or do something else if sigma is high.

But if we are talking about solving things like word problems in statistics. Then this is not what I am talking about. This would be more like CoT or some deep recursive reasoning, probably with function calls on the side to make the math correct.

3 Likes

Thanks for both interesting ideas, although both hardly usable for the mentioned tournament (they need some existing data). They might help for example with sports predicting though. Today I tried the single token log probability strategy on predicting NBA scores and it’s interesting and some fine-tuning might help to make it even better.

2 Likes

Yes, and don’t underestimate the embedding version too. While it may be a lot more upfront work in labeling. It pays dividends in redundancy and cost reduction because you can use multiple (inexpensive) embedding models in parallel.

2 Likes

Thanks, finetuning didn’t work too well, it learned only to say “yes” instead of “Yes.”. Maybe it would need many more training examples.
But embeddings work great. I generated embeddings for various thresholds and I am using a logistic regression model.

1 Like

I pretty much use all embeddings these days and don’t have any substantial fine-tunes running. Good to hear I am not alone.

Does anyone have any gists or snippets they’d be willing to share that demonstrate these methods?

1 Like