Text classification variation during the day

Hello,
I’m using a fine tuned gpt-4o model to classify sentences. The prompt provides classification guidelines and the fine tuning input provides a hundred of classified sentences.
To ensure deterministic result, I run 10 times the classification request and take the median values.
Results obtained at 09:00am CET are consistent.
Results obtained at 08:00pm CET may not be consistent and there are differences with the results obtained at 09:00am CET.
How comes?
Any idea of what should be done to overcome those inconsistencies?
Thank you,
Xavier

1 Like

Welcome to the Forum!

You can’t really achieve deterministic results, so an element of variation is quite normal, especially if there is a higher volume of classification categories and categories have some inherent overlap.

That said, there are a number of factors that can influence consistency - temperature settings among them. What temperate are you currently applying when using your fine-tuned model?

1 Like

Try asking the GPT why it classified the sentence as it did. Then, ask the GPT to create a protocol to improve its precision.

I’ll relay a similar story - or at least you will see why it is similar after reading.

See, my dad, he buys scratch-off lottery tickets every day. Now, we know the lottery commission makes 50% off of every ticket sale basically, over the long run and in large numbers, and with scratch offs, they actually generate a huge batch at tickets at once to guarantee this return - or this loss for the players. The Law of Large Numbers.

So my dad, not being completely mathematically illiterate, wants to buy his ten tickets every day and lose half of his money, to ensure that there is no inheritance, apparently. However, he is dismayed that all his money is taken some days, while other rare days, he earns a large reward.

You see, this multinomial distribution with a statistical mixture has high variance. It even can look like a pattern. He thinks Fridays are the lucky days, the fool having come out ahead on multiple Fridays.

Fortunately, I don’t actually have to witness this, as my father is dead. It does serve as an illustration of fallacy, though.

AI models are for generating language. Creative and unpredictable language. You don’t win the same bland robotic output every time you play, but in fact, by design, there is sampling from the possibilities in the cumulative distribution of token certainties that are generated by the model, to ensure that the result is non-deterministic and non-repeatable.

Someone that generates ten outputs clearly expects the results to be different each time. This could only arise from the sampling of AI tokens not being greedy - not always being the best token at every position through the generation. Some may be unpredictably even better to you than the model’s best prediction, being more human-like poetry, or such.

In fact, there is a parameter on the completions endpoint called “best-of”, it can take these language generations based in randomness, and still assign a score from the total perplexity of the output to return you only one result that is apparently of high quality. Using greedy sampling to only get ten identical outputs would make use of this feature rather daft.

Lesson: GPT AI models at default parameters do not always select the best token, and do not always generate the best token path of production. Despite running multiple trials (rather pointlessly we will see), you still just get an average for that one run, with a statistical sigma of deviation from a “true” answer of what the AI thinks.

In short: To get that best token, you must set either temperature or top_p to 0, which are parameters that control the sampling engine. There will be no need to look past added randomness, nor run multiple trials. You will simply get the top ranked output. Regardless of time, unless OpenAI is being extremely devious and actively reducing the quality of models despite the model fingerprint that is returned with your generation.

In depth: From all the tokens that could be produced when output is generated one at a time, you can also request logprobs, logit probabilities. This can let you see the alternates that were also under consideration, and let you have even more “precision” based on the AI’s underlying thoughts.

You can see that there is a top answer, but the tokens that also might be produced by chance by a random sampler tell a different story when all are evaluated by their weight…

TL;DR:

  • use top_p: 0
  • get your classification
  • ask different models instead of repeating to the same model
  • fine-tune on the task, ensure by validation results are high-quality
  • keep Dad away from the lottery
1 Like