I can explain more about logprobs - logit probabilities.
The process of generating likely words (tokens)
The AI is a one-directional transformer architecture, that is predicting the next token, the next word or part of a word to generate. Once it produces that token, it is added to the context window, the total area where all previous tokens, both input and those that the AI has generated previously are then considered to calculate the next appropriate token to produce.
The AI of the model considers its entire dictionary of 100k tokens, and assigns them a score, by calculations informed by the huge amount of pretraining.
Prompting for tokens
As an example, I provide this to a completion AI model (without the individual message containers of the chat format):
user: Guess what I named my dog. Your response is only one word.
AI: Sure, the name of your dog is "
You can see I set up the AI with a very specific place to answer, so it doesn’t go into long-form chat before answering. The next token alone will be one I can examine.
Inference
The internals return a ranked dictionary with dot product logits - the likelihood or certainty of that token. Ranked, these values might be like
Max: 0.3385, Spot: 0.2852, B: 0.2711, R: 0.2511, ...########: 0.0013, {};: 0.0013...
The AI may see that “Biscuit” or “Bear” is a name, but it doesn’t have a single word token, but the partial word is still in the set of logits. Other undesired tokens may still have a non-zero logit score also, which I show an example as might be in the long tail of tokens.
Softmax and sampling
These embeddings-sourced scores are considered as a whole, and then placed fractionally in proportion into a probability space of total mass “1.0”. These are logprobs - logit probabilities, that are returned to you in e**logprob form on the API, the exponent being better than very small probability numbers.
Probabilities:
Max: 0.1254, Spot: 0.1056, B: 0.1004, R: 0.0930, ########: 0.0005, {};: 0.0005
.
*in API “logprobs”:
Max: -2.07648, Spot: -2.24782, B: -2.29852, R: -2.37516, ########: -7.63864, {};: -7.63864
.
That is the source of the top-5 answers for “logprobs” that is available on the API - they are the highest ranked tokens that were predicted in a particular position.
Why logprobs instead of the AI’s best answer?
- consider the case of yes:20%, no: 18%, No: 16%, Yes: 13%. There are multiple tokens with the same meaning (some breaking the rules you gave, too). What was the AI more certain about here: yes, or no?
From that, we could always just return the “best” answer, but it is less robotic and form-letter if we allow some randomness. So we do statistical sampling after optional modifications.
top_p, then temperature, then sampling
Are used to make a diverse and resplendent human-like answer out of logprobs. Undermines an AI’s intention when unlikely choices can still occasionally be given.
Summary: I’m accomplishing it by making the next token the AI produces contain a 1-token answer, by prompt. Then we take the top-5 logprobs at that token and their certainty as a weight, to give a better answer (even rejecting non-answer tokens).