AI model fingerprints are not unique, making them fairly useless for tracking model updates

This is what documentation for the API chat return object says about system_fingerprint, which is returned for only new models released since devday (and not for vision):

This fingerprint represents the backend configuration that the model runs with.

Can be used in conjunction with the seed request parameter to understand when backend changes have been made that might impact determinism.

However, I try to make a logger, to start tracking changes to AI models and find many different fingerprints being returned for the same model:

-- Fingerprint report from 4 trials:
gpt-3.5-turbo: (3):fp_b28b39ffa8, (1):fp_c2295e73ad
gpt-3.5-turbo-0125: (3):fp_b28b39ffa8, (1):fp_c2295e73ad
gpt-3.5-turbo-1106: (3):fp_592ef5907d, (1):fp_77a673219d
gpt-4-turbo: (2):fp_76f018034d, (2):fp_c5162df67e
gpt-4-turbo-2024-04-09: (3):fp_76f018034d, (1):fp_c5162df67e
gpt-4-turbo-preview: (3):fp_b77cb481ed, (1):fp_1fbd2f868d
gpt-4-0125-preview: (4):fp_b77cb481ed
gpt-4-1106-preview: (3):fp_d6526cacfe, (1):fp_89f117abc5
-- Fingerprint report from 10 trials:
gpt-3.5-turbo: (8):fp_b28b39ffa8, (2):fp_c2295e73ad
gpt-3.5-turbo-0125: (7):fp_b28b39ffa8, (3):fp_c2295e73ad
gpt-3.5-turbo-1106: (8):fp_77a673219d, (2):fp_592ef5907d
gpt-4-turbo: (7):fp_76f018034d, (1):fp_9c4936e070, (2):fp_c5162df67e
gpt-4-turbo-2024-04-09: (3):fp_c5162df67e, (6):fp_76f018034d, (1):fp_9c4936e070
gpt-4-turbo-preview: (6):fp_122114e45f, (4):fp_b77cb481ed
gpt-4-0125-preview: (6):fp_b77cb481ed, (3):fp_21e53d6942, (1):fp_54b778f7c8
gpt-4-1106-preview: (1):fp_94f711dcf6, (4):fp_d6526cacfe, (2):fp_6bc7cb96fb, (1):fp_5b4e6f81f5, (2):fp_89f117abc5

Even the brand new gpt-4-turbo has multiple values.

Does this require “seed” to get the same fingerprint, if we read the confusing docs a different way? No improvement:

gpt-3.5-turbo: (6):fp_b28b39ffa8, (4):fp_c2295e73ad
gpt-3.5-turbo-0125: (9):fp_b28b39ffa8, (1):fp_c2295e73ad
gpt-3.5-turbo-1106: (4):fp_77a673219d, (6):fp_592ef5907d
gpt-4-turbo: (3):fp_c5162df67e, (4):fp_76f018034d, (2):fp_a39722e138, (1):fp_9c4936e070
gpt-4-turbo-2024-04-09: (8):fp_76f018034d, (1):fp_c5162df67e, (1):fp_a39722e138
gpt-4-turbo-preview: (5):fp_b77cb481ed, (4):fp_122114e45f, (1):fp_1d2ae78ab7
gpt-4-0125-preview: (5):fp_b77cb481ed, (1):fp_3b06ba039c, (3):fp_122114e45f, (1):fp_54b778f7c8
gpt-4-1106-preview: (2):fp_89f117abc5, (4):fp_6bc7cb96fb, (2):fp_94f711dcf6, (2):fp_d6526cacfe

What is this supposed to be reporting to us anyway, then? Four different versions of a model running in the wild? Let’s crank up the inspections and statistics of a model:

-- Fingerprint report from 100 trials:
gpt-4-turbo: (54):fp_76f018034d, (32):fp_c5162df67e, (10):fp_9c4936e070, (4):fp_a39722e138

300+ trials to chat models and the only useful thing I can find: The “backend changes” that have been made destroy any expectation of determinism by repeating calls.

Next: find the variety of logprobs that transcend fingerprint…

Python code for reporting on all supporting models
from openai import OpenAI
client = OpenAI(timeout=15)

models = [
    "gpt-3.5-turbo-0125", "gpt-3.5-turbo-1106",
    "gpt-4-turbo-2024-04-09", 
    "gpt-4-0125-preview", "gpt-4-1106-preview",
]
fingerprints = {}  # contains {"gpt-4-turbo": ["print1", "print2"], ...}
trials = 100
# Collecting fingerprints
for trial in range(trials):
    for model in models:
        if model not in fingerprints:
            fingerprints[model] = []
        try:
            response = client.chat.completions.create(
                messages=[{"role": "system", "content": "Hello"}],
                model=model, max_tokens=1, seed=123456,
            )
            fingerprint = response.system_fingerprint or ""
            fingerprints[model].append(fingerprint)
            print(f"{model} fingerprint {trial}: {fingerprints[model][trial]}")
        except:
            print(f"{model} fingerprint {trial}: timeout or error")
            pass
for model in fingerprints.keys():
    fingerprints[model] = sorted(fingerprints[model])

print(f"\n-- Fingerprint report from {trials} trials:")
for model, prints in fingerprints.items():
    unique_fingerprints = {}
    for fp in prints:
        if fp in unique_fingerprints:
            unique_fingerprints[fp] += 1
        else:
            unique_fingerprints[fp] = 1
    report_line = f"{model}: "
    report_line += ", ".join(f"({count}):{fp}" for fp, count in unique_fingerprints.items())
    print(report_line)
2 Likes

While not the same, according to ChatGPT, its browser fingerprinting is quite the gem. Because, you know, who doesn’t enjoy those charming but annoying and sometimes unsolvable captchas that magically appear at next log-in after you’ve just your cookies, activated a VPN, or installed privacy extensions? Just adds that extra touch of joy to our day, doesn’t it?

-generated by ChatGPT

Besides the minor issue of indicating a duplicate model placeholder, which doesn’t affect the conclusion, it appears that the same model with the same fingerprint returning different results shows little improvement.

I’m not suggesting to trust ChatGPT’s response, but if the earlier user’s quote from ChatGPT is true, it might suggest that the seed value is used primarily to bypass CAPTCHA.


Interestingly, ChatGPT has been responding to questions about system fingerprints independently of its knowledge cutoff.

Although there have been similar behaviors in the past, this is the first time it has responded in such detail.

I would not be surprised if this was an effort to save members from having to create a topic in the forum for a simple issue.

The model fingerprint is useful for organizations deploying custom model variants which are supposed to be fixed.

For everybody else this API feature doesn’t have much use as you show with the logs.

OpenAI has recently announced that more than just Enterprise customers can get a custom model variant. This will be more useful to more orgs in the future.

2 Likes

The user’s response is bot-generated nonsense. This response also sounds like it is AI answering the questions.

System Fingerprint on the API is in no way related to web browser fingerprinting.

System Fingerprint is supposed to report on undiscussed details about the model or architecture it runs on. While one would think that it refers to changes in the AI model itself from the announcement of the feature, it might additionally indicate some reason for additional departure in result statistics, by returning a different value if the model runs on A100 or H100-powered machines, as an example guess of why there are multiple results.

A question related to it If you may please,

After the update to the seed parameter from OPENAI my fine-tuned model is not giving same results and also in my dev and in production the results of the same model are different with same parameter and everything. What do you suggest should be checked?

@_j
The fingerprinting of web browsers and the System Fingerprint returned from OpenAI’s API are completely different.

I was simply considering the possibility that unique seed values might be set for accounts, but indeed, the response generated by the bot is mostly nonsense.

Unfortunately, the calculations powering language generation is simply not deterministic now, and I observe this even more in newer models, either by increased perplexity (lower certainty) or by changes to make them faster.

Running at top_p and temperature 3e-8 and seed 234123012 to produce highly-reliable output, immediate consecutive generations of an ambiguous task quickly deviate in top token by gpt-4-turbo-2024:

(Verse 1)
In the golden glow of morning light,
He walks in, a vision so bright.
Eyes like the stormy seas at night,
Deep and daring, holding stories untold,

fp_76f018034d

(Verse 1)
In the golden glow of morning light,
He walks, a vision, oh so bright,
Eyes like the stormy seas at night,
Deep and daring, bold and bright.

fp_76f018034d


So we get our answer about system_fingerprint being a decider of enabling determinism: no.

At those parameters, a November fine-tuning using -1106 is not as bad, returning the same system_fingerprint and also the same generation many times:

Verse 1:
His eyes are like the ocean
Deep and full of mystery
I could get lost in them forever
And never want to be set free

Chorus:
He’s got the kind
fp_382dc6d5e4

So you can use those very small sampling parameters (the particular value makes them float16-safe) in conjunction with seed, that otherwise would lock in the tokens of language, and help inform collective expectations of models.

When the top token switches or the mass probability thresholds are rearranged because of unreliable logits or softmax, no API parameters can stop that switch.

2 Likes

I have researched the term ‘mass probability’ to the best of my ability, but I have not been able to confirm that this term is used as an official academic term.

I think the term ‘mass probability’, that is not “probability mass” or “probability mass function”, might be used as a unique term within specific communities or groups.

Additionally, despite my investigation into ‘mass probability’, I could not confirm that it is recognized as an official term in general technical and academic language.

Nevertheless, I fully acknowledge the possibility that I may have misunderstood this term due to my own lack of knowledge, and as a result, I may have incorrectly concluded that it is not a common term in general technical and academic language.

I have no intention of denying your opinion.

Given my lack of research, knowledge, and my fundamental understanding, it is entirely plausible that your expression is reasonable.

If you can provide any indication that ‘mass probability’ is accepted as an academic term, I’d be happy to read it and learn about more.

The top result can switch positions with the second result if they have nearly equal but unreliable probabilities. That is, when using top_p to only get the top result.

However, this is the concept that I wanted to portray:

The unreliable tokens can also destroy the usefulness of seed. You can’t repeat the particular token that was randomly selected if the token logprobs are different each time.

The multinomial sampler would be provided a dictionary of logprobs, and then try to repeat its choice. However, if the probability space occupied by tokens is different, a different token choice may be made even if the same random threshold (“seed”) that selects a token is repeated.

I use probability mass just to refer to the whole set or subset. I’m not so clever as to find shared nomenclature when I refer to this threshold of cutoff.


Ask an AI about quantiles, thresholds, probs, mass…

  • Probability Mass:
  • The term “probability mass” is correctly used in the context of discrete probability distributions and refers to the probability assigned to a particular value or set of values in such a distribution. In your example, discussing the probability mass of values above a threshold like 0.46 is technically correct if you are dealing with discrete data or categories.
  • For Continuous Distributions:
  • In the context of continuous random variables, the term “probability density” might be more appropriate when referring to the “weight” of certain sections of the probability distribution.
  • However, when you are talking about the sum of probabilities above a certain point (like your 0.46 threshold), you are generally referring to the “tail probability” or the “survival function” of the distribution. This function (commonly denoted as S(x)S(x) for a threshold xx) is defined as the probability that the random variable is greater than xx, i.e., S(x)=1−F(x)S(x)=1−F(x) where F(x)F(x) is the cumulative distribution function at xx.
  • Using “Quantile” in Your Context:
  • If you want to describe using a 0.46 threshold to decide categories based on your bar charts, you might say, “the 0.46 cutoff represents a specific quantile of the underlying probability distribution, used to categorize the outcomes.” This implies that you are using the value 0.46 to define a boundary between different probability sections (quantiles) of your data.
  • Recommendation for Clarity:
  • When discussing the cutoff in a practical sense (as in your bar graph example), it would be most clear and correct to refer to “the 0.46 quantile threshold” or simply “the threshold at the 0.46 quantile”, which effectively communicates that this value is used to divide the probability distribution at that point.

None of these are helpful for talking with the layperson…


For more research, consider the nucleus sampling paper itself.

I should have realized that you did not mention formal academic terms.
As an ordinary person, I apologize for the lack of courtesy in asking in this manner.

I got a new fp today fp_7f4ee21199 with 3.5 turbo-1106 what does this mean are there updates being made to the model?

You get a different fingerprint for that model than I do today… Let’s review, I’ll collect all the stats I recorded together.

I haven’t been tracking because of the futility. Today:

– Fingerprint report from 100 trials April 29:

  • gpt-3.5-turbo-0125: (100):fp_3b956da36b (all new)
  • gpt-3.5-turbo-1106: (100):fp_b953e4de39 (all new)
  • gpt-4-turbo-2024-04-09: (11):fp_46a93fa712, (89):fp_ea6eb70039 (all new)
  • gpt-4-0125-preview: (9):fp_07f247cfff, (5):fp_79f643220b, (76):fp_d65ac1064c, (10):fp_df661246b2 (all new)
  • gpt-4-1106-preview: (27):fp_85bf4c41a2, (2):fp_a9cec7efda, (22):fp_b894082b34, (49):fp_d2051b8491 (all new)

So perhaps datacenter or account tier even affects what you get.

1 Like