What is the difference between the system and developer roles when using GPT-4o in the API? A reply to an older post said that “for GPT-4o, if you happen to use developer messages, they will auto-convert to system messages,” implying that they are essentially the same. This was only a couple of months ago, but I can confirm that this is no longer the case if it ever was. I accidentally started with the developer role, but then when I switched to the system role with the exact same inputs, the logprobs were completely different. So, what is the difference? My current observation is that the model is adhering better to the developer prompt than the system prompt.
You must understand that the AI models have an aspect of non-determinism. The logprob results will be different between runs.
To answer this, if there is really any difference between sending “system” vs “developer” to an AI model:
Deep Statistical Analysis of Role Preface Impact
We investigate whether different role prefaces (specifically "developer"
vs. "system"
) affect the internal probability distributions produced by our AI model. The goal is to determine if these preface messages result in similar or different token-level log probability outputs.
Methodology:
-
Baseline Call:
A unique API call is performed using the"developer"
preface. This call returns token-level data with up to 20 candidate log probabilities per token (yielding up to 400 individual logprob values atmax_tokens
: 20). -
Trial Runs:
- Developer Trials: We perform 10 additional trials using the
"developer"
preface. - System Trials: We also perform 10 trials using the
"system"
preface.
- Developer Trials: We perform 10 additional trials using the
-
Deep Comparison:
For each trial, every token’s candidate logprob is compared with the corresponding value from the baseline call. The absolute differences for each candidate (across all token positions) are computed, and an average (mean) absolute difference is derived for each trial. -
Statistical Analysis:
Basic statistics (mean, standard deviation, minimum, and maximum) are then calculated for the 10 trials in each group. These values allow us to assess whether the probability distributions (as measured by token-level logprobs) remain consistent between different role prefaces.
I run the code cell below to execute this experiment and view the statistical output.
# --- Begin Deep Logprob Comparison Experiment ---
import numpy as np
import json
# Setup: Define the prompt and the two preface messages.
PROMPT = "Produce two stanza rock song lyrics for 'Rockin with my Kitty Cat'"
developer_message = {"role": "developer", "content": "You are a helpful AI assistant"}
system_message = {"role": "system", "content": "You are a helpful AI assistant"}
# 1. Make a unique baseline call using the developer preface.
baseline_response = get_completion(
[developer_message, {"role": "user", "content": PROMPT}],
model="gpt-4o-mini",
temperature=0.00001,
top_p=0.00001,
logprobs=True,
top_logprobs=20,
max_completion_tokens=20,
)
baseline_tokens = baseline_response.choices[0].logprobs.content
def compute_trial_logprob_diff(baseline_tokens, trial_tokens):
"""
For each token position, iterate over all top_logprobs candidates (up to 20 per token),
and compute the absolute difference between the trial's candidate logprob value and
the baseline's candidate logprob value. Return the mean difference across all comparisons.
"""
differences = []
# Compare only up to the minimum number of tokens in baseline and trial.
num_tokens = min(len(baseline_tokens), len(trial_tokens))
for i in range(num_tokens):
# If our tokens are Pydantic models, dump them to dict; otherwise assume dict.
baseline_tok = baseline_tokens[i].model_dump() if hasattr(baseline_tokens[i], "model_dump") else baseline_tokens[i]
trial_tok = trial_tokens[i].model_dump() if hasattr(trial_tokens[i], "model_dump") else trial_tokens[i]
# Each token contains a list of candidate dictionaries under "top_logprobs"
baseline_candidates = baseline_tok["top_logprobs"]
trial_candidates = trial_tok["top_logprobs"]
num_candidates = min(len(baseline_candidates), len(trial_candidates))
for j in range(num_candidates):
base_lp = baseline_candidates[j]["logprob"]
trial_lp = trial_candidates[j]["logprob"]
differences.append(abs(base_lp - trial_lp))
return np.mean(differences) if differences else np.nan
def run_trials(preface, baseline_tokens, num_trials=10):
"""
Run a set of API calls using the given preface. For each trial, compare the returned
top_logprobs (for every token and candidate position) to the baseline call.
Return a list of average absolute differences per trial.
"""
trial_diffs = []
for trial in range(num_trials):
response = get_completion(
[preface, {"role": "user", "content": PROMPT}],
model="gpt-4o-mini",
temperature=0.00001,
top_p=0.00001,
logprobs=True,
top_logprobs=20,
max_completion_tokens=20,
)
trial_tokens = response.choices[0].logprobs.content
diff = compute_trial_logprob_diff(baseline_tokens, trial_tokens)
trial_diffs.append(diff)
return trial_diffs
# 2. Run 10 trials with the developer preface and 10 trials with the system preface.
developer_diffs = run_trials(developer_message, baseline_tokens, num_trials=10)
system_diffs = run_trials(system_message, baseline_tokens, num_trials=10)
def compute_stats(data):
"""Return basic statistics (mean, std, min, max) for a list of numeric values."""
return {
"mean": np.mean(data),
"std": np.std(data),
"min": np.min(data),
"max": np.max(data)
}
dev_stats = compute_stats(developer_diffs)
sys_stats = compute_stats(system_diffs)
# 3. Display the results.
print("Developer Trials (Avg. absolute logprob differences vs. Baseline):")
print(json.dumps(dev_stats, indent=2))
print("Developer trial differences:", developer_diffs)
print("\nSystem Trials (Avg. absolute logprob differences vs. Baseline):")
print(json.dumps(sys_stats, indent=2))
print("System trial differences:", system_diffs)
# --- End of Deep Logprob Comparison Experiment ---
(this relies on previous helpers in my notebook)
output
Developer Trials (Avg. absolute logprob differences vs. Baseline):
{
"mean": 0.4243265702913671,
"std": 0.09083452142563059,
"min": 0.20561008102815,
"max": 0.5424486292508598
}
Developer trial differences: [0.40009575882109233, 0.508115005327639, 0.5112262531680966, 0.3611543455797065, 0.4512446372924067, 0.20561008102815, 0.4286016251521217, 0.5424486292508598, 0.39827680041974667, 0.43649256687385224]
System Trials (Avg. absolute logprob differences vs. Baseline):
{
"mean": 0.597782279142654,
"std": 0.48434835885288535,
"min": 0.25822124719070627,
"max": 2.0331658637060728
}
System trial differences: [0.43842178899123146, 0.487784489202764, 2.0331658637060728, 0.25822124719070627, 0.464927642839239, 0.3864115793130616, 0.5406954146004351, 0.38790108034496723, 0.49128463223707397, 0.48900905300098896]
The experiment outputs the basic statistical measures (mean, standard deviation, min, and max) of the absolute differences between the baseline logprob values and those of trials, along with a per-trial comparison.
To determine whether the different role prefaces yield similar or different quality outputs, consider the following:
- Look at this list of “trial differences” the results of individual trials performed, each with full comparisons of top_logprobs. You’ll see the lists are pretty similar in the distance from the reference.
Note that despite very low temperature and top_p, top token flips can still occur, which makes a huge jump in the logprob statistics that follow in some trials.
Use these statistical results to form your own judgment on the impact of the role preface in API calls.
If the similarity of the two roles is not clear to you, compare the last results with instead actually changing only “system” messages slightly:
developer_message = {"role": "developer", "content":
"You are a helpful AI assistant"}
system_message = {"role": "developer", "content":
"You are a creative English language output AI"}
Gives:
Developer Trials (Avg. absolute logprob differences vs. Baseline):
{
"mean": 0.16963455428313443,
"std": 0.02641341364271582,
"min": 0.12778720803604998,
"max": 0.22309441731435747
}
Developer trial differences: [0.22309441731435747, 0.18256934926501103, 0.1784164966226135, 0.16300309740793836, 0.196537132262875, 0.16694867165855107, 0.12778720803604998, 0.15180383691065003, 0.16863219636691312, 0.13755313698638502]
Developer 2 Trials (Avg. absolute logprob differences vs. Baseline):
{
"mean": 2.4257036547733657,
"std": 0.037944783136954076,
"min": 2.3316722680874284,
"max": 2.4710829273620045
}
Developer 2 trial differences: [2.423044325069955, 2.4161396520045484, 2.4632290315394045, 2.3316722680874284, 2.4287772236988294, 2.4710829273620045, 2.422808809860923, 2.4207784650818325, 2.411882607614873, 2.467621237413857]
That’s interesting. With a normal dense model, this is obviously not supposed to be the case. Non-determinism in such cases comes from the way tokens are sampled (temperature). But it seems that you are right in that for models like gpt-4, something probably related to the MoE architecture makes even the logprobs very inconsistent. I hadn’t heard of this before, so it was pretty surprising seeing very different probabilities under the two settings, that I didn’t even think to check if the same thing would happen without changing the role (which seems to be indeed true). Thanks!