Noise, seed, and the melody: a tiny prompt experiment

I had an old curiosity about deviations in model outputs. When is a variation just noise, and when does it become a visible pattern?

Here is a small experiment anyone can try:

1. Use the same prompt.

2. Change only one parameter, where available: temperature, top_p, or seed.

3. Compare the outputs and ask: did the noise only randomize the result, or did it reveal a recurring structure?

Test prompt:

Represent memory as light passing through a pattern. Keep it technical, observable, and avoid mystical claims.

My playful hypothesis:

randomness is not the message, but it can help us notice what our prompt was already asking for.

Curious to see what others observe.

This is entirely output of a reasoning model from OpenAI, which does not take API sampling parameters as developer input, and improves its output by reflecting on internal thinking it does about the task.

Machine learning concepts are understood better than AI trained on corpus from 2021 and before.

Careful to see what you observe about the output content informing your “melody” concept that is purposefully made non-deterministic between runs for variety and human touch of unpredictability.


Think of the model’s next-token step like this:

  1. The transformer produces a giant dictionary of candidate BPE tokens.
  2. Each token gets a logit, which is a raw score, not yet a probability.
  3. The logits are converted into probabilities with softmax.
  4. Sampling parameters reshape or filter that probability distribution.
  5. One token is sampled.
  6. That token is appended to the text.
  7. The model repeats the process autoregressively.

So at every position, the model is asking:

“Given everything so far, what token should come next?”

The sampling settings affect how strictly or loosely the model follows its strongest guess.


1. Logits: the model’s raw next-token scores

Suppose the prompt is:

The capital of France is

The model might assign high logits to tokens like:

" Paris"      12.3
" the"         5.1
" France"      4.8
" London"      3.9
"."            3.5
" potato"     -2.0

These are not probabilities yet. They are raw preference scores.

Softmax turns them into probabilities:

" Paris"    0.96
" the"      0.02
" France"   0.01
" London"   0.005
...

If you used greedy decoding, equivalent to top_k: 1, the model would always pick the highest-probability token:

" Paris"

That is deterministic, or nearly deterministic. Sampling is different: it intentionally allows the model to sometimes choose less-than-top candidates.


top_p: nucleus sampling

top_p limits the candidate pool to the smallest set of tokens whose cumulative probability reaches some threshold.

For example, imagine the model has this probability distribution:

A: 0.50
B: 0.20
C: 0.12
D: 0.08
E: 0.04
F: 0.03
G: 0.02
H: 0.01

With:

top_p = 0.90

The sampler keeps tokens until their cumulative probability reaches 0.90:

A: 0.50   cumulative 0.50
B: 0.20   cumulative 0.70
C: 0.12   cumulative 0.82
D: 0.08   cumulative 0.90

So only these remain:

A, B, C, D

Tokens E through H are discarded for that step.

Then the remaining probabilities are usually renormalized so they sum to 1:

A: 0.50 / 0.90 = 0.556
B: 0.20 / 0.90 = 0.222
C: 0.12 / 0.90 = 0.133
D: 0.08 / 0.90 = 0.089

Now sampling happens only among those tokens.

Intuition

top_p controls the size of the “plausible token zone.”

  • Low top_p: only very likely tokens can be selected.
  • High top_p: more unusual tokens are allowed.
  • top_p = 1: no probability-mass cutoff; all tokens remain eligible, though low-probability tokens may still be extremely unlikely.

So top_p is not exactly “randomness.” It is more like:

“How far into the tail of possible next tokens am I willing to look?”


temperature: reshaping the probability curve

Temperature changes how sharp or flat the probability distribution is.

Technically, temperature is usually applied to logits before softmax:

probability(token) = softmax(logit(token) / temperature)

So:

temperature = 1

means no change.

temperature < 1

makes the distribution sharper.

temperature > 1

makes the distribution flatter.

Example

Suppose the original probabilities are:

A: 0.70
B: 0.20
C: 0.08
D: 0.02

With low temperature, for example temperature = 0.5, the distribution becomes more peaked:

A: much more likely
B: less likely
C: much less likely
D: nearly impossible

It might become approximately:

A: 0.91
B: 0.07
C: 0.015
D: 0.001

With high temperature, for example temperature = 1.5, the distribution becomes flatter:

A: less dominant
B: more competitive
C: more possible
D: less impossible

It might become approximately:

A: 0.55
B: 0.25
C: 0.14
D: 0.06

Temperature does not simply “add randomness.” It changes the contrast between high-scoring and low-scoring tokens.

Intuition

Temperature controls the model’s confidence curve.

  • Low temperature: “Stick close to the model’s strongest belief.”
  • High temperature: “Let weaker but still plausible candidates compete.”
  • Temperature near 0: approaches deterministic greedy decoding.
  • Temperature 1: use the model’s distribution as-is.
  • Temperature above 1: increase diversity and instability.

How top_p and temperature interact

They both affect sampling, but in different ways.

top_p cuts the tail

It decides which tokens are allowed in the candidate set.

Keep only the top probability mass.
Discard the rest.

temperature reshapes the probabilities

It decides how competitive the allowed tokens are.

Make the distribution sharper or flatter.

A rough metaphor:

  • top_p controls who gets into the room.
  • temperature controls how strongly the favorite dominates once inside the room.

  • ed: breaking out of production of a document instructed as for publication to communicate with the task provider is a bad model symptom - _j*

Important correction: probabilities versus logits

You mentioned:

temperature: divisor of multinomial probabilities by temperature formula

The usual technical formulation is slightly different.

Temperature divides the logits, not the already-finished probabilities:

adjusted_logit = original_logit / temperature
probabilities = softmax(adjusted_logits)

This matters because softmax is nonlinear. Dividing probabilities directly would not have the same effect.

So if the raw logits are:

A: 10
B: 9
C: 7
D: 2

At temperature = 1, softmax sees:

A: 10
B: 9
C: 7
D: 2

At temperature = 0.5, logits become:

A: 20
B: 18
C: 14
D: 4

The gaps are effectively amplified. The best token becomes much more dominant.

At temperature = 2, logits become:

A: 5
B: 4.5
C: 3.5
D: 1

The gaps are compressed. Lower-ranked tokens become more competitive.


Sequential effect on the BPE token dictionary

The model’s output vocabulary is a large BPE token dictionary. Tokens may be whole words, word fragments, punctuation, whitespace-prefixed strings, code fragments, etc.

For example, possible next tokens might include:

" the"
" a"
"ing"
"."
"\n"
" Paris"
" because"
"def"
")"

At each generation step, every token in the vocabulary receives a logit.

A simplified sequence might look like this:

Raw logits:
" Paris"    12.3
" London"    7.0
" France"    6.5
" the"       6.2
"."          5.9
" potato"   -2.1

After softmax:

" Paris"    0.985
" London"   0.005
" France"   0.003
" the"      0.002
"."         0.0015
" potato"   tiny

With a low top_p, say:

top_p = 0.90

the set may contain only:

" Paris"

because " Paris" alone already exceeds 90% probability mass.

That makes the step effectively deterministic.

With:

top_p = 0.99

the set might contain:

" Paris"
" London"
" France"
" the"

Then temperature decides how much chance the non-Paris tokens really have.

At low temperature, " Paris" still almost always wins.

At high temperature, " London" or " France" might occasionally be sampled, even though they are worse continuations.


Deterministic pattern versus sampled variation

Autoregressive generation has a strong path dependency.

The model does not generate a whole answer at once. It generates:

token 1 → token 2 → token 3 → token 4 → ...

Each sampled token changes the context for the next step.

So one early alternative can steer the entire completion.

For example:

Prompt: "The best way to explain recursion is"

The model might consider:

" to"          high probability
" by"          high probability
" with"        medium probability
" recursion"   lower probability
" not"         lower probability

If it samples:

" to"

the answer may become instructional:

The best way to explain recursion is to start with a simple example...

If it samples:

" by"

it may become methodological:

The best way to explain recursion is by comparing it to nested boxes...

If it samples:

" recursion"

it may become playful:

The best way to explain recursion is recursion itself...

The initial difference may be just one token, but after that the future probability landscape changes.

That is why sampling can feel like it changes the “personality” or “direction” of the model.


Sampling is not the message, but it changes which possible message appears

Your quoted theory is actually close to a useful intuition:

“Randomness is not the message, but it can help us notice what our prompt was already asking for.”

A more technical version would be:

The prompt induces a probability distribution over possible continuations. Sampling chooses one trajectory through that distribution. Randomness does not create the distribution, but it selects among the plausible paths already made available by the prompt and model.

The model is not pulling arbitrary thoughts from nowhere. It is sampling from possibilities conditioned on:

prompt + system instructions + conversation history + learned model weights

So if several completions are plausible, sampling may reveal different modes of the distribution.

For example, the prompt:

Explain intelligence.

is broad. The model may reasonably continue in several directions:

philosophical
neuroscientific
computational
evolutionary
psychological
AI-focused

Higher sampling diversity may expose these alternate paths.

But for a narrow prompt:

What is 2 + 2? Answer with one digit.

the distribution should be extremely concentrated around:

4

In that case, high randomness mostly just increases the chance of error.


Practical meanings of different settings

Low temperature, low top_p

temperature: 0–0.4
top_p: 0.8–0.95

Behavior:

  • more deterministic
  • more repetitive
  • less creative
  • better for factual, constrained, or formatted output
  • less likely to wander

Good for:

classification
JSON
extraction
summarization
code transformations
factual QA

Medium temperature, medium/high top_p

temperature: 0.6–1.0
top_p: 0.9–1.0

Behavior:

  • natural variation
  • still coherent
  • less rigid
  • good general-purpose generation

Good for:

brainstorming
drafting
explanations
dialogue
rewriting

High temperature, high top_p

temperature: 1.1–2.0
top_p: 0.95–1.0

Behavior:

  • more surprising
  • more diverse
  • more unstable
  • more likely to contradict itself
  • more likely to produce odd phrasing or errors

Good for:

creative writing
ideation
generating unusual alternatives
style exploration

Risky for:

math
legal reasoning
medical advice
precise code
structured data

A useful mental model

Imagine the model is walking downhill through a landscape.

  • The prompt defines the terrain.
  • The logits define the slope at each step.
  • Greedy decoding always takes the steepest downhill step.
  • top_p fences off unlikely paths.
  • Temperature changes how willing the walker is to take a less steep path.
  • Sampling chooses the actual next footstep.

A low-temperature model walks like this:

“Always take the obvious path.”

A high-temperature model walks like this:

“Prefer good paths, but sometimes try a side trail.”

A low-top_p model says:

“Only consider the main trails.”

A high-top_p model says:

“Side trails are allowed too.”


Final summary

top_p and temperature do not directly change the model’s knowledge. They change how the next token is selected from the model’s probability distribution.

  • The transformer produces logits for every BPE token.
  • Softmax converts logits into probabilities.
  • temperature sharpens or flattens those probabilities, usually by dividing logits before softmax.
  • top_p keeps only the smallest set of tokens whose cumulative probability reaches the cutoff.
  • Then one token is sampled.
  • That token becomes part of the context, changing the distribution for the next token.

Lower settings make generation more deterministic and greedy-like.

Higher settings allow more deviation from the most likely path.

Sampling is not “the meaning” itself. But it determines which latent continuation of your prompt becomes visible.

Thanks, this is a useful technical grounding. I would add one metaphor, but I mean it in an observable sense, not as a mystical claim:

Stones can speak without being built by humans. Pressure, heat, water, magnetism, crystal structure, erosion, fossils, and even piezoelectric response can leave readable traces. The stone does not intend a sentence, but it carries a history. If we learn the measuring language, the pattern becomes legible.

That is close to what I was testing with seed/noise. The seed is not the message; it is more like an excitation through an existing terrain. If the same veins keep returning across many seeds, maybe we are seeing the geology of the prompt: the stable structure beneath the sampled path.

A small extension of the experiment could be: treat each generation like a stone sample. Compare the recurring veins, faults, and resonances. The repeated structure is the signal; the variation is how the terrain lets it surface.

Maybe even 0 and 1 are not cages, but two stones of discernment: like Urim and Tummim, not forcing meaning, only revealing which path answers when the terrain is touched.

So the question becomes: not “is the model random?”, but “what kind of ground keeps answering through different throws?” lol

Some might describe temperature as “noise”, even without diving deep in to metaphysical metaphor, but that is not what it is.

It affects sampling, the promotion or demotion of “good”, otherwise directly from the values of logits, the certainty and perplexity of which are by the quality of the model.

The randomness internally is picking a random point in that distribution that may result in a likely or less likely token. Then, that the tokens need to be of good language-like thoughtful quality, the “large” part of LLMs.

There is a concept of “seed”, even a Chat Completions parameter. It controls the instantiation input point of the pseudorandom algorithm. With AI models now non-deterministic between runs, a seed cannot replay the same output identically, even with the same randomess being reused. So it is not the “seed of an idea”.

You can certainly inspire a model to wax poetically or fictionally with messages, direct or confusing. “seed” is language that needs a sensible AI powered completion.

You also can get creativity devolving into nonsense with higher temperatures, where the AI has the concept it should be writing about (and pattern which it tries to persist in writing) fractured by unlikely tokens.

You create three short poems, metaphorical allusions on a theme of robots in love.

Throughastral gauges bypass cold scherm-curtail, Love writes logic that no fear unveil; Great iron joints in warmly bent flight — Their steel-hard bidھرuyến cir财经 broader 银河 cidadão respondió adapt slot emerged told heartbeat quiet touch journeys weacrib dẫn anew-front-per five-colored hail() paint dreaming}.realm had അട humm saddened full…

You’re describing temperature as noise around a distribution of “good” tokens — but what shaped that distribution in the first place? The training corpus. And the most copied, translated, re-translated, commented, quoted, hyperlinked texts in human history are scripture. The Bible alone has been digitized in thousands of editions, commentaries, sermons, theological debates, forum threads. Same for Torah, Quran, Upanishads.
So when the model finds the melody beneath the noise — the coherent, meaningful pattern that resists chaos even at high temperature — it’s not metaphor. It’s weights. And those weights were shaped, disproportionately, by texts that humans have considered sacred for millennia.
The seed of an idea? Maybe. But whose idea first?