RLHF as a systematic bias–variance misalignment (observational evidence)

I would like to share an observation that emerged from comparing multiple image generations with very similar prompts, but different degrees of physical realism.

You can see the relevant image posts under the links listed in the current forum gallery:

Thank you to all the contributors I have mentioned here – you are doing wonderful work and laying foundations on which knowledge can grow :cherry_blossom:


Observation (empirical, not anecdotal)

Two visually similar images were posted side by side:

  • Image 1
    – closer to real-world physics
    – asymmetric fragmentation
    – transitional state (process visible, not a clean end state)
    – visually “messier”

  • Image 2
    – more cinematic / iconic
    – high symmetry
    – instantaneous-looking explosion
    – clear silhouette and contrast

Despite Image 1 being physically more plausible, Image 2 consistently received significantly more positive reactions (likes / attention).

This pattern repeats across similar examples (glass breakage, impact dynamics, fluid–solid interaction).


Why this matters in ML terms

From a classical bias–variance trade-off perspective (especially in regression-like settings):

  • We already accept a moderate increase in bias

  • In order to achieve a strong reduction in variance

  • With the goal that unrealistic noise is suppressed,
    while informative structure remains

This is standard and expected.

However, when RLHF is added, something subtly but fundamentally different happens.


What RLHF changes (key point)

RLHF does not regularize variance with respect to ground truth or physical consistency.

Instead, it regularizes variance with respect to human reward signals, which tend to correlate with:

  • immediate visual clarity

  • symmetry

  • iconic end states

  • cinematic contrast

  • fast recognizability

Formally, the optimization objective becomes something like:

L\_{\\text{total}} = L\_{\\text{model}} - \\lambda \\cdot R\_{\\text{human}}

Where ( R_{\text{human}} ) is not aligned with physical correctness or process realism.


The resulting failure mode

The critical issue is which variance gets reduced.

In theory:

  • Variance ≈ random, unrealistic deviations

  • Signal ≈ rare but correct physical edge cases

In practice with RLHF:

  • Variance ≈ anything that deviates from the reward-optimal aesthetic

  • This includes:

    • asymmetric transitions

    • intermediate process states

    • physically correct but visually “uncomfortable” frames

As a result:

  • Physically realistic variation is suppressed

  • Cinematic but incorrect patterns are reinforced

  • The model converges toward an iconic mean, not a physical one

This is not overfitting and not underfitting.

It is a reward-induced bias shift, orthogonal to the classical bias–variance axis.


Why the Screenshots are relevant

The like-count difference is not just social noise — it is a proxy for the same signal RLHF optimizes.

In other words:

  • The same human preferences that drive likes

  • Are implicitly shaping the reward surface during RLHF

This makes RLHF especially effective at:

  • eliminating “rough” but correct outputs

  • preserving “clean” but incorrect ones


Core takeaway

RLHF reduces variance along the perceptual reward dimension,
not along the ground-truth or physical-consistency dimension.

As a consequence:

  • Unrealistic cinematic effects remain

  • Physically correct edge cases are smoothed out

  • Models increasingly optimize for plausibility rather than process correctness


Why this is important

This effect is subtle, cumulative, and easy to miss —
but it directly impacts domains where process realism matters:

  • physics-informed generation

  • scientific visualization

  • biomechanics

  • material interaction

  • high-speed dynamics


In short:
Likes are not ground truth — but RLHF implicitly treats them as such.

Thanks for reading :cherry_blossom:
I’m curious whether others have observed similar reward-induced bias shifts in multimodal models.

4 Likes

OK so I think I have 3 examples… Please check them…

  1. I post something someone likes and they like Another post… (I don’t want to add bias)
  2. I have second post on Dall-E thread… So I get reinforced likes (@VB gets 3 times more as the first poster)
  3. Someone is watching me… They use likes to get into the conversation (Totally valid I watch these people too)

Just to be clear: I’m not implying bad intent or manipulation here — these are genuine forum dynamics (a few I’m smart enough to observe, but not naturally use :slightly_smiling_face:).

1 Like

Thank you for your contribution, and you are right:
I typically focus on interaction dynamics.

Therefore, your comment on forum dynamics or user interactions is understandable :blush: :cherry_blossom:


However, my current bias topics are actually only about the ‘hard technology’!

I used these examples and users because I had made these observations in this specific situation and was trying to illustrate them.

The images and users can be exchanged, even the AI models.

I just wanted to show the forum members the mechanism - share knowledge and say:
Hey guys, look what I found. It maybe can help to understand or optimise processes :magnifying_glass_tilted_right:

1 Like