Vision Finetunes for Finetune Stacking ? Do they stack?

was curious could i create finetunes with regularized sets of images and rated responses to craft desired skills and behaviors that the model learn from or to craft the neural nets behavior and perceptions and understanding? and most important will they stack?

Trying to use ai to Play Pokemon Blue and Operate a McDonalds Point of Sales GUI (a sim made from a screenshot of the actual POS background in unity w overlayed buttons lol ).

" Automat trained GPT-4o to locate UI elements on a screen given a natural language description"
Perhaps for the MCD POS i could VFT a CUA endpoint? hmm

but for Pokemon, the models would likely need to stack several behaviors to complete the game:
which makes me wonder, what set to diff skills and concepts it needs to grok:

~ rated responses regularized sets of images near walls to promote walking away from the wall or towards a goal with high rating, and low ratings for movements towards a wall. (commands are pyautogui scripts )
~rated responses to images, where the response would be of making plans and goals esp of current actions and next steps of immediate actions, and various priority goals/plans of diff timescales or apriority if no known priority or merely loose or open ended. And moving to these goals or objects correctly w a high rating.

~image & text response: navigating in game menus, and rating the expected response(s)
~image & text response: various common fight mechanics likely to run into and the response to use --which high ratings to impart a kind of intuition

:

{
  "messages": [
    {"role": "system", "content": "You are an assistant helping a user navigate a game."},
    {"role": "user", "content": "Where should I go next?"},
    {"role": "image", "content": "... (rest of base64 string)"},
    {"role": "assistant", "content": "PYAG: Move_Tile(R,4); Immediate goal: Move to the right to avoid the obstacle and proceed to the next level; Next Action: Move up through the door to enter after we confirm we are lined up; Next action move: PYAG:Move_Up(U,3); Medium priority goals: Heal Pokemon at Pokemon Center (Priority: 3, importance 6), Fight Misty and obtain cascade Badge (Priority: 3, important 5); Longer Term Goals:...    Apriority goals:    Notable events:...  Notable Inventory:... "}
   "weight": 4
  ]
}

{
  "messages": [
    {
      "role": "system",
      "content": "type:image_url lEQVR4nOydeVxN2xfAV7d5nkeao2iS...."
    },
    {
      "role": "assistant",
      "content": "type:text pyag: move(1100, 500); # Notes: The cursor is now positioned over a highlighted section in the game log interface.",
      "weight": 4.3
    }
  ]
}

hopefully help the model ‘see’ the center character tile and screen better, and help w planning, and navigate smoother, not too mention to better perceive gaps in ledges, ledges, other characters, and general visual understanding in game

Are finetunes stackable, or do they more so just augment likely outputs ? Are they Loras --hence wont really shape the neural network in a stackable, long term way? We can’t vision finetune the whole network via full finetuning the network right?
My program can takes screenshots, so collecting the examples and manually crafting weighted(rated) response is plausible for me… if a bit tedious but could be a fun lil project if regularized Vision finetunes stack, erm do they? ya think it’s worth it?




im guessing they don’t stack behaviors, so dont bother? :x

Physics of Skill Learning
https://arxiv.org/pdf/2501.12391
seems possibly sequentially learned

Where does In-context Learning Happen in Large Language Models
https://openreview.net/pdf?id=LLuSjg59an
skills tend to grok (be learned) around layers 15-20

https://arxiv.org/pdf/2407.15017 knowledge mechanisms

[2409.03752] Attention Heads of Large Language Models: A Survey attention heads

InnerThoughts: Disentangling Representations and Predictions in LLMs\n Answer… (InnerThoughts Update Module) ITUM :
https://arxiv.org/pdf/2501.17994

it certainly seems like skill acquisition and learned attention head shaping esp of understand, perception, reasoning can indeed be trained in almost sequentially, which feels like it intuitively has interesting implications hmmm
But if vision FTs are Loras, then…
If OpenAI allows layer-specific vision fine-tuning, it might be possible to:

  1. Train one LoRA fine-tune targeting earlier layers for one task (e.g., Pokémon navigation).or attempt to use ITUM to update the layer
  2. Train another LoRA fine-tune targeting later layers for another task (e.g., UI automation). .or attempt to use ITUM to update another subsequent layer
  3. See if applying both LoRAs or augmentation via layers ITUM sequentially causes interference or allows modular stacking.

and if that doesn’t work, maybe some tools for more comprehensive training or more direct access or tools to better train the model? Which i assume would cost a bit more bleh :thinking: :face_with_monocle: