was curious could i create finetunes with regularized sets of images and rated responses to craft desired skills and behaviors that the model learn from or to craft the neural nets behavior and perceptions and understanding? and most important will they stack?
Trying to use ai to Play Pokemon Blue and Operate a McDonalds Point of Sales GUI (a sim made from a screenshot of the actual POS background in unity w overlayed buttons lol ).
" Automat trained GPT-4o to locate UI elements on a screen given a natural language description"
Perhaps for the MCD POS i could VFT a CUA endpoint? hmm
but for Pokemon, the models would likely need to stack several behaviors to complete the game:
which makes me wonder, what set to diff skills and concepts it needs to grok:
~ rated responses regularized sets of images near walls to promote walking away from the wall or towards a goal with high rating, and low ratings for movements towards a wall. (commands are pyautogui scripts )
~rated responses to images, where the response would be of making plans and goals esp of current actions and next steps of immediate actions, and various priority goals/plans of diff timescales or apriority if no known priority or merely loose or open ended. And moving to these goals or objects correctly w a high rating.
~image & text response: navigating in game menus, and rating the expected response(s)
~image & text response: various common fight mechanics likely to run into and the response to use --which high ratings to impart a kind of intuition
:
{
"messages": [
{"role": "system", "content": "You are an assistant helping a user navigate a game."},
{"role": "user", "content": "Where should I go next?"},
{"role": "image", "content": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQE... (rest of base64 string)"},
{"role": "assistant", "content": "PYAG: Move_Tile(R,4); Immediate goal: Move to the right to avoid the obstacle and proceed to the next level; Next Action: Move up through the door to enter after we confirm we are lined up; Next action move: PYAG:Move_Up(U,3); Medium priority goals: Heal Pokemon at Pokemon Center (Priority: 3, importance 6), Fight Misty and obtain cascade Badge (Priority: 3, important 5); Longer Term Goals:... Apriority goals: Notable events:... Notable Inventory:... "}
"weight": 4
]
}
{
"messages": [
{
"role": "system",
"content": "type:image_url lEQVR4nOydeVxN2xfAV7d5nkeao2iS...."
},
{
"role": "assistant",
"content": "type:text pyag: move(1100, 500); # Notes: The cursor is now positioned over a highlighted section in the game log interface.",
"weight": 4.3
}
]
}
hopefully help the model ‘see’ the center character tile and screen better, and help w planning, and navigate smoother, not too mention to better perceive gaps in ledges, ledges, other characters, and general visual understanding in game
Are finetunes stackable, or do they more so just augment likely outputs ? Are they Loras --hence wont really shape the neural network in a stackable, long term way? We can’t vision finetune the whole network via full finetuning the network right?
My program can takes screenshots, so collecting the examples and manually crafting weighted(rated) response is plausible for me… if a bit tedious but could be a fun lil project if regularized Vision finetunes stack, erm do they? ya think it’s worth it?