I am working on Computer Use Agents, I found that openai’s agent is weak & outdated unlike claude’s. So i thought about finetuning a multimodal gpt41 then plugging tools to it… But could not find similar job or tutorial to this task, even the whole topic is brand new.
I already tried to prepare some conversational data with actions (human trajectories of activity) as reply from assistant but the fine tune system rejected it because it has several screenshots