Video PreTraining (VPT) question

After reading the blog post about VPT - Learning to Play Minecraft with Video PreTraining (VPT)

I have a question, is it possible to use this method for teaching video editing or creating visual effects, especially in programs with a large number of hot keys?

For example in 3d programs:

  1. Creating a Basic Shape
  2. Change textures
  3. Object render

My questions are in regard of the labelled data. Is it just labelled the keyboard inputs such as: A,W,S,D, … and the mouse input such as location and left click and right click?
And with those labels the model figures out what the inputs corresponds to the video ?
Does GPT plays a role in labelling the images also ?