Original approach with dummy model in sprites and then “character skin” applied by model seems easier for me if production scale is needed.
Looks similar to my concept of " brand voice skin" where model converts brand text to " neutral " and then another one gets trained on reversed samples to convert" neutral" in " branded".
My gut says here the “neutral” is the dummy and the approach might actually work way better than one may think. Just some fine-tuning will definitely be necessary if truly going to scale volumes.