Feature Request: Reasoning in other modalities?

i have a question, does the ‘reasoning chains’ have to be in text? could it be trained to reason in outputting various images towards a solution?
or 3d models, or other modalities?
im sure that’d be more work, but might be helpful for the model to imagine certain answers/solutions/manipulations?
ie rotating a shape for a problem, or imagine visually what the actions should be expected, ect? :3

or perhaps both with some textual grounding as commentary along the way.