How does GPT-4 multimodal input interface work?

I have been using ChatGPT on a daily basis since December 2022, primarily to help me as I advance in C Programming - it has proved invaluable for this. I’m considering trying out GPT-4 due to its support for multimodal input. When feeding information to GPT-4, such as text and images - how is this actually done by the user? Are images, documents, code snippets, etc. directly pasted into the interface?

Regular GPT4 doesn’t yet support images/video. Generally any other content is just pasted into the text box. If you are using Code Interpreter mode you can upload files (and zip of files) by clicking the (+) icon in the text box.

