How does gpt5 works with text input and/or image input

For large models (such as gpt-5) that can process both images and text simultaneously, how do the image processing module and the text processing module work independently/collaboratively? If only text is input without an image, how does the model work?