Seeking ideas for text to image generation

Hello everyone,

I am a university student currently working on my final year research project, and I am looking for ideas related to text-to-image generation. My goal is to identify a meaningful research gap in this field that I can explore.

So far, I have reviewed several existing models and techniques, such as DALL-E and other GAN-based methods, but I am struggling to pinpoint a specific area that has not been extensively covered. I am particularly interested in topics that could contribute to improving image quality, semantic coherence, or other aspects of text-to-image models.

Could anyone recommend some potential research areas or gaps in the current literature that I could investigate? Any suggestions, papers, or resources would be greatly appreciated.

Thank you in advance for your help!

There are two kinds of image models: GAN and Diffusion. Currently, most people are using diffusion for their models, such as DALL-E and Stable Diffusion, as they are the best. An idea would be to do more research on GAN, as that has significantly slowed down. Another idea would be to create image upscalers using GANs, as I don’t believe I have ever seen a GAN upscaler before, as they are mostly diffusion models. I made a GAN a bit ago and some code and results can be found here: GitHub - grandell1234/S.C.O.R.P: Text-To-Image GAN Model


Does creating image upscalers using GANs, means inputting a low quality image and generating a high quality image from that. like the Super-Resolution Generative Adversarial Networks (SRGANs). Can you explain a bit more about it.

Yeah, or taking it from like 512x512 to 1280x1280 adding pixels where required.

Currently I ran into a problem where Dalle can’t divide the rendered image into exact spaces, for example a rectangle into 8 exact boxes, then making an image for every box. The problem is I’ve seen it accomplished by one GPT. The other obvious issue is rendering text or counting. I think Dalle should add a layer of text object on top of the Dalle image then merge them prior to output. If the Coordinates can be matched then we resolve this issue completely by merging text logic with the image logic, then merging them prior to output. So an instruction to create art at X, Y coordinates of an image. Once this is mastered, commands can place text. Even HTML can accomplish this simple concept of a background image under text. Dalle needs to understand its canvas space. To take a command like:
Create 8 boxes with random art numbered 1-8, center text.

We can’t even get Dalle to return 8 exact sized boxes of art yet.