This thread’s subject line just gave me the idea that DALL-E 2 could essentially be used to analyze captcha images and then manipulate the images to be OCR-friendly for cases where OCR fails to produce the correct captcha. I mean that would be a genuine use-case but is seen as a way for people to circumvent anti-bot validation systems.
Aside from that random thought, it would be cool to see DALL-E 2 produce text descriptions from images in the same way it produces images from text descriptions. Would having that feature be considered popular enough to implement in the future? I’m not too knowledgeable into how contrastive models such as CLIP operate to know if there’s a simple way to just reverse the input & output to make that idea a reality, but figured I’d put it out there in hopes someone who is knowledgeable in that area can explain the variables that would be at play here to see how easy that would be to implement in future iterations of DALL-E.