A ‘diffusion’ text model?

First: I’m not an AI expert, I’m just a hobbyist who enjoys using AI to write and illustrate stories, so apologies if this is already a thing.

Like many of you, I was blown away by how much better DALLE-2 was from it’s predecessor. I think one major reason is the ‘diffusion’ method. It’s the difference between writing a story without a backspace key, and writing one over 100+ drafts.

Right now GPT-3 is also linear. It can help complete a thought, and the insert/edit features can even back-fill, but it’s still very much up to a human to do a final edit to ensure the whole piece makes sense.

Would it be possible to apply that same ‘diffusion’ logic to GPT? So, for example, it would write a 5 paragraph essay on some topic, but when it was done it would ‘blur and refine’, ‘blur and refine’ until the thing was polished to a shine. In the case of an image ‘blur’ would mean adding gaussian noise…I suppose for text it might mean rephrasing and restructuring lines, selecting words with slightly different meanings, etc.

This may be of particular importance for Codex, where something like renaming a variable requires simultaneous changes throughout the code.

Idk if this is GPT-4, or if ‘writing/completion’ and ‘editing’ are two inherently different models, but if it’s possible to make the same advances in text that DALL-E 2 made for images, OpenAI can take all my money.


I would love for this to be a feature or endpoint. Maybe a revision endpoint or polishing endpoint. But the key thing, I think, is that you’d need a large enough embedding (or semantic vector) to hold the entire document.

What I mean by that is that the DAVINCI embedding is 10,000 values wide (or so) but depending on the size of a document, you might have a million tokens. But I’m imagining that you could take such a vector and add dimensions (2 dimensional image, maybe, using the same tech as DALLE). So then you basically render a text document as a grayscale image. From there you can refine it. To us it would just look like noise. But a machine can extract meaning from it.

Just brainstorming here

Well, to take another page from the DALL-E 2 playbook, the solution to size constraints may be upscaling?

It could write an outline, then an abstract, then a summary (each staying within a reasonable number of tokens), and finally blow each section out and go through the fine details one section at a time.

1 Like