So I have a model which is typically taking one PDF file and summarizes it. Now I want to couple this model with another text to image generation model which should generate images based on the summarized text coming from the text summarization model.
The real problem I am facing is: I implemented this with Stable Diffusion model but if there are alot of text summaries, it takes that summaries time i.e. O(n square) time to generate the summaries and result in taking roughly 20 hours generating around 20 images. So I am thinking of implementing this with DALL-E API (for which I need to spend some dollars from my own pocket) but I am not sure whether this will help me in time optimization. I am performing this on MPS (Apple Silicon) GPU.
Can anyone give me any recommendation on reducing the time complexity by any means such as any other solution apart from Stable Diffusion or DALL - E or hardware related (I know already that NVIDA will be GOD to resolve this problem) but before that any other custom solution can be developed to handle this?
I am open for any thoughts please think out loud here I am waiting for your responses
DALL-E API can produce with a rate limit of 500 images per minute of usage rate limit at tier-1 account level now, and you can launch them all in parallel with code. Huge datacenter and model optimized for throughput means your hundreds of API calls make little impact on that “o”. Generation time is around 10 seconds.
The cost is $0.02-$0.12 per image, depending on specifically which model and which image size and detail settings you request.
The images are not reliable diagrams of data suitable for most PDF-sourced business presentations, though, they are artworks.
Thanks for the response. So what would be the solution then? As in my case it is required to generate and show some artwork through images rather than any diagrams. Also can you suggest the API model which will be best to generate images which you’ve discussed already.
The first challenge: you have a PDF. That is not an input you can send to DALL-E.
The DALL-E 3 AI model, probably the one you want unless you like purely abstract oil paintings, only accepts 256 tokens, although it has an AI in front of it that can summarize 4000 characters to that smaller amount actually used.
So you will need:
PDF document extraction
An AI language model that can write a summary of the very large full text, where it has been instructed to also come up with some visual interpretation in its summary.
Send the summary to DALL-E API
Receive an artwork based on what was described.
So you have a function your code must do (also requiring PDF with searchable text or external OCR to text on scanned documents), a call to a language AI with large input context and good understanding, and a call to the final image.
The PDF processing is something that you might be locally CPU-bound when doing many jobs like you describe, unless you can find another API service to do that PDF-to-OCR-text in parallel.
OpenAI’s Assistants has a document extractor, but it is not good for summarizing, only for answering from searched chunks of knowledge.
Then finally, I just don’t see the application. Perhaps you might have a particular plot point of a story illustrated, but the PDFs I’ve got are more like science papers, service manuals, and invoices, for which there is no good AI art to be made.