> In this work, we present the first text-to-image diffusion model that generates an image on mobile devices in less than 2 seconds. To achieve this, we mainly focus on improving the slow inference speed of the UNet and reducing the number of necessary denoising steps.
As a layman, it's impressive and surprising that there's so much room for optimization here, given the number of hands on folks in the OSS space.
> We propose a novel evolving training framework to obtain an efficient UNet that performs better than the original Stable Diffusion v1.52 while being significantly faster. We also introduce a data distillation pipeline to compress and accelerate the image decoder.
Pretty impressive.
There's only so many folks in OSS space that are capable of doing work from this angle. There are more who could be micro-optimizing code, but the most end up developing GUIs and app prototypes and ad-hoc Python scripts that use the models.
At the same time, the whole field moves at ridiculously fast pace. There's room for optimization because the new model generations are released pretty much as fast as they're developed and trained, without stopping to tune or optimize them.
Also, there must be room for optimization given how ridiculously compute-expensive training and inference still is. Part of my intuition here is that current models do roughly similar things to what our brains do, and brains manage to do these things fast with some 20-50 watts. Sure, there are a lot of differences between NN models and biological brains, but to a first approximation, this is a good lower bound.
People paint or draw imagined images, but that’s a slow process and there’s a feedback loop going on throughout the whole thing (paint a bit, see how it looks, try a little happy tree, didn’t work out, turn it into a cloud). If we include the time spent painting and reconsidering, image generation using humans is pretty expensive.
An iPhone battery holds tens of watt-hours. A painting might take hours to make (I don’t paint. A couple hours is quick, right?), so if the brain is burning tens of watts in that time, the total cost could be in the same ballpark as generating images until your battery dies. But of course it is really hard to make an apples-to-apples comparison here because the human spends a lot of energy just keeping the lights on while bandwidth is limited by the rate of arm-movement.
Carmack could possibly get us realtime networked stable diffusion text to video and video to video at high resolution, maybe even on phones. It will probably happen anyway, but it might take 5+ extra years, and there'll probably be a ton of stupid things we never fix.
So are they gonna release the code, or do they only open-source ad-SDKs[1]?
It seems a bit disingenuous to compare with Stable Diffusion taking 50 steps, though; with the newer schedulers you can consistently get great images in 12 steps of diffusion, probably less if you're a bit careful with exact parameters/model fine tuning choice.
I’m all for the continued advance of diffusion models.
If this paper offered evidence of quantitative and qualities measurement techniques for determining human preference for art or photos based on a prompt, I’d get it the phrasing.
But having the first sentence essentially spurn professional creatives seems to unnecessarily fan the flames.
AI image generation does bother some creatives, and there are real reasons for this given the many models that have been trained using long practiced work.
Keep up with the science, but don’t forget the tact!
Near real-time rendering while entering a prompt on desktop would be even more amazing!
Imagine e.g. UI sliders for adding weight to multi-prompts like "night time" (ideally sliding from [day time]:1, [night time]:0 to both zero, to all night with 0,1).
- Open projects already dominate the desktop space
- There are a growing number of younger people who aren't really computer literate, or otherwise just use their phone as their primary computing device.
- Phones go into social situations where desktops/laptops don't.
Snap Inc.
However, with stuff like controlnet, it's already possible, and will be solved within a year. Yes you can specify every exact detail, but you need to feed it a sketch, or a skeletal pose, or a reference image of the dog...
Also, you can train a LORA on the subject before hand, if you want to consistently regenerate the subject, with just text.
I suppose it's not your main point, but that number is off by... probably about 10,000 orders of magnitude.
The thing is, none of these are mind readers. And text is a very poor way to define tight specs. The best way for software is to code it yourself. Code is the spec. Same with drawing. The spec is the drawing itself. Only the human can control that.
…or something…
How to install it locally on some kind of Linux or Mac ?