SnapFusion: Text-to-Image Diffusion Model on Mobile Devices Within Two Seconds (opens in new tab)

(snap-research.github.io)

221 pointssynapse262y ago52 comments

52 comments

From the paper:

> In this work, we present the first text-to-image diffusion model that generates an image on mobile devices in less than 2 seconds. To achieve this, we mainly focus on improving the slow inference speed of the UNet and reducing the number of necessary denoising steps.

As a layman, it's impressive and surprising that there's so much room for optimization here, given the number of hands on folks in the OSS space.

> We propose a novel evolving training framework to obtain an efficient UNet that performs better than the original Stable Diffusion v1.52 while being significantly faster. We also introduce a data distillation pipeline to compress and accelerate the image decoder.

Pretty impressive.

TeMPOraL2y ago

> As a layman, it's impressive and surprising that there's so much room for optimization here, given the number of hands on folks in the OSS space.

There's only so many folks in OSS space that are capable of doing work from this angle. There are more who could be micro-optimizing code, but the most end up developing GUIs and app prototypes and ad-hoc Python scripts that use the models.

At the same time, the whole field moves at ridiculously fast pace. There's room for optimization because the new model generations are released pretty much as fast as they're developed and trained, without stopping to tune or optimize them.

Also, there must be room for optimization given how ridiculously compute-expensive training and inference still is. Part of my intuition here is that current models do roughly similar things to what our brains do, and brains manage to do these things fast with some 20-50 watts. Sure, there are a lot of differences between NN models and biological brains, but to a first approximation, this is a good lower bound.

bee_rider2y ago

It isn’t obvious to me that these models produce something similar to our brains’ output. We can imagine images of course, but the level of quality is hard to define, and it is really hard and time consuming to save the output of an imagined image.

People paint or draw imagined images, but that’s a slow process and there’s a feedback loop going on throughout the whole thing (paint a bit, see how it looks, try a little happy tree, didn’t work out, turn it into a cloud). If we include the time spent painting and reconsidering, image generation using humans is pretty expensive.

An iPhone battery holds tens of watt-hours. A painting might take hours to make (I don’t paint. A couple hours is quick, right?), so if the brain is burning tens of watts in that time, the total cost could be in the same ballpark as generating images until your battery dies. But of course it is really hard to make an apples-to-apples comparison here because the human spends a lot of energy just keeping the lights on while bandwidth is limited by the rate of arm-movement.

wcarss2y ago

I'm sad that Carmack decided to (as I understand it) focus away from LLMs because they're "already getting enough eyes" -- it feels like he wants to make a novel paradigm shift kind of contribution, but his magic power has always seemed to me to be a capability of grasping a huge amount of technical depth in detail, and seeing past the easy local optima to the real essence of the computation being done, and finding ways to measure and squeeze everything out of that.

Carmack could possibly get us realtime networked stable diffusion text to video and video to video at high resolution, maybe even on phones. It will probably happen anyway, but it might take 5+ extra years, and there'll probably be a ton of stupid things we never fix.

schappim2y ago

Sub 2 second generations on cell phones, nice! Better FID and CLIP scores than Stable Diffusion v1.5 with 50 steps, great!

So are they gonna release the code, or do they only open-source ad-SDKs[1]?

[1] https://github.com/orgs/Snapchat/repositories

suyash2y ago

I'll believe thier hypothesis when I can run the open source code on my iPhone in 2 seconds, doubt it's that fast.

asutekku2y ago

They included a video: https://www.youtube.com/watch?v=zK5PQ3Oj_L8

3 more replies

pedrovhb2y ago

I've seen at least a couple of papers before with similar claims, and still nothing I can run on my phone, so I'm not holding my breath yet.

It seems a bit disingenuous to compare with Stable Diffusion taking 50 steps, though; with the newer schedulers you can consistently get great images in 12 steps of diffusion, probably less if you're a bit careful with exact parameters/model fine tuning choice.

sdflhasjd2y ago

Steps isn't an apples-to-apples comparison though because some schedulers have longer steps (time-wise) than others.

renonce2y ago

At least let’s compare quality within the same amount of time, regardless of steps.

1 more reply

lukasb2y ago

What newer schedulers?

rbinv2y ago

Man, I still remember waiting literally all night for a Mandelbrot render to finish on my 486. We've come a long way.

bredren2y ago

> Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers.

I’m all for the continued advance of diffusion models.

If this paper offered evidence of quantitative and qualities measurement techniques for determining human preference for art or photos based on a prompt, I’d get it the phrasing.

But having the first sentence essentially spurn professional creatives seems to unnecessarily fan the flames.

AI image generation does bother some creatives, and there are real reasons for this given the many models that have been trained using long practiced work.

Keep up with the science, but don’t forget the tact!

MattRix2y ago

You’re reading something into the text that just isn’t there. “Rival” means it’s good enough to be competitive, it doesn’t mean it’s better. The statement is hardly surprising or controversial. It’s fairly obvious to anyone who has looked at the output of these kinds of models.

taylorfinley2y ago

Really impressive. The abstract claims sub 2 second generation times, but the youtube demo seems to show generations taking ~6 seconds. Not that I would complain about 6 second generations, my 12gb 3060 probably takes 3-4x as long running SD1.5; perhaps they're not counting the time to load the model, just the active inference time?

pysnow2y ago

Yeah stable diffusion has a stage before generation where it transforms the text prompt into data for the model to use, called CLIP Encoding, it runs before every generation, and its probably the stage in the video where you see a spinner in place of the step bar.

anotheryou2y ago

Why focus on mobile?

Near real-time rendering while entering a prompt on desktop would be even more amazing!

Imagine e.g. UI sliders for adding weight to multi-prompts like "night time" (ideally sliding from [day time]:1, [night time]:0 to both zero, to all night with 0,1).

phire2y ago

- It's significantly easier to monetize mobile apps (both with ads and in-app purchases)

- Open projects already dominate the desktop space

- There are a growing number of younger people who aren't really computer literate, or otherwise just use their phone as their primary computing device.

- Phones go into social situations where desktops/laptops don't.

HPsquared2y ago

Phones also have cameras built-in, which would fit well with img2img.

schappim2y ago

> Why focus on mobile?

Snap Inc.

noduerme2y ago

Oh you're so right. The face filter industrial complex is going to eat this up. And after them, plastic surgeons... God help us all.

1 more reply

tudorw2y ago

I'd like to be able to define camera movement as a a 3d path along around a geodesic in a multidimensional topological manifold :)

GaggiX2y ago

It has a total latency of 200ms on a A100 40G.

behnamoh2y ago

This is insane! But it makes me wonder if we've reached a local maximum in AI where the current methods are great at generating still images but they're pretty much uncontrollable. Like if you ask the AI to generate a dog, is it really possible to prompt every single detail so it creates exactly what you have in mind, or is it more like a trust situation where you just accept whatever the AI generates for you?

anonylizard2y ago

Pure text->image is impossible to get exactly, given there's 10000 possibilities for a dog. Even if text prompts eliminate 99.9% of probabilities, it there's still 10 possible images.

However, with stuff like controlnet, it's already possible, and will be solved within a year. Yes you can specify every exact detail, but you need to feed it a sketch, or a skeletal pose, or a reference image of the dog...

Also, you can train a LORA on the subject before hand, if you want to consistently regenerate the subject, with just text.

digging2y ago

> 10000 possibilities for a dog

I suppose it's not your main point, but that number is off by... probably about 10,000 orders of magnitude.

moritonal2y ago

It's a human problem really. If you asked an artist to draw a dog you'd have to "trust" them? To control every detail you'd have to tell them every detail, either upfront (whereby the artist night struggle to achieve) or as a series of edits. In the latter case both artist and AI would struggle to keep the look consistent the more edits you make.

spookthesunset2y ago

At some point you might as well just do it yourself cause it’s easier. I’ve gotten to this point more than once with ChatGPT. ChatGPT will get you like 70% of the way initially and if you are lucky you’ll hit 90% with a lot of time invested in “prompt engineering”.

The thing is, none of these are mind readers. And text is a very poor way to define tight specs. The best way for software is to code it yourself. Code is the spec. Same with drawing. The spec is the drawing itself. Only the human can control that.

…or something…

gmerc2y ago

ControlNet begs to differ

FeepingCreature2y ago

https://vcai.mpi-inf.mpg.de/projects/DragGAN/ Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

cubefox2y ago

I thought for fast text-to-image synthesizer you would need a GAN instead of a diffusion model. GAN models are much faster. Though apparently they aren't quite competitive with diffusion models in terms of quality. See

https://arxiv.org/abs/2301.09515

r1nzl3r2y ago

Took me 2 days to generate an image in SD. Hope this can run well on my intel core i3 HD4000

anotheryou2y ago

oh my, just use something online or get a GPU.

bufferoverflow2y ago

I would rather have more quality than speed. The output of this model reminds me of Midjourney 3.

MattRix2y ago

There are lots of models/approaches to look at if you want improved speed. It seems bizarre to not also want the speed to increase.

bufferoverflow2y ago

I would absolutely want more speed if the quality was comparable to the competitor's offerings. But it looks like it's a speed up at the expense of the image quality.

EGreg2y ago

Does it have to be on mobile devices?

How to install it locally on some kind of Linux or Mac ?

shashanoid2y ago

still pretty slow

j / k navigate · click thread line to collapse

52 comments

devadvance2y ago

From the paper:

As a layman, it's impressive and surprising that there's so much room for optimization here, given the number of hands on folks in the OSS space.

Pretty impressive.

TeMPOraL2y ago

> As a layman, it's impressive and surprising that there's so much room for optimization here, given the number of hands on folks in the OSS space.

bee_rider2y ago

wcarss2y ago

schappim2y ago

Sub 2 second generations on cell phones, nice! Better FID and CLIP scores than Stable Diffusion v1.5 with 50 steps, great!

So are they gonna release the code, or do they only open-source ad-SDKs[1]?

[1] https://github.com/orgs/Snapchat/repositories

suyash2y ago

I'll believe thier hypothesis when I can run the open source code on my iPhone in 2 seconds, doubt it's that fast.

asutekku2y ago

They included a video: https://www.youtube.com/watch?v=zK5PQ3Oj_L8

3 more replies

pedrovhb2y ago

I've seen at least a couple of papers before with similar claims, and still nothing I can run on my phone, so I'm not holding my breath yet.

sdflhasjd2y ago

Steps isn't an apples-to-apples comparison though because some schedulers have longer steps (time-wise) than others.

renonce2y ago

At least let’s compare quality within the same amount of time, regardless of steps.

1 more reply

lukasb2y ago

What newer schedulers?

rbinv2y ago

Man, I still remember waiting literally all night for a Mandelbrot render to finish on my 486. We've come a long way.

bredren2y ago

> Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers.

I’m all for the continued advance of diffusion models.

If this paper offered evidence of quantitative and qualities measurement techniques for determining human preference for art or photos based on a prompt, I’d get it the phrasing.

But having the first sentence essentially spurn professional creatives seems to unnecessarily fan the flames.

AI image generation does bother some creatives, and there are real reasons for this given the many models that have been trained using long practiced work.

Keep up with the science, but don’t forget the tact!

MattRix2y ago

taylorfinley2y ago

pysnow2y ago

anotheryou2y ago

Why focus on mobile?

Near real-time rendering while entering a prompt on desktop would be even more amazing!

Imagine e.g. UI sliders for adding weight to multi-prompts like "night time" (ideally sliding from [day time]:1, [night time]:0 to both zero, to all night with 0,1).

phire2y ago

- It's significantly easier to monetize mobile apps (both with ads and in-app purchases)

- Open projects already dominate the desktop space

- There are a growing number of younger people who aren't really computer literate, or otherwise just use their phone as their primary computing device.

- Phones go into social situations where desktops/laptops don't.

HPsquared2y ago

Phones also have cameras built-in, which would fit well with img2img.

schappim2y ago

> Why focus on mobile?

Snap Inc.

noduerme2y ago

Oh you're so right. The face filter industrial complex is going to eat this up. And after them, plastic surgeons... God help us all.

1 more reply

tudorw2y ago

I'd like to be able to define camera movement as a a 3d path along around a geodesic in a multidimensional topological manifold :)

GaggiX2y ago

It has a total latency of 200ms on a A100 40G.

behnamoh2y ago

anonylizard2y ago

Pure text->image is impossible to get exactly, given there's 10000 possibilities for a dog. Even if text prompts eliminate 99.9% of probabilities, it there's still 10 possible images.

Also, you can train a LORA on the subject before hand, if you want to consistently regenerate the subject, with just text.

digging2y ago

> 10000 possibilities for a dog

I suppose it's not your main point, but that number is off by... probably about 10,000 orders of magnitude.

moritonal2y ago

spookthesunset2y ago

…or something…

gmerc2y ago

ControlNet begs to differ

FeepingCreature2y ago

https://vcai.mpi-inf.mpg.de/projects/DragGAN/ Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

cubefox2y ago

https://arxiv.org/abs/2301.09515

r1nzl3r2y ago

Took me 2 days to generate an image in SD. Hope this can run well on my intel core i3 HD4000

anotheryou2y ago

oh my, just use something online or get a GPU.

bufferoverflow2y ago

I would rather have more quality than speed. The output of this model reminds me of Midjourney 3.

MattRix2y ago

There are lots of models/approaches to look at if you want improved speed. It seems bizarre to not also want the speed to increase.

bufferoverflow2y ago

I would absolutely want more speed if the quality was comparable to the competitor's offerings. But it looks like it's a speed up at the expense of the image quality.

EGreg2y ago

Does it have to be on mobile devices?

How to install it locally on some kind of Linux or Mac ?

shashanoid2y ago

still pretty slow

j / k navigate · click thread line to collapse