The repo is aimed at developers and has two parts. The first adapts the ML model to run on Apple Silicon (CPU, GPU, Neural Engine), and the second allows you to easily add Stable Diffusion functionality to your own app.
If you just want an end user app, those already exist, but now it will be easier to make ones that take advantage of Apple's dedicated ML hardware as well as the CPU and GPU.
>This repository comprises:
python_coreml_stable_diffusion, a Python package for converting PyTorch models to Core ML format and performing image generation with Hugging Face diffusers in Python
StableDiffusion, a Swift package that developers can add to their Xcode projects as a dependency to deploy image generation capabilities in their apps. The Swift package relies on the Core ML model files generated by python_coreml_stable_diffusion
https://github.com/apple/ml-stable-diffusionI imagine that here apple wants to highlight a more research/interactive use, for example to allow fine tuning SD on a few samples from a particular domain (a popular customization).
[1] https://onnxruntime.ai/docs/execution-providers/CoreML-Execu...
People who can't get the models to work by themselves given the source code aren't the target audience. There are other projects, though, that do distribute quick and easy scripts and tools to run these models.
Apple stepping in to get Stable Diffusion working on their platform is probably an attempt to get people to take their ML hardware more seriously. I read this more like "look, ma, no CUDA!" than "Mac users can easily use SD now". This module seemed to be designed so that the upstream SD code can easily be ported back to macOS without special tricks.
https://github.com/LaurentMazare/tch-rs
I used this in the past to make a transformer-based syntax annotator. Fully in Rust, no Python required:
> For distilled StableDiffusion 2 which requires 1 to 4 iterations instead of 50, the same M2 device should generate an image in <<1 second
They have some benchmarks on the github repo: https://github.com/apple/ml-stable-diffusion
For reference, previously I was getting about <3 minutes for 50 iterations on my Macbook Air M1. I haven't yet tried Apple's implementation but it looks like a huge improvement. It might take it from "possible" to "usable".
Mac Studio with M1 Ultra gets 3.3 iters/sec for me.
MacBook Pro M1 Max gets 2.8 iters/sec for me.
And the posted benchmarks for the M2 Macbook Air make me consider 'upgrading' to an Air.
Dall-e et. al will still be able to bandwagon off of all the free ecosystem being built around the $10M SD1.4 model that is showing what is possible.
E.g. Dall-e could go straight to Hollywood if their model training works better than SD’s. The toolsets will work
Maybe a dumb question but can the old model still be run?
https://mezha.media/en/2022/10/06/google-is-working-on-image...
Give it some time and SD will be able to do the same.
See deforum[1] and andreasjansson‘s stable-diffusion-animation[2]
[1]: https://deforum.github.io/
[2]: https://replicate.com/andreasjansson/stable-diffusion-animat...
What's cool about the era in which we live is if you look at high-performance graphics for games or simulations, for instance, it may in fact be faster to a the model to "enhance" a low-resolution frame rather than trying to render it fully on the machine.
ex. AMD's FSR vs NVIDIA DLSS
- AMD FSR (Fidelity FX Super Resolution): https://www.amd.com/en/technologies/fidelityfx-super-resolut...
- NVIDIA DLSS (Deep Learning Super Sampling): jhttps://www.nvidia.com/en-us/geforce/technologies/dlss/
AMD's approach renders the game at a crummy, low-detail resolution then each frame uses "upscales"
Both FSR and DLSS aim to improve frames-per-second in games by rendering them below your monitor’s native resolution, then upscaling them to make up the difference in sharpness. Currently, FSR uses spatial upscaling, meaning it only applies its upscaling algorithm to one frame at a time. Temporal upscalers, like DLSS, can compare multiple frames at once, to reconstruct a more finely-detailed image that both more closely resembles native res and can better handle motion. DLSS specifically uses the machine learning capabilities of GeForce RTX graphics cards to process all that data in (more or less) real time.
Video is really a series of frames, the framerate for film/human could get away with 24 frames/second-- ~40ms/image for real-time.
What's cool about the era in which we live is if you look at high-performance graphics for games or simulations, it may in fact be faster to run the model on each frame to "enhance" a low-resolution frame rather than trying to render it fully on the machine.
ex. AMD's FSR vs NVIDIA DLSS
- AMD FSR (Fidelity FX Super Resolution): https://www.amd.com/en/technologies/fidelityfx-super-resolut...
- NVIDIA DLSS (Deep Learning Super Sampling): https://www.nvidia.com/en-us/geforce/technologies/dlss/
AMD's approach renders the game at a crummy, low-detail resolution then use "spatial upscaling" to enhance the images one frame at a time.
NVIDIA DLSS uses "temporal upscaling" to pass over multiple frames and uses other capabilities exclusive to Nvidia's cards to stitch together the frames.
This is a different challenge than generating the content from scratch
I don't think this is possible in real-time yet, but someone put a filter trained on the German country side to produce photorealistic Grand Theft Auto driving gameplay:
https://www.youtube.com/watch?v=P1IcaBn3ej0
Notice the mountains in the background go from Southern California brown to lush green
https://www.rockpapershotgun.com/amd-fsr-20-is-a-more-demand....
The author has a detailed blogpost outlining how he modified the model to use Metal on iOS devices. https://liuliu.me/eyes/stretch-iphone-to-its-limit-a-2gib-mo...
This site is purely a marketing effort.
Pretty much like Stable Diffusion and the grifters using it in general and they will never credit the artists and images that they stole to generate these images.
Of course you can see the original images (https://rom1504.github.io/clip-retrieval/), it was legal to collect them (they used robots.txt for consent just like Google Image Search) and it was legal to do this with them (but not using US legal principles since it's made in Germany).
"Crediting the artist" isn't a legal principle - it's more like some kind of social media standard which is enforced by random amateur artists yelling at you if you don't do it. It's both impossible (there are no original artists for a given output) and wouldn't do anything to help the main social issue (future artists having their jobs taken by AIs).
two wrongs don't make a right.
I'm not seeing any installation instructions on either link - what am I missing?
Great support for M1, basically since the beginning. The install is painless.
Release video for InvokeAI 2.2: https://www.youtube.com/watch?v=hIYBfDtKaus
This gets you text descriptions to images.
I have seen models that given a picture, then generate similar pictures. I want this because while I have many pictures of my grandmothers, I only have a couple of pictures of my grandfathers and it would be nice to generate a few more.
Core ML is so well done. A year ago I wrote a book on Swift AI and used Core ML in several examples.
Edit: still alive! https://grandperspectiv.sourceforge.net/
Found a >100GB accidental “livestream” recording on one computer. Would have taken forever to find what was taking up all the room otherwise.
GUI apps for this task like GP and the like are more visually complex than they need to be.
/dev/disk3s5 926Gi 857Gi 52Gi 95% 8067489 540828800 1% /System/Volumes/Data
It normally hovers around 30-35Gi free.One on the GPU and another on the ML core?
Centralized services small and large are guilty of this and I'm sick of it.
If you mean monetary usecases: Roughly something like Photoshop/Blender/UnrealEngine with ML plugins that are low latency, private, and $0 server hosting costs.
I'm not sure exactly what that costs me in terms of power, but it is assuredly less than any of these services charge for a single image generation.
but seriously, I wonder when you'll be able to paste in a script, and get out a storyboard or a movie
3.56 seconds?