20x Faster Background Removal in the Browser Using ONNX Runtime with WebGPU (opens in new tab)

(img.ly)

165 pointsbuss_jan1y ago31 comments

31 comments

Background Removal can be thought of as Foreground Segmentation, inverted. That is no trivial feat; my undergraduate thesis was on segmentation, but using only “mechanical” approaches, no NNs, etc), hence my appreciation!

But here’s something I don’t understand: (And someone please correct me if I’m wrong!) - now I do understand that NNs are to software what FPGAs are to hardware, and the ability to pick any node and mess with it (delete, clone, more connections, less connections, link weights, swap-out the activation functions, etc) means they’re perfect for evolutionary-algorithms that mutate, spawn, and cull these NNs until they solve some problem (e.g. playing Super Mario on a NES (props to Tom7) or in this case, photo background segmentation.

…now, assuming the analogy to FPGAs still holds, with NNs being an incredibly inefficient way to encode and execute steps in a data-processing pipeline (but very efficient at evolving that pipeline) - doesn’t it then mean that whatever process is encoded in the NN, it should both be possible to represent in some more efficient representation (I.e. computer program code, even if it’s highly parallelised) and that “compiling” it down is essential for performance? And if so, then why are models/systems like this being kept in NN form?

(I look forward to revisiting this post a decade from now and musing at my current misconceptions)

johndough1y ago

For many tasks that neural networks can solve, there are traditional algorithms that are more compact (lines of source code vs size of neural network parameters), but they are not always faster and often produce results of lower quality. For a fair comparison, you have to compare the quality of result together with the computation time, which is not straightforward since those are two competing goals. That being said, neural networks perform quite well for two reasons:

1. They can produce approximate solutions which are often good enough in practice and faster than exact algorithmic solutions.

2. Neural networks benefit from billions of dollars of research into how to make them run faster, so even if they technically require more TFLOPs to compute, they are still faster than traditional algorithms that are not extremely well optimized.

Lastly, development time is also important. It is much easier to train a neural network on some large dataset than to come up with an algorithm that works for all kinds of edge cases. To be fair, neural networks might fail catastrophically when they encounter data that they have not been trained on, but maybe it is possible to collect more training data for this specific case.

I have not discussed any methods to compress and simplify already trained models here (model distillation, quantization, pruning, low-rank approximation, and probably many more that I've forgotten), but they all tip the scales in favor of neural networks.

salamo1y ago

"Neural networks are the second best way of doing just about anything." ~ John Denker

It's an old quote that, although not 100% accurate anymore, still sums up my feelings quite nicely.

sitkack1y ago

There is some work to convert NNs to decision trees.

https://towardsdatascience.com/neural-networks-as-decision-t...

https://arxiv.org/abs/2210.05189

I haven't reviewed any of it, I only know of it tangentially.

https://www.semanticscholar.org/paper/Converting-A-Trained-N...

Distilling a Neural Network Into a Soft Decision Tree https://arxiv.org/abs/1711.09784

GradTree: Learning Axis-Aligned Decision Trees with Gradient Descent https://arxiv.org/abs/2305.03515

TeMPOraL1y ago

NNs are, in a way, already "compiled". If all you want to do is inference (forward pass), then you mostly do a lot of matrix multiplications. It's the training pass that requires building up extra scaffolding to track gradients and such.

It occurred to me that NNs ("AI") are indeed a bit like crypto, in the sense that both attempt to substitute compute for some human quality. Proof of Work and associated ideas try to substitute compute for trust[0]. Solving problems by feeding tons of data into a DNN is substituting compute for understanding. Specifically, for our understanding of the problem being solved.

It's neat we can just throw compute at a problem to solve it well, but we then end up with a magic black box that's even less comprehensible than the problem at hand.

It also occurs to me that stochastic gradient descent is better than evolutionary programming because it's to evolution what closed-form analytical solutions are to running a simulation of interacting bodies - if you can get away with a formula that gives you what the simulation is trying to approximate, you're better off with the formula. So in this sense, perhaps it's worth to try harder to take a step back and reverse-engineer the problems solved by DNNs, try to gain that more theoretical understanding, because as fun as brute-forcing a solution is, analytical solutions are better.

[0] - Which I consider bad for reasons discussed many time before; it's not where I want to go with this comment.

johndough1y ago

Neural networks are not trained with evolutionary algorithms because they are very slow, especially for the millions or billions of parameters that NNs have. Instead, stochastic gradient descent is used for training, which is much more efficient.

eevilspock1y ago

> doesn’t it then mean that whatever process is encoded in the NN, it should both be possible to represent in some more efficient representation...?

Not if NNs are complex systems[1] whose useful behavior is emergent[2] and therefore non-reductive[3]. In fact, my belief is that if NNs and therefore also LLMs aren't these things, they can never be the basis for true AI.[4]

---

[1] https://en.wikipedia.org/wiki/Complex_system

[2] https://en.wikipedia.org/wiki/Emergence

[3] https://en.wikipedia.org/wiki/Reductionism, https://www.encyclopedia.com/humanities/encyclopedias-almana..., https://academic.oup.com/edited-volume/34519/chapter-abstrac...

[4] Though being these things doesn't guarantee that they can be the basis for true AI either. It's a minimum requirement.

andrewstuart1y ago

Worth noting that background removal is built in to Preview on Macos.

dagmx1y ago

It’s also built into Safari and Photos on all their platforms and available as an API that can be called by any app

https://developer.apple.com/wwdc23/10176

oefrha1y ago

Huh, I've been copying background-removed subjects out of Preview and didn't realize there's a VisionKit API. Looks like it's quite easy to use too, I put together a quick and dirty script in a couple minutes and it worked wonderfully:

  import AppKit
  import VisionKit
  
  @main
  struct Script {
    static func main() async {
      let image = NSImage(contentsOfFile: "input.heic")!
      let view = ImageAnalysisOverlayView()
      let analyzer = ImageAnalyzer()
      let configuration = ImageAnalyzer.Configuration(.visualLookUp)
      let analysis = try! await analyzer.analyze(image, orientation: .up, configuration: configuration)
      view.analysis = analysis
      let subjects = await view.subjects
      for (index, subject) in subjects.enumerated() {
        let subjectImage = try! await subject.image
        let pngData = NSBitmapImageRep(data: subjectImage.tiffRepresentation!)!.representation(
          using: .png, properties: [:])
        try! pngData?.write(to: URL(fileURLWithPath: "subject-\(index).png"))
        print("subject-\(index).png")
      }
    }
  }

Abishek_Muthian1y ago

Was searching for an equivalent for Linux, came across rembg.

https://github.com/danielgatis/rembg

forgotusername61y ago

"Therefore, the first run of the network will take ~300 ms and consecutive runs will be ~100 ms"

I only skimmed the article, but I don't think they mention the size of the image. 100ms is not that impressive when you consider that you need to be three times as fast for acceptable video frame rate.

diggan1y ago

> I only skimmed the article, but I don't think they mention the size of the image. 100ms is not that impressive when you consider that you need to be three times as fast for acceptable video frame rate.

You don't need three times as fast for acceptable video frame rates in a video editor, you need a system that allows you to cache "rendered" frames so when the user does an edit, it renders to this cache, then once done, the user can play it back in real-time.

This is essentially how all video editors handle edits on clips/video today. Some effects/edits can be applied in real-time, but the more advanced one (I'd say background removal being one of them) works with this type of caching system.

pjmlp1y ago

As long as one uses a Chrome distribution.

WebGPU is at least one year away of becoming usable for cross browser deployment.

diggan1y ago

> WebGPU is at least one year away of becoming usable for cross browser deployment.

In Firefox it seems to be behind a feature flag and Safari seems to have it in it's "Technology Preview" (some sort of release candidate?), so seems closer that I at least though.

pjmlp1y ago

Firefox has had it as feature flag for at least one year now, Safari just announced the technology preview during WWDC updates.

WebGL 2.0 took almost a decade to be fully supported, and still has issues on Safari, don't expect WebGPU to be any faster.

Also note that Google is the culprit why WebGL Compute did not happen, WebGPU was going to sort all problems, and even though they use DirectX on Windows, apparently it was a big issue to use Metal Compute on Apple instead of OpenGL, and then they ended up improving Angle on top of Metal anyway.

Web politics.

tlarkworthy1y ago

Onnx is cool, the other option is tensorflow js which I have found quite nice as a usable matrix lib for JS with shockingly good perf.would love to know how well they compare

dleeftink1y ago

Also shout out to Taichi and GPU.js for alternatives in this space. I've also had success with Hamster.js, that 'parallelizes' computations using Web workers instead of the GPU (who knows, in the future the two might be combined?).

lukan1y ago

"who knows, in the future the two might be combined?"

You can combine both today alreay and I experiment with it. The problem is still the high latency of the GPU. It takes ages, to get an answer and the timing is not consistent. That makes all scheduling for me a nightmare, when dividing jobs between the CPU and GPU. It would probably require a new hardware architectur, to make use of that in a sane way, so that GPU and CPU are more closely connected. (there are some designs aiming for this, as far as I know)

edit: you probably meant hamsterS.js

salamo1y ago

They are probably two different use cases. Parallelizing with web workers could be faster for algorithms that do a lot of branching (minimax comes to mind) but if you can vectorize (matmuls for example) then GPU probably dominates.

1 more reply

wruza1y ago

Interesting, there’s also node version in /packages.

wruza1y ago

Tried it, and it's absolutely half-baked. Doesn't accept its own config typed param, messes up with own internal urls, cannot run from non-project dir.

Although the segmentation quality is much better than that of `rembg`, the interface to it is just unamazing. Update: nope, it's sharper, but fails at different images at the same rate.

gist: https://gist.github.com/sou-long/5c7cfee57f5399918c9072552af... (adapted from a real project, just for reference)

jvdvegt1y ago

MS teams does this already, right? (I assume they do, as it didn't work in Firefox until recently)

Or do they do it server side?

afro881y ago

I'm pretty sure they do it client side. The latency on your video preview is non existent.

tommek40771y ago

If I run it in a browser on my client, why going to a website in the first place?

jazzyjackson1y ago

to resolve a short url to a piece of software i guess

PUSH_AX1y ago

This novel concept should have a catchy name…

1 more reply

j / k navigate · click thread line to collapse

31 comments

DaiPlusPlus1y ago

(I look forward to revisiting this post a decade from now and musing at my current misconceptions)

johndough1y ago

1. They can produce approximate solutions which are often good enough in practice and faster than exact algorithmic solutions.

salamo1y ago

"Neural networks are the second best way of doing just about anything." ~ John Denker

It's an old quote that, although not 100% accurate anymore, still sums up my feelings quite nicely.

sitkack1y ago

There is some work to convert NNs to decision trees.

https://towardsdatascience.com/neural-networks-as-decision-t...

https://arxiv.org/abs/2210.05189

I haven't reviewed any of it, I only know of it tangentially.

https://www.semanticscholar.org/paper/Converting-A-Trained-N...

Distilling a Neural Network Into a Soft Decision Tree https://arxiv.org/abs/1711.09784

GradTree: Learning Axis-Aligned Decision Trees with Gradient Descent https://arxiv.org/abs/2305.03515

TeMPOraL1y ago

It's neat we can just throw compute at a problem to solve it well, but we then end up with a magic black box that's even less comprehensible than the problem at hand.

[0] - Which I consider bad for reasons discussed many time before; it's not where I want to go with this comment.

johndough1y ago

eevilspock1y ago

> doesn’t it then mean that whatever process is encoded in the NN, it should both be possible to represent in some more efficient representation...?

---

[1] https://en.wikipedia.org/wiki/Complex_system

[2] https://en.wikipedia.org/wiki/Emergence

[3] https://en.wikipedia.org/wiki/Reductionism, https://www.encyclopedia.com/humanities/encyclopedias-almana..., https://academic.oup.com/edited-volume/34519/chapter-abstrac...

[4] Though being these things doesn't guarantee that they can be the basis for true AI either. It's a minimum requirement.

andrewstuart1y ago

Worth noting that background removal is built in to Preview on Macos.

dagmx1y ago

It’s also built into Safari and Photos on all their platforms and available as an API that can be called by any app

https://developer.apple.com/wwdc23/10176

oefrha1y ago

  import AppKit
  import VisionKit
  
  @main
  struct Script {
    static func main() async {
      let image = NSImage(contentsOfFile: "input.heic")!
      let view = ImageAnalysisOverlayView()
      let analyzer = ImageAnalyzer()
      let configuration = ImageAnalyzer.Configuration(.visualLookUp)
      let analysis = try! await analyzer.analyze(image, orientation: .up, configuration: configuration)
      view.analysis = analysis
      let subjects = await view.subjects
      for (index, subject) in subjects.enumerated() {
        let subjectImage = try! await subject.image
        let pngData = NSBitmapImageRep(data: subjectImage.tiffRepresentation!)!.representation(
          using: .png, properties: [:])
        try! pngData?.write(to: URL(fileURLWithPath: "subject-\(index).png"))
        print("subject-\(index).png")
      }
    }
  }

Abishek_Muthian1y ago

Was searching for an equivalent for Linux, came across rembg.

https://github.com/danielgatis/rembg

forgotusername61y ago

"Therefore, the first run of the network will take ~300 ms and consecutive runs will be ~100 ms"

diggan1y ago

pjmlp1y ago

As long as one uses a Chrome distribution.

WebGPU is at least one year away of becoming usable for cross browser deployment.

diggan1y ago

> WebGPU is at least one year away of becoming usable for cross browser deployment.

In Firefox it seems to be behind a feature flag and Safari seems to have it in it's "Technology Preview" (some sort of release candidate?), so seems closer that I at least though.

pjmlp1y ago

Firefox has had it as feature flag for at least one year now, Safari just announced the technology preview during WWDC updates.

WebGL 2.0 took almost a decade to be fully supported, and still has issues on Safari, don't expect WebGPU to be any faster.

Web politics.

tlarkworthy1y ago

Onnx is cool, the other option is tensorflow js which I have found quite nice as a usable matrix lib for JS with shockingly good perf.would love to know how well they compare

dleeftink1y ago

lukan1y ago

"who knows, in the future the two might be combined?"

edit: you probably meant hamsterS.js

salamo1y ago

1 more reply

wruza1y ago

Interesting, there’s also node version in /packages.

wruza1y ago

Tried it, and it's absolutely half-baked. Doesn't accept its own config typed param, messes up with own internal urls, cannot run from non-project dir.

Although the segmentation quality is much better than that of `rembg`, the interface to it is just unamazing. Update: nope, it's sharper, but fails at different images at the same rate.

gist: https://gist.github.com/sou-long/5c7cfee57f5399918c9072552af... (adapted from a real project, just for reference)

jvdvegt1y ago

MS teams does this already, right? (I assume they do, as it didn't work in Firefox until recently)

Or do they do it server side?

afro881y ago

I'm pretty sure they do it client side. The latency on your video preview is non existent.

tommek40771y ago

If I run it in a browser on my client, why going to a website in the first place?

jazzyjackson1y ago

to resolve a short url to a piece of software i guess

PUSH_AX1y ago

This novel concept should have a catchy name…

1 more reply

j / k navigate · click thread line to collapse