Gaussian Splatting SLAM (opens in new tab)

(rmurai.co.uk)

91 pointsshevis1y ago20 comments

20 comments

Tangentially related to the post: I have what I think is a related computer vision problem I would like to solve and need some pointers on how you would go about doing it.

My desk is currently set up such that I have a large monitor in the middle. I'd like to look at the center of the screen when taking calls. I'd also like it to appear as though I am looking straight into the camera, and the camera is pointed at my face. Obviously, I cannot physically place the camera right in front of the monitor as that would be seriously inconvenient. Some laptops solve but I don't think their methods apply here as the top of my monitor ends up being quite a bit higher than what would look "good" for simple eye correction.

I have multiple webcams that I can place around the monitor to my liking. I would like to have something similar to what is seen when you open this webpage, but for a video. hopefully at higher quality since I'm not constrained to a monocular source.

I've dabbled a bit with OpenCV in the past, but the most I've done is a little camera calibration for de-warping fisheye lenses. Any ideas on what work I should look into to get started with this?

In my head, I'm picturing two camera sources: one above and one below the monitor. The "synthetic" projected perspective would be in the middle of the two.

Is capturing a point cloud from a stereo source and then reprojecting with splats the most "straightforward" way to do this? Any and all papers/advice are welcome. I'm a little rusty on the math side but I figure a healthy mix of Szeliski's Computer Vision, Wolfram Alpha, a chatbot, and of course perseverance will get me there.

com2kid1y ago

This is a solved problem on some platforms (Zoom and Teams), which alter your eyes so they look like they are staring into the camera. Basically you drop your monitor down low (so the camera is more centered on your head) and let software fix your eyes.

If you want your head to actually be centered, there are also some "center screen webcams" that exist that plop into the middle of your screen during a call. There are a few types, thin webcams that drape down, and clear "webcam holders" that hold your webcam at the center of your screen, which are a bit less convenient.

Nvidia also has a software package you can use, but I believe it is a bit fiddle to get setup.

dwrodri1y ago

> Some laptops solve but I don't think their methods apply here as the top of my monitor ends up being quite a bit higher than what would look "good" for simple eye correction.

I appreciate the pragmatism of buying another thing to solve the problem but I am hoping to solve this with stuff I already own.

I’d be lying if the nerd cred of overengineering the solution wasn’t attractive as well.

1 more reply

pedalpete1y ago

Have you seen the work done with multiple Kinect cameras in 2015? https://www.inavateonthenet.net/news/article/kinect-camera-a...

Creating a depth field with monocular camera is now possible, so that may help you get further with this.

king_of_kings1y ago

One approach you could try is to use the webcam input to create a deepfake that you place onto a 3D model, one you can rotate around.

It should be doable real-time, but might be stuck in the uncanny valley.

Also maybe look at what Meta and Apple's Vision Pro are doing to create their avatars.

TobinCavanaugh1y ago

Nvidia broadcast does a pretty good job of deepfaking your eyes to look cat the camera... then again thats not as fun

onedognight1y ago

FaceTime can do the eye correction.

totalview1y ago

I love the “3D Gaussian Visualisation” section that illustrates the difference between photos of the mono data and the splat data. The splats are like a giant point cloud under the hood, except unlike point clouds which have uniform size, different splats have different sizes.

This all is well and good when you are just using for a pretty visualization, but it appears gaussians have the same weakness as point clouds processed with structure from motion, in that you need lots of camera angles to get quality surface reconstruction accuracy.

jacoblambda1y ago

> This all is well and good when you are just using for a pretty visualization, but it appears gaussians have the same weakness as point clouds processed with structure from motion, in that you need lots of camera angles to get quality surface reconstruction accuracy.

The paper actually suggests the opposite. That gaussian splats actually outperform point clouds and other methods when given the same amount of data. And not just a little bit, but ridiculously so.

Their Gaussian splatting based SLAM variants with RGB-D and RGB (no depth) camera input both outperform essentially everything else and are SOTA (state-of-the-art) for the field. RGB-D obviously outperforms RGB but RGB data when used with gaussian splatting performs comparably to or beats the competition even when they are using depth data.

And not just that but their metrics outperform everything else except for systems operating on literal ground truth data but even then they perform comparably to those ground truth models within a few percent.

And importantly where most other models run at ~0.2-3fps, this model runs several orders of magnitude faster at an average 769fps. While higher fps doesn't mean much past a certain point, importantly this means you can do SLAM on much weaker hardware while still guaranteeing a WCET below the frame time.

So this actually is a massive advancement in the SOTA since gaussians let you very quickly and cheaply approximate a lot of information in a way you can efficiently compare against and refine against the current inputs from sensors.

totalview1y ago

I will believe this when I can actually measure scenes from Gaussians accurately (I have tried multiple papers worth of experiments with dismal results). No one in the reality capture industry uses splats for anything else other than visualization of water and sky heavy scenes because this is where a Gaussian splat actually renders in a nice way. I look forward to the advancements that Nerf and GS but for now there is no foundational reason why they can extrapolate any more data than COLMAP or GLOMAP when the input data is the major factor in defining scene details.

andybak1y ago

This claims to work with monocular or RGB+depth but the only live demo is for an Intel Realsense d455 RBGD camera. That seems a shame as it significantly raises the bar for people to try it out themselves. (Can you even still buy the d455?)

markisus1y ago

I assumed that the algorithm only uses the RGB sensor and the depth is ignored. I bought two d455 within the past year and they are quite nice.

jacoblambda1y ago

Their model works on both. RGB only performance is comparable to other depth based methods which is why they emphasized it over the RGB-D version.

Direct paper link for ref: https://arxiv.org/pdf/2312.06741

dcanelhas1y ago

Seems it can use both, if available.

pzo1y ago

Most iphones since iphone x have truedepth camera sensor in front that could be used instead of realsense I guess. However since iphone 13 quality of those depthmaps got more noisy. You can try with record3d to stream such rgbd data to your laptop via USB and use in python https://github.com/marek-simonik/record3d/

Dig1t1y ago

I would love to use something like this to make a video game.

Are there any examples or algorithms that can turn this into 3D objects that could be used in a video game? Any examples of someone doing that?

algebra-pretext1y ago

Not using SLAM like this paper and not transforming splats into polygonal meshes, but I used Luma's Unreal Engine plugin[0] for rendering splats via Niagara VFX to develop a walkable reconstruction of a small mall in my hometown[1]. I believe RealityCapture and Nerfstudio have some built-in abilities to transform splats to meshes but haven't seen good results; you're better off feeding the raw image data into a reconstruction workflow focused on polygonal meshes which is RealityCapture's main feature[2]. They recently got bought out by Epic (creator of Unreal Engine) and photogrammetry is already used everywhere in video games, one recent standout example being Hellblade II: Senua's Saga.

[0]: https://www.unrealengine.com/marketplace/en-US/product/luma-...

[1]: https://idkwhojamesis.itch.io/factoria-2018

[2]: https://www.capturingreality.com/

shevisOP1y ago

I actually stumbled across this paper while researching exactly that question! A reliable method for transforming gaussians into geometry seems like it could dramatically change the gamedev asset pipeline.

Dig1t1y ago

I agree! Also I think it could open up entire new types of games.

j / k navigate · click thread line to collapse

20 comments

dwrodri1y ago

Tangentially related to the post: I have what I think is a related computer vision problem I would like to solve and need some pointers on how you would go about doing it.

I've dabbled a bit with OpenCV in the past, but the most I've done is a little camera calibration for de-warping fisheye lenses. Any ideas on what work I should look into to get started with this?

In my head, I'm picturing two camera sources: one above and one below the monitor. The "synthetic" projected perspective would be in the middle of the two.

com2kid1y ago

Nvidia also has a software package you can use, but I believe it is a bit fiddle to get setup.

dwrodri1y ago

> Some laptops solve but I don't think their methods apply here as the top of my monitor ends up being quite a bit higher than what would look "good" for simple eye correction.

I appreciate the pragmatism of buying another thing to solve the problem but I am hoping to solve this with stuff I already own.

I’d be lying if the nerd cred of overengineering the solution wasn’t attractive as well.

1 more reply

pedalpete1y ago

Have you seen the work done with multiple Kinect cameras in 2015? https://www.inavateonthenet.net/news/article/kinect-camera-a...

Creating a depth field with monocular camera is now possible, so that may help you get further with this.

king_of_kings1y ago

One approach you could try is to use the webcam input to create a deepfake that you place onto a 3D model, one you can rotate around.

It should be doable real-time, but might be stuck in the uncanny valley.

Also maybe look at what Meta and Apple's Vision Pro are doing to create their avatars.

TobinCavanaugh1y ago

Nvidia broadcast does a pretty good job of deepfaking your eyes to look cat the camera... then again thats not as fun

onedognight1y ago

FaceTime can do the eye correction.

totalview1y ago

jacoblambda1y ago

The paper actually suggests the opposite. That gaussian splats actually outperform point clouds and other methods when given the same amount of data. And not just a little bit, but ridiculously so.

totalview1y ago

andybak1y ago

markisus1y ago

I assumed that the algorithm only uses the RGB sensor and the depth is ignored. I bought two d455 within the past year and they are quite nice.

jacoblambda1y ago

Their model works on both. RGB only performance is comparable to other depth based methods which is why they emphasized it over the RGB-D version.

Direct paper link for ref: https://arxiv.org/pdf/2312.06741

dcanelhas1y ago

Seems it can use both, if available.

pzo1y ago

Dig1t1y ago

I would love to use something like this to make a video game.