[edit] missed a word
Essentially, at each time step, the algorithm senses its environment and checks how much it differs from the previous time step, and figures out if it saw any new features to add to the map and how much the correlated features between the two time steps moved, to infer egomotion. This doesn't have to be necessarily with cameras, it can also be done with laser rangefinders and other relatively accurate sensors.
Monocular SLAM (MonoSLAM, also the name of a well known paper) is SLAM done with a single camera, which makes the problem harder than with two cameras. With two cameras affixed to a rigid frame and known characteristics, it's possible to determine the 3D position of any given feature that is seen by both cameras at the same time. With a single camera, however, it's trickier because only the angle of a given feature can be determined, not its 3D position, so an optimization step has to be done to determine what the likeliest solution to the problem is.
There's also more to read on the relevant Wikipedia article, at http://en.wikipedia.org/wiki/Simultaneous_localization_and_m...
The reason we went with a single camera is lack of space. As you can see from some of the imagery of the product, the camera stack is a huge proportion of the machine. Also when the algorithms were being developed in the early 2000's cameras were still expensive bits of kit. I seem to remember the first one being 1024x1024 resolution, pretty poor for photography, but good enough for feature mapping with SLAM.
I'd like to see the video and learn more.
[en-gb|Vision_headline] [en-gb|Vision_subhead]
[en-gb|Vision_body]
The 360 is supposed to come out in 2015 and only in Japan at first so no regrets about purchasing the Roomba :)