However I would strongly recommend picking only one of your "features", the indoor navigation. If I were you, I'd definitely try to build a business by concentrating only on indoor navigation!
Indoor navigation is a huge new area where all the big players are looking for possible partners/acquisitions right now! Overlay-based AR, and the measuring tape demo is a joke compared what you've shown in indoor navigation!
You really have a chance of making a successful company based only on the indoor navigation feature. Forget the pricing for now, just offer it as a free beta on both iOS and Android and try to get the word out as much as you can.
Good luck!
Not sure what market you're ultimately going for, but right now it seems to defeat the point of providing a nice simple API if it's only usable for either throwaway projects or by very large customers.
For the vision part, you start by extracting interest points in all images (Harris keypoints, or SIFT, or similar), then you match them up by using local patch descriptors, (a reasonable implementation in OpenCV for example is the Lucas Kanade Optical Flow tracker) and once you have the correspondences you can estimate a relative 3D camera transformation that explains the motion. In this case the problem is hard because the depth of every point is unknown in addition to the camera transform.
For the IMU stream you can use the accelerometer and gyro in the camera which gives you an estimate for both linear and rotational acceleration. These can be integrated over time to get a reasonable guess for the camera transformations from one time point to another as well.
You combine the two guesses (from vision and from the phone inertial measurement units) into a best guess, and then combine that in addition with the best guess from 30 milliseconds ago to arrive at an evolving probability distribution of this best guess over time. Standard way would be something like a Kalman Filter.
Another issue is dealing with drift over time, as errors in estimation build up and if you're scanning the same area your model will start to drift. This requires something called "Loop Closure" which optimizes the camera matrices across the entire duration of scan and not only frame to frame. This is very computational intensive and hard to do online and without it scans for longer than few seconds will get progressively uglier and misaligned.
This stuff is super tricky to get right. Also, be skeptical of these demos because they are easy to can. It's fairly easy to get that one shot where it looks like it works, but in practice these are exceptionally fragile and very very difficult to get working. Though I'm impressed it seemed to work okay inside the mall -- with all the specular reflections from the floor. Though I'd guess that if anyone placed a foot into the field of vision (and made the environment geometry nonstatic) it would all break :) Good luck to the team though!
Apple Developer Videos on Sensor Fusion specifically mention NOT to do this even though their tech uses the Gyroscope which is orders of magnitude more precise than the accelerometer.
I believe it's "Understanding Core Motion." (Developer account required)
Because the device has an accelerometer, it is even able to extract distances, not just relative distances. I'm actually surprised by this as I always assumed the accelerometer was too noisy to be of use for this.
I tried to do a similar thing myself, but the problem is technically very difficult. While a lot of research has been done on structure from motion, actually packaging it into a usable API is a big task