Teaching Robots Presence: What You Need to Know About SLAM

Comet Labs Research Team

Published in

Comet Labs

21 min readAug 18, 2017

This guide to SLAM is one of many guides from Comet Labs for deep technology innovations in AI and robotics.

Created by Abby Yao

At A Glance

Mobile robots are expected to perform complicated tasks that require navigation in complex and dynamic indoor and outdoor environments without any human input. In order to autonomously navigate, path plan, and perform these tasks efficiently and safely, the robot needs to be able to localize itself in its environment based on the constructed maps. Mapping the spatial information of the environment is done online with no prior knowledge of the robot’s location; the built map is subsequently used by the robot for navigation.

Skip to:
1) Localization methods
2) What is SLAM?
3) Sensors
4) Maps
5) Types of Visual SLAM Methods
6) Next Frontiers for Visual SLAM
7) Toolkit, Appendix, and References

Recently, there has been considerable excitement about the use of technology from the robotics and autonomous vehicle industries for indoor mapping where GPS or GNSS are not available.This technology is called SLAM, Simultaneously Localization and Mapping. It is a process where a robot builds a map representing its spatial environment while keeping track of its position within the built map.

If you’re trying to get involved in autonomous vehicles of any kind, this guide will provide the foundation, covering topics ranging from basic localization techniques such as wheel odometry, to the more advanced SLAM, especially visual-based SLAM. It provides the fundamental framework and methodologies used for visual-based SLAM implementation. It looks into promising future of some industries with the aid of localization and mapping. It also includes significant development in the area of monocular sensors and RGB-D sensors for dense 3D reconstruction of the environment.

Localization Methods

Robot localization requires sensory information regarding the position and orientation of the robot within the built map. Each method involves some major limitations, so proper sensor fusion techniques have been deployed to overcome the constraints of each sensor alone.

Relative Positioning Measurement

The simplest form is to use wheel odometry methods that rely upon encoders to measure the amount of rotation of wheels. In those methods, wheel rotation measurements are incrementally used in conjunction with the robot’s motion model to find the robot’s current location with respect to a global reference coordinate system. The most significant error source is wheel slippage in uneven terrain or slippery floors. IMU is also used to measure the linear and rotational acceleration of the robots. However, it can still suffer from factors like extensive drift and sensitivity to bumpy ground.

Like with odometry, position estimates from inertial navigation are acquired by integrating the obtained information from the sensors; once for obtaining the speed, twice for obtaining the traveled distance of the robot. The systems are independent of external information sources. However, since measurements are made by integration, the position estimates drift over time, and can lead to increased errors.

Absolute Positioning Measurement

Laser ranger finders are commonly used as well. As optical sensors, they estimate a distance by calculating phase difference between the wave sent and rebounded. Unfortunately, in large scale environments, there are bound to be areas devoid of features visible by a laser range finder, like open atria or corridors with glass walls.

WiFi localization uses a graph based WiFi map by collecting the signal strength across the field. In this approach, the mean and standard deviations of WiFi RSSI observations are approximated by linear interpolation on a graph. This leads to a computationally efficient observation likelihood function and the optimized location could be derived from the probability function. However, it is restricted to the WiFi signal covered area with the information of a pre-learned WiFi graph.

Fig. 1: Gaussian Process-learned WiFi graph for a single access point. The mean RSSI values are color-coded, varying from -90dBm to -20dBm. The locations where the robot observed signals from the access point are marked with crosses.

Most robots use Global Positioning System (GPS) which allows the acquisition of pose information. With exact prior knowledge of where the satellites are, once the receiver calculates its distance from three or more satellites using the travel time of the radio signals, the true location can be calculated. Of course, poor coverage of satellite signal for indoor environments will limit its accuracy.

This measurement supplies information about the location of the robot independent of previous location estimates; the location is not derived from integrating a sequence of measurements, but directly from one measurement. This has the advantage that the error in the position does not grow unbounded, as is the case with relative position techniques.

Multi-Sensor Fusion

Mobile robots are usually equipped with several sensor systems to avoid the limitations when only one sensor is used to reconstruct the environment. Relative position measurements provide precise positioning information constantly, and at certain times absolute measurements are made to correct potential errors. There are a number of approaches to sensor fusion for robot localization, including merging multiple sensor feeds at the lowest level before being processed homogeneously, and hierarchical approaches to fuse state estimates derived independently from multiple sensors.

The position measurements can be fundamentally combined in a formal probabilistic framework, such as the Markov Localization framework. Based on all available information, the robot can believe to be at a certain location to a certain degree. The localization problem consists of estimating the probability density over the space of all locations. The Markov Localization framework combines information from multiple sensors in the form of relative and absolute measurements to form a combined belief in the location.

Fig. 2: Assume the robot position is one dimensional. With moving forward, the probability density distribution becomes smoother. The robot queries its sensors and finds out itself.

What is SLAM?

Implementation of navigation system that uses artificial landmarks or a priori known maps of the environment, and accurate sensor systems to get precise measurements of the landmarks or map features, is straightforward for today’s robots. Similarly, the task of building a map of the environment given the exact position of the robot is largely a solved problem. However, it is much harder to solve the complete problem simultaneously, enabling a mobile robot to build a map of an unexplored environment while simultaneously using this map to localize itself.

With the prior knowledge of the environments, mobile robots could perform a set of tasks. For instance, a map can inform path planning or provide an intuitive visualization for a human operator. Also, the map limits the error committed in estimating the state of the robot. Without a map, dead-reckoning would quickly drift over time. On the other hand, using a map (e.g. a set of distinguishable landmarks) the robot can “reset” its localization error by re-visiting known areas.

Without the map, data association becomes much more difficult, namely the unknown mapping between landmarks and observations when the robot pose estimation is prone to uncertainty. We’re currently stepping into a new robust-perception age, where performance with a low failure rate, a high-level understanding of the dynamic environments, the flexibility to adjust the computation load depending on the sensing and computational resources, and task-driven perception are required. Future challenges will center on methods enabling large-scale implementations in increasingly unstructured environments and especially in situations where GPS-like solutions are unavailable or unreliable: in urban canyons, under foliage, under water, or on remote planets, for example.

The popularity of the SLAM problem is correlated with the emergence of indoor mobile robotics. The use of GPS has no room to bound the localization error for indoor usage. Additionally, SLAM offers an appealing alternative to user-built maps, showing that robot operation is approachable even in the absence of a purpose specification localization infrastructure.

Important applications of SLAM include:

Automatic car piloting on unrehearsed off-road terrains
Rescue tasks for high-risk or difficult-navigation environments
Planetary, aerial, terrestrial and oceanic exploration
Augmented reality applications where virtual objects are involved in real-world scenes
Visual surveillance systems
Medicine, and many more

Building 3D reconstruction of objects becomes equally applicable with the advancement of visual-based SLAM as well.

Sensors

Sensors typically fit into two main categories: interoceptive sensors and exteroceptive sensors. Interoceptive sensors, like wheel odometers and IMUs, generate relative position measurements. They are subject to non-systematic errors due to external causes like human intervention as well as systematic errors because of imperfections in the robots’ structure. Exteroceptive sensors, including cameras and lasers, provide absolute position measurements. If used alongside each other, they could compensate for errors like odometry drift. The three major types of sensors applied to current SLAM technology are acoustic sensors, laser rangefinders, and visual sensors.

Acoustic Sensors use the Time of Flight (TOF) technique to measure location. Sonar Sensors are mostly used underwater where laser rangefinders and visual sensors are ruled out. Lower frequency sonars minimize absorption, and sonar provides much better resolution in a subsea environment. However, the monotony of subsea regions means sonar depth information is much harder to interpret with high angular uncertainty. Ultrasonic Sensors are generally the cheapest available source of spatial sensing for mobile robots. They are compatible with most surface types, whether metal or nonmetal, clean or opaque, as long as the surface measured has sufficient acoustic reflectivity. However, low spatial resolution and sensing range, sensitivity to environmental factors, and slow response speeds hampers use of ultrasonic sensors in robots.

Laser Rangefinders also use ToF and phase-shift techniques to measure position. The high speed and accuracy of laser rangefinders enable robots to generate precise distance measurements. This contributes to the significant popularity of laser rangefinders in solving SLAM problems since they’re capable of obtaining robust results in both indoor and outdoor environments. A laser scanner is the best sensor for extracting planar features (like walls) due to the dense range data provided. However, the price is the usual stumbling block. For example, a 3D LiDAR system from Velodyne with accuracy ranging within 2cm can cost $75,000.

One thing that acoustic sensors, LiDAR, and other range-finding sensors lack is the ability to use surface properties to localize and identify objects. Color and grayscale images allow robots to use a wider set of information to identify and localize features in the environment.

Visual Sensors are mainly three types: monocular cameras, stereo cameras, and RGB-D cameras. Rich visual information is available from passive low-cost visual sensors which LiDAR lacks. However, the trade off is a higher computational cost and the requirement for more sophisticated algorithms t0 process the images and extract the necessary information. Systems embedded with cameras and IMU are also the main focus for future developments in SLAM.

One of the major reasons Monocular Cameras are used in SLAM problems is the hardware needed to implement it is much simpler, leading to the systems that are cheaper and physically smaller. Suddenly, SLAM is accessible on mobile phones without the need for additional hardware. A weakness, however, is that the algorithms and software needed for monocular SLAM are much more complex because of the lack of direct depth information from a 2D image. Nevertheless, by integrating measurements in the chain of frames over time using a triangulation method, it is possible to jointly recover the shape of the map (and the motion of the camera under the assumption that cameras aren’t still). But since the depths of points are not observed directly, the estimated point and camera positions are related to the real positions by a common, unknown scale factor. The map therefore becomes a dimensionless map without any real-word meaning attached to one map unit.

In order to address the scale availability issue, there are some alternatives other than stereo cameras. Real metric scale can be introduced by an external scale reference in the form of a pre-specified object or a set with known size that can be recognized during mapping.

Fig. 3: The work is done by combining planar object detection with SLAM in order to recognize paintings in an art gallery, then using the known dimensions of the paintings to set the map scale. Adapted from “Combining monoSLAM with Object Recognition for Scene Augmentation using a Wearable Camera,” by Castle, R., Klein, G., & Murray, D., 2010, Image and Vision Computing, 28(11), 1548–1556.

One of the easiest ways to acquire depth information directly is through Stereo Cameras. A stereo camera system consists of two cameras separated by a fixed distance; observations of the position of the same 3D point in both cameras allows depth to be calculated through triangulation, the same way we humans do with our eyes. It rules out the constraints that the depth information will be unapproachable without cameras moving, as is the case with monocular cameras. However, the depth measurement range is limited by the baseline and resolution. Generally, the wider the baseline, the better the depth estimate, though setup with a wider baseline needs a larger space. The baseline on an AR headset will usually be only around 20 cm; and much less on a mobile phone. Considering the heavy computational workload, FPGA becomes the main force to process the high input data rate.

Fig. 4: Basic principle of most stereo cameras. A point in the real world is projected onto two film frames differently by two cameras due to their disparate positions. The point in the left camera image is shifted by a given distance in the right camera image. If the relative position of each point is known in each camera, the depth value can then be obtained.

Most of the SLAM systems deploy RGB-D Cameras that generate 3D images through structured light or time-of-flight technology, both of which can provide depth information directly. In terms of a structured light camera, the camera projects a known pattern onto objects and perceives the deformation of the pattern by an infrared camera to calculate the depth and surface information of the objects. For a time-of-flight camera, the camera obtains depth information by measuring the time of flight of a light signal between the camera and objects. Compared to RGB-D cameras based on time-of-flight technology (e.g. Kinect for Xbox One), the structured light sensors (e.g. Kinect for Xbox 360) are sensitive to illumination. This limits their applicability in direct sunlight. Apart from that, RGB-D cameras were found to have various overarching limitations. They don’t provide reliable range data for semi-transparent or highly reflective surfaces, and also have a limited effective range.

Maps

The most commonly used mapping representations in robotics are:

Feature Maps

Since this approach uses a limited number of sparse objects to represent a map, its computation cost can be kept relatively low and map management algorithms are good solutions for current applications. The major weakness in feature map representation is its sensitivity to false data association.

Occupancy Grids

These are useful in path planning and exploration algorithms in which the occupancy probability information can reduce the complexity of the path planning task. The major drawback of this method is its computational complexity especially for large environments.

Visual-based SLAM Implementation Framework

There has been an increased interest in visual-based SLAM because of the rich visual information available from passive low-cost video sensors compared to laser rangefinders. The majority of modern visual-based SLAM systems are based on tracking a set of points through successive camera frames, and using these tracks to triangulate their 3D position to create the map; while simultaneously using the estimated point locations to calculate the camera pose which could have observed them.

Fig. 5: The basic working principle of V-SLAM, from point observation and intrinsic camera parameters, the real time 3D structure of a scene is computed from the estimated motion of the camera.

The architecture of a SLAM system includes two main components: the front-end and the back-end. The front-end abstracts sensor data into models that are amenable for estimation, while the back-end performs inference on the abstracted data produced by the front-end.

Short-term data association is responsible for associating corresponding features in consecutive sensor measurements. On the other hand, long-term data association (or loop closure) is in charge of associating new measurements to older landmarks.

Fig. 6: Front-end and back-end in a Visual SLAM system.

Types of Visual SLAM Methods

The way that SLAM systems use the image data can be classified as sparse/dense and feature-based/direct. The former describes the quantity of regions used in each received image frame, and the latter describes different ways in which the image data are used.

Sparse and Dense Methods

From the perspective of which areas in an acquired image are used, SLAM systems can be classified as either sparse or dense. More specifically, sparse SLAM systems use only a small selected subset of the pixels in an image frame, while dense SLAM systems use most or all of the pixels in each received frame. As they use a different number of pixels and regions in a given area, the generated maps from sparse and dense methods are very different. The maps generated from sparse methods are basically points clouds, which are a coarse representation of the scene and mainly used to track the camera pose (localization). On the other hand, dense maps provide much more details of viewed scenes, but because they use many more pixels than sparse methods, more powerful hardware is usually needed (most current dense SLAM systems require GPUs).

(a) (b) ©

Fig. 7: Difference between maps generated by sparse and dense SLAM systems. (a) The sparse map created by PTAM, where the colored points are map points. (b) The semi-dense map in the LSD-SLAM, where the colored points are map points. (c) The dense map generated by the DTAM system. All points on the surface are part of the map.

Feature-based/Direct Methods

Based on system requirements to process the data, there are different methods to choose from.

Fig. 8: Different workflow between feature-based and direct method to tracking and mapping. The constructed maps are various depending on which method is used.

The fundamental steps of the Feature-based Method are: extract a set of sparse features from the input images, match the features obtained from different poses, and solve the SLAM problem by minimizing the feature reprojection error (the difference between a point’s tracked location and where it is expected to be given the camera pose estimate, over all points).

Feature Extraction processes the useful information in pictures. Features that are of interest range from simple point features such as corners to more elaborate features such as edges and blobs and even complex objects such as doorways and windows. The region around each detected feature is converted into a compact descriptor that can be matched against other descriptors. The simplest descriptor of a feature is its appearance, or the intensity of the pixels in a patch around the feature point.

Fig. 9: Comparison of different features extraction methods using an image obtained from the Oxford dataset: (a) FAST, (b) HARRIS, (c) ORB, (d) SIFT, (e) SURF. The size of the circle corresponds to the scale and the line corresponds to the orientation (direction of major change in intensity).

Feature Matching is the process of individually extracting features (descriptors) and matching them over multiple frames. Feature matching is particularly useful when significant changes in the appearance of the features occur after observing them over long sequences. The simplest way to match features between two images is to compare all feature descriptors in the first image to all other feature descriptors in the second image using a similarity measure. The Pose Estimation is calculated based on the matching features through a technique called RANdom SAmple Consensus (RANSAC).

Fig. 10: Matching pairs of descriptors.

A frame that has most of its features concentrated in a small area is of less interest to the algorithm as a frame with many spread over a larger area since the features are less likely to overlap. Another issue with feature-based methods is that storing the processed features can quickly become very costly. However, since this method eliminates all data that cannot be used (non feature points), it is typically faster than direct methods. It is possible to reconstruct dense maps from feature-based methods by estimating the camera positions to find what was at the given location.

Direct Methods compare entire images to each other by finding which parts go together. They can also create semi-dense 3D maps in real time on a smartphone by using semi-dense filtering algorithms. This means they provide more information about the environment, making it more interesting to use in robotics or AR, as well as giving a more meaningful representation to the human eye. Some disadvantages of direct methods are that they cannot handle outliers very well, as they will always try to process them and implement them into the final map. Direct methods are also generally slower than feature-based variants.

Loop Closure

Loop closure detection is the final refinement step and is vital for obtaining a globally consistent SLAM solution especially when localizing and mapping over long periods of time. Loop closure is the process of observing the same scene by non-adjacent frames and adding a constraint between them, considerably reducing the accumulated drift in the pose estimate.

Fig. 11: The Map before and after applying loop closure constraints.

The most basic form of loop closure detection is to match the current frame to all the previous frames using feature matching techniques. This approach is computationally very expensive due to the fact that the number of frames increase dramatically over time, so matching the current frame with all the previous frames is not always suitable for real-time applications. One solution is to define key frames (a subset of all the previous frames) and compare the current frame with only those key frames. The most common way to filter the loop closure frame candidates is to use a place recognition approach based on the vocabulary tree in which the feature descriptors of the candidate key frames are hierarchically quantized and are represented by a “Bag of Visual Words” (B.O.W).

Fig. 12: Feature descriptors are clustered around the words in a visual vocabulary. The clustering reduces the problem to a matter of counting how many times each word in the vocabulary occurs. Finally, the image can be represented using the resulting histogram of frequencies. Similarities of images are compared by histograms.

There are two types of problems with loop closure: False Positive (Perceptual Aliasing) where two different places are perceived as the same, and False Negative(Perceptual Variability) where one place is perceived as two different places. A precision-recall curve can be used to better quantify the performance of the system. The curve highlights the tradeoff between precision (absence of false positives in the detection) and recall (prediction power).

Fig. 13: Tweaking the algorithm to improve recall usually leads to more false positives due to the increased sensitivity to similarities in the image.

Back-End Optimization

As drift of the pose estimation is inevitable, Camera Pose Optimization becomes crucial to retrieve the motion of cameras. Traditionally, Extended Kalman Filter (EKF) can be introduced to minimize the noise in motion (an estimate of the robot’s future position) and observation (the actual measurement) models. This is still the first choice for small scale estimations due to implementation simplicity.

An alternative method is to use Bundle Adjustment (Graph Optimization), jointly optimizing the camera pose and the 3D structure parameters that are viewed and matched over multiple frames by minimizing a cost function. It draws ideas from the intersection of numerical methods and graph theory. Bundle Adjustment is increasingly favored over filtering partly due to the latter’s inherent inconsistency. Combined with sub-mapping, this method leads to higher efficiency.

Next Frontiers for Visual SLAM

The development of new camera sensors and the use of new computational tools have often been key drivers for Visual SLAM. There are many alternative sensors that can be leveraged for Visual SLAM, such as depth, light-field, and event-based cameras that are now becoming commodity hardware.

Range Cameras

Light-emitting depth cameras are not all that new. They became common with the introduction of the Microsoft Kinect f0r Xbox consoles. They operate according to different principles, such as structured light, time of flight, interferometry, and coded aperture. Structured light cameras work by triangulation, so their accuracy is limited by the distance between the cameras and the pattern projected (the structured light). By contrast, the accuracy of Time-of-Flight (ToF) cameras only depends on the measurement device. They tend to provide the highest range accuracy (sub millimeter at several meters). Since range cameras carry their own light source, they also work in dark and untextured scenes, which allow for applying SLAM in dynamic environments.

Light-field Cameras

Contrary to standard cameras that only record the light intensity hitting each pixel, a light-field camera records both the intensity and the direction of light rays. Light-field cameras offer several advantages over standard cameras such as depth estimation, noise reduction, video stabilization, isolation of distractors, and specularity removal. Their optics also offer a wider aperture and depth of field compared with conventional cameras.

Event-based Cameras

Compared to frame-based cameras that send entire images at fixed frame rates, event-based cameras only send the local pixel-level changes caused by movement in a scene at the time they occur. This type of camera can have a temporal latency of 1ms, an update rate of up to 1 MHz, a dynamic range of up to 140dB (vs. 60- 70dB of standard cameras), a power consumption of 20mW (vs. 1.5W of standard cameras), and very low bandwidth and storage requirements (because only intensity changes are transmitted). These properties enable the design of a new class of SLAM algorithms that can operate in high-speed motion scenes.

Deep Learning

Researchers have already shown that it is possible to train a deep neural network to regress the inter-frame pose between two images acquired from a moving robot directly from the original image pair, effectively replacing the standard geometry of visual odometry. Likewise it is possible to localize the 6DoF of a camera with regression forest and with deep convolutional neural networks, as well as estimate the depth of a scene (in effect, the map) from a single view solely as a function of the input image.

Toolkit

Appendix

Fig. 14: Comparison of feature detectors: properties and performance.

Fig. 15: Some popular Visual SLAM systems.

References & Resources

Localization Methods

Mautz, R., & Tilch, S. (2011). Survey of optical indoor positioning systems. 2011 International Conference on Indoor Positioning and Indoor Navigation.

Biswas, J., & Veloso, M. (2014). Multi-sensor Mobile Robot Localization for Diverse Environments. RoboCup 2013: Robot World Cup XVII Lecture Notes in Computer Science, 468–479.

Liu, T., Zhang, W., Gu, J., & Ren, H. (2013). A Laser Radar based mobile robot localization method. 2013 IEEE International Conference on Robotics and Biomimetics (ROBIO).

What is SLAM?

C. Cadena and L. Carlone and H. Carrillo and Y. Latif and D. Scaramuzza and J. Neira and I. Reid and J.J. Leonard, “Past, Present, and Future of Simultaneous Localization And Mapping: Towards the Robust-Perception Age”, in IEEE Transactions on Robotics 32 (6) pp 1309–1332, 2016

Fuentes-Pacheco, Jorge, José Ruiz-Ascencio, and Juan Manuel Rendón-Mancha. “Visual simultaneous localization and mapping: a survey.” Artificial Intelligence Review 43.1 (2015): 55–81.

Sensors

Castle, R., Klein, G., & Murray, D. (2010). Combining monoSLAM with object recognition for scene augmentation using a wearable camera. Image and Vision Computing, 28(11), 1548–1556.

Chong, T., Tang, X., Leng, C., Yogeswaran, M., Ng, O., & Chong, Y. (2015). Sensor Technologies and Simultaneous Localization and Mapping (SLAM). Procedia Computer Science,76, 174–179.

Liu, Q., Li, R., Hu, H., & Gu, D. (2016). Extracting Semantic Information from Visual Data: A Survey. Robotics, 5(1), 8.

Yousif, K., Bab-Hadiashar, A., & Hoseinnezhad, R. (2015). An Overview to Visual Odometry and Visual SLAM: Applications to Mobile Robotics. Intelligent Industrial Systems, 1(4), 289–311.

Visual-based SLAM Implementation Framework

Zunino, G., & Christensen, H. (n.d.). Simultaneous localization and mapping in domestic environments. Conference Documentation International Conference on Multisensor Fusion and Integration for Intelligent Systems. MFI 2001 (Cat. №01TH8590).

Engel, Jakob, Jorg Stuckler, and Daniel Cremers. “Large-scale direct SLAM with stereo cameras.” 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2015): n. pag. Web.

Hu, B., Zhang, X., Yang, G., & Jaeger, M. (2008). Objective Evaluation of 3D Reconstructed Plants and Trees from 2D Images. 2008 International Conference on Cyberworlds.

Bailey, T., & Durrant-Whyte, H. (2006). Simultaneous localization and mapping (SLAM): part I. IEEE Robotics & Automation Magazine, 13(3), 108–117.

Bailey, T., & Durrant-Whyte, H. (2006). Simultaneous localization and mapping (SLAM): part II. IEEE Robotics & Automation Magazine, 13(3), 108–117.

Engel, J., Stuckler, J., & Cremers, D. (2015). Large-scale direct SLAM with stereo cameras. 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

Next Frontiers for Visual SLAM