Tuesday, December 30, 2014


Introduction to Simultaneous Localization and Mapping (SLAM) with RGB-D

This independent study will be focused on building a mapping framework capable of generating detailed 3D maps from a high resolution range camera, such as Kinect. The objective is to merge desirable properties of many recent techniques to achieve concurrent map construction and virtual camera rendering of large scale and high resolution scenes. The resulting system would be useful for augmented reality applications, enabling indirect illumination of virtual objects in a real map while continuously updating that map with a low-cost camera. 

KinectFusion - Developed by Microsoft Research in 2011, this is a method that can reconstruct 3D polygonal meshes from a range camera.

The key idea behind KinectFusion is that for a high framerate sensor (e.g. 20+ fps), the majority of information from one frame to the next is duplicated. The approach is to find a transform to match the data as best as possible to predict the motion of the camera. Once the pose of the camera is known, the sensor data at the new frame can be used to update the map. This achieves simultaneous localization and mapping (SLAM) without any additional inertial sensing.

Image Credit: KinectFusion: "Real-time 3D Reconstruction and Interaction Using a Moving Depth Camera", Sharam Izadi et al.
RGB-D cameras provide enough information to provide 3D positions and surface normals, as well as color. This makes the process of iterative closest point (ICP), which attempts to align one frame to the next, far more reliable despite the additional computational cost. The hard problem is computing this fast enough to keep up with the 20+ fps framerate. If that rate cannot be maintained and frames are skipped, the space of possible transformations that must be searched to align the frames grows. This increases the computational burden, slowing the computation down even further and creating a vicious cycle that makes the whole thing fail. GPU computing that can exploit the parallelism of the computation was critical to achieve the speeds required to avoid this downward spiral.

Image Credit: "Real-time large scale dense RGB-D SLAM with volumetric fusion", Thomas Whelan et al. 

KinectFusion represents the map as a 3D voxel grid with a signed distance function representing the distance from a surface. The values are truncated to avoid unnecessary computations in free space. Building this grid is far more maintainable than storing a raw point cloud for each frame, as the redundancy both enables the sensor noise to be smoothed, and also avoids storing significant amounts of duplicate data. A standard RGB-D camera can generate several GB of raw data within only a minute, while the voxel grid can represent the same data with only a few MB.

Image Credit: "Real-time large scale dense RGB-D SLAM with volumetric fusion", Thomas Whelan et al.
Mapping larger scale scenes with KinectFusion becomes messy due to the fixed siz of the voxel grid. There are large scale implementations such as Kintinuous and KinFu that slide the center of the grid when the camera moves past a threshold in any direction. However, this then requires mesh extraction from the area of space that is no longer covered by the grid. Then, an additional scene graph must be maintained relating the grid to these meshes. And what happens if the camera moves back into this area? Recent extensions of the KinectFusion approach have been attempting to resolve all of these undesirable consequences of using a single resolution voxel grid.
Image Credit: "Real-time large scale dense RGB-D SLAM with volumetric fusion", Thomas Whelan et al.
Additional recent work with Kintinuous has shown the ability to perform loop closure of revisiting scenes previously viewed, as well as colored voxel grids, which requires considerably more data and thus a smaller grid size. This can be an issue for longer range cameras with high resolution.

   Advantages
  • Update Speed (Parallel)
  • High-Resolution

   Disadvantages
  • Scalability (Fixed Grid Size)
  • Rendering (Extract Mesh)
  • Static Scenes Only


OctoMap -  This is a commonly used framework in the robotics community for building environmental maps from a variety of noisy sensors (LIDAR, Stereo).


Image Credit: "OctoMap: An Efficient Probabilistic 3D Mapping Framework Based on Octrees", Armin Hornung et al.

OctoMap uses an octree data structure to natively handle multiple levels of resolution. This provides a pointer-based structure to avoid allocating memory for empty space, unlike the voxel grid of KinectFusion that allocates memory but leaves it empty. The data structure provides a convenient method for providing additional data such as color and timestamps without additional bookkeeping.

Image Credit: "OctoMap: An Efficient Probabilistic 3D Mapping Framework Based on Octrees", Armin Hornung et al.
This framework does not explicitly include camera localization, and requires an externally provided camera prediction and update before each new camera frame is integrated into the map. Many approaches do this with a completely separate estimate using inertial sensing and dead reckoning. Others register new scans against the map, providing filtered estimates with methods such as a particle filter.

Image Credit: "OctoMap: An Efficient Probabilistic 3D Mapping Framework Based on Octrees", Armin Hornung et al.

OctoMap is a probabilistic framework where the log-odds of occupancy are stored in the octree. Each sensor is assigned a probability of hit and miss that represent the noise of the sensor. Nodes in the tree are updated by logging each point from a point cloud as a hit, and all points along the ray from the camera position to the point are logged as a miss. This process takes place serially on a CPU, looping over each point in each frame. This framework is most commonly used with LIDAR sensors, which have only a few points per scan which has little benefit from parallelization. An RGB-D sensor would provide millions of points per frame which should be parallelized.

Image Credit: "OctoMap: An Efficient Probabilistic 3D Mapping Framework Based on Octrees", Armin Hornung et al.
The scalable nature of OctoMap enables very large maps to be generated, as moving outside of the map only requires creation of a new root node.

   Advantages
  • Scalability
  • Dynamic Scenes

   Disadvantages
  • Rendering (Extract Cube Centers/Sizes)
  • Slow Updates (Serial)
  • External Localization


GigaVoxels - Cyril Crassin's Ph.D thesis provides a framework for out-of-core GPU sparse octree data management, and a voxel cone tracing method for physically based direct rendering of octree data.

Image Credit: "GigaVoxels: A Voxel-Based Rendering Pipeline For Efficient Exploration Of Large and Detailed Scenes", Cyril Crassin.

GigaVoxels is not a SLAM framework. It is related to this project because both of the aforementioned methods produce voxel-based data structures that must be translated into a triangulated structure to be rendered. GigaVoxels provides a cone-tracing framework that can render with indirect illumination effects without any monte carlo integration that is typically used for polygonal geometry.


Image Credit: "GigaVoxels: A Voxel-Based Rendering Pipeline For Efficient Exploration Of Large and Detailed Scenes", Cyril Crassin.

Voxel Cone Tracing uses texture interpolation at an octree depth associated with the cone's cross section. If all of the needed lighting information is incorporated into the octree, mip-mapping the values into the inner tree branches and texture interpolation performs the integration step inherently.

Image Credit: "GigaVoxels: A Voxel-Based Rendering Pipeline For Efficient Exploration Of Large and Detailed Scenes", Cyril Crassin.

This work also provides a robust stack-less GPU octree data structure that can be constructed, traversed, and updated in parallel. It even provides an out-of-core data management process that coordinates transfer of octree data between the CPU and GPU memory pools as needed for very large scenes.

Image Credit: "GigaVoxels: A Voxel-Based Rendering Pipeline For Efficient Exploration Of Large and Detailed Scenes", Cyril Crassin.

The key limitation is that most of the computer graphics community is heavily invested in a triangle mesh process, which means that voxelization and octree construction must be performed frequently, which is a slow process. However, for SLAM applications it is more common to generate a voxel based geometry, so it is actually more convenient to render it from that structure.

   Advantages
  • Indirect Illumination
  • Level-of-detail
  • Out-of-core data management

   Disadvantages
  • Slow Voxelization of Polygonal Meshes
  • Slow(er) Rendering than Hardware Rasterization

Octree-SLAM - This project will attempt to merge the desirable qualities of these systems into a localization, mapping, and rendering system useful for augmented reality applications. As RGB-D cameras become more widespread on smartphones, this work may provide app developers with a way  to render realistic 3D worlds from the data.
   
   Potential Qualities
  • Update Speed (Parallel)
  • High-Resolution
  • Scalability
  • Dynamic Scenes
  • Indirect Illumination
  • Level-of-detail
  • Out-of-core (Large/Dense Scenes)
  • No Meshing or Voxelization Steps
Follow progress of the source code on Github.