Sunday, January 18, 2015

System Diagram and Octree Map Update Deep-dive


System Diagram

I have drawn a system overview diagram to summarize the system that I described in my last post.


The black arrows designate the flow of execution, while the blue arrows are to clarify the passage of data from the two render passes. The dotted lines indicate modules that are responsible for directly interfacing with hardware devices.

Octree Map Updates

Here I will describe in more detail one of the complex sub-components of the system: the process of updating an octree map from an aligned RGB-D frame.

The process of updating the global map will be made up of several CUDA kernels parallelizing execution. Here is a description of each CUDA kernel in the process.

1.) Determine if any vertex is outside of the current map

Prior to this step of the pipeline, we have computed the 3D position of each pixel in the camera frame with respect to the coordinates of the octree map.  This kernel will parallelize over the vertices and determine if it is outside of the current map bounds. If it is, it computes how many additional layers of depth are required for it to be within the map, and writes it atomically to a global value if its value is greater than the one currently held. 

A CPU stage will then increase the global map to have a new root node and additional layers to ensure that the map will be big enough to contain the observed frame. It may be worth keeping this entire step CPU-bound as it is will be compute-lite. However, the vertices will be on the GPU at this point, so the data would have to be copied back and forth between devices for CPU computation here.

2.) Compute voxel keys for each vertex to create list of occupied cells.

This kernel will parallelize over each vertex and compute the Morton Code at the maximum depth of the octree.

A Morton Code is a condensed form representing a key within an octree. Each depth of the tree is represented using an additional 3 bits, as it is binary in the x, y, and z coordinates. 

Example: For an octree centered at the origin with bounding box at (-1,-1,-1) and (1,1,1), the vertex (0.7, -0.2, 0.1) has the depth 1 Morton Code 101, and the depth 2 Morton Code 101110.

This step determines which cells in the octree have been observed as occupied in the current camera frame. The number of occupied cells is constant, and thus we can pre-allocate the memory needed to store this data and avoid any need for a global memory counter (as will be needed in the next kernel). 

3.) Compute voxel keys for each line segment from the camera position to each vertex and add then to a global list of unoccupied voxels.

This kernel is similar to the previous, though instead of computing a single Morton Code for one point, it computes all Morton Codes along the line segment from the camera origin point to the vertex. This step will not have a constant output size as it is impossible to know how many voxels a line segment will intersect. Thus, we will need to conservatively allocate memory for the output, and use a global memory index counter that is updated atomically.

The occupied voxels will be calculated by using the 3D extension of Bresenham's line algorithm, and the Morton Code will be computed for each voxel center.

4.) Remove duplicate occupied and unoccupied voxels from the lists, and store multiplicities.

An upcoming step is going to involve parallelizing over voxels to be updated. However, many of the threads from the previous stages will involve updating the same voxels due to the close proximity of the rays originating from the same camera position. 

In this step, occupied cells should average the color value of all pixels when performing this compaction.

To make this step more efficient by avoiding multiple threads atomically updating the cells, we should first consolidate these updates into a list of only unique voxels.  We then also need an additional list of integer counts that represent the number of times observed and unobserved in this frame.

5.) Remove all unoccupied voxels that are in the occupied list.

OctoMap found that it is necessary to avoid updating a voxel as both observed and unobserved in the same frame. They chose to abide by the rule that if any ray observes a cell to be occupied, then it cannot be observed also as unoccupied in that same frame. To do this, we next need to parallelize over the unoccupied cell list and remove them if they are in the occupied list.

It is unclear what the most efficient way to do this would be. It may be beneficial to first sort the occupied list so that it can be queried for a value using binary search. On the other hand, it might be most efficient to sort both lists and perform this update step serially on the CPU, copying data back and forth. It may be worth evaluating these two methods.

6.) For all occupied and unoccupied voxels, update the map at the maximum depth.

This step involves parallelizing over all voxels in both lists, and either adding or subtracting the alpha channel by the multiplicity associated with the voxel in this update. Unobserved voxels will use subtraction, and observed voxels will use addition. 

This alpha channel can then be converted to a probability that the voxel is occupied by multiplying by the probability of hit/miss and exponentiating (the alpha channel is a condensed form of log probability of occupation).

Voxels observed as occupied will also update its color value using the color from the original camera pixel.

7.) Mip-map updates into inner layers of the octree map.

The last step involves updating the rest of the tree with a kernel execution pass for each layer of depth in the octree. In this step, each parent node will take on the mean value of its child nodes for RGBA.

It would be ideal to optimize this step by avoiding mip-mapping parts of the tree that were not updated in this frame. An additional pruning step that prunes redundant leaves from the tree, particularly for areas of space where the alpha channel is very small (very likely to be empty) may be beneficial here as well.


Friday, January 9, 2015

System Overview


Week 1 Update:

This week, I have been working on improved mouse/keyboard camera control to the code that I have been developing. I have also added the brick pool data structure described in GigaVoxels, which is used for trilinear interpolation in texture memory when rendering an octree through cone tracing.

Here is a general outline for the system that I expect to be developing based on a merger of concepts from KinectFusion, OctoMap, and GigaVoxels. The steps of the pipeline are:
  1. Receive Image from RGB-D Camera
  2. Compute Vertices/Normals for Each Pixel
  3. Predict Real Camera Pose through ICP
  4. Update Virtual Camera Pose with Keyboard Input
  5. Transfer CPU/GPU Octree Memory based on Camera Poses
  6. Cast Rays from Predicted Camera to Camera Points to Update Octree Map
  7. Render Virtual Camera on Screen
  8. Render Predicted Camera Image to Texture for ICP in the Next Iteration
1.) Receive Image from RGB-D Camera

This first stage provides raw sensor input that will be used to build a virtual map. The RGB-D camera will provide both color and depth information. I have purchased a Structure Sensor that will be used as the primary device for this project. The Structure Sensor is a small active depth camera backed by Kickstarter that is intended for use with mobile devices. The camera provides 640x480 resolution at more than 30 fps. 
The Structure Sensor is a product of Occipital, Inc.
A very appealing feature is that Occipital, the company behind the device, is continuing to support the OpenNI project on Github for the device. OpenNI is the Open Natural Interface standard that is being developed as an abstract interface to these RGB-D devices, though support for it had ended a few years ago. It is nice to see Occipital pick up support of the project, and it is desirable to develop this project in a way that is abstracted from a particular device driver.

Interfacing to the device through OpenNI should only take a few days work. I will start that task next week as the device has just arrived.

2.) Compute Vertices/Normals for Each Pixel

This is the easiest step in the pipeline. The camera stream only provides position and color information natively. Normals are necessary for both for localization and rendering. An early step of the pipeline will use CUDA to parallelize over each pixel in the camera image and compute the vertex for each point using the camera calibration matrix, then computing normals from cross products of adjacent vertices in each direction. An additional bilateral filtering step may be necessary here as well, in which case we will follow a similar method as outlined in KinectFusion.

3.) Predict Real Camera Pose from ICP

In order to use the new camera image to update a world map, we must first predict the position and orientation of the camera relative to the world. This can be done iteratively, where a pose estimate is made in each frame relative to the previous only. For slow camera motions and fast frame rates, these motions will be very small. The net pose of the camera is the composition of transforms of motions between each frame since the start of the process.

KinectFusion makes these iterative predictions using Iterative Closest Point (ICP). This process selects pairs of corresponding points between data sets, then finds an affine transform that can be applied to one set that minimizes the sum of squared distances between all correspondence pairs. Part of what makes this a challenging problem is how to accurately choose corresponding pairs. KinectFusion does this by projecting both frames into the camera space, and points occupying the same 2D pixel location should match. 


KinectFusion finds a camera pose T that minimizes this quantity.

KinectFusion also attempts to minimize the distance between points only in the direction of the normal between the points. This is essentially performing a point-to-plane match that tends to be less noisy. This is because it is very unlikely that two scans will sample identical points on a surface, but it is likely that points will belong to the same surface.

KinectFusion throws out point correspondences where either the difference between normals, or the distance between vertices exceed a threshold. The original KinectFusion did not incorporate color, but we will include a threshold on the difference between color values as well.

The process of ICP in KinectFusion computes terms for each correspondence pair in parallel on the GPU, then compacts/sums before minimizing with Cholesky decomposition on the CPU. 

4.) Update Virtual Camera Pose with Keyboard Input

This is another very simple task in the pipeline. This week, I have developed sufficient keyboard and mouse camera control for this project. These camera controls query GLFW in each frame to determine the new camera pose, rather than a callback-based approach that we previously were using. It uses WASD keys to translate the camera origin in a direction based on its orientation. The mouse can rotate the camera by clicking and dragging. While dragging, the mouse disappears and cannot leave the screen. This allows the user to continuously rotate the camera without screen space limitations. Finally, the scroll wheel adjusts the projection zoom. This control scheme will be familiar to anyone who has played a modern PC first-person-shooter.

5.) Transfer CPU/GPU Octree Memory based on Camera Poses

This will be one of the more challenging aspects of the project. This will likely require a stack-based octree data structure on the CPU, with an associated API that can send/receive portions of the tree to the GPU, converting to its stack-less structure. It will draw upon work from GigaVoxels that uses a Least Recently Updated (LRU) method for determining when data can safely be removed from active memory. This aspect of the project will require additional algorithmic development or research.

6.) Cast Rays from Predicted Camera to Camera Points to Update Octree Map

This step actually builds the octree map from camera color and depth. The update method will be heavily based upon OctoMap, though may include additional GPU optimization. This will cast a ray from the origin of the real camera to each point in a point cloud generated from the depth image. The points along the ray update the map with "miss" observations, and the end point will be updated with a "hit." This will decrease or increase the probability that a voxel is occupied by a value determined by a model of the camera. This model will contain a probability of hit and a probability of miss value, which will be determined experimentally. We will use this probability as the alpha channel of the voxel.

7.) Render Virtual Camera on Screen

At this point, we need to render the virtual camera to the screen. Remember, the virtual camera does not need to be at the same position and orientation in the world as the real camera. This step is essentially the reverse of the previous, initially it will cast a ray in the octree and accumulate color weighted by alpha channel until the total alpha reaches 1.

An improved version of this step will use cone tracing. This approach will use higher levels of the octree as the ray steps further from the camera, behaving like a cone. This will provide global illumination effects and is based upon the work of GigaVoxels.

8.) Render Predicted Camera Image to Texture for ICP in the Next Iteration

The last step is identical to the previous in terms of functionality. However, instead of rendering from the virtual camera that a user is controlling on the computer, this will render an image from the predicted view of the actual camera. This image represents the model that the camera expects that it sees, and the localization method will use ICP to match to this image to the camera image in the next frame. This step is necessary to avoid localization drift that occurs when matching incoming frames only to the previous. Instead, this matches the new frame with a globally consistent model.