Sunday, April 19, 2015

SVO Cone Tracing

I have implemented voxel cone tracing based on Cyril Crassin's GigaVoxels work for physically based rendering of a reconstructed scene. To my knowledge, this is the first use of cone tracing to render real camera data.

Here is a raw camera image from a Kinect:

This is the reconstructed SVO rendered from the same camera view, but extracting voxels at 1 cm resolution and drawing them with instanced rendering through OpenGL. This is the typical approach in rendering reconstructed 3D maps.

Now this is the same SVO render instead with cone tracing. The detailed texture of the hardwood floor is now captured in the rendered image. The ray-marching process seems to produce artifacts that step through the corner of the wall on the right side.

And here is another cone tracing image rendered from an alternate camera view:

Future work will use cone tracing to render artificial objects in these real 3D scenes for augmented reality with consistent lighting.

Sunday, March 29, 2015

Performance Enhancement of Octree Mapping

Key Resolution

Although the sparse octree does not allocate every node of the tree in memory, for unique identification of voxels I have been using Morton Codes to identify them.. Here is an example code:
1 001 010 111
The code starts with a leading 1 to identify the length of the code, and thus the depth in the tree. After that, the code is made up of a series of 3-bit tuples that indicate a high or low value on the binary split of the x, y, and z dimensions respectively.

Using a 32-bit integer, this can represent 10 levels of depth in the tree. However, this is insufficient for mapping with a Kinect camera. The Kinect has a range of 1-10 meters in depth resolution, and doing a back of the envelope calculation for the horizontal resolution (480 pixels, 5m range, 43 degree field of view) shows that the camera will typically provide sub-centimeter resolution. A 10 meter edge volume can only achieve 1 cm resolution using 10 levels of depth. Therefore, I have transitioned to representing these keys with long integers (64-bit), which could represent more than kilometers of volume at millimeter precision, if needed.

Here is a look at the enhanced resolution that can now be achieved. The first is an older image using ~2.5 cm resolution, and the second is enhanced to 4 mm resolution. This level of detail adequately captures the data provided by the Kinect camera.

Key Sorting

Another modification that I made to use of these keys is the ordering of the 3-bit tuples. Originally, I found it very useful to represent the octree keys using the most significant tuple on the farthest to the right. The value was that the octree is almost always traversed starting from the root node. The child node from the root could be obtained using the key by performing a bitwise-AND with 0x7. For the next depth, the key could be modified with a shift right by 3 bits and repeating the same process. Max depth of the key is reached when the remaining code is 1.

However, several parts of my algorithms involve getting unique keys from a list, truncated at different levels of depth. Starting with the least significant tuple makes truncation change the ordering of the keys. This means that every call to "unique" must be preceded by a call to "sort." Together, these make up the slowest part of the octree update process. The most straightforward way to improve this was to remove the need to sort repeatedly by switching the order of the tuples. While this is less convenient for octree traversal, I've found it to result in a worthwhile performance gain. Now, getting the most significant tuple from the key involves finding the position of the leading 1, then extracting the following 3 bits. Updating the key for the next depth requires subtracting the leading 1, along with these 3 bits, then adding the leading 1 at 3 bits to the right of its previous location. Although this is messier, the complexity can easily be encapsulated by a function.

The process of traversing the octree to update inner nodes following changes to leaves uses this unique-key paradigm. It starts with a set of keys identifying the updated leaves. Each pass reduces the depth of the keys by a level, and updates the nodes at this higher level from updated child values. However, at higher levels of the tree, the originally unique nodes start to overlap to the same nodes. We benefit from periodically reducing to unique keys, and prefer not to re-sort each time. I've found that for an identical scene and resolution, this process could be reduced from 14-17 ms to 10-12 ms.

Sunday, March 8, 2015

Octree Map Construction

I've now implemented a set of CUDA kernels that can update an octree from an input point cloud. The steps involve:

Update Octree with Point Cloud
1.) Transform the points to map coordinates (based on a camera pose estimate)
2.) Compute the axis-aligned bounding box of the points, using thrust reduction.
3.) Resize the octree if necessary to ensure that it contains all points.
4.) Push the necessary sub-octree to the GPU to ensure that it can be updated and extracted in parallel.
5.) Compute the octree key for each point.
6.) Determine which nodes will need to be subdivided, and how many new nodes will be created.
7.) Create a new node pool that has enough memory to include the new nodes, and copy the old nodes into it.
8.) Now that there is memory available, split the nodes determined in step 6.
9.) Update the nodes with keys from step 5 and the color values from the input cloud.
10.) Continually shift the keys upwards to determine which nodes have modified children, and re-evaluate those nodes by averaging their children.

Extract Voxels from Octree
1.) Compute keys for tree leaves. This involves a parallel pass for each tree depth and a thrust removal step.
2.) Compute the positions and size for each key.
3.) Extract the color of each key from the octree.

At this point, I am only adding points to the map by counting the points as "hit" observations. The complete solution will involve recasting from the camera origin to each point, and using the points along the line as "misses."

Here are a few screenshots of the map rendered with OpenGL, using instanced rendering of cubes with a Texture Buffer Object specifying the cube locations and colors. The first image is using voxels with an edge length of 2.5 cm. The second is the same viewpoint, but instead the octree is only updated at 10 cm resolution.

This is another shot of the same room, though with a different camera angle with both the Kinect and the virtual camera. This shows more detail of objects sitting on a table.

Performance for this process is relatively fast compared to earlier work with camera localization. At this point, I have only a naive implementation of all kernels without any speed optimization. With an NVidia GTX 770 and a Kinect data stream of 640x480 pixels at 30 fps, these are the times for each step:

1.) Update Octree with Point Cloud - 20 ms
2.) Extract Voxels from Octree - 6 ms
3.) Draw Voxels with OpenGL - 2 ms

Saturday, February 14, 2015

Out-of-Core Octree Management

The sparse octree used to represent a reconstructed 3D map will quickly grow too large to fit entirely in GPU memory. Reconstructing a normal office room at 1 cm resolution will likely take as much as 6-8 GB, but the Nvidia GTX 770 that I am using has only 2 GB.

To handle this, I have developed an out-of-core memory management framework for the octree. At first glance, this framework is a standard stack-based octree on the CPU. However, each node in the tree has an additional boolean flag indicating whether the node is a subtree that is located in linear GPU memory. It also holds a pointer to its location on the GPU as well as its size. The stackless octree data is represented by 64-bits per node, using the same format as GigaVoxels. Here is a summary of the OctreeNode class's data elements:

class OctreeNode {
  //Flag whether the node is at max resolution,
  //in which case it will never be subdivided
  bool is_max_depth_;

  //Flag whether the node's children have been initialized
  bool has_children_;

  //Flag whether the node's data is on the GPU
  bool on_gpu_;

  //Pointer to gpu data, if the node is GPU backed
  int* gpu_data_;

  //The number of children on the GPU
  int gpu_size_;

  //Child nodes if it is CPU backed
  OctreeNode* children_[8];

  //Data in the node if it is CPU backed
  int data_;

Next, I gave these node's an API that can push/pull the data to and from the GPU. The push method uses recursion to convert the stack-based data into a linear array in CPU memory, then copies the memory to the GPU. It avoids the need to over allocate or reallocate the size of the linear memory by first recursing through the node's children to determine the size of the subtree. The pull method copies the linear memory back to the CPU, then uses it to recursively generate it as a stack-based structure.

It's worth mentioning that it is preferred for the data to reside on the GPU, as all of the update and rendering passes are going to involve parallel operations with data that is also in GPU memory. We only want to pull subtrees to the CPU when we run low on available GPU memory. To do this, I added a GPU occupancy count for the octree as a whole. When this exceeds a fraction of available memory, subtrees of the GPU memory need to be pulled back. 

I am working on a Least Recently Used (LRU) approach where all methods operating on the tree must input an associated bounding box of the area that they will affect. First, this allows us to make sure that the entire affected volume is currently on the GPU before attempting to perform the operation. The octree will also keep a history of the N most recently used bounding boxes. When space needs to be freed, it will take the union of these stored bounding boxes and pull data that lies outside of this region back to the CPU.

This initial approach may need to be improved in the future. For one thing, our use case involves two independent camera poses, one for updating the map and one for rendering it. The bounding boxes associated with these two cameras can be separated spatially, but the method will create a single bounding box that will also encompass the space between them. A more advanced method would first cluster the bounding boxes, and then perform a union operation on each cluster. Another issue is that this method will create a tight box around the cameras. If they are moving, it is possible that they will quickly move outside of the bounding box and require memory to be pulled back. One way to handle this would be to predict the future motion of the cameras.

Saturday, February 7, 2015

ICP Localization

I have implemented the ICP algorithm that was described in At this point, I am only matching two consecutive raw frames to compute a camera motion between them. In the coming weeks, I will be matching new raw frames to a frame generated from a reconstructed 3D map. This should drastically reduce camera pose drift.

My initial naive implementation was split into separate CUDA kernels as described in the paper. This implementation has an initial kernel that determines whether the points in the two frames are similar enough to be included in the cost function. This creates a mask that is used to remove points with thrust compaction. Next, a kernel computes a 6x6 matrix A and 6x1 vector b for each corresponding pair of points. These are both summed in parallel with thrust. The final transform update is computed on the CPU by solving A*x = b. For all kernels, I used a block size of 256. Here is pseudo-code for this implementation:

pyramid_depth = 3
depth_iterations = {4, 5, 10}
update_trans = Identity4x4
for i := 1 to pyramid_depth do
    this_frame = this_pyramid[i]
    last_frame = last_pyramid[i]
    for j := 1 to depth_iterations[i] do
        mask = computeCorrespondenceMask(this_frame, last_frame)
        remove_if(this_frame, mask)
        remove_if(last_frame, mask)
        [A b] = computeICPCost(this_frame, last_frame)
        A_total = reduce(A)
        b_total = reduce(b)
        iter_trans = solveCholesky(A_total, b_total)
        applyTransform(iter_trans, this_vertices, this_normals)
        update_trans = iter_trans * camera_trans
camera_pose = camera_pose * update_trans

Using an NVidia GTX 770, I found that this process was unbearably slow, taking {95, 76, 112} ms for the 3 pyramid depths respectively.

I was able to shave off some of the time by optimizing the block sizes. Most of the kernels performed optimally with ~16-32 threads per block. This sped up to {59, 58, 106} ms.

Another 10 ms total was saved by combining the 6x6 A and 6x1 b into a single 7x6 matrix Ab, which could be reduced with a single thrust call rather than 2 separate passes.

I found that the majority of the time was being spent in the thrust reduction pass of summing all of the ICP terms. On average, each iteration of computing ICP cost terms and reducing them was taking ~15.5 ms. I found that loading the initial ICP cost kernel with part of the reduction job substantially sped up this process. Thus, rather than parallelizing the ICP cost kernel over N points, it could be parallelized over N/load_size threads. Each thread is now responsible for iterating and summing over load_size points, and thus the thrust reduction acts on data of length N/load_size. Since each thread acts on multiple points, it is no longer beneficial to precompute a mask and remove invalid correspondences. This is now part of the ICP cost kernel. With load_size =10, I found that the entire process could be reduced from ~15.5 ms per iteration to ~3.5 ms. This reduces the total time to {25, 19, 30} ms.

Pseudo-code for the optimized algorithm is here:

pyramid_depth = 3
depth_iterations = {4, 5, 10}
update_trans = Identity4x4
load_size = 10
for i := 1 to pyramid_depth do
    this_frame = this_pyramid[i]
    last_frame = last_pyramid[i]
    for j := 1 to depth_iterations[i] do
        Ab = correspondAndComputeLoadedICPCost(this_frame, last_frame, load_size)
        Ab_total = reduce(Ab)
        iter_trans = solveCholesky(Ab_total[0:5,:], Ab_total[6,:])
        applyTransform(iter_trans, this_vertices, this_normals)
        update_trans = iter_trans * camera_trans
camera_pose = camera_pose * update_trans

The overall framerate for the system has improved from 2 to 15 FPS.

Sunday, January 18, 2015

System Diagram and Octree Map Update Deep-dive

System Diagram

I have drawn a system overview diagram to summarize the system that I described in my last post.

The black arrows designate the flow of execution, while the blue arrows are to clarify the passage of data from the two render passes. The dotted lines indicate modules that are responsible for directly interfacing with hardware devices.

Octree Map Updates

Here I will describe in more detail one of the complex sub-components of the system: the process of updating an octree map from an aligned RGB-D frame.

The process of updating the global map will be made up of several CUDA kernels parallelizing execution. Here is a description of each CUDA kernel in the process.

1.) Determine if any vertex is outside of the current map

Prior to this step of the pipeline, we have computed the 3D position of each pixel in the camera frame with respect to the coordinates of the octree map.  This kernel will parallelize over the vertices and determine if it is outside of the current map bounds. If it is, it computes how many additional layers of depth are required for it to be within the map, and writes it atomically to a global value if its value is greater than the one currently held. 

A CPU stage will then increase the global map to have a new root node and additional layers to ensure that the map will be big enough to contain the observed frame. It may be worth keeping this entire step CPU-bound as it is will be compute-lite. However, the vertices will be on the GPU at this point, so the data would have to be copied back and forth between devices for CPU computation here.

2.) Compute voxel keys for each vertex to create list of occupied cells.

This kernel will parallelize over each vertex and compute the Morton Code at the maximum depth of the octree.

A Morton Code is a condensed form representing a key within an octree. Each depth of the tree is represented using an additional 3 bits, as it is binary in the x, y, and z coordinates. 

Example: For an octree centered at the origin with bounding box at (-1,-1,-1) and (1,1,1), the vertex (0.7, -0.2, 0.1) has the depth 1 Morton Code 101, and the depth 2 Morton Code 101110.

This step determines which cells in the octree have been observed as occupied in the current camera frame. The number of occupied cells is constant, and thus we can pre-allocate the memory needed to store this data and avoid any need for a global memory counter (as will be needed in the next kernel). 

3.) Compute voxel keys for each line segment from the camera position to each vertex and add then to a global list of unoccupied voxels.

This kernel is similar to the previous, though instead of computing a single Morton Code for one point, it computes all Morton Codes along the line segment from the camera origin point to the vertex. This step will not have a constant output size as it is impossible to know how many voxels a line segment will intersect. Thus, we will need to conservatively allocate memory for the output, and use a global memory index counter that is updated atomically.

The occupied voxels will be calculated by using the 3D extension of Bresenham's line algorithm, and the Morton Code will be computed for each voxel center.

4.) Remove duplicate occupied and unoccupied voxels from the lists, and store multiplicities.

An upcoming step is going to involve parallelizing over voxels to be updated. However, many of the threads from the previous stages will involve updating the same voxels due to the close proximity of the rays originating from the same camera position. 

In this step, occupied cells should average the color value of all pixels when performing this compaction.

To make this step more efficient by avoiding multiple threads atomically updating the cells, we should first consolidate these updates into a list of only unique voxels.  We then also need an additional list of integer counts that represent the number of times observed and unobserved in this frame.

5.) Remove all unoccupied voxels that are in the occupied list.

OctoMap found that it is necessary to avoid updating a voxel as both observed and unobserved in the same frame. They chose to abide by the rule that if any ray observes a cell to be occupied, then it cannot be observed also as unoccupied in that same frame. To do this, we next need to parallelize over the unoccupied cell list and remove them if they are in the occupied list.

It is unclear what the most efficient way to do this would be. It may be beneficial to first sort the occupied list so that it can be queried for a value using binary search. On the other hand, it might be most efficient to sort both lists and perform this update step serially on the CPU, copying data back and forth. It may be worth evaluating these two methods.

6.) For all occupied and unoccupied voxels, update the map at the maximum depth.

This step involves parallelizing over all voxels in both lists, and either adding or subtracting the alpha channel by the multiplicity associated with the voxel in this update. Unobserved voxels will use subtraction, and observed voxels will use addition. 

This alpha channel can then be converted to a probability that the voxel is occupied by multiplying by the probability of hit/miss and exponentiating (the alpha channel is a condensed form of log probability of occupation).

Voxels observed as occupied will also update its color value using the color from the original camera pixel.

7.) Mip-map updates into inner layers of the octree map.

The last step involves updating the rest of the tree with a kernel execution pass for each layer of depth in the octree. In this step, each parent node will take on the mean value of its child nodes for RGBA.

It would be ideal to optimize this step by avoiding mip-mapping parts of the tree that were not updated in this frame. An additional pruning step that prunes redundant leaves from the tree, particularly for areas of space where the alpha channel is very small (very likely to be empty) may be beneficial here as well.