Saturday, February 14, 2015

Out-of-Core Octree Management


The sparse octree used to represent a reconstructed 3D map will quickly grow too large to fit entirely in GPU memory. Reconstructing a normal office room at 1 cm resolution will likely take as much as 6-8 GB, but the Nvidia GTX 770 that I am using has only 2 GB.

To handle this, I have developed an out-of-core memory management framework for the octree. At first glance, this framework is a standard stack-based octree on the CPU. However, each node in the tree has an additional boolean flag indicating whether the node is a subtree that is located in linear GPU memory. It also holds a pointer to its location on the GPU as well as its size. The stackless octree data is represented by 64-bits per node, using the same format as GigaVoxels. Here is a summary of the OctreeNode class's data elements:

class OctreeNode {
  //Flag whether the node is at max resolution,
  //in which case it will never be subdivided
  bool is_max_depth_;

  //Flag whether the node's children have been initialized
  bool has_children_;

  //Flag whether the node's data is on the GPU
  bool on_gpu_;

  //Pointer to gpu data, if the node is GPU backed
  int* gpu_data_;

  //The number of children on the GPU
  int gpu_size_;

  //Child nodes if it is CPU backed
  OctreeNode* children_[8];

  //Data in the node if it is CPU backed
  int data_;
};



Next, I gave these node's an API that can push/pull the data to and from the GPU. The push method uses recursion to convert the stack-based data into a linear array in CPU memory, then copies the memory to the GPU. It avoids the need to over allocate or reallocate the size of the linear memory by first recursing through the node's children to determine the size of the subtree. The pull method copies the linear memory back to the CPU, then uses it to recursively generate it as a stack-based structure.

It's worth mentioning that it is preferred for the data to reside on the GPU, as all of the update and rendering passes are going to involve parallel operations with data that is also in GPU memory. We only want to pull subtrees to the CPU when we run low on available GPU memory. To do this, I added a GPU occupancy count for the octree as a whole. When this exceeds a fraction of available memory, subtrees of the GPU memory need to be pulled back. 

I am working on a Least Recently Used (LRU) approach where all methods operating on the tree must input an associated bounding box of the area that they will affect. First, this allows us to make sure that the entire affected volume is currently on the GPU before attempting to perform the operation. The octree will also keep a history of the N most recently used bounding boxes. When space needs to be freed, it will take the union of these stored bounding boxes and pull data that lies outside of this region back to the CPU.

This initial approach may need to be improved in the future. For one thing, our use case involves two independent camera poses, one for updating the map and one for rendering it. The bounding boxes associated with these two cameras can be separated spatially, but the method will create a single bounding box that will also encompass the space between them. A more advanced method would first cluster the bounding boxes, and then perform a union operation on each cluster. Another issue is that this method will create a tight box around the cameras. If they are moving, it is possible that they will quickly move outside of the bounding box and require memory to be pulled back. One way to handle this would be to predict the future motion of the cameras.

Saturday, February 7, 2015

ICP Localization


I have implemented the ICP algorithm that was described in http://research.microsoft.com/pubs/155378/ismar2011.pdf. At this point, I am only matching two consecutive raw frames to compute a camera motion between them. In the coming weeks, I will be matching new raw frames to a frame generated from a reconstructed 3D map. This should drastically reduce camera pose drift.

My initial naive implementation was split into separate CUDA kernels as described in the paper. This implementation has an initial kernel that determines whether the points in the two frames are similar enough to be included in the cost function. This creates a mask that is used to remove points with thrust compaction. Next, a kernel computes a 6x6 matrix A and 6x1 vector b for each corresponding pair of points. These are both summed in parallel with thrust. The final transform update is computed on the CPU by solving A*x = b. For all kernels, I used a block size of 256. Here is pseudo-code for this implementation:

------------------------------------------------------------------------------------------------------------------
pyramid_depth = 3
depth_iterations = {4, 5, 10}
update_trans = Identity4x4
for i := 1 to pyramid_depth do
    this_frame = this_pyramid[i]
    last_frame = last_pyramid[i]
    for j := 1 to depth_iterations[i] do
        mask = computeCorrespondenceMask(this_frame, last_frame)
        remove_if(this_frame, mask)
        remove_if(last_frame, mask)
        [A b] = computeICPCost(this_frame, last_frame)
        A_total = reduce(A)
        b_total = reduce(b)
        iter_trans = solveCholesky(A_total, b_total)
        applyTransform(iter_trans, this_vertices, this_normals)
        update_trans = iter_trans * camera_trans
    end
end
camera_pose = camera_pose * update_trans
------------------------------------------------------------------------------------------------------------------

Using an NVidia GTX 770, I found that this process was unbearably slow, taking {95, 76, 112} ms for the 3 pyramid depths respectively.

I was able to shave off some of the time by optimizing the block sizes. Most of the kernels performed optimally with ~16-32 threads per block. This sped up to {59, 58, 106} ms.

Another 10 ms total was saved by combining the 6x6 A and 6x1 b into a single 7x6 matrix Ab, which could be reduced with a single thrust call rather than 2 separate passes.

I found that the majority of the time was being spent in the thrust reduction pass of summing all of the ICP terms. On average, each iteration of computing ICP cost terms and reducing them was taking ~15.5 ms. I found that loading the initial ICP cost kernel with part of the reduction job substantially sped up this process. Thus, rather than parallelizing the ICP cost kernel over N points, it could be parallelized over N/load_size threads. Each thread is now responsible for iterating and summing over load_size points, and thus the thrust reduction acts on data of length N/load_size. Since each thread acts on multiple points, it is no longer beneficial to precompute a mask and remove invalid correspondences. This is now part of the ICP cost kernel. With load_size =10, I found that the entire process could be reduced from ~15.5 ms per iteration to ~3.5 ms. This reduces the total time to {25, 19, 30} ms.

Pseudo-code for the optimized algorithm is here:

------------------------------------------------------------------------------------------------------------------
pyramid_depth = 3
depth_iterations = {4, 5, 10}
update_trans = Identity4x4
load_size = 10
for i := 1 to pyramid_depth do
    this_frame = this_pyramid[i]
    last_frame = last_pyramid[i]
    for j := 1 to depth_iterations[i] do
        Ab = correspondAndComputeLoadedICPCost(this_frame, last_frame, load_size)
        Ab_total = reduce(Ab)
        iter_trans = solveCholesky(Ab_total[0:5,:], Ab_total[6,:])
        applyTransform(iter_trans, this_vertices, this_normals)
        update_trans = iter_trans * camera_trans
    end
end
camera_pose = camera_pose * update_trans
------------------------------------------------------------------------------------------------------------------

The overall framerate for the system has improved from 2 to 15 FPS.