Saturday, February 7, 2015

ICP Localization


I have implemented the ICP algorithm that was described in http://research.microsoft.com/pubs/155378/ismar2011.pdf. At this point, I am only matching two consecutive raw frames to compute a camera motion between them. In the coming weeks, I will be matching new raw frames to a frame generated from a reconstructed 3D map. This should drastically reduce camera pose drift.

My initial naive implementation was split into separate CUDA kernels as described in the paper. This implementation has an initial kernel that determines whether the points in the two frames are similar enough to be included in the cost function. This creates a mask that is used to remove points with thrust compaction. Next, a kernel computes a 6x6 matrix A and 6x1 vector b for each corresponding pair of points. These are both summed in parallel with thrust. The final transform update is computed on the CPU by solving A*x = b. For all kernels, I used a block size of 256. Here is pseudo-code for this implementation:

------------------------------------------------------------------------------------------------------------------
pyramid_depth = 3
depth_iterations = {4, 5, 10}
update_trans = Identity4x4
for i := 1 to pyramid_depth do
    this_frame = this_pyramid[i]
    last_frame = last_pyramid[i]
    for j := 1 to depth_iterations[i] do
        mask = computeCorrespondenceMask(this_frame, last_frame)
        remove_if(this_frame, mask)
        remove_if(last_frame, mask)
        [A b] = computeICPCost(this_frame, last_frame)
        A_total = reduce(A)
        b_total = reduce(b)
        iter_trans = solveCholesky(A_total, b_total)
        applyTransform(iter_trans, this_vertices, this_normals)
        update_trans = iter_trans * camera_trans
    end
end
camera_pose = camera_pose * update_trans
------------------------------------------------------------------------------------------------------------------

Using an NVidia GTX 770, I found that this process was unbearably slow, taking {95, 76, 112} ms for the 3 pyramid depths respectively.

I was able to shave off some of the time by optimizing the block sizes. Most of the kernels performed optimally with ~16-32 threads per block. This sped up to {59, 58, 106} ms.

Another 10 ms total was saved by combining the 6x6 A and 6x1 b into a single 7x6 matrix Ab, which could be reduced with a single thrust call rather than 2 separate passes.

I found that the majority of the time was being spent in the thrust reduction pass of summing all of the ICP terms. On average, each iteration of computing ICP cost terms and reducing them was taking ~15.5 ms. I found that loading the initial ICP cost kernel with part of the reduction job substantially sped up this process. Thus, rather than parallelizing the ICP cost kernel over N points, it could be parallelized over N/load_size threads. Each thread is now responsible for iterating and summing over load_size points, and thus the thrust reduction acts on data of length N/load_size. Since each thread acts on multiple points, it is no longer beneficial to precompute a mask and remove invalid correspondences. This is now part of the ICP cost kernel. With load_size =10, I found that the entire process could be reduced from ~15.5 ms per iteration to ~3.5 ms. This reduces the total time to {25, 19, 30} ms.

Pseudo-code for the optimized algorithm is here:

------------------------------------------------------------------------------------------------------------------
pyramid_depth = 3
depth_iterations = {4, 5, 10}
update_trans = Identity4x4
load_size = 10
for i := 1 to pyramid_depth do
    this_frame = this_pyramid[i]
    last_frame = last_pyramid[i]
    for j := 1 to depth_iterations[i] do
        Ab = correspondAndComputeLoadedICPCost(this_frame, last_frame, load_size)
        Ab_total = reduce(Ab)
        iter_trans = solveCholesky(Ab_total[0:5,:], Ab_total[6,:])
        applyTransform(iter_trans, this_vertices, this_normals)
        update_trans = iter_trans * camera_trans
    end
end
camera_pose = camera_pose * update_trans
------------------------------------------------------------------------------------------------------------------

The overall framerate for the system has improved from 2 to 15 FPS.


No comments:

Post a Comment