I have drawn a system overview diagram to summarize the system that I described in my last post.
The black arrows designate the flow of execution, while the blue arrows are to clarify the passage of data from the two render passes. The dotted lines indicate modules that are responsible for directly interfacing with hardware devices.
Octree Map Updates
Here I will describe in more detail one of the complex sub-components of the system: the process of updating an octree map from an aligned RGB-D frame.
The process of updating the global map will be made up of several CUDA kernels parallelizing execution. Here is a description of each CUDA kernel in the process.
1.) Determine if any vertex is outside of the current map
Prior to this step of the pipeline, we have computed the 3D position of each pixel in the camera frame with respect to the coordinates of the octree map. This kernel will parallelize over the vertices and determine if it is outside of the current map bounds. If it is, it computes how many additional layers of depth are required for it to be within the map, and writes it atomically to a global value if its value is greater than the one currently held.
A CPU stage will then increase the global map to have a new root node and additional layers to ensure that the map will be big enough to contain the observed frame. It may be worth keeping this entire step CPU-bound as it is will be compute-lite. However, the vertices will be on the GPU at this point, so the data would have to be copied back and forth between devices for CPU computation here.
2.) Compute voxel keys for each vertex to create list of occupied cells.
This kernel will parallelize over each vertex and compute the Morton Code at the maximum depth of the octree.
A Morton Code is a condensed form representing a key within an octree. Each depth of the tree is represented using an additional 3 bits, as it is binary in the x, y, and z coordinates.
Example: For an octree centered at the origin with bounding box at (-1,-1,-1) and (1,1,1), the vertex (0.7, -0.2, 0.1) has the depth 1 Morton Code 101, and the depth 2 Morton Code 101110.
This step determines which cells in the octree have been observed as occupied in the current camera frame. The number of occupied cells is constant, and thus we can pre-allocate the memory needed to store this data and avoid any need for a global memory counter (as will be needed in the next kernel).
3.) Compute voxel keys for each line segment from the camera position to each vertex and add then to a global list of unoccupied voxels.
This kernel is similar to the previous, though instead of computing a single Morton Code for one point, it computes all Morton Codes along the line segment from the camera origin point to the vertex. This step will not have a constant output size as it is impossible to know how many voxels a line segment will intersect. Thus, we will need to conservatively allocate memory for the output, and use a global memory index counter that is updated atomically.
The occupied voxels will be calculated by using the 3D extension of Bresenham's line algorithm, and the Morton Code will be computed for each voxel center.
4.) Remove duplicate occupied and unoccupied voxels from the lists, and store multiplicities.
An upcoming step is going to involve parallelizing over voxels to be updated. However, many of the threads from the previous stages will involve updating the same voxels due to the close proximity of the rays originating from the same camera position.
In this step, occupied cells should average the color value of all pixels when performing this compaction.
To make this step more efficient by avoiding multiple threads atomically updating the cells, we should first consolidate these updates into a list of only unique voxels. We then also need an additional list of integer counts that represent the number of times observed and unobserved in this frame.
5.) Remove all unoccupied voxels that are in the occupied list.
OctoMap found that it is necessary to avoid updating a voxel as both observed and unobserved in the same frame. They chose to abide by the rule that if any ray observes a cell to be occupied, then it cannot be observed also as unoccupied in that same frame. To do this, we next need to parallelize over the unoccupied cell list and remove them if they are in the occupied list.
It is unclear what the most efficient way to do this would be. It may be beneficial to first sort the occupied list so that it can be queried for a value using binary search. On the other hand, it might be most efficient to sort both lists and perform this update step serially on the CPU, copying data back and forth. It may be worth evaluating these two methods.
6.) For all occupied and unoccupied voxels, update the map at the maximum depth.
This step involves parallelizing over all voxels in both lists, and either adding or subtracting the alpha channel by the multiplicity associated with the voxel in this update. Unobserved voxels will use subtraction, and observed voxels will use addition.
This alpha channel can then be converted to a probability that the voxel is occupied by multiplying by the probability of hit/miss and exponentiating (the alpha channel is a condensed form of log probability of occupation).
Voxels observed as occupied will also update its color value using the color from the original camera pixel.
7.) Mip-map updates into inner layers of the octree map.
The last step involves updating the rest of the tree with a kernel execution pass for each layer of depth in the octree. In this step, each parent node will take on the mean value of its child nodes for RGBA.
It would be ideal to optimize this step by avoiding mip-mapping parts of the tree that were not updated in this frame. An additional pruning step that prunes redundant leaves from the tree, particularly for areas of space where the alpha channel is very small (very likely to be empty) may be beneficial here as well.