NOVEL DEPTH REPRESENTATIONS FOR DEPTH COMPLETION WITH APPLICATION IN 3D OBJECT DETECTION By Saif Muhammad Imran A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical Engineering - Doctor of Philosophy 2022 ABSTRACT NOVEL DEPTH REPRESENTATIONS FOR DEPTH COMPLETION WITH APPLICATION IN 3D OBJECT DETECTION By Saif Muhammad Imran Depth completion refers to interpolating a dense, regular depth grid from sparse and irregularly sampled depth values, often guided by high-resolution color imagery. The primary goal of depth completion is to estimate depth. In practice methods are trained by minimizing an error between predicted dense depth and groundtruth depth, and are evaluated by how well they minimize this error. Here we identify a second goal which is to avoid smearing depth across depth discontinuities. This second goal is important because it can improve downstream applications of depth completion such as object detection and pose estimation. However, we also show that the goal of minimizing error can conflict with the goal of eliminating depth smearing. In this thesis, we propose two novel representations of depths that can encode depth discontinu- ity across object surfaces by allowing multiple depth estimation in the spatial domain. In order to learn these new representations, we propose carefully designed loss functions and show their effec- tiveness in deep neural network learning. We show how our representations can avoid inter-object depth mixing and also beat state of the art metrics for depth completion. The quality of ground-truth depth in real-world depth completion problems is another key challenge for learning and accurate evaluation of methods. Ground truth depth created from semi- automatic methods suffers from sparse sampling and errors at object boundaries. We show that the combination of these errors and the commonly used evaluation measure has promoted solutions that mix depths across boundaries in current methods. The thesis proposes alternate depth comple- tion performance measures that reduce preference for mixed depths and promote sharp boundaries. The thesis also investigates whether additional points from depth completion methods can help in a challenging and high-level perception problem; 3D object detection. It shows the effect of different depth noises originated from depth estimates on detection performances and proposes some effective ways to reduce noise in the estimate and overcome architecture limitations. The method is demonstrated on both real-world and synthetic datasets. Copyright by SAIF MUHAMMAD IMRAN 2022 I would like to dedicate this thesis to my parents ABM Saiful Islam and Zareen Salma, my wife Nazifa and my lovely 1 year old daughter Qaanita who has been a continuous inspiration for me during the journey. v ACKNOWLEDGMENTS I would like to express deepest gratitude to my advisors Dr. Daniel Morris and Dr. Xiaoming Liu for the support and funding throughout the turbulent years of my PhD journey. I thoroughly enjoyed working with them at late night hours before submission deadlines. My lab colleagues Mehmet Alper and Yunfei Long have been always through thick and thin on this arduous journey. I would like to thank Dr. Liu’s lab group members; especially Garrick, Shengjie, Amin and Luan for brainstorming and sharing vision ideas and facilitating access to server gpus and setups. Working with them was a true inspiration on its own. I am indebted to my wife for her unwavering support, understanding and dealing with crisis mode when needed, my parents who came all the way from home to support me with my then expecting wife and all the household chores. Everytime I lost hope, their prayers and consolation kept me persistent to my target. And finally all praise goes to Almighty Allah for all the blessing and love that comes along the way. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Depth Estimation using Multi-Modal Sensors: A Motivation . . . . . . . . . . . . 1 1.3 Depth Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2 Depth Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Representations of Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 3D Point Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Voxels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Polygon Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.4 SingleView and Multiview Depth Maps . . . . . . . . . . . . . . . . . . . 14 2.3 Multichannel Representation of Depth: Depth Coefficients . . . . . . . . . . . . . 14 2.3.1 Depth Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1.1 Discrete Representation . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1.2 Probability Representation . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Mathematical Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.3 Depth Reconstruction from True DC . . . . . . . . . . . . . . . . . . . . . 18 2.4 Dual Channel Representation of Depth: Twin Surface . . . . . . . . . . . . . . . . 19 2.4.1 Depth Reconstruction from Foreground and Background Surfaces . . . . . 20 2.5 Proposed Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Chapter 3 Depth Coefficients for Depth Completion . . . . . . . . . . . . . . . . . . 25 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Related Works in Depth Completion . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Avoiding Depth Mixing by Convolution . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.2 Proposed Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.2.1 Depth Loss Functions with Ambiguity . . . . . . . . . . . . . . 32 3.3.2.2 Cross Entropy as Loss Measure . . . . . . . . . . . . . . . . . . 33 3.4 Learning by Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.1 Neural Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.2 Depth Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 vii 3.5.1 Experimental Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5.1.0.1 Sub-Sampling . . . . . . . . . . . . . . . . . . . . . . 38 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Chapter 4 Depth Completion Using TWIN Surface Extrapolation at Occlusion Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.1 Depth Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.2 Depth Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.3 Loss Functions in Depth Completion . . . . . . . . . . . . . . . . . . . . . 45 4.3 Ambiguities and Expected Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.1 Ambiguities and Expected Loss . . . . . . . . . . . . . . . . . . . . . . . 47 4.4.2 Asymmetric Linear Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4.3 Foreground and Background Estimators . . . . . . . . . . . . . . . . . . . 50 4.4.4 Fused Depth Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.5 Depth Surface Representation . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.6.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.6.2 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . 53 4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.5.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.5.3.1 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . 56 4.5.3.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.5.3.3 Qualitative Parsing . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5.3.4 Relative Error Maps . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5.3.5 Outlier Errors and Analysis on KITTI Semi-Dense GT . . . . . . 61 4.5.3.6 Quantitative Results on NYU2 . . . . . . . . . . . . . . . . . . . 63 4.5.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.5.4.1 Effect of Loss Functions . . . . . . . . . . . . . . . . . . . . . . 64 4.5.4.2 Effect of σ on Estimated Surfaces . . . . . . . . . . . . . . . . . 65 4.5.4.3 Effect of γ on Performance . . . . . . . . . . . . . . . . . . . . 65 4.5.4.4 Effect of Sparsity on Depth Performance . . . . . . . . . . . . . 66 4.5.4.5 Synthetic Experiments with VKITTI . . . . . . . . . . . . . . . 66 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Chapter 5 3D Object Detection from Noisy Depth . . . . . . . . . . . . . . . . . . . 70 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2.1 Depth Completion and Depth Prediction . . . . . . . . . . . . . . . . . . . 74 5.2.2 3D Object Detection with Multi-Modal Sensors . . . . . . . . . . . . . . . 74 5.2.3 3D Object Detection from Estimated Depth . . . . . . . . . . . . . . . . . 76 viii 5.3 Impact of Noisy Depth on Object Detection . . . . . . . . . . . . . . . . . . . . . 77 5.3.1 Baseline Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3.2 Noise Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3.3 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.3.3.1 Promotion of Noisy Pixels using FPS . . . . . . . . . . . . . . . 83 5.3.3.2 Downsampling in Point-Based Architecture . . . . . . . . . . . . 84 5.3.4 Remedies to Tackle Noisy Depth . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.4.1 Filtering Smeared Points . . . . . . . . . . . . . . . . . . . . . . 85 5.3.4.2 Filtering Background Clutter . . . . . . . . . . . . . . . . . . . . 86 5.3.4.3 Filtering Pixels within r distance from raw LiDAR . . . . . . . . 86 5.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.1.1 KITTI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.1.2 Virtual KITTI . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Chapter 6 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . . 97 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.1.1.1 Multi-Channel and Dual Channel Representation . . . . . . . . . 97 6.1.1.2 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.1.1.3 Noisy GT and Metrics . . . . . . . . . . . . . . . . . . . . . . . 98 6.1.1.4 3D Object Detection from Noisy Depth . . . . . . . . . . . . . . 99 6.1.2 Future Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 ix LIST OF TABLES Table 1.1: Comparisons of active and passive depth sensors . . . . . . . . . . . . . . . . 4 Table 3.1: Quantitative results of NYU2 (Done on Uniform-500 Samples + RGB) (units in m). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Table 3.2: Performance evaluation at different levels of Lidar sparsity (KITTI dataset). 64R, 32R and 16R refers to 64-row, 32-row, 16-row respectively. Units in cm. 39 Table 3.3: A comparison whether DC on the input or DC with cross entropy (CE) on output has the dominant effect. It turns out that individually their effect is small, but together have a large impact (NYU2 dataset). Units in cm. . . . . . 39 Table 3.4: Average precision (%) for 3D detection and pose estimation of cars on KITTI [1] using Frustum PointNet [2]. The baseline, Raw-16R, uses 16 rows from the Lidar, while Ma’s method [3] and our method start by densely upsampling these 16-row data. In each case, the method is trained on 3, 712 frames and evaluated on 3, 769 frames, of the KITTI 3D object detection benchmark [1] using an intersection of union (IOU) measure of 0.7. Only our method im- proves on the baseline, and this is the most significant for 3D bounding boxes. 40 Table 4.1: Depth completion on the Test/Validation sets of KITTI, with 64R LiDAR and RGB input (units in mm). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Table 4.2: Error metrics for different image regions on TWISE. . . . . . . . . . . . . . 61 Table 4.3: Relation between Disparity Error and Depth Error in metric units (cm). Note that KITTI Outliers are defined by: > 3 pix disparity error and 5% error. . . . 62 Table 4.4: Depth completion results on NYU2 [4]. . . . . . . . . . . . . . . . . . . . . 63 Table 4.5: Effect of different loss functions. Compared to single channel losses, CE requires 80 channel, while TWISE requires 3 channel. . . . . . . . . . . . . 64 Table 4.6: Effect of learned σ in TWISE, evaluated by our best model. . . . . . . . . . . 64 Table 4.7: Effect of γ on depth completion performance. . . . . . . . . . . . . . . . . . 64 Table 4.8: Row sparsity impact on SoTA depth completion methods. . . . . . . . . . . . 68 x Table 5.1: 3D Object Detection results (3D and Bird-eye view (BEV) average precision respectively) with different depth resolutions. Object refers to object detec- tion [5] and Depth refers to depth completion [6] network. Raw refers to 64R LiDAR, Semi-Dense GT refers to the GT created by accumulating Li- DAR points used for supervision of depth completion network in KITTI. We use the results of two depth completion networks (MultiHourGlass [6] and TWISE [7]) for comparison purposes. . . . . . . . . . . . . . . . . . . . . . 78 Table 5.2: AP Comparison of raw (64R LiDAR) and dense depth (TWISE) at different depth ranges in meters. Numerical results also show dense depth performs worse compared to raw LiDAR at all categories; with the gap increasing at higher depth range and more difficult categories respectively. . . . . . . . . . 79 Table 5.3: 3D Object Detection with augmented LiDAR. L refers to 64R LiDAR, D refers to Dense Depth with TWISE, F refers to sigma filter 5.3.4.1, SF refers to filter used with semantic mask, and Aug.F refers to LiDAR augmented with dense depth estimate using the semantic mask and filter. . . . . . . . . . 91 Table 5.4: Performance evaluation of different architectures on a validation set of KITTI with dense depth from TWISE. D refers to Dense Depth with TWISE, F refers to sigma filter 5.3.4.1. It shows the filter designed with TWISE out- put can improve detection performance significantly by pruning floating and ambiguous depth pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Table 5.5: 3D Object detection results with defined sampling regions to reduce back- ground clutter and bypass architecture limitations. Filtered depth refers to the TWISE filter 5.3.4.1, 64R and 128R are depth sampled in azimuth and elevation space to simulate a LiDAR, while semantic filtered depth refers to the estimated semantic mask created by [8], and used to filter dense depth in image plane based on vehicle pixels. The dense depth pixels are reduced further by grid sampling in 3D at grid spacing 0.1m. . . . . . . . . . . . . . . 94 Table 5.6: 3D Object Detection results with the apriori sampling region and different radius configurations used to gather background pixels around the relevant object pixels. Sem filtered refers to sigma filter and semantic map used to fil- ter out background and floating depth points. Sem-Mask in last row refers to binary semantic information which is concatenated with the input pointcloud as additional information to the detection network. . . . . . . . . . . . . . . . 95 Table 5.7: Average precision (%) for 3D detection and pose estimation of cars on Vir- tualKITTI using PointRCNN2[[5]]/VoxelRCNN[9]. Alldep refers to depth pixels grid sampled at 0.1m, while semantic pixel refers to selected pixels using GT semantic masks. Note that pixels within r = 0.3m radius of se- mantics are also taken into consideration. . . . . . . . . . . . . . . . . . . . 95 xi Table 5.8: 3D detection and pose estimation of cars on VirtualKITTI [1] using different types of depth noises. Gaussian Blur smooths depth along boundaries and simulate smeared depth; Depth Misalign. simulates depth error along the optical ray. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Table 5.9: Average precision (%) for 3D detection and pose estimation of cars on Virtu- alKITTI [1] using PointRCNN2. 2D Bounding box refers to the GT bounding boxes within which depth pixels are sampled, while Semantic refers to GT semantics. Note that pixels within r = 0.3m radius of semantic pixels are also taken into consideration. . . . . . . . . . . . . . . . . . . . . . . . . . . 96 xii LIST OF FIGURES Figure 1.1: Performance comparison of depth completion method using different input sensor modalities (raw depth, monocular and monocular + raw depth). A standard deep learning method (Ma et al. [10]) is used in outdoor KITTI depth completion dataset for evaluation. The maximum depth range available for evaluation is 85m. x-axis refers to the no. of LiDAR scanlines projected to the image as raw depth, y-axis refers to the RMSE metric in cm. 0 scan- line resembles depth estimation using monocular camera only. It shows that monocular and sparse depth as input modality can reduce the performance gap by around 5 times compared to monocular camera only just by increasing the raw depth measurements. The performance gap between raw depth and multimodal input (monocular + depth) decreases as the number of raw depth measurements increase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Figure 2.1: An illustration of different representations of depth; (a) represents point- cloud data (Courtesy of Caltech) (b) represents voxel grid data (courtesy of IIT Kharagpur), (c) represents triangle mesh (Courtesy of UW), and (d) rep- resents multiview representations of depth (Courtesy of Stanford) by means of depth maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Figure 2.2: Illustration of DC. The depth pixel (red circle) in the image plane is projected back to the 3D space by a pixel ray. The depth bins are quantized along that pixel ray. Each bin is then assigned a weight (depth coefficients) on few depth bins based on the proximity of the depth with the depth bins. . . . . . . . . . 15 Figure 2.3: Representing depth by means of multiple depth bins or depth coefficients to preserve depth precision. In this toy example, we divide the depth range of 10m into 10 bins, each bin spaced 1m apart from each other. (a) depth representation by a single bin, which loses precision (b) depth representation by finite number of bins which preserve precision. . . . . . . . . . . . . . . . 17 Figure 2.4: Dual Channel Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 20 xiii Figure 2.5: Illustration why RMSE and MAE metric are not good at boundaries. x-axis shows the (yˆi − ỹi ) where yˆi and ỹi are estimated depth and groundtruth depth respectively, y-axis shows the loss functions. (a) and (b) shows the metric for MSE and MAE, and TRMSE and TMAE of depth and their characteristic loss curve respectively. (c) and (d) shows the expected MSE and MAE, and expected TRMSE and TMAE when depth pixels are missing between (10m) and (13m) (see text). From (a) and (b), both MSE and MAE and TMSE and TMAE shows perfect prediction/no errors when it coincides with GT, but for MSE and MAE the errors continue to increase if the depth estimates are far away from the GT. However, for TRMSE and TMAE, a fixed error occurs for points beyond a threshold. From (c) and (d), the expected MSE and MAE fa- vors estimates in-between 10m and 13m, and thus, indirectly promote depth mixing. However, the TMSE and TMAE favor either points (10m) or 13m. In a way, the TMSE and TMAE do not penalize depth estimates when one surface/depth point (10m) is chosen instead of the other surface/depth point (13m), and only account for intra-surface variations of depth estimates. . . . . 22 Figure 3.1: Our depth completion uses (a) a color image and the subsampled (16-row) Lidar points projected into image plane to estimate (b), a dense depth image. (c-e) are zoomed-in view of input color image, super-resolved depth of Ma et al. [3] and ours respectively. (f -h) are bird’s eye view of input sparse Lidar data, (d), and (e), respectively. Colors in the bird’s eye view show the number of height pixels in each cell/pixel. So a smeared object shape has height pixels spread out around the object boundary. Notice the smearing of height at the object boundaries in (g) compared to (h). These depth-mixing pixels impact qualitative appearance as well as subsequent tasks, such as object detection and pose estimation. . . . . . . . . . . . . . . . . . . . . . . . . . 26 Figure 3.2: Illustration of Depth Mixing. (a) shows sparse measurements (red) in color image, (b) shows estimated depth map [10], (c) shows pointcloud generated from the depthmap (b). The black crosses in (c) are the sparse measurements. The red 3D box indicates the position of the car. Between the foreground mode (car) and the background mode (walls of the building), the 3D points floating in between are the mixed depth pixels. We say the car is smeared along its boundary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Figure 3.3: An example of depth mixing, and how DC avoids it. (a) A slice through a depth image showing a depth discontinuity between two objects. (b) An example sparse depth representation: each pixel either has a depth value or a zero. (c) The result of a 1D convolution, shown in (d), applied to the sparse depth. This estimates the missing pixel, but generates a mixed-depth pixel between the two objects. (e) A DC representation of the sparse depth. Each pixel with a depth measurement has three non-negative coefficients that sum to 1 (shown column-wise). (f) The result of applying the same filter (d) to DC in (e). Missing depths are interpolated and notably there is no depth mixing between the objects. . . . . . . . . . . . . . . . . . . . 31 xiv Figure 3.4: (a) shows MSE and MAE loss functions. These perform an expectation over the probability of the data. Now consider an ambiguous case where a pixel’s depth has equal probability being d(1) or d(2) , shown as black squares in (b). Minimum MSE estimate, d, ˆ is the mid-point, while MAE has equal loss for all points between these two depths. This illustrates why MSE prefers mixed-depth pixels, and MAE fails to penalize them. . . . . . . . . . . . . . 33 Figure 3.5: An illustration of Pdata modeled as the sum of the DC of the two points from Fig. 3.4. The estimated ĉij with minimum cross-entropy loss, Eq. 3.3, will exactly match Pdata , providing a multi-modal density. A pixel depth estimate using Eq. 3.4 will find the depth of one of the peaks, and not a mixed-depth value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Figure 3.6: An overview of our method. Sparse depth is converted into Depth Coeffi- cients with multiple channels, each channel holding information of a certain depth range. This, along with color, is input to the neural network. The output is a multi-channel dense depth density that is optimized using cross entropy with a ground-truth DC. The final depth is reconstructed based on the predicted density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Figure 3.7: Our CNN architecture modified from [3] with 80-channel DC at the input, and 80-channel cross-entropy loss. The residual blocks are defined in the dashed red region, and the top branch consists of ResNet 34 [11]. . . . . . . 36 Figure 3.8: Depth completion with 16-row Lidar. (a) scene, (b, e) show Ma et al. [3] with significant mixed pixels. (c, f ) show our 3-coefficient estimation, demon- strating very little depth mixing. (d, g) show our estimation with all coefficients. 38 Figure 3.9: Another depth completion example with 16-row Lidar, where all subfigures are defined the same as Fig. 3.8. Interestingly, higher RMSE is reported on 3-coefficient estimation as opposed to all-coefficient estimation. . . . . . . . 38 Figure 4.1: Our depth completion algorithm can input LiDAR data and image (a), and extrapo- late the estimates of foreground depth d1 (b) and background depth d2 (c), along with a weight σ (e). Fusing all three leads to the completed depth (d). The foreground- background depth difference (f) d2 − d1 is small except at depth discontinuities. . . 43 Figure 4.2: Depth smearing across boundaries. We show the ground truth depth (colored red) overlaid on an image (a), depths estimated by the SoTA method [6] (b), our fused depths (c), our estimated weights σ (d), a depth slice of [6] (e), fused depth and σ slices (f), and foreground and background slice (g). Our extrapolation ability in (g) results in the sharp depth boundary in (f), rather than the smeared depth in (e). . . . 46 xv Figure 4.3: (a) The ALE from Eq. (4.2) is asymmetric around its minimum at the origin. (b) The RALE from Eq. (4.3) is a reflection of the ALE. We use the ALE for foreground sur- face estimation and the RALE for background estimation. (c) A pixel depth is shown with two ambiguities at depths d1 and d2 and probabilities p1 and p1 respectively. The black line shows the expected ALE which is the probability-weighted sum of two ALE functions, see Eq. (4.1). The expected ALE will have a minimum at one of the marked corners occurring at d1 and d2 . The minimum will be at d1 if Eq. (4.4) is satisfied, as it is in this case with p1 = p2 , and so acts as a foreground depth estimator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Figure 4.4: Incorporating 3-channel at the output of the Hour-glass network used in [6]. SDn and F Dn are the sparse inputs and fused depth obtained from F Gn , BGn , and σ n at multi resolution scale n respectively. . . . . . . . . . . . . . 52 Figure 4.5: Comparison of our method with SoTA methods with whole and zoom in views (a) showing Color Images (b) DC [12], (c) MultiStack [6] (d) NLSPN [13] and our method (e). Four different regions of the image from two different instants are se- lected to show depth quality from diverse areas. . . . . . . . . . . . . . . . . . . 54 Figure 4.6: Input image (a), its zoom-in views (d), our estimation on foreground depth (b), back- ground depth (c), fused depth (e), and the depth difference between foreground and background depth (f). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 4.7: Difference of TWISE vs MultiStack [6] in (a) Absolute Error (AE) and (b) Squared Error (SE) respectively. The red indicates the most gain of ours over [6], marked by ’o’; while the blue is vice-versa, marked by ’x’. Zoom in for details. . . . . . . . . 59 Figure 4.8: (a) Magenta is a histogram of absolute error differences A(i) for A(i) > 0 (where MultiStack errors > TWISE errors) and green is a histogram of |A(i)| for A(i) < 0 (where TWISE errors > MultiStack errors). (b) Corresponding histograms for squared pixel error differences S(i). . . . . . . . . . . . . . . 60 Figure 4.9: Color images (top) and depth error maps in 0 − 5m (bottom). . . . . . . . . . 61 Figure 4.10: Semi-dense GT depths overlaid on color images. Zoom-in views show fore- ground/background depths are incorrectly spread (dilated/constricted) across boundaries of poles, traffic signs etc. visible in color images. . . . . . . . . . 62 Figure 4.11: (a) Results on Virtual KITTI experiments trained on clean GT and synthe- sized semi-dense respectively (units in cm). (b) MAE and (c) RMSE curves of scatter plots (Semi-Dense vs Clean GT) for different loss functions (col- ored symbols) and two backbone networks (MultiStack [6] and ResNet-18 [10]). Methods trained with the same backbone network are connected. . . . . . . 65 Figure 4.12: KITTI sparse patterns of (a) 64R, (b) 32R, (c) 16R, and (d) 8R subsampled LiDAR respectively overlaid on a color image. . . . . . . . . . . . . . . . . . 67 xvi Figure 5.1: Object detection from a SOTA architecture [5] trained with dense inaccurate pointclouds backprojected from a depth image by SOTA depth completion method i.e. TWISE [7]. The figure showing (a) several false positive detec- tions (red 3D cuboids) on the estimated high-density pointcloud. Estimated (colored) and LiDAR (white) pointclouds also shown in 3D space (b) Esti- mated depth map in 2D space from where the estimated pointcloud in (a) is originated from. The LiDAR points are also shown in white dots, and (c) shows the estimated and groundtruth 3D bounding boxes projected to a 2D image space of the scene. Red and blue boxes are predicted and groundtruth 3D bounding boxes respectively. It shows that bush and phone booth are being wrongly classified as cars. . . . . . . . . . . . . . . . . . . . . . . . . 72 Figure 5.2: 3D detections from high-density pointcloud. The color image and sparse depth is fed into a depth completion network, and dense depth estimate is ob- tained. The high-density depthmap is then converted to pointcloud in 3D, and trained with a SOTA object detection network, the end-result is 3D detections (red 3D bounding boxes) in pointcloud. . . . . . . . . . . . . . . . . . . . . 77 Figure 5.3: Different types of depth noise on estimated depth; the first column shows color images, the second column shows depth images, and the third column shows 3D pointclouds; (a), (b) and (c) show noise-fg, indicating smeared points within the car resulting in misalignment of the predicted and GT bound- ing boxes as shown in red and blue respectively; (d), (e) and (f ) show noise- inbet, indicating smeared pointcloud between two objects resulting in the wrong orientation of the 2nd car; and (g), (h) and (i) show noise-bg, the background points (phone-booth) wrongly classified as a car since its outer surface looks similar to a car shape. . . . . . . . . . . . . . . . . . . . . . . 82 Figure 5.4: Selecting 4096 points from (a) 64R raw LiDAR and (b) dense depth (noisy environments) of TWISE. FPS samples more background and outlier points at boundaries of tree trunks and buildings from noisy dense depth. . . . . . . 84 Figure 5.5: Subsampling in point-based architecture. The dense depth points get sub- sampled to 64 encoded points at the final encoder level. . . . . . . . . . . . . 84 Figure 5.6: Removing smeared points by TWISE. (a) shows the rectangular region 0.2 < σ < 0.8 and depth difference (difference between BG and FG in TWISE) >= 3m as smeared depth points. The σ parameter refers to σ parameter learned in TWISE. (b) and (c) show the unfiltered and filtered depth points in 3D space respectively. As shown, most of the floating depth pixels at the scene in (b) is filtered out at (c). . . . . . . . . . . . . . . . . . . . . . . . . . 85 xvii Figure 5.7: Sampling to select relevant pixels from dense depth. (a) ≈ 315k with dense depth, (b) ≈ 200k after pruning ambiguous pixels based on sigma filter 5.3.4.1, (c) ≈ 35k after grid sampling at 0.2m, (d) ≈ 15k points after us- ing object level semantic mask and gathering neighboring points 0.3m from the semantic pixels in 3D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Figure 5.8: Object detection performance comparison in bird eye view with different depth inputs. (a) show color image, and BEV detection on (b) sparse depth, (c) dense depth, (d) depth filterd with sigma filter (5.3.4.1), (e) augmented LiDAR with variable radius respectively. (f ), (g), (h), and (i) show another example of a scene with color image and bird eye view of all the methods subsequently. In both the examples, augmented LiDAR has best performance compared to all other methods. . . . . . . . . . . . . . . . . . . . . . . . . . 93 xviii Chapter 1 Background and Motivation 1.1 Introduction In 3D vision, depth perception is the visual ability to estimate the world in 3D dimensions and gives us the physical distance of an object relative to a calibrated sensor. It is critical for 3D scene understanding and is widely used in autonomous driving and navigation, high-definition mapping and augmented reality. Accurate, dense and high-resolution depth is desired for 3D scene reconstruction and accurate scene perception. 1.2 Depth Estimation using Multi-Modal Sensors: A Motiva- tion In the context of automotive sensing and perception, depth plays a large role in 3D object detection [2, 14], classification, localization [15, 3], tracking [16], shape estimation [17], 3D reconstruction [18] and modelling [19]. Depth is necessary to measure any 3D object dimensions and locate its position with respect to the ego-vehicle, and reconstruct the 3D structure of the environment for mapping, localization and even path-planning and collision avoidance in an autonomous driving environment. There are several ways to estimate depth; estimation using active depth sensors and passive sensors, or combination of these sensors. The following explains the methods of 1 estimating depth on each of these types and the motivation of using depth estimation using multi- modal sensors. Active depth sensing involves transmitting a signal into the scene and measuring its impact in real-time. These sensors can also be known as three-dimensional (3D) range finders, which means they can acquire multi-point distance information across a Field-of-View (FoV). These sensors may emit laser pulses (time-of-flight or ToF sensors), infra-red beams (active stereo) or structured light patterns (fixed pattern or programmed patterns) on surrounding environments to measure depths. Active stereo sensors (e.g. Intel RealSense D Series, Structure Core) work with an infrared pattern projector in addition to the principle of triangulation by means of two cameras, separated by a baseline distance to measure depth. The infrared (IR) projector helps to increase the fidelity and reliability even in low-light conditions. However, the IR projectors on these sensors are limited in range, and these relegates the sensors to near and mid-range applications, regardless of the baseline distance. Structured light sensors (e.g. Zivid, Intel RealSense SR series, Kinect v1) combine low cost with high-fidelity 3D data capture, as well as performance in a wide set of lighting conditions, except for direct or indirect bright light. This is because infra-red light used by these sensors gets overpowered by infra-red light in the same bandwidth that is naturally present in the environment. Time of flight or TOF sensors (e.g. Velodyne, Ouster, Terabee) send out packets of modulated or unmodulated infra-red or laser pulses, and record the time it takes for the signal to return. They are less susceptible to interference from bright and indirect sunlight, but more susceptible to absorption of transmitted signals on dark, rough, or specular surfaces. The accuracy, and range of these sensors are based on the power consumption of the emitters, modulation and specific technologies used (see Tab. 1.1). Normally, they have mm - cm level accuracy, but either they are sparse, or have low resolution, or limited range capacity, or all of them. Amongst all the sensors, ToF sensors like LiDARs are widely used for longer-range applications for automotive safety and 2 navigation. Passive depth sensing refers to depth estimation using passive camera sensors by leveraging depth cues of the visible environment. This depth cues come in the form of relative object sizes at different ranges from the camera, occlusions, texture gradients, shading, aerial perspective etc. Depth estimation are typically done in regular 2D grids and thus offer high-resolution depth. Dif- ferent types are possible; single-view monocular, multi-view monocular, motion-based monocular, and stereo depths. The most accurate amongst them are the stereo depth sensors (e.g. StereoLabs ZED, Ensenso), which operate on the principle of triangulation between two cameras with a base- line distance to estimate depth from disparity. They come in a wide range of baseline distances, and thus can operate at different depth ranges. However, their accuracy depends on matching features between two images, image resolution, stereo calibration and their performance suffers in low-light conditions, textureless, smooth or specular surfaces. Some more advanced learning algorithms [20, 21, 22] leverage neural networks and novel loss functions to tackle those weak- nesses, but computational and algorithm complexity makes them slow and less readily adaptable for realtime applications. Depth estimation using single view [23, 24, 25] has been popular re- search problem over the last decade and on due to availability of cheap and high-resolution sensors like cameras. The learning algorithms often leverage availability of ground-truth depth [24, 26], novel CNN architectures [27, 28], or loss functions [29, 30] to improve depth accuracy, but the performance gap is still wide compared to existing active depth sensors (see Tab. 1.1). Multi- view [31, 32] and motion-based [15, 33] monocular depth estimation are also widely researched to leverage multiple viewpoints of the camera or motion-based cues, but depth errors are still in the range of meters due to occlusions, moving objects, shadows or textureless surfaces and errors in camera pose. While performance in range and resolution are improving, the cost for higher-resolution Li- 3 Property Active Sensing Passive Sensing Active Stereo Structured Light ToF Single-view mono. Multi-view mono Passive Stereo Infra-red Known pattern Known speed Light Property N/A N/A N/A beam projector of light of light < 100 < 10 <3 < 20 Short to long Effective Range med range very short to Theoretically Theoretically depends on range depends on in meters depends on med range, depending Infinite Infinite baseline laser power IR projector on illum. power between cameras and modulation mm to cm mm to cm Gradually Specific diminishing mm to cm mm to cm m m to Algo, Depth Accuracy with distance, Rapid fall-off Rapid fall-off Specific Specific Gradually difficulty with beyond projection range beyond projection range to Algo to Algo diminishing smooth, textureless with distance surface Low Light Good Excellent Excellent Poor Poor Poor performance Medium Fast Fast Scanning Speed Limited by N/A N/A N/A Limited by camera speed Limited by sensor speed software complexity Latency Medium Medium Low High High Medium Software Complexity Medium Low/Middle Low High High High Sensor Cost $ $$ $$ - $$$ $ $ $ Table 1.1: Comparisons of active and passive depth sensors DARs remains prohibitive for numerous applications. As a result there is significant ongoing effort into improving the resolution, while lowering the cost of 3D sensors [34, 35, 36]. Pixel level fusion of multimodal sensors i.e. LiDAR and camera [3] or radars and camera [37], are recently being explored to improve depth resolution. There are several motivations to attain pixel level association of depth in image grid. One motivation is that depth sensors are pretty accurate and do not suffer the problem of scale, but the depth measurements are sparse compared to color images in a scene, which offer high-resolution imagery. Another motivation to obtain depth for each color pixel is for the purpose of RGB-D applications such as scene modeling, colorizing pointclouds etc. The idea of fusion is to backproject depth points from LiDARs into image plane given we know rela- tive transformation between the two sensors. Once the sparse depth measurements are projected into image, depth completion problems can then be defined as completing ’missing’ information of depth in the full resolution image grid, aided by color image. Early works on depth completion [38, 39, 35] started with Middlebury dataset [40] [41] where generated depth map is sparsely sampled to simulate raw input depth. This dataset has mostly synthetic and indoor scenes, and have controlled lighting conditions. Since then, research has 4 moved on from the Middlebury dataset to larger datasets, NYU2 [4], SUN RGBD Dataset [42], Scan-Net [43]. These are mostly possible after the advent of RGBD depth sensors like Kinect, Intel RealSense etc. Some of the depth completion tasks using these datasets are [44, 45, 46, 47, 48]. But the datasets are mostly in indoor planar scenes, with more focus on detection, pose estimation and semantics recovery of object rather than depth super-resolution tasks. Also, the depth-range of the sensors were limited to few meters. It has been only in recent times that depth completion in outdoor scenes have gained considerable attention, with the advent of LiDARs that can promise both range and accuracy. Some of the pioneering works in the fields are [3, 49, 50]. 900 800 raw depth monocular + raw depth 700 monocular 600 RMSE cm 500 400 300 200 100 0 024 8 16 32 64 No. of LiDAR scanlines Figure 1.1: Performance comparison of depth completion method using different input sensor modalities (raw depth, monocular and monocular + raw depth). A standard deep learning method (Ma et al. [10]) is used in outdoor KITTI depth completion dataset for evaluation. The maximum depth range available for evaluation is 85m. x-axis refers to the no. of LiDAR scanlines projected to the image as raw depth, y-axis refers to the RMSE metric in cm. 0 scan-line resembles depth es- timation using monocular camera only. It shows that monocular and sparse depth as input modality can reduce the performance gap by around 5 times compared to monocular camera only just by increasing the raw depth measurements. The performance gap between raw depth and multimodal input (monocular + depth) decreases as the number of raw depth measurements increase. We choose to tackle depth completion problem by seeking to maximize the resolution of 3D 5 sensing and improve accuracy. We ask: given a 3D sensor, can we upgrade its depth resolution by adding a higher resolution color camera, and fusing the sensor data? Often, accuracy of esti- mation turns out to be a big issue depending on which input modalities we choose for estimating high-resolution depth. To put accuracy into some perspective, consider the Fig. 1.1, where we check the performance between raw depth, monocular and monocular + depth as multimodal input to a standard learning network [10]. An outdoor depth completion dataset (KITTI [1]) is used for evaluation. The maximum depth range available for evaluation is 85m. 0 scan-line refers to depth estimation using monocular camera only. It shows the performance gap can be reduced roughly by 5 times (480cm for monocular, 87cm for monocular + depth with 64 LiDAR scanlines) if multimodal sensors are used as input modality. In short, we tackle the problem of depth completion in this dissertation using multi-modal fusion (depth sensors + monocular camera). We project sparse depth measurements from depth sensors into color image provided we have accurate extrinsic calibration parameters i.e. rotation and translation with respect to the other sensor; and estimate dense depth for each pixel in image plane, aided by color image. 1.3 Depth Ambiguity Although depth completion has been increasingly used in wide applications in both indoor [3, 50, 48] and outdoor scenes [10, 51], the usefulness is still limited since the solutions often fail to tackle a very important problem in depth completion tasks: depth smearing across object boundaries. We now introduce the problem of depth ambiguity across boundaries and how it creates depth smearing at these discontinuous regions. A key property of a 3D scene is it has step edges between two surfaces. These sharp disconti- 6 nuities emerge at the occluding boundaries of objects in natural scenes [52, 53]. When depth points from LiDARs are projected into uniform and dense 2D camera grid, missing pixels in the grid face an ambiguity problem for pixels at object boundaries: do these pixels belong to the foreground or background object? Failure to resolve depths at ambiguity would create depth mixing which would smear object boundaries and cause distortion in object shapes. Using high-resolution infor- mation from another modality (particularly from RGB) turns out to be useful in order to detect and recover sharp depth discontinuities. But problems of depth mixing remain and it goes worse with more sparsity. Some researches [52, 54] address the depth-mixing problem by putting regulariza- tion constraints on estimated edgemaps/discontinuity maps in the cost function. But they do not address the problem completely, since predicting the edgemaps/discontinuity maps is challenging in the first place and often not readily available in indoor/outdoor scenes. To understand ambiguity better, lets define some key terms first in depth completion. We would like to learn model parameters θ that can fill in all the missing pixels in the 2D dense grid with depth di given we have sparsely sampled depth and color image xi as input data. The learning task is supervised by groundtruth data dgt . The data is defined by a probability distribution px . The aim of the learning task is to learn θ that best predict depth di , given sparse depth and color image data xi , with distribution px : θ̂ = arg max Epdata [log pmodel (di |x; θ)] . (1.1) θ Here the expectation is performed over training data with distribution pdata . The term pmodel is the probability of estimating di given data x and parameters theta. Ideally, our model can learn perfect depth given we have infinite training data. Initially let’s consider that the data x consist only of sparse depth values and no color images. 7 To simplify the problem, let us consider two flat surfaces perpendicular to the z axis and a depth discontinuity at its boundary. Within a surface the method can exactly estimate depth, but close to the boundary there is an ambiguity. Given sparse depth samples in both the surfaces, we can ask whether a pixel near a depth discontinuity belongs to the foreground or background surface. If the boundary is unknown, it will be ambiguous to decide whether it is foreground or background. What this means in terms of Eq. 1.1 is that given the same data x, there are at least two compatible depths: d(1) for foreground and d(2) for background. If a color image is available, it may be possible to exactly infer the boundary between objects, and resolve this ambiguity. However, often this is not the case; the boundary is not clear and hence the ambiguity persists. Most depth completion methods often fail to resolve this ambiguity at those regions and as a result introduce depth mixing at both the surface depths. In this dissertation we design our representations carefully to handle this case. We show that the way we address this ambiguity has important implications for depth completion problems. 1.4 Thesis Outline The objective of the dissertation is to propose a novel depth completion method that can preserve shape and boundary of objects for use in real-life perception applications like 3D object detection. Chapter 2 provides a theory of existing depth representations and introduces our novel rep- resentations of depth using multichannel (Depth Coefficients) and dual channel representations (Twin Surface). We discuss several important properties of these two representations that helps to encode ambiguity well across object surfaces or boundaries. We also propose some effective depth completion performance metrics e.g TMAE and TRMSE that can reduce preference for mixed depths and promotes sharp boundaries. 8 Chapter 3 digs deeper into our proposed multi-channel (Depth Coefficients or DC) represen- tation, and how it can be learned in neural network. It shows how DC can avoid depth mixing during learning and depth reconstruction. We also propose an effective loss function that works amazingly well in learning this representation. Chapter 4 delves deeper into learning dual-channel (Twin Surface or TWISE) representation, which also avoids issues with memory constraints in DC and improves computational efficiency significantly. We also discuss the quality of ground-truth depth in real-world depth completion problems and how accurate evaluation of methods is affected by the presence of its outliers. We find some potential pitfalls of commonly used evaluation measure i.e. RMSE, how it promotes mixed depth solutions and study the robustness of some common metrics (MAE and RMSE) in presence of outliers in groundtruth data. Chapter 5 shows an application of our depth completion solution in 3D object detection. This chapter investigates whether estimated high-resolution depth from depth completion methods can help in object detection. It shows the effect of different depth noises originated from depth esti- mates on detection performances and proposes some effective ways to reduce noise in the estimate and overcome architecture limitations. The proposed findings are run on both real-world and syn- thetic datasets. Finally we conclude the thesis with Chapter 6 with our findings, limitations and some possible future directions of depth completion problems. 9 Chapter 2 Depth Representation Depth representations are an integral component in depth completion problems and learning high- resolution depth. In this chapter, we discuss different types of depth representations that exist in literature and introduce our depth representation and the motivations behind these representations. This chapter also explains two proposed metrics that promote fairer depth completion metric com- parisons by discounting mixed depth evaluation across object boundaries. We use these metrics as performance measures for evaluation of depth completion performance throughout the rest of the thesis. 2.1 Introduction Depth data provides rich 3D information about the geometry of an object, hence its adequate representation is of significant importance in computer vision tasks. There are different types of 3D representations available, and some representations are efficient for each individual tasks like acquisition, rendering, manipulation, data analysis and even animation. But a single representation might not be effective for all tasks. We realize that representation also affects learning depth completion. Hence we seek an effective depth representation that can facilitate learning depth accurately both within objects, and across its boundaries. Sparse depth points projected in the image plane is often called depth map, which is a single channel representation of depth in image grid. Although it is the default choice in depth com- 10 pletion tasks, we show that this representation fails to capture empty space between objects [55]; an interesting property of a 3D scene, and as a result, learning algorithms often smear depths of foreground and background objects and fill in empty space between them. We propose two novel depth representations that can model 3D scenes effectively, both at ambiguous and non-ambiguous regions. We also study in this chapter that conventional metrics used to evaluate depth completion algorithms are not often effective measures to evaluate depth at depth boundaries. In summary, this chapter introduces our novel representations for depth completion; i.e. multi- channel depth coefficients and dual channel twin surface depth representations. We elaborate some interesting properties of these representations. We also propose a novel evaluation metric that tends to address the performance, more strictly, on depth discontinuity or occlusion boundaries. The uses of these representations in depth completion are also a contribution of the thesis and will be discussed more in the subsequent chapters. 2.2 Representations of Depth (a) (b) (c) (d) Figure 2.1: An illustration of different representations of depth; (a) represents point-cloud data (Courtesy of Caltech) (b) represents voxel grid data (courtesy of IIT Kharagpur), (c) represents triangle mesh (Courtesy of UW), and (d) represents multiview representations of depth (Courtesy of Stanford) by means of depth maps. Shapes and sizes of 3D objects can be measured using active and passive depth sensors. These 11 3D measurements are typically stored in raw pointclouds, which can later be efficiently repre- sented multiple ways depending on application scenarios, see 2.1. Some of these formats pose new challenges to deep learning architecture models while also provide opportunity for novel and efficient solutions. Some of the most common representations include: 2.2.1 3D Point Clouds 3D point clouds are a collection of unordered set of sampled points on surfaces of 3D objects. Each point is encoded by its own set of X, Y, Z coordinates in 3D space. They are widely used in ob- ject detection [2], segmentation [56] and surface normal estimation [57]. Their advantages include registering precise locations of objects and are typically decoded directly from raw pulse signals in ToF sensors. Also, it can be realized as a small set of Euclidean subsets that have global pa- rameterization and common system of coordinates. But learning pointcloud is challenging due to irregularly structured points and lack of connectivity information in-between points, which might cause ambiguity on distinguishing multiple surfaces close to each other. Also, pointclouds are memory intensive representation, so upsampling pointcloud of a 3D scene might come with mem- ory and computational limitations. 2.2.2 Voxels Voxels specify occupancy on a regularly spaced lattice in cartesian or polar coordinates. Each data point can be single-dimensional specifying opacity, occupancy, or consist of multi-dimensional information, such as color, probability of occupancy etc. in addition to opacity. Discrete and quantized system of coordinates can also be used to define grid-structured representation like vox- els. Viewpoint information about the 3D shape can be encoded by classifying voxels into visible, 12 occluded, or empty regions. This regular grid can be used for object detection [58], object clas- sification and orientation estimation [59]. Dense voxel grids can be memory intensive at high resolutions, since this stores both occupied and non-occupied parts of the scene. However, sparse voxel representations are also available to tackle memory issues. A more recent 3D volumetric representation for effeciently finding neighborhood voxels is oct-tree based [60, 59] voxels, which are typically varying-size voxels, and have the capacity to store high-resolution data by using hier- archical data structure. However, none of these voxel-based representations preserve the shape of 3D objects precisely. Additionally there are issues with artifacts from discretizing the surface, as well as high data and computational costs. 2.2.3 Polygon Meshes Polygon meshes consist of a set of polygon facets with shared vertices that can approximate a ge- ometric surface. The vertices are associated with a connectivity list which describes how these vertices are connected to each other. Polygon meshes facilitate rendering because they represent underlying geometric surfaces of objects. However, depending on the resolution, it is also mem- ory intensive and requires high computational power for processing. Meshes can also be defined directly on image grids; but there is possibility that this representation suffer from poor definition of vertex connectivity due to ambiguous or missing information in the scene. Using regular con- volutional neural networks (CNN) on raw mesh data might not be feasible because of irregular representation, connectivity issues due to edge ambiguity, different resolution at different parts of the scene and non-uniformity of data [61]. However, graph CNNs are applicable to meshes, where meshes can be represented as graph data structures [62]. Such an approach gives promising new direction over processing 3D data represented in meshes. 13 2.2.4 SingleView and Multiview Depth Maps Singleview depth maps are 2.5D representations of 3D data, and closely reflects raw depth data in 2D grids captured from data acquisition devices e.g. ToF sensors. Multiview Depth Maps, are combination of multiple singleview depth images, captured from different point of views of the camera. Representing 3D data in this way allows learning multiple features to reduce the effect of noise, incomplete data, occlusion and illumination problems. They have been used for RGBD fusion and instance segmentation [63, 64]. This representation in regular grids can be processed with CNN in an analogous way to color image super-resolution [65, 66]. This is the representation of choice for colorization techniques and fusion [4] as well as depth completion. However the question of how many views are sufficient to model the 3D shape is still open. Also, sparsity of single-valued depth data in 2.5D representations might have adverse effects on dense convolutions [34], and might lead to depth mixing and smearing problems at boundaries, see 3.3. 2.3 Multichannel Representation of Depth: Depth Coefficients Depth completion algorithms typically take single channel depth representation as the default choice for estimating missing depth pixels in regular 2D image grid. However, ambiguities can exist around depth boundaries, and majority of depth completion algorithms suffer at these regions by introducing smearing, i.e. mixing depth of multiple surfaces that are possible in that pixel loca- tion. To tackle ambiguity, we deem it necessary to represent depth as multi-channel representation to suggest possibilities of multiple depths. In this section, we discuss why we opt for a multichannel depth representation to estimate accurate depths at ambiguous regions, introduce Depth Coefficients (DC) and discuss the properties of it. We also explained how to reconstruct depth from DC. 14 2.3.1 Depth Representation 𝑐ෞ 𝑦 𝑐ෝ𝑧 𝑐ෝ𝑥 Figure 2.2: Illustration of DC. The depth pixel (red circle) in the image plane is projected back to the 3D space by a pixel ray. The depth bins are quantized along that pixel ray. Each bin is then assigned a weight (depth coefficients) on few depth bins based on the proximity of the depth with the depth bins. We seek a depth representation that can model depth ambiguity in a 3D scene, and preferably be used to resolve that ambiguity. So instead of representing depth as single value in that pixel, we use a discrete set of weights or coefficients, called depth coefficients (DC) which represents a single depth. Geometrically, DC represents depth along pixel rays. In simple terms, these coefficients are weights of quantized depth bins 2.3 (b) existing along the trajectory of that ray, and sum of product of these weights and the depth bins can give us depth in that pixel location. We now list several interesting properties of depth coefficients that make it a useful representation for modelling ambiguity. 15 2.3.1.1 Discrete Representation Depth coefficients are weights assigned to multiple depth bins along a pixel ray to preserve the precision of a single depth value (see Fig. 2.2). With depth coefficients, depth is realized as sum of weighted bins. However, spreading the weights into all available depth bins can be memory intensive and make the learning task harder. So we seek only a finite number of coefficients that makes it sparse and thus can preserve both precision and memory. In doing so, we make the signal energy concentrated at possible depth bins; typically the main bin and its neighboring bins, see Fig. 2.3. These kind of bin representations already exist in literature in several other applications like tracking [67], bilateral filtering [68], channel smoothing [69], joint image alignment [70] etc. But the way we use channels/bins to represent depth is novel and unique. 2.3.1.2 Probability Representation Ambiguities and uncertainties can be modelled by probabilities. So we seek a probabilistic repre- sentation of depth rather than single depth value in each pixel location. Depth coefficients offer dis- crete probability representation in quantized depth bins on any pixel location in a 2D grid. This is an important property since depth is ambiguous at depth discontinuities or object boundaries. Sin- gle modal probability distribution can be used to model ambiguity by increasing the uncertainty or standard deviation of depth. Although single modal distribution can model non-ambiguous region quite well, it is preferable to suggest multiple depth scenario by using multi-modal distribution in ambiguous regions. This is preferable since it gives the possibility to choose one depth over the other without mixing the two possible depths. Mixture of gaussians can be used for multi-modal representation, but which particular gaussian should be weighed more is not obvious. In that re- spect, DC can be used to model both single-modal and multi-modal distributions respectively. It 16 𝑑 = 7.25; 𝑑 ≈ 𝑑𝑖 𝑖=𝑝+1 𝑑 = ෍ 𝑐𝑖 𝑑𝑖 = 0.125 × 6 + 0.5 × 7 + 0.375 × 8 = 7.25 𝑖=𝑝 −1 (a) (b) Figure 2.3: Representing depth by means of multiple depth bins or depth coefficients to preserve depth precision. In this toy example, we divide the depth range of 10m into 10 bins, each bin spaced 1m apart from each other. (a) depth representation by a single bin, which loses precision (b) depth representation by finite number of bins which preserve precision. is noteworthy that although we use a single modal DC to represent depth, it is possible to estimate both single modal and multi-modal DC. The higher weighted mode in the estimated DC can be used to choose the final depth. 2.3.2 Mathematical Notations Let us now represent a dense or sparse depth image by means of depth coefficients. We create a multi-channel image, all of the same spatial resolution, with each channel centered at a pre- determined uniformly spaced depth; D = {D1 , . . . , DN }, where Dj refers to depth of whole image at j-th channel. The depth values increase in uniform steps of size b. In choosing the number of channels (or bins) we trade-off memory vs. precision. For our applications, we chose 80 bins to cover the full range of depth up to 80m, and this determines the bin width, b; i.e. 1m apart. Thus each pixel i has a vector of values, ci = [ci1 , . . . , ciN ], which we call Depth Coefficients (DC), that encodes its depth, di . We constrain these coefficient vector to be non-negative, sum of all elements to 1, and give the depth as its inner product with the quantized channel depths: 17 X di = cij Dj . (2.1) j Note this representation is not unique as many combinations of coefficients may produce the same depth. So we use the following simple representation with minimum number of non-zero coefficients (in our case three) to represent depth. We select 3 to make the distribution of weights compact and symmetric. It also facilitates learning since there are only 3 coefficients to learn compared to 5 or higher odd number. Let k be the index of the depth channel closest to pixel depth di , b is the d −D spacing between adjacent bin depths and δ = i b k be the fraction of residual depth with respect to b. Dk−1 and Dk+1 bins can be expressed in terms of the center bin depth as Dk−1 = Dk − b and Dk+1 = Dk + b respectively. With this substitution all the terms cancel on the right-hand side of Eq. 2.2 leaving di . Considering the center bin depth has the maximum weight of 0.5, the other neighboring weights depend on the residual δ. The DC vector for pixel i is:   0.5 − δ 0.5 + δ ci = 0, . . . , 0, , 0.5, , 0, . . . , 0 , (2.2) 2 2 where three non-zero terms are (ci(k−1) , cik , ci(k+1) ). This is unique for each di , satisfies Eq. (2.1), and sums to 1. 2.3.3 Depth Reconstruction from True DC To construct depth from true DC, for each pixel, we can just use the calculated coefficients from DC and use the Eq. (2.1) as stated above. In an ideal DC with only three non-zero coefficients, it is possible to have precise reconstruction of depth from DC. However, it is possible that estimated DC has incorrect coefficients or more than three non-zero coefficients and still represent depth. 18 This case is illustrated in 3.4.2. 2.4 Dual Channel Representation of Depth: Twin Surface Multichannel representation (DC) requires multiple channels and thus more memory to preserve depth precision than single channel depth. But DC has the advantage that it can represent depth ambiguities for every pixel, and can present multiple ambiguities by using multiple peaks in the form of multi-modal gaussian distribution. However, in real scenes most depth pixels have no ambiguity since they are typically from the same surface, and those that typically are ambiguous can only have two possibilities; depth from either foreground or background surface. The phenomenon can be illustrated by Fig. 2.4 where dual peak DC suggest two possible depths. It is also possible to represent ambiguity by a dual channel depth, i.e. foreground and back- ground depth for each pixel, which can represent minimum and maximum depth for each pixel at ambiguity. We call this a dual channel representation of depth, or twin depth. Ambiguity typically exists around boundaries of objects since it is possible to be foreground or background depth at the same pixel location. Sparse depth are likely to miss at or around boundaries, and interpolation of foreground or background depth in a single channel can cause smearing. But it is possible to model this ambiguity by using dual channel depth; each channel representing a foreground and background depth respectively. This simplistic approach can save memory dramatically and can model real-world scene in any depth without losing precision. Hence we propose a more memory efficient representation that uses just dual channels per pixel, rather than multiple channels. 19 Foreground Depth Background Depth Figure 2.4: Dual Channel Representation 2.4.1 Depth Reconstruction from Foreground and Background Surfaces Although a dual channel depth is used to model ambiguity, it is desired to reconstruct a final depth from this representation. Let us define foreground depth and background depth by d1 and d2 respectively. At non-ambiguous regions, d1 and d2 represent the same surface, and at ambiguous region, they can represent foreground and background surfaces respectively. The true depth can be represented by using the following equation: dt = σd1 + (1 − σ)d2 (2.3) where value of σ ranges between 0 - 1. By this representation, we allow σ to choose between foreground and background surface when there is ambiguity, and can choose any value at non- ambiguous region. This reconstruction also allows some level of interpolation, within object sur- face by mixing the two depths. 20 2.5 Proposed Evaluation Metrics While RMSE and MAE are useful metrics for overall depth completion performance, they are not effective measures of evaluating performance at sharp discontinuities across occlusion boundaries. In this section, we show why RMSE and MAE are not good metrics to evaluate the sharpness at depth discontinuity and propose a new metric to quantify it. The two most common error metrics to evaluate depth completion tasks are MAE and RMSE respectively. Both MAE and RMSE reflect the deviation of estimated depth from groundtruth depth. While MAE calculates the mean of the absolute deviation, or error residuals of all the data, RMSE calculates the mean of the squared error residual. Since the errors are squared before they are averaged, RMSE gives a relatively high weight to large errors. This means RMSE should be more useful when large errors are particularly undesirable. In the context of depth evaluation, both these metrics are useful when depth is evaluated on the same surface, but this scenario changes when multiple surfaces exist and there are equal probability for the estimated depth to reside on any one surface. In order to understand the implication of RMSE when multiple surfaces exist, let us consider probability of depth pixel having either foreground (d(1) ) or background surface d(2) . The expec- tation of the probability, or mean of this metric occurs at the midpoint (smeared pixel) of these two depths since penalty of the estimated smeared pixel minimizes at the midpoint of the two surfaces. MAE, although less severe, offers minimum penalty for any pixel between these two surfaces. Nevertheless, mixed depth pixels are not sufficiently penalized in this metric either. The scenarios can be best illustrated in Fig. 2.5 (c). Thus we propose two complementary metrics that focus on depth surface accuracy and penalize depth mixing equally to other large errors. These metrics are Root Mean Squared Thresholded 21 Figure 2.5: Illustration why RMSE and MAE metric are not good at boundaries. x-axis shows the (yˆi − ỹi ) where yˆi and ỹi are estimated depth and groundtruth depth respectively, y-axis shows the loss functions. (a) and (b) shows the metric for MSE and MAE, and TRMSE and TMAE of depth and their characteristic loss curve respectively. (c) and (d) shows the expected MSE and MAE, and expected TRMSE and TMAE when depth pixels are missing between (10m) and (13m) (see text). From (a) and (b), both MSE and MAE and TMSE and TMAE shows perfect prediction/no errors when it coincides with GT, but for MSE and MAE the errors continue to increase if the depth estimates are far away from the GT. However, for TRMSE and TMAE, a fixed error occurs for points beyond a threshold. From (c) and (d), the expected MSE and MAE favors estimates in-between 10m and 13m, and thus, indirectly promote depth mixing. However, the TMSE and TMAE favor either points (10m) or 13m. In a way, the TMSE and TMAE do not penalize depth estimates when one surface/depth point (10m) is chosen instead of the other surface/depth point (13m), and only account for intra-surface variations of depth estimates. 22 Error (tRMSE) and Mean Absolute Thresholded Error (tMAE), defined as follows: v uP uX min((ỹi − ŷi )2 , t2 ) tRMSE = t , (2.4) P i=1 P X min(|ỹi − ŷi |, t) tMAE = . (2.5) P i=1 Here P is the number of pixels, t the threshold distance distinguishing within-surface variation from inter-object separation, ỹi the ground-truth value and ŷi the estimated value. For TRMSE and TMAE, fixed error occurs for points beyond a threshold t, as illustrated in Fig. 2.5 (b). Considering any inter-surface variation are not more than t, any error that goes beyond t are equally penalized regardless of whether the estimated depth falls midway within the two possible surfaces or falls at the incorrect surface. As a result, mixed depth pixel is not favored more over depth pixel that estimates the incorrect surface. We argue that mixed depth pixel is equally detrimental to estimated depth pixel on the wrong surface, if not worse. Thus this metric is ideally suited for evaluation around boundaries since most of the mixed depth pixels reside around boundaries. To illustrate the idea mathematically, consider a single pixel with a density function defined over its depth. Assume that this density is greater than zero for a set of points (corresponding to probable surfaces) and zero elsewhere. The expectation of this density function is shown in Fig.2.5 (d). All depths, separated by at least t from any probable depth, will have a fixed expectation cost of t which is greater than the cost of depths closer than t to a probable depth. Unlike RMSE, there are no local minima at mixed-depth points (> t from a probable point). Unlike MAE, all mixed- depth points have greater cost than points close to probable surfaces. Hence it does not favor any mixed depth pixel over pixel that chooses the wrong surface. 23 2.6 Conclusion In this chapter we introduce two novel representation of depth; multichannel depth coefficients (DC) and dual channel (twin surface), two depth representations that can model ambiguity of re- alistic 3D scene. Multi-channel depth coefficients is a discrete probability distribution function (discrete pdf) of depth. We show how to reconstruct depth from DC. In order to save memory requirement, we also show how a dual channel representation can be used to model ambiguity and propose a reconstruction method from these twin representation. Both these representation can po- tentially avoid depth mixing around boundaries of objects. We also explained the problems of con- ventional evaluation metrics on object boundaries and proposed a novel metric to deal with depth evaluation on boundaries. Now that depth completion methods are producing high-quality dense depths, our proposed metrics, tRMSE and tMAE, are preferable as they reward high-probable depth estimates and give equal penalty to large errors, which are mostly mixed-depth pixels. The next two chapters deal with how these two representations are used in a deep neural net- work model for solving depth completion problems. 24 Chapter 3 Depth Coefficients for Depth Completion 3.1 Introduction This chapter focuses on using our proposed multichannel depth representation called Depth Coef- ficients (DC) in a deep learning framework for estimating dense depth pixels. We show how DC is able to avoid mixed depth pixels during learning. We also examine how loss functions, such as MSE, favor mixed-depth pixels in certain cases. Using our proposed DC representation, we leverage cross-entropy loss to avoid promoting depth mixing. Finally we show one utility of depth estimation using DC in 3D object detection. We show resolving ambiguity across boundaries can improve 3D object detection performance by recovering object shape and pose. Sample result is shown in Fig. 4.1. The contributions of this section are: (1) First use of DC in a neural network (2) a new use of cross entropy as a depth loss function, (3) demonstration of improved object detection from super-resolved depth. 3.2 Related Works in Depth Completion Evolution of Depth Completion problems with Datasets: The substantially lower resolution of depth sensors compared to color cameras has been a motivator for depth completion. Early work by Diebel and Thrun [38] used markov random fields to guide upsampling, and this was followed 25 (a) (b) (c) (d) (e) (f) (g) (h) Figure 3.1: Our depth completion uses (a) a color image and the subsampled (16-row) Lidar points projected into image plane to estimate (b), a dense depth image. (c-e) are zoomed-in view of input color image, super-resolved depth of Ma et al. [3] and ours respectively. (f -h) are bird’s eye view of input sparse Lidar data, (d), and (e), respectively. Colors in the bird’s eye view show the number of height pixels in each cell/pixel. So a smeared object shape has height pixels spread out around the object boundary. Notice the smearing of height at the object boundaries in (g) compared to (h). These depth-mixing pixels impact qualitative appearance as well as subsequent tasks, such as object detection and pose estimation. by a variety of improvements including bilateral filters [39], robust regularization [35, 71], hand- crafted filters [72] and image segmentation [73]. But most of the evaluation results were done on Middlebury dataset [41]. They are mostly synthetic and indoor scenes, have controlled lighting 26 conditions, and donot have ample data for training in deep neural networks. More recently deep convolutional neural networks (CNNs) have taken the lead and research has moved on from the Middlebury dataset [40] to larger datasets, NYU2 [4], SUN RGBD Dataset [42], Scan-Net [43]. These are mostly possible after the advent of accurate of RGBD depth sensors like Kinect, Intel RealSense etc. Some of the depth completion tasks using these datasets are [45, 46, 47, 48], and all of them used deep learning networks since there are now more training data available. But the datasets are mostly in indoor planar scenes, with more focus on detection, pose estimation and semantics recovery of object rather than depth super-resolution tasks. Also, the depth-range of the sensors were limited to few meters. It has been only in recent times when depth completion in outdoor scenes were dealt with tremendous interest for improving 3D perception algorithms, mostly for autonomous driving. But mostly synthetic datasets [74, 75, 76] have dense depth maps that is exactly colocated with color imagery, and questions remain on the photo-realisticity of the scenes and different subsampling tools researchers use to downsample the dense depth maps, and thus cannot replace real world datasets [77, 78, 79]. But unfortunately all these datasets focus on 3D object detection, segmen- tation and tracking applications. The only realistic dataset to-date that focus on dense depth esti- mation tasks is KITTI [34]. Even though GT data provided by this dataset is semi-sparse, one of the main advantages of this dataset is it being realistic, outdoor scene, and enough data to train on neural networks. Most of the recent outdoor depth completion tasks [36, 3, 51] use this dataset for benchmarking. Our work uses a similar network as [3], but with focus on the depth representation and loss function, instead of the architecture. Loss function for depth completion: A key component of depth completion is the choice of loss function. Recent work has explored loss functions including L2 [48, 3], L1 [47], inverse- L1 [36], and softmax losses on depth [50]. While these loss functions can achieve low error 27 on measures including RMSE, MAE, iMAE, often it comes at the cost of smoothing out depth estimate at object boundaries. In this way, the sharp boundaries are lost/smeared and object shapes are distorted. We propose to impose cross-entropy on our probabilistic representation, and show this gives both high performance and sharp boundaries. 3.3 Avoiding Depth Mixing by Convolution In order to define depth mixing, we first define some necessary terminologies. If the closest depth pixels of an interest point are separated by a given distance, we say there is depth discontinuity across the interest point. The foreground and background modes are defined by whether the interest point is closer (foreground mode) or further away (background mode) than each other from the sensor. It is possible that the modes have intra-surface variations due to noise or roughness of the surface (foliage, potholes in the road etc), but we claim that in order to separate between two modes/surfaces, the intra-mode depth variation should have maximum threshold t. In this section, for outdoor scenes, the t is 1m, and for indoor scenes, it is 0.1m. We define mixed depth pixels as estimated depth pixels which are in between the two depths of foreground and background mode. Mixed depth pixels occur at empty spaces across mode boundaries. The phenomenon is best illustrated by Fig. 3.2. The consequence of depth mixing is smeared shape of the objects. Now that we have explained depth coefficients and its representation in 2.3 and depth mixing, the question comes to how to incorporate DC in a deep neural network with an effective loss function that would avoid depth mixing and smearing across object boundaries. Ideally, we would like to get no depth mixing across object boundaries throughout the 3D scene, but this phenomena is often limited by the resolution of channel gaps in DC. Let us consider we uniformly set tm 28 (a) (b) 1 0 -1 -2 50 40 30 20 10 0 10 5 15 (c) Figure 3.2: Illustration of Depth Mixing. (a) shows sparse measurements (red) in color image, (b) shows estimated depth map [10], (c) shows pointcloud generated from the depthmap (b). The black crosses in (c) are the sparse measurements. The red 3D box indicates the position of the car. Between the foreground mode (car) and the background mode (walls of the building), the 3D points floating in between are the mixed depth pixels. We say the car is smeared along its boundary. 29 spacing between DC channels. So our aim is to minimize depth mixing across two object depths separated by atleast tm away from each other. This section explains the motivation how DC can avoid depth mixing during convolutions; discusses further that by using DC, we can apply cross- entropy as an effective loss function to resolve depth both with-in and across object surfaces. The section then goes on to give an overview of the deep learning architecture. 3.3.1 Motivation The fact that dense convolutions with DC can avoid depth mixing can be explained by a motivating example (see Fig. 3.3). Consider two planar depths in a depth image, separated by 5tm away from each other. We take a slice of the depth image, and sparsely subsample the 1D depth signal. Now if we run a 1D convolving FIR filter on this signal, it will mix the depth of two planar depths across depth discontinuity. Now consider the case when the sparse depth signal is converted to DC. This time the DC channels are spatially separated by tm in depth. This time we run the 1D FIR filter on DC along DC bins; now the missing weights/coefficients are interpolated only between neighboring bins and as a result, there is no mixing between weights of spatially separated channels more than tm away from each other. Similarly, the first step of a CNN is typically an image convolution with Nin input channels. For sparse depth input, Nin = 1, and so all convolutions apply equally to all depths, resulting mixing right from the start. For DC input, depths are divided over Nin = N input channels, resulting in two important capabilities. First, CNNs can learn to avoid mixing depths in different channels as needed. This is similar to voxel-based convolutions [80, 81] which avoid mixing spatially-distant voxels. This effect is illustrated in Fig. 3.3(e-f), where a multi-channel input representation, (e), allows convolutions to avoid mixing widely spaced depths. Second, since convolutions apply to all channels simultaneously, depth dependencies, like occlusion effects, can 30 20 20 20 15 15 15 Depth 10 Depth 10 Depth 10 5 5 5 0 0 0 (a) Mid-pix (b) (c) Mid-pix 20 20 1.5 15 15 1 0.5 Depth 10 Depth 10 0 5 5 -0.5 -1 0 1 0 0 (d) (e) (f ) Mid-pix Figure 3.3: An example of depth mixing, and how DC avoids it. (a) A slice through a depth image showing a depth discontinuity between two objects. (b) An example sparse depth representation: each pixel either has a depth value or a zero. (c) The result of a 1D convolution, shown in (d), applied to the sparse depth. This estimates the missing pixel, but generates a mixed-depth pixel between the two objects. (e) A DC representation of the sparse depth. Each pixel with a depth measurement has three non-negative coefficients that sum to 1 (shown column-wise). (f) The result of applying the same filter (d) to DC in (e). Missing depths are interpolated and notably there is no depth mixing between the objects. be modeled and learned by neural networks. 3.3.2 Proposed Loss Function Once we are motivated that dense convolutions can avoid depth smearing on DC channels, we explore some of the loss functions to optimize our neural network parameters. We found that designing an optimization loss over GT DC rather than GT depth is also vital for avoiding depth smearing across boundaries. We discuss in this section why some of the conventional loss functions used for depth estimation encourage depth smearing across boundaries. We then propose cross- 31 entropy loss to address this phenomena. 3.3.2.1 Depth Loss Functions with Ambiguity One of the more popular loss functions in depth completion tasks is Mean Squared Error (MSE). In part this is because the MSE gives the maximum likelihood solution to Eq. 1.1 when pmodel (di |x; θ) is gaussian. We consider the simple case when data is gaussian. The maximum likelihood opti- mizer gives an optimum estimator that is close to the mean of data x. Now we consider the implications of using MSE when there are depth ambiguities. We define ambiguity as regions across depth discontinuity when there are probability of a depth pixel having either foreground and background mode as defined in 1.3. Given there are two modes from data x, the foreground mode having depth d(1) and the background mode having d(2) . Given multi-modal density function from data x, the MSE loss for depth di at pixel i is: 1  MSE(di ) = ||d − d(1) ||2 + ||d − d(2) ||2 , (3.1) 2 which is minimum when 1 dˆi = (d(1) + d(2) ). (3.2) 2 And so the estimated depth pixel is a mean of the foreground and background depths. An il- lustration of this is in Fig. 3.4(b). This solution is only good when the foreground and back- ground depth are coming from within the same surface (e.g. road surface, walls etc) since regres- sion/interpolation is desirable within same surface, but the solution is undesirable when the depths are coming from different surfaces. Mean Absolute Error (MAE), has a similar issue, yet not as severe. As in Fig. 3.4, in the pairwise ambiguity case, the MAE loss of mixed-depth pixels is equal to the loss at the actual 32 4 10 MAE MAE MSE 8 MSE 3 d (1),d (2) 6 Loss 2 Loss 4 1 2 0 0 -2 -1 0 1 2 0 2 4 6 8 10 (a) Depth in m (b) Depth in m Figure 3.4: (a) shows MSE and MAE loss functions. These perform an expectation over the prob- ability of the data. Now consider an ambiguous case where a pixel’s depth has equal probability being d(1) or d(2) , shown as black squares in (b). Minimum MSE estimate, d,ˆ is the mid-point, while MAE has equal loss for all points between these two depths. This illustrates why MSE prefers mixed-depth pixels, and MAE fails to penalize them. values. Thus while MAE loss does not prefer mixed-depth pixels like MSE, nevertheless mixed- depth solutions may not be sufficiently penalized to avoid them. 3.3.2.2 Cross Entropy as Loss Measure As shown in Sec. 3.3.2.1, minimizing MSE leads to depth mixing when there is depth ambiguity (note that ambiguity only exists at inference when we estimate depths on each pixel given sparse depth measurements, and possibly, color). One way to avoid this is, rather than estimating depth directly, we can estimate a more general probabilistic representation of depth. Now DC can provide a probabilistic depth model, both for pdata and pmodel in Eq. 1.1. Minimizing the cross entropy of the predicted output c̃, representing pdata (d˜i |xi ; θ), is equivalent to minimizing the KL divergence with c. In this way, we can learn to estimate pmodel (di |xi ; θ) parameterized with DC rather than learn to estimate di , Our cross-entropy loss for pixel i is defined as: N Lce X i (cij ) =− cij log c̃ij , (3.3) j=1 33 0.3 P data DC Value d (1) ,d (2) 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 10 Depth (m) Figure 3.5: An illustration of Pdata modeled as the sum of the DC of the two points from Fig. 3.4. The estimated ĉij with minimum cross-entropy loss, Eq. 3.3, will exactly match Pdata , providing a multi-modal density. A pixel depth estimate using Eq. 3.4 will find the depth of one of the peaks, and not a mixed-depth value. where cij terms are the DC elements of the ground truth obtained using Eq. 2.2. Training a network to predict c̃ij that minimizes Lce i is equivalent to maximizing Eq. 1.1. Use of cross-entropy loss has two main advantages. The first is that depth ambiguities no longer result in a preference for mixed-depth pixels. As illustrated in Fig. 3.5, DC models multi- modal densities, and as we show in the next section our depth estimate will find the location of the maximum peak at one of the depths. Second, optimizing cross entropy leads to much faster convergence than MSE, which suffers from gradients going to zero near the solution. 3.4 Learning by Deep Neural Network Now that we are certain that we need to inject sparse DC and estimate dense DC in and out of a deep learning model, we set our sight as to what is a good network architecture that can incorporate DC naturally. Also we need a model to recover dense depth from estimated DC. This section describes in detail our deep learning framework to estimate depth from DC. The first part gives an overview of the model, the second part explains the neural network architecture of our model, and the third part explains how we recover depth from estimated DC. The three major blocks in our deep learning framework are (a) Sparse Depth 2 DC block (b) 34 Color Image Dense Depth Cross-entropy Sparse Depth Loss DC Generator Neural Depth Network Reconstruction 𝑐1 𝑐1ǁ 𝑐5 𝑐5ǁ 𝑐10 𝑐10 ǁ 𝑐20 𝑐20 ǁ 𝑐𝑁 𝑐ǁ𝑁 DC Predicted DC Figure 3.6: An overview of our method. Sparse depth is converted into Depth Coefficients with multiple channels, each channel holding information of a certain depth range. This, along with color, is input to the neural network. The output is a multi-channel dense depth density that is optimized using cross entropy with a ground-truth DC. The final depth is reconstructed based on the predicted density. Neural Network block (c) Dense DC to Depth block as shown in Fig. 3.6. The sparse depth measurements are first converted to sparse DC. This operation is non-differentiable though. The deep network estimates DC, and we optimize the solution on GT DC converted from ground-truth depth based on cross-entropy loss. Finally, we convert dense DC estimate to dense depth estimate. 35 3.4.1 Neural Network Architecture 3x3 Conv DC F = 80 Res. Block Res. Block Res. Block Res. Block 3x3 Conv, F = 64, F = 128, F = 256, F = 512, F = 512, 3x3 3x3 1x 0.5x 0.5x 0.5x 0.5x Upsampler, Upsampler, Color 3x3 Conv F = N, 2x F = N, 2x Sum Image F = 48 3x3 3x3 3x3 3x3 NN 3x3 conv + 1x1 Conv, Res. Block Prediction Upsampler, Upsampler, Upsampler, Upsampler, Upsampler BN + ReLu F=80 F = 64 F = 64, 2x F = 64, 2x F = 64, 2x F = 64, 2x Figure 3.7: Our CNN architecture modified from [3] with 80-channel DC at the input, and 80- channel cross-entropy loss. The residual blocks are defined in the dashed red region, and the top branch consists of ResNet 34 [11]. We selected a standard network for depth completion [3], and modified the input and output. Single channel depth is first converted to 80C DC. On the input, 80/48 channels of DC and color were then fed into the initial convolutions respectively and then concatenated for further propaga- tion into the network. On the output, 80 channels are predicted (rather than a single channel) using a 1 × 1 convolution. It is straightforward to convert a depth network into a DC network using this strategy. The downside, though, is feeding in more channels (80 in our case), creates more memory requirements. The estimated DC output of the network is trained using cross entropy loss on a DC representation of semi-dense depth. 3.4.2 Depth Reconstruction Since we optimize the neural network parameters based on GT DC, it is possible that the estimated DC infers mulimodal density functions at ambiguous pixels. The case is best illustrated in Fig. 3.5. The question is then how to recover depth from estimated DC. There are a number of options for depth reconstruction from DC. We can use Eq. 2.1, and substitute ĉij for cij for pixel i. However, the predicted coefficients may be multi-modal as in Fig. 3.5, and it may be preferable to estimate the maximum likelihood solution. We now show that representing depth in depth coefficients is guaranteed to avoid depth mixing if we only select the 36 main bin (peak of the probability vector) and its two closest bins (far and near bin) to construct the depth. This step is key to cut off interactions with far away bins and avoid depth mixing. We found out that near object boundaries, the DC vector of that pixel has estimated weights typically spanned over large number of bins (large uncertainty along boundaries). So to avoid depth mixing, we detect the peak bin and its associated nearby bins to reconstruct the depth. We can estimate the depth for the peak via the maximum coefficient cik ∈ ci and its two neighbors. This gives us: ĉi(k−1) D(k−1) + ĉik Dk + ĉi(k+1) D(k+1) dˆi = . (3.4) ĉi(k−1) + ĉik + ĉi(k+1) 3.5 Experiments and Results 3.5.1 Experimental Protocols We evaluate DC representation by means of two publicly available datasets: KITTI (outdoor scenes) and NYU2 (indoor scenes) respectively to demonstrate the performance of our algorithm. We use KITTI depth completion dataset [34] for both training and testing. The dataset is created by aggregating Lidar scans from 11 consecutive frames into one, producing a semi-dense ground truth with roughly 30% annotated pixels. The dataset consists of 85, 898 training data, 1, 000 selected validation data, and 1, 000 test data without ground truth. We truncate the top 90 rows of the image during training since it contains no Lidar measurements. The NYU-Depth v2 dataset consists of RGB and depth images collected from 464 different scenes. We use the official split of data, where 249 scenes are used for training and we sample 50K images out of the training similar to [47]. For testing, the standard labelled set of 654 images is used. The original image size is first downsampled to half, and then center-cropped, producing a 37 network input spatial dimension of 304 × 208. For comparison purposes, we choose the state of the arts in both outdoor [3] and indoor scenes [47, 51] using RGBD depth sensors. 2 2 2 (a) (b) (c) (d) 0 0 0 -2 -2 -2 20 20 20 15 15 15 10 10 10 -2 -3 -2 -3 -1 -2 -3 5 1 0 -1 5 1 0 -1 5 1 0 (e) (f ) (g) Figure 3.8: Depth completion with 16-row Lidar. (a) scene, (b, e) show Ma et al. [3] with signifi- cant mixed pixels. (c, f ) show our 3-coefficient estimation, demonstrating very little depth mixing. (d, g) show our estimation with all coefficients. (a) (b) (c) (d) (e) (f ) (g) Figure 3.9: Another depth completion example with 16-row Lidar, where all subfigures are defined the same as Fig. 3.8. Interestingly, higher RMSE is reported on 3-coefficient estimation as opposed to all-coefficient estimation. 3.5.1.0.1 Sub-Sampling Another application of depth completion is to improve on object de- tection. While it might seem intuitive that at higher resolution, estimated dense depth could give better vehicle detection, often this is not the case, and we are not aware of other past literature 38 Method RMSE MAE REL tMAE tRMSE δ1 δ2 δ3 δ4 δ5 Ma [47] 0.236 0.13 0.046 0.068 0.075 52.3 82.3 92.6 97.1 99.4 Bilateral [82] 0.479 - 0.084 - - 29.9 58.0 77.3 92.4 97.6 SPN [83] 0.172 - 0.031 - - 61.1 84.9 93.5 98.3 99.7 Unet [51] 0.137 0.051 0.020 - - 78.1 91.6 96.2 98.9 99.8 CSPN [51] 0.162 - 0.028 - - 64.6 87.7 94.9 98.6 99.7 CSPN+UNet [51] 0.117 - 0.016 - - 83.2 93.4 97.1 99.2 99.9 Ours-all 0.118 0.038 0.013 0.042 0.053 86.3 95.0 97.8 99.4 99.9 Ours-3coeff 0.131 0.038 0.013 0.040 0.054 86.8 95.4 97.9 99.3 99.8 Table 3.1: Quantitative results of NYU2 (Done on Uniform-500 Samples + RGB) (units in m). Sparsity MAE RMSE tMAE tRMSE 64R-3coeff 24.1 121.2 20.3 34.4 64R-all 25.2 106.1 23.9 37.4 32R-3coeff 31.0 132.2 24.4 39.5 32R-all 31.1 115.8 27.6 42.2 16R-3coeff 37.8 160.6 33.4 47.2 16R-all 38.6 142.3 36.1 50.5 Table 3.2: Performance evaluation at different levels of Lidar sparsity (KITTI dataset). 64R, 32R and 16R refers to 64-row, 32-row, 16-row respectively. Units in cm. Input Loss MAE RMSE tMAE tRMSE SP MSE 6.63 15.28 5.96 6.97 DC MSE 6.10 15.32 5.72 6.73 SP CE 9.53 17.81 6.75 7.56 DC CE 3.82 11.85 4.24 5.37 Table 3.3: A comparison whether DC on the input or DC with cross entropy (CE) on output has the dominant effect. It turns out that individually their effect is small, but together have a large impact (NYU2 dataset). Units in cm. reporting this. Likely mixed-depth pixels have a large negative impact on object detection. Indeed, Tab. 3.4 shows worse car detection on Ma’s output than on the raw 16-row sparse data. However, our method is able to outperform sparse depth, an important step towards improving Lidar-based object detection. 39 3D Bounding Box Bird’s Eye View Box Upsample: Easy Med. Hard Easy Med. Hard Raw 16R 54.4 36.2 31.3 73.6 58.1 50.4 Ma [3] 36.7 23.0 18.5 56.2 33.8 29.7 DC-3coeff 64.9 41.9 34.7 78.1 54.0 45.6 Table 3.4: Average precision (%) for 3D detection and pose estimation of cars on KITTI [1] using Frustum PointNet [2]. The baseline, Raw-16R, uses 16 rows from the Lidar, while Ma’s method [3] and our method start by densely upsampling these 16-row data. In each case, the method is trained on 3, 712 frames and evaluated on 3, 769 frames, of the KITTI 3D object detection benchmark [1] using an intersection of union (IOU) measure of 0.7. Only our method improves on the baseline, and this is the most significant for 3D bounding boxes. 3.6 Conclusion In this chapter, we introduce depth coefficients (DC) in a neural network model. On the input, DC represents depth without loss in accuracy (unlike binning) while separating pixels by depth so that it is simple for convolutions to avoid depth mixing. On the output side, instead of directly predicting depth, we predict a depth density using cross entropy on the Depth Coefficients. This is a richer representation that avoids depth mixing and can enable deeper levels of fusion and object detection. Also using a similar strategy (using a depth-to-DC and DC-to-depth converter), it is possible to incorporate DC in any neural network architecture. Indeed we show that, unlike other upsampling methods, our dense depth estimates can improve object detection compared to sparse depth. 40 Chapter 4 Depth Completion Using TWIN Surface Extrapolation at Occlusion Boundaries 4.1 Introduction In the previous chapter, we show how multi-channel depth representation can be used to model depth ambiguity at object boundaries or step-like discontinuities. It is important to maintain depth discontinuities to facilitate object shape and pose estimation. However, it has high computational and memory demand for accommodating many channels at high resolution. Instead of using mul- tiple channels with binning, our method, named TWIn-Surface Estimation (TWISE), uses a two- surface representation which is much more efficient and can explicitly model ambiguity by finding difference between the twin surface depths. We believe that naturally encoding the foreground and background pixels at the boundary would enable the effective learning of the step-wise discontinu- ity with lower memory and computational requirement. In order to train a twin-surface estimator, we propose a pair of asymmetric loss functions that naturally bias estimates toward foreground and background depth surfaces. The asymmetry in the losses are key to separation of foreground and background depths at ambiguous pixels. We also incorporate a fusion channel that automatically combines the foreground and background depths into a final depth estimate for each pixel, by selecting a foreground/background depth at the ambiguous regions and mixing the two depths at non-ambiguous regions. 41 Of particular concern is the lack of dense and reliable ground-truth depth data in outdoor scenes needed for accurate evaluation of depth estimates. KITTI, a realistic outdoor scene dataset, offers semi-dense ground-truth, created by accumulating LiDAR points but suffers from noisy depth samples (outliers) at boundaries and dynamic objects [49]. Indoor dataset like NYU2 provides dense GT only by using some colorization techniques that can cause smoothing at object bound- aries. Currently the preferred evaluation metric of choice for ranking depth completion methods is RMSE. In this paper, we study the effects of outlier noise present in ground-truth data on RMSE and note that MAE is a more consistent metric for both cases of noisy and clean ground-truth, as validated on the synthetic VKITTI dataset. In summary, this chapter focuses on a twin-surface representation that can estimate foreground, background and fused depth. We also design a pair of assymmetric loss functions that can explicitly predict foreground-background object surfaces that can be used in any neural network learning paradigm. We discuss further that in presence of outliers in ground-truth, MAE is a more consistent metric to rank methods compared to RMSE, and we validate this claim with extensive experiments in VKITTI, a synthetic dataset for urban driving scenario. Finally we show the effectiveness and superiority of our method by comparing with SoTA method on both challenging outdoor and indoor scenarios. 4.2 Related Works 4.2.1 Depth Completion Deep neural networks (DNNs) have been applied to the depth completion problem, in works such as Sparse-to-Dense [10], DDP [84], and Spade RGBsD [36]. These works show that by using standard encoder-decoder architecture (ResNet and MobileNet), it is possible to improve depth 42 (a) (b) (c) (d) (e) (f) Figure 4.1: Our depth completion algorithm can input LiDAR data and image (a), and extrapolate the estimates of foreground depth d1 (b) and background depth d2 (c), along with a weight σ (e). Fusing all three leads to the completed depth (d). The foreground-background depth difference (f) d2 − d1 is small except at depth discontinuities. estimation accuracy via regression losses like L2 , L1 and inverse L1 losses. Deep-Lidar [85] esti- mates surface normal and dense depth using multiple DNNs to assist in further fine-tuning dense depth. Both [85] and [84] rely on synthetic data and various labels for learning depth represen- tations. Recently, works have opted to optimize depth using 3D geometric constraints like depth- normal consistency [86, 87] to improve depth completion. Xu et al. create geometric consistency between the surface normal and depth in 3D, but use another refinement network for improved depth estimation [87]. Another recent trend is to learn spatial propagation of pixels in 2D depth space for depth completion problems in fixed [88] or variable receptive field [13, 89]. Although re- 43 sults are highly encouraging, these methods suffer from poor inference times and generalizability on variable sparsity. Researchers have also looked into learning 3D features for depth comple- tion using continuous convolution in 3D space [90], point cloud completion [91], 3D graph neural networks [92] for dynamic construction of local neighborhood regions. 4.2.2 Depth Representations Depth maps, as 2.5D representations, have been used for RGBD fusion and instance segmenta- tion [63, 64]. They naturally encode sensor viewing rays and adjacency between points. They are compact representations and their regular grids can be processed with CNNs in an analogous way to image super-resolution [93, 94]. This is the representation of choice for colorization techniques and fusion [4] as well as depth completion. We propose a 2-layered representation of depth to model occlusion boundaries. The concept of layered representation of depth has been well known in graphics community. LDIs (Layered Depth Images) are first proposed by Shade et al. [95] as intermediate representation for efficient image-based rendering. These are gathered by accumulating depth values via z-buffering from multiple depth images of nearby view points. Tulsiani et al. [96] infer 2-layered depth represen- tation (recovering depth of visible and non-visible scene) from a single input image by learning view-synthesis from multiview camera guided supervision. Hedman et al. [97] propose a 3D photo reconstruction algorithm that builds multi-layered geometric representation of the scene by warp- ing several depth maps and stitching color and depth panoramas for front and back-scene surfaces. In all these cases, multi-layered representation is constructed/learned from multi-camera view- points/depthmaps of the scene. In our case, we estimate these 2-layered representation on a single camera viewpoint with our proposed loss functions. 44 4.2.3 Loss Functions in Depth Completion A key component of depth completion is the choice of loss functions. Recent work has explored loss functions including L2 [48, 10], L1 [47], inverse-L1 [36], Huber loss [90] and Softmax loss on depth [50]. Another elegant way is to use combination of L1 + L2 [13], which can leverage the benefits of both L1 and L2 losses. While these loss functions can achieve low error on metrics in- cluding RMSE, MAE, iMAE, often it comes at the cost of smoothing depth estimates across object boundaries. In addition to the aforementioned losses, people increasingly use Chamfer distance on point cloud [91], depth-normal constraint [87], Cosine loss [85], in a multi-learning framework to improve depth completion accuracy. Nevertheless, smoothing across sharp boundaries remains a concern in many of these methods. Imran et al. [12] show that cross-entropy (CE) loss can generate sharp boundaries, although performs worse in the RMSE metric. We learn foreground and background depth by proposing two assymetric loss functions, and the final depth using a fusion loss. The assymetric loss function has been used in Vogel et al. [98], for the different purpose of denoising input images. We propose to use assymetric loss functions to learn biased estimators of FG/BG surface, and learn to select/blend (fusion loss) between FG/BG surface, and that, we claim, helps to recover depth discontinuity. 4.3 Ambiguities and Expected Loss Depth completion involves two quite different challenges which can be at odds. The first is to interpolate missing pixel depths within objects leveraging nearby sparse depths. The second is to accurately find the occlusion boundaries of objects and ensure that interpolated pixels belong to either the foreground or background object. We propose a method that aims to perform both tasks well. 45 color i mg depthmap depthmap sigmoid c han (a) (b) (c) (d) 40 40 1 40 Depth in m Depth in m Depth in m 60 60 60 0.5 Depth FG 80 80 80 BG 0 585 590 595 600 605 585 590 595 600 605 585 590 595 600 605 x axis x axis x axis (e) (f) (g) Figure 4.2: Depth smearing across boundaries. We show the ground truth depth (colored red) overlaid on an image (a), depths estimated by the SoTA method [6] (b), our fused depths (c), our estimated weights σ (d), a depth slice of [6] (e), fused depth and σ slices (f), and foreground and background slice (g). Our extrapolation ability in (g) results in the sharp depth boundary in (f), rather than the smeared depth in (e). Our approach divides depth completion into two simpler problems, each of which can be more easily learned by a network. The first problem is depth interpolation without boundary determi- nation. Rather than estimating a single surface which must model step functions at depth dis- continuities, Our key novelty is to estimate twin surfaces. A foreground surface extrapolates the foreground object depth up to and beyond boundaries, while a background surface extrapolates the background depth up to and behind the occluding object. Then the second problem is to find the boundary and determine a single depth by fusing these two surfaces. We find the color image is particularly useful in aiding surface fusion. Both of these components are illustrated in Fig. 4.2. 4.4 Methodology Depth completion involves two quite different challenges which can be at odds. The first is to interpolate missing pixel depths within objects leveraging nearby sparse depths. The second is to accurately find the occlusion boundaries of objects and ensure that interpolated pixels belong to either the foreground or background object. We propose a method that aims to perform both tasks 46 well. Our approach divides depth completion into two simpler problems, each of which can be more easily learned by a network. The first problem is depth interpolation without boundary determi- nation. Rather than estimating a single surface which must model step functions at depth dis- continuities, Our key novelty is to estimate twin surfaces. A foreground surface extrapolates the foreground object depth up to and beyond boundaries, while a background surface extrapolates the background depth up to and behind the occluding object. Then the second problem is to find the boundary and determine a single depth by fusing these two surfaces. We find the color image is particularly useful in aiding surface fusion. Both of these components are illustrated in Fig. 4.2. 4.4.1 Ambiguities and Expected Loss Ambiguities have a significant impact on depth completion, and it is useful to have a quantitative way to assess their impact. Here we propose using the expected loss to predict and explain the impact of ambiguities on trained networks. By an ambiguity we mean, not that there isn’t a unique true solution, but rather that from a measurement it is difficult for the algorithm and/or human to decide between two or more dis- tinct solutions. Ambiguity can be more formally defined as follows. Given measurement data that sparsely samples the scene, the number of ambiguities is equal to the number of different true scenes, i.e. true depth maps in our case, that could have generated the sparse measurement. This number depends on what variations occur in actual data. For simplicity we treat each pixel ambi- guity independently of other pixels, and so the ambiguities for a pixel are the possible depth values it could take that are consistent with the measurement. We anticipate the level of ambiguity to vary across a scene. For example, pixels on flat surfaces will be well-constrained by nearby pixels and have low ambiguity. In contrast, pixels near depth 47 discontinuities may have large depth ambiguity. There is often insufficient data from the depth image to decide whether the pixel is on the foreground or background. A corresponding color image can help resolve ambiguities as to which object a pixel belongs. However, exactly how to leverage color images to resolve ambiguities in CNNs is one of the open challenges in depth completion. Our work aims to offer a solution to this problem by explicitly estimating ambiguities and resolving them within the network. Our work aims to offer a solution to this problem by explicitly estimating ambiguities and resolving them within the network. To assess the impact of ambiguities on our network, we build a quantitative model. Consider a single pixel whose depth, d, we seek to estimate. Next assume that the pixel has a set of ambigui- ties, di , each with probability pi . This probability measures of how likely it is that the ground truth will take the corresponding depth, given our modeled scene assumptions. Now consider a loss function on the error for each pixel, L(d − dt ), where dt is the ground truth depth. The expected loss as a function of depth is: X E{L(d)} = pi L(d − di ). (4.1) i This expected loss is important because if a network is trained on representative data then it will be trained to minimize the expected loss. Thus by examining the expected loss we can predict the behavior of our network at ambiguities, and so justify the design of our method. 48 ALE ( ) RALE ( ) ALE (d d1 ) ALE (d d2 ) E{ALE (d)} 0 0 d1 d2 d (a) (b) (c) Figure 4.3: (a) The ALE from Eq. (4.2) is asymmetric around its minimum at the origin. (b) The RALE from Eq. (4.3) is a reflection of the ALE. We use the ALE for foreground surface estimation and the RALE for background estimation. (c) A pixel depth is shown with two ambiguities at depths d1 and d2 and probabilities p1 and p1 respectively. The black line shows the expected ALE which is the probability-weighted sum of two ALE functions, see Eq. (4.1). The expected ALE will have a minimum at one of the marked corners occurring at d1 and d2 . The minimum will be at d1 if Eq. (4.4) is satisfied, as it is in this case with p1 = p2 , and so acts as a foreground depth estimator. 4.4.2 Asymmetric Linear Error Our method uses a pair of error functions which we call the Asymmetric Linear Error (ALE), and its twin, the Reflected Asymmetric Linear Error (RALE), defined as:   1 ALEγ (ε) = max − ε, γε , (4.2) γ   1 RALEγ (ε) = max ε, −γε . (4.3) γ Here ε is the difference between the measurement and the ground truth, γ is a parameter, and max(a, b) returns the larger of a and b. The ALE and RALE are generalizations of the absolute error, and are identical to the absolute error when γ = 1. The difference is that the negative side of ALE is weighted by γ1 and the positive weighted by γ. The RALE is simply the reflection of the ALE over the ε = 0 line. Both are illustrated in Fig. 4.3 (a,b). Note that if γ is replaced by γ1 , both the ALE and RALE are reflected. Thus, without loss of generality, in this work we restrict γ ≥ 1. 49 4.4.3 Foreground and Background Estimators We make a further simplifying assumption in our analysis that there are at most binary ambiguities per pixel. A binary ambiguity is described by a pixel having probabilities p1 and p2 of depths d1 and d2 respectively. When d1 < d2 we call d1 the foreground depth and d2 the background depth. Such a binary ambiguity is likely to occur near object-boundary depth discontinuities. To estimate the foreground depth we propose minimizing the mean ALE over all pixels to obtain dˆ1 , the estimated foreground surface. To predict the characteristics of dˆ1 from a trained network at ambiguous pixels, we examine the expected ALE, as shown in Fig. 4.3 (c). This is piecewise linear and has two corners, one at d1 and the other at d2 . The lower of these will determine the minimum expected loss, and hence what an ideal network will predict. Using Eqs. (4.2) and (4.1), we obtain expected losses: L(d1 ) = p2 (d2 − d1 )/γ, and L(d2 ) = p1 (d2 − d1 )γ. From this it is straightforward to see L(d1 ) < L(d2 ) when: r p2 γ> . (4.4) p1 This equation shows the sensitivity of the foreground estimator to γ; the higher γ, the lower the probability on foreground p1 needed for the minimum to be at the foreground depth d1 . To estimate the background depth, dˆ2 , at boundaries we propose minimizing the expected RALE. The same analysis will apply to this as to the ALE, and we obtain the same constraint on γ as in Eq. (4.4), except that the probability ratio is inverted. Fig. 4.1 (b) shows an example foreground depth estimate, (c) the background depth and (f) the depth difference. We observe that at pixels far from depth discontinuities, as well as the sparse input-depth pixels, the foreground depth is very close to the background depth indicating no ambi- guity. 50 4.4.4 Fused Depth Estimator We desire to have a fused depth predictor that can do both interpolation and extrapolation at sur- faces depending on ambiguous and non-ambiguous regions. The foreground and background depth estimates provide lower and upper bounds on the depth for each pixel. We express the final fused depth estimator dˆt for the true depth dt as a weighted combination of the two depths: dˆt = σ dˆ1 + (1 − σ)dˆ2 . (4.5) where σ is an estimated value between 0 and 1. We use a mean absolute error as part of the fusion loss: F (σ) = |dˆt − dt | = |σ dˆ1 + (1 − σ)dˆ2 − dt |. (4.6) The expected loss for this is Le (σ) =E{F (σ)} = p|σ dˆ1 + (1 − σ)dˆ2 − d1 |+ (4.7) (1 − p)|σ dˆ1 + (1 − σ)dˆ2 − d2 |. Here, p = p1 , and p2 = 1 − p. This has a minimum at σ = 1 when p > 0.5 and a minimum at σ = 0 when p < 0.5. Of course this assumes that depth is either d1 or d2 . Depth fusion occurs by optimizing the loss of Eq. (4.7) to predict a separate σ for each pixel. In this way our fusion step is an explicit determination of whether a pixel is foreground or background or a combination. An example estimated σ is shown in Fig. 4.1 (e). 51 4.4.5 Depth Surface Representation We have developed three separate loss functions whose individual optimizations give us three separate components of a final depth estimate for each pixel. Based on the characterization of our losses, we require a network to produce a 3-channel output. Then for simplicity we combine all loss functions into a single loss: N 1 X   L(c1 , c2 , c3 ) = (ALEγ c1j + RALEγ c2j N j (4.8) + Le (s(c3j ))). Here cij refers to pixel j of channel i, s() is a Sigmoid function, and the mean is taken over all N pixels. We interpret the output of these three channels for a trained network as c1 → dˆ1 , c2 → dˆ2 and s(c3 ) → σ, and combine them as in Eq. (4.5) to obtain a depth estimator dˆt for each pixel. 4.4.6 Implementation Details 4.4.6.1 Architecture 𝐹𝐺 𝑛−1 𝐹𝐷𝑛 𝐵𝐺 𝑛−1 Hour Glass Network S𝐷𝑛 𝜎 𝑛−1 Figure 4.4: Incorporating 3-channel at the output of the Hour-glass network used in [6]. SDn and F Dn are the sparse inputs and fused depth obtained from F Gn , BGn , and σ n at multi resolution scale n respectively. This work presents novel loss functions linked to a multi-channel depth representation. These 52 can be easily incorporated into a variety of network architectures with minimal change to the network. Specifically we selected the multistack network [6], with the author-provided code. The only modification we made are at the last layer of the network, where we used three channels representing d1 (foreground estimate), d2 (background estimate), and σ (see 4.4). We repeat this strategy in the hourglass networks in all the three multi-resolution levels. Please see [6] for more details of the network. We choose this network due to its fast inference time, lower number of parameters than [10], and its near-SoTA performance. The changes we made were three output channels and instead of one at each stacked hourglass network, and we use our loss function for the optimization. We used 64 channels in the encoder-decoder network as that provided their highest performing results. More details are shared in the supplementary material. 4.4.6.2 Training and Inference We followed the training protocol in [6] with multi-scale supervision on our 3 channels. The total loss is a weighted sum of the multiple resolution losses Li , where L1 is the full resolution 3-channel loss in Eq. (4.8), L2 is half-resolution and L3 quarter resolution: L = ω1 L1 + ω2 L2 + ω3 L3 . The multiscale stage training protocol sets ω1 = ω2 = ω3 = 1 during the first 10 epochs, reduces ω2 = ω3 = 0.1, and continues to train for another 10 epochs. For the last 10 epochs we set ω2 = ω3 = 0 and complete training after 30 epochs. Using Adam optimizer with an initial learning rate of 10e − 3 and decrease to half every 5 epochs, we train a full sized image with gradient accumulated every 4 samples in a batch. We use PyTorch [99] for our implementation. 53 Method MAE RMSE iMAE iRMSE TMAE [12] TRMSE [12] Infer. time (sec.) Ma et al. [10] 249.95/269.2 814.73/878.5 1.21/1.34 2.80/3.25 –/190.15 –/297.48 0.081 Depth-Normal [87] 235.17/236.67 777.05/811.07 1.79/1.11 2.42/2.45 –/– –/– – DeepLidar [85] 226.50/215.38 758.40/687.0 1.15/1.10 2.56/2.51 –/162.75 –/266.79 0.097 3DepthNet [91] 226.2/208.96 798.40/693.23 1.02/0.98 2.36/2.37 –/– –/– – Uber-FuseNet [90] 221.19/217.0 752.88/785.0 1.14/1.08 2.34/2.36 –/– –/– – MultiStack [6] 220.41/223.40 762.20/798.80 0.98/1.0 2.30/2.57 –/157.90 –/270.15 0.018 DC-3co [12] 215.75/215.04 965.87/1011.3 0.98/0.94 2.43/2.50 –/141.67 –/238.5 0.112 CSPN++ [88] 209.28/– 743.69/– 0.90/– 2.07/– –/– –/– 0.200 DDP [84] 205.40/– 836.00/– 0.86/– 2.12/– –/– –/– – NLSPN [13] 199.59/198.64 741.68/7f 71.8 0.84/0.83 1.99/2.03 –/138.81 –/248.88 0.225 TWISE 195.58/193.40 840.20/879.40 0.82/0.81 2.08/2.19 –/131.60 –/239.80 0.022 Table 4.1: Depth completion on the Test/Validation sets of KITTI, with 64R LiDAR and RGB input (units in mm). ((a)) ((b)) ((c)) ((d)) ((e)) Figure 4.5: Comparison of our method with SoTA methods with whole and zoom in views (a) showing Color Images (b) DC [12], (c) MultiStack [6] (d) NLSPN [13] and our method (e). Four different regions of the image from two different instants are selected to show depth quality from diverse areas. 4.5 Experimental Results 4.5.1 Dataset We evaluate the proposed algorithm on the standard KITTI Depth Completion dataset [1], a real- world outdoor scene, NYU2, with indoor scenes [4], and Virtual KITTI [100], a synthetic dataset with photo-realistic images and dense ground-truth depth. KITTI depth is created by aggregating LiDAR scans from 11 consecutive frames into one, producing a semi-dense ground truth (GT) with 30% annotated depth pixels. The sparsity of GT makes depth estimation more challenging. Note that we do not require any synthetic depth data for pre-training as used by [84, 85] to improve 54 performance. The dataset consists of 85K, 1K, and 1K samples for training, validation, and testing respectively. Although the training set has different image sizes, the test and validation sets are cropped to a uniform size of 352 × 1, 216. Although created in a real world scenario, the semi-dense GT produced by Uhrig et al. [49] has far fewer depth points on object boundaries (see Fig. 4.2 (a)), and is susceptible to outliers. As we claim our method works well on boundaries, we also evaluate on VKITTI 2.0, a synthetic dataset with clean and dense GT depth at depth discontinuities. The VKITTI 2.0, created by the Unity game engine, contains 5 different camera locations (15o left, 15o right, 30o left, 30o right, clone) in addition to 5 different driving sequences. Additionally, there are stereo image pairs for each camera location. For training and testing, we only use the clone (forward facing camera) with stereo image pairs. For VKITTI training, 2k training images were created from driving sequences 01, 02, 06, and 018 respectively. For testing, we use sequence 020 at the left stereo camera, and choose every other frames, with total 420 images. We subsample the dense GT depth in azimuth- elevation space to simulate LiDAR-like pattern as sparse inputs. Further, we create the pseudo GT following [49] to study the effects of outlier noise on training and evaluation. More details are shared in the supplementary. To show the generalizibility of our method, we also evaluate on NYU-Depth v2 dataset [4], which consists of RGB and depth images obtained from Kinect in 464 scenes. We use the official split of data, where 249 scenes are used for training and we sample 50K images out of the training similar to [85, 13]. For testing, the standard labelled set of 654 images is used. The original image size is first downsampled to half, and then center-cropped, producing a network input dimension of 304 × 208. Unlike [13], we use the same loss function for all the datasets. 55 4.5.2 Metrics The standard metrics used by KITTI include RMSE, MAE, iMAE and iRMSE. Since RMSE is used as the preferred metric for depth completion, most SoTA methods on the KITTI leaderboard use MSE as their primary loss. We also include tMAE and tRMSE metrics proposed in [12] since it can discount outlier depth pixels (i.e., floating depth pixels around boundary regions) and give a better evaluation of depth pixels at and within object boundaries. 4.5.3 Results 4.5.3.1 Quantitative Results Tab. 4.1 compares the performance on KITTI’s test/validation sets, with a 64-row LiDAR and color image as input. We list the SoTA methods with performance quoted from their papers. The inference times are calculated on a single GPU of GTX 1080 Ti. The method [13] with lowest RMSE achieves this at the expense of inference time. We outperform the SoTA methods in other metrics including MAE, and iMAE. The exception is RMSE, by which the methods are ranked in the KITTI leaderboard. That leads us to investigate in which areas are our method perform better and worse, which we examine next. 4.5.3.2 Qualitative Results Fig. 4.5 shows our depth estimation quality compared to baselines. We choose three best SoTA methods: MultiStack [6], NLSPN [13], and DC [12]. Different local regions including poles, trees, cars, and traffic signs, illustrate the depth quality of close- and long-range depth pixels. The zoomed-in view shows the substantial improvement of our depth map over SoTA, especially along sharp object boundaries. [6] has a more blurred estimation around boundaries leading to mixed 56 (a) ((a)) ((b)) ((c)) ((d)) ((e)) ((f)) Figure 4.6: Input image (a), its zoom-in views (d), our estimation on foreground depth (b), background depth (c), fused depth (e), and the depth difference between foreground and background depth (f). depth pixels and holes within objects, such as on the traffic poles and van. Although [13] has reduced mixed depths and more tighter boundary, depth mixing still exists (blurriness at object boundaries), additionally it suffers from jagged boundary edges and streaking artifacts. 57 4.5.3.3 Qualitative Parsing Fig. 4.6 offers a more detailed analysis of our method by showing different estimation at fore- ground, background depths and fused depth respectively. We choose five zoom-in views from diverse objects, e.g., tree, poles, car, and even pixels at far-away depth pixels. It shows that our fused depth estimator can learn to choose foreground and background regions well, resulting in a clear shape estimation of objects. We note that it is biased to choose the foreground surface as ambiguity increases, e.g., relatively large depth gap between foreground and background surfaces (see depth difference in Fig. 4.6 (f)). This can be explained by the fact that there are more supervi- sion at the close-up region than the far-away region on account of uneven distribution of GT depth pixels. 4.5.3.4 Relative Error Maps It is worthwhile to examine where our method has lower errors in comparison with our baseline method [6] which uses MSE. Two types of errors are examined: the absolute difference between estimated depth and GT, and its squared version, which are referred as AE and SE respectively. Errors are evaluated on semi-dense GT data. We calculate relative error maps by the difference of error maps of Absolute Error, A(i), and Squared Error, S(i), of two methods respectively to show the gains of our method over MultiStack [6]. The error differences are calculated by the following equation: A(i) = |dˆM (i) − dt (i)| − |dˆT (i) − dt (i)|, (4.9) S(i) = |dˆM (i) − dt (i)|2 − |dˆT (i) − dt (i)|2 , (4.10) where dˆM and dˆT are depth estimates of MultiStack [6] and TWISE respectively. A(i) and S(i) are Absolute Error Difference and Squared Error Difference of pixel i on two competing methods 58 (a) (b) Figure 4.7: Difference of TWISE vs MultiStack [6] in (a) Absolute Error (AE) and (b) Squared Error (SE) respectively. The red indicates the most gain of ours over [6], marked by ’o’; while the blue is vice-versa, marked by ’x’. Zoom in for details. respectively. For a particular pixel, when A(i) and S(i) is (+)ve, TWISE is performing better then MultiStack and vice-versa for (−)ve values. We note that the errors are evaluated only where there are valid ground-truth pixels. As shown in Fig. 4.7, our method wins in substantially more pixels than losing. Errors in our method often comes from few pixels at boundary regions, when a FG depth is erroneously chosen over a BG depth/vice versa; we term them as outliers e.g., see depth error at the traffic sign pixels, edge of tree-trunk etc close to/at the boundary. These outliers with large depth errors are strongly weighted by the RMSE metric, leading to our worse performance on that metric. To further our analysis, we do a statistical evaluation 4.8 on 200 samples of the validation set (chosen every 5 samples from KITTI’s 1, 000 validation set). For the statistical analysis, we do a histogram binning of A(i) for pixels where A(i) > 0 (Multistack > TWISE is equivalent to performance gain of TWISE over MultiStack) and of |A(i)| for pixels where A(i) < 0 (TWISE > MultiStack is equivalent to performance gain of MultiStack 59 10 6 10 6 MultiStack > TWISE MultiStack > TWISE TWISE > MultiStack TWISE > MultiStack 10 3 10 3 1 10 50 1 10 100 600 |A(i)| |S(i)| (a) (b) Figure 4.8: (a) Magenta is a histogram of absolute error differences A(i) for A(i) > 0 (where MultiStack errors > TWISE errors) and green is a histogram of |A(i)| for A(i) < 0 (where TWISE errors > MultiStack errors). (b) Corresponding histograms for squared pixel error differences S(i). over TWISE). There histograms are plotted together in Fig. 4.8(a). Analogous histograms are plotted for the squared error difference, S(i), in Fig. 4.8(b). These histograms show that TWISE has less error than Multi-Stack [6] for most pixels (∼ 2.70∗106 ) compared to just (∼ 6, 100) pixels where Multi-stack bests TWISE. The average image in this set has 13, 500 pixels where TWISE is better versus 31 pixels where MultiStack is better. The reason for large RMSE errors in TWISE is believed to be caused by the outliers (erroneous FG/BG depth selection by TWISE) closer to object boundaries. The outliers are penalized heavily by RMSE metric as opposed to floating depth pixels estimated by MultiStack; as a result, our depth estimate suffers in that metric. As representative examples in Fig. 4.9, the error maps show depth errors around the boundary, and missing thin objects like poles. The reasoning can be further enhanced by the Tab. 4.2. In this analysis, we leverage GT semantics provided by KITTI semantic segmentation dataset. In 140 images, FG objects are poles, boundaries, traffic signs, vehicle, person and the rest as background. For each image, we label all pixels where distances to object boundaries less than 3 pixels are referred as edge pixels and the remaining as inside object pixels. 60 Figure 4.9: Color images (top) and depth error maps in 0 − 5m (bottom). Area MAE RMSE TMAE TRMSE Inside Object 196.1 752.3 138.6 327.3 Edge Pixels 731.6 2396.9 304.4 454.6 Whole Image 215.1 880.9 144.6 254.3 Table 4.2: Error metrics for different image regions on TWISE. Tab. 4.2 validates substantial larger errors are around boundary. 4.5.3.5 Outlier Errors and Analysis on KITTI Semi-Dense GT While outliers can be caused by wrong estimation of foreground/background depth, another im- portant source of outliers is incorrect labelling of ground-truth depths in KITTI. As a result, loss functions that are more sensitive to outliers (i.e. MSE loss) can be negatively influenced by the presence of noise. We highlight the noisy ground-truth labels in KITTI in the next section. In this section we show some evidence of outliers (noisy ground-truth depth) on boundaries of objects in KITTI’s semi-dense GT. Uhrig [49] proposed an approach to generate large-scale semi-dense GT data (85k training 61 (a) (b) Figure 4.10: Semi-dense GT depths overlaid on color images. Zoom-in views show fore- ground/background depths are incorrectly spread (dilated/constricted) across boundaries of poles, traffic signs etc. visible in color images. images) on realistic outdoor scenes suitable for neural network training. Although the approach is scalable on any dataset, it creates noisy ground-truth depth. Uhrig’s analysis shows that the semi-dense GT has larger errors on dynamic objects and large-range pixels. Additionally, we show that it also contains incorrect depth labels on some boundaries of objects. In both (a) and (b) of Fig. 4.10, we show zoomed in views of how foreground and background depths that are incorrectly spread across the boundaries of the poles, traffic signs, trees etc. of color images. Our analysis shows that the outliers in the semi-dense GT are caused by a variety of reasons; • Noisy rotation R, and translation t obtained from the IMU sensor • Timing synchronization between camera trigger and time taken to spin one lidar revolution • Consistency Check on Stereo-Global Matching algorithm which introduce boundary artifacts • Accumulation of lidar points from dynamic objects. MAE RMSE KITTI MAE RMSE (in pixel) (in pixel) Outliers* (in cm) (in cm) 0.35 0.84 0.31 38.6 94.1 Table 4.3: Relation between Disparity Error and Depth Error in metric units (cm). Note that KITTI Outliers are defined by: > 3 pix disparity error and 5% error. 62 Method RMSE (m) REL δ1.25 2 δ1.25 3 δ1.25 DC-3co [12] 0.118 0.013 99.4 99.9 100.0 DeepLidar [85] 0.115 0.022 99.3 99.9 100.0 DepthNormal [87] 0.112 0.018 99.5 99.9 100.0 GNN [92] 0.106 0.016 99.6 99.9 100.0 TWISE 0.097 0.013 99.6 99.9 100.0 NLSPN [13] 0.092 0.012 99.6 99.9 100.0 Table 4.4: Depth completion results on NYU2 [4]. In order to evaluate the depth quality of semi-dense GT, Uhrig [49] used the manually cleaned training set of 2015 KITTI stereo benchmark as reference data. The depth evaluation is done in pixel units. We realize that it is equally important to evaluate the semi-dense ground-truth depths in metric units to notice the effect of boundary outliers on semi-dense ground-truth depth metric performance. We translate the error in pixel units to error in metric units in Tab. 4.3, by converting the ground-truth disparity to depth using KITTI’s provided intrinsics. It shows the noisy semi-dense ground-truth depths suffering from boundary noise and dynamic objects can also have significant errors in metric units. It is also a possible indication that lowering the RMSE error in semi-dense GT might result in learning the noise inherent in semi-dense ground-truth. 4.5.3.6 Quantitative Results on NYU2 : Results on NYU2 are shown in Tab. 4.4, based on its standard metrics. We are currently ranked the second in all standard metrics. Note that compared to NLSPN [13], ours is 10× faster in inference on KITTI. The results also show that TWISE is equally generalizable to indoor scenes. 4.5.4 Ablation Studies In this section, we conduct extensive ablation studies to investigate the effect of different param- eters of our proposed loss. We train with 1/6 data (∼12K training samples) due to resource con- straints, and maintain this protocol for all ablations unless otherwise noted. 63 Res-18 [10] MultiStack [6] Loss MAE RMSE TMAE TRMSE MAE RMSE TMAE TRMSE L1 [47] 282.6 110.6 181.8 295.6 211.0 950.0 138.6 246.0 L2 [10] 341.2 987.8 244.6 349.5 247.4 880.0 170.3 285.0 L2 +L1 [13] 298.8 972.2 206.5 316.7 231.8 887.5 156.9 271.2 Huber [90] 288.6 1039.6 198.0 302.1 222.6 927.1 153.9 256.0 CE [12] 279.1 1125.1 184.3 239.1 – – – – TWISE 275.5 1045.1 181.1 294.0 201.3 927.6 134.1 240.1 Table 4.5: Effect of different loss functions. Compared to single channel losses, CE requires 80 channel, while TWISE requires 3 channel. Options MAE RMSE TMAE TRMSE dˆt = dˆ1 (σ = 1) 306.9 1109.9 204.4 314.8 dˆt = dˆ2 (σ = 0) 295.4 1092.9 193.9 306.1 dˆt = 0.5 ∗ (dˆ1 + dˆ2 ) (σ = 0.5) 220.7 854.8 148.2 262.4 dˆt = dˆ1 /dˆ2 |σ > 0.5 261.0 1008.0 180.4 287.9 No color 222.4 1067.5 139.2 247.8 dˆt = σ dˆ1 + (1 − σ)dˆ2 193.4 879.4 131.1 236.0 Table 4.6: Effect of learned σ in TWISE, evaluated by our best model. γ MAE RMSE TMAE TRMSE 1.0 223.1 950.1 145.8 257.0 1.5 207.8 947.9 138.1 245.1 2.0 201.3 927.6 134.1 240.1 2.5 204.4 932.5 136.1 242.5 5.0 207.1 923.4 138.7 246.1 10 216.1 922.8 146.7 255.4 Table 4.7: Effect of γ on depth completion performance. 4.5.4.1 Effect of Loss Functions We show that performance of our loss function is network agnostic. Tab. 4.5 refers to different loss functions typically used in SoTA depth estimation works. Although L2 is a widely used loss for estimating depth [10, 6, 48], L1 loss [47], Huber loss [90], L1 + L2 [13] are some of the widely used losses for depth completion. We compare our TWISE loss with all others, including the CE loss [12]. Top performances on MAE and TMAE show the positive side effect of our loss addressing the smearing problem at the boundary. We particularly note that TWISE performs better than a standard L1 loss on both the backbone networks, leading to believe that TWISE offers more benefit than a mere trade-off between MAE and RMSE. 64 Supervision Noisy Semi-Dense GT Clean GT 28 158 Backbone Method MAE RMSE TMAE TRMSE MAE RMSE TMAE TRMSE L1 8.79 49.9 7.09 16.02 14.43 130.62 6.16 18.28 MultiStack L2 10.40 45.35 8.61 17.89 17.75 127.14 8.45 20.18 MAE:Res [6] L1 + L2 9.42 44.90 8.23 16.82 15.45 126.20 7.14 19.30 Clean Clean 20 140 TWISE 7.98 47.5 6.25 15.35 12.71 126.4 5.22 16.67 RMSE:Res CE 10.50 58.67 8.64 16.57 19.03 155.24 8.26 18.29 MAE:Stack L1 L1 L2 L2 ResNet-18 L1 12.95 62.97 14.68 19.25 23.52 147.42 12.51 29.33 TWISE TWISE L1L2 L1L2 [10] L2 17.45 50.48 17.48 21.26 27.48 133.21 15.57 36.73 12 DC RMSE:Stack DC L1 + L2 14.21 48.25 15.80 20.10 25.40 132.6 14.35 32.47 6 12 18 123 43 55 65 TWISE 10.24 52.37 8.42 16.77 18.88 132.94 9.45 18.17 Semi-Dense Semi-Dense (a) (b) (c) Figure 4.11: (a) Results on Virtual KITTI experiments trained on clean GT and synthesized semi-dense respectively (units in cm). (b) MAE and (c) RMSE curves of scatter plots (Semi-Dense vs Clean GT) for different loss functions (colored symbols) and two backbone networks (MultiStack [6] and ResNet-18 [10]). Methods trained with the same backbone network are connected. 4.5.4.2 Effect of σ on Estimated Surfaces Another interesting evaluation is the importance of learned σ on different estimated surfaces. In Tab. 4.6, we evaluate estimated depths for different combinations of σ and compare individually its depth completion metrics. The performance is evaluated on our best model in Tab. 4.1, except for the row with “no color”, where we train without color input on the same network of our best model. From Tab. 4.6, foreground and background depth surface estimates, as usual, have higher error metric, since they are individually a biased estimate of depth. If we fix σ at 0.5, we see it is possible to achieve decent performance on MAE and RMSE on account of averaging (interpo- lation) between the two surfaces. We make a binary choice between foreground and background surface if σ > 0.5 and the results are worse than averaging. In addition, we see σ does not learn effectively without color input. So high-resolution imagery helps to learn effective σ and resolve ambiguities at the boundaries. 4.5.4.3 Effect of γ on Performance Since γ impacts the separation of foreground and background surfaces, we perform an ablation to assess its impact on TWISE. Tab. 4.7 shows depth completion performance with several γ values. With γ = 1, the loss is equivalent to MAE. As γ increases, the gap between foreground and 65 background surface increases. At small γ values, the interpolation benefits, thus leading to lower MAE, TMAE, TRMSE, since it is easier to interpolate between two nearby surfaces; however, in the meantime extrapolation suffers, thus leading to higher RMSE. At larger γ, the slope between two surfaces increase, and interpolation becomes harder. We choose γ = 2.0 in our experiment as a compromise between interpolation and extrapolation. 4.5.4.4 Effect of Sparsity on Depth Performance We also ran an extensive ablation study on generalization of SoTA methods due to sparsity. Spar- sity is created by subsampling LiDAR-points in azimuth-elevation space to simulate LiDAR-like structured patterns. We simulate lower resolution LiDARs by subsampling 32R, 16R, 8R rows from 64R lidar (depth acquisition sensor used by KITTI). The different sparse patterns can be seen in Fig. 4.12. We subsample the points based on selecting a subset of evenly spaced rows of 64R raw data provided by KITTI (split based on the azimuth angle in the lidar space) and then pro- jecting the points into the image. All the SoTA methods compared have been retrained using the author provided code with variable sparse input patterns. Tab. 4.8 shows that TWISE has better generalization and exhibits significantly less errors in all the metrics compared to SoTA methods. With more sparsity, TWISE is able to beat the RMSE metrics of methods supervised by standard losses. Particularly interesting is the fact that TWISE can be used for monocular depth estimation with no sparse depth input. 4.5.4.5 Synthetic Experiments with VKITTI Using both semi-dense GT and clean GT of VKITTI, we ran experiments on different loss func- tions using two different backbone networks. The conclusion is drawn by training and evaluation on noisy semi-dense and clean GT respectively. The results are shown in Fig. 4.11 (a). Several 66 (a) (b) (c) (d) Figure 4.12: KITTI sparse patterns of (a) 64R, (b) 32R, (c) 16R, and (d) 8R subsampled LiDAR respectively overlaid on a color image. inferences can be drawn from the scatter plot of Fig. 4.11 (b) and (c). Firstly, the MAE score is smooth and monotonic as opposed RMSE which zigzags. This implies that given a MAE score on semi-dense, we are able to predict its score on the clean dataset as well. Additionally, the rank- 67 Sparsity Method MAE RMSE TMAE TRMSE DC [12] 279.1 1125.1 183.1 292.3 MultiStack [6] 229.4 889.7 156.8 265.0 64R NLSPN [13] 219.1 868.0 147.7 263.4 TWISE 201.3 927.6 134.1 240.1 DC 392.7 1456.2 232.1 350.7 MultiStack 439.2 1288.8 275.4 402.3 32R NLSPN 392.4 1229.2 248.2 373.8 TWISE 327.9 1242.6 204.9 324.3 DC 477.7 1777.3 259.5 382.9 MultiStack 528.4 1504.3 308.6 439.5 16R NLSPN 497.1 1483.1 286.8 419.2 TWISE 414.0 1481.1 237.3 365.1 DC 634.7 2311.9 288.5 420.6 MultiStack 672.58 1841.6 353.2 486.8 8R NLSPN 669.05 1869.5 340.3 475.2 TWISE 532.1 1782.5 275.6 409.4 DC 2423.8 4433.6 715.4 797.2 MultiStack 2070.4 4185.1 635.7 735.4 RGB NLSPN 2192.9 4362.35 646.0 743.6 TWISE 1964.1 4078.8 612.0 716.5 Table 4.8: Row sparsity impact on SoTA depth completion methods. ing of the methods in both the datasets is the same for MAE but not RMSE. As a result, we can conclude that MAE is a superior metric to RMSE for comparing and ranking depth completion methods. Secondly, TWISE is more than a trade-off between MAE and RMSE. One of the objective of TWISE is to improve depth points at discontinuity regions. But KITTI semi-dense GT lacks dense ground-truth depth points, and contains more outliers in the boundary regions owing to methodology adopted in creating the GT. In presence of outliers, RMSE in TWISE suffers the most, but when clean GT can be provided, RMSE in TWISE performs as well as those methods with the L2 loss. 4.6 Conclusion In this chapter we propose TWISE, a new twin-surface representation and estimation method for depth images. Our proposed asymmetric loss functions, ALE and RALE, bias these twin surface estimates towards the foreground and background at pixels with depth ambiguity. A third channel 68 of our output fuses these estimates to achieve a single surface estimate. This solution simplifies the task of learning depth discontinuities, and as a result better maintains step-wise depth discontinu- ities across boundaries, and generates SOTA depth estimates. We also compared the robustness of MAE and RMSE as metrics for ranking depth completion methods and our analysis suggests that MAE is a superior metric in presence of noisy GT datasets. 69 Chapter 5 3D Object Detection from Noisy Depth 5.1 Introduction In autonomous driving, active depth acquisition sensors like LiDARs and radars are paramount in scene understanding and perception problems like 3D object detection and localization [101, 9, 102, 103], 3D semantic segmentation [104, 105, 106] and navigation problems [107, 108], but they become expensive with high-resolution, long-range depth and accuracy of the sensors. Depth sensors, typically have cm level accuracy and record raw depth measurements in 3D point- clouds, which are irregular sampled points of object surfaces. Recovering 3D structure and shapes from pointclouds is important for perception and 3D surface reconstruction, but challenging due to sparseness of data, missing depth points on objects due to occlusions and surface proper- ties. As a result, high density pointclouds are often desired. Since high-density and long-range depth sensors are expensive 1 , there has been active research problems in estimating high-density depth using cheap sensors like stereo [109, 21, 23] (stereo depth estimation), monocular camera [110, 29] (monocular depth estimation), or jointly using low-resolution LiDARs or radars with high-resolution color imagery (depth completion) [37, 7, 13]. Unfortunately, former methods for estimated depth often results in inaccurate 3D pointcloud. The question we are trying to answer in this chapter is whether additional but depth points with lower accuracy from depth completion can help in perception problems like 3D object detection and pose estimation. Extensive experiments 1 https://arstechnica.com/cars/2018/05/why-bulky-spinning-lidar-sensors-might-be-around-for-another-decade/ 70 with synthetic (Virtual KITTI) and real datasets (KITTI dataset) have been performed using SoTA architectures to validate the findings. There are different ways of densifying sparse depth points or pointclouds; upsampling onto a 2D regular grid or 3D irregular (pointcloud) structures. Each of these methods is fundamentally different from the other. The idea of upsampling an irregular pointcloud in 3D [111, 112, 113] is interesting since the sparse depth points are already in real metric space, scale and empty space between objects are well captured in this space, and no artifacts are created from perspective dis- tortion and occlusions. Also the densification of points can be made uniform althroughout the 3D space. However, these upsampling methods are still limited to a single surface or multiple sur- faces from the same object and require 3D models of objects for supervision. In real-world scenes, multiple discontinuous surfaces and disconnected regions can exist. In this chapter, we focus on depth completion which performs depth upsampling in a 2D regular grid from LiDAR pointclouds projected into the image plane. In this way, high-resolution color imagery can be leveraged more readily for filling incomplete depth pixels in a 2D grid. A big difference with doing upsampling in 2D space as opposed to 3D space is that depth- completed map is non-uniformly dense. Close-range points are dense while long-range points are sparse in 3D. Also, only visible surfaces can be readily upsampled in the image grid. As a result, artifacts can be created by interpolation, erosion or dilation of available depth pixels in the grid. Common artifacts are floating depth pixels between objects, holes in occluded objects, suppression of thin and small structures by foreground objects. However, one key reason of doing upsampling in 2D space is that a depth image samples the environment in a similar fashion to a data source (LiDAR and video) with density of points falling off with range; while 3D grids do not naturally sample the environment in this way. Also, traditional 2D convolutional neural networks (CNNs) can be readily applied on image grids, and it requires far-less computational and memory 71 footprints compared to point-based, voxel-based or mesh-based architectures. Thus 2D grids are still a preferred choice for upsampling depth-maps. Groundtruth Box Predicted Box LiDAR Points 80 (a) 70 LiDAR Points 60 50 40 30 20 (b) 10 Groundtruth Box Predicted Box (c) Figure 5.1: Object detection from a SOTA architecture [5] trained with dense inaccurate pointclouds back- projected from a depth image by SOTA depth completion method i.e. TWISE [7]. The figure showing (a) several false positive detections (red 3D cuboids) on the estimated high-density pointcloud. Estimated (col- ored) and LiDAR (white) pointclouds also shown in 3D space (b) Estimated depth map in 2D space from where the estimated pointcloud in (a) is originated from. The LiDAR points are also shown in white dots, and (c) shows the estimated and groundtruth 3D bounding boxes projected to a 2D image space of the scene. Red and blue boxes are predicted and groundtruth 3D bounding boxes respectively. It shows that bush and phone booth are being wrongly classified as cars. 72 Several estimation errors can arise by upsampling depth maps. The errors can change with long-range depths, boundary ambiguity, highly sparse or non-existing LiDAR points on thin and small structures and far-away objects. We note that raw LiDAR points also have errors, but they are typically in cm-level accuracy. Compared to LiDAR points, the scale of estimation errors can still range from few cms to few meters depending on the distance of the objects from the camera, ambiguous surfaces or boundaries etc. We model some of these errors as noise with certain characteristics (see sec. 5.3.2) and would like to use the two terms interchangeably. We show in this chapter that SoTA object detection performances often suffer from depth error from estimated dense depths (see Fig. 5.1). We discover different types of these estimation errors that typically exist in depth completion results. Similar detection performance drops are also ev- ident when we simulate the error as noise with certain characteristics in Virtual KITTI (VKITTI) [100], a synthetic dataset with a similar setup to the KITTI [1] dataset. Interestingly, our exper- iment reveals that high-resolution depths do contribute to better object detection performance if depth-maps are noiseless and free of artifacts, as is the case for VKITTI. That leads us to design simple, yet elegant noise filtering techniques to tackle depth error from estimated depth maps. We conclude that reducing depth error and sampling depth points from relevant areas are keys to improvement in detection performances with high-density but noisy depth points. The main contributions of the chapter are summarized as follows: • We investigate the effect of high-resolution but noisy depth on 3D object detection with SoTA point-based and voxel-based neural network architectures. • We study the effect of noise-free and noisy depth on a synthetic dataset on 3D object detec- tion. • We propose an effective way to leverage estimated depth from depth completion for better 73 object detection performance. 5.2 Related Works 5.2.1 Depth Completion and Depth Prediction While depth completion involves the use of LiDAR and high-resolution color imagery to com- plete the remaining missing depth pixels, depth prediction involves estimating depth from color image only. Quite naturally, depth completion has a lower depth error metric compared to monoc- ular/stereo depth estimation since LiDAR depth measurements are additionally used as input sig- nals. Deep Neural Networks (DNNs) are applied to both sets of problems. Both these problems are relying more on geometric and semantic constraints like depth-normal consistency [86, 87, 110] to improve depth completion and monocular depth prediction performances. Some methods devise novel depth representations [12, 7, 27, 114] for improving depth metric performances. 5.2.2 3D Object Detection with Multi-Modal Sensors 3D object detection is a widely researched and challenging perception problem in vision and a wide variety of sensors are used for this purpose. Some of the most common sensors in a typical sensor suite of autonomous vehicles are LiDARs, cameras, radars, GPS and even ultrasonics. Based on the different types of sensors, detection performance varies widely. We will focus our discussion to LiDARs and camera, as these are broadly used to develop perception algorithms. The widest and most commonly used sensors are LiDARs, which generate unordered and ir- regular pointclouds in 3D with cm level accuracy. The depth-based 3D object detector networks encode these point clouds with point-based [101, 115, 116, 117] and voxel-based neural networks 74 [9, 118, 119] to represent the geometry of a scene. Both 2-stage RPN networks and single-stage networks are ubiquitously used. 2-stage networks have a region proposal network followed by a refinement network to finetune the regression parameters of the bounding boxes. Typically, 2 stage networks have better detection performance compared to single-stage detectors, although SSDs are more suitable for real-time scenarios. Some of the recent researches [120, 121] prefer bird eye view (BEV) representations of these pointclouds as input to traditional image-based CNNs to improve efficiency of these 3D detections. A comparatively much cheaper sensor in the sensor suite is a camera, where researches explore ways to infer 3D bounding box parameters like physical size and orientation from predicted 2D bounding boxes in image space [122, 123, 124], under the assumption that perspective projection of 3D bounding boxes fit tightly with its respective 2D detection window. Strong prior shapes of 3D objects [125], shape and geometry of the objects [126], spatial context of the scene [127], and even temporal information [128] are all taken into account for improving monocular 3D detec- tions. However, due to a lack of direct depth measurements, the performance gap between LiDAR based detection and camera only detection is still wide. Recent trends estimate depth [129, 130] or leverage network pre-trained on depth [131] to estimate 3D bounding box parameters from a monocular image which boosts up accuracy to some extent, but performance still suffers due to erroneous depth estimation. To improve the robustness and accuracy of 3D detection algorithms, several existing studies propose to fuse multiple sensors (LiDARs, camera, radars etc) to take advantage of their com- plementary characteristics; raw LiDARs having depth measurements to cm level accuracy, color images having dense appearance features of a specific field of view of the scene, radars having velocity information of objects and more robust to adverse weather conditions etc. Fusion is nec- essary to transform the sensor modalities to a common coordinate space since different modalities 75 have different viewpoint locations. Amongst all the sensors, LiDARs and cameras are the most common sensors widely researched for multimodal fusion [132] to improve perception algorithms. Early-level or pixel-level fusion [2, 133], mid-level or intermediate-feature level [14, 5, 134, 135], and late-level or detection-level fusion [136] all exist in literatures with advantages and disadvan- tages. But in all these methods, LiDARs with high-quality depth measurements play a big role in detection performances [132] due to measuring precise localization of objects in the 3D scene, robustness to different lighting conditions etc. 5.2.3 3D Object Detection from Estimated Depth Since high-resolution LiDARs are expensive, some recent research endeavors propose to esti- mate 3D object detections using estimated depth from monocular [137], stereo [138] or even low-resolution LiDARs [139] instead of using high-resolution raw depth measurements. Some of the pioneering works in this field are [138, 139], which create LiDAR-like representation from estimated depths by sampling depth in azimuth-elevation space. Qian et al. [140] use end-to-end learning on both depth and object detection networks. Weng et al. [137] use instance segmentation mask on estimated depth map to reduce the effect of smeared/floating depth pixels for monocular 3D object detection. Although all the above methods create a 64R pseudo-LiDAR representation, little study is done on the effect of high-resolution depth points estimated in 2D dense grid on object detection performance. In this chapter, we study the effect of depth points on detection per- formance using estimated depth from a depth completion perspective. In our approach, the input to our depth network is a color image and raw LiDAR measurements, we use estimated depth as input to 3D object detection network. The aim is to see whether high-density but estimated point- cloud, when applied over SoTA object detection neural networks, can help in improving 3D object detection compared to raw LiDAR pointcloud. 76 Camera Parameters Color Image DepthMap to PointCloud Converter Depth Completion Network Dense Depth Sparse Depth Object Detection Network 3D Detections PointCloud Figure 5.2: 3D detections from high-density pointcloud. The color image and sparse depth is fed into a depth completion network, and dense depth estimate is obtained. The high-density depthmap is then converted to pointcloud in 3D, and trained with a SOTA object detection network, the end- result is 3D detections (red 3D bounding boxes) in pointcloud. Our preliminary investigation suggests that additional depth points from raw depth completion result do not help in improving 3D object detection as shown in Fig.5.1 and Tab. 5.1. Our next section investigates the root cause of the problem and proposes some effective strategies to improve detection with high-density depth points. 5.3 Impact of Noisy Depth on Object Detection In this section, we show several key insights of using raw dense depth completion results for 3D object detection. We show the effect of depth errors on detection problems, architecture bottlenecks present in existing SoTA architectures when handling high-density depth points, and ways to tackle these depth errors to improve depth completion performance. 77 5.3.1 Baseline Results We start with a 2-stage SoTA architecture for pointcloud and apply standard loss functions typically present in 3D object detection problems. The only change is at the data input of the network, where instead of using 64R LiDAR points, we backproject the dense depth estimate in 2D to pointcloud input as 3D using camera intrinsics (see Fig. 5.2). To facilitate batch training, we keep the initial number of depth points fixed at 50k points. The results are as shown in Tab 5.1. Network Input Car 3D AP Car BEV AP Easy Med. Hard Easy Med. Hard Object Raw 89.0 78.69 76.68 93.08 87.12 85.0 Object Semi-Dense GT 75.82 60.09 59.02 82.31 67.85 67.22 Depth + Object MultiHourGlass 40.21 25.2 20.65 51.8 31.64 26.7 Depth + Object TWISE 82.04 59.79 52.80 89.42 68.30 61.0 Table 5.1: 3D Object Detection results (3D and Bird-eye view (BEV) average precision respec- tively) with different depth resolutions. Object refers to object detection [5] and Depth refers to depth completion [6] network. Raw refers to 64R LiDAR, Semi-Dense GT refers to the GT cre- ated by accumulating LiDAR points used for supervision of depth completion network in KITTI. We use the results of two depth completion networks (MultiHourGlass [6] and TWISE [7]) for comparison purposes. From Tab. 5.1, we compare two depth completion methods, and TWISE still gives the best performance owing to less smearing point [7] (refer to Sec. 4.5.3.4) at the boundaries. How- ever, it is still not enough to help improve object detection performance using raw 64R LiDAR, which has, on average ≈ 18k points at the input, compared to ≈ 200k high-density depth points. Most interestingly, object detection with semi-dense KITTI groundtruth gives worse performance than TWISE. The possible factors are non-uniform distribution of close and far points, pruning of LiDAR points when there are inconsistencies with stereo depth, and depth noise in the form of outliers (refer to Sec. 4.5.3.5 on creating semi-dense groundtruth) indicating that semi-dense depth can also contribute to deteriorating performance in 3D object detection. 78 We further evaluate the object detection at two different ranges of depth; 0 − 30m and 30 − 70m 5.2 and compare the performance on different categories of objects; easy, medium, and hard based on object size and level of occlusions. We see that the performance gap between sparse and dense pointcloud increases at moderate or hard categories of objects and far-away objects. It indicates that at high occlusions and far-away objects, limited visibility in camera plane and limited groundtruth supervision at far-away depth also result in more noise/ambiguous depth estimate leading to erroneous 3D object detections. Range Depth Density 3D AP BEV AP in meters Easy Medium Hard Easy Medium Hard Raw 94.32 92.67 89.51 95.85 96.1 95.19 0 − 30 Dense 89.52 87.88 77.61 95.94 93.87 81.40 Raw 45.62 55.75 54.27 53.65 70.56 68.94 30 − 70 Dense 39.38 42.35 36.83 54.07 58.37 51.56 Table 5.2: AP Comparison of raw (64R LiDAR) and dense depth (TWISE) at different depth ranges in meters. Numerical results also show dense depth performs worse compared to raw LiDAR at all categories; with the gap increasing at higher depth range and more difficult categories respectively. 5.3.2 Noise Modelling From the previous section, we notice degradation of object detection performances due to erro- neous high-density depth estimates. In this section, we model this error with some noise charac- teristics that is consistent with these depth estimates. Assume there is an underlying GT surface to be estimated at each pixel. The LiDAR samples a subset of these pixels exactly or with small noise. The depth completion method estimates all the pixels with some error. These estimated pixel depths generate a point cloud with a noise term that models the errors in the depth for each pixel. For notational convenience, we the terms D and D̂ for original and estimated depth respectively. We consider two kinds of errors in these estimates; errors coming from mixed depth and depth misalignment error respectively. Mixed depth, or floating depth pixels between foreground and 79 background can be introduced by using a spatial gaussian blur. The gaussian blur function is a 2D spatial filter kernel given by: 1 x2 + y 2 G(x, y) = p exp − (5.1) 2πρ2 2ρ2 where x and y are the spatial positions in the grid and ρ is the standard deviation of this filter. The estimated depth can be obtained by convolution of original depth with this kernel as: D̂ = D ∗ G(x, y) Xa X b = G(s, t)D(x + s, y + t) s=−a t=−b where s and t defines the kernel size of the filter. Depth Misalignment Error replicates larger errors in depth as distance increases. We model an additive noise model for this error. The noise η comes from a zero mean gaussian distribution with standard deviation ν such that η ∼ M (0, ν), whose uncertainty (standard deviation) increases linearly with distance. The linear relationship of uncertainty vs distance is given by: ν = 0.05 + aD (5.2) where a is the slope of the linear function. The additive noise model is then defined by: D̂ = D + η (5.3) Although other errors are possible, the above-mentioned errors are most prevalent in depth 80 completion methods. In the context of 3D object detection, these errors can be further classified as three different types as seen in Fig.5.3. • Noisy depth points within foreground; Depth points on the foreground object can spread out within and around the 3D bounding box, as a result, object shapes and size can appear distorted (see Fig. 5.3(c)), we can label it as noise-fg. • Smeared depth points inbetween FG and BG; Floating depth points between foreground and background; we can label it as noise-inbet. • Noise from Background clutter; There can be structures in the background that can resemble similar objects; e.g. bushes, telephone boxes, es, etc which might look alike to objects’ surface as shown in Fig. 5.3(i). We can label it as noise-bg. 81 (a) (b) (c) (g) (h) (i) (d) (e) (f) Figure 5.3: Different types of depth noise on estimated depth; the first column shows color images, the second column shows depth images, and the third column shows 3D pointclouds; (a), (b) and (c) show noise-fg, indicating smeared points within the car resulting in misalignment of the predicted and GT bounding boxes as shown in red and blue respectively; (d), (e) and (f ) show noise-inbet, indicating smeared pointcloud between two objects resulting in the wrong orientation of the 2nd car; and (g), (h) and (i) show noise-bg, the background points (phone-booth) wrongly classified as a car since its outer surface looks similar to a car shape. Note that noisy pointclouds also arise due to the limitation of acquisition equipments. How- ever, the properties of noise from estimated depth are very different from the noise that can come from acquisition equipments [141, 142]. Sensor noise typically comes in the form of sensor im- perfections, temperature noise, etc and the sensors are typically in cm level accuracy, which is an order of magnitude lower than noise from estimated depth. In addition to noisy dense depth estimate, SoTA backbone networks of 3D detection are not conducive to high-density noisy depth points which we describe next. 82 5.3.3 Architectures Voxel-net and point-net based architectures are increasingly used as backbones in 3D object de- tection problems. All these architectures rely on subsampling depth points. Voxel-net based ar- chitectures divide the 3D space into voxels and subsample depth points based on voxel regions. Voxelized pointclouds are either projected to different views such as bird-eye views (BEV), range view (RV) to be processed by 2D convolution, or kept in 3D coordinates to be processed by sparse 3D convolutions. The point-based method can preserve precision in localization and detailed 3D structure information, while voxel-based methods are fast due to aggressive subsampling of points in defined grids. Deng et al. [9] argue that the 3D structure is of significant importance for object detections. Although time-consuming and memory-intensive, point-based methods have the potential to en- code detailed 3D information quite well. However, these methods also rely on the downsampling of pointclouds at each encoder layer for efficiency and enlarged receptive field. A very important component of the point-based architecture is the furthest point sampling (FPS) which uniformly samples depth points from low-density (far-away pixels) and high-density regions (close-by pixels) in 3D. It ensures diversity in the selection of points. However, it also promotes noisy pixels and outliers which we describe next. 5.3.3.1 Promotion of Noisy Pixels using FPS FPS is used to select representative points at each encoding layer of point-net architectures. Down- sampling pointcloud is essential for the efficient encoding of points. However, by preferentially selecting points that are spatially far apart from others, it promotes background points and outlier points more compared to relevant object pixels. Fig. 5.4 shows FPS selecting background (trees 83 and vegetation) and outlier points (at boundaries) from the dense depth estimate of TWISE. (a) (b) Figure 5.4: Selecting 4096 points from (a) 64R raw LiDAR and (b) dense depth (noisy environments) of TWISE. FPS samples more background and outlier points at boundaries of tree trunks and buildings from noisy dense depth. 5.3.3.2 Downsampling in Point-Based Architecture FPS FPS FPS FPS N Points 4096 1024 256 64 Figure 5.5: Subsampling in point-based architecture. The dense depth points get subsampled to 64 encoded points at the final encoder level. Downsampling is essential for encoding the points and making memory and computation rea- sonable. A typical backbone architecture of pointnet downsamples N (input) points into 4096, 1024, 256, 64 points at each encoder stage respectively, see Fig. 5.5. This architecture bottleneck and FPS constrict relevant object points to move deep into the encoder layers. Having studied the architecture limitations of the point-based 3D backbone network, we now explore some remedial steps that can be used to filter out noisy pixels from the estimated depth and 84 bypass some architecture limitations of 3D object detection networks. The next section describes these phenomena. 5.3.4 Remedies to Tackle Noisy Depth We investigate some potential ways to filter out noisy depth from dense depth estimates and bypass architecture limitations of the backbone networks. This section explains the analysis. 5.3.4.1 Filtering Smeared Points (b) (c) (a) Figure 5.6: Removing smeared points by TWISE. (a) shows the rectangular region 0.2 < σ < 0.8 and depth difference (difference between BG and FG in TWISE) >= 3m as smeared depth points. The σ parameter refers to σ parameter learned in TWISE. (b) and (c) show the unfiltered and filtered depth points in 3D space respectively. As shown, most of the floating depth pixels at the scene in (b) is filtered out at (c). Although TWISE can reduce depth smearing considerably, the σ parameter can still allow depth mixing at regions which are ambiguous. These floating/smeared depth pixels along depth discontinuity and far-away depth can trouble object detection algorithms. Qiu et al. [85] liken smeared points as salt and pepper noise and rely on traditional 2D median filtering to remove this noise. Another way to remove it is by checking consistent depth points between estimated depth and depth points created by morphological filtering on sparse depth [143]. However, the ability of morphological filtering decreases as input raw LiDAR points become more and more sparse at far- away regions. There are some natural ways to filter out floating depth pixels between foreground and background based on TWISE estimate. We realize that hard ambiguous regions in TWISE 85 are often characterized by higher depth discrepancies between foreground and background depth channel. Depth mixing typically occurs at these regions when the σ parameter of TWISE does a soft averaging of the FG and BG estimates instead of selecting either of the depth. This motivates us to design an effective filter for removing smeared points. We use a depth difference threshold between foreground and background depth and sigma threshold to design the filter (see 5.6 (a)). It shows that by pruning out depth from this rectangular band region, it is possible to get rid of floating depth pixels or noise-inbet around boundaries and long-range depth to a great extent. We name it sigma filter for the ease of explanation in the next sections. 5.3.4.2 Filtering Background Clutter We use grid sampling on 3D to downsample dense depth uniformly on close and far away points. Grid sampling prunes points that fall within t m (in our case, 0.2m) of a grid and replaces it with a representative point. Additionally, we use semantic map information of objects (e.g. vehicles, pedestrians, cyclists) generated from the SoTA method [8] to filter out most of the background depth clutter. It ensures that only relevant object pixels are used for subsampling the pointcloud. Some background context of the object can still be gathered by collecting points that fall within the r m (in our case, 0.3m) radius of any relevant car pixels (see Fig 5.7). Using this strategy, it is possible to reduce the number of points significantly by ensuring relevant pixels are always considered as input to the network. It is also possible to bypass the architecture limitations of point-based architecture. 5.3.4.3 Filtering Pixels within r distance from raw LiDAR Instead of using the estimated depth, we can also leverage the presence of raw LiDAR which is available as input to any depth completion problem. The LiDAR points select from the dense depth 86 (a) (b) (c) (d) Figure 5.7: Sampling to select relevant pixels from dense depth. (a) ≈ 315k with dense depth, (b) ≈ 200k after pruning ambiguous pixels based on sigma filter 5.3.4.1, (c) ≈ 35k after grid sampling at 0.2m, (d) ≈ 15k points after using object level semantic mask and gathering neighboring points 0.3m from the semantic pixels in 3D. any points that fall within the r m radius from them. In this way, it is possible to avoid smeared points and gather only meaningful background points for context. Due to perspective distortion, closer and far away depth points are unevenly distributed in the image plane. Any depth points that are within d m from the camera are termed as close pixels and pixels beyond d m are far-away pixels. We can use two different r values; r1 and r2 for depths <= d and > d respectively. This 87 new depth can be termed as augmented LiDAR since LiDAR is now augmented with dense depth estimate. 5.4 Experiments and Results We evaluate the proposed algorithm on the standard KITTI Depth Completion dataset [1], a real- world outdoor scene, and Virtual KITTI [100], a synthetic dataset with photo-realistic images and dense ground-truth depth. 5.4.1 Dataset 5.4.1.1 KITTI KITTI’s 3D/BEV object detection framework contains 7481 training samples and 7518 testing samples. The training set is further divided into a training set (3712 samples), and validation samples (3769 samples). The dataset has difficulty levels (easy, medium, and hard) based on the object size, occlusion, and truncation level respectively. There are three major object categories; car, pedestrian, and cyclists respectively. We use the ’Cars’ category for training and evaluation. 5.4.1.2 Virtual KITTI we evaluate VKITTI 2.0, a synthetic dataset with noise-free and dense GT depth at depth discon- tinuities, accurate 3D object bounding boxes, and noise-free semantic labels. The VKITTI 2.0, created by the Unity game engine, contains 5 different camera locations (15o left, 15o right, 30o left, 30o right, clone) in addition to 5 different driving sequences. Additionally, there are stereo im- age pairs for each camera location. For training and testing, we only use the clone (forward-facing camera) with stereo image pairs. For VKITTI training, 2k training images were created from driv- 88 ing sequences 01, 02, 06, and 018 respectively. For testing, we use sequence 020 of both the left and right stereo cameras, and choose all other frames, with a total of 1.6k images. We subsample the dense GT depth in azimuth-elevation space to simulate a LiDAR-like pattern as sparse inputs. We found out that, unlike KITTI, VKITTI does not label easy, medium, and hard categories on its own. We have to label the raw VKITTI 3D object detection boxes into easy, medium, and hard cases depending on the truncation and occlusion levels provided. Object categories are classified as easy if the height of 2D bounding boxes are greater than 40 pixels, truncation, and occlusion level is less than 0.1. Similarly, medium categories are referred as objects with height ≥ 20 and (0.1 ≤ truncation ≤ 0.5 or and 0.1 ≤ occlusion ≤ 0.6) respectively. Similarly, for hard categories, height ≥ 15, and (0.5 ≤ truncation ≤ 0.9, or occlusion ≥ 0.6) respectively. 5.4.2 Metrics We use average precision as the common metric to evaluate 3D object detections. Recently, the KITTI dataset applies a new evaluation protocol that uses 40 recall positions instead of 11 recall positions to make a fairer evaluation. All the methods are compared with the new evaluation metric. 5.4.3 Architecture We use various backbone architectures (i.e. point-based, voxel-based, or mix of points and voxels) for our study. We choose a variant of PointRCNN (see [5]) architecture as our backbone architec- ture in all ablation studies unless noted. 89 5.4.4 Implementation Details Our implementation of PointRCNN2 is similar to the implementation of EPNet [5], except no color fusion network is used. This is done to eliminate the contribution of the color fusion network and investigate the effect of noisy pointcloud on object detection. The range of the pointcloud is restricted to [−40, 40], [−1, 3], and [0, 70.4]m along X (right), Y (down) and Z (forward) axis in camera coordinate respectively. And the orientation of theta is in the range of [−π, π]. The number of input pointcloud is fixed at 16384 for raw LiDAR to facilitate batch training. For dense depth, we randomly subsample 50k points as input to the network. Afterward, FPS sampling is used to subsample points to 4096, 1024, 256, 64 respectively similar to point CNN. We keep focal loss for foreground/background classification and consistency loss between localization and classification confidence as proposed by the paper. During inference, the top 8000 proposal boxes generated by RPN are selected based on classification confidence. NMS threshold of 0.8 is then used to filter out redundant boxes and obtain 64 positive candidate boxes refined by the refinement network. Throughout the whole training process, no data augmentation is used, and we kept the training epochs to 50 unless otherwise noted. For all the other networks, we use the public implementation of OpenPcDet [144] and stick to their training protocol unless otherwise noted. 5.4.5 Results Tab.5.3 refers to object detection results in the KITTI dataset. All the architecture uses LIDAR only for 3D object detection, and the improvement results with dense depth are shown with PointRCNN2 as backbone architecture. Filtering floating pixels by sigma filter 5.3.4.1improves performance significantly compared with baseline (D). In D and SF in this table, we first grid 90 sample dense depth similar to 5.7, and then semantic mask generated by [8] is used for filtering background points. Some background context is captured by gathering r m depth pixels around the semantic depth pixels. Finally, we find the dense depth estimate gives the best results when augmented with raw LiDAR (L and D and SF) filtered by semantic mask and sigma filter. Archi. Res. Car 3D AP Car BEV AP Easy Medium Hard Easy Medium Hard PointRCNN L 88.4 76.5 74.4 87.3 85.3 83.3 PointPillars L 86.0 76.6 70.1 89.8 86.7 83.1 Second L 88.8 78.4 76.9 90.3 87.7 79.7 PVRCNN L 89.2 79.1 78.4 90.2 87.8 87.3 VoxelRCNN L 89.6 79.4 78.7 90.4 88.2 87.8 L 90.0 78.7 76.7 93.1 87.1 85.0 D 82.0 59.8 52.8 89.4 68.3 61.0 PointRCNN2 D and F 86.6 69.9 60.8 92.8 78.3 69.0 D and SF 88.9 75.3 70.3 95.3 84.3 79.4 L and D and SF 92.5 79.5 73.2 96.3 88.5 82.3 Table 5.3: 3D Object Detection with augmented LiDAR. L refers to 64R LiDAR, D refers to Dense Depth with TWISE, F refers to sigma filter 5.3.4.1, SF refers to filter used with semantic mask, and Aug.F refers to LiDAR augmented with dense depth estimate using the semantic mask and filter. We evaluate dense depth estimates on different 3D-based architectures (point-based and voxel- based) and realize that in all these architectures, dense depth estimate reduces detection perfor- mance, more significantly in moderate and hard cases (see Tab. 5.4). This indicates that at moder- ate to heavy occlusions depth estimate is noisier. Although point-based methods are more heavily affected by noise and outliers, sigma filtering recovers performance in medium and hard cases and outbeat voxel-based method. In order to analyze the 3D detection performance on KITTI (real-world dataset) using noisy depth estimate. The figure shows a bird-eye view of predictions and ground truth for different configurations of depth points. Bird-eye views are chosen to show a number of false positives which affect detection performance. We ablate the effect of several remedial measures to mitigate depth noise on 3D object de- 91 Catgry Archi. Res. Car 3D AP Car BEV AP Easy Medium Hard Easy Medium Hard PointRCNN D 74.4 51.9 43.5 80.2 61.2 52.3 Point-Based PointPillars D 74.7 57.5 54.4 83.4 71.4 68.1 D 82.0 59.8 52.8 89.4 68.3 61.0 PointRCNN2 D and F 86.6 69.9 60.8 92.8 78.3 69.0 Second D 71.2 55.7 53.2 79.5 67.5 65.8 Voxel-Based D 72.8 58.4 56.7 84.0 68.9 67.7 Voxel-RCNN D and F 79.3 61.6 58.7 87.8 72.8 70.3 Mix PVRCNN D 73.5 57.8 54.1 83.0 68.5 66.3 Table 5.4: Performance evaluation of different architectures on a validation set of KITTI with dense depth from TWISE. D refers to Dense Depth with TWISE, F refers to sigma filter 5.3.4.1. It shows the filter designed with TWISE output can improve detection performance significantly by pruning floating and ambiguous depth pixels. tection performance. In Tab 5.5, we show how local region proposals in the form of estimated semantic maps can help improve object detection by reducing background clutters and bypassing architecture bottlenecks. We also try sampling on dense depth in azimuth and elevation space sim- ilar to [138, 140] to make pseudo-LiDAR representation. However, the key is to select relevant object pixels as input to the network. To remove redundancy and increase relevant object pixels, we apply grid sampling and semantic mask on the dense depth. Additionally, floating depth pix- els are reduced by the sigma filter. The results show all these steps are crucial in improving 3D object detection by reducing background clutter, floating depth noise, and bypassing architecture limitations. To gather some background context, pixels around r = 0.3m are gathered around the relevant object pixels. In Tab 5.6, we ablate on different ways of gathering background context by gathering neigh- boring pixels around the relevant object pixels. Due to uneven distribution of depth pixels at close and far-away regions, we use two different r values for gathering neighboring pixels. The conclu- sion is that raw LiDAR augmented with dense depth estimate improves object detection accuracy considerably, although at hard regions, the raw LiDAR performance still performs the best. 92 (a) GT GT GT GT 150 150 150 150 Pred 250 Pred 250 Pred 250 Pred 250 200 200 200 200 200 200 200 200 250 250 250 250 150 150 150 150 300 300 300 300 100 100 100 100 350 350 350 350 400 400 50 400 50 400 50 50 450 450 450 450 0 0 0 0 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 (b) (c) (d) (e) (f ) 100 GT 100 GT 100 GT 100 GT Pred 250 Pred 250 Pred 250 Pred 250 150 150 150 150 200 200 200 200 200 200 200 200 250 250 250 250 150 150 150 150 300 300 300 300 100 100 100 100 350 350 350 350 50 50 50 50 400 400 400 400 450 450 450 450 0 0 0 0 50 100 150 200 250 50 100 150 200 250 50 100 150 200 250 50 100 150 200 250 (g) (h) (i) (j) Figure 5.8: Object detection performance comparison in bird eye view with different depth inputs. (a) show color im- age, and BEV detection on (b) sparse depth, (c) dense depth, (d) depth filterd with sigma filter (5.3.4.1), (e) augmented LiDAR with variable radius respectively. (f ), (g), (h), and (i) show another example of a scene with color image and bird eye view of all the methods subsequently. In both the examples, augmented LiDAR has best performance compared to all other methods. Does high-resolution depth really help in object detection? We set up VKITTI experiments to determine the effect of whether high-resolution 3D pointcloud helps in object detection. In 93 Archi. Input Sampling Region Car 3D AP Car BEV AP Easy Med. Hard Easy Med. Hard PointRCNN2 Lidar N/A 89.0 78.69 76.68 93.08 87.12 85.0 MultiStack + PointRCNN2 Filtered Depth N/A 86.6 69.9 60.82 92.77 78.34 68.97 MultiStack + azimuth and PointRCNN2 64R Filtered Depth elevation space 87.85 73.59 69.14 92.27 82.44 79.49 MultiStack + azimuth and PointRCNN2 128R Filtered Depth elevation space 88.91 74.45 67.41 92.95 83.11 75.90 MultiStack + PointRCNN2 Sem Filtered Depth Grid Sampling 0.1m 88.91 75.10 70.26 95.30 84.3 79.44 Table 5.5: 3D Object detection results with defined sampling regions to reduce background clutter and bypass architecture limitations. Filtered depth refers to the TWISE filter 5.3.4.1, 64R and 128R are depth sampled in azimuth and elevation space to simulate a LiDAR, while semantic filtered depth refers to the estimated semantic mask created by [8], and used to filter dense depth in image plane based on vehicle pixels. The dense depth pixels are reduced further by grid sampling in 3D at grid spacing 0.1m. Table 5.7, we classify pixels as All (foreground and background pixels), and Semantic (foreground) pixels which are selected using ground truth semantic masks. We use only noise-free depth points in this setting to understand the effect of high-resolution on 3D object detection. We use two types of architectures (point-based and voxel-based) to understand its effect. For PointRCNN2, We see that when ’All’ pixels are considered, the 3D detection performance tops at 256R resolution, but falls down with higher resolution (512R, Alldep). This indicates that at higher resolution, the FPS component in the architecture selects more non-relevant pixels and hurts the performance. If relevant object pixels (’Semantic’) are selected as input to the network, performance improves for all resolutions significantly. The plateauing out at ’AllDep’ could be due to memory constraints or the architecture bottleneck of the pointnet structure. For VoxelRCNN, the findings are interestingly different. We see the performance steadily increases for higher resolution for ’All’ pixels, with ’Alldep’ giving the best performance. Since no FPS is present in its pipeline, it ensures that relevant object pixels are not thrown away in preference of background pixels. Providing target-relevant pixels improves the performance only 94 Train Strat. Archi. Input Dep Sampling Region Car 3D AP Car BEV AP Easy Med. Hard Easy Med. Hard 1-stage PointNet Lidar N/A 89.0 78.69 76.68 93.08 87.12 85.0 1-stage PointNet Sem. Lidar N/A 90.53 79.33 77.02 95.1 88.2 86.22 MultiStack & Grid Sampling, 2-stage PointNet Sem Filtered r = 0.3m around sampled pts 88.91 75.10 70.26 95.30 84.30 79.44 MultiStack & Grid Sampling 2-stage PointNet Sem Filtered r = 1m around sampled pts 89.28 76.88 71.56 93.2 83.82 80.53 Grid Sampling MultiStack & 2-stage PointNet Sem Filtered r = 0.3m around sampled pts < 40m, 92.34 78.08 73.14 95.95 84.85 81.93 r = 1.5m around sampled pts >= 40m Grid Sampling MultiStack & Sem Filtered & 2-stage PointNet Sem-Mask r = 0.3m around sampled pts < 40m, 92.51 79.50 73.20 96.25 88.5 82.25 r = 1.5m around sampled pts >= 40m Table 5.6: 3D Object Detection results with the apriori sampling region and different radius config- urations used to gather background pixels around the relevant object pixels. Sem filtered refers to sigma filter and semantic map used to filter out background and floating depth points. Sem-Mask in last row refers to binary semantic information which is concatenated with the input pointcloud as additional information to the detection network. Res. Type 3D Bounding Box Bird’s Eye View Box PixelType Resol. Easy Med. Hard Easy Med. Hard 64R 81.4/80.5 80.1/71.0 71.8/69.0 89.7/90.1 81.6/80.9 72.6/71.7 128R 85.5/80.8 79.7/75.0 71.8/70.2 90.6/90.2 83.4/81.0 75.7/75.9 256R 89.9/81.3 81.4/80.0 79.7/70.8 90.8/90.4 81.6/81.2 81.3/80.01 All 512R 89.9/81.2 80.7/80.4 72.0/71.4 90.6/90.4 81.4/81.5 78.3/80.2 Alldep 81.7/89.7 81.4/80.7 72.4/79.7 90.6/90.4 81.7/89.8 72.6/81.5 64R 88.1/– 82.0/– 74.1/– 91.5/– 88.3/– 80.7/– 128R 88.8/– 85.4/– 77.9 91.7 88.8 81.3/– Semantic 256R 89.9/– 88.3/– 80.7/– 92.0/– 89.3/– 84.0/– 512R 90.1/– 88.5/– 80.9/– 92.1/– 89.5/– 84.1/– Alldep 90.2/89.9 88.5/80.5 81.0/79.3 92.1/90.5 89.5/89.3 84.2/80.0 Table 5.7: Average precision (%) for 3D detection and pose estimation of cars on VirtualKITTI using PointRCNN2[[5]]/VoxelRCNN[9]. Alldep refers to depth pixels grid sampled at 0.1m, while semantic pixel refers to selected pixels using GT semantic masks. Note that pixels within r = 0.3m radius of semantics are also taken into consideration. slightly, indicating that selecting the relevant pixels apriori has no major impact on the performance in this kind of architecture. In Tab. 5.8, we show the effect of noise on object detection performance on PointRCNN2 by simulating different kinds of noise; Gaussian Blur refers to the gaussian smoothing along the depth discontinuities and simulating floating points along the boundary. Depth Misalignment refers to the depth noise along the optical ray. We simulate this noise by increasing gaussian noise linearly with distance on the valid pixels. The results substantiate the claim that noisy pointcloud decreases 95 3D Bounding Box Bird’s Eye View Box 3D Bounding Box Bird’s Eye View Box Noise: Easy Med. Hard Easy Med. Hard Target Pix. Easy Med. Hard Easy Med. Hard 64R + Noise-free 81.4 80.1 71.8 89.7 81.6 72.6 All 81.4 80.1 71.8 89.7 81.6 72.6 64R + Gaussian Blur 67.4 54.4 47.1 73.4 62.4 54.8 2DBounding Box 82.0 79.1 71.4 84.8 82.1 76.9 64R + Depth Misalign. 70.9 56.3 47.4 77.4 63.6 55.9 Semantic 88.1 82.0 74.1 91.5 88.3 80.7 Table 5.8: 3D detection and pose estima- Table 5.9: Average precision (%) for 3D tion of cars on VirtualKITTI [1] using dif- detection and pose estimation of cars on ferent types of depth noises. Gaussian Blur VirtualKITTI [1] using PointRCNN2. 2D smooths depth along boundaries and simu- Bounding box refers to the GT bounding late smeared depth; Depth Misalign. simu- boxes within which depth pixels are sam- lates depth error along the optical ray. pled, while Semantic refers to GT seman- tics. Note that pixels within r = 0.3m ra- dius of semantic pixels are also taken into consideration. object detection performance significantly. In Tab.5.9, We also experimented on finding an effective sampling region to select relevant object pixels; it shows that semantic regions, compared to 2D bounding boxes are most effective in improving the object detection performance in point-based backbone networks. 5.5 Conclusion Existing 3D architectures for object detection are sensitive to noise from the pointclouds generated from depth estimates in image plane. Learning accurate depth in image plane is non-trivial since often real-life groundtruth data is noisy. We design a sigma filter from TWISE that can minimize the noise-inbet pixels to a great extent. We also found out that estimated semantic masks from an external CNN network can suppress noise-bg pixels significantly. These filters can prune out noise from dense depth estimate quite effectively. We conclude that raw LiDAR, once augmented with pruned dense depth estimate, can contribute to improving object detection by gathering important details and context of relevant object pixels. In our future work, we plan to improve detection accuracy on foreground objects by designing a 3D pointcloud loss constraint, introducing 3D priors i.e. shapes and geometry knowledge of the scene. 96 Chapter 6 Conclusion and Future Works 6.1 Conclusions Depth completion aims to recover dense depth maps from sparse depth measurements, guided by color image because of its high-resolution imagery. We introduce two novel representation of depth called multi-channel and dual channel representation; which, when incorporated effectively into a learning framework, are capable of improving depth completion performance significantly by recovering depth discontinuities. We also study an application of high-resolution depth in 3D object detection problem and show some potential advantages and pitfalls in using high-resolution depth estimates using SoTA 3D object detection problems. We conclude the thesis with our con- tributions, and possible avenues of future work in depth completion problems. 6.1.1 Contributions In this section we list several contributions from the thesis. 6.1.1.1 Multi-Channel and Dual Channel Representation We propose two novel representations of depth; multi-channel and dual channel representation that can model ambiguity at object boundaries. Multichannel representation can be expressed by probability distribution of each depth which can model ambiguity by multiple peaks. By using fi- nite number of coefficients for depth reconstruction, it is possible to prevent floating pixels around 97 boundary completely. However, it needs to accommodate many channels for wide range of depths to preserve depth precision, resulting in high memory and computational requirements. We then introduce a new dual representation of depth, called twin surface, that learns a foreground and background depth for each pixel. Ideally, the foreground and background depth only differ around object boundaries when there is ambiguity and are same at object surfaces. It is possible to re- construct depth by selecting foreground and background at object boundaries and select or mix any depths within object surfaces. Although some mixed depth pixels can still be present due to incorrect learning, it can drastically reduce memory requirement compared to depth coefficients. Both representation can be learned using standard neural network architectures. 6.1.1.2 Loss Functions We propose some effective loss functions to learn the multichannel and dual channel representa- tions. We showed that the cross-entropy loss is an effective measure to learn multichannel depth (DC), which is also a discrete probability distribution of depth. In order to learn DC estimate by cross-entropy loss, the groundtruth depth also needs to be converted to groundtruth DC. To learn dual channel representation, we designed two asymmetric loss functions; ALE and RALE, that can be used as biased estimator of foreground (FG) and background (BG) depth respectively. Addi- tionally, we learn a sigma that is a weighted combination of FG and BG depth. It is possible to select a FG and BG depth at ambiguous region by adopting this strategy, and as a result, the final depth map can still preserve depth discontinuity at ambiguous regions. 6.1.1.3 Noisy GT and Metrics RMSE is the standard and most common metric to evaluate depth completion methods since it favors far-away depth pixel errors compared to close-by depth pixels. With clean GT depth, it is 98 an effective mechanism to evaluate depth pixels at all ranges. However, real-life world has noisy GT due to different approaches in creating GT. For example, accumulation of lidar points from multiple frames causes depth noise in moving objects, and depth inconsistency from stereo depth cause several long-range depth pixels to omit in ground-truth. For indoor scenes, colorization techniques create smeared points in NYU2 groundtruth. We discovered in our analysis that MAE is the least susceptible metric to use in presence of GT outliers. We also propose two evaluation measures, TMAE and TRMSE metrics, which can reward better solutions at boundaries and within object surfaces by reducing the preference of outlier points and mixed depth pixel errors in GT and estimated depths respectively. 6.1.1.4 3D Object Detection from Noisy Depth We apply depth estimates from depth completion in perception problems like 3D object detection and study how high-resolution noisy depth extended to pointcloud can affect object detection per- formances. Our analysis showed several depth noise and architecture limitations in SoTA method that are potential factors in degraded object detection performances. We propose a sigma filter, and some additional remedies to handle noises common in depth estimate and conclude that depth estimates can be leveraged more effectively if it augments the sparse lidar by gathering dense depth points around local regions. 6.1.2 Future Improvements Even though our proposed representation and learning framework can reduce significant smear- ing at object boundaries, we realize some potential limitations of our depth completion methods. Mixed depth pixels and depth noise still remain a concern for long-range depth where GT su- pervision is sparse and for heavily occluded regions which are not visible by the camera. Also 99 real-world GT like KITTI and NYU2 contains enough outlier noise resulting in noisy supervi- sion. We also realize that since supervision is done in image plane, there are very few pixels that represent long-range depth compared to close-range pixels as a result of perspective distortion. Creating clean high-resolution depth is paramount for quality depth completion and depth es- timation. There are number of avenues that can be explored for improving depth completion. Weaker labels like 3D bounding boxes, pointcloud supervision by chamfer distance or 3D geo- metric constraints can also be used for additional supervision to tackle errors in long-range depth and highly occluded regions in image plane. We also realize that SoTA architectures do not nec- essarily support some high-resolution dense depth due to large number of depth points at high resolution. So our work can be extended to designing a more conducive network architecture for supporting dense depth. There are opportunities to develop applications of dense depth estimates in perception, navigation, generation of high-definition mapping problem to utilize the true potential of high-resolution depth. 100 BIBLIOGRAPHY 101 BIBLIOGRAPHY [1] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2012. [2] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum PointNets for 3D object detection from RGB-D data,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2018. [3] F. Ma, G. V. Cavalheiro, and S. Karaman, “Self-supervised sparse-to-dense: Self-supervised depth completion from Lidar and monocular camera,” arXiv preprint arXiv:1807.00275, 2018. [4] P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in Proc. European Conf. Computer Vision (ECCV), 2012. [5] T. Huang, Z. Liu, X. Chen, and X. Bai, “Epnet: Enhancing point features with image se- mantics for 3d object detection,” in European Conference on Computer Vision. Springer, 2020, pp. 35–52. [6] A. Li, Z. Yuan, Y. Ling, W. Chi, s. zhang, and C. Zhang, “A multi-scale guided cascade hourglass network for depth completion,” in IEEE Workshop Application Computer Vision (WACV), March 2020. [7] S. Imran, X. Liu, and D. Morris, “Depth completion with twin surface extrapolation at occlusion boundaries,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2583–2592. [8] Y. Zhu, K. Sapra, F. A. Reda, K. J. Shih, S. Newsam, A. Tao, and B. Catanzaro, “Improving semantic segmentation via video propagation and label relaxation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8856–8865. [9] J. Deng, S. Shi, P. Li, W. Zhou, Y. Zhang, and H. Li, “Voxel r-cnn: Towards high perfor- mance voxel-based 3d object detection,” arXiv preprint arXiv:2012.15712, 2020. [10] F. Ma, G. V. Cavalheiro, and S. Karaman, “Self-supervised sparse-to-dense: self-supervised depth completion from lidar and monocular camera,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 3288–3295. [11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 770– 778. 102 [12] S. Imran, Y. Long, X. Liu, and D. Morris, “Depth coefficients for depth completion,” arXiv preprint arXiv:1903.05421, 2019. [13] J. Park, K. Joo, Z. Hu, C.-K. Liu, and I. S. Kweon, “Non-local spatial propagation network for depth completion,” in Proc. European Conf. Computer Vision (ECCV), 2020. [14] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object detection network for au- tonomous driving,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), vol. 1, no. 2, 2017, p. 3. [15] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1851–1858. [16] B. Fortin, R. Lherbier, and J. Noyer, “A model-based joint detection and tracking approach for multi-vehicle tracking with Lidar sensor,” IEEE Transactions on Intelligent Transporta- tion Systems, vol. 16, no. 4, pp. 1883–1895, Aug 2015. [17] Y. Cui, S. Schuon, S. Thrun, D. Stricker, and C. Theobalt, “Algorithms for 3D shape scan- ning with a depth camera,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 5, pp. 1039–1050, 2013. [18] T. Hassner and R. Basri, “Example based 3d reconstruction from single 2d images,” in 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06). IEEE, 2006, pp. 15–15. [19] A. Tuan Tran, T. Hassner, I. Masi, and G. Medioni, “Regressing robust and discriminative 3d morphable models with a very deep neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5163–5172. [20] A. Tonioni, F. Tosi, M. Poggi, S. Mattoccia, and L. D. Stefano, “Real-time self-adaptive deep stereo,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 195–204. [21] J.-R. Chang and Y.-S. Chen, “Pyramid stereo matching network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5410–5418. [22] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 66–75. [23] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 270–279. 103 [24] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep Ordinal Regression Net- work for Monocular Depth Estimation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2002–2011. [25] J. Hu, M. Ozay, Y. Zhang, and T. Okatani, “Revisiting single image depth estimation: To- ward higher resolution maps with accurate object boundaries,” in 2019 IEEE Winter Con- ference on Applications of Computer Vision (WACV). IEEE, 2019, pp. 1043–1051. [26] Y. Kuznietsov, J. Stuckler, and B. Leibe, “Semi-supervised deep learning for monocular depth map prediction,” in Proceedings of the IEEE conference on computer vision and pat- tern recognition, 2017, pp. 6647–6655. [27] S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4009–4018. [28] Z. Cheng, Y. Zhang, and C. Tang, “Swin-depth: Using transformers and multi-scale fusion for monocular-based depth estimation,” IEEE Sensors Journal, vol. 21, no. 23, pp. 26 912– 26 920, 2021. [29] J. Watson, M. Firman, G. J. Brostow, and D. Turmukhambetov, “Self-supervised monoc- ular depth hints,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2162–2171. [30] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3828–3838. [31] K. Wang and S. Shen, “Mvdepthnet: Real-time multiview depth estimation neural network,” in 2018 International conference on 3d vision (3DV). IEEE, 2018, pp. 248–257. [32] X. Long, L. Liu, W. Li, C. Theobalt, and W. Wang, “Multi-view depth estimation using epipolar spatio-temporal networks,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2021, pp. 8258–8267. [33] R. Mahjourian, M. Wicke, and A. Angelova, “Unsupervised learning of depth and ego- motion from monocular video using 3d geometric constraints,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5667–5675. [34] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” in Int. Conf. 3D Vision (3DV), 2017. [35] J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. Kweon, “High quality depth map upsampling for 3D-TOF cameras,” in Proc. Int. Conf. Computer Vision (ICCV), Nov 2011, pp. 1623– 1630. 104 [36] M. Jaritz, R. De Charette, E. Wirbel, X. Perrotton, and F. Nashashibi, “Sparse and dense data with CNNs: Depth completion and semantic segmentation,” in Int. Conf. 3D Vision (3DV), 2018. [37] Y. Long, D. Morris, X. Liu, M. Castro, P. Chakravarty, and P. Narayanan, “Radar-camera pixel depth association for depth completion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 507–12 516. [38] J. Diebel and S. Thrun, “An application of Markov Random Fields to range sensing,” in Advances in Neural Information Processing Systems (NIPS), Y. Weiss, B. Schölkopf, and J. C. Platt, Eds. MIT Press, 2006, pp. 291–298. [Online]. Available: http: //papers.nips.cc/paper/2837-an-application-of-markov-random-fields-to-range-sensing.pdf [39] Q. Yang, R. Yang, J. Davis, and D. Nister, “Spatial-depth super resolution for range images,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), June 2007, pp. 1–8. [40] D. Scharstein and R. Szeliski, “High-accuracy stereo depth maps using structured light,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), vol. 1, June 2003. [41] H. Hirschmuller and D. Scharstein, “Evaluation of cost functions for stereo matching,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2007, pp. 1–8. [42] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 567–576. [43] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proc. Computer Vision and Pat- tern Recognition (CVPR), IEEE, 2017. [44] G. Riegler, M. Rüther, and H. Bischof, “ATGV-Net: Accurate depth super-resolution,” in Proc. European Conf. Computer Vision (ECCV). Springer, 2016, pp. 268–284. [45] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in neural information processing systems, 2014, pp. 2366–2374. [46] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2650–2658. [47] F. Ma and S. Karaman, “Sparse-to-dense: Depth prediction from sparse depth samples and a single image,” in Proc. IEEE Int. Conf. Robotics and Automation (ICRA). IEEE, 2018, pp. 1–8. [48] Z. Chen, V. Badrinarayanan, G. Drozdov, and A. Rabinovich, “Estimating depth from RGB and sparse sensing,” in Proc. European Conf. Computer Vision (ECCV), 2018. 105 [49] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” in Int. Conf. 3D Vision (3DV). IEEE, 2017, pp. 11–20. [50] Y. Liao, L. Huang, Y. Wang, S. Kodagoda, Y. Yu, and Y. Liu, “Parse geometry from a line: Monocular depth estimation with partial laser observation,” in Proc. IEEE Int. Conf. Robotics and Automation (ICRA). IEEE, 2017, pp. 5059–5066. [51] X. Cheng, P. Wang, and R. Yang, “Depth estimation via affinity learned with convolutional spatial propagation network,” in Proc. European Conf. Computer Vision (ECCV), 2018. [52] J. Huang, A. B. Lee, and D. B. Mumford, “Statistics of range images.” Institute of Electrical and Electronics Engineers, 2000. [53] S. Kalkan, F. Wörgötter, and N. Krüger, “Statistical analysis of local 3D structure in 2D images,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR). Citeseer, 2006, pp. 1114–1121. [54] Y. Zhang and T. Funkhouser, “Deep depth completion of a single RGB-D image.” [55] P. Hu, J. Ziglar, D. Held, and D. Ramanan, “What you see is what you get: Exploiting visibility for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 001–11 009. [56] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentation and recognition using structure from motion point clouds,” in Proc. European Conf. Computer Vision (ECCV). Springer, 2008, pp. 44–57. [57] N. J. Mitra and A. Nguyen, “Estimating surface normals in noisy point cloud data,” in Proceedings of the nineteenth annual symposium on Computational geometry. ACM, 2003, pp. 322–328. [58] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3D object detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2018. [59] G. Riegler, A. Osman Ulusoy, and A. Geiger, “Octnet: Learning deep 3d representations at high resolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3577–3586. [60] M. Tatarchenko, A. Dosovitskiy, and T. Brox, “Octree generating networks: Efficient con- volutional architectures for high-resolution 3d outputs,” in Proceedings of the IEEE Inter- national Conference on Computer Vision, 2017, pp. 2088–2096. [61] R. Hanocka, A. Hertz, N. Fish, R. Giryes, S. Fleishman, and D. Cohen-Or, “Meshcnn: a network with an edge,” ACM Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1–12, 2019. 106 [62] M. Fey, J. Eric Lenssen, F. Weichert, and H. Müller, “Splinecnn: Fast geometric deep learn- ing with continuous b-spline kernels,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 869–877. [63] L. Shao, Y. Tian, and J. Bohg, “Clusternet: Instance segmentation in RGB-D images,” arXiv preprint arXiv:1807.08894, 2018. [64] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from RGB-D images for object detection and segmentation,” in Proc. European Conf. Computer Vision (ECCV). Springer, 2014, pp. 345–360. [65] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in In Proceeding of IEEE Computer Vision and Pattern Recognition, 2017. [66] Y. Tai, J. Yang, X. Liu, and C. Xu, “MemNet: A persistent memory network for image restoration,” in In Proceeding of International Conference on Computer Vision, 2017. [67] L. Sevilla-Lara and E. Learned-Miller, “Distribution fields for tracking,” in 2012 IEEE Con- ference on computer vision and pattern recognition. IEEE, 2012, pp. 1910–1917. [68] S. Paris and F. Durand, “A fast approximation of the bilateral filter using a signal processing approach,” in European conference on computer vision. Springer, 2006, pp. 568–580. [69] M. Felsberg, P.-E. Forssén, and H. Scharr, “Channel smoothing: Efficient robust smoothing of low-level signal features,” IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 28, no. 2, pp. 209–222, 2005. [70] E. G. Learned-Miller, “Data driven image models through continuous joint alignment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 2, pp. 236– 250, 2005. [71] G. Riegler, D. Ferstl, M. Rüther, and H. Bischof, “A deep primal-dual network for guided depth super-resolution,” 2016. [72] D. Ferstl, C. Reinbacher, R. Ranftl, M. Rüther, and H. Bischof, “Image guided depth up- sampling using anisotropic total generalized variation,” in Proc. Int. Conf. Computer Vision (ICCV), 2013, pp. 993–1000. [73] J. Lu and D. Forsyth, “Sparse depth super resolution,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2245–2253. [74] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxy for multi-object track- ing analysis,” in Proceedings of the IEEE conference on computer vision and pattern recog- nition, 2016, pp. 4340–4349. [75] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: An open urban driving simulator,” arXiv preprint arXiv:1711.03938, 2017. 107 [76] D. Hernandez-Juarez, L. Schneider, A. Espinosa, D. Vazquez, A. Lopez, U. Franke, M. Pollefeys, and J. C. Moure, “Slanted stixels: Representing san francisco’s steepest streets,” 2017. [77] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Bal- dan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” arXiv preprint arXiv:1903.11027, 2019. [78] A. Patil, S. Malla, H. Gang, and Y.-T. Chen, “The h3d dataset for full-surround 3d multi- object detection and tracking in crowded urban scenes,” in International Conference on Robotics and Automation, 2019. [79] R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platinsky, W. Jiang, and V. Shet, “Lyft level 5 av dataset 2019,” urlhttps://level5.lyft.com/dataset/, 2019. [80] S. Ghadai, X. Lee, A. Balu, S. Sarkar, and A. Krishnamurthy, “Multi-resolution 3D convo- lutional neural networks for object recognition,” arXiv preprint arXiv:1805.12254, 2018. [81] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for real-time object recognition,” in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2015, pp. 922–928. [82] J. T. Barron and B. Poole, “The fast bilateral solver,” in Proc. European Conf. Computer Vision (ECCV), 2016. [83] R. Liu, G. Zhong, J. Cao, Z. Lin, S. Shan, and Z. Luo, “Learning to diffuse: A new perspec- tive to design pdes for visual analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 12, pp. 2457–2471, 2016. [84] Y. Yang, A. Wong, and S. Soatto, “Dense depth posterior (ddp) from single image and sparse range,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3353–3362. [85] J. Qiu, Z. Cui, Y. Zhang, X. Zhang, S. Liu, B. Zeng, and M. Pollefeys, “Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and sin- gle color image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3313–3322. [86] Z. Yang, P. Wang, W. Xu, L. Zhao, and R. Nevatia, “Unsupervised learning of geometry from videos with edge-aware depth-normal consistency,” 2018. [Online]. Available: https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16421 [87] Y. Xu, X. Zhu, J. Shi, G. Zhang, H. Bao, and H. Li, “Depth completion from sparse lidar data with depth-normal constraints,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2811–2820. 108 [88] X. Cheng, P. Wang, C. Guan, and R. Yang, “Cspn++: Learning context and resource aware convolutional spatial propagation networks for depth completion,” in Proc. AAAI Conf. Ar- tificial Intelligence (AAAI), 2020. [89] Z. Xu, H. Yin, and J. Yao, “Deformable spatial propagation networks for depth completion,” in Proc. Int. Conf. Image Processing (ICIP). IEEE, 2020, pp. 913–917. [90] Y. Chen, B. Yang, M. Liang, and R. Urtasun, “Learning joint 2d-3d representations for depth completion,” in Proc. Int. Conf. Computer Vision (ICCV), 2019, pp. 10 023–10 032. [91] R. Xiang, F. Zheng, H. Su, and Z. Zhang, “3ddepthnet: Point cloud guided depth completion network for sparse depth and single color image,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2020. [92] X. Xiong, H. Xiong, K. Xian, C. Zhao, Z. Cao, and X. Li, “Sparse-to-dense depth comple- tion revisited: Sampling strategy and graph construction?.” [93] Y. Tai, J. Yang, and X. Liu, “Image Super-Resolution via Deep Recursive Residual Net- work,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3147–3155. [94] Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A Persistent Memory Network for Image Restoration,” in Proc. Int. Conf. Computer Vision (ICCV), 2017, pp. 4539–4547. [95] J. Shade, S. Gortler, L.-w. He, and R. Szeliski, “Layered depth images,” 1998, pp. 231–242. [96] S. Tulsiani, R. Tucker, and N. Snavely, “Layer-structured 3d scene inference via view syn- thesis,” in Proc. European Conf. Computer Vision (ECCV), 2018, pp. 302–317. [97] P. Hedman, S. Alsisan, R. Szeliski, and J. Kopf, “Casual 3D Photography,” vol. 36, no. 6, pp. 234:1–234:15, 2017. [98] T. Vogels, F. Rousselle, B. McWilliams, G. Röthlin, A. Harvill, D. Adler, M. Meyer, and J. Novák, “Denoising with kernel prediction and asymmetric loss functions,” ACM Trans- actions on Graphics (TOG), vol. 37, no. 4, pp. 1–15, 2018. [99] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS-W, 2017. [100] Y. Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,” 2020. [101] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 770–779. [102] X. Pan, Z. Xia, S. Song, L. E. Li, and G. Huang, “3d object detection with pointformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7463–7472. 109 [103] J. Mao, Y. Xue, M. Niu, H. Bai, J. Feng, X. Liang, H. Xu, and C. Xu, “Voxel transformer for 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Com- puter Vision, 2021, pp. 3164–3173. [104] Y. Wang, T. Shi, P. Yun, L. Tai, and M. Liu, “Pointseg: Real-time semantic segmentation based on 3d lidar point cloud,” arXiv preprint arXiv:1807.06288, 2018. [105] A. Milioto, I. Vizzo, J. Behley, and C. Stachniss, “Rangenet++: Fast and accurate lidar semantic segmentation,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 4213–4220. [106] Y. Zhang, Z. Zhou, P. David, X. Yue, Z. Xi, B. Gong, and H. Foroosh, “Polarnet: An improved grid representation for online lidar point clouds semantic segmentation,” in Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9601–9610. [107] C. Goodin, M. Doude, C. R. Hudson, and D. W. Carruth, “Enabling off-road autonomous navigation-simulation of lidar in dense vegetation,” Electronics, vol. 7, no. 9, p. 154, 2018. [108] T. Ort, I. Gilitschenski, and D. Rus, “Autonomous navigation in inclement weather based on a localizing ground penetrating radar,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3267–3274, 2020. [109] F. Zhang, V. Prisacariu, R. Yang, and P. H. Torr, “Ga-net: Guided aggregation net for end- to-end stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 185–194. [110] S. Zhu, G. Brazil, and X. Liu, “The edge of depth: Explicit constraints between segmen- tation and depth,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2020, pp. 13 116–13 125. [111] X. Wang, M. H. Ang Jr, and G. H. Lee, “Cascaded refinement network for point cloud completion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 790–799. [112] G. Qian, A. Abualshour, G. Li, A. Thabet, and B. Ghanem, “Pu-gcn: Point cloud upsam- pling using graph convolutional networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 683–11 692. [113] L. Yu, X. Li, C.-W. Fu, D. Cohen-Or, and P.-A. Heng, “Pu-net: Point cloud upsampling net- work,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2790–2799. [114] B.-U. Lee, K. Lee, and I. S. Kweon, “Depth completion using plane-residual representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 916–13 925. 110 [115] Z. Yang, Y. Sun, S. Liu, and J. Jia, “3dssd: Point-based 3d single stage object detector,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 040–11 048. [116] C. He, H. Zeng, J. Huang, X.-S. Hua, and L. Zhang, “Structure aware single-stage 3d object detection from point cloud,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 873–11 882. [117] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “Std: Sparse-to-dense 3d object detector for point cloud,” in Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2019, pp. 1951–1960. [118] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast en- coders for object detection from point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 697–12 705. [119] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, 2018. [120] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection from point clouds,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 7652–7660. [121] B. Li, “3d fully convolutional network for vehicle detection in point cloud,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 1513–1518. [122] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka, “3d bounding box estimation using deep learning and geometry,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 7074–7082. [123] L. Liu, J. Lu, C. Xu, Q. Tian, and J. Zhou, “Deep fitting degree scoring network for monoc- ular 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1057–1066. [124] A. Naiden, V. Paunescu, G. Kim, B. Jeon, and M. Leordeanu, “Shift r-cnn: Deep monocular 3d object detection with closed-form geometric constraints,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019, pp. 61–65. [125] G. Brazil and X. Liu, “M3d-rpn: Monocular 3d region proposal network for object detec- tion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9287–9296. [126] J. Fang, L. Zhou, and G. Liu, “3d bounding box estimation for autonomous vehicles by cascaded geometric constraints and depurated 2d detections using 3d results,” arXiv preprint arXiv:1909.01867, 2019. 111 [127] Y. Chen, L. Tai, K. Sun, and M. Li, “Monopair: Monocular 3d object detection using pair- wise spatial relationships,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 093–12 102. [128] G. Brazil, G. Pons-Moll, X. Liu, and B. Schiele, “Kinematic 3d object detection in monoc- ular video,” in European Conference on Computer Vision. Springer, 2020, pp. 135–152. [129] X. Ma, S. Liu, Z. Xia, H. Zhang, X. Zeng, and W. Ouyang, “Rethinking pseudo-lidar repre- sentation,” in European Conference on Computer Vision. Springer, 2020, pp. 311–327. [130] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv preprint arXiv:1904.07850, 2019. [131] D. Park, R. Ambrus, V. Guizilini, J. Li, and A. Gaidon, “Is pseudo-lidar needed for monoc- ular 3d object detection?” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3142–3152. [132] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Glaeser, F. Timm, W. Wiesbeck, and K. Dietmayer, “Deep multi-modal object detection and semantic segmentation for au- tonomous driving: Datasets, methods, and challenges,” IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 3, pp. 1341–1360, 2020. [133] A. Paigwar, D. Sierra-Gonzalez, Ö. Erkent, and C. Laugier, “Frustum-pointpillars: A multi- stage approach for 3d object detection using rgb camera and lidar,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2926–2933. [134] J. H. Yoo, Y. Kim, J. Kim, and J. W. Choi, “3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection,” in Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16. Springer, 2020, pp. 720–736. [135] M. Liang, B. Yang, S. Wang, and R. Urtasun, “Deep continuous fusion for multi-sensor 3d object detection,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 641–656. [136] S. Pang, D. Morris, and H. Radha, “Clocs: Camera-lidar object candidates fusion for 3d object detection,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 10 386–10 393. [137] X. Weng and K. Kitani, “Monocular 3d object detection with pseudo-lidar point cloud,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0. [138] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo- lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8445–8453. 112 [139] Y. You, Y. Wang, W.-L. Chao, D. Garg, G. Pleiss, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driv- ing,” arXiv preprint arXiv:1906.06310, 2019. [140] R. Qian, D. Garg, Y. Wang, Y. You, S. Belongie, B. Hariharan, M. Campbell, K. Q. Wein- berger, and W.-L. Chao, “End-to-end pseudo-lidar for image-based 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5881–5890. [141] S. Luo and W. Hu, “Score-based point cloud denoising,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 4583–4592. [142] C. Luo, Z. Yang, P. Wang, Y. Wang, W. Xu, R. Nevatia, and A. Yuille, “Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding,” arXiv preprint arXiv:1810.06125, 2018. [143] J. Gu, Z. Xiang, Y. Ye, and L. Wang, “Denselidar: A real-time pseudo dense depth guided depth completion network,” IEEE Robotics and Automation Letters, vol. 6, no. 2, p. 1808–1815, Apr 2021. [Online]. Available: http://dx.doi.org/10.1109/LRA.2021.3060396 [144] O. D. Team, “Openpcdet: An open-source toolbox for 3d object detection from point clouds,” https://github.com/open-mmlab/OpenPCDet, 2020. 113