GENERALIZING MONOCULAR 3D OBJECT DETECTION By Abhinav Kumar A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Doctor of Philosophy 2025 ABSTRACT Monocular 3D object detection (Mono3D) is a fundamental computer vision task that estimates an object’s class, 3D position, dimensions, and orientation from a single image. Its applications, including autonomous driving, augmented reality, and robotics, critically rely on accurate 3D en- vironmental understanding. This thesis addresses the challenge of generalizing Mono3D models to diverse scenarios, including occlusions, datasets, object sizes, and camera parameters. To en- hance occlusion robustness, we propose a mathematically differentiable NMS (GrooMeD-NMS). To improve generalization to new datasets, we explore depth equivariant (DEVIANT) backbones. We address the issue of large object detection, demonstrating that it’s not solely a data imbalance or receptive field problem but also a noise sensitivity issue. To mitigate this, we introduce a segmentation-based approach in bird’s-eye view with dice loss (SeaBird). Finally, we mathemati- cally analyze the extrapolation of Mono3D models to unseen camera heights and improve Mono3D generalization in such out-of-distribution settings. Copyright by ABHINAV KUMAR 2025 Dedicated to my country India. iv ACKNOWLEDGEMENTS My long PhD journey is the result of all my advisors, mentors, collaborators, friends, and family. First and foremost, I express my deepest gratitude to my advisor, Prof. Xiaoming Liu. Prof. Liu took a bet on me at a point when I was lost in the dark - having an intense desire to do the PhD, but bereft of any support. Over the years, his taste, rigor, work-ethics and guidance has instilled in me an awareness of what it takes to do great research. The faith he had on my capabilities, even when I did not have on myself, is what I am grateful to him for. I also thank my PhD committee – Prof. Daniel Morris (MSU), Prof. Georgia Gkioxari (Caltech, FAIR), Prof. Vishnu Boddetti (MSU), and Prof. Yu Kong (MSU) for agreeing to serve in my committee and supporting this journey. I acknowledge Prof. Daniel Morris for a three-year long collaboration on the Radar-Camera project, and sharing all his insights in developing radar-camera 3D detectors. I thank Prof. Georgia Gkioxari, who was also my internship manager at FAIR, Meta AI. Her vision expanded my horizons by giving me a taste of moonshot industry grade research, and what it takes to do one. I thank Prof. Vishnu Bodetti and Prof. Yu Kong for asking thought-provoking questions in this journey. I deeply acknowledge my mentors: Dr. Tim Marks, Dr. Michael Jones, Dr. Anoop Cherian, Dr. Ye Wang, Dr. Toshi Koike-Akino and Prof. Cheng Feng at MERL. They took a bet on me as a first year PhD student when I didn’t have any significant publications. The work done there culminated into my first CVPR paper. The paper opened doors to MSU to continue my PhD. If not for that internship, my aspirations for a PhD would have come to a crashing end five years back. When I joined MSU, Dr. Garrick Brazil took me under his wings for a very daunting area of 3D computer vision, and was almost a second advisor to me at MSU and FAIR. It was due to his strong belief in me that I applied to FAIR internship, which at that time, I believed was beyond my capacity. I acknowledge Dr. Yuliang Guo, Dr. Xinyu Huang and Dr. Liu Ren from Bosch AI Research. We had a long collaboration that spanned across 1 internship and 2 CVPR submissions. Their continued guidance and support enabled us to tackle hard open problems. v Dr. SriGanesh Madhvanath was my manager at Xerox Research, Bangalore. His mentorship gave me an early realization that as much as the calibre of a candidate matters, the environment and support system matters too. His generous endorsement opened doors for doing PhD in the US. As I grow in my career, I hope to pay it forward. I also thank Vladimir Kozitsky for his outstanding leadership on the LPRv2 project at Xerox Research. Throughout my PhD journey in the US, I have been told that my Maths and Linear Algebra skills are decent. My Master’s advisor, Prof. Animesh Kumar at IIT Bombay, is the stalwart who should get due credit. I stand on his shoulders and am deeply grateful to him for a rigorous foundation in mathematical thinking. I also thank my professors at IIT Patna: Prof. Ayash Kanto Mukherjee, Prof. Kailash Ray, Prof. Lokman Hakim Choudhury, Prof. Nutan Tomar, Prof. Somnath Sarangi, Prof. Sumanta Gupta, and Prof. Yatendra Singh for their rigorous undergrad training. I thank my Physics and Maths +2 teacher Jitendra Bharadwaj for his inspirational and thought-provoking teaching, and my teachers at MKDAV Public School, Daltonganj: Antariksh Roy, Asha Mishra, Ashok Verma, Ganga Agarwal, Kunal Kumar, and Rita Sinha for laying a solid foundation to my higher studies. Additionally, I thank Vincent Mattison and Brenda Hodge, the program coordinator and secre- tary in the CSE department at MSU for helping me with admin issues every single time. This research would not have been possible without the funding from Ford Motor Company and Bosch AI Research. I gratefully acknowledge their financial support. Prof. Liu’s lab gave me an open culture, access to like-minded peers, and exposure to a setup for doing high quality research. I am thankful to all these amazing people in the lab: Prof. Feng Liu, Dr. Amin Jourabloo, Dr. Xi Yin, Dr. Garrick Brazil, Dr. Yaojie Liu, Shengjie Zhu, Andrew Hou, Vishal Asnani, Masa Hu, Yunfei Long, Xiao Guo, Minchul Kim, Yiyang Su, Jei Zhu and Zhiyuan Ren for reviewing my ideas, critiquing my papers and open discussions. Also, the newer members of the group: Girish Ganeshan, Dinqiang Ye, Zhihao Zhong, Zhizhong Huang, Hoang Le and Ziang Gu for sharing this journey with me. I am pretty sure each one of you have done and will be doing great in the future. vi Next, I thank my friends in East Lansing - Bharat Basti Shenoy, Ankit Gupta, Rahul Dey, Ankit Kumar, Vishal Asnani, Hitesh Gakhar, Sachit Gaudi, Avrajit Ghosh and Ritam Guha, who made me feel East Lansing a second home. I am grateful to my friends Koushik Chattopadhyay, Saurabh Kumar, Ashay Jain, Manas Pratim Haloi, Vidit Singh, and Priyanka Sinha for being my loudest supporters despite staying thousands of kilometers away. All of them have been friends for more than eight years with three for more than fifteen years. These were the people with whom I discussed all my PhD quitting plans. I am also thankful to my parents, and my sister, Ayushi Raj, for their love, patience, support and encouragement, and keeping me sane during this demanding PhD journey. vii TABLE OF CONTENTS CHAPTER 1 1.1 Thesis Contributions 1.2 Thesis Organization . INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 5 CHAPTER 2 Introduction . GROOMED-NMS: GROUPED MATHEMATICALLY DIFFERENTIABLE 6 NMS FOR MONOCULAR 3D OBJECT DETECTION . . . . . . . . . 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 . 2.1 . 2.2 Related Works . 2.3 Background . . 2.4 GrooMeD-NMS . . 2.5 Experiments . . 2.6 Conclusions . . . . . . . . . . . . CHAPTER 3 Introduction . DEVIANT: DEPTH EQUIVARIANT NETWORK FOR MONOCULAR 3D OBJECT DETECTION . . . . . . . . . . . . . . . . . . . . . . . . 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 . . 28 3.2 Related Works . 3.3 Background . . . 29 3.4 Depth Equivariant Backbone . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 4 Introduction . SEABIRD: SEGMENTATION IN BIRD’S VIEW WITH DICE LOSS IMPROVES MONOCULAR 3D DETECTION OF LARGE OBJECTS . 44 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 . 47 . 49 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 . 4.2 Related Works . . . 4.3 SeaBird . . . 4.4 Experiments . . 4.5 Conclusions . . . . . . . . . . . CHAPTER 5 Introduction . CHARM3R: TOWARDS CAMERA HEIGHT AGNOSTIC MONOCULAR 3D OBJECT DETECTOR . . . . . . . . . . . . . . . . . . . . . . . . . 61 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 . 64 . 66 . 68 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 . 5.1 . 5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Notations and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 CHARM3R . . 5.5 Experiments . . 5.6 Conclusions . . . . . . . . . . . . . . CHAPTER 6 CONCLUSIONS AND FUTURE RESEARCH . . . . . . . . . . . . . 78 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 APPENDIX A PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 viii APPENDIX B GROOMED-NMS APPENDIX . . . . . . . . . . . . . . . . . . . . . . 105 APPENDIX C DEVIANT APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . 114 APPENDIX D SEABIRD APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . 138 APPENDIX E CHARM3R APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . 159 ix CHAPTER 1 INTRODUCTION Monocular 3D object detection (Mono3D) is a fundamental computer vision problem that estimates an object’s 3D position, dimensions, and orientation in a scene from a single image and its camera matrix. Its applications, including autonomous driving [108, 132, 181], robotics [213], and augmented reality [2,172,183,293], critically rely on accurate 3D environmental understanding. To address these applications’ demands, Mono3D networks must generalize across occlusions, diverse datasets [108], object sizes [110], camera intrinsics [14], extrinsics [94, 104], rotations [177], weather and geographical conditions [54] and be robust to adversarial examples [310]. Although Mono3D popularity stems from its high accessibility from consumer vehicles com- pared to LiDAR/Radar-based detectors [155, 215, 290] and computational efficiency compared to stereo-based detectors [34], Mono3D methods suffer from classical scale-depth ambiguity making their generalization harder. This is why there are fewer works along the lines of generalizing Mono3D. This thesis aims to generalize Mono3D to these varying conditions. Most Mono3D networks benefit from end-to-end learning idea. However, they train without including NMS in the training pipeline making the final box after NMS outside the training paradigm. While there were attempts to include NMS in the training pipeline for tasks such as 2D object detection, they have been less widely adopted due to a non-mathematical expression of the NMS. We present and integrate GrooMeD-NMS– a novel Grouped Mathematically Differentiable NMS for Mono3D, such that the network is trained end-to-end with a loss on the boxes after NMS. We first formulate NMS as a matrix operation and then group and mask the boxes in an unsupervised manner to obtain a simple closed-form expression of the NMS. GrooMeD-NMS addresses the mismatch between training and inference pipelines and, therefore, forces the network to select the best 3D box in a differentiable manner. As a result, GrooMeD-NMS achieves state- of-the-art monocular 3D object detection results on the KITTI benchmark dataset performing comparably to monocular video-based methods, and outperforming them on the hard occluded examples. 1 Generalizing to datasets requires features which are dataset-independent. One common way is to obtain such features is incorporating inductive bias or symmetries in the network. One such symmetry is translating the ego camera along depth should result in deterministic transformations of the feature maps. Modern neural networks use building blocks such as convolutions that are equivariant to arbitrary 2D translations in the Euclidean manifold. However, these vanilla blocks are not equivariant to arbitrary 3D translations in the projective manifold. Even then, all Mono3D networks use vanilla blocks to obtain the 3D coordinates, a task for which the vanilla blocks are not designed for. This paper takes the first step towards convolutions equivariant to arbitrary 3D translations in the projective manifold. Since the depth is the hardest to estimate for monocular detection, this paper proposes Depth Equivariant Network (DEVIANT) built with existing scale equivariant steerable blocks. As a result, DEVIANT is equivariant to the depth translations in the projective manifold whereas vanilla networks are not. The additional depth equivariance forces the DEVIANT to learn consistent depth estimates, and therefore, DEVIANT works better than vanilla networks in cross-dataset evaluation. DEVIANT also achieves state-of-the-art monocular 3D detection results on KITTI and Waymo datasets in the image-only category and performs competitively to methods using extra information. Mono3D networks achieve remarkable performance on cars and smaller objects. However, their performance drops on larger objects, leading to fatal accidents. Large objects like trailers, buses and trucks are harder to detect [268] in Mono3D, sometimes resulting in fatal accidents [23, 60]. Some attribute these failures to training data scarcity [308] or the receptive field requirements [268] of large objects. We find that modern frontal detectors struggle to generalize to large objects even on nearly balanced datasets. We argue that the cause of failure is the sensitivity of depth regression losses to noise of larger objects. To bridge this gap, we comprehensively investigate regression and dice losses, examining their robustness under varying error levels and object sizes. We mathematically prove that the dice loss leads to superior noise-robustness and model convergence for large objects compared to regression losses for a simplified case. Leveraging our theoretical insights, we propose SeaBird (Segmentation in Bird’s View) as the first step towards generalizing to large 2 objects. SeaBird effectively integrates BEV segmentation on foreground objects for 3D detection, with the segmentation head trained with the dice loss. SeaBird achieves SoTA results on the KITTI-360 leaderboard and improves existing detectors on the nuScenes leaderboard, particularly for large objects. With all these generalizations, the networks do not generalize well to changing extrinsics or viewpoints in testing. Finally, we aim to extend Mono3D’s capabilities to varying camera extrinsics, such as camera heights. 1.1 Thesis Contributions The thesis focuses on generalizing Mono3D across occlusions, datasets, object sizes, and camera extrinsics. The scale-depth ambiguity in Mono3D task requires elegant handling of the depth error. • This thesis introduces the mathematically differentiable Non-Maximal Suppression, which attempts Mono3D generalization to occluded and hard objects. Most detectors use a post- processing algorithm called Non-Maximal Suppression (NMS) only during inference. While there were attempts to include NMS in the training pipeline for tasks such as 2D object detection, they have been less widely adopted due to a non-mathematical expression of the NMS. In this chapter, we present and integrate GrooMeD-NMS – a novel Grouped Mathematically Differen- tiable NMS for monocular 3D object detection, such that the network is trained end-to-end with a loss on the boxes after NMS. We first formulate NMS as a matrix operation and then group and mask the boxes in an unsupervised manner to obtain a simple closed-form expression of the NMS. GrooMeD-NMS addresses the mismatch between training and inference pipelines and, therefore, forces the network to select the best 3D box in a differentiable manner. As a result, GrooMeD-NMS achieves state-of-the-art monocular 3D object detection results on the KITTI dataset. • We next propose the depth equivariant backbone in the projective manifold which attempts generalization to unseen datasets. Modern neural networks use building blocks such as convo- lutions that are equivariant to arbitrary 2D translations in the Euclidean manifold. However, these vanilla blocks are not equivariant to arbitrary 3D translations in the projective manifold. 3 Even then, all monocular 3D detectors use vanilla blocks to obtain the 3D coordinates, a task for which the vanilla blocks are not designed for. This chapter takes the first step towards convolu- tions equivariant to arbitrary 3D translations in the projective manifold. Since the depth is the hardest to estimate for monocular detection, this chapter proposes Depth Equivariant Network (DEVIANT) built with existing scale equivariant steerable blocks. As a result, DEVIANT is equivariant to the depth translations in the projective manifold whereas vanilla networks are not. The additional depth equivariance forces the DEVIANT to learn consistent depth estimates, and therefore, DEVIANT achieves state-of-the-art monocular 3D detection results on KITTI and Waymo datasets in the image-only category and performs competitively to methods using extra information. Moreover, DEVIANT works better than vanilla networks in cross-dataset evaluation. • We then investigate large object detection, demonstrating that it is not solely a data imbalance or receptive field issue but also a noise sensitivity problem. To generalize Mono3D to large objects, it introduces a segmentation-based approach in bird’s eye view with dice loss (SeaBird). Monocular 3D detectors achieve remarkable performance on cars and smaller objects. However, their performance drops on larger objects, leading to fatal accidents. Some attribute the failures to training data scarcity or their receptive field requirements of large objects. In this chapter, we highlight this understudied problem of generalization to large objects. We find that modern frontal detectors struggle to generalize to large objects even on nearly balanced datasets. We argue that the cause of failure is the sensitivity of depth regression losses to noise of larger objects. To bridge this gap, we comprehensively investigate regression and dice losses, examining their robustness under varying error levels and object sizes. We mathematically prove that the dice loss leads to superior noise-robustness and model convergence for large objects compared to regression losses for a simplified case. Leveraging our theoretical insights, we propose SeaBird (Segmentation in Bird’s View) as the first step towards generalizing to large objects. SeaBird effectively integrates BEV segmentation on foreground objects for 3D detection, with the segmentation head trained with the dice loss. SeaBird achieves SoTA results on the KITTI- 4 360 leaderboard and improves existing detectors on the nuScenes leaderboard, particularly for large objects. • Monocular 3D object detectors, while effective on data from one ego camera height, struggle with unseen or out-of-distribution camera heights. Existing methods often rely on Plucker embeddings, image transformations or data augmentation. This chapter takes a step towards this understudied problem by investigating the impact of camera height variations on state-of- the-art (SoTA) Mono3D models. With a systematic analysis on the extended CARLA dataset with multiple camera heights, we observe that depth estimation is a primary factor influencing performance under height variations. We mathematically prove and also empirically observe consistent negative and positive trends in mean depth error of regressed and ground-based depth models, respectively, under camera height changes. To mitigate this, we propose Camera Height Robust Monocular 3D Detector (CHARM3R), which averages both depth estimates within the model. CHARM3R significantly improves generalization to unseen camera heights, achieving SoTA performance on the CARLA dataset. 1.2 Thesis Organization We organize the remaining chapters of the dissertation as follows. Chapter 2 introduces the mathematically differentiable Non-Maximal Suppression, which attempts generalization to occluded and hard objects. Chapter 3 describes the depth equivariant backbone which attempts generalization to unseen datasets. Chapter 4 investigates large object detection, demonstrating that it is not solely a data imbalance or receptive field issue but also a noise sensitivity problem. To improve large object detection, it introduces a segmentation-based approach in bird’s eye view with dice loss (SeaBird). Chapter 5 attempts solving the generalization of Mono3D trained on single camera height to multiple camera heights. Chapter 6 introduces the future research for monocular 3D detection. 5 CHAPTER 2 GROOMED-NMS: GROUPED MATHEMATICALLY DIFFERENTIABLE NMS FOR MONOCULAR 3D OBJECT DETECTION Modern 3D object detectors have immensely benefited from the end-to-end learning idea. However, most of them use a post-processing algorithm called Non-Maximal Suppression (NMS) only during inference. While there were attempts to include NMS in the training pipeline for tasks such as 2D object detection, they have been less widely adopted due to a non-mathematical expression of the NMS. In this chapter, we present and integrate GrooMeD-NMS– a novel Grouped Mathematically Differentiable NMS for monocular 3D object detection, such that the network is trained end-to-end with a loss on the boxes after NMS. We first formulate NMS as a matrix operation and then group and mask the boxes in an unsupervised manner to obtain a simple closed-form expression of the NMS. GrooMeD-NMS addresses the mismatch between training and inference pipelines and, therefore, forces the network to select the best 3D box in a differentiable manner. As a result, GrooMeD-NMS achieves state-of-the-art monocular 3D object detection results on the KITTI benchmark dataset performing comparably to monocular video-based methods. 2.1 Introduction 3D object detection is one of the fundamental problems in computer vision, where the task is to infer 3D information of the object. Its applications include augmented reality [2, 201], robotics [120,213], medical surgery [203], and, more recently path planning and scene understand- ing in autonomous driving [33,90,123,220]. Most of the 3D object detectors [33,90,121,123,220] are extensions of the 2D object detector Faster R-CNN [202], which relies on the end-to-end learn- ing idea to achieve State-of-the-Art (SoTA) object detection. Some of these methods have proposed changing architectures [123, 216, 220] or losses [15, 35]. Others have tried incorporating confi- dence [17, 216, 220] or temporal cues [17]. Almost all of them output a massive number of boxes for each object and, thus, rely on post- processing with a greedy [192] clustering algorithm called Non-Maximal Suppression (NMS) during inference to reduce the number of false positives and increase performance. However, 6 Training Inference Inference Training e n o b k c a B B Score 2𝐷 3𝐷 s OIoU Predictions r NMS e n o b k c a B B Score 2𝐷 3𝐷 s OIoU GrooMeD NMS Predictions r L𝑏𝑒 𝑓 𝑜𝑟𝑒 L𝑏𝑒 𝑓 𝑜𝑟𝑒 L𝑏𝑒 𝑓 𝑜𝑟𝑒 L𝑎 𝑓 𝑡𝑒𝑟 (a) Conventional NMS Pipeline (b) GrooMeD-NMS Pipeline s O I M P Sort 𝑝 lower (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) d > 𝑣 s (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) ⌊.⌉ G Group r Forward Backward (c) GrooMeD-NMS layer Figure 2.1 Overview of GrooMeD-NMS. (a) Conventional object detection has a mismatch between training and inference as it uses NMS only in inference. (b) To address this, we propose a novel GrooMeD-NMS layer, such that the network is trained end-to-end with NMS applied. s and r denote the score of boxes B before and after the NMS respectively. O denotes the matrix containing IoU2D overlaps of B. L𝑏𝑒 𝑓 𝑜𝑟 𝑒 denotes the losses before the NMS, while L𝑎 𝑓 𝑡𝑒𝑟 denotes the loss after the NMS. (c) GrooMeD-NMS layer calculates r in a differentiable manner giving gradients from L𝑎 𝑓 𝑡𝑒𝑟 when the best-localized box corresponding to an object is not selected after NMS. these works have largely overlooked NMS’s inclusion in training leading to an apparent mismatch between training and inference pipelines as the losses are applied on all boxes before NMS but not on final boxes after NMS (see Fig. 2.1a). We also find that 3D object detection suffers a greater mismatch between classification and 3D localization compared to that of 2D localization, as discussed further in Sec. B.3.2 of the supplementary and observed in [17, 90, 216]. Hence, our focus is 3D object detection. Earlier attempts to include NMS in the training pipeline [80, 81, 192] have been made for 2D object detection where the improvements are less visible. Recent efforts to improve the correlation in 3D object detection involve calculating [220, 222] or predicting [17, 216] the scores via likelihood estimation [111] or enforcing the correlation explicitly [90]. Although this improves the 3D detection performance, improvements are limited as their training pipeline is not end to end 7 in the absence of a differentiable NMS. To address the mismatch between training and inference pipelines as well as the mismatch between classification and 3D localization, we propose including the NMS in the training pipeline, which gives a useful gradient to the network so that it figures out which boxes are the best-localized in 3D and, therefore, should be ranked higher (see Fig. 2.1b). An ideal NMS for inclusion in the training pipeline should be not only differentiable but also parallelizable. Unfortunately, the inference-based classical NMS and Soft-NMS [12] are greedy, set-based and, therefore, not parallelizable [192]. To make the NMS parallelizable, we first formulate the classical NMS as matrix operation and then obtain a closed-form mathematical expression using elementary matrix operations such as matrix multiplication, matrix inversion, and clipping. We then replace the threshold pruning in the classical NMS with its softer version [12] to get useful gradients. These two changes make the NMS GPU-friendly, and the gradients are backpropagated. We next group and mask the boxes in an unsupervised manner, which removes the matrix inversion and simplifies our proposed differentiable NMS expression further. We call this NMS as Grouped Mathematically Differentiable NMS (GrooMeD-NMS). In summary, the main contributions of this work include: • This is the first work to propose and integrate a closed-form mathematically differentiable NMS for object detection, such that the network is trained end-to-end with a loss on the boxes after NMS. • We propose an unsupervised grouping and masking on the boxes to remove the matrix inversion in the closed-form NMS expression. • We achieve SoTA monocular 3D object detection performance on the KITTI dataset performing comparably to monocular video-based methods. 2.2 Related Works 3D Object Detection. Recent success in 2D object detection [69, 70, 139, 200, 202] has inspired people to infer 3D information from a single 2D (monocular) image. However, the monocular problem is ill-posed due to the inherent scale/depth ambiguity [232]. Hence, approaches use 8 additional sensors such as LiDAR [90, 215, 269], stereo [122, 254] or radar [178, 242]. Although LiDAR depth estimations are accurate, LiDAR data is sparse [85] and computationally expensive to process [232]. Moreover, LiDAR s are expensive and do not work well in severe weather [232]. Hence, there have been several works on monocular 3D object detection. Earlier approaches [31, 61,186,187] use hand-crafted features, while the recent ones are all based on deep learning. Some of these methods have proposed changing architectures [123, 143, 232] or losses [15, 35]. Others have tried incorporating confidence [17,143,216,220], augmentation [223], depth in convolution [15,52] or temporal cues [17]. Our work proposes to incorporate NMS in the training pipeline of monocular 3D object detection. Non-Maximal Suppression. NMS has been used to reduce false positives in edge detection [206], feature point detection [75, 157, 174], face detection [243], human detection [16, 18, 47] as well as SoTA 2D [69, 139, 200, 202] and 3D detection [5, 17, 33, 216, 220, 232]. Modifications to NMS in 2D detection [12, 49, 80, 81, 192], 2D pedestrian detection [116, 145, 209], 2D salient object detection [298] and 3D detection [216] can be classified into three categories – inference NMS [12, 216], optimization-based NMS [3, 49, 116, 209, 244, 298] and neural network based NMS [78, 80, 81, 145, 192]. The inference NMS [12] changes the way the boxes are pruned in the final set of predic- tions. [216] uses weighted averaging to update the 𝑧-coordinate after NMS. [209] solves quadratic unconstrained binary optimization while [3,116,224] and [298] use point processes and MAP based inference respectively. [49] and [244] formulate NMS as a structured prediction task for isolated and all object instances respectively. The neural network NMS use a multi-layer network and message- passing to approximate NMS [80, 81, 192] or to predict the NMS threshold adaptively [145]. [78] approximates the sub-gradients of the network without modelling NMS via a transitive relationship. Our work proposes a grouped closed-form mathematical approximation of the classical NMS and does not require multiple layers or message-passing. We detail these differences in Sec. 2.4.2. 9 2.3 Background 2.3.1 Notations r = {𝑟𝑖}𝑛 Let B = {𝑏𝑖}𝑛 𝑖=1 denote the set of boxes or proposals 𝑏𝑖 from an image. Let s = {𝑠𝑖}𝑛 𝑖=1 and 𝑖=1 denote their scores (before NMS) and rescores (updated scores after NMS) respectively such that 𝑟𝑖, 𝑠𝑖 ≥ 0 ∀ 𝑖. D denotes the subset of B after the NMS. Let O = [𝑜𝑖 𝑗 ] denote the 𝑛 × 𝑛 matrix with 𝑜𝑖 𝑗 denoting the 2D Intersection over Union (IoU2D) of 𝑏𝑖 and 𝑏 𝑗 . The pruning function 𝑝 decides how to rescore a set of boxes B based on IoU2D overlaps of its neighbors, sometimes suppressing boxes entirely. In other words, 𝑝(𝑜𝑖) = 1 denotes the box 𝑏𝑖 is suppressed while 𝑝(𝑜𝑖) = 0 denotes 𝑏𝑖 is kept in D. The NMS threshold 𝑁𝑡 is the threshold for which two boxes need in order for the non-maximum to be suppressed. The temperature 𝜏 controls the shape of the exponential and sigmoidal pruning functions 𝑝. 𝑣 thresholds the rescores in GrooMeD and Soft-NMS [13] to decide if the box remains valid after NMS. B is partitioned into different groups G = {G𝑘 }. BG𝑘 denotes the subset of B belonging to group 𝑘. Thus, BG𝑘 = {𝑏𝑖} ∀ 𝑏𝑖 ∈ G𝑘 and BG𝑘 ∩ BG𝑙 = 𝜙 ∀ 𝑘 ≠ 𝑙. G𝑘 in the subscript of a variable denotes its subset corresponding to BG𝑘 . Thus, sG𝑘 and rG𝑘 denote the scores and the rescores of BG𝑘 respectively. 𝛼 denotes the maximum group size. ∨ denotes the logical OR while ⌊𝑥⌉ denotes clipping of 𝑥 in the range [0, 1]. Formally, ⌊𝑥⌉ = 1, 𝑥 > 1 𝑥, 0 ≤ 𝑥 ≤ 1 0, 𝑥 < 0 (2.1)    |s| denotes the number of elements in s. (cid:108) in the subscript denotes the lower triangular version of (cid:108) the matrix without the principal diagonal. ⊙ denotes the element-wise multiplication. I denotes the identity matrix. 2.3.2 Classical and Soft-NMS NMS is one of the building blocks in object detection whose high-level goal is to iteratively suppress boxes which have too much IoU with a nearby high-scoring box. We first give an overview of the classical and Soft-NMS [12], which are greedy and used in inference. Classical NMS uses 10 Algorithm 1: Classical/Soft-NMS [12] Input: s: scores, O: IoU2D matrix, 𝑁𝑡: NMS threshold, 𝑝: pruning function, 𝜏: temperature Output: d: box index after NMS, r: scores after NMS 1 begin 2 d ← {} t ← {1, · · · , |s|} r ← s while t ≠ 𝑒𝑚 𝑝𝑡𝑦 do 𝜈 ← argmax r[t] d ← d ∪ 𝜈 t ← t − 𝜈 for 𝑖 ← 1 : |t| do 3 4 5 6 7 8 9 10 11 𝑟𝑖 ← (1 − 𝑝𝜏 (O[𝜈, 𝑖]))𝑟𝑖 end ⊲ All box indices ⊲ Top scored box ⊲ Add to valid box index ⊲ Remove from t ⊲ Rescore end 12 13 end the idea that the score of a box having a high IoU2D overlap with any of the selected boxes should be suppressed to zero. That is, it uses a hard pruning 𝑝 without any temperature 𝜏. Soft-NMS makes this pruning soft via temperature 𝜏. Thus, classical and Soft-NMS only differ in the choice of 𝑝. We reproduce them in Alg. 1 using our notations. 2.4 GrooMeD-NMS Classical NMS (Alg. 1) uses argmax and greedily calculates the rescore 𝑟𝑖 of boxes B and, is thus not parallelizable or differentiable [192]. We wish to find its smooth approximation in closed-form for including in the training pipeline. 2.4.1 Formulation 2.4.1.1 Sorting Classical NMS uses the non-differentiable hard argmax operation (Line 6 of Alg. 1). We remove the argmax by hard sorting the scores s and O in decreasing order (lines 2-3 of Alg. 2). We also try making the sorting soft. Note that we require the permutation of s to sort O. Most soft sorting methods [8, 10, 185, 190] apply the soft permutation to the same vector. Only two other methods [46, 191] can apply the soft permutation to another vector. Both methods use O (cid:0)𝑛2(cid:1) 11 Algorithm 2: GrooMeD-NMS Input: s: scores, O: IoU2D matrix, 𝑁𝑡: NMS threshold, 𝑝: pruning function, 𝑣: valid box threshold, 𝛼: maximum group size Output: d: box index after NMS, r: scores after NMS 1 begin 2 3 4 5 6 7 8 9 10 11 12 (cid:108) ← lower(O) s, index ← sort(s, descending= True) O ← O[index] [:, index] O (cid:108) P ← 𝑝(O (cid:108) (cid:108) ) I ← Identity(|s|) G ← group(O, 𝑁𝑡, 𝛼) for 𝑘 ← 1 : |G| do MG𝑘 ← zeros (|G𝑘 | , |G𝑘 |) MG𝑘 [:, G𝑘 [1]] ← 1 rG𝑘 ← (cid:4) (cid:0)IG𝑘 − MG𝑘 ⊙ PG𝑘 (cid:1) sG𝑘 (cid:7) end d ← index[r >= 𝑣] 13 14 end ⊲ Sort s ⊲ Sort O ⊲ Lower △ular matrix ⊲ Prune matrix ⊲ Identity matrix ⊲ Group boxes B ⊲ Prepare mask ⊲ First col of MG𝑘 ⊲ Rescore ⊲ Valid box index computations for soft sorting [10]. We implement [191] and find that [191] is overly dependent on temperature 𝜏 to break out the ranks, and its gradients are too unreliable to train our model. Hence, we stick with the hard sorting of s and O. 2.4.1.2 NMS as a Matrix Operation The rescoring process of the classical NMS is greedy set-based [192] and only considers overlaps with unsuppressed boxes. We first generalize this rescoring by accounting for the effect of all (suppressed and unsuppressed) boxes as 𝑠𝑖 − 𝑟𝑖 ≈ max (cid:169) (cid:173) (cid:171) 𝑖−1 ∑︁ 𝑗=1 𝑝(𝑜𝑖 𝑗 )𝑟 𝑗 , 0(cid:170) (cid:174) (cid:172) (2.2) using the relaxation of logical OR (cid:212) operator as (cid:205) [106, 124]. See Sec. B.1 of the supplementary material for an alternate explanation of Eq. (2.2). The presence of 𝑟 𝑗 on the RHS of Eq. (2.2) prevents suppressed boxes from influencing other boxes hugely. When 𝑝 outputs discretely as {0, 1} as in classical NMS, scores 𝑠𝑖 are guaranteed to be suppressed to 𝑟𝑖 = 0 or left unchanged 12 𝑟𝑖 = 𝑠𝑖 thereby implying 𝑟𝑖 ≤ 𝑠𝑖 ∀ 𝑖. We write the rescores r in a matrix formulation as 𝑠1 𝑠2 𝑟1     𝑟2    𝑟3   ...      𝑟𝑛   The above equation is written compactly as (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171)                                                                 ≈ max 𝑠3 ... 𝑠𝑛 − 0 𝑝(𝑜21) 0 0 𝑝(𝑜31) 𝑝(𝑜32) ... ... 𝑝(𝑜𝑛1) 𝑝(𝑜𝑛2) . . . 0 . . . 0               . . . 0   . . . 0 ... ... r ≈ max(s − Pr, 0), , 𝑟1     𝑟2    𝑟3   ...      𝑟𝑛                   .                 0     0    0   ...      0   (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) (2.3) (2.4) where P, called the Prune Matrix, is obtained when the pruning function 𝑝 operates element-wise on O (cid:108) (cid:108) . Maximum operation makes Eq. (2.4) non-linear [112] and, thus, difficult to solve. However, to avoid recursion, we use r ≈ (cid:4)(I + P)−1 s(cid:7) , (2.5) as the solution to Eq. (2.4) with I being the identity matrix. Intuitively, if the matrix inversion is considered division in Eq. (2.5) and the boxes have overlaps, the rescores are the scores divided by a number greater than one and are, therefore, lesser than scores. If the boxes do not overlap, the division is by one and rescores equal scores. Note that the I + P in Eq. (2.5) is a lower triangular matrix with ones on the principal diagonal. Hence, I + P is always full rank and, therefore, always invertible. 2.4.1.3 Grouping We next observe that the object detectors output multiple boxes for an object, and a good detector outputs boxes wherever it finds objects in the monocular image. Thus, we cluster the boxes in an image in an unsupervised manner based on IoU2D overlaps to obtain the groups G. Grouping thus mimics the grouping of the classical NMS, but does not rescore the boxes. As clustering limits interactions to intra-group interactions among the boxes, we write Eq. (2.5) as rG𝑘 ≈ (cid:106) (cid:0)IG𝑘 + PG𝑘 (cid:1) −1 sG𝑘 (cid:109) . (2.6) 13 Algorithm 3: Grouping of boxes Input: O: sorted IoU2D matrix, 𝑁𝑡: NMS threshold, 𝛼: maximum group size Output: G: Groups 1 begin 2 G ← {} t ← {1, · · · , O.shape[1]} while t ≠ 𝑒𝑚 𝑝𝑡𝑦 do u ← O[:, 1] > 𝑁𝑡 v ← t[u] 𝑛G𝑘 ← min(|v|, 𝛼) G.insert(v[: 𝑛G𝑘 ]) w ← O[:, 1] ≤ 𝑁𝑡 t ← t[w] O ← O[w] [:, w] 3 4 5 6 7 8 9 10 11 ⊲ All box indices ⊲ High overlap indices ⊲ New group ⊲ Insert new group ⊲ Low overlap indices ⊲ Keep w indices in t ⊲ Keep w indices in O end 12 13 end This results in taking smaller matrix inverses in Eq. (2.6) than Eq. (2.5). We use a simplistic grouping algorithm, i.e., we form a group G𝑘 with boxes having high IoU2D overlap with the top-ranked box, given that we sorted the scores. As the group size is limited by 𝛼, we choose a minimum of 𝛼 and the number of boxes in G𝑘 . We next delete all the boxes of this group and iterate until we run out of boxes. Also, grouping uses IoU2D since we can achieve meaningful clustering in 2D. We detail this unsupervised grouping in Alg. 3. 2.4.1.4 Masking Classical NMS considers the IoU2D of the top-scored box with other boxes. This consideration is equivalent to only keeping the column of O corresponding to the top box while assigning the rest of the columns to be zero. We implement this through masking of PG𝑘 . Let MG𝑘 denote the binary mask corresponding to group G𝑘 . Then, entries in the binary matrix MG𝑘 in the column corresponding to the top-scored box are 1 and the rest are 0. Hence, only one of the columns in MG𝑘 ⊙ PG𝑘 is non-zero. Now, IG𝑘 + MG𝑘 ⊙ PG𝑘 is a Frobenius matrix (Gaussian transformation) and we, therefore, invert this matrix by simply subtracting the second term [71]. In other words, (IG𝑘 + MG𝑘 ⊙ PG𝑘 )−1 = IG𝑘 − MG𝑘 ⊙ PG𝑘 . Hence, we simplify Eq. (2.6) further to get rG𝑘 ≈ (cid:4) (cid:0)IG𝑘 − MG𝑘 ⊙ PG𝑘 (cid:1) sG𝑘 (cid:7) . (2.7) 14 Thus, masking allows to bypass the computationally expensive matrix inverse operation altogether. We call the NMS based on Eq. (2.7) as Grouped Mathematically Differentiable Non-Maximal Suppression or GrooMeD-NMS. We summarize the complete GrooMeD-NMS in Alg. 2 and show its block-diagram in Fig. 2.1c. GrooMeD-NMS in Fig. 2.1c provides two gradients - one through s and other through O. 2.4.1.5 Pruning Function As explained in Sec. 2.3.1, the pruning function 𝑝 decides whether to keep the box in the final set of predictions D or not based on IoU2D overlaps, i.e., 𝑝(𝑜𝑖) = 1 denotes the box 𝑏𝑖 is suppressed while 𝑝(𝑜𝑖) = 0 denotes 𝑏𝑖 is kept in D. Classical NMS uses the threshold as the pruning function, which does not give useful gradients. Therefore, we considered three different functions for 𝑝: Linear, a temperature (𝜏)-controlled Exponential, and Sigmoidal function. • Linear Linear pruning function [12] is 𝑝(𝑜) = 𝑜. • Exponential Exponential pruning function [12] is 𝑝(𝑜) = 1 − exp (cid:16) • Sigmoidal Sigmoidal pruning function is 𝑝(𝑜) = 𝜎 (cid:16) 𝑜−𝑁𝑡 𝜏 (cid:17). − 𝑜2 𝜏 (cid:17) with 𝜎 denoting the standard sigmoid. Sigmoidal function appears as the binary cross entropy relaxation of the subset selection problem [185]. We show these pruning functions in Fig. 2.2. The ablation studies (Sec. 2.5.4) show that choosing 𝑝 as Linear yields the simplest and the best GrooMeD-NMS. 2.4.2 Differences from Existing NMS Although no differentiable NMS has been proposed for the monocular 3D object detection, we compare our GrooMeD-NMS with the NMS proposed for 2D object detection, 2D pedestrian detection, 2D salient object detection, and 3D object detection in Tab. 2.1. No method described in Tab. 2.1 has a matrix-based closed-form mathematical expression of the NMS. Classical, Soft [12] and Distance-NMS [216] are used at the inference time, while GrooMeD-NMS is used during both training and inference. Distance-NMS [216] updates the 𝑧-coordinate of the box after NMS as the weighted average of the 𝑧-coordinates of top-𝜅 boxes. QUBO-NMS [209], Point-NMS [116, 224], 15 Figure 2.2 Pruning functions 𝑝 of the classical and GrooMeD-NMS. We use the Linear and Exponential pruning of the Soft-NMS [12] while training with the GrooMeD-NMS. and MAP-NMS [298] are not used in end-to-end training. [3] proposes a trainable Point-NMS. The Structured-SVM based NMS [49, 244] rely on structured SVM to obtain the rescores. Adaptive- NMS [145] uses a separate neural network to predict the classical NMS threshold 𝑁𝑡. The trainable neural network based NMS (NN-NMS) [80, 81, 192] use a separate neural network containing multiple layers and/or message-passing to approximate the NMS and do not use the pruning function. Unlike these methods, GrooMeD-NMS uses a single layer and does not require multiple layers or message passing. Our NMS is parallel up to group (denoted by G). However, |G| is, in general, << |B| in the NMS. 2.4.3 Target Assignment and Loss Function Target Assignment. Our method consists of M3D-RPN [15] and uses binning and self-balancing confidence [17]. The boxes’ self-balancing confidence are used as scores s, which pass through the GrooMeD-NMS layer to obtain the rescores r. The rescores signal the network if the best box has not been selected for a particular object. We extend the notion of the best 2D box [192] to 3D. The best box has the highest product of IoU2D and gIoU3D [204] with ground truth 𝑔𝑙. If the product is greater than a certain threshold 𝛽, 16 Table 2.1 Comparison of different NMS. [Key: Train= End-to-end Trainable, Prune= Pruning function, #Layers= Number of layers, Par= Parallelizable] Rescore ✕ ✕ ✕ NMS Train Classical ✕ Soft-NMS [12] ✕ Distance-NMS [216] ✕ QUBO-NMS [209] ✕ Optimization ✕ Point Process Point-NMS [116, 224] Trainable Point-NMS [3] ✓ Point Process MAP MAP-NMS [298] ✕ SSVM Structured-NMS [49, 244] ✕ ✕ Adaptive-NMS [145] ✕ ✓ Neural Network ✕ NN-NMS [80, 81, 192] Soft Matrix ✓ GrooMeD-NMS (Ours) O (|G|) O (|G|) O (|G|) - - - - - Prune #Layers Par - Hard - Soft - Hard - ✕ - ✕ - ✕ - ✕ - ✕ > 1 O (|G|) Hard O (1) > 1 1 O (|G|) it is assigned a positive label. Mathematically, target(𝑏𝑖) =    1, if ∃ 𝑔𝑙 st 𝑖 = argmax 𝑞(𝑏 𝑗 , 𝑔𝑙) and 𝑞(𝑏𝑖, 𝑔𝑙) ≥ 𝛽 (2.8) 0, otherwise with 𝑞(𝑏 𝑗 , 𝑔𝑙) = IoU2D(𝑏 𝑗 , 𝑔𝑙) (cid:16) 1+gIoU3D (𝑏 𝑗 ,𝑔𝑙) 2 (cid:17). gIoU3D is known to provide signal even for non- intersecting boxes [204], where the usual IoU3D is always zero. Therefore, we use gIoU3D instead of regular IoU3D for figuring out the best box in 3D as many 3D boxes have a zero IoU3D overlap with the ground truth. For calculating gIoU3D, we first calculate the volume 𝑉 and hull volume 𝑉ℎ𝑢𝑙𝑙 of the 3D boxes. 𝑉ℎ𝑢𝑙𝑙 is the product of gIoU2D in Birds Eye View (BEV), removing the rotations and hull of the 𝑌 dimension. gIoU3D is then given by gIoU3D(𝑏𝑖, 𝑏 𝑗 ) = 𝑉 (𝑏𝑖 ∩ 𝑏 𝑗 ) 𝑉 (𝑏𝑖 ∪ 𝑏 𝑗 ) + 𝑉 (𝑏𝑖 ∪ 𝑏 𝑗 ) 𝑉ℎ𝑢𝑙𝑙 (𝑏𝑖, 𝑏 𝑗 ) − 1. (2.9) Loss Function. Generally the number of best boxes is less than the number of ground truths in an image, as there could be some ground truth boxes for which no box is predicted. The tiny number of best boxes introduces a far-heavier skew than the foreground-background classification. Thus, we use the modified AP-Loss [30] as our loss after NMS since AP-Loss does not suffer from class imbalance [30]. 17 Vanilla AP-Loss treats boxes of all images in a mini-batch equally, and the gradients are back- propagated through all the boxes. We remove this condition and rank boxes in an image-wise manner. In other words, if the best boxes are correctly ranked in one image and are not in the second, then the gradients only affect the boxes of the second image. We call this modification of AP-Loss as Imagewise AP-Loss. In other words, L𝐼𝑚𝑎𝑔𝑒𝑤𝑖𝑠𝑒 = 1 𝑁 𝑁 ∑︁ 𝑚=1 AP(r(𝑚), target(B (𝑚))), (2.10) where r(𝑚) and B (𝑚) denote the rescores and the boxes of the 𝑚th image in a mini-batch respectively. This is different from previous NMS approaches [78, 80, 81, 192], which use classification losses. Our ablation studies (Sec. 2.5.4) show that the Imagewise AP-Loss is better suited to be used after NMS than the classification loss. Our overall loss function is thus given by L = L𝑏𝑒 𝑓 𝑜𝑟𝑒 + 𝜆L𝑎 𝑓 𝑡𝑒𝑟 where L𝑏𝑒 𝑓 𝑜𝑟𝑒 denotes the losses before the NMS including classification, 2D and 3D regression as well as confidence losses, and L𝑎 𝑓 𝑡𝑒𝑟 denotes the loss term after the NMS, which is the Imagewise AP-Loss with 𝜆 being the weight. See Sec. B.2 of the supplementary material for more details of the loss function. 2.5 Experiments Our experiments use the most widely used KITTI autonomous driving dataset [67]. We modify the publicly-available PyTorch [184] code of Kinematic-3D [17]. [17] uses DenseNet-121 [86] trained on ImageNet as the backbone and 𝑛ℎ = 1,024 using 3D-RPN settings of [15]. As [17] is a video-based method while GrooMeD-NMS is an image-based method, we use the best image model of [17] henceforth called Kinematic (Image) as our baseline for a fair comparison. Kinematic (Image) is built on M3D-RPN [15] and uses binning and self-balancing confidence. Data Splits. There are three commonly used data splits of the KITTI dataset; we evaluate our method on all three. KITTI Test (Full) split: Official KITTI 3D benchmark [1] consists of 7,481 training and 7,518 testing images [67]. KITTI Val 1 split: It partitions the 7,481 training images into 3,712 training and 3,769 validation images [17, 32, 220]. 18 Table 2.2 KITTI Test cars AP 3D| 𝑅40 quoted from the official leader-board or from papers.[Key: Best, Second Best]. and AP BEV| 𝑅40 comparisons (IoU3D ≥ 0.7). Previous results are (− (− (cid:17)) Method AP 3D|𝑅40 AP BEV|𝑅40 (cid:17)) Easy Mod Hard Easy Mod Hard 2.77 1.51 1.01 5.40 3.23 2.46 FQNet [143] 4.32 2.02 1.46 9.78 4.91 3.74 ROI-10D [169] 4.47 2.90 2.47 8.41 6.08 4.94 GS3D [121] 9.61 5.74 4.25 18.19 11.17 8.73 MonoGRNet [195] 10.76 7.25 5.85 18.33 12.58 9.91 MonoPSR [107] 10.37 7.94 6.40 17.23 13.19 11.12 MonoDIS [222] 15.58 8.61 6.00 21.85 12.51 9.20 UR3D [216] 14.76 9.71 7.42 21.02 13.67 10.23 M3D-RPN [15] 14.03 9.76 7.84 20.83 14.49 12.75 SMOKE [153] 13.04 9.99 8.65 19.28 14.83 12.89 MonoPair [35] 14.41 10.34 8.77 19.17 14.20 11.99 RTM3D [123] 16.50 10.74 9.52 25.03 17.32 14.91 AM3D [165] 15.19 10.90 9.26 22.76 17.03 10.86 MoVi-3D [223] 16.37 11.01 9.52 22.45 15.02 12.93 RAR-Net [144] 17.51 11.46 8.98 24.15 15.93 12.11 M3D-SSD [160] 16.77 11.50 8.93 DA-3Ddet [287] 16.65 11.72 9.51 22.51 16.02 12.55 D4LCN [52] Kinematic (Video) [17] 19.07 12.72 9.17 26.69 17.52 13.10 GrooMeD-NMS (Ours) 18.10 12.32 9.65 26.19 18.27 14.05 - - - KITTI Val 2 split: It partitions the 7,481 training images into 3,682 training and 3,799 validation images [271]. Training. Training is done in two phases - warmup and full [17]. We initialize the model with the confidence prediction branch from warmup weights and finetune using the self-balancing loss [17] and Imagewise AP-Loss [30] after our GrooMeD-NMS. See Sec. B.3.1 of the supplementary material for more training details. We keep the weight 𝜆 at 0.05. Unless otherwise stated, we use 𝑝 as the Linear function (this does not require 𝜏) with 𝛼 = 100. 𝑁𝑡, 𝑣 and 𝛽 are set to 0.4 [15, 17], 0.3 and 0.3 respectively. Inference. We multiply the class and predicted confidence to get the box’s overall score in inference as in [99, 216, 241]. See Sec. 2.5.2 for training and inference times. Evaluation Metrics. KITTI uses AP 3D|𝑅40 metric to evaluate object detection following [220,222]. KITTI benchmark evaluates on three object categories: Easy, Moderate and Hard. It assigns each 19 Table 2.3 KITTI Val 1 cars AP 3D| 𝑅40 and AP BEV| 𝑅40 results. [Key: Best, Second Best]. IoU3D ≥ 0.7 (cid:17)) (− IoU3D ≥ 0.5 (cid:17)) (− (− (cid:17)) Method AP 3D|𝑅40 AP 3D|𝑅40 (cid:17)) Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard 12.50 7.34 4.98 19.49 11.51 8.72 AP BEV|𝑅40 AP BEV|𝑅40 MonoDR [7] MonoGRNet [195] in [35] 11.90 7.56 5.76 19.72 12.81 10.15 47.59 32.28 25.50 52.13 35.99 28.72 MonoDIS [222] in [220] M3D-RPN [15] in [17] MoVi-3D [223] MonoPair [35] Kinematic (Image) [17] Kinematic (Video) [17] GrooMeD-NMS (Ours) 11.06 7.60 6.37 18.45 12.58 10.66 14.53 11.07 8.65 20.85 15.62 11.88 48.56 35.94 28.59 53.35 39.60 31.77 14.28 11.13 9.68 22.36 17.87 15.73 16.28 12.30 10.42 24.12 18.17 15.76 55.38 42.39 37.99 61.06 47.63 41.92 18.28 13.55 10.13 25.72 18.82 14.48 54.70 39.33 31.25 60.87 44.36 34.48 19.76 14.10 10.47 27.83 19.72 15.10 55.44 39.47 31.26 61.79 44.68 34.56 19.67 14.32 11.27 27.38 19.75 15.92 55.62 41.07 32.89 61.83 44.98 36.29 (− - - - - - - - - - - - - - - - - - - (a) Linear Scale (b) Log Scale Figure 2.3 AP3D Comparison at different depths and IoU3D matching thresholds on KITTI Val 1 Split. object to a category based on its occlusion, truncation, and height in the image space. The AP 3D|𝑅40 performance on the Moderate category compares different models in the benchmark [67]. We focus primarily on the Car class following [17]. 2.5.1 KITTI Test Mono3D Tab. 2.2 summarizes the results of 3D object detection and BEV evaluation on KITTI Test Split. The results in Tab. 2.2 show that GrooMeD-NMS outperforms the baseline M3D-RPN [15] by a significant margin and several other SoTA methods on both the tasks. GrooMeD-NMS also outper- forms augmentation based approach MoVi-3D [223] and depth-convolution based D4LCN [52]. Despite being an image-based method, GrooMeD-NMS performs competitively to the video-based method Kinematic (Video) [17], outperforming it on the most-challenging Hard set. 20 Table 2.4 Comparisons with other NMS on KITTI Val 1 cars (IoU3D ≥ 0.7). [Key: C= Classical, S= Soft-NMS [12], D= Distance-NMS [216], G= GrooMeD-NMS ] S (− (− (cid:17)) Method Infer NMS AP 3D|𝑅40 AP BEV|𝑅40 (cid:17)) Easy Mod Hard Easy Mod Hard Kinematic (Image) C 18.28 13.55 10.13 25.72 18.82 14.48 Kinematic (Image) 18.29 13.55 10.13 25.71 18.81 14.48 Kinematic (Image) D 18.25 13.53 10.11 25.71 18.82 14.48 Kinematic (Image) G 18.26 13.51 10.10 25.67 18.77 14.44 C 19.67 14.31 11.27 27.38 19.75 15.93 GrooMeD-NMS S GrooMeD-NMS 19.67 14.31 11.27 27.38 19.75 15.93 D 19.67 14.31 11.27 27.38 19.75 15.93 GrooMeD-NMS G 19.67 14.32 11.27 27.38 19.75 15.92 GrooMeD-NMS 2.5.2 KITTI Val 1 Mono3D Results. Tab. 2.3 summarizes the results of 3D object detection and BEV evaluation on KITTI Val 1 Split at two IoU3D thresholds of 0.7 and 0.5 [17, 35]. Tab. 2.3 results show that GrooMeD-NMS outperforms the baseline of M3D-RPN [15] and Kinematic (Image) [17] by a significant margin. Interestingly, GrooMeD-NMS (an image-based method) also outperforms the video-based method Kinematic (Video) [17] on most of the metrics. Thus, GrooMeD-NMS performs best on 6 out of the 12 cases (3 categories × 2 tasks × 2 thresholds) while second-best on all other cases. The performance is especially impressive since the biggest improvements are shown on the Moderate and Hard set, where objects are more distant and occluded. AP3D at different depths and IoU3D thresholds. We next compare the AP3D performance of GrooMeD-NMS and Kinematic (Image) on linear and log scale for objects at different depths of [15, 30, 45, 60] meters and IoU3D matching criteria of 0.3 − 0.7 in Fig. 2.3 as in [17]. Fig. 2.3 shows that GrooMeD-NMS outperforms the Kinematic (Image) [17] at all depths and all IoU3D thresholds. (cid:17) Comparisons with other NMS. We compare with the classical NMS, Soft-NMS [12] and Distance- NMS [216] in Tab. 2.4. More detailed results are in Tab. B.2 of the supplementary material. The results show that NMS inclusion in the training pipeline benefits the performance, unlike [12], which suggests otherwise. Training with GrooMeD-NMS helps because the network gets an additional signal through the GrooMeD-NMS layer whenever the best-localized box corresponding to an object is not selected. Interestingly, Tab. 2.4 also suggests that replacing GrooMeD-NMS 21 Figure 2.4 Score-IoU3D plot after the NMS. GrooMeD-NMS achieves the best correlation. Table 2.5 KITTI Val 2 cars AP 3D| 𝑅40 and AP BEV| 𝑅40 comparisons. [Key: Best, *= Released, †= Retrained]. IoU3D ≥ 0.7 (cid:17)) (− IoU3D ≥ 0.5 (cid:17)) (− Method AP 3D|𝑅40 (cid:17)) Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard M3D-RPN [15]* 14.57 10.07 7.51 21.36 15.22 11.28 49.14 34.43 26.39 53.44 37.79 29.36 Kinematic (Image) [17]† 13.54 10.21 7.24 20.60 15.14 11.30 51.53 36.55 28.26 56.20 40.02 31.25 GrooMeD-NMS (Ours) 14.72 10.87 7.67 22.03 16.05 11.93 51.91 36.78 28.40 56.29 40.31 31.39 AP BEV|𝑅40 AP BEV|𝑅40 AP 3D|𝑅40 (cid:17)) (− (− with the classical NMS in inference does not affect the performance. Score-IoU3D Plot. We further correlate the scores with IoU3D after NMS of our model with two baselines - M3D-RPN [15] and Kinematic (Image) [17] and also the Kinematic (Video) [17] in Fig. 2.4. We obtain the best correlation of 0.345 exceeding the correlations of M3D-RPN, Kinematic (Image) and, also Kinematic (Video). This proves that including NMS in the training pipeline is beneficial. Training and Inference Times. We now compare the training and inference times of includ- ing GrooMeD-NMS in the pipeline. Warmup training phase takes about 13 hours to train on a single 12 GB GeForce GTX Titan-X GPU. Full training phase of Kinematic (Image) and GrooMeD- NMS takes about 8 and 8.5 hours respectively. The inference time per image using classical and GrooMeD-NMS is 0.12 and 0.15 ms respectively. Tab. 2.4 suggests that changing the NMS from GrooMeD to classical during inference does not alter the performance. Then, the inference time of our method is the same as 0.12 ms. 22 Table 2.6 Ablation studies of GrooMeD-NMS on KITTI Val 1 cars. Change from GrooMeD-NMS model: IoU3D ≥ 0.7 (− (cid:17)) IoU3D ≥ 0.5 (− (cid:17)) Changed From −− (cid:17) Training Conf+NMS− Conf+NMS− Conf+NMS− Initialization No Warmup Pruning Function Group+Mask Loss (cid:17) (cid:17) (cid:17) (cid:17) Linear − Linear − Linear − Linear − Group+Mask− Group+Mask− (cid:17) Imagewise AP − (cid:17) Imagewise AP − Inference Class*Pred− NMS Scores Class*Pred− (− (− To (cid:17)) (cid:17) (cid:17) (cid:17) AP 3D|𝑅40 AP 3D|𝑅40 AP BEV|𝑅40 AP BEV|𝑅40 No Conf+No NMS Conf+No NMS No Conf+NMS (cid:17)) Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard 16.66 12.10 9.40 23.15 17.43 13.48 51.47 38.58 30.98 56.48 42.53 34.37 19.16 13.89 10.96 27.01 19.33 14.84 57.12 41.07 32.79 61.60 44.58 35.97 15.02 11.21 8.83 21.07 16.27 12.77 48.01 36.18 29.96 53.82 40.94 33.35 15.33 11.68 8.78 21.32 16.59 12.93 49.15 37.42 30.11 54.32 41.44 33.48 Exponential, 𝜏 = 1 12.81 9.26 7.10 17.07 12.17 9.25 29.58 20.42 15.88 32.06 22.16 17.20 Exponential, 𝜏 = 0.5 [12] 18.63 13.85 10.98 27.52 20.14 15.76 56.64 41.01 32.79 61.43 44.73 36.02 18.34 13.79 10.88 27.26 19.71 15.90 56.98 41.16 32.96 62.77 45.23 36.56 Exponential, 𝜏 = 0.1 17.40 13.21 9.80 26.77 19.26 14.76 55.15 40.77 32.63 60.56 44.23 35.74 Sigmoidal, 𝜏 = 0.1 18.43 13.91 11.08 26.53 19.46 15.83 55.93 40.98 32.78 61.02 44.77 36.09 18.99 13.74 10.24 26.71 19.21 14.77 55.21 40.69 32.55 61.74 44.67 36.00 18.23 13.73 10.28 26.42 19.31 14.76 54.47 40.35 32.20 60.90 44.08 35.47 16.34 12.74 9.73 22.40 17.46 13.70 52.46 39.40 31.68 58.22 43.60 35.27 18.26 13.36 10.49 25.39 18.64 15.12 52.44 38.99 31.3 57.37 42.89 34.68 17.51 12.84 9.55 24.55 17.85 13.63 52.78 37.48 29.37 58.30 41.26 32.66 19.67 14.32 11.27 27.38 19.75 15.92 55.62 41.07 32.89 61.83 44.98 36.29 No Group Group+No Mask Vanilla AP BCE (cid:17) Class (cid:17) Pred (cid:17) (cid:17) — GrooMeD-NMS (best model) 2.5.3 KITTI Val 2 Mono3D Tab. 2.5 summarizes the results of 3D object detection and BEV evaluation on KITTI Val 2 Split at two IoU3D thresholds of 0.7 and 0.5 [17, 35]. Again, we use M3D-RPN [15] and Kinematic (Image) [17] as our baselines. We evaluate the released model of M3D-RPN [15] using the KITTI metric. [17] does not report Val 2 results, so we retrain on Val 2 using their public code. The results in Tab. 2.5 show that GrooMeD-NMS performs best in all cases. This is again impressive because the improvements are shown on Moderate and Hard set, consistent with Tabs. 2.2 and 2.3. 2.5.4 Ablation Studies on KITTI Val 1 Tab. 2.6 compares the modifications of our approach on KITTI Val 1 cars. Unless stated otherwise, we stick with the experimental settings described in Sec. 2.5. Using a confidence head (Conf+No NMS) proves beneficial compared to the warmup model (No Conf+No NMS), which is consistent with the observations of [17, 216]. Further, GrooMeD-NMS on classification scores (denoted by No Conf + NMS) is detrimental as the classification scores are not suited for localization [17,90]. Training the warmup model and then finetuning also works better than training without warmup as in [17] since the warmup phase allows GrooMeD-NMS to carry meaningful grouping of the boxes. As described in Sec. 2.4.1.5, in addition to Linear, we compare two other functions for pruning function 𝑝: Exponential and Sigmoidal. Both of them do not perform as well as the Linear 𝑝 possibly 23 because they have vanishing gradients close to overlap of zero or one. Grouping and masking both help our model to reach a better minimum. As described in Sec. 2.4.3, Imagewise AP loss is better than the Vanilla AP loss since it treats boxes of two images differently. Imagewise AP also performs better than the binary cross-entropy (BCE) loss proposed in [78, 80, 81, 192]. Using the product of self-balancing confidence and classification scores instead of using them individually as the scores to the NMS in inference is better, consistent with [99, 216, 241]. Class confidence performs worse since it does not have the localization information while the self-balancing confidence (Pred) gives the localization without considering whether the box belongs to foreground or background. 2.6 Conclusions In this chapter, we present and integrate GrooMeD-NMS– a novel Grouped Mathematically Differentiable NMS for monocular 3D object detection, such that the network is trained end-to-end with a loss on the boxes after NMS. We first formulate NMS as a matrix operation and then do unsupervised grouping and masking of the boxes to obtain a simple closed-form expression of the NMS. GrooMeD-NMS addresses the mismatch between training and inference pipelines and, therefore, forces the network to select the best 3D box in a differentiable manner. As a result, GrooMeD-NMS achieves state-of-the-art monocular 3D object detection results on the KITTI benchmark dataset. Although our implementation demonstrates monocular 3D object detection, GrooMeD-NMS is fairly generic for other object detection tasks. Future work includes applying this method to tasks such as LiDAR-based 3D object detection and pedestrian detection. Limitation. GrooMeD-NMS does not fully solve the generalization issue. 24 CHAPTER 3 DEVIANT: DEPTH EQUIVARIANT NETWORK FOR MONOCULAR 3D OBJECT DETECTION Modern neural networks use building blocks such as convolutions that are equivariant to arbitrary 2D translations in the Euclidean manifold. However, these vanilla blocks are not equivariant to arbitrary 3D translations in the projective manifold. Even then, all monocular 3D detectors use vanilla blocks to obtain the 3D coordinates, a task for which the vanilla blocks are not designed for. This chapter takes the first step towards convolutions equivariant to arbitrary 3D translations in the projective manifold. Since the depth is the hardest to estimate for monocular detection, this chapter proposes Depth Equivariant Network (DEVIANT) built with existing scale equivariant steerable blocks. As a result, DEVIANT is equivariant to the depth translations in the projective manifold whereas vanilla networks are not. The additional depth equivariance forces the DEVIANT to learn consistent depth estimates, and therefore, DEVIANT achieves state-of-the-art monocular 3D detection results on KITTI and Waymo datasets in the image-only category and performs competitively to methods using extra information. Moreover, DEVIANT works better than vanilla networks in cross-dataset evaluation. 3.1 Introduction Monocular 3D object detection is a fundamental task in computer vision, where the task is to infer 3D information including depth from a single monocular image. It has applications in augmented reality [2], gaming [201], robotics [213], and more recently in autonomous driving [15, 220] as a fallback solution for LiDAR. Most of the monocular 3D methods attach extra heads to the 2D Faster-RCNN [202] or CenterNet [306] for 3D detections. Some change architectures [123, 143, 232] or losses [15, 35]. Others incorporate augmentation [223], or confidence [17, 143]. Recent ones use in-network ensembles [159, 301] for better depth estimation. Most of these methods use vanilla blocks such as convolutions that are equivariant to arbitrary 2D translations [19, 198]. In other words, whenever we shift the ego camera in 2D (See 𝑡𝑢 of 25 𝑧 𝑥 𝑦 ℎ 𝑡𝑍 ℎ′ 𝑡𝑢 𝑡𝑋 ℎ′ SES Convolution * Depth Translation = T𝑠(Corollary 1.1) T𝑠−1 Depth Translation = T𝑠 * (a) Idea. (b) Depth Equivariance. Figure 3.1 (a) Idea. Vanilla CNN is equivariant to projected 2D translations 𝑡𝑢, 𝑡𝑣 (in red) of the ego camera. The ego camera moves in 3D in driving scenes which breaks this assumption. We propose DEVIANT which is additionally equivariant to depth translations 𝑡𝑍 (in green) in the projective manifold. (b) Depth Equivariance. DEVIANT enforces additional consistency among the feature maps of an image and its transformation caused by the ego depth translation. T𝑠 =scale transformation, ∗ =vanilla convolution. Fig. 3.1), the new image (projection) is a translation of the original image, and therefore, these methods output a translated feature map. However, in general, the camera moves in depth in driving scenes instead of 2D (See 𝑡𝑍 of Fig. 3.1). So, the new image is not a translation of the original input image due to the projective transform. Thus, using vanilla blocks in monocular methods is a mismatch between the assumptions and the regime where these blocks operate. Additionally, there is a huge generalization gap between training and validation for monocular 3D detection (See Modeling translation equivariance in the correct manifold improves generalization for tasks in spherical [41] and hyperbolic [64] manifolds. Monocular detection involves processing pixels (3D point projections) to obtain the 3D information, and is thus a task in the projective manifold. Moreover, the depth in monocular detection is ill-defined [232], and thus, the hardest to estimate [166]. Hence, using building blocks equivariant to depth translations in the projective manifold is a natural choice for improving generalization and is also at the core of this work (See Sec. C.1.8). Recent monocular methods use flips [15], scale [159, 223], mosaic [11, 238] or copy-paste [135] augmentation, depth-aware convolution [15], or geometry [151, 159, 218, 302] to improve generalization. Although all these methods improve performance, a major issue is that their backbones are not designed for the projective world. This results in the depth estimation going haywire with a slight ego movement [307]. Moreover, data augmentation, e.g., flips, scales, 26 Table 3.1 Equivariance comparisons. [Key: Proj.= Projected, ax= axis] (cid:17) 3D Translation − Proj. 2D 𝑥−ax 𝑦−ax 𝑧−ax 𝑢-ax 𝑣-ax (𝑡𝑣) (𝑡𝑍 ) (𝑡𝑋) (𝑡𝑢) ✓ ✓ Vanilla CNN − − ✓ Log-polar [313] − − − ✓ ✓ ✓ − DEVIANT ✓ ✓ Ideal − − (𝑡𝑌 ) − − − ✓ mosaic, copy-paste, is not only limited for the projective tasks, but also does not guarantee desired behavior [63]. To address the mismatch between assumptions and the operating regime of the vanilla blocks and improve generalization, we take the first step towards convolutions equivariant to arbitrary 3D translations in the projective manifold. We propose Depth Equivariant Network (DEVIANT) which is additionally equivariant to depth translations in the projective manifold as shown in Tab. 3.1. Building upon the classic result from [76], we simplify it under reasonable assumptions about the camera movement in autonomous driving to get scale transformations. The scale equivariant blocks are well-known in the literature [68, 92, 227, 309], and consequently, we replace the vanilla blocks in the backbone with their scale equivariant steerable counterparts [227] to additionally embed equivariance to depth translations in the projective manifold. Hence, DEVIANT learns consistent depth estimates and improves monocular detection. In summary, the main contributions of this work include: • We study the modeling error in monocular 3D detection and propose depth equivariant networks built with scale equivariant steerable blocks as a solution. • We achieve state-of-the-art (SoTA) monocular 3D object detection results on the KITTI and Waymo datasets in the image-only category and perform competitively to methods which use extra information. • We experimentally show that DEVIANT works better in cross-dataset evaluation suggesting better generalization than vanilla CNN backbones. 27 Transformation − Manifold − (cid:17) (cid:17) Table 3.2 Equivariances known in the literature. Translation Rotation Scale Flips Learned Euclidean Spherical Hyperbolic Projective Vanilla CNN [115] Spherical CNN [41] Hyperbolic CNN [64] Monocular Detector Polar, Log-polar [79], Steerable [266] Steerable [68] ChiralNets [288] Transformers [55] − − − − − − − − − − − − 3.2 Related Works Equivariant Neural Networks. The success of convolutions in CNN has led people to look for their generalizations [43, 262]. Convolution is the unique solution to 2D translation equivariance in the Euclidean manifold [19, 20, 198]. Thus, convolution in CNN is a prior in the Euclidean manifold. Several works explore other group actions in the Euclidean manifold such as 2D rotations [42, 50, 171, 263], scale [98, 170], flips [288], or their combinations [247, 266]. Some consider 3D translations [265] and rotations [239]. Few [55, 264, 304] attempt learning the equivariance from the data, but such methods have significantly higher data requirements [265]. Others change the manifold to spherical [41], hyperbolic [64], graphs [173], or arbitrary manifolds [97]. Monocular 3D detection involves operations on pixels which are projections of 3D point and thus, works in a different manifold namely projective manifold. Tab. 3.2 summarizes all these equivariances known thus far. Scale Equivariant Networks. Scale equivariance in the Euclidean manifold is more challenging than the rotations because of its acyclic and unbounded nature [198]. There are two major lines of work for scale equivariant networks. The first [56, 79] infers the global scale using log-polar transform [313], while the other infers the scale locally by convolving with multiple scales of images [98] or filters [278]. Several works [68, 92, 227, 309] extend the local idea, using steerable filters [62]. Another work [267] constructs filters for integer scaling. We compare the two kinds of scale equivariant convolutions on the monocular 3D detection task and show that steerable convolutions are better suited to embed depth (scale) equivariance. Scale equivariant networks have been used for classification [56, 68, 227], 2D tracking [226] and 3D object classification [56]. We are the first to use scale equivariant networks for monocular 3D detection. 28 3D Object Detection. Accurate 3D object detection uses sparse data from LiDARs [215], which are expensive and do not work well in severe weather [232] and glassy environments. Hence, several works have been on monocular camera-based 3D object detection, which is simplistic but has scale/depth ambiguity [232]. Earlier approaches [31, 61, 186, 187] use hand-crafted features, while the recent ones use deep learning. Some change architectures [123,143,146,232] or losses [15,35]. Some use scale [159, 223], mosaic [238] or copy-paste [135] augmentation. Others incorporate depth in convolution [15, 52], or confidence [17, 111, 143]. More recent ones use in-network ensembles to predict the depth deterministically [301] or probabilistically [159]. A few use temporal cues [17], NMS [109], or corrected camera extrinsics [307] in the training pipeline. Some also use CAD models [25, 154] or LiDAR [199] in training. Another line of work called Pseudo-LiDAR [162,165,181,221,254] estimates the depth first, and then uses a point cloud-based 3D object detector. We refer to [163] for a detailed survey. Our work is the first to use scale equivariant blocks in the backbone for monocular 3D detection. 3.3 Background We first provide the necessary definitions which are used throughout this chapter. These are not our contributions and can be found in the literature [21, 76, 265]. Equivariance. Consider a group of transformations 𝐺, whose individual members are 𝑔. Assume Φ denote the mapping of the inputs ℎ to the outputs 𝑦. Let the inputs and outputs undergo the transformation T ℎ 𝑔 and T 𝑦 𝑔 (Φℎ), ∀ 𝑔 ∈ 𝐺. Thus, equivariance provides an explicit relationship between 𝑦 𝑔 respectively. Then, the mapping Φ is equivariant to the group 𝐺 [265] if Φ(T ℎ 𝑔 ℎ) = T input transformations and feature-space transformations at each layer of the neural network [265], and intuitively makes the learning easier. The mapping Φ is the vanilla convolution when the T ℎ 𝑔 = T 𝑦 𝑔 = Tt where Tt denotes the translation t on the discrete grid [19, 20, 198]. These vanilla convolution introduce weight-tying [115] in fully connected neural networks resulting in a greater generalization. A special case of equivariance is the invariance [265] which is given by Φ(T ℎ 𝑔 ℎ) = Φℎ, ∀ 𝑔 ∈ 𝐺. Projective Transformations. Our idea is to use equivariance to depth translations in the projective 29 manifold since the monocular detection task belongs to this manifold. A natural question to ask is whether such equivariants exist in the projective manifold. [21] answers this question in negative, and says that such equivariants do not exist in general. However, such equivariants exist for special classes, such as planes. An intuitive way to understand this is to infer the rotations and translations by looking at the two projections (images). For example, the result of [21] makes sense if we consider a car with very different front and back sides as in Fig. C.2. A 180◦ ego rotation around the car means the projections (images) are its front and the back sides, which are different. Thus, we can not infer the translations and rotations from these two projections. Based on this result, we stick with locally planar objects i.e. we assume that a 3D object is made of several patch planes. (See last row of Fig. 3.2b as an example). It is important to stress that we do NOT assume that the 3D object such as car is planar. The local planarity also agrees with the property of manifolds that manifolds locally resemble 𝑛-dimensional Euclidean space and because the projective transform maps planes to planes, the patch planes in 3D are also locally planar. We show a sample planar patch and the 3D object in Fig. C.1 in the appendix. Planarity and Projective Transformation. Example 13.2 from [76] links the planarity and projective transformations. Although their result is for stereo with two different cameras (K, K′), we substitute K = K′ to get Th. 1. Theorem 1. [76] Consider a 3D point lying on a patch plane 𝑚𝑥+𝑛𝑦+𝑜𝑧+ 𝑝 = 0, and observed by an ego camera in a pinhole setup to give an image ℎ. Let t = (𝑡𝑋, 𝑡𝑌 , 𝑡𝑍 ) and R = [𝑟𝑖 𝑗 ]3×3 denote a translation and rotation of the ego camera respectively. Observing the same 3D point from a new camera position leads to an image ℎ′. Then, the image ℎ is related to the image ℎ′ by the projective transformation T : ℎ(𝑢 − 𝑢0, 𝑣 − 𝑣0) = (cid:16) 𝑟21+𝑡 𝑋 (cid:16) 𝑟23+𝑡 𝑍 (cid:16) 𝑟11+𝑡 𝑋 (cid:16) 𝑟13+𝑡 𝑍 (𝑢−𝑢0) + (𝑢−𝑢0) + 𝑚 𝑝 (cid:17) (cid:17) 𝑓 𝑚 𝑝 ℎ′(cid:169) (cid:173) (cid:173) (cid:171) (3.1) (cid:17) (cid:17) 𝑛 𝑝 𝑛 𝑝 (𝑣 −𝑣0) + (𝑣 −𝑣0) + (cid:16) 𝑟31+𝑡 𝑋 (cid:16) 𝑟33+𝑡 𝑍 𝑜 𝑝 𝑜 𝑝 (cid:17) (cid:17) 𝑓 𝑓 , 30 (cid:16) 𝑟12+𝑡𝑌 (cid:16) 𝑟13+𝑡 𝑍 𝑚 𝑝 𝑚 𝑝 (cid:17) (cid:17) 𝑓 (𝑢−𝑢0) + (𝑢−𝑢0) + (cid:16) 𝑟22+𝑡𝑌 (cid:16) 𝑟23+𝑡 𝑍 𝑛 𝑝 𝑛 𝑝 (cid:17) (cid:17) (𝑣 −𝑣0) + (𝑣 −𝑣0) + (cid:16) 𝑟32+𝑡𝑌 (cid:16) 𝑟33+𝑡 𝑍 𝑜 𝑝 𝑜 𝑝 (cid:17) (cid:17) 𝑓 𝑓 , (cid:170) (cid:174) (cid:174) (cid:172) where 𝑓 and (𝑢0, 𝑣0) denote the focal length and principal point of the ego camera, and (𝑡 𝑋, 𝑡𝑌 , 𝑡 𝑍 ) = R𝑇 t. 3.4 Depth Equivariant Backbone The projective transformation in Eq. (3.1) from [76] is complicated and also involves rotations, and we do not know which convolution obeys this projective transformation. Hence, we simplify Eq. (3.1) under reasonable assumptions to obtain a familiar transformation for which the convolution is known. Corollary 1.1. When the ego camera translates in depth without rotations (R = I), and the patch plane is “approximately” parallel to the image plane, the image ℎ locally is a scaled version of the second image ℎ′ independent of focal length, i.e. T𝑠 : ℎ(𝑢 − 𝑢0, 𝑣 − 𝑣0) ≈ ℎ′ (cid:32) 𝑢 − 𝑢0 1+𝑡𝑍 𝑜 𝑝 , 𝑣 − 𝑣0 𝑜 1+𝑡𝑍 𝑝 (cid:33) . (3.2) where 𝑓 and (𝑢0, 𝑣0) denote the focal length and principal point of the ego camera, and 𝑡𝑍 denotes the ego translation. See Sec. C.1.6 for the detailed explanation of Corollary 1.1. Corollary 1.1 says T𝑠 : ℎ(𝑢 − 𝑢0, 𝑣 − 𝑣0) ≈ ℎ′ (cid:16) 𝑢 − 𝑢0 𝑠 , 𝑣 − 𝑣0 𝑠 (cid:17) , (3.3) where, 𝑠 = 1 + 𝑡𝑍 𝑜 𝑝 denotes the scale and T𝑠 denotes the scale transformation. The scale 𝑠 < 1 suggests downscaling, while 𝑠 > 1 suggests upscaling. Corollary 1.1 shows that the transformation T𝑠 is independent of the focal length and that scale is a linear function of the depth translation. Hence, the depth translation in the projective manifold induces scale transformation and thus, the depth equivariance in the projective manifold is the scale equivariance in the Euclidean manifold. Mathematically, the desired equivariance is [T𝑠 (ℎ) ∗ Ψ] = T𝑠 [ℎ ∗ Ψ𝑠−1], where Ψ denotes the filter (See Sec. C.1.7). As CNN is not a scale equivariant (SE) architecture [227], we aim to get SE backbone which makes the architecture equivariant to depth translations in the projective 31 (a) SES Convolution Output. (b) Receptive fields. (c) Log-polar SSIM. Figure 3.2 (a) Scale Equivariance. We apply SES convolution [227] with two scales on a single channel toy image ℎ. (b) Receptive fields of convolutions in the Euclidean manifold. Colors represent different weights, while shades represent the same weight. (c) Impact of discretization on log-polar convolution. SSIM is very low at small resolutions and is not 1 even after upscaling by 4. [Key: Up= Upscaling] manifold. The scale transformation is a familiar transformation and SE convolutions are well known [68, 92, 227, 309]. Scale Equivariant Steerable (SES) Blocks. We use the existing SES blocks [226,227] to construct our Depth Equivariant Network (DEVIANT) backbone. As [226] does not construct SE-DLA-34 backbones, we construct our DEVIANT backbone as follows. We replace the vanilla convolutions by the SES convolutions [226] with the basis as Hermite polynomials. SES convolutions result in multi-scale representation of an input tensor. As a result, their output is five-dimensional instead of four-dimensional. Thus, we replace the 2D pools and batch norm (BN) by 3D pools and 3D BN respectively. The Scale-Projection layer [227] carries a max over the extra (scale) dimension to project five-dimensional tensors to four dimensions (See Fig. C.5) in the supplementary). Ablation in Sec. 3.5.3 confirms that BN and Pool (BNP) should also be SE for the best performance. The SES convolutions [68, 227, 309] are based on steerable-filters [62]. Steerable approaches [68] first pre-calculate the non-trainable multi-scale basis in the Euclidean manifold and then build filters by the linear combinations of the trainable weights w. The number of trainable weights w equals the number of filters at one particular scale. The linear combination of multi-scale basis ensures that the filters are also multi-scale. Thus, SES blocks bypass grid conversion and do not suffer from sampling effects. We show the convolution of toy image ℎ with a SES convolution in Fig. 3.2a. Let Ψ𝑠 denote 32 the filter at scale 𝑠. The convolution between downscaled image and filter T0.5(ℎ) ∗ Ψ0.5 matches the downscaled version of original image convolved with upscaled filter T0.5(ℎ ∗ Ψ1.0). Fig. 3.2a (right column) shows that the output of a CNN exhibits aliasing in general and is therefore, not scale equivariant. Log-polar Convolution: Impact of Discretization. An alternate way to convert the depth transla- tion 𝑡𝑍 of Eq. (3.2) to shift is by converting the images to log-polar space [313] around the principal point (𝑢0, 𝑣0), as ℎ(ln 𝑟, 𝜃) ≈ ℎ′ (cid:18) ln 𝑟 − ln (cid:18) 1+𝑡𝑍 (cid:19) 𝑜 𝑝 (cid:19) , , 𝜃 (3.4) with 𝑟 = √︁(𝑢−𝑢0)2+ (𝑣 − 𝑣0)2, and 𝜃 = tan−1 (cid:16) 𝑣−𝑣0 𝑢−𝑢0 scale to translation, so using convolution in the log-polar space is equivariant to the logarithm of (cid:17). The log-polar transformation converts the the depth translation 𝑡𝑍 . We show the receptive field of log-polar convolution in Fig. 3.2b. The log-polar convolution uses a smaller receptive field for objects closer to the principal point, while a larger field away from the principal point. We implemented log-polar convolution and found that its performance (See Tab. 3.11) is not acceptable, consistent with [227]. We attribute this behavior to the discretization of pixels and loss of 2D translation equivariance. Eq. (3.4) is perfectly valid in the continuous world (Note the use of parentheses instead of square brackets in Eq. (3.4)). However, pixels reside on discrete grids, which gives rise to sampling errors [112]. We discuss the impact of discretization on log-polar convolution in Sec. 3.5.2 and show it in Fig. 3.2c. Hence, we do not use log-polar convolution for the DEVIANT backbone. Comparison of Equivariance s for Monocular 3D Detection. We now compare equivariances for monocular 3D detection task. An ideal monocular detector should be equivariant to arbitrary 3D translations (𝑡𝑋, 𝑡𝑌 , 𝑡𝑍 ). However, most monocular detectors [109,159] estimate 2D projections of 3D centers and the depth, which they back-project in 3D world via known camera intrinsics. Thus, a good enough detector shall be equivariant to 2D translations (𝑡𝑢, 𝑡𝑣) for projected centers as well as equivariant to depth translations (𝑡𝑍 ). Existing detector backbones [109,159] are only equivariant to 2D translations as they use vanilla convolutions that produce 4D feature maps. Log-polar backbones is equivariant to logarithm of 33 depth translations but not to 2D translations. DEVIANT uses SES convolutions to produce 5D feature maps. The extra dimension in 5D feature map captures the changes in scale (for depth), while these feature maps individually are equivariant to 2D translations (for projected centers). Hence, DEVIANT augments the 2D translation equivariance (𝑡𝑢, 𝑡𝑣) of the projected centers with the depth translation equivariance. We emphasize that although DEVIANT is not equivariant to arbitrary 3D translations in the projective manifold, DEVIANT does provide the equivariance to depth translations (𝑡𝑍 ) and is thus a first step towards the ideal equivariance. Our experiments (Sec. 3.5) show that even this additional equivariance benefits monocular 3D detection task. This is expected because depth is the hardest parameter to estimate [166]. Tab. 3.1 summarizes these equivariances. Moreover, Tab. 3.10 empirically shows that 2D detection does not suffer and therefore, confirms that DEVIANT indeed augments the 2D equivariance with the depth equivariance. An idea similar to DEVIANT is the optical expansion [280] which augments optical flow with the scale information and benefits depth estimation. 3.5 Experiments Our experiments use the KITTI [67], Waymo [230] and nuScenes datasets [22]. We modify the publicly-available PyTorch [184] code of GUP Net [159] and use the GUP Net model as our baseline. For DEVIANT, we keep the number of scales as three [226]. DEVIANT takes 8.5 hours for training and 0.04s per image for inference on a single A100 GPU. Evaluation Metrics. KITTI evaluates on three object categories: Easy, Moderate and Hard. It assigns each object to a category based on its occlusion, truncation, and height in the image space. KITTI uses AP 3D|𝑅40 following [220, 222]. percentage metric on the Moderate category to benchmark models [67] Waymo evaluates on two object levels: Level_1 and Level_2. It assigns each object to a level based on the number of LiDAR points included in its 3D box. Waymo uses APH3D percentage metric which is the incorporation of heading information in AP3D to benchmark models. It also provides evaluation at three distances [0, 30), [30, 50) and [50, ∞) meters. Data Splits. We use the following splits of the KITTI,Waymo and nuScenes: 34 Table 3.3 Results on KITTI Test cars at IoU3D ≥ 0.7. Previous results are from the leader-board or papers. We show 3 methods in each Extra category and 6 methods in the image-only category. [Key: Best, Second Best] Method Extra AutoShape [154] PCT [246] DFR-Net [312] MonoDistill [39] PatchNet-C [221] CaDDN [199] DD3D [181] MonoEF [307] Kinematic [17] GrooMeD-NMS [109] MonoRCNN [218] MonoDIS-M [220] Ground-Aware [151] MonoFlex [301] GUP Net [159] DEVIANT (Ours) (cid:17)) − − (cid:17)) [%](− AP BEV|𝑅40 [%](− AP 3D|𝑅40 Easy Mod Hard Easy Mod Hard 22.47 14.17 11.36 30.66 20.08 15.59 21.00 13.37 11.31 29.65 19.03 15.92 19.40 13.63 10.35 28.17 19.17 14.84 22.97 16.03 13.60 31.87 22.59 19.72 CAD Depth Depth Depth LiDAR 22.40 12.53 10.60 LiDAR 19.17 13.41 11.46 27.94 18.91 17.19 LiDAR 23.22 16.34 14.20 30.98 22.56 20.03 Odometry 21.29 13.87 11.71 29.03 19.70 17.26 19.07 12.72 9.17 26.69 17.52 13.10 18.10 12.32 9.65 26.19 18.27 14.05 18.36 12.65 10.03 25.48 18.11 14.10 16.54 12.97 11.04 24.45 19.25 16.87 21.65 13.25 9.91 29.81 17.98 13.08 19.94 13.89 12.07 28.23 19.75 16.89 20.11 14.20 11.77 21.88 14.46 11.89 29.65 20.44 17.43 Video − − − − − − − − − − − • KITTI Test (Full) split: Official KITTI 3D benchmark [1] consists of 7,481 training and 7,518 testing images [67]. • KITTI Val split: It partitions the 7,481 training images into 3,712 training and 3,769 validation images [32]. • Waymo Val split: This split [199, 246] contains 52,386 training and 39,848 validation images from the front camera. We construct its training set by sampling every third frame from the training sequences as in [199, 246]. • nuScenes Val split: It consists of 28,130 training and 6,019 validation images from the front camera [22]. We use this split for evaluation [218]. 3.5.1 KITTI Test Mono3D Cars. Tab. 3.3 lists out the results of monocular 3D detection and BEV evaluation on KITTI Test cars. Tab. 3.3 results show that DEVIANT outperforms the GUP Net and several other SoTA methods on both tasks. Except DD3D [181] and MonoDistill [39], DEVIANT, an image-based method, also outperforms other methods that use extra information. Cyclists and Pedestrians. Tab. 3.4 lists out the results of monocular 3D detection on KITTI Test 35 Table 3.4 Results on KITTI Test cyclists and pedestrians (Cyc/Ped) at IoU3D ≥ 0.5. Previous results are from the leader-board or papers. [Key: Best, Second Best] Method DDMP-3D [245] DFR-Net [312] MonoDistill [39] CaDDN [199] DD3D [181] MonoEF [307] MonoDIS-M [220] MonoFlex [301] GUP Net [159] DEVIANT (Ours) (cid:17)) Ped AP 3D|𝑅40 (cid:17)) Extra Cyc AP 3D|𝑅40 [%](− Easy Mod Hard 2.32 4.18 2.50 Depth 3.10 5.69 3.58 Depth 2.40 Depth 5.53 2.81 3.30 LiDAR 7.00 3.41 1.31 LiDAR 2.39 1.52 0.71 Odometry 1.80 0.92 0.48 1.17 0.54 1.67 3.39 2.10 2.09 4.18 2.65 2.59 5.05 3.13 − − − − [%](− Easy Mod Hard 3.01 4.93 3.55 3.39 6.09 3.62 7.45 12.79 8.17 6.76 12.87 8.14 8.05 13.91 9.30 2.21 4.27 2.79 4.42 7.79 5.14 6.81 11.89 8.16 7.87 14.72 9.53 7.69 13.43 8.65 Table 3.5 Results on KITTI Val cars. Comparison with bigger CNN backbones in Tab. C.4. [Key: Best, Second Best, −= No pretrain ] Method Extra IoU3D ≥ 0.7 [%](− IoU3D ≥ 0.5 [%](− [%](− (cid:17)) AP BEV|𝑅40 AP 3D|𝑅40 [%](− Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard 28.12 20.39 16.34 − − 38.39 27.53 24.44 47.16 34.65 28.47 − 24.31 18.47 15.76 33.09 25.40 22.16 65.69 49.35 43.49 71.45 53.11 46.94 (cid:17)) AP BEV|𝑅40 (cid:17)) AP 3D|𝑅40 − − − − − − − − − − − − (cid:17)) − − − − 33.5 26.0 22.6 26.8 20.2 16.7 − − − − − − − − − − − − − − − − − − − − − − − − Depth Depth Depth LiDAR 23.57 16.31 13.84 − LiDAR 24.51 17.03 13.25 − LiDAR LiDAR − − − − − − Odometry 18.26 16.30 15.24 26.07 25.21 21.61 57.98 51.80 49.34 63.40 61.13 53.22 19.76 14.10 10.47 27.83 19.72 15.10 55.44 39.47 31.26 61.79 44.68 34.56 16.61 13.19 10.65 25.29 19.22 15.30 − 17.45 13.66 11.68 24.97 19.33 17.01 55.41 43.42 37.81 60.73 46.87 41.89 19.67 14.32 11.27 27.38 19.75 15.92 55.62 41.07 32.89 61.83 44.98 36.29 23.63 16.16 12.06 − 23.64 17.51 14.83 − 22.76 16.46 13.72 31.07 22.94 19.75 57.62 42.33 37.59 61.78 47.06 40.88 21.10 15.48 12.88 28.58 20.92 17.83 58.95 43.99 38.07 64.60 47.76 42.97 24.63 16.54 14.52 32.60 23.04 19.99 61.00 46.00 40.18 65.28 49.63 43.50 Video − − − − − − − − − 60.92 42.18 32.02 − − − − − − − − − − − − − − − − − DDMP-3D [245] PCT [246] MonoDistill [39] CaDDN [199] PatchNet-C [221] DD3D (DLA34) [181] DD3D −(DLA34) [181] MonoEF [307] Kinematic [17] MonoRCNN [218] MonoDLE [166] GrooMeD-NMS [109] Ground-Aware [151] MonoFlex [301] GUP Net (Reported)[159] GUP Net (Retrained)[159] DEVIANT (Ours) Cyclist and Pedestrians. The results show that DEVIANT achieves SoTA results in the image-only category on the challenging Cyclists, and is competitive on Pedestrians. 3.5.2 KITTI Val Mono3D Cars. Tab. 3.5 summarizes the results of monocular 3D detection and BEV evaluation on KITTI Val split at two IoU3D thresholds of 0.7 and 0.5 [35, 109]. We report the median model over 5 runs. The results show that DEVIANT outperforms the GUP Net [159] baseline by a significant margin. The biggest improvements shows up on the Easy set. Significant improvements are also 36 (a) Linear Scale (b) Log Scale Figure 3.3 AP3D at different depths and IoU3D thresholds on KITTI Val Split. Table 3.6 Cross-dataset evaluation of the KITTI Val model on KITTI Val and nuScenes frontal Val cars with depth MAE (− (cid:17) ). [Key: Best, Second Best] KITTI Val 1 nuScenes frontal Val Method 0−20 20−40 40−∞ All 0−20 20−40 40−∞ All 10.36 2.67 M3D-RPN [15] 0.56 8.65 2.39 MonoRCNN [218] 0.46 GUP Net [159] 6.20 1.45 0.45 4.50 1.26 0.40 DEVIANT 2.73 1.26 0.94 2.59 1.14 0.94 1.85 0.89 0.82 1.80 0.87 0.76 1.33 1.27 1.10 1.09 3.06 2.84 1.70 1.60 on the Moderate and Hard sets. Interestingly, DEVIANT also outperforms DD3D [181] by a large margin when the large-dataset pretraining is not done (denoted by DD3D −). AP3D at different depths and IoU3D thresholds. We next compare the AP3D of DEVIANT and GUP Net in Fig. 3.3 at different distances in meters and IoU3D matching criteria of 0.3 − 0.7 as in [109]. Fig. 3.3 shows that DEVIANT is effective over GUP Net [159] at all depths and higher (cid:17) IoU3D thresholds. Cross-Dataset Evaluation. Tab. 3.6 shows the result of our KITTI Val model on the KITTI Val and nuScenes [22] frontal Val images, using mean absolute error (MAE) of the depth of the boxes [218]. More details are in Sec. C.3.1. DEVIANT outperforms GUP Net on most of the metrics on both the datasets, which confirms that DEVIANT generalizes better than CNNs. DEVIANT performs exceedingly well in the cross-dataset evaluation than [15, 159, 218]. We believe this happens because [15, 159, 218] rely on data or geometry to get the depth, while DEVIANT is equivariant to the depth translations, and therefore, outputs consistent depth. So, DEVIANT is more robust to data distribution changes. 37 Table 3.7 Scale Augmentation vs Scale Equivariance on KITTI Val cars. [Key: Best, Eqv= Equivariance, Aug= Augmentation] Method GUP Net [159] DEVIANT ✓ ✓ Scale Scale Eqv Aug AP 3D|𝑅40 IoU3D ≥ 0.7 IoU3D ≥ 0.5 [%](− [%](− (cid:17)) AP 3D|𝑅40 (cid:17)) AP BEV|𝑅40 [%](− Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard 20.82 14.15 12.44 29.93 20.90 17.87 62.37 44.40 39.61 66.81 48.09 43.14 ✓ 21.10 15.48 12.88 28.58 20.92 17.83 58.95 43.99 38.07 64.60 47.76 42.97 21.33 14.77 12.57 28.79 20.28 17.59 59.31 43.25 37.64 63.94 47.02 41.12 ✓ 24.63 16.54 14.52 32.60 23.04 19.99 61.00 46.00 40.18 65.28 49.63 43.50 (cid:17)) AP BEV|𝑅40 [%](− (cid:17)) Table 3.8 Comparison of Equivariant Architectures on KITTI Val cars. [Key: Best, Eqv= Equivariance, †= Retrained] IoU3D ≥ 0.7 IoU3D ≥ 0.5 Method (cid:17)) Eqv [%](− AP BEV|𝑅40 AP 3D|𝑅40 [%](− (cid:17)) Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard 4.41 3.06 2.79 20.09 13.80 12.78 26.51 18.49 17.36 1.94 1.26 1.09 21.10 15.48 12.88 28.58 20.92 17.83 58.95 43.99 38.07 64.60 47.76 42.97 2D +Depth 24.63 16.54 14.52 32.60 23.04 19.99 61.00 46.00 40.18 65.28 49.63 43.50 AP BEV|𝑅40 AP 3D|𝑅40 [%](− [%](− 2D (cid:17)) (cid:17)) DETR3D† [257] Learned GUP Net [159] DEVIANT Alternatives to Equivariance. We now compare with alternatives to equivariance in the following paragraphs. (a) Scale Augmentation. A withstanding question in machine learning is the choice between equivariance and data augmentation [63]. Tab. 3.7 compares scale equivariance and scale augmen- tation. GUP Net [159] uses scale-augmentation and therefore, Tab. 3.7 shows that equivariance also benefits models which use scale-augmentation. This agrees with Tab. 2 of [227], where they observe that both augmentation and equivariance benefits classification on MNIST-scale dataset. (b) Other Equivariant Architectures. We now benchmark adding depth (scale) equivariance to a 2D translation equivariant CNN and a transformer which learns the equivariance. Therefore, we compare DEVIANT with GUP Net [159] (a CNN), and DETR3D [257] (a transformer) in Tab. 3.8. As DETR3D does not report KITTI results, we trained DETR3D on KITTI using their public code. DEVIANT outperforms GUP Net and also surpasses DETR3D by a large margin. This happens because learning equivariance requires more data [265] compared to architectures which hardcode equivariance like CNN or DEVIANT. (c) Dilated Convolution. DEVIANT adjusts the receptive field based on the object scale, and so, we compare with the dilated CNN (DCNN) [291] and D4LCN [52] in Tab. 3.9. The results 38 Table 3.9 Comparison with Dilated Convolution on KITTI Val cars. [Key: Best] IoU3D≥ 0.7 IoU3D≥ 0.5 Method Extra AP 3D|𝑅40 [%](− (cid:17)) Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard AP BEV|𝑅40 AP BEV|𝑅40 AP 3D|𝑅40 [%](− [%](− [%](− (cid:17)) (cid:17)) (cid:17)) D4LCN [52] Depth 22.32 16.20 12.30 31.53 22.58 17.87 DCNN [291] DEVIANT 21.66 15.49 12.90 30.22 22.06 19.01 57.54 43.12 38.80 63.29 46.86 42.42 24.63 16.54 14.52 32.60 23.04 19.99 61.00 46.00 40.18 65.28 49.63 43.50 − − − − − − − − show that DCNN performs sub-par to DEVIANT. This is expected because dilation corresponds to integer scales [267] while the scaling is generally a float in monocular detection. D4LCN [52] uses monocular depth as input to adjust the receptive field. DEVIANT (without depth) also outperforms D4LCN on Hard cars, which are more distant. (d) Other Convolutions. We now compare with other known convolutions in literature such as Log-polar convolution [313], Dilated convolution [291] convolution and DISCO [225] in Tab. 3.11. The results show that the log-polar convolution does not work well, and SES convolutions are better suited to embed depth (scale) equivariance. As described in Sec. 3.4, we investigate the behavior of log-polar convolution through a small experiment. We calculate the SSIM [258] of the original image and the image obtained after the upscaling, log-polar, inverse log-polar, and downscaling blocks. We then average the SSIM over all KITTI Val images. We repeat this experiment for multiple image heights and scaling factors. The ideal SSIM should have been one. However, Fig. 3.2c shows that SSIM does not reach 1 even after upscaling by 4. This result confirms that log-polar convolution loses information at low resolutions resulting in inaccurate detection. Next, the results show that dilated convolution [291] performs sub-par to DEVIANT. Moreover, DISCO [225] also does not outperform SES convolution which agrees with the 2D tracking results of [225]. (e) Feature Pyramid Network (FPN). Our baseline GUP Net [159] uses FPN [138] and Tab. 3.5 shows that DEVIANT outperforms GUP Net. Hence, we conclude that equivariance also benefits models which use FPN. Comparison of Equivariance Error. We next quantitatively evaluate the scale equivariance of DEVIANT vs. GUP Net [159], using the equivariance error metric [227]. The equivariance error 39 (a) At blocks (depths) of backbone. (b) Varying scaling factors. Figure 3.4 Log Equivariance Error (Δ) comparison for DEVIANT and GUP Net at (a) different blocks with random image scaling factors (b) different image scaling factors at depth 3. DEVIANT shows lower scale equivariance error than vanilla GUP Net [159]. Δ is the normalized difference between the scaled feature map and the feature map of the scaled image, and is given by Δ = 1 𝑁 ||T𝑠𝑖 Φ(ℎ𝑖)−Φ(T𝑠𝑖 ℎ𝑖)||2 2 ||T𝑠𝑖 Φ(ℎ𝑖)||2 2 the scaling transformation for the image 𝑖, and 𝑁 is the total number of images. The equivariance , where Φ denotes the neural network, T𝑠𝑖 is (cid:205)𝑁 𝑖=1 error is zero if the scale equivariance is perfect. We plot the log of this error at different blocks of DEVIANT and GUP Net backbones and also plot at different downscaling of KITTI Val images in Fig. 3.4. The plots show that DEVIANT has low equivariance error than GUP Net. This is expected since the feature maps of the proposed DEVIANT are additionally equivariant to scale transformation s (depth translations). We also visualize the equivariance error for a validation image and for the objects of this image in Figs. C.8a and C.8b in the supplementary. The qualitative plots also show a lower error for the proposed DEVIANT, which agrees with Fig. 3.4. Fig. C.8ba shows that equivariance error is particularly low for nearby cars which also justifies the good performance of DEVIANT on Easy (nearby) cars in Tabs. 3.3 and 3.5. Does 2D Detection Suffer? We now investigate whether 2D detection suffers from using DEVIANT backbones in Tab. 3.10. The results show that DEVIANT introduces minimal decrease in the 2D detection performance. This is consistent with [226], who report that 2D tracking improves with the SE networks. 40 Table 3.10 3D and 2D detection on KITTI Val cars. IoU ≥ 0.7 IoU ≥ 0.5 Method AP 3D|𝑅40 (cid:17)) Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard GUP Net [159] 21.10 15.48 12.88 96.78 88.87 79.02 58.95 43.99 38.07 99.52 91.89 81.99 DEVIANT (Ours) 24.63 16.54 14.52 96.68 88.66 78.87 61.00 46.00 40.18 97.12 91.77 81.93 AP 2D|𝑅40 AP 3D|𝑅40 AP 2D|𝑅40 [%](− [%](− [%](− [%](− (cid:17)) (cid:17)) (cid:17)) Table 3.11 Ablation studies on KITTI Val cars. Change from DEVIANT: IoU3D ≥ 0.7 [%](− IoU3D ≥ 0.5 [%](− To [%](− (cid:17)) AP 3D|𝑅40 (cid:17)) AP BEV|𝑅40 (cid:17)) AP BEV|𝑅40 Changed From −− SES− Convolution SES− (cid:17) SES− (cid:17) SES− (cid:17) 5% Downscale 10% − (cid:17) 10% − 20% (cid:17) SE− Vanilla (cid:17) 3 − 1 (cid:17) 2 3 − (cid:17) DEVIANT (best) (cid:17) AP 3D|𝑅40 [%](− Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard (cid:17) 21.10 15.48 12.88 28.58 20.92 17.83 58.95 43.99 38.07 64.60 47.76 42.97 Vanilla Log-polar [313] 9.19 6.77 5.78 16.39 11.15 9.80 40.51 27.62 23.90 45.66 31.34 25.80 21.66 15.49 12.90 30.22 22.06 19.01 57.54 43.12 38.80 63.29 46.86 42.42 Dilated[291] 20.21 13.84 11.46 28.56 19.38 16.41 55.22 39.76 35.37 59.46 43.16 38.52 DISCO[225] 24.24 16.51 14.43 31.94 22.86 19.82 60.64 44.46 40.02 64.68 49.30 43.49 22.19 15.85 13.48 31.15 23.01 19.90 61.24 44.93 40.22 67.46 50.10 43.83 24.39 16.20 14.36 32.43 22.53 19.70 62.81 46.14 40.38 67.87 50.23 44.08 23.20 16.29 13.63 31.76 23.23 19.97 61.90 46.66 40.61 67.37 50.31 43.93 24.15 16.48 14.55 32.42 23.17 20.07 61.05 46.34 40.46 67.36 50.32 44.07 24.63 16.54 14.52 32.60 23.04 19.99 61.00 46.00 40.18 65.28 49.63 43.50 𝛼 BNP Scales — (cid:17)) 3.5.3 Ablation Studies on KITTI Val Tab. 3.11 compares the modifications of our approach on KITTI Val cars based on the experi- mental settings of Sec. 3.5. (a) Floating or Integer Downscaling? We next investigate the question that whether one should use floating or integer downscaling factors for DEVIANT. We vary the downscaling factors as 1+𝛼 , 1(cid:17). We find that 𝛼 of 10% works the best. We again bring up the dilated convolution (Dilated) results at this point because dilation (1+2𝛼, 1+𝛼, 1) and therefore, our scaling factor 𝑠 = 1+2𝛼 , 1 (cid:16) 1 is a scale equivariant operation for integer downscaling factors [267] (𝛼 = 100%, 𝑠 = 0.5). Tab. 3.11 results suggest that the downscaling factors should be floating numbers. (b) SE BNP. As described in Sec. 3.4, we ablate DEVIANT against the case when only convolutions are SE but BNP layers are not. So, we place Scale-Projection [227] immediately after every SES convolution. Tab. 3.11 shows that such a network performs slightly sub-optimal to our final model. (c) Number of Scales. We next ablate against the usage of Hermite scales. Using three scales performs better than using only one scale especially on Mod and Hard objects, and slightly better than using two scales. 41 Table 3.12 Waymo Val vehicles detection results. [Key: Best, Second Best] IoU3D Difficulty Method CaDDN [199] PatchNet [162] in [246] PCT [246] 0.7 Level_1 M3D-RPN [15] in [199] GUP Net (Retrained) [159] DEVIANT (Ours) CaDDN [199] PatchNet [162] in [246] PCT [246] 0.7 Level_2 M3D-RPN [15] in [199] GUP Net (Retrained) [159] DEVIANT (Ours) CaDDN [199] PatchNet [162] in [246] PCT [246] 0.5 Level_1 M3D-RPN [15] in [199] GUP Net (Retrained) [159] DEVIANT (Ours) CaDDN [199] PatchNet [162] in [246] PCT [246] 0.5 Level_2 M3D-RPN [15] in [199] GUP Net (Retrained) [159] DEVIANT (Ours) 3.5.4 Waymo Val Mono3D Extra APH3D [%](− (cid:17)) AP3D [%](− (cid:17)) 0-30 30-50 50-∞ All All 0.39 0.89 0.35 2.28 2.69 0.38 0.66 0.33 2.14 2.52 LiDAR 5.03 14.54 1.47 0.10 1.67 0.13 0.03 Depth 3.18 0.27 0.07 Depth 1.12 0.18 0.02 − 6.15 0.81 0.03 − 6.95 0.99 0.02 − LiDAR 4.49 14.50 1.42 0.09 1.67 0.13 0.03 Depth 3.18 0.27 0.07 Depth 0.18 0.02 1.12 − 6.13 0.78 0.02 − 6.93 0.95 0.02 − 0-30 30-50 50-∞ 4.99 14.43 1.45 0.10 1.63 0.12 0.03 0.39 3.15 0.27 0.07 0.88 1.10 0.18 0.02 0.34 2.27 6.11 0.80 0.03 6.90 0.98 0.02 2.67 4.45 14.38 1.41 0.09 1.63 0.11 0.03 0.36 3.15 0.26 0.07 0.66 1.10 0.17 0.02 0.33 6.08 0.77 0.02 2.12 6.87 0.94 0.02 2.50 LiDAR 17.54 45.00 9.24 0.64 17.31 44.46 9.11 0.62 9.75 0.96 0.18 2.74 Depth 2.92 10.03 1.09 0.23 4.15 14.54 1.75 0.39 Depth 4.20 14.70 1.78 0.39 3.63 10.70 2.09 0.21 3.79 11.14 2.16 0.26 − 10.02 24.78 4.84 0.22 9.94 24.59 4.78 0.22 − 10.98 26.85 5.13 0.18 10.89 26.64 5.08 0.18 − LiDAR 16.51 44.87 8.99 0.58 16.28 44.33 8.86 0.55 2.28 Depth 9.73 0.97 0.16 2.42 10.01 1.07 0.22 4.15 14.51 1.71 0.35 Depth 4.03 14.67 1.74 0.36 3.46 10.67 2.04 0.20 3.61 11.12 2.12 0.24 − 9.39 24.69 4.67 0.19 9.31 24.50 4.62 0.19 − 10.29 26.75 4.95 0.16 10.20 26.54 4.90 0.16 − We also benchmark our method on the Waymo dataset [230] which has more variability than KITTI. Tab. 3.12 shows the results on Waymo Val split. The results show that DEVIANT outperforms the baseline GUP Net [159] on multiple levels and multiple thresholds. The biggest gains are on the nearby objects which is consistent with Tabs. 3.3 and 3.5. Interestingly, DEVIANT also outperforms PatchNet [162] and PCT [246] without using depth. Although the performance of DEVIANT lags CaDDN [199], it is important to stress that CaDDN uses LiDAR data in training, while DEVIANT is an image-only method. 3.6 Conclusions This chapter studies the modeling error in monocular 3D detection in detail and takes the first step towards convolutions equivariant to arbitrary 3D translations in the projective manifold. Since the depth is the hardest to estimate for this task, this chapter proposes Depth Equivariant 42 Network (DEVIANT) built with existing scale equivariant steerable blocks. As a result, DEVIANT is equivariant to the depth translations in the projective manifold whereas vanilla networks are not. The additional depth equivariance forces the DEVIANT to learn consistent depth estimates and therefore, DEVIANT achieves SoTA detection results on KITTI and Waymo datasets in the image-only category and performs competitively to methods using extra information. Moreover, DEVIANT works better than vanilla networks in cross-dataset evaluation. Future works include applying the idea to Pseudo-LiDAR [254], and monocular 3D tracking. Limitation. DEVIANT does not model 3D equivariance but only a special case of 3D equivariance. Considerably less number of boxes are detected in the cross-dataset evaluation. 43 CHAPTER 4 SEABIRD: SEGMENTATION IN BIRD’S VIEW WITH DICE LOSS IMPROVES MONOCULAR 3D DETECTION OF LARGE OBJECTS Monocular 3D detectors achieve remarkable performance on cars and smaller objects. However, their performance drops on larger objects, leading to fatal accidents. Some attribute the failures to training data scarcity or their receptive field requirements of large objects. In this chapter, we highlight this understudied problem of generalization to large objects. We find that modern frontal detectors struggle to generalize to large objects even on nearly balanced datasets. We argue that the cause of failure is the sensitivity of depth regression losses to noise of larger objects. To bridge this gap, we comprehensively investigate regression and dice losses, examining their robustness under varying error levels and object sizes. We mathematically prove that the dice loss leads to superior noise-robustness and model convergence for large objects compared to regression losses for a simplified case. Leveraging our theoretical insights, we propose SeaBird (Segmentation in Bird’s View) as the first step towards generalizing to large objects. SeaBird effectively integrates BEV segmentation on foreground objects for 3D detection, with the segmentation head trained with the dice loss. SeaBird achieves SoTA results on the KITTI-360 leaderboard and improves existing detectors on the nuScenes leaderboard, particularly for large objects. 4.1 Introduction Monocular 3D object detection (Mono3D) task aims to estimate both the 3D position and dimensions of objects in a scene from a single image. Its applications span autonomous driving [108, 132,181], robotics [213], and augmented reality [2,172,183,293], where accurate 3D understanding of the environment is crucial. Our study focuses explicitly on 3D object detectors applied to autonomous vehicles (AVs), considering the challenges and motivations deviate drastically across different applications. AVs demand object detectors that generalize to diverse intrinsics [14], camera-rigs [94, 104], rotations [177], weather and geographical conditions [54] and also are robust to adversarial examples [310]. Since each of these poses a significant challenge, recent works focus exclusively on the 44 (a) Improve KITTI-360 SoTA. (b) Improve nuScenes Val SoTA. Figure 4.1 Teaser (a) SoTA frontal detectors struggle with large objects (low AP𝐿𝑟 𝑔) even on a nearly balanced KITTI-360 dataset. Our proposed SeaBird achieves significant Mono3D improvements, particularly for large objects. (b) SeaBird also improves two SoTA BEV detectors, BEVerse-S [303] and HoP [311] on the nuScenes dataset, particularly for large objects. (c) Plot of convergence variance Var(𝜖) of dice and regression losses with the noise 𝜎 in depth prediction. The 𝑦-axis denotes the deviation from the optimal weight, so the lower the better. SeaBird leverages dice loss, which we prove is more noise-robust than regression losses for large objects. (c) Theory Advancement. generalization of object detectors to all these out-of-distribution shifts. However, our focus is on the generalization of another type, which, thus far, has been understudied in the literature – Mono3D generalization to large objects. Large objects like trailers, buses and trucks are harder to detect [268] in Mono3D, sometimes resulting in fatal accidents [23, 60]. Some attribute these failures to training data scarcity [308] or the receptive field requirements [268] of large objects, but, to the best of our knowledge, no existing literature provides a comprehensive analytical explanation for this phenomenon. The goal of this chapter is, thus, to bring understanding and a first analytical approach to this real-world problem in the AV space – Mono3D generalization to large objects. We conjecture that the generalization issue stems not only from limited training data or larger receptive field but also from the noise sensitivity of depth regression losses in Mono3D. To substantiate our argument, we analyze the Mono3D performance of state-of-the-art (SoTA) frontal detectors on the KITTI-360 dataset [136], which includes almost equal number (1 : 2) of large objects and cars. We observe that SoTA detectors struggle with large objects on this dataset (Fig. 4.1a). Next, we carefully investigate the SGD convergence of losses used in Mono3D task and mathematically prove that the dice loss, widely used in BEV segmentation, exhibits superior noise-robustness than the regression losses, particularly for large objects (Fig. 4.1c). Thus, the dice loss facilitates better model convergence than regression losses, improving Mono3D of large 45 objects. Incorporating dice loss in detection introduces unique challenges. Firstly, the dice loss does not apply to sparse detection centers and only incorporates depth information when used in the BEV space. Secondly, naive joint training of Mono3D and BEV segmentation tasks with image inputs does not always benefit Mono3D task [132, 167] due to negative transfer [45], and the underlying reasons remain unclear. Fortunately, many Mono3D segmentors and detectors are in the BEV space, where the BEV segmentor can seamlessly apply dice loss and the BEV detector can readily benefit from the segmentor in the same space. To mitigate negative transfer, we find it effective to train the BEV segmentation head on the foreground detection categories. Building upon our theoretical findings about the dice loss, we propose a simple and effective pipeline called Segmentation in Bird’s View (SeaBird) for enhancing Mono3D of large objects. SeaBird employs a sequential approach for the BEV segmentation and Mono3D heads (Fig. 4.2). SeaBird first utilizes a BEV segmentation head to predict the segmentation of only foreground objects, supervised by the dice loss. The dice loss offers superior noise-robustness for large objects, ensuring stable convergence, while focusing on foreground objects in segmentation mitigates negative transfer. Subsequently, SeaBird concatenates the resulting BEV segmentation map with the original BEV features as an additional feature channel and feeds this concatenated feature to a Mono3D head supervised by Mono3D losses1. Building upon this, we adopt a two-stage training pipeline: the first stage exclusively focuses on training the BEV segmentation head with dice loss, which fully exploits its noise-robustness and superior convergence in localizing large objects. The second stage involves both the detection loss and dice loss to finetune the Mono3D head. In our experiments, we first comprehensively evaluate SeaBird and conduct ablations on the balanced single-camera KITTI-360 dataset [136]. SeaBird outperforms the SoTA baselines by a substantial margin. Subsequently, we integrate SeaBird as a plug-in-and-play module into two SoTA detectors on the multi-camera nuScenes dataset [22]. SeaBird again significantly improves the original detectors, particularly on large objects. Additionally, SeaBird consistently enhances 1Only Mono3D head predicts additional 3D attributes, namely object’s height and elevation. 46 Figure 4.2 SeaBird Pipeline. SeaBird uses the predicted BEV foreground segmentation (For. Seg.) map to predict accurate 3D boxes for large objects. SeaBird training protocol involves BEV segmentation pre- training with the noise-robust dice loss and Mono3D fine-tuning. Mono3D performance across backbones with those two SoTA detectors (Fig. 4.1b), demonstrating its utility in both edge and cloud deployments. In summary, we make the following contributions: • We highlight the understudied problem of generalization to large objects in Mono3D, showing that even on nearly balanced datasets, SoTA frontal models struggle to generalize due to the noise sensitivity of regression losses. • We mathematically prove that the dice loss leads to superior noise-robustness and model conver- gence for large objects compared to regression losses for a simplified case and provide empirical support for more general settings. • We propose SeaBird, which treats BEV segmentation head on foreground objects and Mono3D head sequentially and trains in a two-stage protocol to fully harness the noise-robustness of the dice loss. • We empirically validate our theoretical findings and show significant improvements, particularly for large objects, on both KITTI-360 and nuScenes leaderboards. 4.2 Related Works Mono3D. Mono3D popularity stems from its high accessibility from consumer vehicles compared to LiDAR/Radar-based detectors [155, 215, 290] and computational efficiency compared to stereo- based detectors [34]. Earlier approaches [31, 186] leverage hand-crafted features, while the recent ones use deep networks. Advancements include introducing new architectures [89, 217, 275], 47 equivariance [29,108], losses [15,35], uncertainty [111,159] and incorporating auxiliary tasks such as depth [175, 301], NMS [109, 147, 216], corrected extrinsics [307], CAD models [25, 117, 154] or LiDAR [199] in training. A particular line of work called Pseudo-LiDAR [165, 254] shows generalization by first estimating the depth, followed by a point cloud-based 3D detector. Another line of work encodes image into latent BEV features [164] and attaches multiple heads for downstream tasks [303]. Some focus on pre-training [272] and rotation-equivariant convolutions [59]. Others introduce new coordinate systems [95], queries [128, 161], or positional encoding [219] in a transformer-based detection framework [24]. Some use pixel-wise depth [88], object-wise depth [38, 40, 141], or depth-aware queries [296], while many utilize temporal fusion [17, 150, 248, 261] to boost performance. A few use longer frame history [182, 311], distillation [105, 260] or stereo [125, 261]. We refer to [163, 167] for the survey. SeaBird also builds upon the BEV-based framework since it flexibly accepts single or multiple images as input and uses dice loss. Different from the majority of other detectors, SeaBird improves Mono3D of large objects using the power of dice loss. SeaBird is also the first work to mathematically prove and justify this loss choice for large objects. BEV Segmentation. BEV segmentation typically utilizes BEV features transformed from 2D image features. Various methods encode single or multiple images into BEV features using MLPs [180] or transformers [205, 211]. Some employ learned depth distribution [83, 188], while others use attention [211, 305] or attention fields [37]. Image2Maps [211] utilizes polar ray, while PanopticBEV [72] uses transformers. FIERY [83] introduces uncertainty modelling and temporal fusion, while Simple-BEV [74] uses radar aggregation. Since BEV segmentation lacks object height and elevation, one also needs a Mono3D head to predict 3D boxes. Joint Mono3D and BEV Segmentation. Joint 3D detection and BEV segmentation using LiDAR data [58, 215] as input benefits both tasks [252, 281]. However, joint learning on image data often hinders detection performance [132, 167, 272, 303], while the BEV segmentation improvement is inconsistent across categories [167]. Unlike these works which treat the two heads in parallel and decrease Mono3D performance [167], SeaBird treats the heads sequentially and increases Mono3D 48 Noise 𝜂 ∼ N (0, 𝜎2) BEV ˆ𝑧 (cid:201) w L Image h GT 𝑧 Length ℓ GT (0, 𝑧) Pred (0, ˆ𝑧) ℓ ℓ 𝑍 (a) 𝑋 (b) CS View 𝑍 𝑧 ℓ 𝑍 ˆ𝑧 ℓ ) 𝑍 ( 𝑃 ) 𝑍 ( 𝑃 1 (c) Figure 4.3 (a) Problem setup. The single-layer neural network takes an image h (or its features) and predicts depth ˆ𝑧 and the object length ℓ. The noise 𝜂 is the additive error in depth prediction and is a normal random variable. The GT depth 𝑧 supervises the predicted depth ˆ𝑧 with a loss L in training. We assume the network predicts the GT length ℓ. Frontal detectors directly regress the depth with L1, L2, or Smooth L1 loss, while SeaBird projects to BEV plane and supervises through dice loss L𝑑𝑖𝑐𝑒. (b) Shifting of predictions (blue) in BEV along the ray due to the noise 𝜂. (c) Cross Section (CS) view along the ray with classification scores 𝑃(𝑍). performance, particularly for large objects. 4.3 SeaBird SeaBird is driven by a deep understanding of the distinctions between monocular regression and BEV segmentation losses. Thus, in this section, we delve into the problem and discuss existing results. We then present our theoretical findings and, subsequently, introduce our pipeline. We introduce the problem and refer to Lemma 1 from the literature [113, 214], which evaluates loss quality by measuring the deviation of trained weight (after SGD updates) from the optimal weight. Fig. 4.3a illustrates the problem setup. Figs. 4.3b and 4.3c visualize the BEV and cross- section view, respectively. Since this deviation depends on the gradient variance of losses, we next derive the gradient variance of the dice loss in Lemma 2. By comparing the distance between trained weight and optimal weight, we assess the effectiveness of dice loss versus MAE (L1) and MSE (L2) losses in Lemma 3, and choose the representation and loss combination. Combining these findings, we establish Th. 2 that the model trained with dice loss achieves better AP than the model trained with regression losses. Finally, we present our pipeline, SeaBird, which integrates BEV segmentation supervised by dice loss for Mono3D. 49 4.3.1 Background and Problem Statement Mono3D networks [108, 159] commonly employ regression losses, such as L1 or L2 loss, to compare the predicted depth with ground truth (GT) depth [108, 303]. In contrast, BEV segmentation utilizes dice loss [211] or cross-entropy loss [83] at each BEV location, comparing it with GT. Despite these distinct loss functions, we evaluate their effectiveness under an idealized model, where we measure the model quality by the expected deviation of trained weight (after SGD updates) from the optimal weight [214]. Lemma 1. Convergence analysis [214]. Consider a linear regression model with trainable weight w for depth prediction ˆ𝑧 from an image h. Assume the noise 𝜂 is an additive error in depth prediction and is a normal random variable N (0, 𝜎2). Also, assume SGD optimizes the model parameters with loss function L during training with square summable steps 𝑠 𝑗 , i.e. 𝑠 = lim 𝑡→∞ 𝑡 (cid:205) 𝑗=1 𝑠2 𝑗 exists and 𝜂 is independent of the image. Then, the expected deviation of the trained weight Lw∞ from the optimal weight w∗ obeys E (cid:16)(cid:13) (cid:13) Lw∞−w∗ (cid:17) 2 (cid:13) (cid:13) 2 = 𝑐1Var(𝜖) + 𝑐2, (4.1) where 𝜖 = 𝜕L (𝜂) 𝜕𝜂 is the gradient of the loss L wrt noise, 𝑐1 = 𝑠E(h𝑇 h) and 𝑐2 are constants independent of the loss. We refer to Sec. D.1.1 for the proof. Eq. (4.1) demonstrates that training losses L exhibit varying gradient variances Var(𝜖). Hence, comparing this term for different losses allows us to evaluate their quality. 4.3.2 Loss Analysis: Dice vs. Regression Given that [214] provides the gradient variance Var(𝜖), for L1 and L2 losses, we derive the corresponding gradient variance for dice and IoU losses in this chapter to facilitate comparison. First, we express the dice loss, L𝑑𝑖𝑐𝑒, as a function of noise 𝜂 as per its definition from [211] for Fig. 4.3c as: L𝑑𝑖𝑐𝑒 (𝜂) = 1−2 Pred GT Pred + GT = 1−2 ℓ−|𝜂| 2ℓ , |𝜂| ≤ ℓ 1 , |𝜂| ≥ ℓ    50 Table 4.1 Convergence variance of training loss functions. Gradient variance of L𝑑𝑖𝑐𝑒 is more noise-robust for large objects, resulting in better detectors. We do not analyze cross-entropy loss theoretically since its Var(𝜖) is infinite, but empirically in Tab. 4.5. Loss L L1 [214] (App. D.1.2.1) L2 [214] (App. D.1.2.2) Dice (Lemma 2) Gradient 𝜖 sgn(𝜂) 𝜂 , |𝜂| ≤ ℓ , |𝜂| ≥ ℓ (cid:40) sgn(𝜂) ℓ 0 (cid:17) ) Var(𝜖) (− 1 𝜎2 (cid:18) ℓ Erf √ (cid:19) 1 ℓ2 2𝜎 (4.2) =⇒ L𝑑𝑖𝑐𝑒 (𝜂) = |𝜂| ℓ , |𝜂| ≤ ℓ 1 , |𝜂| ≥ ℓ ,    where ℓ denotes the object length. Eq. (4.2) shows that the dice loss L𝑑𝑖𝑐𝑒 depends on the object size ℓ. With the given dice loss L𝑑𝑖𝑐𝑒, we proceed to derive the following lemma: Lemma 2. Gradient variance of dice loss. Let 𝜂 = N (0, 𝜎2) be an additive normal random variable and ℓ be the object length. Let Erf be the error function. Then, the gradient variance of the dice loss Var𝑑𝑖𝑐𝑒 (𝜖) wrt noise 𝜂 is Var𝑑𝑖𝑐𝑒 (𝜖) = 1 ℓ2 Erf (cid:19) . (cid:18) √ ℓ 2𝜎 (4.3) We refer to Sec. D.1.2.3 for the proof. Eq. (4.3) shows that gradient variance of the dice loss Var𝑑𝑖𝑐𝑒 (𝜖) also varies inversely to the object size ℓ and the noise deviation 𝜎 (See Sec. D.1.5). These two properties of dice loss are particularly beneficial for large objects. Tab. 4.1 summarizes these losses, their gradients, and gradient variances. With Var𝑑𝑖𝑐𝑒 (𝜖) derived for the dice loss, we now compare the deviation of trained weight with the deviations from L1 or L2 losses, leading to our next lemma. Lemma 3. Dice model is closer to optimal weight than regression loss models. Based on Lemma 1 (cid:17) and assuming the object length ℓ is a constant, if 𝜎𝑚 is the solution of the equation 𝜎2 = 1 √ and the noise deviation 𝜎 ≥ 𝜎𝑐 = max (cid:16) 2 dice loss L𝑑𝑖𝑐𝑒 is better than the converged weight 𝑟w∞ with the L1 or L2 loss, i.e. (cid:16) ℓ √ 2𝜎 , then the converged weight 𝑑w∞ with the ℓ Erf−1(ℓ2) ℓ2 Erf 𝜎𝑚, (cid:17) E (cid:16)(cid:13) (cid:13) 𝑑w∞ − w∗ (cid:17) (cid:13) (cid:13)2 ≤ E (∥𝑟w∞ − w∗∥2) . (4.4) 51 Figure 4.4 Plot of convergence variance Var(𝜖) of loss functions with the noise 𝜎. Dice loss has minimum convergence variance with large noise, resulting in better detectors for large objects. We refer to Sec. D.1.3 for the proof. Beyond noise deviation threshold 𝜎𝑐 = max (cid:16) ℓ Erf−1(ℓ2) the convergence gap between dice and regression losses widens as the object size ℓ increases. 𝜎𝑚, √ 2 (cid:17), Fig. 4.4 depicts the superior convergence of dice loss compared to regression losses under increas- ing noise deviation 𝜎 pictorially. Taking the car category with ℓ = 4𝑚 and the trailer category with ℓ = 12𝑚 as examples, the noise threshold 𝜎𝑐, beyond which dice loss exhibits better convergence, are 𝜎𝑐 = 0.3𝑚 and 𝜎𝑐 = 0.1𝑚 respectively. Combining these lemmas, we finally derive: Theorem 2. Dice model has better AP3D. Assume the object length ℓ is a constant and depth is the only source of error for detection. Based on Lemma 1, if 𝜎𝑚 is the solution of the equation 𝜎2 = 1 ℓ2 Erf (cid:17) (cid:16) ℓ √ 2𝜎 and the noise deviation 𝜎 ≥ 𝜎𝑐 = max (cid:16) 𝜎𝑚, √ 2 ℓ Erf−1(ℓ2) (cid:17) , then the Average Precision (AP3D) of the dice model is better than AP3D from L1 or L2 model. We refer to Sec. D.1.4 and Tab. D.1 for the proof and assumption comparisons respectively. 4.3.3 Discussions Comparing classification and regression losses. We now explain how we compare classification (dice) and regression losses. Our analysis assumes one-class classification in BEV segmentation with perfect predicted foreground scores 𝑃(𝑍) = 1 (Fig. 4.3c). Hence, dice analysis focuses on object localization along the BEV ray (Fig. 4.3b) instead of classification probabilities thus allowing comparison of dice and regression losses. Lemma 1 links these losses by comparing the deviation 52 of learned and optimal weights. Regression losses work better than dice loss for regression tasks? Our key message is NOT always! We mathematically and empirically show that regression losses work better only when the noise 𝜎 is less in Fig. 4.4. 4.3.4 SeaBird Pipeline Architecture. Based on theoretical insights of Th. 2, we propose SeaBird, a novel pipeline, in Fig. 4.2. To effectively involve the dice loss which originally designed for segmentation task to assist Mono3D, SeaBird treats BEV segmentation of foreground objects and Mono3D head sequentially. Although BEV segmentation map provides depth information (hardest [108, 166] Mono3D parameter), it lacks elevation and height information for Mono3D task. To address this, SeaBird concatenates BEV features with predicted BEV segmentation (Fig. 4.2), and feeds them into the detection head to predict 3D boxes in a 7-DoF representation: BEV 2D position, elevation, 3D dimension, and yaw. Unlike most works [132, 303] that treat segmentation and detection branches in parallel, the sequential design directly utilizes refined BEV localization information to enhance Mono3D. Ablations in Sec. 4.4.2 validate this design choice. We defer the details of baselines to Sec. 4.4. Notably, our foreground BEV segmentation supervision with dice loss does not require dense BEV segmentation maps, as we efficiently prepare them from GT 3D boxes. Training Protocol. SeaBird trains the BEV segmentation head first, employing the dice loss between the predicted and the GT BEV semantic segmentation maps, which fully utilizes the dice loss’s noise-robustness and superior convergence in localizing large objects. In the second stage, we jointly fine-tune the BEV segmentation head and the Mono3D head. We validate the effectiveness of training protocol via the ablation in Sec. 4.4.2. 4.4 Experiments Datasets. Our experiments utilize two datasets with large objects: KITTI-360 [136] and nuScenes [22] encompassing both single-camera and multi-camera configurations. We opt for KITTI-360 instead of KITTI [67] for four reasons: 1) KITTI-360 includes large objects, while KITTI does not; 2) KITTI-360 exhibits a balanced distribution of large objects and cars; 3) an extended version, 53 Table 4.2 Datasets comparison. We use KITTI-360 and nuScenes datasets for our experiments. See Fig. D.2 for the skewness. KITTI [67] Waymo [230] KITTI-360 [136] nuScenes [22] Large objects Balanced BEV Seg. GT #images (k) ✕ ✕ ✕ 4 ✕ ✕ ✓ 52 [108] ✓ ✓ ✓ 49 ✓ ✕ ✓ 168 KITTI-360 PanopticBEV [72], includes BEV segmentation GT for ablation studies, while KITTI 3D detection and the Semantic KITTI dataset [6] do not overlap in sequences; 4) KITTI-360 contains about 10× more images than KITTI. We compare these datasets in Tab. 4.2 and show their skewness in Fig. D.2. Data Splits. We use the following splits of the two datasets: • KITTI-360 Test split: This benchmark [136] contains 300 training and 42 testing windows. These windows contain 61,056 training and 910 testing images. • KITTI-360 Val split: It partitions the official train into 239 train and 61 validation windows [136]. This split contains 48,648 training and 1,294 validation images. • nuScenes Test split: It has 34,149 training and 6,006 testing samples [22] from the six cameras. This split contains 204,894 training and 36,036 testing images. • nuScenes Val split: It has 28,130 training and 6,019 validation samples [22] from the six cameras. This split contains 168,780 training and 36,114 validation images. Evaluation Metrics. We use the following metrics: • Detection: KITTI-360 uses the mean AP 3D 50 percentage across categories to benchmark models [136]. nuScenes [22] uses the nuScenes Detection Score (NDS) as the metric. NDS is the weighted average of mean AP (mAP) and five TP metrics. We also report mAP over large categories (truck, bus, trailers and construction vehicles), cars, and small categories (pedestrians, motorcyle, bicycle, cone and barrier) as AP𝐿𝑟𝑔, AP𝐶𝑎𝑟 and AP𝑆𝑚𝑙 respectively. • Semantic Segmentation: We report mean IoU over foreground and all categories at 200×200 resolution [211, 303]. KITTI-360 Baselines and SeaBird Implementation. Our evaluation on the KITTI-360 focuses on 54 the detectors taking single-camera image as input. We evaluate SeaBird pipelines against six SoTA frontal detectors: GrooMeD-NMS [109], MonoDLE [166], GUP Net [159], DEVIANT [108], Cube R-CNN [14] and MonoDETR [300]. The choice of these models encompasses anchor [14,109] and anchor-free methods [108, 166], CNN [159, 166], group CNN [108] and transformer-based [300] architectures. Further, MonoDLE normalizes loss with GT box dimensions. Due to SeaBird’s BEV-based approach, we do not integrate it with these frontal view detectors. Instead, we extend two SoTA image-to-BEV segmentation methods, Image2Maps (I2M) [211] and PanopticBEV (PBEV) [72] with SeaBird. Since both BEV segmentors already include their own implementations of the image encoder, the image-to-BEV transform, and the segmentation head, implementing the SeaBird pipeline only involves adding a detection head, which we chose to be Box Net [289]. SeaBird extensions employ dice loss for BEV segmentation, Smooth L1 losses [69] in the BEV space to supervise the BEV 2D position, elevation, and 3D dimension, and cross entropy loss to supervise orientation. nuScenes Baselines and SeaBird Implementation. We integrate SeaBird into two prototypical BEV-based detectors, BEVerse [303] and HoP [311] to prove the effectiveness of SeaBird. Our choice of these models encompasses both transformer and convolutional backbones, multi-head and single-head architectures, shorter and longer frame history, and non-query and query-based detectors. This comprehensively allows us to assess SeaBird’s impact on large object detection. BEVerse employs a multi-head architecture with a transformer backbone and shorter frame history. HoP is single-head query-based SoTA model utilizing BEVDet4D [87] with CNN backbone, and longer frame history. BEVerse [303] includes its own implementation of detection head and BEV segmentation head in parallel. We reorganize the two heads to follow our sequential design and adhere to our training protocol for network training. Since HoP [311] lacks a BEV segmentation head, we incorporate the one from BEVerse into this HoP extension with SeaBird. 55 Table 4.3 KITTI-360 Test detection results. SeaBird pipelines outperform all monocular baselines, and also outperform old LiDAR baselines. Click for the KITTI-360 leaderboard as well as our PBEV+SeaBird and I2M+SeaBird entries. [Key: Best, Second Best, L= LiDAR, C= Camera, †= Retrained]. Modality L C ✓ ✓ Method Venue AP 3D 50 (− (cid:17)) AP 3D 25 (− mAP [%] mAP [%] (cid:17)) L-VoteNet [194] L-BoxNet [194] ICCV19 ICCV19 ✓ GrooMeD † [109] CVPR21 ✓ MonoDLE † [166] CVPR21 ✓ GUP Net † [159] ICCV21 ✓ DEVIANT † [108] ECCV22 ✓ Cube R-CNN † [14] CVPR23 ✓ MonoDETR † [300] ICCV23 CVPR24 ✓ I2M+SeaBird CVPR24 ✓ PBEV+SeaBird 3.40 4.08 0.17 0.85 0.87 0.88 0.80 0.79 3.14 4.64 30.61 23.59 16.12 28.99 27.25 26.96 15.57 27.13 35.04 37.12 4.4.1 KITTI-360 Mono3D KITTI-360 Test. Tab. 4.3 presents KITTI-360 leaderboard results, demonstrating the superior performance of both SeaBird pipelines compared to all monocular baselines across all metrics. Moreover, PBEV+SeaBird also outperforms both legacy LiDAR baselines on all metrics, while I2M+SeaBird surpasses them on the AP 3D 25 metric. KITTI-360 Val. Tab. 4.4 presents the results on KITTI-360 Val split, reporting the median model over three different seeds with the model being the final checkpoint as [108]. SeaBird pipelines outperform all monocular baselines on all but one metric, similar to Tab. 4.3 results. Due to the dice loss in SeaBird, the biggest improvement shows up on larger objects. Tab. 4.4 also includes the upper-bound oracle, where we train the Box Net with the GT BEV segmentation maps. Lengthwise AP Analysis. Th. 2 states that training a model with dice loss should lead to lower errors and, consequently, a better detector for large objects. To validate this claim, we analyze the detection performance with AP 3D 50 and AP 3D 25 metrics against the object’s lengths. For this analysis, we divide objects into four bins based on their GT object length (max of sizes): [0, 5), [5, 10), [10, 15), [15 + 𝑚. Fig. 4.5 shows that SeaBird pipelines excel for large objects, where the baselines’ performance drops significantly. BEV Semantic Segmentation. Tab. 4.4 also presents the BEV semantic segmentation results on the KITTI-360 Val split. SeaBird pipelines outperforms the baseline I2M [211], and achieve 56 (a) AP 3D 50 comparison. (b) AP 3D 25 comparison. Figure 4.5 Lengthwise AP Analysis of four SoTA detectors and two SeaBird pipelines on KITTI-360 Val split. SeaBird pipelines outperform all baselines on large objects with over 10m in length. Table 4.4 KITTI-360 Val detection and segmentation results. SeaBird pipelines outperform all frontal monocular baselines, particularly for large objects. Dice loss in SeaBird also improves the BEV only (w/o dice) version of SeaBird pipelines. I2M and PBEV are BEV segmentors. So, we do not report their Mono3D performance. [Key: Best, Second Best, †= Retrained] View Method BEV Seg Loss Frontal BEV GrooMeD-NMS † [109] MonoDLE † [166] GUP Net † [159] DEVIANT † [108] Cube R-CNN † [14] MonoDETR † [300] I2M † [211] I2M+SeaBird I2M+SeaBird PBEV † [72] PBEV+SeaBird PBEV+SeaBird Oracle (GT BEV) − Dice ✕ Dice CE ✕ Dice (cid:17)) (cid:17)) (cid:17)) Venue AP 3D 50 [%](− AP 3D 25 [%](− 33.04 16.52 44.81 22.88 45.11 22.83 44.25 22.39 22.52 11.63 43.24 22.02 38.21 19.11 50.52 27.58 50.52 25.75 48.57 24.79 27.12 16.34 48.69 26.60 AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP Large 0.00 0.00 CVPR21 4.64 0.94 CVPR21 0.98 0.54 ICCV21 1.01 0.53 ECCV22 5.55 0.75 CVPR23 4.50 0.81 ICCV23 ICRA22 − − 45.09 24.98 26.33 52.31 39.32 4.86 CVPR24 43.19 25.95 35.76 52.22 43.99 CVPR24 8.71 RAL22 − − CVPR24 45.37 26.51 29.72 53.86 41.79 7.64 CVPR24 13.22 42.46 27.84 37.15 52.53 44.84 BEV Seg IoU [%](− MFor Car − − − − − − − − − − − − 29.25 38.04 3.54 7.07 31.42 39.61 36.18 48.54 1.57 1.47 36.17 48.04 26.77 51.79 39.28 49.74 56.62 53.18 100.00 100.00 100.00 − − − − − − 20.46 0.00 23.23 23.83 2.07 24.30 − − − − − − − − − similar performance to PBEV [72] in BEV segmentation. We retrain all BEV segmentation models only on foreground detection categories for a fair comparison. 4.4.2 Ablation Studies on KITTI-360 Val Tab. 4.5 ablates I2M [211] +SeaBird on the KITTI-360 Val split, following the experimental settings of Sec. 4.4.1. Dice Loss. Tab. 4.5 shows that both dice loss and BEV representation are crucial to Mono3D of large objects. Replacing dice loss with MSE or Smooth L1 loss, or only BEV representation (w/o dice) reduces Mono3D performance. Mono3D and BEV Segmentation. Tab. 4.5 shows that removing the segmentation head hinders 57 Table 4.5 Ablation studies on KITTI-360 Val. [Key: Best, Second Best] Changed From − To Dice − Dice − Dice − Dice − (cid:17) No Loss Smooth L1 MSE CE Segmentation Loss Semantic Category (cid:17) (cid:17) (cid:17) Segmentation Head Yes− No (cid:17) Yes− Detection Head No (cid:17) For.− All (cid:17) For.− Car (cid:17) Sequential− (cid:17) Yes− S+J− S+J− − Multi-head Arch. BEV Shortcut Training Protocol I2M+SeaBird (cid:17) (cid:17) (cid:17) Parallel No (cid:17) J [303] D+J [281] (cid:17)) (cid:17)) 7.07 0.00 AP 3D 50 [%](− AP 3D 25 [%](− BEV Seg IoU [%](− 45.09 24.98 26.33 52.31 39.32 3.54 36.69 22.16 31.01 47.51 39.26 17.16 34.67 25.92 35.59 21.32 30.90 44.71 37.81 17.46 34.85 26.16 35.60 21.33 33.22 47.60 40.41 21.83 38.11 29.97 39.24 23.38 31.83 47.88 39.86 − (cid:17)) AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP Large Car MFor MAll 4.86 − 7.63 − 7.04 − 7.06 − 7.52 − − − 1.61 4.17 9.12 6.53 7.42 6.07 8.71 − 44.12 22.87 15.36 51.76 33.56 19.26 34.46 26.86 24.34 43.01 23.59 22.68 51.58 37.13 40.28 20.14 40.27 24.69 32.45 51.55 42.00 22.19 40.37 31.28 38.12 22.33 32.05 52.62 42.34 23.00 40.39 31.70 42.73 25.08 31.94 49.88 40.91 22.91 39.66 31.29 43.43 24.75 29.24 52.96 41.10 20.71 35.68 28.20 43.19 25.95 35.76 52.22 43.99 23.23 39.61 31.42 − 20.46 38.04 29.25 − − − − − − − − − − − − Mono3D performance. Conversely, removing detection head also diminishes the BEV segmentation performance for the segmentation model. This confirms the mututal benefit of sequential BEV segmentation on foreground objects and Mono3D. Semantic Category in BEV Segmentation. We next analyze whether background categories play any role in Mono3D. Tab. 4.5 shows that changing the foreground (For.) categories to foreground + background (All) does not help Mono3D. This aligns with the observations of [167, 272, 303] that report lower performance on joint Mono3D and BEV segmentation with all categories. We believe this decrease happens because the network gets distracted while getting the background right. We also predict one foreground category (Car) instead of all in BEV segmentation. Tab. 4.5 shows that predicting all foreground categories in BEV segmentation is crucial for overall good Mono3D. Multi-head Architecture. SeaBird employs a sequential architecture (Arch.) of segmentation and detection heads instead of parallel architecture. Tab. 4.5 shows that the sequential architecture outperforms the parallel one. We attribute this Mono3D boost to the explicit object localization provided by segmentation in the BEV plane. BEV Shortcut. Sec. 4.3.4 mentions that SeaBird’s Mono3D head utilizes both the BEV segmen- tation map and BEV features. Tab. 4.5 demonstrates that providing BEV features to the detection head is crucial for good Mono3D. This is because the BEV map lacks elevation information, and incorporating BEV features helps estimate elevation. Training Protocol. SeaBird trains segmentor first and then jointly trains detector and segmentor 58 Table 4.6 nuScenes Test detection results. SeaBird pipelines achieve the best AP𝐿𝑟 𝑔 among methods without Class Balanced Guided Sampling (CBGS) [308] and future frames. Results are from the nuScenes leaderboard or corresponding chapters on V2-99 or R101 backbones. [Key: Best, Second Best, S= Small, ∗= Reimplementation, §= CBGS, = Future Frames.] (cid:35)(cid:32) Resolution Method (cid:17)) AP𝐶𝑎𝑟 (− (cid:17)) AP𝑆𝑚𝑙 (− (cid:17)) Venue BBone AAAI23 BEVDepth [127] in [101] R101 AAAI23 BEVStereo [125] in [101] R101 ICCV23 R101 P2D [101] ArXiv Swin-S BEVerse-S [303] ICCV23 R101 HoP ∗ [311] CVPR24 R101 HoP+SeaBird ECCV22 V2-99 SpatialDETR [53] ICCV23 V2-99 3DPPE [219] CVPR23 R101 X3KDall [105] V2-99 PETRv2 [150] ICCV23 V2-99 CVPR23 VEDet [29] V2-99 CVPR23 FrustumFormer [256] ICCV23 V2-99 MV2D [259] V2-99 HoP ∗ [311] ICCV23 V2-99 CVPR24 HoP+SeaBird V2-99 SA-BEV § [299] ICCV23 ICCV23 V2-99 FB-BEV § [134] V2-99 CVPR23 CAPE § [273] ICCV23 V2-99 SparseBEV R101 ParametricBEV [283] ICCV23 R101 NeurIPS22 UVTR [126] V2-99 BEVFormer [132] ECCV22 V2-99 AAAI23 PolarFormer [95] V2-99 NeurIPS23 STXD [91] [142] (cid:35)(cid:32) AP𝐿𝑟𝑔 (− − − − 24.4 36.0 36.6 30.2 − − 36.4 37.1 − − 37.1 38.4 40.5 39.3 41.3 45.6 − 35.1 34.4 36.8 − 512×1408 640×1600 900×1600 − − − 60.4 65.0 65.8 61.0 − − 66.7 68.5 − − 68.7 70.2 68.9 71.7 71.4 76.3 − 67.3 67.7 68.4 − − − − 47.0 53.9 54.7 48.5 − − 55.6 57.7 − − 55.6 57.4 60.5 61.6 63.3 68.8 − 52.9 55.2 55.5 − (cid:17)) mAP(− 39.6 40.4 43.6 39.3 47.9 48.6 42.5 46.0 45.6 49.0 50.5 51.6 51.1 49.4 51.1 53.3 53.7 55.3 60.3 46.8 47.2 48.9 49.3 49.7 (cid:17)) NDS(− 48.3 50.2 53.0 53.1 57.5 57.0 48.7 51.4 56.1 58.2 58.5 58.9 59.6 58.9 59.7 62.4 62.4 62.8 67.5 49.5 55.1 56.9 57.2 58.3 Table 4.7 nuScenes Val detection results. SeaBird pipelines outperform the two baselines BEVerse and HoP, particularly for large objects. We train all models without CBGS. See Tab. D.9 for a detailed comparison. [Key: S= Small, T= Tiny, = Released, ∗= Reimplementation] Resolution Method 256×704 512×1408 640×1600 BEVerse-T [303] +SeaBird HoP [311] +SeaBird BEVerse-S [303] +SeaBird HoP ∗ [311] +SeaBird HoP ∗ [311] +SeaBird (cid:17)) (cid:17)) (cid:17)) R50 46.6 57.2 Swin-T NDS (− mAP (− 32.1 AP𝐿𝑟𝑔 (− 18.5 (cid:17)) AP𝑆𝑚𝑙 (− 38.8 (cid:17)) AP𝐶𝑎𝑟 (− 53.4 BBone Venue ArXiv CVPR24 19.5 (+1.0) 54.2 (+0.8) 41.1 (+2.3) 33.8 (+1.5) 48.1 (+1.7) ICCV23 27.4 CVPR24 28.2 (+0.8) 58.6 (+1.4) 47.8 (+1.4) 41.1 (+1.2) 51.5 (+0.6) ArXiv CVPR24 24.6 (+3.7) 58.7 (+2.5) 45.0 (+2.8) 38.2 (+3.0) 51.3 (+1.8) ICCV23 31.4 CVPR24 32.9 (+1.5) 65.0 (+1.3) 53.1 (+0.6) 46.2 (+1.0) 54.7 (–0.3) ICCV23 36.5 CVPR24 40.3 (+3.8) 71.7 (+2.6) 58.8 (+2.7) 52.7 (+3.1) 60.2 (+1.9) V2-99 R101 Swin-S 56.2 63.7 69.1 39.9 35.2 45.2 49.6 46.4 50.9 42.2 49.5 52.5 55.0 56.1 58.3 20.9 (S+J). We compare with direct joint training (J) of [303] and training detection followed by joint training (D+J) of [281]. Tab. 4.5 shows that SeaBird training protocol works best. 59 4.4.3 nuScenes Mono3D We next benchmark SeaBird on nuScenes [22], which encompasses more diverse object cate- gories such as trailers, buses, cars and traffic cones, compared to KITTI-360 [136]. nuScenes Test. Tab. 4.6 presents the results of incorportaing SeaBird to the HoP models with the V2-99 and R101 backbones. SeaBird with both V2-99 and R101 backbones outperform several SoTA methods on the nuScenes leaderboard, as well as the baseline HoP, on nearly every metric. Interestingly, SeaBird pipelines also outperform several baselines which use higher resolution (900×1600) inputs. Most importantly, SeaBird pipelines achieve the highest AP𝐿𝑟𝑔 performance, providing empirical support for the claims of Th. 2. nuScenes Val. Tab. 4.7 showcases the results of integrating SeaBird with BEVerse [303] and HoP [311] at multiple resolutions, as described in [303,311]. Tab. 4.7 demonstrates that integrating SeaBird consistently improves these detectors on almost every metric at multiple resolutions. The improvements on AP𝐿𝑟𝑔 empirically support the claims of Th. 2 and validate the effectiveness of dice loss and BEV segmentation in localizing large objects. 4.5 Conclusions This chapter highlights the understudied problem of Mono3D generalization to large objects. Our findings reveal that modern frontal detectors struggle to generalize to large objects even when trained on balanced datasets. To bridge this gap, we investigate the regression and dice losses, examining their robustness under varying error levels and object sizes. We mathematically prove that the dice loss outperforms regression losses in noise-robustness and model convergence for large objects for a simplified case. Leveraging our theoretical insights, we propose SeaBird (Segmentation in Bird’s View) as the first step towards generalizing to large objects. SeaBird effectively integrates BEV segmentation with the dice loss for Mono3D. SeaBird achieves SoTA results on the KITTI-360 leaderboard and consistently improves existing detectors on the nuScenes leaderboard, particularly for large objects. We hope that this initial step towards generalization will contribute to safer AVs. Limitation. SeaBird does not fully solve the problem of generalization to large objects. 60 CHAPTER 5 CHARM3R: TOWARDS CAMERA HEIGHT AGNOSTIC MONOCULAR 3D OBJECT DETECTOR To this end, we attempt generalizing Mono3D networks to occlusion, dataset and object sizes. Monocular 3D object detectors, while effective on data from one ego camera height, struggle with unseen or out-of-distribution camera heights. Existing methods often rely on Plucker embeddings, image transformations or data augmentation. This chapter takes a step towards this understudied problem by investigating the impact of camera height variations on state-of-the-art (SoTA) Mono3D models. With a systematic analysis on the extended CARLA dataset with multiple camera heights, we observe that depth estimation is a primary factor influencing performance under height vari- ations. We mathematically prove and also empirically observe consistent negative and positive trends in mean depth error of regressed and ground-based depth models, respectively, under cam- era height changes. To mitigate this, we propose Camera Height Robust Monocular 3D Detector (CHARM3R), which averages both depth estimates within the model. CHARM3R significantly improves generalization to unseen camera heights, achieving SoTA performance on the CARLA dataset. 5.1 Introduction Monocular 3D object detection (Mono3D) task uses a single image to determine both the 3D location and dimensions of objects. This technology is essential for augmented reality [2, 172, 183, 293], robotics [213], and self-driving cars [108, 132, 181], where accurate 3D understanding of the environment is crucial. Our research specifically focuses on using 3D object detectors applied to autonomous vehicles (AVs), as they have unique challenges and requirements. AVs necessitate detectors that are robust to a wide range of intrinsic and extrinsic factors, including intrinsics [14], domains [133], object size [110], rotations [177, 307], weather conditions [137, 179], and adversarial examples [310]. Existing research primarily focusses on generalizing object detectors to these failure modes. However, this work investigates the generalization of Mono3D to another type, which, thus far, has been relatively understudied in the literature – 61 Figure 5.1 Teaser. Changing ego height at inference quickly drops Mono3D performance of SoTA detectors. A height change Δ𝐻 of 0.76𝑚 in inference drops AP3D [%] by absolute 35 points. (a) AP 3D 70 [%] Results. (b) AP 3D 50 [%] Results. (c) Depth error trend on changing ego heights. Figure 5.2 Performance Comparison. The performance of SoTA detector GUP Net [159] drops significantly with changing ego heights in inference. Ground-based model shows contrasting depth error (extrapolation) trend compared to regression-based depth models. Our proposed CHARM3R exhibits greater robustness to such variations by averaging regression and ground-based depth estimates. All methods, except the Oracle, are trained on car-height data Δ𝐻 = 0𝑚 and tested on data from bot to truck heights. Mono3D generalization to unseen ego camera heights. The ego height of autonomous vehicles (AVs) varies significantly across different platforms and deployment scenarios. While almost all training data is collected from a specific ego height, such as that of a passenger car, AVs are now deployed with substantially different ego height such as small bots or trucks. Collecting, labeling datasets and retraining models for each possible height is not scalable [104], computationally expensive and impractical. Therefore, our work aims to address the challenge of generalizing Mono3D models to unseen ego heights. Generalizing Mono3D to unseen ego heights from single ego height data is challenging due to 62 Test EgoTrain EgoΔHGroundImage3D Detector3D Boxes the following five reasons. First, neural models excel at In-Domain (ID) generalization, but struggle with unseen Out-Of-Domain (OOD) generalization [237, 276]. Second, ego height changes induce projective transformations [76] that CNNs [43], DEVIANT [108] or ViT [55] backbones do not effectively handle [212]. Third, existing projective equivariant backbones [168, 176] are limited to single-transform-per-image scenarios, while every pixel in a driving image undergoes a different depth-dependent transform. Fourth, the non-linear nature [21, 76] of projective transformations makes interpolation difficult. Finally, disentangled learning does not work for this problem since such approaches need at least two height data, while the training data here is from single height. Note that the generalization from single height to multi heights is more practical since multi-height data is unavailable in almost all real datasets. We first systematically analyze and quantify the impact of ego height on the performance of Mono3D models trained on a single ego height. Leveraging the extended CARLA dataset [104], we evaluate the performance of state-of-the-art (SoTA) Mono3D models under multiple ego heights. Our analysis reveals that SoTA Mono3D models exhibit significant performance degradation when faced with large height changes in inference (Figs. 5.2a and 5.2b). Additionally, we empirically observe a consistent negative trend in the regressed object depth under height changes (Fig. 5.2c). Furthermore, we decompose the performance impact into individual sub-tasks and identify depth estimation as the primary contributor to this degradation. Recent papers address ego height changes by using Plucker embeddings [4], transforming target- height images to the original height, assuming constant depth [129], or by retraining with augmented data [104]. While these techniques do offer some effectiveness, image transformation fails (Fig. 5.6) under significant height changes due to real-world depth variations. The augmentation strategy requires complicated pipelines for data synthesis at target heights and also falls short when the target height is OOD or when the target height is unknown apriori during training. To effectively generalize Mono3D to unseen ego heights, a detector should first disentangle the depth representation from ego parameters in training and produce a new representation with new ego parameters in inference, while also canceling the trends. We propose using the projected 63 bottom 3D center and ground depth in addition to the regressed depth. While the ground depth is easily calculated from ego parameters and height, and can be changed based on the ego height, its direct application to Mono3D models is sub-optimal (a reason why ground plane is not used alone). However, we observe a consistent positive trend in ground depth, which contrasts with the negative trend in regressed depths. By averaging both depth estimates within the model, we effectively cancel these opposing trends and improve Mono3D generalization to unseen ego heights. In summary the main contributions of this work include: • We attempt the understudied problem of OOD ego height robustness in Mono3D models from single height data. • We mathematically prove systematic negative and positive trends in the regressed and ground- based object depths, respectively, under ego height changes under simplified assumptions (Th. 3 and 4). • We propose simple averaging of these depth estimates within the model to effectively counteract these opposing trends and generalize to unseen ego heights (Sec. 5.4.3). • We empirically demonstrate SoTA robustness to unseen ego height changes on the CARLA dataset (Tab. 5.2). 5.2 Related Works Extrapolation / OOD Generalization. Neural models excel at ID generalization, but struggle at OOD generalization [237, 276]. There are two major classes of methods for good OOD classifica- tion. The first does not use target data and relies on diversifying data [235], features [240, 286], predictions [119], gradients [207, 236] or losses [193, 208, 210]. Another class finetunes on small target data [103]. None of these papers attempt OOD generalization for regression tasks. Mono3D. Mono3D has gained significant popularity, offering a cost-effective and efficient solution for perceiving the 3D world. Unlike its more expensive LiDAR and radar counterparts [155, 215, 290], or its computationally intensive stereo-based cousins [34], Mono3D relies solely on a single camera or multiple cameras with little overlaps. Earlier approaches to this task [31, 186] relied on hand-crafted features, while the recent advancements use deep models. Researchers explored 64 a variety of approaches to improve performance, including architectural innovations [89, 275], equivariance [29, 108], losses [15, 35], uncertainty [111, 159] and depth estimation [175, 279, 301]. A few use NMS [109, 147], corrected extrinsics [307], CAD models [25, 117, 154] or LiDAR [199] in training. Other innovations include Pseudo-LiDAR [165,254], diffusion [196,274], BEV feature encoding [96, 131, 303] or transformer-based [24] methods with modified positional encoding [82, 219, 233], queries [36, 93, 128, 296] or query denoising [140]. Some use pixel-wise depth [88] or object-wise depth [38, 40, 141]. Many utilize temporal fusion with short [17, 150, 248, 261] or long frame history [28, 182, 311] to boost performance. A few use distillation [100, 260], stereo [125, 261] or loss [110, 148] to improve these results further. For a comprehensive overview, we redirect readers to the surveys [163, 167]. CHARM3R selects representative Mono3D models and improves their extrapolation to unseen camera heights. Camera Parameter Robustness. While several works aim for robust LiDAR-based detec- tors [27, 84, 255, 277, 282], planners [285] and map generators [197], fewer studies focus on generalizing image-based detectors. Existing image-based techniques, such as self-training [130], adversarial learning [249], perspective debiasing [158], and multi-view depth constraints [26], primarily address datasets with variations in camera intrinsics and minor height differences of 0.2𝑚. Some papers show robustness to other camera parameters such as intrinsics [14], and rota- tions [177, 307]. CHARM3R specifically tackles the challenge of generalizing to scenarios with significant camera height changes, exceeding 0.7𝑚. Height-Robustness. Image-based 3D detectors such as BEVHeight [94] and MonoUNI [94] train multiple detectors at different heights, but always do ID testing. Recent works address ego height changes by either using Plucker embeddings [4, 297] for video generation/pose estimation, by transforming target-height images to the original height, assuming constant depth [129] for Mono3D, or by retraining with augmented data [104] for BEV segmentation. In contrast, we investigate the contrasting extrapolation behavior of regressed and ground-based depth estimators and average them for generalizing Mono3D to unseen camera heights. Wide Baseline Setup. Wide baseline setups are challenging due to issues like large occlusions, 65 Figure 5.3 Problem Setup. Note that changing ego height does not change the object depth 𝑧 but only its position (𝑢𝑐, 𝑣𝑐) in the image plane. A regressed-depth model uses this pixel position to estimate the depth and therefore, fails when the ego height is changed. depth discontinuities [229] and intensity variations [228]. Unlike traditional wide-baseline setups with arbitrary baseline movements, generalization to unseen ego height requires handling baseline movements specifically along the vertical direction. 5.3 Notations and Preliminaries We first list out the necessary notations and preliminaries which are used throughout this chapter. These are not our contributions and can be found in the literature [65, 73, 76]. Notations. Let K ∈ R3×3 denote the camera intrinsic matrix, R ∈ R3×3 the rotation matrix and T ∈ R3×1 the translation vector of the extrinsic parameters. Also, 0 ∈ R3×1 denotes the zero vector in 3D. We denote the ego camera height on the car as 𝐻, and the height change relative to this car as Δ𝐻 meters. The camera intrinsics matrix K has focal length 𝑓 and principal point (𝑢0, 𝑣0). Let (𝑢, 𝑣) represent a pixel position in the camera coordinates, and (𝑢𝑐, 𝑣𝑐) and (𝑢𝑏, 𝑣𝑏) denotes the projected 3D center and bottom center respectively. ℎ denotes the height of the image plane. We show these notations pictorially in Fig. 5.3. Pinhole Point Projection [76]. The pinhole model relates a 3D point (𝑋, 𝑌 , 𝑍) in the world 66 ObjectTest EgoTrain EgoHΔHDepth (z)GroundZYff3D CenterBottom 3D CenterImage Plane0.5h2D0.5h2D(uc,vc)(uc,vc)(ub,vb)(ub,vb)hhX Figure 5.4 CHARM3R Overview. CHARM3R predicts the shift coefficient to obtain projected 3D bottom centers to query the ground depth and then averages the ground-depth and the regressed depth estimates within the model itself to output final depth estimate of a bounding box. CHARM3R uses the results of Th. 3 and 4 that demonstrate that the ground and the regressed depth models show contrasting extrapolation behaviors. coordinate system to its 2D projected pixel (𝑢, 𝑣) in camera coordinates as: 𝑣  𝑢          1 (cid:104) 𝑧 = (cid:105) K 0            R T       0𝑇 1        , 𝑋 𝑌 𝑍 1                           (5.1) where 𝑧 denotes the depth of pixel (𝑢, 𝑣). Ground Depth Estimation [65,73]. While depth estimation in Mono3D is ill-posed, ground depth can be precisely determined given the camera parameters and height relative to the ground in the world coordinate system [65, 73, 284]. Since all datasets provide camera mounting height from the ground, we obtain the depth of ground plane pixels in closed form. Lemma 4. Ground Depth of Pixel [65, 73, 284]. Consider a pinhole camera model with intrinsics K, rotation R and translation extrinsics T. Let matrix 𝑨 = (𝑎𝑖 𝑗 ) = R−1K−1 ∈ R3×3, and −R−1T as 67 the vector 𝑩 = (𝑏𝑖) ∈ R3×1. Then, the ground depth 𝑧 for a pixel (𝑢, 𝑣) is 𝑧 = 𝐻 − 𝑏2 𝑎21𝑢 + 𝑎22𝑣 + 𝑎23 . (5.2) We refer to Sec. E.1.1 in the appendix for the derivation. Lemma 5. Ground Depth of Pixel For datasets with the rotation extrinsics R an identity, the depth estimate 𝑧 from Lemma 4 becomes 𝑧 = . 𝐻 − 𝑏2 𝑣 −𝑣0 𝑓 (5.3) We refer to Sec. E.1.2 for the proof. 5.4 CHARM3R In this section, we first mathematically prove the contrasting extrapolation behavior of regressed and ground-based object depths under varying camera heights. To mitigate the impact of these opposing trends and improve generalization to unseen heights, we propose Camera Height Agnostic Monocular 3D Object Detector or CHARM3R. CHARM3R averages both these depth estimates within the model to mitigate these trends and improves generalization to unseen heights. Fig. 5.4 shows the overview of CHARM3R. 5.4.1 Ground-based Depth Model Outdoor driving scenes typically contain a ground region, unlike indoor scenes. The ground depth varies with ego height, providing a valuable reference and prior for generalizing Mono3D to unseen ego heights. Bottom Center Estimation. Lemma 4 utilizes the ground plane depth from Eq. (5.2) to estimate object depths. The numerator in Eq. (5.2) can be negative, while depth is positive for forward facing cameras. To ensure positive depth values, we apply the Rectified Linear Unit (ReLU) activation (max(𝑧, 0)) to the numerator of Eq. (5.2). This step promotes spatially continuous and meaningful ground depth representations, improving the training stability of CHARM3R. Ablation in Sec. 5.5.3 confirm the effectiveness. 68 In practice, CHARM3R leverages the projected 3D center (𝑢𝑐, 𝑣𝑐), 2D height information ℎ2𝐷 and the 2D center (𝑢𝑐,2𝐷, 𝑣𝑐,2𝐷) to compute the projected bottom 3D center (𝑢𝑏, 𝑣𝑏) as follows: 𝑢𝑏 = 𝑢𝑐 ; 𝑣𝑏 = 𝑣𝑐 + 1 2 ℎ2𝐷 + 𝛼(𝑣𝑐 − 𝑣𝑐,2𝐷). (5.4) With the projected bottom center (𝑢𝑏, 𝑣𝑏) estimated, we query the ground plane depth at this point, as derived in Lemma 4. Note that we do not use the 3D height to calculate the bottom center since projecting this point requires the box depth, which is the quantity we aim to estimate. We, now, analyze the extrapolation behavior of this ground-based depth model in the following theorem. Theorem 3. Ground-based bottom center model has positive slope (trend) in extrapolation. Consider a ground depth model that predicts ˆ𝑧 from the projected bottom 3D center (𝑢𝑏, 𝑣𝑏) image. Assuming the GT object depth 𝑧 is more than the ego height change Δ𝐻, the mean depth error of the ground model exhibits a positive trend w.r.t. the height change Δ𝐻: E(cid:16)𝑔ˆ𝑧Δ𝐻 − 𝑧 (cid:17) ≈ 𝑅𝑒𝐿𝑈 (cid:19) (cid:18) 1 𝑣𝑏 −𝑣0 𝑓 Δ𝐻, (5.5) where 𝑓 is the focal length and (𝑢0, 𝑣0) is the optical center. Th. 3 says that the ground model over-estimates and under-estimates depth as the ego height change Δ𝐻 increases and decreases respectively. Proof. When the ego camera shifts by Δ𝐻 𝑚, the 𝑦-coordinate of the projected 3D bottom center 𝑣𝑏 of a 3D box becomes 𝑣𝑏 + 𝑓 Δ𝐻 𝑧 . Using Eq. (5.3), the new depth 𝑔ˆ𝑧Δ𝐻 is 𝑔ˆ𝑧Δ𝐻 = 𝐻 + Δ𝐻 − 𝑏2 𝑓 Δ𝐻 𝑧 𝑓 − 𝑣0 𝑣𝑏 + = 𝐻 + Δ𝐻 − 𝑏2 Δ𝐻 𝑣𝑏 −𝑣0 𝑧 𝑓 + . (5.6) If the ego height change Δ𝐻 is small compared to the object depth 𝑧, Δ𝐻 𝑧 above equation as ≈ 0. So, we write the 𝑔ˆ𝑧Δ𝐻 ≈ 𝐻 + Δ𝐻 − 𝑏2 𝑣𝑏 −𝑣0 𝑓 = 𝑔ˆ𝑧0 + Δ𝐻 𝑣𝑏 −𝑣0 𝑓 69 ≈ 𝑧 + 𝜂 + 𝑓 Δ𝐻 𝑣𝑏 −𝑣0 =⇒ 𝑔ˆ𝑧Δ𝐻 − 𝑧 ≈ 𝜂 + 𝑓 Δ𝐻 𝑣𝑏 −𝑣0 , assuming the ground depth 𝑔ˆ𝑧0 at train height Δ𝐻 = 0 is the GT depth 𝑧 added by a normal random variable 𝜂 with mean 0 and variance 𝜎2 as in [110]. Taking expectation on both sides, the mean depth error is E(cid:16)𝑔ˆ𝑧Δ𝐻 − 𝑧 (cid:17) ≈ (cid:19) (cid:18) 1 𝑣𝑏 −𝑣0 𝑓 Δ𝐻, confirming the positive trend of the mean depth error of the ground model w.r.t. the height change Δ𝐻. The ground lies between the bottom part of the image plane/ image height (ℎ) and the optical center 𝑦-coordinate 𝑣0, and so 𝑣𝑏 − 𝑣0 > 0. However, in practice, it could get negative in early stage of training. To enforce non-negativity of this term, we pass 𝑣𝑏−𝑣0 through a ReLU non-linearity to enforce 𝑣𝑏 −𝑣0 is positive. Sec. 5.5.3 confirms that ReLU remains important for good results. □ 5.4.2 Regression-based Depth Model Most Mono3D models rely on regression losses, to compare the predicted depth with the GT depth [108, 303]. We, next, derive the extrapolation behavior of such regressed depth model in the following theorem. Theorem 4. Regressed model has negative slope (trend) in extrapolation. Consider a regressed depth model trained on data from single ego height, predicting depth ˆ𝑧 from the projected 3D center (𝑢𝑐, 𝑣𝑐). Assuming a linear relationship between predicted depth and pixel position, the mean depth error of a regressed model exhibits a negative trend w.r.t. the height change Δ𝐻: E(cid:16)𝑟ˆ𝑧Δ𝐻 − 𝑧 (cid:17) = − (cid:19) (cid:18) 𝛽 𝑧 𝑓 Δ𝐻, (5.7) where 𝛽 is a camera height independent positive constant. Th. 4 says that regressed depth model under-estimates and over-estimates depth as the ego height change Δ𝐻 increases and decreases respectively. 70 Figure 5.5 CARLA Val samples with both negative and positive ego height changes (Δ𝐻) covers AVs from bots to cars to trucks. Table 5.1 Error analysis of GUP Net [159] trained on Δ𝐻 = 0𝑚 on all height changes Δ𝐻 of CARLA Val split. Depth remains the biggest source of error in inference on unseen ego heights. 𝑧 𝑦 ℎ ✓ ✓ (cid:17)) (cid:17)) Oracle Params. − 𝑥 (cid:17) / Δ𝐻 (𝑚)− 𝜃 𝑤 𝑙 (cid:17) −0.70 9.46 15.95 13.56 34.82 65.44 10.32 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 75.86 ✓ ✓ ✓ ✓ ✓ ✓ ✓ 78.44 ✓ ✓ ✓ ✓ AP 3D 70 [%] (− 0 53.82 62.21 59.55 69.99 82.36 56.24 82.82 85.20 +0.76 −0.70 41.66 7.23 46.89 12.74 44.93 10.67 68.10 39.03 74.76 80.70 42.04 7.20 78.21 82.08 78.44 82.28 AP 3D 50 [%] (− 0 76.47 76.78 76.86 82.73 84.93 76.61 85.17 85.20 +0.76 −0.70 40.97 50.97 49.84 76.24 82.11 42.03 82.24 82.28 MDE (𝑚) [≈ 0] +0.76 0 +0.53 +0.03 −0.63 +0.53 +0.03 −0.63 +0.53 +0.03 −0.63 +0.00 +0.00 +0.00 +0.00 +0.00 +0.00 +0.53 +0.03 −0.63 +0.00 +0.00 +0.00 +0.00 +0.00 +0.00 Proof. Neural nets often use the 𝑦-coordinate of their projected 3D center 𝑣𝑐 to predict depth [51]. Consider a simple linear regression model for predicting depth. Then, the regressed depth 𝑟ˆ𝑧0 is 𝑟ˆ𝑧0 = − (cid:19) (cid:18) 𝑧𝑚𝑎𝑥 −𝑧𝑚𝑖𝑛 ℎ−𝑣0 (𝑣𝑐 −𝑣0) + 𝑧𝑚𝑎𝑥 = −𝛽(𝑣𝑐 −𝑣0) + 𝑧𝑚𝑎𝑥, (5.8) This linear regression model has a negative slope, with a positive slope parameter 𝛽, and ℎ being the height of the image. This model predicts depth 𝑧𝑚𝑖𝑛 at pixel position 𝑣𝑐 = ℎ and 𝑧𝑚𝑎𝑥 at principal point 𝑣𝑐 = 𝑣0. When the ego camera shifts by Δ𝐻 𝑚, the projected center of the object 71 𝑓 Δ𝐻 𝑧 becomes 𝑣𝑐 + depth 𝑟ˆ𝑧Δ𝐻 as, . Substituting this into the regression model of Eq. (5.8), we obtain the new 𝑟ˆ𝑧Δ𝐻 = −𝛽 (cid:18) 𝑣𝑐 + 𝑓 Δ𝐻 𝑧 (cid:19) −𝑣0 + 𝑧𝑚𝑎𝑥 = −𝛽(𝑣𝑐 −𝑣0) + 𝑧𝑚𝑎𝑥 − (cid:19) (cid:18) 𝛽 𝑧 𝑓 Δ𝐻 = 𝑟ˆ𝑧0 − = 𝑧 + 𝜂 − (cid:19) (cid:18) 𝛽 𝑧 (cid:18) 𝛽 𝑧 𝑓 Δ𝐻 (cid:19) 𝑓 Δ𝐻 =⇒ 𝑟ˆ𝑧Δ𝐻 − 𝑧 = 𝜂 − (cid:19) (cid:18) 𝛽 𝑧 𝑓 Δ𝐻, assuming the regressed depth 𝑟ˆ𝑧0 at train height Δ𝐻 = 0 is the GT depth 𝑧 added by a normal random variable 𝜂 with mean 0 and variance 𝜎2 as in [110]. Taking expectation on both sides, the mean depth error is E(cid:16)𝑟ˆ𝑧Δ𝐻 − 𝑧 (cid:17) = − (cid:19) (cid:18) 𝛽 𝑧 𝑓 Δ𝐻, confirming the negative trend of the mean depth error of the regressed depth model w.r.t. the height change Δ𝐻. 5.4.3 Merging Depth Estimates. □ Th. 3 and 4 prove that the ground and the regressed depth models show contrasting extrap- olation behaviors. The former over-estimates the depth while the latter under-estimates depth as the ego height change Δ𝐻 increases. Fig. 5.4 shows how these two depth estimates are fused together. Overall, CHARM3R leverages depth information from these two source sources (with different extrapolation behaviors) to improve the Mono3D generalization to unseen camera heights. CHARM3R starts with an input image, and estimates the depth of the object using two methods: ground and regressed depth. CHARM3R outputs the projected bottom center of the object to query the ground depth (calculated from the ego camera parameters and its position and orientation relative to the ground plane as in Lemma 4). It also outputs another depth estimate based on regression. The final step combines the two estimated depths with a simple average to cancel the 72 Table 5.2 CARLA Val Results. CHARM3R outperforms all other baselines, especially at bigger unseen ego heights. All methods except Oracle are trained on car height Δ𝐻 = 0𝑚 and tested on bot to truck height data. [Key: Best] 3D Detector GUP Net [159] DEVIANT [108] Method − (cid:17) / Δ𝐻 (𝑚)− (cid:17) Source Plucker [189] UniDrive [129] UniDrive++ [129] CHARM3R Oracle Source Plucker [189] UniDrive [129] UniDrive++ [129] CHARM3R Oracle (cid:17)) (cid:17)) AP 3D 70 [%] (− 0 53.82 55.56 53.82 53.82 55.68 53.82 50.18 51.32 50.18 50.18 48.74 50.18 −0.70 9.46 8.43 10.73 10.83 19.45 70.96 8.63 8.43 8.33 6.73 17.11 71.97 +0.76 −0.70 41.66 7.23 37.10 10.13 42.30 5.54 47.81 12.27 53.40 27.33 83.88 62.25 40.24 6.25 38.24 9.52 41.40 6.56 42.91 12.03 49.28 26.24 84.56 62.56 AP 3D 50 [%] (− 0 76.47 76.57 76.46 76.46 74.47 76.47 73.78 73.91 73.78 73.78 70.21 73.78 +0.76 −0.70 40.97 43.22 39.33 53.08 61.98 83.96 41.74 44.22 41.27 52.36 63.60 83.94 MDE (𝑚) [≈ 0] +0.76 0 +0.53 +0.03 −0.63 +0.55 +0.03 −0.63 +0.51 +0.03 −0.67 +0.39 +0.03 −0.48 +0.07 +0.05 −0.02 +0.03 +0.03 +0.03 +0.46 +0.01 −0.65 +0.46 +0.01 −0.64 +0.46 +0.01 −0.64 +0.37 +0.01 −0.47 +0.01 +0.03 −0.02 +0.03 +0.01 −0.02 opposing trends and obtain the refined depth estimates, resulting in a set of accurate and localized 3D objects in the scene. 5.5 Experiments Datasets. Our experiments utilize the simulated CARLA dataset1 from [104], configured to mimic the nuScenes [22] dataset. We use this dataset for two reasons. First, this dataset reduces training and testing domain gaps, while existing public datasets lack data at multiple ego heights. Second, recent paper [104] also use this dataset for their experiments. The default CARLA dataset sweeps camera height changes Δ𝐻 from 0 to 0.76𝑚, rendering a dataset every 0.076𝑚 (car to trucks). To fully investigate the impact of camera height variations, we extend the original CARLA dataset by introducing negative height changes. The extended CARLA dataset sweeps height changes Δ𝐻 from −0.70𝑚 to 0.76𝑚 with settings from bots to cars to trucks. Fig. 5.5 illustrates sample images from this dataset. Note that we exclude Δ𝐻 = −0.76𝑚 setting due to visibility obstructions caused by the ego vehicle’s bonnet. Data Splits. Our experiments use the CARLA Val Split. This dataset split [104] contains 25,000 images (2,500 scenes) from town03 map for training and 5,000 images (500 scenes) from town05 map for inference on multiple ego height. Except for Oracle, we train all models on training images from the car height (Δ𝐻 = 0𝑚). 1The authors of [104] do not release their other Nvidia-Sim dataset. 73 (a) AP 3D 70 [%] comparison. (b) AP 3D 50 [%] comparison. (c) MDE comparison. Figure 5.6 CARLA Val Results on GUP Net. CHARM3R outperforms all baselines, especially at bigger unseen ego heights. All methods except Oracle are trained on car height and tested on all heights. Results of inference on height changes of −0.70, 0 and 0.76 meters are in Tab. 5.2. See Fig. 5.6 in the supplementary for another detector. Evaluation Metrics. We choose the KITTI AP 3D 70 percentage on the Moderate category [67] as our evaluation metric. We also report AP3D 50 percentage numbers following prior works [17,109]. Additionally, we report the mean depth error (MDE) over predicted boxes with IoU2D overlap greater than 0.7 with the GT boxes similar to [108]. Note that MDE is different from MAE metric of [108] that it does not take absolute value. Detectors. We use the GUP Net [159] and DEVIANT [108] as our base detectors. The choice of these models encompasses CNN [159] and group CNN-based [108] architectures. Baselines. We compare against the following baselines: • Source: This is the Mono3D model trained on the car height (Δ𝐻 = 0𝑚) data. • Plucker Embeddings [189,234]: Training a Mono3D model with Plucker embeddings to improve robustness as in 3D pose estimation and reconstruction tasks. Plucker embeddings generalize the intrinsic-focused CAM-Convs [57] embeddings to camera extrinsics. • UniDrive [129]: Transforming unseen ego height (target) images to car height (source) assuming objects at fixed distance parameter (50𝑚) and then passing to the Mono3D model. • UniDrive++ [129]: UniDrive with distance parameter optimized per dataset. • Oracle: We also report the Oracle Mono3D model, which is trained and tested on the same ego height. The Oracle serves as the upper bound of all baselines. 74 Table 5.3 CARLA Val Results with ResNet-18 backbone. CHARM3R outperforms all baselines, espe- cially at bigger unseen ego heights. All methods except Oracle are trained on car height Δ𝐻 = 0𝑚 and tested on bot to truck height data. [Key: Best] 3D Detector GUP Net [159] DEVIANT [108] Method − (cid:17) / Δ𝐻 (𝑚)− (cid:17) Source UniDrive [129] UniDrive++ [129] CHARM3R Oracle Source UniDrive [129] UniDrive++ [129] CHARM3R Oracle (cid:17)) (cid:17)) AP 3D 70 [%] (− 0 49.82 49.82 49.82 46.13 49.82 49.88 49.87 49.87 49.13 49.88 −0.70 10.13 10.05 9.37 16.62 70.25 8.83 8.21 6.01 14.96 68.35 +0.76 −0.70 47.15 5.28 47.15 6.15 52.95 13.00 57.00 24.50 83.49 62.93 42.10 4.43 42.21 3.75 43.99 12.03 52.68 23.66 84.03 58.49 AP 3D 50 [%] (− 0 73.49 73.49 73.49 67.83 73.49 72.79 72.79 72.79 72.95 72.79 MDE (𝑚) [≈ 0] +0.76 0 +0.76 −0.70 +0.40 +0.01 −0.65 42.70 +0.35 +0.01 −0.62 43.89 55.57 +0.31 +0.01 −0.46 60.86 −0.15 +0.00 +0.07 84.07 −0.01 +0.05 +0.07 +0.40 +0.01 −0.69 38.42 +0.40 +0.01 −0.70 38.38 +0.38 +0.01 −0.50 50.67 60.98 −0.07 +0.05 +0.02 −0.1 83.42 −0.04 +0.01 Table 5.4 Ablation Studies of GUP Net + CHARM3R on the CARLA Val split on unseen ego heights. [Key: Best] Change GUP Net [159] − To From − (cid:17) / Δ𝐻 (𝑚)− (cid:17) Merge Ground Formulation CHARM3R Oracle Regress Ground Regress+Ground − Regress+Ground − (cid:17) Within Model − Offline (cid:17) Simple Avg − Learned Avg (cid:17) No ReLU ReLU − (cid:17) Product − Sum (cid:17) − (cid:17) − 5.5.1 CARLA Error Analysis (cid:17)) AP 3D 70 [%] (− +0.76 −0.70 0 41.66 7.23 53.82 41.66 7.23 53.82 14.21 26.61 5.39 49.86 47.66 18.36 38.58 9.53 56.49 15.66 0.07 52.94 17.28 37.22 12.79 55.68 27.33 53.40 83.88 53.82 62.25 −0.70 9.46 9.46 0.98 12.86 8.25 0.60 3.28 19.45 70.96 AP 3D 50 [%] (− 0 (cid:17)) MDE (𝑚) [≈ 0] 0 +0.76 −0.70 +0.76 76.47 40.97 +0.53 +0.05 −0.63 76.47 40.97 +0.53 +0.05 −0.63 51.97 31.42 −0.80 −0.01 +0.55 76.30 54.38 +0.24 +0.02 −0.28 76.82 43.13 +0.56 −0.03 −0.62 −1.09 −0.01 +1.34 74.79 63.88 47.09 +0.56 +0.09 −0.22 74.47 61.98 −0.07 +0.05 +0.02 76.47 83.96 +0.03 +0.03 +0.03 4.50 We first report the error analysis of the baseline GUP Net [159] in Tab. 5.1 by replacing the predicted box data with the oracle parameters of the box as in [110, 166]. We consider the GT box to be an oracle box for predicted box if the euclidean distance is less than 4𝑚 [110]. In case of multiple GT being matched to one box, we consider the oracle with the minimum distance. Tab. 5.1 shows that depth is the biggest source of error for Mono3D task under ego height changes as also observed for single height settings in [108, 110, 166]. Note that the Oracle does not get 100% results since we only replace box parameters in the baseline and consequently, the missed boxes in the baseline are not added. 5.5.2 CARLA Height Robustness Results Tab. 5.2 presents the CARLA Val results, reporting the median model over three different seeds with the model being the final checkpoint as [108]. It compares baselines and our CHARM3R on all 75 Mono3D models - GUP Net [159], and DEVIANT [108]. Except for Oracle, all models are trained from car height data and tested on all ego heights. Tab. 5.2 confirms that CHARM3R outperforms other baselines on all the Mono3D models, and results in a better height robust detector. We also plot these AP3D numbers and depth errors visually in Fig. 5.6 for intermediate height changes to confirm our observations. The MDE comparison in Fig. 5.6c also shows the trend of baselines, while CHARM3R cancels the opposite trends in extrapolation. Oracle Biases. We further note biases in the Oracle models at big changes in ego height. This agrees the observations of [104] in the BEV segmentation task. While higher AP3D for a higher height could be explained by fewer occlusions due a higher height, higher AP3D at lower camera height is not explained by this hypothesis. We leave the analysis of higher Oracle numbers for a future work. Results on Other Backbone. We next investigate whether the extrapolation behavior holds for other backbones as well following DEVIANT [108]. So, we benchmark on the ResNet-18 backbone. Tab. 5.3 results show that extrapolation shows up in other backbones and CHARM3R again outperforms all baselines. The biggest gains are in big camera height changes, which is consistent with Tab. 5.2 results. 5.5.3 Ablation Studies Tab. 5.4 ablates the design choices of GUP Net + CHARM3R on CARLA Val split, with the experimental setup of Sec. 5.5. Depth Merge. We first analyze the impact of averaging the two depth estimates. Merging both regressed and ground-based depth estimates is crucial for optimal performance. Relying solely on the regressed depth gives good ID performance but bad OOD performance. Using only ground depth generalizes poorly in both ID and OOD settings, which is why it is not used in modern Mono3D models. However, it has a contrasting extrapolation MDE compared to regression models. While offline merging of depth estimates from regression-only and ground-only models also improves extrapolation, it is slower and lacks end-to-end training. We also experiment with changing the simple averaging of CHARM3R to learned averaging. Simple average of CHARM3R outperforms 76 learned one in OOD test because the learned average overfits to train distribution. ReLUed Ground. Sec. 5.4.1 says that ReLU activation applied to the ground depth ensures spatial continuity and improves model training stability. Removing the ReLU leads to training instability and suboptimal extrapolation to camera height. (The training also collapses in some cases). Formulation. CHARM3R estimates the projected 3D bottom center by using the projected 3D center and the 2D height prediction. Eq. (5.4) predicts a coefficient 𝛼 to determine the precise bottom center location. Product means predicting 𝛼 and then multiplying by (𝑣𝑐 −𝑣𝑐,2𝐷) to obtain the shift, while sum means directly predicting the shift 𝛼. Replace this product formulation by the sum formulation of 𝛼 confirms that the product is more effective than the sum. 5.6 Conclusions This chapter highlights the understudied problem of Mono3D generalization to unseen ego heights. We first systematically analyze the impact of camera height variations on state-of-the- art Mono3D models, identifying depth estimation as the primary factor affecting performance. We mathematically prove and also empirically observe consistent negative and positive trends in regressed and ground-based object depth estimates, respectively, under camera height changes. This chapter then takes a step towards generalization to unseen camera heights and proposes CHARM3R. CHARM3R averages both depth estimates within the model to mitigate these opposing trends. CHARM3R significantly enhances the generalization of Mono3D models to unseen camera heights, achieving SoTA performance on the CARLA dataset. We hope that this initial step towards generalization will contribute to safer AVs. Future work involves extending this idea to more Mono3D models. Limitation. CHARM3R does not fully solve the generalization issue to unseen camera heights. 77 CHAPTER 6 CONCLUSIONS AND FUTURE RESEARCH In this thesis, we attempt generalizing Mono3D networks to occlusion, dataset, object sizes and camera heights. The backbones of our models is in all cases a convolutional neural network or a transformer backbone. While the current Mono3D networks generalize fairly well across these shifts, they still suffer from the following issues: • They do not generalize to unseen datasets during training. • They do not multiple handle tasks like depth prediction, semantic scene completion and Mono3D. • They do not generalize to unknown or noisy camera extrinsics. • They do not handle multiple camera models. Generalizing to Unseen Datasets. Current multi-dataset trained baselines such as Cube R- CNN [14] generalize poorly to datasets unseen in training. In other words, these models do not generalize in cross-dataset settings. Generalizing Mono3D to unseen datasets remains unsolved till date. We conjecture that the cause of limited generalization is the limited training data and specialized backbones which handle projective geometry. Generalizing to Multiple Tasks. Tasks like metric depth prediction, semantic scene completion and Mono3D all represent 3D scene understanding at varying levels of granularity from points to voxels to objects. While there are networks that specialize in doing each task, a single model understands all these granularities as well as intermediate granularities remains an exciting direction for solidyfying 3D understanding and task generalization. Generalizing to Unknown Extrinsics. Current Mono3D methods work well when trained and tested on the same extrinsics. However, such methods do not work well when the camera extrinsics, is unknown during testing. Joint Mono3D and camera calibration remains an open problem. Generalizing to Camera Models. Current methods handle only pinhole cameras, while the cameras available today also include fisheye and 360 camera models. Generalizing Mono3D networks to handle any camera model remains another open problem in this area. Advances in Mono3D task enable diverse applications such as Autonomous Driving, Metaverse 78 and robotics. The goal of home robots is to assist humans in indoor activities, such as cooking or cleaning. Future works which generalize Mono3D along these directions will make our limited 3D scene understanding even more powerful. 79 BIBLIOGRAPHY [1] [2] [3] [4] The KITTI Vision Benchmark Suite. http://www.cvlibs.net/datasets/kitti/eval_object.php? obj_benchmark=3d. Accessed: 2022-07-03. 18, 35 Hassan Alhaija, Siva Mustikovela, Lars Mescheder, Andreas Geiger, and Carsten Rother. Augmented reality meets computer vision: Efficient data generation for urban driving scenes. IJCV, 2018. 1, 6, 25, 44, 61 Samaneh Azadi, Jiashi Feng, and Trevor Darrell. Learning detection with diverse proposals. In CVPR, 2017. 9, 16, 17 Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasac- chi, David Lindell, and Sergey Tulyakov. VD3D: Taming large video diffusion transformers for 3D camera control. arXiv preprint arXiv:2407.12781, 2024. 63, 65 [5] Wentao Bao, Bin Xu, and Zhenzhong Chen. MonoFENet: Monocular 3D object detection with feature enhancement networks. IEEE Transactions on Image Processing, 2019. 9 [6] [7] [8] [9] Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences. In ICCV, 2019. 54 Deniz Beker, Hiroharu Kato, Mihai Adrian Morariu, Takahiro Ando, Toru Matsuoka, Wadim Kehl, and Adrien Gaidon. Monocular differentiable rendering for self-supervised 3D object detection. In ECCV, 2020. 20 Quentin Berthet, Mathieu Blondel, Olivier Teboul, Marco Cuturi, Jean-Philippe Vert, and Francis Bach. Learning with differentiable perturbed optimizers. In NeurIPS, 2020. 11 Zygmunt Birnbaum. An inequality for Mill’s ratio. The Annals of Mathematical Statistics, 1942. 144 [10] Mathieu Blondel, Olivier Teboul, Quentin Berthet, and Josip Djolonga. Fast differentiable sorting and ranking. In ICML, 2020. 11, 12 [11] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. YOLOv4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020. 26 [12] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry Davis. Soft-NMS–improving object detection with one line of code. In ICCV, 2017. 8, 9, 10, 11, 15, 16, 17, 21, 23, 108, 109, 110 [13] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry Davis. Soft-NMS implementa- tion. https://github.com/bharatsingh430/soft-nms/blob/master/lib/nms/cpu_nms.pyx#L98, 2017. Accessed: 2021-01-18. 10 80 [14] Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia Gkioxari. Omni3D: A large benchmark and model for 3D object detection in the wild. In CVPR, 2023. 1, 44, 55, 56, 57, 61, 65, 78 [15] Garrick Brazil and Xiaoming Liu. M3D-RPN: Monocular 3D region proposal network for object detection. In ICCV, 2019. 6, 9, 16, 18, 19, 20, 21, 22, 23, 25, 26, 29, 37, 42, 48, 65, 108, 110, 121, 128, 129 [16] Garrick Brazil and Xiaoming Liu. Pedestrian detection with autoregressive network phases. In CVPR, 2019. 9 [17] Garrick Brazil, Gerard Pons-Moll, Xiaoming Liu, and Bernt Schiele. Kinematic 3D object detection in monocular video. In ECCV, 2020. 6, 7, 9, 16, 18, 19, 20, 21, 22, 23, 25, 29, 35, 36, 48, 65, 74, 107, 108, 109, 110, 111, 112, 113, 117, 121, 131 [18] Garrick Brazil, Xi Yin, and Xiaoming Liu. detection & segmentation. In ICCV, 2017. 9 Illuminating pedestrians via simultaneous [19] Michael Bronstein. Convolution from first principles. https://towardsdatascience.com/ deriving-convolution-from-first-principles-4ff124888028. Accessed: 2021-08-13. 25, 28, 29, 114 [20] Michael Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021. 28, 29, 114 [21] Brian Burns, Richard Weiss, and Edward Riseman. The non-existence of general-case view-invariants. In Geometric invariance in computer vision. 1992. 29, 30, 63, 114, 115 [22] Holger Caesar, Varun Bankiti, Alex Lang, Sourabh Vora, Venice Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. In CVPR, 2020. 34, 35, 37, 46, 53, 54, 60, 73, 128, 149, 164 [23] Brittany Caldwell. 2 die when tesla crashes into parked tractor-trailer in florida. https: //www.wftv.com/news/local/2-die-when-tesla-crashes-into-parked-tractor-trailer-florida/ KJGMHHYTQZA2HNAHWL2OFSVIPM/, 2022. Accessed: 2023-11-06. 2, 45 [24] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020. 48, 65 [25] Florian Chabot, Mohamed Chaouch, Jaonary Rabarisoa, Céline Teuliere, and Thierry Chateau. Deep MANTA: A coarse-to-fine many-task network for joint 2D and 3D vehi- cle analysis from monocular image. In CVPR, 2017. 29, 48, 65 [26] Gyusam Chang, Jiwon Lee, Donghyun Kim, Jinkyu Kim, Dongwook Lee, Daehyun Ji, Sujin Jang, and Sangpil Kim. Unified domain generalization and adaptation for multi-view 3D 81 object detection. In NeurIPS, 2024. 65 [27] Gyusam Chang, Wonseok Roh, Sujin Jang, Dongwook Lee, Daehyun Ji, Gyeongrok Oh, Jinsun Park, Jinkyu Kim, and Sangpil Kim. CMDA: Cross-modal and domain adversarial adaptation for LiDAR-based 3D object detection. In AAAI, 2024. 65 [28] Ming Chang, Xishan Zhang, Rui Zhang, Zhipeng Zhao, Guanhua He, and Shaoli Liu. RecurrentBEV: A long-term temporal fusion framework for multi-view 3D detection. In ECCV, 2024. 65 [29] Dian Chen, Jie Li, Vitor Guizilini, Rares Andrei Ambrus, and Adrien Gaidon. Viewpoint equivariance for multi-view 3D object detection. In CVPR, 2023. 48, 59, 65 [30] Kean Chen, Weiyao Lin, Jianguo Li, John See, Ji Wang, and Junni Zou. AP-Loss for accurate one-stage object detection. TPAMI, 2020. 17, 19 [31] Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, and Raquel Urtasun. Monocular 3D object detection for autonomous driving. In CVPR, 2016. 9, 29, 47, 64 [32] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew Berneshawi, Huimin Ma, Sanja Fidler, and Raquel Urtasun. 3D object proposals for accurate object class detection. In NeurIPS, 2015. 18, 35 [33] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3D object detection network for autonomous driving. In CVPR, 2017. 6, 9 [34] Yilun Chen, Shu Liu, Xiaoyong Shen, and Jiaya Jia. DSGN: Deep stereo geometry network for 3D object detection. In CVPR, 2020. 1, 47, 64 [35] Yongjian Chen, Lei Tai, Kai Sun, and Mingyang Li. MonoPair: Monocular 3D object detection using pairwise spatial relationships. In CVPR, 2020. 6, 9, 19, 20, 21, 23, 25, 29, 36, 48, 65 [36] Zhili Chen, Shuangjie Xu, Maosheng Ye, Zian Qian, Xiaoyi Zou, Dit-Yan Yeung, and Qifeng Chen. Learning high-resolution vector representation from multi-camera images for 3D object detection. In ECCV, 2024. 65 [37] Kashyap Chitta, Aditya Prakash, and Andreas Geiger. NEAT: Neural attention fields for end-to-end autonomous driving. In ICCV, 2021. 48 [38] Wonhyeok Choi, Mingyu Shin, and Sunghoon Im. Depth-discriminative metric learning for monocular 3D object detection. In NeurIPS, 2023. 48, 65 [39] Zhiyu Chong, Xinzhu Ma, Hong Zhang, Yuxin Yue, Haojie Li, Zhihui Wang, and Wanli Ouyang. MonoDistill: Learning spatial features for monocular 3D object detection. In ICLR, 2022. 35, 36, 132 [40] Xiaomeng Chu, Jiajun Deng, Yuan Zhao, Jianmin Ji, Yu Zhang, Houqiang Li, and Yanyong 82 Zhang. OA-BEV: Bringing object awareness to bird’s-eye-view representation for multi- camera 3D object detection. arXiv preprint arXiv:2301.05711, 2023. 48, 65 [41] Taco Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Spherical CNNs. In ICLR, 2018. 26, 28 [42] Taco Cohen and Max Welling. Learning the irreducible representations of commutative lie groups. In ICML, 2014. 28 [43] Taco Cohen and Max Welling. Group equivariant convolutional networks. In ICML, 2016. 28, 63, 114 [44] MMDetection3D Contributors. MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/mmdetection3d, 2020. 149 [45] Michael Crawshaw. Multi-task learning with deep neural networks: A survey. arXiv preprint arXiv:2009.09796, 2020. 46 [46] Marco Cuturi, Olivier Teboul, and Jean-Philippe Vert. Differentiable ranks and sorting using optimal transport. In NeurIPS, 2019. 11 [47] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 9 [48] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. large-scale hierarchical image database. In CVPR, 2009. 124 ImageNet: A [49] Chaitanya Desai, Deva Ramanan, and Charless Fowlkes. Discriminative models for multi- class object layout. IJCV, 2011. 9, 16, 17 [50] Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu. Exploiting cyclic symmetry in convolutional neural networks. In ICML, 2016. 28, 114 [51] Tom van Dijk and Guido de Croon. How do neural networks see depth in single images? In ICCV, 2019. 71, 161 [52] Mingyu Ding, Yuqi Huo, Hongwei Yi, Zhe Wang, Jianping Shi, Zhiwu Lu, and Ping In CVPR Luo. Learning depth-guided convolutions for monocular 3D object detection. Workshops, 2020. 9, 19, 20, 29, 38, 39, 121 [53] Simon Doll, Richard Schulz, Lukas Schneider, Viviane Benzin, Markus Enzweiler, and Hendrik Lensch. SpatialDETR: Robust scalable transformer-based 3D object detection from multi-view camera images with global cross-sensor attention. In ECCV, 2022. 59 [54] Yinpeng Dong, Caixin Kang, Jinlai Zhang, Zijian Zhu, Yikai Wang, Xiao Yang, Hang Su, Xingxing Wei, and Jun Zhu. Benchmarking robustness of 3D object detection to common corruptions. In CVPR, 2023. 1, 44 83 [55] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. 28, 63 [56] Carlos Esteves, Christine Allen-Blanchette, Xiaowei Zhou, and Kostas Daniilidis. Polar transformer networks. In ICLR, 2018. 28 [57] Jose Facil, Benjamin Ummenhofer, Huizhong Zhou, Luis Montesano, Thomas Brox, and Javier Civera. CAM-Convs: Camera-aware multi-scale convolutions for single-view depth. In CVPR, 2019. 74 [58] Lue Fan, Feng Wang, Naiyan Wang, and Zhao Zhang. Fully sparse 3D object detection. In NeurIPS, 2022. 48 [59] Chengjian Feng, Zequn Jie, Yujie Zhong, Xiangxiang Chu, and Lin Ma. AEDet: Azimuth- invariant multi-view 3D object detection. arXiv preprint arXiv:2211.12501, 2022. 48 [60] Roshan Fernandez. A tesla firetruck on a tesla-driver-killed-california-firetruck-nhtsa, 2023. Accessed: 2023-11-06. 2, 45 california highway. driver was a https://www.npr.org/2023/02/20/1158367204/ smashing killed after into [61] Sanja Fidler, Sven Dickinson, and Raquel Urtasun. 3D object detection and viewpoint estimation with a deformable 3D cuboid model. In NeurIPS, 2012. 9, 29 [62] William Freeman and Edward Adelson. The design and use of steerable filters. TPAMI, 1991. 28, 32 [63] Kanchana Gandikota, Jonas Geiping, Zorah Lähner, Adam Czapliński, and Michael Moeller. Training or architecture? how to incorporate invariance in neural networks. arXiv preprint arXiv:2106.10044, 2021. 27, 38, 114 [64] Octavian-Eugen Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic neural net- works. In NeurIPS, 2017. 26, 28 [65] Noa Garnett, Rafi Cohen, Tomer Pe’er, Roee Lahav, and Dan Levi. 3D-LaneNet: end-to-end 3D multiple lane detection. In ICCV, 2019. 66, 67 [66] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The KITTI dataset. IJRR, 2013. 112, 134 [67] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. In CVPR, 2012. 18, 20, 34, 35, 53, 54, 74, 146 [68] Rohan Ghosh and Anupam Gupta. Scale steerable filters for locally scale-invariant convolu- tional neural networks. In ICML Workshops, 2019. 27, 28, 32, 122 [69] Ross Girshick. Fast R-CNN. In ICCV, 2015. 8, 9, 55 84 [70] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 8 [71] Gene Golub and Charles Loan. Matrix computations. 2013. 14 [72] Nikhil Gosala and Abhinav Valada. Bird’s-eye-view panoptic segmentation using monocular frontal view images. RAL, 2022. 48, 54, 55, 57, 148, 150, 152 [73] Yuliang Guo, Guang Chen, Peitao Zhao, Weide Zhang, Jinghao Miao, Jingao Wang, and Tae Eun Choe. Gen-lanenet: A generalized and scalable approach for 3D lane detection. In ECCV, 2020. 66, 67 [74] Adam Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki. Simple- BEV: What really matters for multi-sensor BEV perception? In CoRL, 2022. 48 [75] Christopher Harris and Mike Stephens. A combined corner and edge detector. In Alvey vision conference, 1988. 9 [76] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cam- bridge university press, 2003. 27, 29, 30, 31, 63, 66, 116, 117 [77] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 150 [78] Paul Henderson and Vittorio Ferrari. End-to-end training of object class detectors for mean average precision. In ACCV, 2016. 9, 18, 24 [79] [80] [81] [82] Joao Henriques and Andrea Vedaldi. Warped convolutions: Efficient invariance to spatial transformations. In ICML, 2017. 28 Jan Hosang, Rodrigo Benenson, and Bernt Schiele. A convnet for non-maximum suppres- sion. In GCPR, 2016. 7, 9, 16, 17, 18, 24 Jan Hosang, Rodrigo Benenson, and Bernt Schiele. Learning non-maximum suppression. In CVPR, 2017. 7, 9, 16, 17, 18, 24 Jinghua Hou, Tong Wang, Xiaoqing Ye, Zhe Liu, Xiao Tan, Errui Ding, Jingdong Wang, and Xiang Bai. OPEN: Object-wise position embedding for multi-view 3D object detection. In ECCV, 2024. 65 [83] Anthony Hu, Zak Murez, Nikhil Mohan, Sofía Dudas, Jeffrey Hawke, Vijay Badrinarayanan, Roberto Cipolla, and Alex Kendall. FIERY: future instance prediction in bird’s-eye view from surround monocular cameras. In ICCV, 2021. 48, 50 [84] Hanjiang Hu, Zuxin Liu, Sharad Chitlangia, Akhil Agnihotri, and Ding Zhao. Investigating the impact of multi-LiDAR placement on object detection for autonomous driving. In CVPR, 2022. 65 85 [85] Peiyun Hu, Jason Ziglar, David Held, and Deva Ramanan. What you see is what you get: Exploiting visibility for 3D object detection. In CVPR, 2020. 9 [86] Gao Huang, Zhuang Liu, Laurens Maaten, and Kilian Weinberger. Densely connected convolutional networks. In CVPR, 2017. 18 [87] [88] Junjie Huang and Guan Huang. BEVDet4D: Exploit temporal cues in multi-camera 3D object detection. arXiv preprint arXiv:2203.17054, 2022. 55, 155 Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. High-performance multi-camera 3D object detection in bird-eye-view. arXiv:2112.11790, 2021. 48, 65, 155 BEVDet: arXiv preprint [89] Kuan-Chih Huang, Tsung-Han Wu, Hung-Ting Su, and Winston Hsu. MonoDTR: Monocular 3D object detection with depth-aware transformer. In CVPR, 2022. 47, 65 [90] Tengteng Huang, Zhe Liu, Xiwu Chen, and Xiang Bai. EPNet: Enhancing point features with image semantics for 3D object detection. In ECCV, 2020. 6, 7, 9, 23 [91] Sujin Jang, Dae Ung Jo, Sung Ju Hwang, Dongwook Lee, and Daehyun Ji. STXD: Structural and temporal cross-modal distillation for multi-view 3D object detection. In NeurIPS, 2023. 59 [92] Ylva Jansson and Tony Lindeberg. Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales. IJCV, 2021. 27, 28, 32 [93] Haoxuanye Ji, Pengpeng Liang, and Erkang Cheng. Enhancing 3D object detection with 2D detection-guided query anchors. In CVPR, 2024. 65 [94] Jinrang Jia, Zhenjia Li, and Yifeng Shi. MonoUNI: A unified vehicle and infrastructure-side monocular 3D object detection network with sufficient depth clues. In NeurIPS, 2023. 1, 44, 65 [95] Yanqin Jiang, Li Zhang, Zhenwei Miao, Xiatian Zhu, Jin Gao, Weiming Hu, and Yu-Gang Jiang. Polarformer: Multi-camera 3D object detection with polar transformers. In AAAI, 2023. 48, 59, 155 [96] Zheng Jiang, Jinqing Zhang, Yanan Zhang, Qingjie Liu, Zhenghui Hu, Baohui Wang, and Yunhong Wang. FSD-BEV: Foreground self-distillation for multi-view 3D object detection. In ECCV, 2024. 65 [97] Li Jing. Physical symmetry enhanced neural networks. PhD thesis, Massachusetts Institute of Technology, 2020. 28 [98] Angjoo Kanazawa, Abhishek Sharma, and David Jacobs. Locally scale-invariant convolu- tional neural networks. In NeurIPS Workshops, 2014. 28 [99] Kang Kim and Hee Lee. Probabilistic anchor assignment with IoU prediction for object 86 detection. In ECCV, 2020. 19, 24 [100] Sanmin Kim, Youngseok Kim, Sihwan Hwang, Hyeonjun Jeong, and Dongsuk Kum. La- belDistill: Label-guided cross-modal knowledge distillation for camera-based 3D object detection. In ECCV, 2024. 65 [101] Sanmin Kim, Youngseok Kim, In-Jae Lee, and Dongsuk Kum. Predict to Detect: Prediction- guided 3D object detection using sequential images. In ICCV, 2023. 59, 155 [102] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 108, 125, 150 [103] Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. In ICLR, 2022. 64 [104] Tzofi Klinghoffer, Jonah Philion, Wenzheng Chen, Or Litany, Zan Gojcic, Jungseock Joo, Ramesh Raskar, Sanja Fidler, and Jose Alvarez. Towards viewpoint robustness in Bird’s Eye View segmentation. In ICCV, 2023. 1, 44, 62, 63, 65, 73, 76, 161, 162 [105] Marvin Klingner, Shubhankar Borse, Varun Ravi Kumar, Behnaz Rezaei, Venkatraman Narayanan, Senthil Yogamani, and Fatih Porikli. X3KD: Knowledge distillation across modalities, tasks and stages for multi-camera 3D object detection. In CVPR, 2023. 48, 59 [106] Emile Krieken, Erman Acar, and Frank Harmelen. Analyzing differentiable fuzzy logic operators. arXiv preprint arXiv:2002.06100, 2020. 12, 106 [107] Jason Ku, Alex Pon, and Steven Waslander. Monocular 3D object detection leveraging accurate proposals and shape reconstruction. In CVPR, 2019. 19 [108] Abhinav Kumar, Garrick Brazil, Enrique Corona, Armin Parchami, and Xiaoming Liu. DEVIANT: Depth Equivariant Network for monocular 3D object detection. In ECCV, 2022. 1, 44, 48, 50, 53, 54, 55, 56, 57, 61, 63, 65, 70, 73, 74, 75, 76, 146, 147, 152, 153, 156, 162, 163, 164 [109] Abhinav Kumar, Garrick Brazil, and Xiaoming Liu. GrooMeD-NMS: Grouped mathemat- ically differentiable NMS for monocular 3D object detection. In CVPR, 2021. 29, 33, 35, 36, 37, 48, 55, 56, 57, 65, 74, 126, 128, 132, 134, 151 [110] Abhinav Kumar, Yuliang Guo, Xinyu Huang, Liu Ren, and Xiaoming Liu. SeaBird: Seg- mentation in bird’s view with dice loss improves monocular 3D detection of large objects. In CVPR, 2024. 1, 61, 65, 70, 72, 75, 162 [111] Abhinav Kumar, Tim Marks, Wenxuan Mou, Ye Wang, Michael Jones, Anoop Cherian, Toshiaki Koike-Akino, Xiaoming Liu, and Chen Feng. LUVLi face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood. In CVPR, 2020. 7, 29, 48, 65 [112] Animesh Kumar and Vinod Prabhakaran. Estimation of bandlimited signals from the signs of noisy samples. In ICASSP, 2013. 13, 33, 107 87 [113] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an O (1/𝑡) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012. 49, 138, 141 [114] John Lambert, Zhuang Liu, Ozan Sener, James Hays, and Vladlen Koltun. MSeg: A composite dataset for multi-domain semantic segmentation. In CVPR, 2020. 130 [115] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998. 28, 29 [116] Donghoon Lee, Geonho Cha, Ming-Hsuan Yang, and Songhwai Oh. Individualness and determinantal point processes for pedestrian detection. In ECCV, 2016. 9, 15, 17 [117] Hyo-Jun Lee, Hanul Kim, Su-Min Choi, Seong-Gyun Jeong, and Yeong Koh. BAAM: Monocular 3D pose and shape reconstruction with bi-contextual attention module and attention-guided modeling. In CVPR, 2023. 48, 65 [118] Jin Lee, Myung Han, Dong Ko, and Il Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019. 130 [119] Yoonho Lee, Huaxiu Yao, and Chelsea Finn. Diversify and disambiguate: Learning from underspecified data. In ICLR, 2022. 64 [120] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. IJRR, 2018. 6 [121] Buyu Li, Wanli Ouyang, Lu Sheng, Xingyu Zeng, and Xiaogang Wang. GS3D: An efficient 3D object detection framework for autonomous driving. In CVPR, 2019. 6, 19 [122] Peiliang Li, Xiaozhi Chen, and Shaojie Shen. Stereo R-CNN based 3D object detection for autonomous driving. In CVPR, 2019. 9 [123] Peixuan Li, Huaici Zhao, Pengfei Liu, and Feidao Cao. RTM3D: Real-time monocular 3D detection from object keypoints for autonomous driving. In ECCV, 2020. 6, 9, 19, 25, 29 [124] Tao Li and Vivek Srikumar. Augmenting neural networks with first-order logic. In ACL, 2019. 12, 106 [125] Yinhao Li, Han Bao, Zheng Ge, Jinrong Yang, Jianjian Sun, and Zeming Li. BEVStereo: Enhancing depth estimation in multi-view 3D object detection with dynamic temporal stereo. In AAAI, 2023. 48, 59, 65 [126] Yanwei Li, Yilun Chen, Xiaojuan Qi, Zeming Li, Jian Sun, and Jiaya Jia. Unifying voxel- based representation with transformer for 3D object detection. In NeurIPS, 2022. 59 [127] Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. BEVDepth: Acquisition of reliable depth for multi-view 3D object detection. 88 In AAAI, 2023. 59, 155 [128] Yangguang Li, Bin Huang, Zeren Chen, Yufeng Cui, Feng Liang, Mingzhu Shen, Fenggang Liu, Enze Xie, Lu Sheng, Wanli Ouyang, and Jing Shao. Fast-BEV: A fast and strong bird’s-eye view perception baseline. In NeurIPS Workshops, 2023. 48, 65 [129] Ye Li, Wenzhao Zheng, Xiaonan Huang, and Kurt Keutzer. UniDrive: Towards universal driving perception across camera configurations. arXiv preprint arXiv:2410.13864, 2024. 63, 65, 73, 74, 75, 163 [130] Zhenyu Li, Zehui Chen, Ang Li, Liangji Fang, Qinhong Jiang, Xianming Liu, and Junjun Jiang. Unsupervised domain adaptation for monocular 3D object detection via self-training. In ECCV, 2022. 65 [131] Zhenxin Li, Shiyi Lan, Jose Alvarez, and Zuxuan Wu. BEVNeXt: Reviving dense BEV frameworks for 3D object detection. In CVPR, 2024. 65 [132] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022. 1, 44, 46, 48, 53, 59, 61, 147, 155 [133] Zhuoling Li, Xiaogang Xu, SerNam Lim, and Hengshuang Zhao. Unimode: Unified monoc- ular 3D object detection. In CVPR, 2024. 61 [134] Zhiqi Li, Zhiding Yu, Wenhai Wang, Anima Anandkumar, Tong Lu, and Jose Alvarez. FB- BEV: BEV representation from forward-backward view transformations. In ICCV, 2023. 59 [135] Qing Lian, Botao Ye, Ruijia Xu, Weilong Yao, and Tong Zhang. Geometry-aware data augmentation for monocular 3D object detection. arXiv preprint arXiv:2104.05858, 2021. 26, 29 [136] Yiyi Liao, Jun Xie, and Andreas Geiger. KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2D and 3D. TPAMI, 2022. 45, 46, 53, 54, 60, 148, 149 [137] Hongbin Lin, Yifan Zhang, Shuaicheng Niu, Shuguang Cui, and Zhen Li. MonoTTA: Fully test-time adaptation for monocular 3D object detection. In ECCV, 2024. 61 [138] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 39, 123, 149 [139] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. TPAMI, 2018. 8, 9 [140] Feng Liu, Tengteng Huang, Qianjing Zhang, Haotian Yao, Chi Zhang, Fang Wan, Qixiang Ye, and Yanzhao Zhou. Ray Denoising: Depth-aware hard negative sampling for multi-view 3D object detection. In ECCV, 2024. 65 89 [141] Feng Liu and Xiaoming Liu. Voxel-based 3D detection and reconstruction of multiple objects from a single image. In NeurIPS, 2021. 48, 65 [142] Haisong Liu Liu, Yao Teng Teng, Tao Lu, Haiguang Wang, and Limin Wang. SparseBEV: High-performance sparse 3D object detection from multi-camera videos. In ICCV, 2023. 59, 154 [143] Lijie Liu, Jiwen Lu, Chunjing Xu, Qi Tian, and Jie Zhou. Deep fitting degree scoring network for monocular 3D object detection. In CVPR, 2019. 9, 19, 25, 29 [144] Lijie Liu, Chufan Wu, Jiwen Lu, Lingxi Xie, Jie Zhou, and Qi Tian. Reinforced axial refinement network for monocular 3D object detection. In ECCV, 2020. 19 [145] Songtao Liu, Di Huang, and Yunhong Wang. Adaptive NMS: Refining pedestrian detection in a crowd. In CVPR, 2019. 9, 16, 17 [146] Xianpeng Liu, Nan Xue, and Tianfu Wu. Learning auxiliary monocular contexts helps monocular 3D object detection. In AAAI, 2022. 29 [147] Xianpeng Liu, Ce Zheng, Kelvin Cheng, Nan Xue, Guo-Jun Qi, and Tianfu Wu. Monocular 3D object detection with bounding box denoising in 3D by perceiver. In ICCV, 2023. 48, 65 [148] Xianpeng Liu, Ce Zheng, Ming Qian, Nan Xue, Chen Chen, Zhebin Zhang, Chen Li, and Tianfu Wu. Multi-view attentive contextualization for multi-view 3D object detection. In CVPR, 2024. 65 [149] Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. PETR: Position embedding transformation for multi-view 3D object detection. In ECCV, 2022. 155 [150] Yingfei Liu, Junjie Yan, Fan Jia, Shuailin Li, Qi Gao, Tiancai Wang, Xiangyu Zhang, and Jian Sun. PETRv2: A unified framework for 3D perception from multi-camera images. In ICCV, 2023. 48, 59, 65, 155 [151] Yuxuan Liu, Yuan Yixuan, and Ming Liu. Ground-aware monocular 3D object detection for autonomous driving. Robotics and Automation Letters, 2021. 26, 35, 36 [152] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 150 [153] Zechen Liu, Zizhang Wu, and Roland Tóth. SMOKE: Single-stage monocular 3D object detection via keypoint estimation. In CVPR Workshops, 2020. 19 [154] Zongdai Liu, Dingfu Zhou, Feixiang Lu, Jin Fang, and Liangjun Zhang. AutoShape: Real- time shape-aware monocular 3D object detection. In ICCV, 2021. 29, 35, 48, 65 [155] Yunfei Long, Abhinav Kumar, Daniel Morris, Xiaoming Liu, Marcos Castro, and Punarjay Chakravarty. RADIANT: RADar Image Association Network for 3D object detection. In 90 AAAI, 2023. 1, 47, 64 [156] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019. 150, 151 [157] David Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. 9 [158] Hao Lu, Yunpeng Zhang, Qing Lian, Dalong Du, and Yingcong Chen. Towards gen- eralizable multi-camera 3D object detection via perspective debiasing. arXiv preprint arXiv:2310.11346, 2023. 65 [159] Yan Lu, Xinzhu Ma, Lei Yang, Tianzhu Zhang, Yating Liu, Qi Chu, Junjie Yan, and Wanli Ouyang. Geometry uncertainty projection network for monocular 3D object detection. In ICCV, 2021. 25, 26, 29, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 48, 50, 55, 56, 57, 62, 65, 71, 73, 74, 75, 76, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 146, 152, 153, 155, 163, 165, 166, 167 [160] Shujie Luo, Hang Dai, Ling Shao, and Yong Ding. M3DSSD: Monocular 3D single stage object detector. In CVPR, 2021. 19 [161] Zhipeng Luo, Changqing Zhou, Gongjie Zhang, and Shijian Lu. DETR4D: Direct multi- view 3D object detection with sparse attention. arXiv preprint arXiv:2212.07849, 2022. 48 [162] Xinzhu Ma, Shinan Liu, Zhiyi Xia, Hongwen Zhang, Xingyu Zeng, and Wanli Ouyang. Rethinking Pseudo-LiDAR representation. In ECCV, 2020. 29, 42 [163] Xinzhu Ma, Wanli Ouyang, Andrea Simonelli, and Elisa Ricci. 3D object detection from images for autonomous driving: A survey. TPAMI, 2023. 29, 48, 65 [164] Xinzhu Ma, Yongtao Wang, Yinmin Zhang, Zhiyi Xia, Yuan Meng, Zhihui Wang, Haojie Li, and Wanli Ouyang. Towards fair and comprehensive comparisons for image-based 3D object detection. In ICCV, 2023. 48 [165] Xinzhu Ma, Zhihui Wang, Haojie Li, Pengbo Zhang, Wanli Ouyang, and Xin Fan. Accu- rate monocular 3D object detection via color-embedded 3D reconstruction for autonomous driving. In ICCV, 2019. 19, 29, 48, 65 [166] Xinzhu Ma, Yinmin Zhang, Dan Xu, Dongzhan Zhou, Shuai Yi, Haojie Li, and Wanli In CVPR, Ouyang. Delving into localization errors for monocular 3D object detection. 2021. 26, 34, 36, 53, 55, 56, 57, 75, 119, 121, 122, 126, 152 [167] Yuexin Ma, Tai Wang, Xuyang Bai, Huitong Yang, Yuenan Hou, Yaming Wang, Yu Qiao, Ruigang Yang, Dinesh Manocha, and Xinge Zhu. Vision-centric BEV perception: A survey. arXiv preprint arXiv:2208.02797, 2022. 46, 48, 58, 65 [168] Lachlan MacDonald, Sameera Ramasinghe, and Simon Lucey. Enabling equivariance for arbitrary lie groups. In CVPR, 2022. 63 91 [169] Fabian Manhardt, Wadim Kehl, and Adrien Gaidon. Roi-10D: Monocular lifting of 2D detection to 6D pose and metric shape. In CVPR, 2019. 19 [170] Diego Marcos, Benjamin Kellenberger, Sylvain Lobry, and Devis Tuia. Scale equivariance in CNNs with vector fields. In ICML Workshops, 2018. 28 [171] Diego Marcos, Michele Volpi, Nikos Komodakis, and Devis Tuia. Rotation equivariant vector field networks. In ICCV, 2017. 28 [172] Nathaniel Merrill, Yuliang Guo, Xingxing Zuo, Xinyu Huang, Stefan Leutenegger, Xi Peng, Liu Ren, and Guoquan Huang. Symmetry and uncertainty-aware object SLAM for 6DoF object pose estimation. In CVPR, 2022. 1, 44, 61 [173] Alessio Micheli. Neural network for graphs: A contextual constructive approach. IEEE Transactions on Neural Networks, 2009. 28 [174] Krystian Mikolajczyk and Cordelia Schmid. Scale & affine invariant interest point detectors. IJCV, 2004. 9 [175] Zhixiang Min, Bingbing Zhuang, Samuel Schulter, Buyu Liu, Enrique Dunn, and Manmohan Chandraker. NeurOCS: Neural NOCS supervision for monocular 3D object localization. In CVPR, 2023. 48, 65 [176] Mircea Mironenco and Patrick Forré. Lie group decompositions for equivariant neural networks. In ICLR, 2024. 63 [177] SungHo Moon, JinWoo Bae, and SungHoon Im. Rotation matters: Generalized monocular 3D object detection for various camera systems. arXiv preprint arXiv:2310.05366, 2023. 1, 44, 61, 65 [178] Frank Moosmann, Oliver Pink, and Christoph Stiller. Segmentation of 3D LiDAR data In Intelligent Vehicles in non-flat urban environments using a local convexity criterion. Symposium, 2009. 9 [179] Youngmin Oh, Hyung-Il Kim, Seong Tae Kim, and Jung Kim. MonoWAD: Weather-adaptive diffusion model for robust monocular 3D object detection. In ECCV, 2024. 61 [180] Bowen Pan, Jiankai Sun, Ho Leung, Alex Andonian, and Bolei Zhou. Cross-view semantic segmentation for sensing surroundings. RAL, 2020. 48 [181] Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is Pseudo-LiDAR needed for monocular 3D object detection? In ICCV, 2021. 1, 29, 35, 36, 37, 44, 61, 127, 132, 150 [182] Jinhyung Park, Chenfeng Xu, Shijia Yang, Kurt Keutzer, Kris Kitani, Masayoshi Tomizuka, and Wei Zhan. Time will tell: New outlooks and a baseline for temporal multi-view 3D object detection. In ICLR, 2023. 48, 65, 155 92 [183] Kiru Park, Timothy Patten, and Markus Vincze. Pix2Pose: Pixel-wise coordinate regression of objects for 6D pose estimation. In ICCV, 2019. 1, 44, 61 [184] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019. 18, 34, 149 [185] Max Paulus, Dami Choi, Daniel Tarlow, Andreas Krause, and Chris Maddison. Gradient estimation with stochastic softmax tricks. In NeurIPS, 2020. 11, 15 [186] Nadia Payet and Sinisa Todorovic. From contours to 3D object detection and pose estimation. In ICCV, 2011. 9, 29, 47, 64 [187] Bojan Pepik, Michael Stark, Peter Gehler, and Bernt Schiele. Multi-view and 3D deformable part models. TPAMI, 2015. 9, 29 [188] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In ECCV, 2020. 48 [189] Julius Plücker. Analytisch-geometrische Entwicklungen. GD Baedeker, 1828. 73, 74 [190] Marin Pogančić, Anselm Paulus, Vit Musil, Georg Martius, and Michal Rolinek. Differen- tiation of blackbox combinatorial solvers. In ICLR, 2019. 11 [191] Sebastian Prillo and Julian Eisenschlos. Softsort: A continuous relaxation for the argsort operator. In ICML, 2020. 11, 12 [192] Sergey Prokudin, Daniel Kappler, Sebastian Nowozin, and Peter Gehler. Learning to filter object detections. In GCPR, 2017. 6, 7, 8, 9, 11, 12, 16, 17, 18, 24, 105 [193] Aahlad Manas Puli, Lily Zhang, Yoav Wald, and Rajesh Ranganath. Don’t blame dataset shift! shortcut learning due to gradients and cross entropy. In NeurIPS, 2023. 64 [194] Charles Qi, Or Litany, Kaiming He, and Leonidas Guibas. Deep hough voting for 3D object detection in point clouds. In ICCV, 2019. 56 [195] Zengyi Qin, Jinglu Wang, and Yan Lu. MonoGRNet: A geometric reasoning network for 3D object localization. In AAAI, 2019. 19, 20 [196] Yasiru Ranasinghe, Deepti Hegde, and Vishal M Patel. MonoDiff: Monocular 3D object detection and pose estimation with diffusion models. In CVPR, 2024. 65 [197] Narayanan Elavathur Ranganatha, Hengyuan Zhang, Shashank Venkatramani, Jing-Yan Liao, and Henrik Christensen. SemVecNet: Generalizable vector map generation for arbitrary sensor configurations. 2024. 65 93 [198] Matthias Rath and Alexandru Condurache. Boosting deep neural networks with geometrical prior knowledge: A survey. arXiv preprint arXiv:2006.16867, 2020. 25, 28, 29, 114 [199] Cody Reading, Ali Harakeh, Julia Chae, and Steven Waslander. Categorical depth distribu- tion network for monocular 3D object detection. In CVPR, 2021. 29, 35, 36, 42, 48, 65, 124, 126, 132 [200] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016. 8, 9 [201] Konstantinos Rematas, Ira Kemelmacher-Shlizerman, Brian Curless, and Steve Seitz. Soccer on your tabletop. In CVPR, 2018. 6, 25 [202] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS, 2015. 6, 8, 9, 25 [203] David Rey, Gérard Subsol, Hervé Delingette, and Nicholas Ayache. Automatic detection and segmentation of evolving processes in 3D medical images: Application to multiple sclerosis. Medical Image Analysis, 2002. 6 [204] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 2019. 16, 17 [205] Thomas Roddick and Roberto Cipolla. Predicting semantic map representations from images using pyramid occupancy networks. In CVPR, 2020. 48 [206] Azriel Rosenfeld and Mark Thurston. Edge and curve detection for visual scene analysis. IEEE Transactions on Computers, 1971. 9 [207] Andrew Slavin Ross, Weiwei Pan, and Finale Doshi-Velez. Learning qualitatively diverse and interpretable rules for classification. In ICML Workshops, 2018. 64 [208] Shouwei Ruan, Yinpeng Dong, Hang Su, Jianteng Peng, Ning Chen, and Xingxing Wei. Towards viewpoint-invariant visual recognition via adversarial training. In ICCV, 2023. 64 [209] Sitapa Rujikietgumjorn and Robert Collins. Optimized pedestrian detection for multiple and occluded people. In CVPR, 2013. 9, 15, 17 [210] Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In ICLR, 2019. 64 [211] Avishkar Saha, Oscar Mendez, Chris Russell, and Richard Bowden. Translating images into maps. In ICRA, 2022. 48, 50, 54, 55, 56, 57, 149, 150, 152 [212] Ayush Sarkar, Hanlin Mai, Amitabh Mahapatra, Svetlana Lazebnik, David Forsyth, and Anand Bhattad. Shadows don’t lie and lines can’t bend! generative models don’t know 94 projective geometry... for now. In CVPR, 2024. 63 [213] Ashutosh Saxena, Justin Driemeyer, and Andrew Ng. Robotic grasping of novel objects using vision. IJRR, 2008. 1, 6, 25, 44, 61 [214] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub- gradient solver for SVM. In ICML, 2007. 49, 50, 51, 138, 141 [215] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. PointRCNN: 3D object proposal generation and detection from point cloud. In CVPR, 2019. 1, 9, 29, 47, 48, 64 [216] Xuepeng Shi, Zhixiang Chen, and Tae-Kyun Kim. Distance-normalized unified representa- tion for monocular 3D object detection. In ECCV, 2020. 6, 7, 9, 15, 17, 19, 21, 23, 24, 48, 108, 109 [217] Xuepeng Shi, Zhixiang Chen, and Tae-Kyun Kim. Multivariate probabilistic monocular 3D object detection. In WACV, 2023. 47 [218] Xuepeng Shi, Qi Ye, Xiaozhi Chen, Chuangrong Chen, Zhixiang Chen, and Tae-Kyun Kim. Geometry-based distance decomposition for monocular 3D object detection. In ICCV, 2021. 26, 35, 36, 37, 128, 129, 132 [219] Changyong Shu, Fisher Yu, and Yifan Liu. 3DPPE: 3D point positional encoding for multi-camera 3D object detection transformers. In ICCV, 2023. 48, 59, 65, 155 [220] Andrea Simonelli, Samuel Bulò, Lorenzo Porzi, Manuel Antequera, and Peter Kontschieder. Disentangling monocular 3D object detection: From single to multi-class recognition. TPAMI, 2020. 6, 7, 9, 18, 19, 20, 25, 34, 35, 36, 128, 132 [221] Andrea Simonelli, Samuel Bulò, Lorenzo Porzi, Peter Kontschieder, and Elisa Ricci. Are we missing confidence in Pseudo-LiDAR methods for monocular 3D object detection? In ICCV, 2021. 29, 35, 36, 109, 130 [222] Andrea Simonelli, Samuel Bulò, Lorenzo Porzi, Manuel López-Antequera, and Peter Kontschieder. Disentangling monocular 3D object detection. In ICCV, 2019. 7, 19, 20, 34, 128 [223] Andrea Simonelli, Samuel Bulò, Lorenzo Porzi, Elisa Ricci, and Peter Kontschieder. Towards generalization across depth for monocular 3D object detection. In ECCV, 2020. 9, 19, 20, 25, 26, 29 [224] Samik Some, Mithun Das Gupta, and Vinay Namboodiri. Determinantal point process as an alternative to NMS. In BMVC, 2020. 9, 15, 17 [225] Ivan Sosnovik, Artem Moskalev, and Arnold Smeulders. DISCO: accurate discrete scale convolutions. In BMVC, 2021. 39, 41 [226] Ivan Sosnovik, Artem Moskalev, and Arnold Smeulders. Scale equivariance improves 95 siamese tracking. In WACV, 2021. 28, 32, 34, 40, 122, 123, 124 [227] Ivan Sosnovik, Michał Szmaja, and Arnold Smeulders. Scale-equivariant steerable networks. In ICLR, 2020. 27, 28, 31, 32, 33, 38, 39, 41, 119, 120, 122, 123, 132 [228] Christoph Strecha, Rik Fransens, and Luc Van Gool. Wide-baseline stereo from multiple views: a probabilistic account. In CVPR, 2004. 66 [229] Christoph Strecha, Tinne Tuytelaars, and Luc Van Gool. Dense matching of multiple wide- baseline views. In ICCV, 2003. 66 [230] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020. 34, 42, 54 [231] Mingxing Tan, Ruoming Pang, and Quoc Le. EfficientDet: Scalable and efficient object detection. In CVPR, 2020. 150 [232] Yunlei Tang, Sebastian Dorn, and Chiragkumar Savani. Center3D: Center-based monocular 3D object detection with joint depth understanding. arXiv preprint arXiv:2005.13423, 2020. 8, 9, 25, 26, 29 [233] Yingqi Tang, Zhaotie Meng, Guoliang Chen, and Erkang Cheng. SimPB: A single model for 2D and 3D object detection from multiple cameras. In ECCV, 2024. 65 [234] Seth Teller and Michael Hohmeyer. Determining the lines through four lines. Journal of graphics tools, 1999. 74 [235] Damien Teney, Ehsan Abbasnejad, and Anton Hengel. Unshuffling data for improved gen- eralization in visual question answering. In ICCV, 2021. 64 [236] Damien Teney, Ehsan Abbasnejad, Simon Lucey, and Anton Hengel. Evading the simplicity bias: Training a diverse set of models discovers solutions with superior OOD generalization. In CVPR, 2022. 64 [237] Damien Teney, Yong Lin, Seong Joon Oh, and Ehsan Abbasnejad. ID and OOD performance are sometimes inversely correlated on real-world datasets. In NeurIPS, 2023. 63, 64 [238] Sugirtha Thayalan-Vaz, Sridevi M, Khailash Santhakumar, B Ravi Kiran, Thomas Gauthier, and Senthil Yogamani. Exploring 2D data augmentation for 3D monocular object detection. arXiv preprint arXiv:2104.10786, 2021. 26, 29 [239] Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotation-and translation-equivariant neural networks for 3D point clouds. arXiv preprint arXiv:1802.08219, 2018. 28 96 [240] Rishabh Tiwari and Pradeep Shenoy. Overcoming simplicity bias in deep networks using a feature sieve. In ICML, 2023. 64 [241] Lachlan Tychsen-Smith and Lars Petersson. Improving object localization with fitness NMS and bounded IoU loss. In CVPR, 2018. 19, 24 [242] Alexandru Vasile and Richard Marino. Pose-independent automatic target detection and recognition using 3D laser radar imagery. Lincoln laboratory journal, 2005. 9 [243] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001. 9 [244] Li Wan, David Eigen, and Rob Fergus. End-to-end integration of a convolution network, deformable parts model and non-maximum suppression. In CVPR, 2015. 9, 16, 17 [245] Li Wang, Liang Du, Xiaoqing Ye, Yanwei Fu, Guodong Guo, Xiangyang Xue, Jianfeng Feng, and Li Zhang. Depth-conditioned dynamic message propagation for monocular 3D object detection. In CVPR, 2021. 36 [246] Li Wang, Li Zhang, Yi Zhu, Zhi Zhang, Tong He, Mu Li, and Xiangyang Xue. Progressive coordinate transforms for monocular 3D object detection. In NeurIPS, 2021. 35, 36, 42, 132 [247] Rui Wang, Robin Walters, and Rose Yu. Incorporating symmetry into deep dynamics models for improved generalization. In ICLR, 2021. 28 [248] Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xiangyu Zhang. StreamPETR: Exploring object-centric temporal modeling for efficient multi-view 3D object detection. In ICCV, 2023. 48, 65 [249] Shuo Wang, Xinhai Zhao, Hai-Ming Xu, Zehui Chen, Dameng Yu, Jiahao Chang, Zhen Yang, and Feng Zhao. Towards domain generalization for multi-view 3D object detection in bird-eye-view. In CVPR, 2023. 65 [250] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. FCOS3D: Fully convolutional one-stage monocular 3D object detection. In ICCV Workshops, 2021. 155 [251] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Probabilistic and geometric depth: Detecting objects in perspective. In CoRL, 2021. 155 [252] Xueqing Wang, Diankun Zhang, Haoyu Niu, and Xiaojun Liu. Segmentation can aid detection: Segmentation-guided single stage detection for 3D point cloud. Electronics, 2023. 48 [253] Xinjiang Wang, Shilong Zhang, Zhuoran Yu, Litong Feng, and Wayne Zhang. Scale- equalizing pyramid convolution for object detection. In CVPR, 2020. 127 [254] Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Weinberger. Pseudo-LiDAR from visual depth estimation: Bridging the gap in 3D object 97 detection for autonomous driving. In CVPR, 2019. 9, 29, 43, 48, 65 [255] Yan Wang, Xiangyu Chen, Yurong You, Li Li, Bharath Hariharan, Mark Campbell, Kilian Weinberger, and Wei-Lun Chao. Train in Germany, test in the USA: Making 3D object detectors generalize. In CVPR, 2020. 65, 129 [256] Yuqi Wang, Yuntao Chen, and Zhaoxiang Zhang. FrustumFormer: Adaptive instance-aware resampling for multi-view 3D detection. In CVPR, 2023. 59 [257] Yue Wang, Vitor Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In CoRL, 2021. 38, 155 [258] Zhou Wang, Alan Bovik, Hamid Sheikh, and Eero Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 2004. 39 [259] Zitian Wang, Zehao Huang, Jiahui Fu, Naiyan Wang, and Si Liu. Object as Query: Lifting any 2D object detector to 3D detection. In ICCV, 2023. 59 [260] Zeyu Wang, Dingwen Li, Chenxu Luo, Cihang Xie, and Xiaodong Yang. DistillBEV: In Boosting multi-camera 3D object detection with cross-modal knowledge distillation. ICCV, 2023. 48, 65 [261] Zengran Wang, Chen Min, Zheng Ge, Yinhao Li, Zeming Li, Hongyu Yang, and Di Huang. STS: Surround-view temporal stereo for multi-view 3D detection. In AAAI, 2023. 48, 65, 155 [262] Maurice Weiler, Patrick Forré, Erik Verlinde, and Max Welling. Coordinate independent convolutional networks–isometry and gauge equivariant convolutions on riemannian mani- folds. arXiv preprint arXiv:2106.06020, 2021. 28 [263] Maurice Weiler, Fred Hamprecht, and Martin Storath. Learning steerable filters for rotation equivariant CNNs. In CVPR, 2018. 28 [264] Mark van der Wilk, Matthias Bauer, ST John, and James Hensman. Learning invariances using the marginal likelihood. In NeurIPS, 2018. 28 [265] Daniel Worrall and Gabriel Brostow. Cubenet: Equivariance to 3D rotation and translation. In ECCV, 2018. 28, 29, 38 [266] Daniel Worrall, Stephan Garbin, Daniyar Turmukhambetov, and Gabriel Brostow. Harmonic networks: Deep translation and rotation equivariance. In CVPR, 2017. 28 [267] Daniel Worrall and Max Welling. Deep scale-spaces: Equivariance over scale. In NeurIPS, 2019. 28, 39, 41, 121 [268] Chen Wu. Waymo keynote talk, CVPR workshop on autonomous driving at 17:20. https: //www.youtube.com/watch?v=fXsbI2VkHgc, 2023. Accessed: 2023-11-11. 2, 45 98 [269] Pengxiang Wu, Siheng Chen, and Dimitris Metaxas. MotionNet: Joint perception and motion prediction for autonomous driving based on bird’s eye view maps. In CVPR, 2020. 9 [270] Yuxin Wu and Justin Johnson. Rethinking “batch” in batchnorm. arXiv preprint arXiv:2105.07576, 2021. 127 [271] Yu Xiang, Wongun Choi, Yuanqing Lin, and Silvio Savarese. Subcategory-aware convolu- tional neural networks for object proposals and detection. In WACV, 2017. 19 [272] Enze Xie, Zhiding Yu, Daquan Zhou, Jonah Philion, Anima Anandkumar, Sanja Fidler, Ping Luo, and Jose Alvarez. Mˆ2BEV: Multi-camera joint 3D detection and segmentation with unified birds-eye view representation. arXiv preprint arXiv:2204.05088, 2022. 48, 58 [273] Kaixin Xiong, Shi Gong, Xiaoqing Ye, Xiao Tan, Ji Wan, Errui Ding, Jingdong Wang, and Xiang Bai. CAPE: Camera view position embedding for multi-view 3D object detection. In CVPR, 2023. 59, 155 [274] Chenfeng Xu, Huan Ling, Sanja Fidler, and Or Litany. 3Difftection: 3D object detection with geometry-aware diffusion features. In CVPR, 2024. 65 [275] Junkai Xu, Liang Peng, Haoran Cheng, Hao Li, Wei Qian, Ke Li, Wenxiao Wang, and Deng Cai. MonoNeRD: NeRF-like representations for monocular 3D object detection. In ICCV, 2023. 47, 65 [276] Keyulu Xu, Mozhi Zhang, Jingling Li, Simon Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka. How neural networks extrapolate: From feedforward to graph neural networks. In ICLR, 2021. 63, 64 [277] Qiangeng Xu, Yin Zhou, Weiyue Wang, Charles Qi, and Dragomir Anguelov. SPG: Unsu- pervised domain adaptation for 3D object detection via semantic point generation. In ICCV, 2021. 65 [278] Yichong Xu, Tianjun Xiao, Jiaxing Zhang, Kuiyuan Yang, and Zheng Zhang. Scale-invariant convolutional neural networks. arXiv preprint arXiv:1411.6369, 2014. 28 [279] Longfei Yan, Pei Yan, Shengzhou Xiong, Xuanyu Xiang, and Yihua Tan. MonoCD: Monoc- ular 3D object detection with complementary depths. In CVPR, 2024. 65 [280] Gengshan Yang and Deva Ramanan. Upgrading optical flow to 3D scene flow through optical expansion. In CVPR, 2020. 34 [281] Haitao Yang, Zaiwei Zhang, Xiangru Huang, Min Bai, Chen Song, Bo Sun, Li Erran Li, and Qixing Huang. LiDAR-based 3D object detection via hybrid 2D semantic scene generation. arXiv preprint arXiv:2304.01519, 2023. 48, 58, 59 [282] Jihan Yang, Shaoshuai Shi, Zhe Wang, Hongsheng Li, and Xiaojuan Qi. ST3D: Self-training for unsupervised domain adaptation on 3D object detection. In CVPR, 2021. 65 99 [283] Jiayu Yang, Enze Xie, Miaomiao Liu, and Jose Alvarez. Parametric depth based feature representation learning for object detection and segmentation in bird’s-eye view. In ICCV, 2023. 59 [284] Xiaodong Yang, Zhuang Ma, Zhiyu Ji, and Zhe Ren. GEDepth: Ground embedding for monocular depth estimation. In ICCV, 2023. 67, 159, 160 [285] Yue Yao, Shengchao Yan, Daniel Goehring, Wolfram Burgard, and Joerg Reichardt. Im- proving out-of-distribution generalization of trajectory prediction for autonomous driving via polynomial representations. arXiv preprint arXiv:2407.13431, 2024. 65 [286] Shingo Yashima, Teppei Suzuki, Kohta Ishikawa, Ikuro Sato, and Rei Kawakami. Feature space particle inference for neural network ensembles. In ICML, 2022. 64 [287] Xiaoqing Ye, Liang Du, Yifeng Shi, Yingying Li, Xiao Tan, Jianfeng Feng, Errui Ding, and Shilei Wen. Monocular 3D object detection via feature domain adaptation. In ECCV, 2020. 19 [288] Raymond Yeh, Yuan-Ting Hu, and Alexander Schwing. Chirality nets for human pose regression. In NeurIPS, 2019. 28 [289] Jingru Yi, Pengxiang Wu, Bo Liu, Qiaoying Huang, Hui Qu, and Dimitris Metaxas. Oriented object detection in aerial images with box boundary-aware vectors. In WACV, 2021. 55 [290] Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Center-based 3D object detection and tracking. In CVPR, 2021. 1, 47, 64 [291] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2015. 38, 39, 41, 121 [292] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In CVPR, 2018. 123 [293] Xiang Yu, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. PoseCNN: A convo- lutional neural network for 6D object pose estimation in cluttered scenes. In RSS, 2018. 1, 44, 61 [294] Syed Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Khan, Ming-Hsuan Yang, and Ling Shao. Learning enriched features for fast image restoration and enhancement. TPAMI, 2022. 153 [295] Arthur Zhang, Chaitanya Eranki, Christina Zhang, Raymond Hong, Pranav Kalyani, Lochana Kalyanaraman, Arsh Gamare, Maria Esteva, and Joydeep Biswas. Towards robust 3D robot perception in urban environments: The UT Campus Object Dataset (CODa). In IROS, 2023. 164 [296] Hao Zhang, Hongyang Li, Xingyu Liao, Feng Li, Shilong Liu, Lionel Ni, and Lei Zhang. DA-BEV: Depth aware BEV transformer for 3D object detection. arXiv preprint 100 arXiv:2302.13002, 2023. 48, 65 [297] Jason Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, and Shubham Tulsiani. Cameras as rays: Pose estimation via ray diffusion. In ICLR, 2024. 65 [298] Jianming Zhang, Stan Sclaroff, Zhe Lin, Xiaohui Shen, Brian Price, and Radomir Mech. Unconstrained salient object detection via proposal subset optimization. In CVPR, 2016. 9, 16, 17 [299] Jinqing Zhang, Yanan Zhang, Qingjie Liu, and Yunhong Wang. SA-BEV: Generating semantic-aware bird’s-eye-view feature for multi-view 3D object detection. In ICCV, 2023. 59 [300] Renrui Zhang, Han Qiu, Tai Wang, Xuanzhuo Xu, Ziyu Guo, Yu Qiao, Peng Gao, and Hongsheng Li. MonoDETR: Depth-guided transformer for monocular 3D object detection. In ICCV, 2023. 55, 56, 57, 154, 157 [301] Yunpeng Zhang, Jiwen Lu, and Jie Zhou. Objects are different: Flexible monocular 3D object detection. In CVPR, 2021. 25, 29, 35, 36, 48, 65, 132 [302] Yinmin Zhang, Xinzhu Ma, Shuai Yi, Jun Hou, Zhihui Wang, Wanli Ouyang, and Dan Xu. Learning geometry-guided depth via projective modeling for monocular 3D object detection. arXiv preprint arXiv:2107.13931, 2021. 26 [303] Yunpeng Zhang, Zheng Zhu, Wenzhao Zheng, Junjie Huang, Guan Huang, Jie Zhou, and Jiwen Lu. BEVerse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022. 45, 48, 50, 53, 54, 55, 58, 59, 60, 65, 70, 149, 150, 151, 155 [304] Allan Zhou, Tom Knowles, and Chelsea Finn. Meta-learning symmetries by reparameteri- zation. In ICLR, 2021. 28 [305] Brady Zhou and Philipp Krähenbühl. Cross-view transformers for real-time map-view semantic segmentation. In CVPR, 2022. 48 [306] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019. 25 [307] Yunsong Zhou, Yuan He, Hongzi Zhu, Cheng Wang, Hongyang Li, and Qinhong Jiang. MonoEF: Extrinsic parameter free monocular 3D object detection. TPAMI, 2021. 26, 29, 35, 36, 48, 61, 65, 132 [308] Benjin Zhu, Zhengkai Jiang, Xiangxin Zhou, Zeming Li, and Gang Yu. Class-balanced grouping and sampling for point cloud 3D object detection. In CVPR Workshop, 2019. 2, 45, 59 [309] Wei Zhu, Qiang Qiu, Robert Calderbank, Guillermo Sapiro, and Xiuyuan Cheng. Scale-equivariant neural networks with decomposed convolutional filters. arXiv preprint 101 arXiv:1909.11193, 2019. 27, 28, 32, 132 [310] Zijian Zhu, Yichi Zhang, Hai Chen, Yinpeng Dong, Shu Zhao, Wenbo Ding, Jiachen Zhong, and Shibao Zheng. Understanding the robustness of 3D object detection with bird’s-eye-view representations in autonomous driving. In CVPR, 2023. 1, 44, 61 [311] Zhuofan Zong, Dongzhi Jiang, Guanglu Song, Zeyue Xue, Jingyong Su, Hongsheng Li, and Yu Liu. Temporal enhanced training of multi-view 3D object detector via historical object prediction. In ICCV, 2023. 45, 48, 55, 59, 60, 65, 149, 150, 151, 155 [312] Zhikang Zou, Xiaoqing Ye, Liang Du, Xianhui Cheng, Xiao Tan, Li Zhang, Jianfeng Feng, Xiangyang Xue, and Errui Ding. The devil is in the task: Exploiting reciprocal appearance- localization features for monocular 3D object detection. In ICCV, 2021. 35, 36 [313] Philip Zwicke and Imre Kiss. A new implementation of the mellin transform and its application to radar classification of ships. TPAMI, 1983. 27, 28, 33, 39, 41 102 APPENDIX A PUBLICATIONS First-Author Publications. A list of all first-authored peer-reviewed publications during the Ph.D. program listed in reverse chronological order. • Abhinav Kumar, Yuliang Guo, Zhihao Zhang, Xinyu Huang, Liu Ren and Xiaoming Liu. “CHARM3R: Towards Camera Height Agnostic Monocular 3D Object Detector ", ICCV, 2025 (under review). • Abhinav Kumar, Yuliang Guo, Xinyu Huang, Liu Ren and Xiaoming Liu. “SeaBird: Seg- mentation in Bird’s View with Dice Loss Improves 3D Detection of Large Objects", CVPR, 2024. • Abhinav Kumar, Garrick Brazil, Enrique Corona, Armin Parchami and Xiaoming Liu. “DE- VIANT: Depth Equivariant Network for Monocular 3D Object Detection", ECCV, 2022. • Abhinav Kumar, Garrick Brazil and Xiaoming Liu. “GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection", CVPR, 2021. • Abhinav Kumar∗, Tim Marks∗, Wenxuan Mou∗, Ye Wang, Michael Jones, Anoop Cherian, Toshi Koike-Akino, Xiaoming Liu and Chen Feng. “LUVLi Face Alignment: Estimating Location, Uncertainty and Visibility Likelihood", CVPR, 2020. Other Publications. • Yunfei Long, Abhinav Kumar, Xiaoming Liu and Daniel Morris. “RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection", CVPR, 2025. • Yuliang Guo, Abhinav Kumar, Chen Zhao, Ruoyu Wang, Xinyu Huang, and Liu Ren. “SUP- NeRF: A Streamlined Unification of Pose Estimation and NeRF for Monocular 3D Object Reconstruction", ECCV, 2024. • Shengjie Zhu, Girish Ganesan, Abhinav Kumar and Xiaoming Liu. “RePLAy: Remove Projec- tive LiDAR Depthmap Artifacts via Exploiting Epipolar Geometry", ECCV, 2024. 103 • Shengjie Zhu, Abhinav Kumar, Masa Hu and Xiaoming Liu. “Tame a Wild Camera: In-the-Wild Monocular Camera Calibration", NeurIPS, 2023. • Vishal Asnani, Abhinav Kumar, Sua You and Xiaoming Liu. “PrObeD: Proactive 2D Object Detection Wrapper", NeurIPS, 2023. • Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson and Georgia Gkioxari, “Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild", CVPR, 2023. • Yunfei Long, Abhinav Kumar, Daniel Morris, Xiaoming Liu, Marcos Castro and Punarjay Chakravarty. “RADIANT: radar Image Association NeTwork for 3D Object Detection", AAAI, 2023. • Thiago Serra, Xin Yu, Abhinav Kumar, and Srikumar Ramalingam. “Scaling Up Exact Neural Network Compression by ReLU Stability", NeurIPS, 2021. • Abhinav Kumar∗, Tim Marks∗, Wenxuan Mou∗, Chen Feng and Xiaoming Liu. “UGLLI Face Alignment: Estimating Uncertainty with Gaussian Log-Likelihood Loss". ICCV Workshops, 2019. 104 APPENDIX B GROOMED-NMS APPENDIX B.1 Detailed Explanation of NMS as a Matrix Operation The rescoring process of the classical NMS is greedy set-based [192] and calculates the rescore for a box 𝑖 (Line 10 of Alg. 1) as 𝑟𝑖 = 𝑠𝑖 (cid:214) 𝑗 ∈d<𝑖 (cid:0)1 − 𝑝(𝑜𝑖 𝑗 )(cid:1) , (B.1) where d<𝑖 is defined as the box indices sampled from d having higher scores than box 𝑖. For example, let us consider that d = {1, 5, 7, 9}. Then, for 𝑖 = 7, d<𝑖 = {1, 5} while for 𝑖 = 1, d<𝑖 = 𝜙 with 𝜙 denoting the empty set. This is possible since we had sorted the scores s and O in decreasing order (Lines 2-3 of Alg. 2) to remove the non-differentiable hard argmax operation of the classical NMS (Line 6 of Alg. 1). Classical NMS only takes the overlap with unsuppressed boxes into account. Therefore, we generalize Eq. (B.1) by accounting for the effect of all (suppressed and unsuppressed) boxes as 𝑟𝑖 = 𝑠𝑖 𝑖−1 (cid:214) 𝑗=1 (cid:0)1 − 𝑝(𝑜𝑖 𝑗 )𝑟 𝑗 (cid:1) . (B.2) The presence of 𝑟 𝑗 on the RHS of Eq. (B.2) prevents suppressed boxes 𝑟 𝑗 ≈ 0 from influencing other boxes hugely. Let us say we have a box 𝑏2 with a high overlap with an unsuppressed box 𝑏1. The classical NMS with a threshold pruning function assigns 𝑟2 = 0 while Eq. (B.2) assigns 𝑟2 a small non-zero value with a threshold pruning. Although Eq. (B.2) keeps 𝑟𝑖 ≥ 0, getting a closed-form recursion in r is not easy because of the product operation. To get a closed-form recursion with addition/subtraction in r, we first carry out the polynomial multiplication and then ignore the higher-order terms as 𝑖−1 ∑︁ 𝑗=1 𝑖−1 ∑︁ 𝑗=1 1 − 𝑟𝑖 = 𝑠𝑖 (cid:169) (cid:173) (cid:171) 1 − ≈ 𝑠𝑖 (cid:169) (cid:173) (cid:171) 𝑝(𝑜𝑖 𝑗 )𝑟 𝑗 + O (𝑛2)(cid:170) (cid:174) (cid:172) 𝑝(𝑜𝑖 𝑗 )𝑟 𝑗 (cid:170) (cid:174) (cid:172) 105 Table B.1 Results on using Oracle NMS scores on KITTI Val 1 cars detection. [Key: Best] (− (cid:17)) AP 3D|𝑅40 NMS Scores (cid:17)) Easy Mod Hard Easy Mod Hard Easy Mod Hard Kinematic (Image) 18.29 13.55 10.13 25.72 18.82 14.48 93.69 84.07 67.14 9.36 9.93 6.40 12.27 10.43 8.72 99.18 95.66 85.77 Oracle IoU2D 87.93 73.10 60.91 93.47 83.61 71.31 80.99 78.38 67.66 Oracle IoU3D AP BEV|𝑅40 AP 2D|𝑅40 (cid:17)) (− (− ≈ 𝑠𝑖 − 𝑖−1 ∑︁ 𝑗=1 𝑝(𝑜𝑖 𝑗 )𝑟 𝑗 . (B.3) Dropping the 𝑠𝑖 in the second term of Eq. (B.3) helps us get a cleaner form of Eq. (B.7). Moreover, it does not change the nature of the NMS since the subtraction keeps the relation 𝑟𝑖 ≤ 𝑠𝑖 intact as 𝑝(𝑜𝑖 𝑗 ) and 𝑟 𝑗 are both between [0, 1]. We can also reach Eq. (B.3) directly as follows. Classical NMS suppresses a box which has a high IoU2D overlap with any of the unsuppressed boxes (𝑟 𝑗 ≈ 1) to zero. We consider any as a logical non-differentiable OR operation and use logical OR (cid:212) operator’s differentiable relaxation as (cid:205) [106, 124]. We next use this relaxation with the other expression r ≤ s. When a box shows overlap with more than two unsuppressed boxes, the term 𝑖−1 (cid:205) 𝑗=1 𝑝(𝑜𝑖 𝑗 )𝑟 𝑗 > 1 in Eq. (B.3) or when a box shows high overlap with one unsuppressed box, the term 𝑠𝑖 < 𝑝(𝑜𝑖 𝑗 )𝑟 𝑗 . In both of these cases, 𝑟𝑖 < 0. So, we lower bound Eq. (B.3) with a max operation to ensure that 𝑟𝑖 ≥ 0. Thus, 𝑠𝑖 − 𝑖−1 ∑︁ 𝑗=1 𝑟𝑖 ≈ max (cid:169) (cid:173) (cid:171) 𝑝(𝑜𝑖 𝑗 )𝑟 𝑗 , 0(cid:170) (cid:174) (cid:172) We write the rescores r in a matrix formulation as . (B.4) ≈ max 𝑟1     𝑟2    𝑟3   ...      𝑟𝑛                   (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) − 𝑠1 𝑠2 𝑠3 ... 𝑠𝑛                                 0 𝑝(𝑜21) 0 0 𝑝(𝑜31) 𝑝(𝑜32) ... ... 𝑝(𝑜𝑛1) 𝑝(𝑜𝑛2)                 . . . 0 . . . 0               . . . 0   . . . 0 ... ... We next write the above equation compactly as r ≈ max(s − Pr, 0), 106 , 𝑟1     𝑟2    𝑟3   ...      𝑟𝑛                   .                 0     0    0   ...      0   (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) (B.5) (B.6) where P, called the Prune Matrix, is obtained by element-wise operation of the pruning function 𝑝 on O (cid:108) (cid:108) . Maximum operation makes Eq. (B.6) non-linear [112] and, thus, difficult to solve. However, for a differentiable NMS layer, we need to avoid the recursion. Therefore, we first solve Eq. (B.6) assuming the max operation is not present which gives us the solution r ≈ (I + P)−1 s. In general, this solution is not necessarily bounded between 0 and 1. Hence, we clip it explicitly to obtain the approximation r ≈ (cid:4)(I + P)−1 s(cid:7) , (B.7) which we use as the solution to Eq. (B.6). B.2 Loss Functions We now detail out the loss functions used for training. The losses on the boxes before NMS, L𝑏𝑒 𝑓 𝑜𝑟𝑒, is given by [17] where L𝑏𝑒 𝑓 𝑜𝑟𝑒 = L𝑐𝑙𝑎𝑠𝑠 + L2D + 𝑏𝑐𝑜𝑛 𝑓 L3D + 𝜆𝑐𝑜𝑛 𝑓 (1 − 𝑏𝑐𝑜𝑛 𝑓 ), L𝑐𝑙𝑎𝑠𝑠 = CE(𝑏𝑐𝑙𝑎𝑠𝑠, 𝑔𝑐𝑙𝑎𝑠𝑠), L2D = − log(IoU(𝑏2D, 𝑔2D)), L3D = Smooth-L1(𝑏3D, 𝑔3D) + 𝜆𝑎CE( [𝑏𝜃𝑎, 𝑏𝜃ℎ], [𝑔𝜃𝑎, 𝑔𝜃ℎ]). (B.8) (B.9) (B.10) (B.11) 𝑏𝑐𝑜𝑛 𝑓 is the predicted self-balancing confidence of each box 𝑏, while 𝑏𝜃𝑎 and 𝑏𝜃ℎ are its orientation bins [17]. 𝑔 denotes the ground-truth. 𝜆𝑐𝑜𝑛 𝑓 is the rolling mean of most recent L3D losses per mini- batch [17], while 𝜆𝑎 denotes the weight of the orientation bins loss. CE and Smoooth-L1 denote the Cross Entropy and Smooth L1 loss respectively. Note that we apply 2D and 3D regression losses as well as the confidence losses only on the foreground boxes. 107 Table B.2 Detailed comparisons with other NMS during inference on KITTI Val 1 cars. IoU3D ≥ 0.7 (cid:17)) (− Inference NMS (− (cid:17)) AP 3D| 𝑅40 AP BEV| 𝑅40 Classical Soft [12] (cid:17)) Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard 18.28 13.55 10.13 25.72 18.82 14.48 54.70 39.33 31.25 60.87 44.36 34.48 Kinematic (Image) [17] Kinematic (Image) [17] 18.29 13.55 10.13 25.71 18.81 14.48 54.70 39.33 31.26 60.87 44.36 34.48 Kinematic (Image) [17] Distance [216] 18.25 13.53 10.11 25.71 18.82 14.48 54.70 39.33 31.26 60.87 44.36 34.48 18.26 13.51 10.10 25.67 18.77 14.44 54.59 39.25 31.18 60.78 44.28 34.40 Kinematic (Image) [17] GrooMeD 19.67 14.31 11.27 27.38 19.75 15.93 55.64 41.08 32.91 61.85 44.98 36.31 Classical GrooMeD-NMS 19.67 14.31 11.27 27.38 19.75 15.93 55.64 41.08 32.91 61.85 44.98 36.31 Soft [12] GrooMeD-NMS Distance [216] 19.67 14.31 11.27 27.38 19.75 15.93 55.64 41.08 32.91 61.85 44.98 36.31 GrooMeD-NMS 19.67 14.32 11.27 27.38 19.75 15.92 55.62 41.07 32.89 61.83 44.98 36.29 GrooMeD-NMS AP BEV| 𝑅40 AP 3D| 𝑅40 GrooMeD (− IoU3D ≥ 0.5 (cid:17)) (− As explained in Sec. 2.4.3, the loss on the boxes after NMS, L𝑎 𝑓 𝑡𝑒𝑟, is the Imagewise AP-Loss, which is given by L𝑎 𝑓 𝑡𝑒𝑟 = L𝐼𝑚𝑎𝑔𝑒𝑤𝑖𝑠𝑒 = 1 𝑁 𝑁 ∑︁ AP(r(𝑚), target(B (𝑚))), 𝑚=1 Let 𝜆 be the weight of the L𝑎 𝑓 𝑡𝑒𝑟 term. Then, our overall loss function is given by L = L𝑏𝑒 𝑓 𝑜𝑟𝑒 + 𝜆L𝑎 𝑓 𝑡𝑒𝑟 = L𝑐𝑙𝑎𝑠𝑠 + L2D + 𝑏𝑐𝑜𝑛 𝑓 L3D + 𝜆𝑐𝑜𝑛 𝑓 (1 − 𝑏𝑐𝑜𝑛 𝑓 ) + 𝜆L𝐼𝑚𝑎𝑔𝑒𝑤𝑖𝑠𝑒 = CE(𝑏𝑐𝑙𝑎𝑠𝑠, 𝑔𝑐𝑙𝑎𝑠𝑠) − log(IoU(𝑏2D, 𝑔2D)) + 𝑏𝑐𝑜𝑛 𝑓 Smooth-L1(𝑏3D, 𝑔3D) + 𝜆𝑎 𝑏𝑐𝑜𝑛 𝑓 CE( [𝑏𝜃𝑎, 𝑏𝜃ℎ], [𝑔𝜃𝑎, 𝑔𝜃ℎ]) (B.12) (B.13) (B.14) + 𝜆𝑐𝑜𝑛 𝑓 (1 − 𝑏𝑐𝑜𝑛 𝑓 ) + 𝜆L𝐼𝑚𝑎𝑔𝑒𝑤𝑖𝑠𝑒. (B.15) We keep 𝜆𝑎 = 0.35 following [17] and 𝜆 = 0.05. Clearly, all our losses and their weights are identical to [17] except L𝐼𝑚𝑎𝑔𝑒𝑤𝑖𝑠𝑒. B.3 Additional Experiments and Results We now provide additional details and results evaluating our system’s performance. B.3.1 Training Training images are augmented using random flipping with probability 0.5 [17]. Adam opti- mizer [102] is used with batch size 2, weight-decay 5 × 10−4 and gradient clipping of 1 [15, 17]. 108 Warmup starts with a learning rate 4 × 10−3 following a poly learning policy with power 0.9 [17]. Warmup and full training phases take 80𝑘 and 50𝑘 mini-batches respectively for Val 1 and Val 2 Splits [17] while take 160𝑘 and 100𝑘 mini-batches for Test Split. B.3.2 KITTI Val 1 Oracle NMS Experiments As discussed in Sec. 2.1, to understand the effects of an inference-only NMS on 2D and 3D object detection, we conduct a series of oracle experiments. We create an oracle NMS by taking the Val Car boxes of KITTI Val 1 Split from the baseline Kinematic (Image) model before NMS and replace their scores with their true IoU2D or IoU3D with the ground-truth, respectively. Note that this corresponds to the oracle because we do not know the ground-truth boxes during inference. We then pass the boxes with the oracle scores through the classical NMS and report the results in Tab. B.1. The results show that the AP3D increases by a staggering > 60 AP on Mod cars when we use oracle IoU3D as the NMS score. On the other hand, we only see an increase in AP 2D by ≈ 11 AP on Mod cars when we use oracle IoU2D as the NMS score. Thus, the relative effect of using oracle IoU3D NMS scores on 3D detection is more significant than using oracle IoU2D NMS scores on 2D detection. In other words, the mismatch is greater between classification and 3D localization compared to the mismatch between classification and 2D localization. B.3.3 KITTI Val 1 3D Object Detection Comparisons with other NMS. We compare GrooMeD-NMS with the other NMS—classical, Soft [12] and Distance-NMS [216] and report the detailed results in Tab. B.2. We use the publicly released Soft-NMS code and Distance-NMS code from the respective authors. The Distance- NMS model uses the class confidence scores divided by the uncertainty in 𝑧 (the most erroneous dimension in 3D localization [221]) of a box as the Distance-NMS [216] input. Our model does not predict the uncertainty in 𝑧 of a box but predicts its self-balancing confidence (the 3D localization score). Therefore, we use the class confidence scores multiplied by the self-balancing confidence as the Distance-NMS input. The results in Tab. B.2 show that NMS inclusion in the training pipeline benefits the perfor- 109 Table B.3 Sensitivity to NMS threshold 𝑁𝑡 on KITTI Val 1 cars. [Key: Best] (− (cid:17)) AP 3D|𝑅40 AP BEV|𝑅40 (cid:17)) Easy Mod Hard Easy Mod Hard 𝑁𝑡 = 0.3 17.49 13.32 10.54 26.07 18.94 14.61 𝑁𝑡 =0.4 19.67 14.32 11.27 27.38 19.75 15.92 𝑁𝑡 = 0.5 19.65 13.93 11.09 26.15 19.15 14.71 (− Table B.4 Sensitivity to valid box threshold 𝑣 on KITTI Val 1 cars. [Key: Best] (− (− (cid:17)) AP 3D|𝑅40 AP BEV|𝑅40 (cid:17)) Easy Mod Hard Easy Mod Hard 𝑣 = 0.01 13.71 9.65 7.24 17.73 12.47 9.36 𝑣 = 0.1 19.37 13.99 10.92 26.95 19.84 15.40 𝑣 = 0.2 19.65 14.31 11.24 27.35 19.73 15.89 𝑣 = 0.3 19.67 14.32 11.27 27.38 19.75 15.92 𝑣 = 0.4 19.67 14.33 11.28 27.38 19.76 15.93 𝑣 = 0.5 19.67 14.33 11.28 27.38 19.76 15.93 𝑣 = 0.6 19.67 14.33 11.29 27.39 19.77 15.95 mance, unlike [12], which suggests otherwise. Training with GrooMeD-NMS helps because the network gets an additional signal through the GrooMeD-NMS layer whenever the best-localized box corresponding to an object is not selected. Moreover, Tab. B.2 suggests that we can replace GrooMeD-NMS with the classical NMS in inference as the performance is almost the same even at IoU3D = 0.5. How good is the classical NMS approximation? GrooMeD-NMS uses several approximations to arrive at the matrix solution Eq. (B.7). We now compare how good these approximations are with the classical NMS. Interestingly, Tab. B.2 shows that GrooMeD-NMS is an excellent approximation to the classical NMS as the performance does not degrade after changing the NMS in inference. B.3.4 KITTI Val 1 Sensitivity Analysis There are a few adjustable parameters for the GrooMeD-NMS, such as the NMS threshold 𝑁𝑡, valid box threshold 𝑣, the maximum group size 𝛼, the weight 𝜆 for the L𝑎 𝑓 𝑡𝑒𝑟, and 𝛽. We carry out a sensitivity analysis to understand how these parameters affect performance and speed, and how sensitive the algorithm is to these parameters. Sensitivity to NMS Threshold. We show the sensitivity to NMS threshold 𝑁𝑡 in Tab. B.3. The results in Tab. B.3 show that the optimal 𝑁𝑡 = 0.4. This is also the 𝑁𝑡 in [15, 17]. 110 Figure B.1 Sensitivity to group size 𝛼 on KITTI Val 1 Moderate cars. Sensitivity to Valid Box Threshold. We next show the sensitivity to valid box threshold 𝑣 in Tab. B.4. Our choice of 𝑣 = 0.3 performs close to the optimal choice. Sensitivity to Maximum Group Size. Grouping has a parameter group size (𝛼). We vary this parameter and report AP 3D|𝑅40 and AP BEV|𝑅40 at two different IoU3D thresholds on Moderate cars of KITTI Val 1 Split in Fig. B.1. We note that the best AP 3D|𝑅40 performance is obtained at 𝛼 = 100 and we, therefore, set 𝛼 = 100 in our experiments. Sensitivity to Loss Weight. We now show the sensitivity to loss weight 𝜆 in Tab. B.5. Our choice of 𝜆 = 0.05 is the optimal value. Sensitivity to Best Box Threshold. We now show the sensitivity to the best box threshold 𝛽 in Tab. B.6. Our choice of 𝛽 = 0.3 is the optimal value. Conclusion. GrooMeD-NMS has minor sensitivity to 𝑁𝑡, 𝛼, 𝜆 and 𝛽, which is common in object detection. GrooMeD-NMS is not as sensitive to 𝑣 since it only decides a box’s validity. Our parameter choice is either at or close to the optimal. The inference speed is only affected by 𝛼. Other parameters are used in training or do not affect inference speed. B.3.5 Qualitative Results We next show some qualitative results of models trained on KITTI Val 1 Split in Fig. B.2. We depict the predictions of GrooMeD-NMS in image view on the left and the predictions of GrooMeD-NMS, Kinematic (Image) [17], and ground truth in BEV on the right. In general, 111 Table B.5 Sensitivity to loss weight 𝜆 on KITTI Val 1 cars. [Key: Best] (− (− (cid:17)) AP 3D|𝑅40 AP BEV|𝑅40 (cid:17)) Easy Mod Hard Easy Mod Hard 19.16 13.89 10.96 27.01 19.33 14.84 𝜆 = 0 𝜆 = 0.05 19.67 14.32 11.27 27.38 19.75 15.92 𝜆 = 0.1 17.74 13.61 10.81 25.86 19.18 15.57 10.08 7.26 6.00 14.44 10.55 8.41 𝜆 = 1 Table B.6 Sensitivity to best box threshold 𝛽 on KITTI Val 1 cars. [Key: Best] (− (− (cid:17)) AP 3D|𝑅40 AP BEV|𝑅40 (cid:17)) Easy Mod Hard Easy Mod Hard 𝛽 = 0.1 18.09 13.64 10.21 26.52 19.50 15.74 𝛽 = 0.3 19.67 14.32 11.27 27.38 19.75 15.92 𝛽 = 0.4 18.91 14.02 11.15 27.11 19.64 15.90 𝛽 = 0.5 18.49 13.66 10.96 27.01 19.47 15.79 GrooMeD-NMS predictions are more closer to the ground truth than Kinematic (Image) [17]. B.3.6 Demo Video of GrooMeD-NMS We next include a short demo video of our GrooMeD-NMS model trained on KITTI Val 1 Split. We run our trained model independently on each frame of the three KITTI raw [66] sequences - 2011_10_03_drive_0047, 2011_09_29_drive_0026 and 2011_09_26_drive_0009. None of the frames from these three raw sequences appear in the training set of KITTI Val 1 Split. We use the camera matrices available with the raw sequences but do not use any temporal information. Overlaid on each frame of the raw input videos, we plot the projected 3D boxes of the predictions and also plot these 3D boxes in the BEV. We set the frame rate of this demo at 10 fps. The demo is also available in HD at https://www.youtube.com/watch?v=PWctKkyWrno. In the demo video, notice that the orientation of the boxes are stable despite not using any temporal information. 112 Figure B.2 Qualitative Results (Best viewed in color). We depict the predictions of GrooMeD-NMS (magenta) in image view on the left and the predictions of GrooMeD-NMS, Kinematic (Image) [17] (blue), and Ground Truth (green) in BEV on the right. In general, GrooMeD-NMS predictions are more closer to the ground truth than Kinematic (Image) [17]. 113 APPENDIX C DEVIANT APPENDIX C.1 Supportive Explanations We now add some explanations which we could not put in the main chapter because of the space constraints. C.1.1 Equivariance vs Augmentation Equivariance adds suitable inductive bias to the backbone [43, 50] and is not learnt. Augmen- tation adds transformations to the input data during training or inference. Equivariance and data augmentation have their own pros and cons. Equivariance models the physics better, is mathematically principled and is so more agnostic to data distribution shift compared to the data augmentation. A downside of equivariance compared to the augmentation is equivariance requires mathematical modelling, may not always exist [21], is not so intuitive and generally requires more flops for inference. On the other hand, data augmentation is simple, intuitive and fast, but is not mathematically principled. The choice between equivariance and data augmentation is a withstanding question in machine learning [63]. C.1.2 Why do 2D CNN detectors generalize? We now try to understand why 2D CNN detectors generalize well. Consider an image ℎ(𝑢, 𝑣) and Φ be the CNN. Let Tt denote the translation in the (𝑢, 𝑣) space. The 2D translation equivariance [19, 20, 198] of the CNN means that Φ(Ttℎ(𝑢, 𝑣)) = TtΦ(ℎ(𝑢, 𝑣)) =⇒ Φ(ℎ(𝑢 + 𝑡𝑢, 𝑣 + 𝑡𝑣)) = Φ(ℎ(𝑢, 𝑣)) + (𝑡𝑢, 𝑡𝑣) (C.1) where (𝑡𝑢, 𝑡𝑣) is the translation in the (𝑢, 𝑣) space. Assume the CNN predicts the object position in the image as (𝑢′, 𝑣′). Then, we write Φ(ℎ(𝑢, 𝑣)) = ( ˆ𝑢, ˆ𝑣) (C.2) Now, we want the CNN to predict the output the position of the same object translated by (𝑡𝑢, 𝑡𝑣). The new image is thus ℎ(𝑢 + 𝑡𝑢, 𝑣 + 𝑡𝑣). The CNN easily predicts the translated position of 114 𝑧 𝑥 𝑦 (𝑥, 𝑦, 𝑧) Patch Plane 𝑚𝑋+𝑛𝑌 +𝑜𝑍+𝑝 = 0 𝑓 (𝑢0,𝑣0) ℎ(𝑢, 𝑣) 𝑡𝑍 𝑓 (𝑢0,𝑣0) ℎ′(𝑢′, 𝑣′) Figure C.1 Equivariance exists for the patch plane when there is depth translation of the ego camera. Downscaling converts image ℎ to image ℎ′. ℎ(𝑢, 𝑣) ℎ′(𝑢′, 𝑣′) Figure C.2 Example of non-existence of equivariance [21] when there is 180◦ rotation of the ego camera. No transformation can convert image ℎ to image ℎ′. the object because all CNN is to do is to invoke its 2D translation equivariance of Eq. (C.1), and translate the previous prediction by the same amount. In other words, Φ(ℎ(𝑢 + 𝑡𝑢, 𝑣 + 𝑡𝑣)) = Φ(ℎ(𝑢, 𝑣)) + (𝑡𝑢, 𝑡𝑣) = ( ˆ𝑢, ˆ𝑣) + (𝑡𝑢, 𝑡𝑣) = ( ˆ𝑢 + 𝑡𝑢, ˆ𝑣 + 𝑡𝑣) Intuitively, equivariance is a disentaglement method. The 2D translation equivariance disentangles the 2D translations (𝑡𝑢, 𝑡𝑣) from the original image ℎ and therefore, the network generalizes to unseen 2D translations. C.1.3 Existence and Non-existence of Equivariance The result from [21] says that generic projective equivariance does not exist in particular with rotation transformations. We now show an example of when the equivariance exists and does not exist in the projective manifold in Figs. C.1 and C.2 respectively. 115 C.1.4 Why do not Monocular 3D CNN detectors generalize? Monocular 3D CNN detectors do not generalize well because they are not equivariant to arbitrary 3D translations in the projective manifold. To show this, let 𝐻 (𝑋, 𝑌 , 𝑍) denote a 3D point cloud. The monocular detection network Φ operates on the projection ℎ(𝑢, 𝑣) of this point cloud 𝐻 to output the position ( ˆ𝑥, ˆ𝑦, ˆ𝑧) as Φ(K𝐻 (𝑋, 𝑌 , 𝑍)) = ( ˆ𝑥, ˆ𝑦, ˆ𝑧) =⇒ Φ(ℎ(𝑢, 𝑣)) = ( ˆ𝑥, ˆ𝑦, ˆ𝑧), where K denotes the projection operator. We translate this point cloud by an arbitrary 3D translation of (𝑡𝑋, 𝑡𝑌 , 𝑡𝑍 ) to obtain the new point cloud 𝐻 (𝑋 + 𝑡𝑋, 𝑌 + 𝑡𝑌 , 𝑍 + 𝑡𝑍 ). Then, we again ask the monocular detector Φ to do prediction over the translated point cloud. However, we find that Φ(K𝐻 (𝑋 + 𝑡𝑋, 𝑌 + 𝑡𝑌 , 𝑍 + 𝑡𝑍 )) ≠ Φ(ℎ(𝑢 + K(𝑡𝑋, 𝑡𝑌 , 𝑡𝑍 ), 𝑣 + K(𝑡𝑋, 𝑡𝑌 , 𝑡𝑍 ))) =⇒ Φ(K𝐻 (𝑋 + 𝑡𝑋, 𝑌 + 𝑡𝑌 , 𝑍 + 𝑡𝑍 )) ≠ Φ(K𝐻 (𝑋, 𝑌 , 𝑍)) + K(𝑡𝑋, 𝑡𝑌 , 𝑡𝑍 ) = Φ(ℎ(𝑢, 𝑣)) + K(𝑡𝑋, 𝑡𝑌 , 𝑡𝑍 ) In other words, the projection operator K does not distribute over the point cloud 𝐻 and arbitrary 3D translation of (𝑡𝑋, 𝑡𝑌 , 𝑡𝑍 ). Hence, if the network Φ is a vanilla CNN (existing monocular backbone), it can no longer invoke its 2D translation equivariance of Eq. (C.1) to get the new 3D coordinates ( ˆ𝑥 + 𝑡𝑋, ˆ𝑦 + 𝑡𝑌 , ˆ𝑧 + 𝑡𝑍 ). Note that the LiDAR based 3D detectors with 3D convolutions do not suffer from this problem because they do not involve any projection operator K. Thus, this problem exists only in monocular 3D detection. This makes monocular 3D detection different from 2D and LiDAR based 3D object detection. C.1.5 Overview of Planar Transformations: Th. 1 We now pictorially provide the overview of Th. 1 (Example 13.2 from [76]), which links the planarity and projective transformations in the continuous world in Fig. C.3. 116 Continuous WorldDiscrete World 3D point on plane Projective 2D point Sampling 2D pixel (R, t) Th. 1 3D point on plane Projective 2D point Sampling 2D pixel Figure C.3 Overview of Th. 1 (Example 13.2 from [76]), which links the planarity and projective transfor- mations in the continuous world. C.1.6 Approximation of Scale Transformations: Corollary 1.1 We now give the approximation under which Corollary 1.1 is valid. We assume that the ego camera does not undergo any rotation. Hence, we substitute R = I in Eq. (3.1) to get ℎ(𝑢 − 𝑢0, 𝑣 − 𝑣0) = ℎ′ (cid:169) (cid:173) (cid:173) (cid:171) 𝑓 (cid:16)1+𝑡𝑋 (cid:17) 𝑚 𝑝 (𝑢−𝑢0) +𝑡𝑋 𝑡𝑍 𝑚 𝑝 (𝑢−𝑢0) +𝑡𝑍 𝑡𝑌 𝑚 𝑝 (𝑢−𝑢0) + 𝑓 𝑛 𝑝 (𝑣 −𝑣0) + (cid:17) (cid:16)1+𝑡𝑌 𝑛 𝑝 𝑡𝑍 𝑚 𝑝 (𝑢−𝑢0) + 𝑡𝑍 𝑛 𝑝 (𝑣 −𝑣0) + 𝑛 𝑝 (𝑣 −𝑣0) +𝑡𝑋 (cid:16)1+𝑡𝑍 𝑜 𝑝 𝑜 𝑝 𝑓 (cid:17) 𝑓 , (𝑣 −𝑣0) +𝑡𝑌 (cid:16)1+𝑡𝑍 𝑜 𝑝 𝑜 𝑝 𝑓 (cid:17) 𝑓 (cid:170) (cid:174) (cid:174) (cid:172) . (C.3) Next, we use the assumption that the ego vehicle moves in the 𝑧-direction as in [17], i.e., substitute 𝑡𝑋 = 𝑡𝑌 = 0 to get ℎ(𝑢−𝑢0, 𝑣 −𝑣0) = ℎ′ (cid:169) (cid:173) (cid:173) (cid:171) 𝑡𝑍 𝑓 𝑢 − 𝑢0 𝑛 𝑝 (𝑣 −𝑣0) + 𝑚 𝑝 (𝑢−𝑢0) + 𝑡𝑍 𝑓 , (cid:17) (cid:16)1+𝑡𝑍 𝑜 𝑝 𝑝 (𝑢−𝑢0) + 𝑡𝑍 The patch plane is 𝑚𝑥 + 𝑛𝑦 + 𝑜𝑧 + 𝑝 = 0. We consider the planes in the front of camera. Without 𝑡𝑍 𝑓 𝑜 𝑝 𝑚 𝑓 𝑣 − 𝑣0 𝑛 𝑝 (𝑣 −𝑣0) + (cid:16)1+𝑡𝑍 (cid:17) . (cid:170) (cid:174) (cid:174) (cid:172) (C.4) loss of generality, consider 𝑝 < 0 and 𝑜 > 0. We first write the denominator 𝐷 of RHS term in Eq. (C.4) as 𝐷 = 𝑡𝑍 𝑓 𝑚 𝑝 (𝑢−𝑢0) + 𝑡𝑍 𝑓 𝑛 𝑝 117 (𝑣 −𝑣0) + (cid:18) 1+𝑡𝑍 (cid:19) 𝑜 𝑝 = 1 + 𝑡𝑍 𝑝 (cid:18) 𝑚 𝑓 (𝑢−𝑢0) + (cid:19) (𝑣 −𝑣0) + 𝑜 𝑛 𝑓 Because we considered patch plane s in front of the camera, 𝑝 < 0. Also consider 𝑡𝑍 < 0, which implies 𝑡𝑍 /𝑝 > 0. Now, we bound the term in the parantheses of the above equation as (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (𝑢−𝑢0) + (𝑣 −𝑣0) + 𝑜 𝐷 ≤ 1 + ≤ 1 + ≤ 1 + ≤ 1 + ≤ 1 + 𝑡𝑍 𝑝 𝑡𝑍 𝑝 𝑡𝑍 𝑝 𝑡𝑍 𝑝 𝑡𝑍 𝑝 𝑛 𝑓 (cid:13) 𝑚 (cid:13) (cid:13) 𝑓 (cid:13) (cid:18)(cid:13) 𝑚 (cid:13) (cid:13) 𝑓 (cid:13) (cid:18) ∥𝑚∥ 𝑓 (cid:18) ∥𝑚∥ 𝑓 + (cid:13) (cid:13) 𝑛 (cid:13) (cid:13) (cid:13) (cid:13) 𝑓 (cid:13) (cid:13) 𝐻 ∥𝑛∥ 2 𝑓 𝑊 ∥𝑛∥ 2 𝑓 + 𝑊 2 𝑊 2 (cid:18) (∥𝑚∥ + ∥𝑛∥)𝑊 2 𝑓 + (cid:19) , + 𝑜 (𝑢−𝑢0) (𝑣 −𝑣0) (cid:19) + ∥𝑜∥ by Triangle inequality (cid:19) + 𝑜 (cid:19) + 𝑜 , (𝑢−𝑢0) ≤ 𝑊 2 , (𝑣 −𝑣0) ≤ 𝐻 2 , ∥𝑜∥ = 𝑜 , 𝐻 ≤ 𝑊 If the coefficients of the patch plane 𝑚, 𝑛, 𝑜, its width 𝑊 and focal length 𝑓 follow the relationship (∥𝑚∥+∥𝑛∥)𝑊 2 𝑓 << 𝑜, the patch plane is “approximately” parallel to the image plane. Then, a few quantities can be ignored in the denominator 𝐷 to get D ≈ 1 + 𝑡𝑍 𝑜 𝑝 Therefore, the RHS of Eq. (C.4) gets simplified and we obtain (cid:32) 𝑢 − 𝑢0 1+𝑡𝑍 T𝑠 : ℎ(𝑢 − 𝑢0, 𝑣 − 𝑣0) ≈ ℎ′ , 𝑜 𝑝 (C.5) (C.6) (cid:33) 𝑣 − 𝑣0 𝑜 1+𝑡𝑍 𝑝 An immediate benefit of using the approximation is Eq. (3.2) does not depend on the distance of the patch plane from the camera. This is different from wide-angle camera assumption, where the ego camera is assumed to be far from the patch plane. Moreover, patch plane s need not be perfectly aligned with the image plane for Eq. (3.2). Even small enough perturbed patch plane s work. We next show the approximation in the Fig. C.4 with 𝜃 denoting the deviation from the perfect parallel plane. The deviation 𝜃 is about 3 degrees for the KITTI dataset while it is 6 degrees for the Waymo dataset. e.g. The following are valid patch plane s for KITTI images whose focal length 𝑓 = 707 and width 𝑊 = 1242. −0.05𝑥 + 0.05𝑦 + 𝑧 = 30 118 𝑧 𝑦 𝜃 Figure C.4 Approximation of Corollary 1.1. Bold shows the patch plane parallel to the image plane. The dotted line shows the approximated patch plane. 0.05𝑥 − 0.05𝑦 + 𝑧 = 30 (C.7) The following are valid patch plane s for Waymo images whose focal length 𝑓 = 2059 and width 𝑊 = 1920. −0.1𝑥 + 0.1𝑦 + 𝑧 = 30 0.1𝑥 − 0.1𝑦 + 𝑧 = 30 (C.8) Although the assumption is slightly restrictive, we believe our method shows improvements on both KITTI and Waymo datasets because the car patches are approximately parallel to image planes and also because the depth remains the hardest parameter to estimate [166]. C.1.7 Scale Equivariance of SES Convolution for Images [227] derive the scale equivariance of SES convolution for a 1D signal. We simply follow on their footsteps to get the scale equivariance of SES convolution for a 2D image ℎ(𝑢, 𝑣) for the sake of completeness. Let the scaling of the image ℎ be 𝑠. Let ∗ denote the standard vanilla convolution and Ψ denote the convolution filter. Then, the convolution of the downscaled image T𝑠 (ℎ) with the filter Ψ is given by (cid:19) Ψ(𝑢′ − 𝑢, 𝑣′ − 𝑣)𝑑𝑢′𝑑𝑣′ , ℎ ∫ ∫ [T𝑠 (ℎ) ∗ Ψ] (𝑢, 𝑣) (cid:18) 𝑢′ 𝑣′ 𝑠 𝑠 (cid:18) 𝑢′ 𝑠 (cid:18) 𝑢′ 𝑠 (cid:18) 𝑢′ 𝑠 = = 𝑠2 ∫ ∫ = 𝑠2 ∫ ∫ = 𝑠2 ∫ ∫ ℎ ℎ ℎ , , , (cid:19) (cid:19) (cid:19) 𝑣′ 𝑠 𝑣′ 𝑠 𝑣′ 𝑠 (cid:18) 𝑠 Ψ T𝑠−1 T𝑠−1 (cid:19) 𝑑 , 𝑠 𝑢′ − 𝑢 𝑠 (cid:18) 𝑢′ − 𝑢 𝑠 𝑣′ − 𝑣 𝑠 𝑣′ − 𝑣 𝑠 Ψ , (cid:20) (cid:19) (cid:18) 𝑢′ 𝑠 𝑑 (cid:19)(cid:21) 𝑑 (cid:19) (cid:18) 𝑣′ 𝑠 (cid:19) 𝑑 (cid:18) 𝑢′ 𝑠 (cid:18) 𝑢′ 𝑠 𝑑 (cid:19) (cid:18) 𝑣′ 𝑠 (cid:18) 𝑣′ 𝑠 (cid:19) (cid:19) 𝑑 (cid:20) Ψ (cid:18) 𝑢′ 𝑠 − 𝑢 𝑠 , 𝑣′ 𝑠 − 𝑣 𝑠 (cid:19)(cid:21) 119 = 𝑠2 [ℎ ∗ T𝑠−1 (Ψ)] (cid:17) (cid:16) 𝑢 𝑠 , 𝑣 𝑠 = 𝑠2T𝑠 [ℎ ∗ T𝑠−1 (Ψ)] (𝑢, 𝑣). Next, [227] re-parametrize the SES filters by writing Ψ𝜎 (𝑢, 𝑣) = 1 𝜎2 Ψ (cid:0) 𝑢 𝜎 , 𝑣 𝜎 (C.9) (cid:1). Substituting in Eq. (C.9), we get [T𝑠 (ℎ) ∗ Ψ𝜎] (𝑢, 𝑣) = 𝑠2T𝑠 [ℎ ∗ T𝑠−1 (Ψ𝜎)] (𝑢, 𝑣) (C.10) Moreover, the re-parametrized filters are separable [227] by construction and so, one can write Ψ𝜎 (𝑢, 𝑣) = Ψ𝜎 (𝑢)Ψ𝜎 (𝑣). (C.11) The re-parametrization and separability leads to the important property that T𝑠−1 (Ψ𝜎 (𝑢, 𝑣)) = T𝑠−1 (Ψ𝜎 (𝑢)Ψ𝜎 (𝑣)) = T𝑠−1 (Ψ𝜎 (𝑢)) T𝑠−1 (Ψ𝜎 (𝑣)) = 𝑠−2Ψ𝑠−1𝜎 (𝑢)Ψ𝑠−1𝜎 (𝑣) = 𝑠−2Ψ𝑠−1𝜎 (𝑢, 𝑣). (C.12) Substituting above in the RHS of Eq. (C.10), we get [T𝑠 (ℎ) ∗ Ψ𝜎] (𝑢, 𝑣) = 𝑠2T𝑠 (cid:2)ℎ ∗ 𝑠−2Ψ𝑠−1𝜎 (cid:3) (𝑢, 𝑣) =⇒ [T𝑠 (ℎ) ∗ Ψ𝜎] (𝑢, 𝑣) = T𝑠 [ℎ ∗ Ψ𝑠−1𝜎] (𝑢, 𝑣), (C.13) which is a cleaner form of Eq. (C.9). Eq. (C.13) says that convolving the downscaled image with a filter is same as the downscaling the result of convolving the image with the upscaled filter [227]. This additional constraint regularizes the scale (depth) predictions for the image, leading to better generalization. C.1.8 Why does DEVIANT generalize better compared to CNN backbone? DEVIANT models the physics better compared to the CNN backbone. CNN generalizes better for 2D detection because of the 2D translation equivariance in the Euclidean manifold. However, 120 Table C.1 Comparison of Methods on the basis of inputs, convolution kernels, outputs and whether output are scale-constrained. Method Input #Conv Frame Kernel Output Vanilla CNN Depth-Aware [15] Dilated CNN [291] DEVIANT Depth-guided [52] 1 + Depth Kinematic3D [17] 1 1 1 1 > 1 1 > 1 > 1 > 1 1 1 4D 4D 5D 5D 4D 5D Output Constrained for Scales? ✕ ✕ Integer [267] Float Integer [267] ✕ monocular 3D detection does not belong to the Euclidean manifold but is a task of the projective manifold. Modeling translation equivariance in the correct manifold improves generalization. For monocular 3D detection, we take the first step towards the general 3D translation equivariance by embedding equivariance to depth translations. The 3D depth equivariance in DEVIANT uses Eq. (C.10) and thus imposes an additional constraint on the feature maps. This additional constraint results in consistent depth estimates from the current image and a virtual image (obtained by translating the ego camera), and therefore, better generalization than CNNs. On the other hand, CNNs, by design, do not constrain the depth estimates from the current image and a virtual image (obtained by translating the ego camera), and thus, their depth estimates are entirely data-driven. C.1.9 Why not Fixed Scale Assumption? We now answer the question of keeping the fixed scale assumption. If we assume fixed scale assumption, then vanilla convolutional layers have the right equivariance. However, we do not keep this assumption because the ego camera translates along the depth in driving scenes and also, because the depth is the hardest parameter to estimate [166] for monocular detection. So, zero depth translation or fixed scale assumption is always violated. C.1.10 Comparisons with Other Methods We now list out the differences between different convolutions and monocular detection methods in Tab. C.1. Kinematic3D [17] does not constrain the output at feature map level, but at system level using Kalman Filters. The closest to our method is the Dilated CNN (DCNN) [291]. We show in Tab. 3.9 that DEVIANT outperforms Dilated CNN. 121 Multi-scale Steerable Basis Scale Conv. Output ⊗w⊗w⊗w * Kernel * * Input Scale- Projection 4D Output Figure C.5 (a) SES convolution [68, 227] The non-trainable basis functions multiply with learnable weights w to get kernels. The input then convolves with these kernels to get multi-scale 5D output. (b) Scale- Projection [227] takes max over the scale dimension of the 5D output and converts it to 4D. [Key: ∗ = Vanilla convolution.] C.1.11 Why is Depth the hardest among all parameters? Images are the 2D projections of the 3D scene, and therefore, the depth is lost during projection. Recovering this depth is the most difficult to estimate, as shown in Tab. 1 of [166]. Monocular detection task involves estimating 3D center, 3D dimensions and the yaw angle. The right half of Tab. 1 in [166] shows that if the ground truth 3D center is replaced with the predicted center, the detection reaches a minimum. Hence, 3D center is the most difficult to estimate among center, dimensions and pose. Most monocular 3D detectors further decompose the 3D center into projected (2D) center and depth. Out of projected center and depth, Tab. 1 of [166] shows that replacing ground truth depth with the predicted depth leads to inferior detection compared to replacing ground truth projected center with the predicted projected center. Hence, we conclude that depth is the hardest parameter to estimate. C.2 Implementation Details We now provide some additional implementation details for facilitating reproduction of this work. C.2.1 Steerable Filters of SES Convolution We use the scale equivariant steerable blocks proposed by [226] for our DEVIANT backbone. We now share the implementation details of these steerable filters. Basis. Although steerable filters can use any linearly independent functions as their basis, we stick 122 Figure C.6 Steerable Basis [227] for 7×7 SES convolution filters. (Showing only 8 of the 49 members for each scale). with the Hermite polynomials as the basis [226]. Let (0, 0) denote the center of the function and (𝑢, 𝑣) denote the pixel coordinates. Then, the filter coefficients 𝜓𝜎𝑛𝑚 [226] are 𝜓𝜎𝑛𝑚 = 𝐴 𝜎2 𝐻𝑛 (cid:17) (cid:16) 𝑢 𝜎 𝐻𝑚 (cid:16) 𝑣 𝜎 (cid:17) 𝑒− 𝑢2 +𝑣2 𝜎2 (C.14) 𝐻𝑛 denotes the Probabilist’s Hermite polynomial of the 𝑛th order, and 𝐴 is the normalization constant. The first six Probabilist’s Hermite polynomials are 𝐻0(𝑥) = 1 𝐻1(𝑥) = 𝑥 𝐻2(𝑥) = 𝑥2 − 1 𝐻3(𝑥) = 𝑥3 − 3𝑥 𝐻4(𝑥) = 𝑥4 − 6𝑥2 + 3 (C.15) (C.16) (C.17) (C.18) (C.19) Fig. C.6 visualizes some of the SES filters and shows that the basis is indeed at different scales. C.2.2 Monocular 3D Detection Architecture. We use the DLA-34 [292] configuration, with the standard Feature Pyramid Network (FPN) [138], binning and ensemble of uncertainties. FPN is a bottom-up feed-forward CNN that computes feature maps with a downscaling factor of 2, and a top-down network that brings them back to the high-resolution ones. There are total six feature maps levels in this FPN. We use DLA-34 as the backbone for our baseline GUP Net [159], while we use SES-DLA-34 as the backbone for DEVIANT. We also replace the 2D pools by 3D pools with pool along the scale dimensions as 1 for DEVIANT. 123 We initialize the vanilla CNN from ImageNet weights. For DEVIANT, we use the regularized least squares [226] to initialize the trainable weights in all the Hermite scales from the ImageNet [48] weights. Compared to initializing one of the scales as proposed in [226], we observed more stable convergence in initializing all the Hermite scales. We output three foreground classes for KITTI dataset. We also output three foreground classes for Waymo dataset ignoring the Sign class [199]. Datasets. We use the publicly available KITTI,Waymo and nuScenes datasets for our experi- ments. KITTI is available at http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark= 3d under CC BY-NC-SA 3.0 License. Waymo is available at https://waymo.com/intl/en_us/ dataset-download-terms/ under the Apache License, Version 2.0. nuScenes is available at https: //www.nuscenes.org/nuscenes under CC BY-NC-SA 4.0 International Public License. Augmentation. Unless otherwise stated, we horizontal flip the training images with probability 0.5, and use scale augmentation as 0.4 as well for all the models [159] in training. Pre-processing. The only pre-processing step we use is image resizing. • KITTI. We resize the [370, 1242] sized KITTI images, and bring them to the [384, 1280] resolution [159]. • Waymo. We resize the [1280, 1920] sized Waymo images, and bring them to the [512, 768] resolution. This resolution preserves their aspect ratio. Box Filtering. We apply simple hand-crafted rules for filtering out the boxes. We ignore the box if it belongs to a class different from the detection class. • KITTI. We train with boxes which are atleast 2𝑚 distant from the ego camera, and with visibility > 0.5 [159]. • Waymo. We train with boxes which are atleast 2𝑚 distant from the ego camera. The Waymo dataset does not have any occlusion based labels. However, Waymo provides the number of LiDAR points inside each 3D box which serves as a proxy for the occlusion. We train the boxes which have more than 100 LiDAR points for the vehicle class and have more than 50 LiDAR points for the cyclist and pedestrian class. 124 Training. We use the training protocol of GUP Net [159] for all our experiments. Training uses the Adam optimizer [102] and weight-decay 1 × 10−5 . Training dynamically weighs the losses using Hierarchical Task Learning (HTL) [159] strategy keeping 𝐾 as 5 [159]. Training also uses a linear warmup strategy in the first 5 epochs to stabilize the training. We choose the model saved in the last epoch as our final model for all our experiments. • KITTI. We train with a batch size of 12 on single Nvidia A100 (40GB) GPU for 140 epochs. Training starts with a learning rate 1.25 × 10−3 with a step decay of 0.1 at the 90th and the 120th epoch. • Waymo. We train with a batch size of 40 on single Nvidia A100 (40GB) GPU for 30 epochs because of the large size of the Waymo dataset. Training starts with a learning rate 1.25 × 10−3 with a step decay of 0.1 at the 18th and the 26th epoch. Losses. We use the GUP Net [159] multi-task losses before the NMS for training. The total loss L is given by L = Lheatmap + L2D,offset + L2D,size + L3D2D,offset + L3D,𝑎𝑛𝑔𝑙𝑒 + L3D,𝑙 + L3D,𝑤 + L3D,ℎ + L3D,𝑑𝑒 𝑝𝑡ℎ. The individual terms are given by Lheatmap = Focal(𝑐𝑙𝑎𝑠𝑠𝑏, 𝑐𝑙𝑎𝑠𝑠𝑔), L2D,offset = L1(𝛿𝑏 L2D,size = L1(𝑤𝑏 L3D2D,offset = L1(𝛿𝑏 2D, 𝛿𝑔 2D), 2D, 𝑤𝑔 2D) + L1(ℎ𝑏 3D2D, 𝛿𝑔 3D2D) 2D, ℎ𝑔 2D), L3D,𝑎𝑛𝑔𝑙𝑒 = CE(𝛼𝑏, 𝛼𝑔) L3D,𝑙 = L1(𝜇𝑏 L3D,𝑤 = L1(𝜇𝑏 𝑙3D) 𝑙3D, 𝛿𝑔 𝑤3D, 𝛿𝑔 𝑤3D) L3D,ℎ = L3D,𝑑𝑒 𝑝𝑡ℎ = √ 2 𝜎ℎ3D √ 2 𝜎𝑑 L1(𝜇𝑏 ℎ3D, 𝛿𝑔 ℎ3D) + ln(𝜎ℎ3D) L1(𝜇𝑏 𝑑, 𝜇𝑔 𝑑) + ln(𝜎𝑑), 125 (C.20) (C.21) (C.22) (C.23) (C.24) (C.25) (C.26) (C.27) (C.28) (C.29) where, 𝜇𝑏 𝑑 = 𝑓 + 𝜇𝑑,𝑝𝑟𝑒𝑑 𝜇𝑏 ℎ3D ℎ𝑏 2D (cid:118)(cid:117)(cid:116)(cid:32) 𝜎𝑑 = 𝑓 (cid:33) 2 + 𝜎2 𝑑,𝑝𝑟𝑒𝑑. 𝜎ℎ3D ℎ𝑏 2D (C.30) (C.31) The superscripts 𝑏 and 𝑔 denote the predicted box and ground truth box respectively. CE and Focal denote the Cross Entropy and Focal loss respectively. The number of heatmaps depends on the number of output classes. 𝛿2D denotes the deviation of the 2D center from the center of the heatmap. 𝛿3D2D,offset denotes the deviation of the projected 3D center from the center of the heatmap. The orientation loss is the cross entropy loss between the binned observation angle of the prediction and the ground truth. The observation angle 𝛼 is split into 12 bins covering 30◦ range. 𝛿𝑙3D, 𝛿𝑤3D and 𝛿ℎ3D denote the deviation of the 3D length, width and height of the box from the class dependent mean size respectively. The depth is the hardest parameter to estimate [166]. So, GUP Net uses in-network ensembles to predict the depth. It obtains a Laplacian estimate of depth from the 2D height, while it obtains another estimate of depth from the prediction of depth. It then adds these two depth estimates. Inference. Our testing resolution is same as the training resolution. We do not use any augmentation for test/validation. We keep the maximum number of objects to 50 in an image, and we multiply the class and predicted confidence to get the box’s overall score in inference as in [109]. We consider output boxes with scores greater than a threshold of 0.2 for KITTI [159] and 0.1 for Waymo [199]. C.3 Additional Experiments and Results We now provide additional details and results of the experiments evaluating DEVIANT’s performance. C.3.1 KITTI Val Split Monocular Detection has Huge Generalization Gap. As mentioned in Sec. 3.1, we now show that the monocular detection has huge generalization gap between training and inference. We report the object detection performance on the train and validation (val) set for the two models on KITTI Val split in Tab. C.2. Tab. C.2 shows that the performance of our baseline GUP Net [159] and our 126 Table C.2 Generalization gap (− between training and inference sets. [Key: Best] (cid:17) ) on KITTI Val cars. Monocular detection has huge generalization gap Method Scale Eqv GUP Net [159] DEVIANT ✓ IoU3D ≥ 0.7 IoU3D ≥ 0.5 (cid:17)) (cid:17)) Set [%](− [%](− [%](− AP 3D|𝑅40 AP BEV|𝑅40 AP 3D|𝑅40 [%](− (cid:17)) Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard Train 91.83 74.87 67.43 95.19 80.95 73.55 99.50 93.62 86.22 99.56 93.88 86.46 Val 21.10 15.48 12.88 28.58 20.92 17.83 58.95 43.99 38.07 64.60 47.76 42.97 Gap 70.73 59.39 54.55 66.61 60.03 55.72 40.55 49.63 48.15 34.96 46.12 43.49 Train 91.09 76.19 67.16 94.76 82.61 75.51 99.37 93.56 88.57 99.50 93.87 88.90 Val 24.63 16.54 14.52 32.60 23.04 19.99 61.00 46.00 40.18 65.28 49.63 43.50 Gap 66.46 59.65 52.64 62.16 59.57 55.52 38.37 47.56 48.39 34.22 44.24 45.40 AP BEV|𝑅40 (cid:17)) Table C.3 Comparison on multiple backbones on KITTI Val cars. [Key: Best] IoU3D ≥ 0.7 IoU3D ≥ 0.5 (cid:17)) [%](− Method BackBone AP 3D|𝑅40 [%](− (cid:17)) Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard ResNet-18 GUP Net [159] 18.86 13.20 11.01 26.05 19.37 16.57 54.90 40.65 34.98 60.54 46.13 40.12 20.27 14.21 12.56 28.09 20.32 17.49 55.75 42.41 36.97 60.82 46.43 40.59 DLA-34 GUP Net [159] 21.10 15.48 12.88 28.58 20.92 17.83 58.95 43.99 38.07 64.60 47.76 42.97 24.63 16.54 14.52 32.60 23.04 19.99 61.00 46.00 40.18 65.28 49.63 43.50 AP BEV|𝑅40 AP BEV|𝑅40 AP 3D|𝑅40 DEVIANT DEVIANT [%](− [%](− (cid:17)) (cid:17)) DEVIANT is huge on the training set, while it is less than one-fourth of the train performance on the val set. We also report the generalization gap (in pink) metric [270] in Tab. C.2, which is the difference between training and validation performance. The generalization gap at both the thresholds of 0.7 and 0.5 is huge. Comparison on Multiple Backbones. A common trend in 2D object detection community is to show improvements on multiple backbones [253]. DD3D [181] follows this trend and also reports their numbers on multiple backbones. Therefore, we follow the same and compare with our baseline on multiple backbones on KITTI Val cars in Tab. C.3. Tab. C.3 shows that DEVIANT shows consistent improvements over GUP Net [159] in 3D object detection on multiple backbones, proving the effectiveness of our proposal. Comparison with Bigger CNN Backbones. Since the SES blocks increase the Flop counts signif- icantly compared to the vanilla convolution block, we next compare DEVIANT with bigger CNN backbones with comparable GFLOPs and FPS/ wall-clock time (instead of same configuration) in Tab. C.4. We compare DEVIANT with DLA-102 and DLA-169 - two biggest DLA networks 127 Table C.4 Results with bigger CNNs having similar flops on KITTI Val cars. [Key: Best] (cid:17) ) Disk Size (− (cid:17) ) Flops (− Method BackBone GUP Net [159] DLA-34 GUP Net [159] DLA-102 GUP Net [159] DLA-169 DEVIANT SES-DLA-34 Param (− (M) 16 34 54 16 (MB) 235 583 814 236 (cid:17) ) Infer (− (G) 30 70 114 235 (cid:17) ) AP3D IoU3D≥ 0.7 (− (ms) Easy Mod Hard 20 25 30 40 (cid:17)) AP3D IoU3D≥ 0.5 (− Easy Mod Hard 21.10 15.48 12.88 58.95 43.99 38.07 20.96 14.64 12.80 57.06 41.78 37.26 21.76 15.35 12.72 57.60 43.27 37.32 24.63 16.54 14.52 61.00 46.00 40.18 (cid:17)) Table C.5 Results on KITTI Val cyclists and pedestrians (Cyc/Ped) (IoU3D ≥ 0.5). [Key: Best, Second Best] Method Extra GrooMeD-NMS [109] MonoDIS [222] MonoDIS-M [220] GUP Net (Retrained) [159] DEVIANT (Ours) − − − − − (cid:17)) Cyc AP 3D|𝑅40 [%](− Easy Mod Hard 0.00 0.00 0.00 0.71 1.52 0.73 1.30 2.70 1.50 2.03 4.41 2.17 2.14 4.05 2.20 (cid:17)) Ped AP 3D|𝑅40 [%](− Easy Mod Hard 2.61 3.79 2.71 1.71 3.20 2.28 5.70 9.50 7.10 5.73 9.37 6.84 5.42 9.85 7.18 with ImageNet weights1 on KITTI Val split. We use the fvcore library2 to get the parameters and flops. Tab. C.4 shows that DEVIANT again outperforms the bigger CNN backbones, especially on nearby objects. We believe this happens because the bigger CNN backbones have more trainable parameters than DEVIANT, which leads to overfitting. Although DEVIANT takes more time compared to the CNN backbones, DEVIANT still keeps the inference almost real-time. Performance on Cyclists and Pedestrians. Tab. C.5 lists out the results of 3D object detection on KITTI Val Cyclist and Pedestrians. The results show that DEVIANT is competitive on challenging Cyclist and achieves SoTA results on Pedestrians on the KITTI Val split. Cross-Dataset Evaluation Details. For cross-dataset evaluation, we test on all 3,769 images of the KITTI Val split, as well as all frontal 6,019 images of the nuScenes Val split [22], as in [218]. We first convert the nuScenes Val images to the KITTI format using the export_kitti3 function in the nuscenes devkit. We keep KITTI Val images in the [384, 1280] resolution, while we keep the nuScenes Val images in the [384, 672] resolution to preserve the aspect ratio. For M3D-RPN [15], we bring the nuScenes Val images in the [512, 910] resolution. Monocular 3D object detection relies on the camera focal length to back-project the projected 1Available at http://dl.yf.io/dla/models/imagenet/ 2https://github.com/facebookresearch/fvcore 3https://github.com/nutonomy/nuscenes-devkit/blob/master/python-sdk/nuscenes/scripts/export_kitti.py 128 centers into the 3D space. Therefore, the 3D centers depends on the focal length of the camera used in the dataset. Hence, one should take the camera focal length into account while doing cross-dataset evaluation. We now calculate the camera focal length of a dataset as follows. We take 2 𝑓𝑦 the camera matrix K and calculate the normalized focal length ¯𝑓 = 𝐻 , where 𝐻 denotes the height of the image. The normalized focal length ¯𝑓 for the KITTI dataset is 3.82, while the normalized focal length ¯𝑓 for the nuScenes dataset is 2.82. Thus, the KITTI and the nuScenes images have a different focal length [255]. M3D-RPN [15] does not normalize w.r.t. the focal length. So, we explicitly correct and divide the depth predictions of nuScenes images from the KITTI model by 3.82/2.82 = 1.361 in the M3D- RPN [15] codebase. The GUP Net [159] and DEVIANT codebases use normalized coordinates i.e. they normalize w.r.t. the focal length. So, we do not explicitly correct the focal length for GUP Net and DEVIANT predictions. We match predictions to the ground truths using the IoU2D overlap threshold of 0.7 [218]. After this matching, we calculate the Mean Average Error (MAE) of the depths of the predicted and the ground truth boxes [218]. Stress Test with Rotational and/or xy-translation Ego Movement. Corollary 1.1 uses translation along the depth as the sole ego movement. This assumption might be valid for the current outdoor datasets and benchmarks, but is not the case in the real world. Therefore, we conduct stress tests on how tolerable DEVIANT and GUP Net [159] are when there is rotational and/or 𝑥𝑦-translation movement on the vehicle. First, note that KITTI and Waymo are already large-scale real-world datasets, and our own dataset might not be a good choice. So, we stick with KITTI and Waymo datasets. We manually choose 306 KITTI Val images with such ego movements and again compare performance of DEVIANT and GUP Net on this subset in Tab. C.6. The average distance of the car in this subset is 27.69 m (±16.59 m), which suggests a good variance and unbiasedness in the subset. Tab. C.6 shows that both the DEVIANT backbone and the CNN backbone show a drop in the detection performance by about 4 AP points on the Mod cars of ego-rotated subset compared to the all set. 129 Table C.6 Stress Test with rotational and 𝑥𝑦-translation ego movement on KITTI Val cars. [Key: Best] Set Method AP3D IoU3D≥ 0.7 (− Easy Mod Hard 9.91 (cid:17)) AP3D IoU3D≥ 0.5 (− Easy Mod Hard 47.47 35.02 32.63 Subset (306) 20.17 12.49 10.93 49.81 36.93 34.32 KITTI Val GUP Net [159] 21.10 15.48 12.88 58.95 43.99 38.07 (3769) 24.63 16.54 14.52 61.00 46.00 40.18 GUP Net [159] 17.22 11.43 DEVIANT DEVIANT (cid:17)) Table C.7 Comparison of Depth Estimates of monocular depth estimators and 3D object detectors on KITTI Val cars. Depth from a depth estimator BTS is not good for foreground objects (cars) beyond 20+ m range. [Key: Best, Second Best] Depth Ground Back+ Foreground Method at Truth GUP Net [159] 3D Center 3D Box 3D Center 3D Box DEVIANT BTS [118] Pixel LiDAR 0.48 Foreground (Cars) 0−20 20−40 40−∞ 0−20 20−40 40−∞ 1.85 1.80 2.16 1.10 1.09 1.22 − − 1.30 − − 1.83 0.45 0.40 0.30 − − This drop experimentally confirms the theory that both the DEVIANT backbone and the CNN backbone do not handle arbitrary 3D rotations. More importantly, the table shows that DEVIANT maintains the performance improvement over GUP Net [159] under such movements. Also, Waymo has many images in which the ego camera shakes. Improvements on Waymo (Tab. 3.12) also confirms that DEVIANT outperforms GUP Net [159] even when there is rotational or 𝑥𝑦-translation ego movement. Comparison of Depth Estimates from Monocular Depth Estimators and 3D Object Detectors. We next compare the depth estimates from monocular depth estimators and depth estimates from monocular 3D object detectors on the foreground objects. We take a monocular depth estimator BTS [118] model trained on KITTI Eigen split. We next compare the depth error for all and foreground objects (cars) on KITTI Val split using MAE (− (cid:17) ) metric in Tab. C.7 as in Tab. 3.6. We use the MSeg [114] to segment out cars in the driving scenes for BTS. Tab. C.7 shows that the depth from BTS is not good for foreground objects (cars) beyond 20+ m range. Note that there is a data leakage issue between the KITTI Eigen train split and the KITTI Val split [221] and therefore, we expect more degradation in performance of monocular depth estimators after fixing the data leakage issue. Equivariance Error for KITTI Monocular Videos. A better way to compare the scale equiv- 130 Figure C.7 Equivariance error (Δ) comparison for DEVIANT and GUP Net on previous three frames of the KITTI monocular videos at block 3 in the backbone. ariance of the DEVIANT and GUP Net [159] compared to Fig. 3.4, is to compare equivariance error on real images with depth translations of the ego camera. The equivariance error Δ is the normalized difference between the scaled feature map and the feature map of the scaled image, and is given by Δ = 1 𝑁 𝑁 ∑︁ 𝑖=1 ||T𝑠𝑖 Φ(ℎ𝑖) − Φ(T𝑠𝑖 ℎ𝑖)||2 2 ||T𝑠𝑖 Φ(ℎ𝑖)||2 2 , (C.32) where Φ denotes the neural network, T𝑠𝑖 is the scaling transformation for the image 𝑖, and 𝑁 is the total number of images. Although we do evaluate this error in Fig. 3.4, the image scaling in Fig. 3.4 does not involve scene change because of the absence of the moving objects. Therefore, evaluating on actual depth translations of the ego camera makes the equivariance error evaluation more realistic. We next carry out this experiment and report the equivariance error on three previous frames of the val images of the KITTI Val split as in [17]. We plot this equivariance error in Fig. C.7 at block 3 of the backbones because the resolution at this block corresponds to the output feature map of size [96, 320]. Fig. C.7 is similar to Fig. 3.4b, and shows that DEVIANT achieves lower equivariance error. Therefore, DEVIANT has better equivariance to depth translations (scale transformation s) than GUP Net [159] in real scenarios. Model Size, Training, and Inference Times. Both DEVIANT and the baseline GUP Net have the same number of trainable parameters, and therefore, the same model size. GUP Net takes 4 hours to train on KITTI Val and 0.02 ms per image for inference on a single Ampere A100 (40 GB) GPU. DEVIANT takes 8.5 hours for training and 0.04 ms per image for inference on the same GPU. This 131 Method GUP Net [159] DEVIANT Table C.8 Five Different Runs on KITTI Val cars. [Key: Average] IoU3D ≥ 0.7 IoU3D ≥ 0.5 (cid:17)) (cid:17)) (cid:17)) Run [%](− [%](− [%](− AP 3D|𝑅40 AP BEV|𝑅40 AP BEV|𝑅40 AP 3D|𝑅40 [%](− (cid:17)) Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard 1 21.67 14.75 12.68 28.72 20.88 17.79 58.27 43.53 37.62 63.67 47.37 42.55 2 21.26 14.94 12.49 28.39 20.40 17.43 59.20 43.55 37.63 64.06 47.46 42.67 3 20.87 15.03 12.61 28.66 20.56 17.48 60.19 44.08 39.36 65.26 49.44 43.17 4 21.10 15.48 12.88 28.58 20.92 17.83 58.95 43.99 38.07 64.60 47.76 42.97 5 22.52 15.92 13.31 30.77 22.40 19.36 59.91 44.00 39.30 64.94 48.01 43.08 Avg 21.48 15.22 12.79 29.02 21.03 17.98 59.30 43.83 38.40 64.51 48.01 42.89 23.19 15.84 14.11 29.82 21.93 19.16 60.19 45.52 39.86 66.32 49.39 43.38 1 23.33 16.12 13.54 31.22 22.64 19.64 61.59 46.33 40.35 67.49 50.26 43.98 2 24.12 16.37 14.48 31.58 22.52 19.65 62.51 46.47 40.65 67.33 50.24 44.16 3 24.63 16.54 14.52 32.60 23.04 19.99 61.00 46.00 40.18 65.28 49.63 43.50 4 5 25.82 17.69 15.07 33.63 23.84 20.60 62.39 46.46 40.61 67.55 50.51 45.80 Avg 24.22 16.51 14.34 31.77 22.79 19.81 61.54 46.16 40.33 66.79 50.01 44.16 Table C.9 Experiments Comparison. Venue Multi-Dataset Cross-Dataset Multi-Backbone Method GrooMeD-NMS [109] CVPR21 CVPR21 MonoFlex [301] CVPR21 CaDDN [199] ICCV21 MonoRCNN [218] ICCV21 GUP Net [159] ICCV21 DD3D [181] NeurIPS21 PCT [246] ICLR22 MonoDistill [39] TPAMI20 MonoDIS-M [220] TPAMI21 MonoEF [307] - DEVIANT − − ✓ − − ✓ ✓ − ✓ ✓ ✓ − − − ✓ − − − − − − ✓ − − − − − ✓ ✓ − − − ✓ is expected because SE models use more flops [227, 309] and, therefore, DEVIANT takes roughly twice the training and inference time as GUP Net. Reproducibility. As described in Sec. 3.5.2, we now list out the five runs of our baseline GUP Net [159] and DEVIANT in Tab. C.8. Tab. C.8 shows that DEVIANT outperforms GUP Net in all runs and in the average run. Experiment Comparison. We now compare the experiments of different chapters in Tab. C.9. To the best of our knowledge, the experimentation in DEVIANT is more than the experimentation of most monocular 3D object detection chapters. 132 (cid:17) ). (a) Depth equivariance error (− (b) Error (− (cid:17) ) on objects. Figure C.8 (a) Depth (scale) equivariance error of vanilla GUP Net [159] and proposed DEVIANT. (See Sec. 3.5.2 for details) (b) Error on objects. The proposed backbone has less depth equivariance error than vanilla CNN backbone. C.3.2 Qualitative Results KITTI. We next show some more qualitative results of models trained on KITTI Val split in Fig. C.9. We depict the predictions of DEVIANT in image view on the left and the predictions of DEVIANT and GUP Net [159], and ground truth in BEV on the right. In general, DEVIANT predictions are more closer to the ground truth than GUP Net [159]. nuScenes Cross-Dataset Evaluation. We then show some qualitative results of KITTI Val model evaluated on nuScenes frontal in Fig. C.10. We again observe that DEVIANT predictions are more closer to the ground truth than GUP Net [159]. Also, considerably less number of boxes are detected in the cross-dataset evaluation i.e. on nuScenes. We believe this happens because of the domain shift. Waymo. We now show some qualitative results of models trained on Waymo Val split in Fig. C.11. We again observe that DEVIANT predictions are more closer to the ground truth than GUP Net [159]. C.3.3 Demo Videos of DEVIANT Detection Demo. We next put a short demo video of our DEVIANT model trained on KITTI Val split at https://www.youtube.com/watch?v=2D73ZBrU-PA. We run our trained model indepen- 133 dently on each frame of 2011_09_26_drive_0009 KITTI raw [66]. The video belongs to the City category of the KITTI raw video. None of the frames from the raw video appear in the training set of KITTI Val split [109]. We use the camera matrices available with the video but do not use any temporal information. Overlaid on each frame of the raw input videos, we plot the projected 3D boxes of the predictions and also plot these 3D boxes in the BEV. We set the frame rate of this demo at 10 fps as in KITTI. The attached demo video demonstrates very stable and impressive results because of the additional equivariance to depth translations in DEVIANT which is absent in vanilla CNNs. Also, notice that the orientation of the boxes are stable despite not using any temporal information. Equivariance Error Demo. We next show the depth equivariance (scale equivariance) error demo of one of the channels from the vanilla GUP Net and our proposed method at https://www. youtube.com/watch?v=70DIjQkuZvw. As before, we report at block 3 of the backbones which corresponds to output feature map of the size [96, 320]. The equivariance error demo indicates more white spaces which confirms that DEVIANT achieves lower equivariance error compared to the baseline GUP Net [159]. Thus, this demo agrees with Fig. C.8a. This happens because depth (scale) equivariance is additionally hard-baked into DEVIANT, while the vanilla GUP Net is not equivariant to depth translations (scale transformation s). 134 Figure C.9 KITTI Qualitative Results. DEVIANT predictions in general are more accurate than GUP Net [159]. [Key: Cars (pink), Cyclists (orange) and Pedestrians (violet) of DEVIANT; all classes of GUP Net (cyan), and Ground Truth (green) in BEV]. 135 Figure C.10 nuScenes Cross-Dataset Qualitative Results. DEVIANT predictions in general are more accurate than GUP Net [159]. [Key: Cars of DEVIANT (pink); Cars of GUP Net (cyan), and Ground Truth (green) in BEV]. 136 Figure C.11 Waymo Qualitative Results. DEVIANT predictions in general are more accurate than GUP Net [159]. [Key: Cars (pink), Cyclists (orange) and Pedestrians (violet) of DEVIANT; all classes of GUP Net (cyan), and Ground Truth (green) in BEV]. 137 APPENDIX D SEABIRD APPENDIX D.1 Additional Explanations and Proofs We now add some explanations and proofs which we could not put in the main chapter because of the space constraints. D.1.1 Proof of Converged Value We first bound the converged value from the optimal value. These results are well-known in the literature [113, 214]. We reproduce the result from using our notations for completeness. Lw∞−w∗ (cid:17) (cid:13) (cid:13) 2 2 E (cid:16)(cid:13) (cid:13) = E (cid:16)(cid:13) (cid:13) = E Lw∞−L 𝝁 + L 𝝁−w∗ (cid:18) (cid:16)Lw∞−L 𝝁 + L 𝝁−w∗ (cid:17) 2 2 (cid:13) (cid:13) (cid:17)𝑇 (cid:16)Lw∞−L 𝝁 + L 𝝁−w∗ (cid:17)(cid:19) = E((Lw∞−L 𝝁)𝑇 (Lw∞−L 𝝁)) + E((L 𝝁−w∗)𝑇 (L 𝝁−w∗)) + 2E((Lw∞−L 𝝁)𝑇 (L 𝝁−w∗)) = Var(Lw∞) + E((L 𝝁−w∗)𝑇 (L 𝝁−w∗)) (D.1) where L 𝝁 = E(Lw∞) is the mean of the layer weight and Var(w) denotes the variance of (cid:205) 𝑗 𝑤2 𝑗 . SGD. We begin the proof by writing the value of Lw𝑡 at every step. The model uses SGD, and so, the weight Lw𝑡 after 𝑡 gradient updates is Lw𝑡 = w0 − 𝑠1 Lg1 − 𝑠2 Lg2 − · · · − 𝑠𝑡 Lg𝑡, (D.2) where Lg𝑡 denotes the gradient of w at every step 𝑡. Assume the loss function under consideration L is L = 𝑓 (w𝑡h − 𝑧) = 𝑓 (𝜂). Then, we have, Lg𝑡 = = = 𝜕L 𝜕w𝑡 𝜕L (w𝑡h − 𝑧) 𝜕w𝑡 𝜕L (w𝑡h − 𝑧) 𝜕 (w𝑡h − 𝑧) 138 𝜕 (w𝑡h − 𝑧) 𝜕w𝑡 𝜕L (𝜂) 𝜕𝜂 h 𝜕L (𝜂) 𝜕𝜂 = = h =⇒ Lg𝑡 = h𝜖, (D.3) with 𝜖 = 𝜕L (𝜂) 𝜕𝜂 is the gradient of the loss function wrt noise. Expectation and Variance of Gradient Lg𝑡 Since the image h and noise 𝜂 are statistically independent, the image and the noise gradient 𝜂 are also statistically independent. So, the expected gradients E(Lg𝑡) = E(h)E(𝜖) = 0. (D.4) Note that if the loss function is an even function (symmetric about zero), its gradient 𝜖 is an odd function (anti-symmetric about 0), and so its mean E(𝜖) = 0. Next, we write the gradient variance Var(Lg𝑡) as Var(Lg𝑡) = Var(h𝜖) = E(h𝑇 h)E(𝜖 2) − E2(h)E2(𝜖) = E(h𝑇 h) (cid:2)Var(𝜖) + E2(𝜖)(cid:3) − E2(h)E2(𝜖) =⇒ Var(Lg𝑡) = E(h𝑇 h)Var(𝜖) as E(𝜖) = 0 (D.5) Expectation and Variance of Converged Weight Lw𝑡 We first calculate the expected converged weight as E(Lw𝑡) = E(w0) + (cid:169) (cid:173) (cid:171) 𝑠 𝑗 E (cid:16)Lg 𝑗 (cid:17) 𝑡 ∑︁ 𝑗=1 (cid:170) (cid:174) (cid:172) = 0 using Eq. (D.4) , using Eq. (D.2) =⇒ E(Lw∞) = lim 𝑡→∞ E(Lw𝑡) =⇒ E(Lw∞) = L 𝝁 = 0 139 (D.6) We finally calculate the variance of the converged weight. Because the SGD step size is independent of the gradient, we write using Eq. (D.2), Var(Lw𝑡) = Var(w0) + 𝑠2 1Var (g1) + 𝑠2 (cid:17) 2Var (g2) + · · · + 𝑠2 𝑡 Var (cid:16)Lg𝑡 Assuming the gradients Lg𝑡 are drawn from an identical distribution, we have Var(Lw𝑡) = Var(w0) + (cid:169) (cid:173) (cid:171) 𝑡 ∑︁ 𝑗=1 𝑠2 𝑗 (cid:170) (cid:174) (cid:172) Var (cid:16)Lg𝑡 (cid:17) =⇒ Var(Lw∞) = lim 𝑡→∞ Var(Lw𝑡) lim = Var(w0) + (cid:169) (cid:173) 𝑡→∞ (cid:171) Var (cid:16)Lg𝑡 (cid:17) 𝑡 ∑︁ 𝑗=1 𝑠2 𝑗 (cid:170) (cid:174) (cid:172) (cid:17) =⇒ Var(Lw∞) = Var(w0) + 𝑠Var (cid:16)Lg𝑡 (D.7) (D.8) An example of square summable step-sizes of SGD is 𝑠 𝑗 = 1 𝑗 = 𝜋2 𝑠2 6 . This assumption is also satisfied by modern neural networks since their training steps are always 𝑗 , and then the constant 𝑠 = (cid:205) 𝑗=1 finite. Substituting Eq. (D.5) in Eq. (D.8), we have Var(Lw∞) = Var(w0) + 𝑠E(h𝑇 h)Var(𝜖) (D.9) Substituting mean and variances from Eqs. (D.6) and (D.9) in Eq. (D.1), we have E (cid:16)(cid:13) (cid:13) Lw∞−w∗ (cid:17) (cid:13) (cid:13) 2 2 = Var(w0) + 𝑠E(h𝑇 h)Var(𝜖) + E(||w∗||2) = 𝑠E(h𝑇 h)Var(𝜖) + Var(w0) + E(||w∗||2) =⇒ E (cid:16)(cid:13) (cid:13) Lw∞−w∗ (cid:17) (cid:13) (cid:13) 2 2 = 𝑐1Var(𝜖) + 𝑐2, (D.10) where 𝜖 = 𝜕L (𝜂) 𝜕𝜂 is the gradient of the loss function wrt noise, and 𝑐1 = 𝑠E(h𝑇 h) and 𝑐2 are terms independent of the loss function L. 140 D.1.2 Comparison of Loss Functions Eq. (4.1) shows that different losses L lead to different Var(𝜖). Hence, comparing this term for different losses asseses the quality of losses. D.1.2.1 Gradient Variance of MAE Loss The result on MAE (L1) is well-known in the literature [113, 214]. We reproduce the result from [113, 214] using our notations for completeness. The L1 loss is L1(𝜂) = | ˆ𝑧 − 𝑧|1 = |Lw𝑡h − 𝑧|1 = |𝜂|1 =⇒ 𝜖 = 𝜕L1(𝜂) 𝜕𝜂 = sgn(𝜂) (D.11) Thus, 𝜖 = sgn(𝜂) is a Bernoulli random variable with 𝑝(𝜖) = 1/2 for 𝜖 = ±1. So, mean E(𝜖) = 0 and variance Var(𝜖) = 1. D.1.2.2 Gradient Variance of MSE Loss The result on MSE (L2) is well-known in the literature [113, 214]. We reproduce the result from [113, 214] using our notations for completeness. The L2 loss is L2(𝜂) = 0.5| ˆ𝑧 − 𝑧|2 = 0.5|𝜂|2 = 0.5𝜂2 =⇒ 𝜖 = 𝜕L2(𝜂) 𝜕𝜂 = 𝜂 (D.12) Thus, 𝜖 = 𝜂 is a normal random variable [214]. So, mean E(𝜖) = 0 and variance Var(𝜖) = Var(𝜂) = 𝜎2. D.1.2.3 Gradient Variance of Dice Loss. (Proof of Lemma 2) Proof. We first write the gradient of dice loss as a function of noise (𝜂) as follows: 𝜖 = 𝜕L𝑑𝑖𝑐𝑒 (𝜂) 𝜕𝜂 = sgn(𝜂) ℓ , |𝜂| ≤ ℓ 0 , |𝜂| ≥ ℓ    (D.13) 141 The gradient of the loss 𝜖 is an odd function and so, its mean E(𝜖) = 0. Next, we write its variance Var(𝜖) as Var(𝜖) = Var(𝜂) = = = = = 1 ℓ2 2 ℓ2 2 ℓ2 2 ℓ2 2 ℓ2 ℓ ∫ −ℓ ∫ ℓ √ √ 0 ℓ/𝜎 ∫ ℓ/𝜎 ∫ 0      −∞  (cid:20) Φ 𝜂2 2𝜎2 𝑑𝜂 𝑒− 𝜂2 2𝜎2 𝑑𝜂 𝑒− 1 2𝜋𝜎 1 2𝜋𝜎 𝜂2 2 𝑑𝜂 𝑒− 1 √ 2𝜋 1 √ 2𝜋 𝑒− 𝜂2 2 𝑑𝜂 − 1 2       (cid:19) (cid:18) ℓ 𝜎 − (cid:21) 1 2 where, Φ is the normal CDF We write the CDF Φ(𝑥) in terms of error function Erf as: Φ(𝑥) = 1 2 + 1 2 Erf (cid:19) (cid:18) 𝑥 √ 2 for 𝑥 ≥ 0. Next, we put 𝑥 = ℓ 𝜎 to get (cid:19) (cid:18) ℓ 𝜎 1 2 + 1 2 = Φ (cid:18) Erf Substituting above in Eq. (D.14), we obtain ℓ (cid:19) √ 2𝜎 Var(𝜖) = =⇒ Var(𝜖) = (cid:20) 1 2 Erf 2 ℓ2 1 ℓ2 (cid:18) ℓ (cid:19) Erf √ 2𝜎 (cid:21) 1 2 − 1 2 ℓ + (cid:18) (cid:19) √ 2𝜎 D.1.3 Proof of Dice Model Being BetterLemma 3 Proof. It remains sufficient to show that E (cid:16)(cid:13) (cid:13) 𝑑w∞ − w∗ (cid:17) (cid:13) (cid:13)2 ≤ E (∥𝑟w∞ − w∗∥2) 142 (D.14) (D.15) (D.16) (D.17) =⇒ E (cid:16)(cid:13) (cid:13) 𝑑w∞ − w∗ (cid:17) (cid:13) (cid:13) 2 2 ≤ E (cid:16) ∥𝑟w∞ − w∗∥2 2 (cid:17) (D.18) Using Lemma 1, the above comparison is a comparison between the gradient variance of the loss wrt noise Var(𝜖). Hence, we compute the gradient variance of the loss L, i.e., Var(𝜖) of regression and dice losses to derive this lemma. Case 1 𝜎 ≤ 1: Given Tab. 4.1, if 𝜎 ≤ 1, the minimum deviation in converged regression model comes from the L2 loss. The difference in the estimates of regression loss and the dice loss E (cid:16) ∥𝑟w∞ − w∗∥2 2 (cid:17) − E (cid:16)(cid:13) (cid:13) 𝑑w∞ − w∗ (cid:18) 2 (cid:13) (cid:13) 2 ℓ (cid:17) (cid:19) Erf √ 1 ℓ2 2𝜎 ∝ 𝜎2 − (D.19) Let 𝜎𝑚 be the solution of the equation 𝜎2 = 1 ℓ2 Erf (cid:18) √ ℓ (cid:19) 2𝜎 . Note that the above equation has unique solution 𝜎𝑚 since 𝜎2 is a strictly increasing function wrt 𝜎 for 𝜎 > 0, while 1 ℓ2 Erf (cid:18) √ ℓ (cid:19) 2𝜎 is a strictly decreasing function wrt 𝜎 for 𝜎 > 0. If the noise has 𝜎 ≥ 𝜎𝑚, the RHS of the above equation ≥ 0, which means dice loss converges better than the regression loss. Case 2 𝜎 ≥ 1: Given Tab. 4.1, if 𝜎 ≥ 1, the minimum deviation in converged regression model comes from the L1 loss. The difference in the regression and dice loss estimates: E (cid:16) ∥𝑟w∞ − w∗∥2 2 (cid:17) − E (cid:16)(cid:13) (cid:13) ∝ 1 − (cid:17) (cid:19) 2 (cid:13) 𝑑w∞ − w∗ (cid:13) 2 1 ℓ ℓ2 Erf √ (cid:18) 2𝜎 (D.20) If the noise has 𝜎 ≥ √ 2 ℓ Erf−1(ℓ2), the RHS of the above equation ≥ 0, which means dice loss is better than the regression loss. For objects such as cars and trailers which have length ℓ > 4𝑚, this is trivially satisfied. Combining both cases, dice loss outperforms the L1 and L2 losses if the noise deviation 𝜎 exceeds the critical threshold 𝜎𝑐, i.e. (cid:32) 𝜎 > 𝜎𝑐 = max 𝜎𝑚, √ 2 ℓ Erf−1(ℓ2) (cid:33) . (D.21) 143 D.1.4 Proof of Convergence Analysis Th. 2 Proof. Continuing from Lemma 3, the advantage of the trained weight obtained from dice loss over the trained weight obtained from regression losses further results in Var(𝑑w∞) ≤ Var(𝑟w∞) =⇒ E(|𝑑w∞h − 𝑧|) ≤ E(|𝑟w∞h − 𝑧|) =⇒ E(| 𝑑 ˆ𝑧 − 𝑧|) ≤ E(| 𝑟 ˆ𝑧 − 𝑧|) =⇒ E(𝑑IoU3D) ≥ E(𝑟IoU3D), (D.22) assuming depth is the only source of error. Because AP3D is an non-decreasing function of IoU3D, the inequality remains preserved. Hence, we have 𝑑AP3D ≥ 𝑟AP3D. □ Thus, the average precision from the dice model is better than the regression model, which means a better detector. D.1.5 Properties of Dice Loss. We next explore the properties of model in Lemma 3 trained with dice loss. From Lemma 1, we write E (cid:16)(cid:13) (cid:13) 𝑑w∞ − w∗ (cid:17) (cid:13) (cid:13) 2 2 = 𝑐1Var(𝜖) + 𝑐2 Substituting the result of Lemma 2, we have E (cid:16)(cid:13) (cid:13) 𝑑w∞ − w∗ (cid:17) 2 (cid:13) (cid:13) 2 = 𝑐1 ℓ2 Erf (cid:18) √ ℓ (cid:19) 2𝜎 + 𝑐2 (D.23) chapter [9] says that for a normal random variable 𝑋 with mean 0 and variance 1 and for any 𝑥 > 0, we have √ 4 + 𝑥2 − 𝑥 2 =⇒ =⇒ 𝑥 + 𝑥 + 1 √ 4 + 𝑥2 1 √ 4 + 𝑥2 √︂ 1 2𝜋 √︂ 2 𝜋 √︂ 2 𝜋 𝑒− 𝑥2 2 ≤ 𝑃 (𝑋 > 𝑥) 𝑒− 𝑥2 2 ≤ 𝑃 (𝑋 > 𝑥) 𝑒− 𝑥2 2 ≤ 1 − 𝑃 (𝑋 ≤ 𝑥) 144 Table D.1 Assumption comparison of Convergence Analysis of Th. 2 vs Mono3D models. Regression Noise 𝜂 PDF Noise & Image Object Categories Object Size ℓ Error Loss L Optimizers Global Optima Th. 2 Linear Normal Independent 1 Ideal Depth Mono3D Models Non-linear Arbitrary Dependent Multiple Non-ideal All 7 parameters L1, L2, dice Smooth L1, L2, dice, CE SGD Unique SGD, Adam, AdamW Multiple =⇒ =⇒ =⇒ =⇒ 𝑥 + 𝑥 + 𝑥 + 𝑥 + 1 √ 4 + 𝑥2 1 √ 4 + 𝑥2 1 √ 4 + 𝑥2 1 √ 4 + 𝑥2 √︂ 2 𝜋 √︂ 2 𝜋 √︂ 2 𝜋 √︂ 2 𝜋 (cid:18) 𝑥 √ 2 𝑒− 𝑥2 2 ≤ 𝑒− 𝑥2 2 ≤ 1 2 1 2 1 2 𝑒− 𝑥2 2 ≤ 1− 1 2 − ∫ 𝑥 0 1 √ 𝑒− 𝑥2 2 ≤ ∫ 𝑥 − 1 √ 𝑒− 𝑋2 2 𝑑𝑋 2𝜋 𝑒− 𝑋2 2 𝑑𝑋 𝑒−𝑋 2 𝑑𝑋 0 ∫ 𝑥 √ 2 0 1 2 Erf 2𝜋 1 √ 𝜋 (cid:18) 𝑥 √ 2 (cid:19) − − =⇒ Erf (cid:19) ≤ 1 − 2 √ 4 + 𝑥2 𝑥 + √︂ 2 𝜋 𝑒− 𝑥2 2 Substituting 𝑥 = ℓ 𝜎 above, we have, Erf (cid:18) √ ℓ (cid:19) 2𝜎 ≤ 1 − √ ℓ + 2𝜎 4𝜎2 + ℓ2 √︂ 2 𝜋 𝑒− ℓ2 2𝜎2 (D.24) Case 1: Upper bound. The RHS of Eq. (D.24) is clearly less than 1 since the term in the RHS after subtraction is positive. Hence, Erf (cid:18) √ ℓ (cid:19) 2𝜎 ≤ 1 Substituting above in Eq. (D.23), we have E (cid:16)(cid:13) (cid:13) 𝑑w∞ − w∗ (cid:17) 2 (cid:13) (cid:13) 2 ≤ 𝑐1 ℓ2 + 𝑐2 (D.25) Clearly, the deviation of the trained model with the dice loss is inversely proportional to the object length ℓ. The deviation from the optimal is less for large objects. 145 Case 2: Infinite Noise variance 𝜎2 → ∞. Then, one of the terms in the RHS of Eq. (D.24) → 0 =⇒ 𝑒− ℓ2 2𝜎2 ≈ (cid:19) (cid:18) 1 − ℓ2 2𝜎2 . So, RHS of Eq. (D.24) 2𝜎 √ 4𝜎2 + ℓ2 ℓ + becomes → 1. Moreover, ℓ 𝜎 Erf =⇒ Erf (cid:19) (cid:19) (cid:18) (cid:18) √ √ ℓ 2𝜎 ℓ 2𝜎 ≈ 1 − (cid:32) ≈ 1 + (cid:18) √︂ 2 𝜋 √︂ 2 𝜋 1 − (cid:19) ℓ2 2𝜎2 (cid:33) √︂ 2 𝜋 ℓ2 2𝜎2 + (D.26) Substituting above in Eq. (D.23), we have E (cid:16)(cid:13) (cid:13) 𝑑w∞ − w∗ (cid:17) (cid:13) (cid:13) 2 2 ≈ (cid:32) 1 + √︂ 2 𝜋 (cid:33) √︂ 2 𝜋 ℓ2 2𝜎2 + 𝑐1 ℓ2 + 𝑐2 (D.27) Thus, the deviation from the optimal weight is inversely proportional to the noise deviation 𝜎2. Hence, the deviation from the optimal weight decreases as 𝜎2 increases for the dice loss. This property provides noise-robustness to the model trained with the dice loss. D.1.6 Notes on Theoretical Result Assumption Comparisons. The theoretical result of Th. 2 relies upon several assumptions. We present a comparison between the assumptions made by Th. 2 and those underlying Mono3D models, in Tab. D.1. While our analysis depends on these assumptions, it is noteworthy that the results are apparent even in scenarios where the assumptions do not hold true. Another advantage of having a linear regression setup is that this setup has a unique global minima (because of its convexity). Nature of Noise 𝜂. Th. 2 assumes that the noise 𝜂 is a normal random variable N (0, 𝜎2). To verify this assumption, we take the two SoTA released models GUP Net [159] and DEVIANT [108] on the KITTI [67] Val cars. We next plot the depth error histogram of both these models in Fig. D.1. This figure confirms that the depth error is close to the Gaussian random variable. Thus, this assumption is quite realistic. Th. 2 Requires Assumptions? We agree that Th. 2 requires assumptions for the proof. However, our theory does have empirical support; most Mono3D works have no theory. So, our theoretical 146 Figure D.1 Depth error histogram of released GUP Net and DEVIANT [108] on the KITTI Val cars. The histogram shows that depth error is close to the Gaussian random variable. attempt for Mono3D is a step forward! We leave the analysis after relaxing some or all of these assumptions for future avenues. Does Th. 2 Hold in Inference? Yes, Th. 2 holds even in inference. Th. 2 relies on the converged weight Lw∞, which in turn depends on the training data distribution. Now, as long as the training and testing data distribution remains the same (a fundamental assumption in ML), Th. 2 holds also during inference. D.1.7 More Discussions SeaBird improves because it removes depth estimation and integrates BEV segmentation. We clarify to remove this confusion. First, SeaBird also estimates depth. SeaBird depth estimates are better because of good segmentation, a form of depth (thanks to dice loss). Second, predicted BEV segmentation needs processing with the 3D head to output depth; so it can not replace depth estimation. Third, integrating segmentation over all categories degrades Mono3D performance ( [132] and our Tab. 4.5 Sem. Category). Why evaluation on outdoor datasets? We experiment with outdoor datasets in this chapter because indoor datasets rarely have large objects (mean length > 6𝑚). 147 D.2 Implementation Details Datasets. Our experiments use the publicly available KITTI-360, KITTI-360 PanopticBEV and nuScenes datasets. KITTI-360 is available at https://www.cvlibs.net/datasets/kitti-360/download. php under CCA-NonCommercial-ShareAlike (CC BY-NC-SA) 3.0 License. KITTI-360 Panop- ticBEV is available at http://panoptic-bev.cs.uni-freiburg.de/ under Robot Learning License Agree- ment. nuScenes is available at https://www.nuscenes.org/nuscenes under CC BY-NC-SA 4.0 Inter- national Public License. Data Splits. We detail out the detection data split construction of the KITTI-360 dataset. • KITTI-360 Test split: This detection benchmark [136] contains 300 training and 42 testing windows. These windows contain 61,056 training and 9,935 testing images. The calibration exists for each frame in training, while it exists for every 10th frame in testing. Therefore, our split consists of 61,056 training images, while we run monocular detectors on 910 test images (ignoring uncalibrated images). • KITTI-360 Val split: The KITTI-360 detection Val split partitions the official train into 239 train and 61 validation windows [136]. The original Val split [136] contains 49,003 training and 14,600 validation images. However, this original Val split has the following three issues: – Data leakage (common images) exists in the training and validation windows. – Every KITTI-360 image does not have the corresponding BEV semantic segmentation GT in the KITTI-360 PanopticBEV [72] dataset, making it harder to compare Mono3D and BEV segmentation performance. – The KITTI-360 validation set has higher sampling rate compared to the testing set. To fix the data leakage issue, we remove the common images from training set and keep them only in the validation set. Then, we take the intersection of KITTI-360 and KITTI-360 PanopticBEV datasets to ensure that every image has corresponding BEV segmentation segmentation GT. After these two steps, the training and validation set contain 48,648 and 12,408 images with calibration and semantic maps. Next, we subsample the validation images by a factor of 10 as in the testing set. Hence, our KITTI-360 Val split contains 48,648 training images and 1,294 148 Figure D.2 Skewness in datasets. The ratio of large (yellow) objects to other objects is approximately 1 : 2 in KITTI-360 [136], while the skewness is about 1 : 21 in nuScenes [22]. validation images. Augmentation. We keep the same augmentation strategy as our baselines for the respective models. Pre-processing. We resize images to preserve their aspect ratio. • KITTI-360. We resize the [376, 1408] sized KITTI-360 images, and bring them to the [384, 1438] resolution. • nuScenes. We resize the [900, 1600] sized nuScenes images, and bring them to the [256, 704], [512, 1408] and [640, 1600] resolutions as our baselines [303, 311]. Libraries. I2M and PBEV experiments use PyTorch [184], while BEVerse and HoP use MMDe- tection3D [44]. Architecture. • I2M+SeaBird.I2M [211] uses ResNet-18 as the backbone with the standard Feature Pyramid Network (FPN) [138] and a transformer to predict depth distribution. FPN is a bottom-up feed-forward CNN that computes feature maps with a downscaling factor of 2, and a top-down network that brings them back to the high-resolution ones. There are total four feature maps 149 levels in this FPN. We use the Box Net with ResNet-18 [77] as the detection head. • PBEV+SeaBird.PBEV [72] uses EfficientDet [231] as the backbone. We use Box Net with ResNet-18 [77] as the detection head. • BEVerse+SeaBird. BEVerse [303] uses Swin transformers [152] as the backbones. We use the original heads without any configuration change. • HoP+SeaBird. HoP [311] uses ResNet-50, ResNet-101 [77] and V2-99 [181] as the backbones. Since HoP does not have the segmentation head, we use the one in BEVerse as the segmentation head. We initialize the CNNs and transformers from ImageNet weights except for V2-99, which is pre- trained on 15 million LiDAR data.. We output two and ten foreground categories for KITTI-360 and nuScenes datasets respectively. Training. We use the training protocol as our baselines for all our experiments. We choose the model saved in the last epoch as our final model for all our experiments. • I2M+SeaBird. Training uses the Adam optimizer [102], a batch size of 30, an exponential decay of 0.98 [211] and gradient clipping of 10 on single Nvidia A100 (80GB) GPU. We train the BEV Net in the first stage with a learning rate 1.0×10−4 for 50 epochs [211] . We then add the detector in the second stage and finetune with the first stage weight with a learning rate 0.5×10−4 for 40 epochs. Training on KITTI-360 Val takes a total of 100 hours. For Test models, we finetune I2M Val stage 1 model with train+val data for 40 epochs. • PBEV+SeaBird. Training uses the Adam optimizer [102] with Nesterov, a batch size of 2 per GPU on eight Nvidia RTX A6000 (48GB) GPU. We train the PBEV with the dice loss in the first stage with a learning rate 2.5×10−3 for 20 epochs. We then add the Box Net in the second stage and finetune with the first stage weight with a learning rate 2.5×10−3 for 20 epochs. PBEV decays the learning rate by 0.5 and 0.2 at 10 and 15 epoch respectively. Training on KITTI-360 Val takes a total of 80 hours. For Test models, we finetune PBEV Val stage 1 model with train+val data for 10 epochs on four GPUs. • BEVerse+SeaBird. Training uses the AdamW optimizer [156], a sample size of 4 per GPU, 150 the one-cycle policy [303] and gradient clipping of 35 on eight Nvidia RTX A6000 (48GB) GPU [303]. We train the segmentation head in the first stage with a learning rate 2.0×10−3 for 4 epochs. We then add the detector in the second stage and finetune with the first stage weight with a learning rate 2.0×10−3 for 20 epochs [303]. Training on nuScenes takes a total of 400 hours. • HoP+SeaBird. Training uses the AdamW optimizer [156], a sample size of 2 per GPU, and gradient clipping of 35 on eight Nvidia A100 (80GB) GPUs [311]. We train the segmentation head in the first stage with a learning rate 1.0×10−4 for 4 epochs. We then add the detector in the second stage and finetune with the first stage weight with a learning rate 1.0×10−4 for 24 epochs [303]. nuScenes training takes a total of 180 hours. For Test models, we finetune val model with train+val data for 4 more epochs. Losses. We train the BEV Net of SeaBird in Stage 1 with the dice loss. We train the final SeaBird pipeline in Stage 2 with the following loss: L = L𝑑𝑒𝑡 + 𝜆𝑠𝑒𝑔L𝑠𝑒𝑔, (D.28) with L𝑠𝑒𝑔 being the dice loss and 𝜆𝑠𝑒𝑔 being the weight of the dice loss in the baseline. We keep the 𝜆𝑠𝑒𝑔 = 5. If the segmentation loss is itself scaled such as PBEV uses the L𝑠𝑒𝑔 as 7, we use 𝜆𝑠𝑒𝑔 = 35 with detection. Inference. We report the performance of all KITTI-360 and nuScenes models by inferring on single GPU card. Our testing resolution is same as the training resolution. We do not use any augmentation for test/validation. We keep the maximum number of objects is 50 per image for KITTI-360 models. We use score threshold of 0.1 for KITTI-360 models and class dependent threshold for nuScenes models as in [303]. KITTI-360 evaluates on windows and not on images. So, we use a 3D center-based NMS [109] to convert image-based predictions to window-based predictions for SeaBird and all our KITTI-360 baselines. This NMS uses a threshold of 4m for all categories, and keeps the highest score 3D box if multiple 3D boxes exist inside a window. 151 Table D.2 Error analysis on KITTI-360 Val. ✓ ✓ (cid:17)) Oracle AP 3D 25 [%](− (cid:17)) AP 3D 50 [%](− 𝑥 𝑦 𝑧 𝑙 𝑤 ℎ 𝜃 AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP 8.71 43.19 25.95 35.76 52.22 43.99 9.78 41.63 25.70 36.07 50.63 43.35 9.57 46.08 27.82 34.65 53.03 43.84 9.90 42.32 27.11 39.66 53.08 46.37 19.90 47.37 33.63 41.84 52.53 47.19 9.49 45.67 27.58 33.43 51.53 42.48 ✓ ✓ ✓ ✓ ✓ ✓ 37.09 46.27 41.68 44.58 51.15 47.87 ✓ ✓ ✓ ✓ ✓ ✓ ✓ 37.02 47.03 42.02 44.46 51.50 47.98 ✓ ✓ ✓ ✓ ✓ ✓ ✓ Table D.3 Complexity analysis on KITTI-360 Val. Method GUP Net [159] DEVIANT [108] I2M [211] I2M+SeaBird PBEV [72] PBEV+SeaBird Mono3D Inf. Time (s) Param (M) Flops (G) ✓ ✓ ✕ ✓ ✕ ✓ 0.02 0.04 0.01 0.02 0.14 0.15 16 16 40 53 24 37 30 235 80 130 229 279 D.3 Additional Experiments and Results We now provide additional details and results of the experiments evaluating SeaBird’s perfor- mance. D.3.1 KITTI-360 Val Results Error Analysis. We next report the error analysis of the SeaBird in Tab. D.2 by replacing the predicted box data with the oracle box data as in [166]. We consider the GT box to be an oracle box for predicted box if the euclidean distance is less than 4𝑚. In case of multiple GT being matched to one box, we consider the oracle with the minimum distance. Tab. D.2 shows that depth is the biggest source of error for Mono3D task as also observed in [166]. Moreover, the oracle does not lead to perfect results since the KITTI-360 PanopticBEV GT BEV semantic is only upto 50𝑚, while the KITTI-360 evaluates all objects (including objects beyond 50𝑚). Computational Complexity Analysis. We next compare the complexity analysis of SeaBird pipeline in Tab. D.3. For the flops analysis, we use the fvcore library as in [108]. Naive baseline for Large Objects. We next compare SeaBird against a naive baseline for large objects detection, such as by fine-tuning GUP Net only on larger objects. Tab. D.4 shows that 152 Table D.4 KITTI-360 Val results with naive baseline finetuned for large objects. SeaBird pipelines comfortably outperform this naive baseline on large objects. [Key: Best, Second Best, †= Retrained] Method Venue GUP Net † [159] GUP Net (Large FT) † [159] I2M+SeaBird PBEV+SeaBird (cid:17)) AP 3D 50 [%](− BEV Seg IoU [%](− AP 3D 25 [%](− AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP Large Car MFor 0.54 0.56 8.71 ICCV21 ICCV21 CVPR24 43.19 25.95 35.76 52.22 43.99 23.23 39.61 31.42 CVPR24 13.22 42.46 27.84 37.15 52.53 44.84 24.30 48.04 36.17 45.11 22.83 0.28 50.52 25.75 1.28 0.98 2.56 − − − − − − (cid:17)) − − (cid:17)) Table D.5 Impact of denoising BEV segmentation maps with MIRNet-v2 [294] on KITTI-360 Val with I2M+SeaBird. Denoising does not help. [Key: Best] Denoiser ✓ ✕ (cid:17)) AP 3D 50 [%](− BEV Seg IoU [%](− AP 3D 25 [%](− AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP Large Car MFor 43.77 23.25 14.34 51.23 32.79 21.42 39.72 30.57 2.73 43.19 25.95 35.76 52.22 43.99 23.23 39.61 31.42 8.71 (cid:17)) (cid:17)) Table D.6 Segmentation loss weight 𝜆𝑠𝑒𝑔 sensitivity on KITTI-360 Val with I2M+SeaBird. 𝜆𝑠𝑒𝑔 = 5 works the best. [Key: Best] 𝜆𝑠𝑒𝑔 0 1 3 5 10 (cid:17)) (cid:17)) AP 3D 50 [%](− BEV Seg IoU [%](− AP 3D 25 [%](− AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP Large Car MFor 4.86 3.54 45.09 24.98 26.33 52.31 39.32 41.71 24.39 32.92 7.07 42.91 23.78 40.58 32.18 43.45 25.36 34.47 52.54 43.51 23.40 40.15 31.78 7.26 43.19 25.95 35.76 52.22 43.99 23.23 39.61 31.42 8.71 43.41 25.55 34.22 50.97 42.60 22.15 39.83 30.99 7.69 7.07 52.9 0 (cid:17)) SeaBird pipelines comfortably outperform this baseline as well. Does denoising BEV images help? Another potential addition to the SeaBird framework is using a denoiser between segmentation and detection heads. We use the MIRNet-v2 [294] as our denoiser and train the BEV segmentation head, denoiser and detection head in an end-to-end manner. Tab. D.5 shows that denoising does not increase performance but the inference time. Hence, we do not use any denoiser for SeaBird. Sensitivity to Segmentation Weight. We next study the impact of segmentation weight on I2M+SeaBird in Tab. D.6 as in Sec. 4.4.2. Tab. D.6 shows that 𝜆𝑠𝑒𝑔 = 5 works the best for the Mono3D of large objects. Reproducibility. We ensure reproducibility of our results by repeating our experiments for 3 random seeds. We choose the final epoch as our checkpoint in all our experiments as [108]. Tab. D.7 shows the results with these seeds. SeaBird outperforms SeaBird without dice loss in the 153 Table D.7 Reproducibility results on KITTI-360 Val with I2M+SeaBird. SeaBird outperforms SeaBird without dice loss in the median and average cases. [Key: Best, Second Best] Dice Seed ✕ ✓ 111 444 222 Avg 111 444 222 Avg (cid:17)) (cid:17)) (cid:17)) AP 3D 50 [%](− BEV Seg IoU [%](− AP 3D 25 [%](− AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP Large Car MFor 3.00 3.81 44.63 24.22 24.96 53.15 39.06 3.54 4.86 45.09 24.98 26.33 52.31 39.32 2.66 5.79 46.71 26.25 24.32 54.06 39.19 3.06 4.82 45.58 25.15 25.20 53.17 39.19 44.03 25.95 33.55 53.93 43.74 22.64 40.64 31.64 7.87 43.19 25.95 35.76 52.22 43.99 23.23 39.61 31.42 8.71 42.87 25.79 34.71 51.72 43.22 22.74 40.01 31.38 8.71 43.36 25.90 34.67 52.62 43.65 22.87 40.09 31.48 8.43 5.99 7.07 5.32 6.13 0 0 0 0 Table D.8 Dice vs regression on methods with depth estimation. Dice model again outperforms regression loss models, particularly for large objects. [Key: Best, Second Best] Resolution Method BBone Venue Loss AP𝐿𝑟𝑔 (− 256×704 HoP+SeaBird R50 ICCV23 − L1 L2 CVPR24 Dice − − 27.4 27.0 28.2 (cid:17)) AP𝐶𝑎𝑟 (− 57.2 57.1 58.6 (cid:17)) AP𝑆𝑚𝑙 (− 46.4 46.5 (cid:17)) mAP (− 39.9 39.7 Did Not Converge 41.1 47.8 (cid:17)) (cid:17)) NDS (− 50.9 50.7 51.5 median and average cases. The biggest improvement shows up on larger objects. D.3.2 nuScenes Results Extended Val Results. Besides showing improvements upon existing detectors in Tab. 4.7 on the nuScenes Val split, we compare with more recent SoTA detectors with large backbones in Tab. D.9. Dice vs regression on depth estimation methods. We report HoP +R50 config, which uses depth estimation and compare losses in Tab. D.8. Tab. D.8 shows that Dice model again outperforms regression loss models. SeaBird Compatible Approaches. SeaBird conditions the detection outputs on segmented BEV features and so, requires foreground BEV segmentation. So, all approaches which produce latent BEV map in Tabs. 4.6 and 4.7 are compatible with SeaBird. However, approaches which do not produce BEV features such as SparseBEV [142] are incompatible with SeaBird. D.3.3 Qualitative Results KITTI-360. We now show some qualitative results of models trained on KITTI-360 Val split in Fig. D.3. We depict the predictions of PBEV+SeaBird in image view on the left, the predictions of PBEV+SeaBird, the baseline MonoDETR [300], predicted and GT boxes in BEV in the mid- 154 Table D.9 nuScenes Val Detection results. SeaBird pipelines outperform the baselines, particularly for large objects. [Key: Best, Second Best, B= Base, S= Small, T= Tiny, = Released, ∗= Reimplementation, §= CBGS] Resolution Method (cid:17)) AP𝐶𝑎𝑟 (− (cid:17)) AP𝑆𝑚𝑙 (− (cid:17)) BBone R50 R50 R50 Swin-T R50 R50 R101 R101 R101 R101 R101 Swin-S Venue CAPE [273] CVPR23 PETRv2 [150] ICCV23 SOLOFusion§ [182] ICLR23 BEVerse-T [303] ArXiv BEVerse-T+SeaBird Swin-T CVPR24 ICCV23 HoP [311] CVPR24 HoP+SeaBird ICCV23 3DPPE [219] AAAI23 STS [261] ICCV23 P2D [101] AAAI23 BEVDepth [127] ArXiv BEVDet4D [87] ArXiv BEVerse-S [303] BEVerse-S+SeaBird Swin-S CVPR24 R101 ICCV23 HoP ∗ [311] R101 CVPR24 HoP+SeaBird V2-99 ArXiv BEVDet [88] ICCV23 R101 PETRv2 [150] V2-99 CVPR23 CAPE [273] ArXiv Swin-B BEVDet4D § [87] V2-99 HoP ∗ [311] ICCV23 V2-99 CVPR24 HoP+SeaBird ICCVW21 R101 FCOS3D [250] CoRL21 R101 PGD [251] CoRL21 R101 DETR3D [257] ECCV22 R101 PETR [149] R101 BEVFormer [132] ECCV22 V2-99 AAAI23 PolarFormer [95] AP𝐿𝑟𝑔 (− 18.5 − 26.5 18.5 19.5 27.4 28.2 − − − − − 20.9 24.6 31.4 32.9 29.6 − 31.2 − 36.5 40.3 − − 22.4 − 27.7 − 256×704 512×1408 640×1600 900×1600 53.2 − 57.3 53.4 54.2 57.2 58.6 − − − − − 56.2 58.7 63.7 65.0 61.7 − 63.2 − 69.1 71.7 − − 60.3 − 48.5 − 38.1 − 48.5 38.8 41.1 46.4 47.8 − − − − − 42.2 45.0 52.5 53.1 48.2 − 51.9 − 56.1 58.8 − − 41.1 − 34.5 − (cid:17)) mAP (− 31.8 34.9 40.6 32.1 33.8 39.9 41.1 39.1 43.1 43.3 41.8 42.1 35.2 38.2 45.2 46.2 42.1 42.1 44.7 42.6 49.6 52.7 34.4 36.9 34.9 37.0 41.5 50.0 (cid:17)) NDS (− 44.2 45.6 49.7 46.6 48.1 50.9 51.5 45.8 52.5 52.8 53.8 54.5 49.5 51.3 55.0 54.7 48.2 52.4 54.4 55.2 58.3 60.2 41.5 42.8 43.4 44.2 51.7 56.2 dle and BEV semantic segmentation predictions from PBEV+SeaBird on the right. In general, PBEV+SeaBird detects more larger objects (buildings) than GUP Net [159]. nuScenes. We now show some qualitative results of models trained on nuScenes Val split in Fig. D.4. As before, we depict the predictions of BEVerse-S+SeaBird in image view from six cameras on the left and BEV semantic segmentation predictions from SeaBird on the right. KITTI-360 Demo Video. We next put a short demo video of PBEV+SeaBird model trained on KITTI-360 Val split compared with MonoDETR at https://www.youtube.com/watch?v=SmuRbMbsnZA. We run our trained model independently on each frame of KITTI-360. None of the frames from the raw video appear in the training set of KITTI-360 Val split. We use the camera matrices available with the video but do not use any 155 temporal information. Overlaid on each frame of the raw input videos, we plot the projected 3D boxes of the predictions, predicted and GT boxes in BEV in the middle and BEV semantic segmentation predictions from PBEV+SeaBird. We set the frame rate of this demo at 5 fps similar to [108]. The demo video demonstrates impressive results on larger objects. 156 Figure D.3 KITTI-360 Qualitative Results. PBEV+SeaBird detects more large objects (buildings, in blue) than MonoDETR [300] in orange. We depict the predictions of PBEV+SeaBird in the image view on the left, the predictions of PBEV+SeaBird, the baseline MonoDETR [300], and ground truth in BEV in the middle, and BEV semantic segmentation predictions from PBEV+SeaBird on the right. [Key: Buildings (in blue) and Cars (in yellow) of PBEV+SeaBird; all classes (pink) of MonoDETR [300], and Ground Truth (in green) in BEV]. 157 Figure D.4 nuScenes Qualitative Results. The first row shows the front_left, front, and front_right cameras, [Key: Cars (blue), Vehicles while the second row shows the back_left, back, and back_right cameras. (green), Pedestrian (violet), Cones (yellow) and Barrier (gray) of BEVerse-S+SeaBird at 200×200 resolution in BEV ]. 158 APPENDIX E CHARM3R APPENDIX E.1 Additional Details and Proof We now add more details and proofs which we could not put in the main paper because of the space constraints. E.1.1 Proof of Ground Depth Lemma 1 We reproduce the proof from [284] with our notations for the sake of completeness of this work. Proof. We first rewrite the pinhole projection Eq. (5.1) as: 𝑋 𝑌 𝑍                        = R−1(K−1 𝑧 − T). (E.1) 𝑣  𝑢          1           We now represent the ray shooting from the camera optical center through each pixel as −→𝑟 (𝑢, 𝑣, 𝑧). Using the matrix 𝑨 = (𝑎𝑖 𝑗 ) = R−1K−1, and the vector 𝑩 = (𝑏𝑖) = −R−1T, we define the parametric ray as: 𝑋 = (𝑎11𝑢 + 𝑎12𝑣 + 𝑎13)𝑧 + 𝑏1 −→𝑟 (𝑢, 𝑣, 𝑧) : 𝑌 = (𝑎21𝑢 + 𝑎22𝑣 + 𝑎23)𝑧 + 𝑏2 (E.2) 𝑍 = (𝑎31𝑢 + 𝑎32𝑣 + 𝑎33)𝑧 + 𝑏3 Moreover, the ground at a distance ℎ can be described by a plane, which is determined by the point (0, 𝐻, 0) in the plane and the normal vector −→𝑛 = (0, 1, 0): −→𝑟 · −→𝑛 = 𝐻. (E.3) Then, the ground depth is the intersection point between this ray and the ground plane. Combining Eqs. (E.2) and (E.3), the ground depth 𝑧 of the pixel (𝑢, 𝑣) is: (𝑎21𝑢 + 𝑎22𝑣 + 𝑎23)𝑧 + 𝑏2 = 𝐻 =⇒ 𝑧 = 𝐻 − 𝑏2 𝑎21𝑢 + 𝑎22𝑣 + 𝑎23 . (E.4) □ 159 E.1.2 Proof of Lemma 5 We next derive Lemma 5 from Lemma 4 as follows. Proof. 𝑨 = (𝑎𝑖 𝑗 ) = R−1K−1 = 𝑰−1 = 𝑰 1 𝑓 0 0 1 𝑓 −𝑢0 𝑓 −𝑣0 𝑓 0 0 1                               0 𝑢0 𝑓 𝑣0 1 −1           𝑓 0 0 0 , with rotation matrix R is identity 𝑰 for forward cameras. So, 𝑎21 = 0, 𝑎22 = 1 𝑓 , 𝑎23 = −𝑣0 𝑓 𝑎21, 𝑎22, 𝑎23 in Eq. (5.2), we get Eq. (5.3). . Substituting □ E.1.3 Extension to Camera not parallel to Ground Following Sec. 3.3 of GEDepth [284], we use the camera pitch 𝛿, and generalize Eq. (5.2) to obtain ground depth as 𝑧 = = 𝐻 −𝑏2cos 𝛿−𝑏3sin 𝛿 [𝑎21𝑢+𝑎22𝑣 +𝑎23] cos 𝛿+ [𝑎31𝑢+𝑎32𝑣 +𝑎33] sin 𝛿 𝐻 − 𝑏2 cos𝛿 − 𝑏3sin 𝛿 𝑣−𝑣0 𝑓 cos 𝛿 + sin 𝛿 (E.5) . Note that if camera pitch 𝛿 = 0, this reduces to the usual form of Eq. (5.2) and Eq. (5.3) respectively. Also, Th. 1 has a more general form with the pitch value, and remains valid for majority of the pitch angle ranges. E.1.4 Extension to Not-flat Roads For non-flat roads, we assume that the road is made of multiple flat ‘pieces‘ of roads each with its own slope and we predict the slope of each pixel as in GEDepth [284]. To predict slope ˆ𝛿 of each pixel, we first define a set of 𝑁 discrete slopes: {𝜏𝑖, 𝑖 = 1, ..., 𝑁 }. We compute each pixel slope by linearly combining the discrete slopes with the predicted probability distribution 160 (a) GUP Net (b) CHARM3R Figure E.1 CARLA Val AP3D at different depths and IoU3D thresholds with GUP Net. CHARM3R shows biggest gains on IoU3D > 0.3 for [0, 30]𝑚 boxes. (a) AP 3D 70 [%] comparison. (b) AP 3D 50 [%] comparison. (c) MDE comparison. Figure E.2 CARLA Val Results with GUP Net detector after augmentation of [104]. Training a detector with both Δ𝐻 = −0.70𝑚 and Δ𝐻 = 0𝑚 images produces better results at Δ𝐻 = −0.70𝑚 and Δ𝐻 = 0𝑚, but fails at unseen height images Δ𝐻 = +0.76𝑚. CHARM3R outperforms all baselines, especially at unseen bigger height changes. All methods except Oracle are trained on car height and tested on all heights. { ˆ𝑝𝑖 ∈ [0, 1], (cid:205)𝑖 ˆ𝑝𝑖 = 1} over 𝑁 slopes ˆ𝛿 = (cid:205)𝑖 ˆ𝑝𝑖𝜏𝑖. We train the network to minimize the total loss: 𝐿total = 𝐿det +𝜆slope𝐿slope(𝛿, ˆ𝛿), where 𝐿det are the detection losses, and 𝐿slope is the slope classification loss. We next substitute the predicted slope in [?]. We do not run this experiment since planar ground is reasonable assumption for most driving scenarios within some distance. E.1.5 Unrealistic assumptions of Th. 4 We partially agree. These assumptions reflect the observations of [51]. Also, our theory has empirical support; most Mono3D works have no theory. So, our theoretical attempt is a step forward! 161 (a) AP 3D 70 [%] comparison. (b) AP 3D 50 [%] comparison. (c) MDE comparison. Figure E.3 CARLA Val Results with DEVIANT detector. CHARM3R outperforms all baselines, espe- cially at bigger height changes. All methods except Oracle are trained on car height and tested on all heights. Results of inference on height changes of −0.70, 0 and 0.76 meters are in Tab. 5.2. E.2 Additional Experiments We now provide additional details and results of the experiments evaluating CHARM3R’s performance. E.2.1 CARLA Val Results We first analyze the results on the synthetic CARLA dataset further. AP at different distances and thresholds. We next compare the AP3D of the baseline GUP Net and CHARM3R in Fig. E.1 at different distances in meters and IoU3D matching criteria of 0.1 − 0.7 as in [108]. Fig. E.1 shows that CHARM3R is effective over GUP Net at all depths and higher IoU3D thresholds. CHARM3R shows biggest gains on IoU3D > 0.3 for [0, 30]𝑚 boxes. Comparison with Augmentation-Methods. Sec. 5.1 of the paper says that the augmentation strategy falls short when the target height is OOD. We show this in Fig. E.2. Since authors of [104] do not release the NVS code, we use the ground truth images from height change Δ𝐻 = −0.70𝑚 in training. Fig. E.2 confirms that augmentation also improves the performance on Δ𝐻 = −0.70𝑚 and Δ𝐻 = 0𝑚, but again falls short on unseen ego heights Δ𝐻 = +0.76𝑚. On the other hand, CHARM3R (even though trained on Δ𝐻 = −0.70𝑚) outperforms such augmentation strategy at unseen ego heights Δ𝐻 = +0.76𝑚. This shows the complementary nature of CHARM3R over augmentation strategies. Reproducibility. We ensure reproducibility of our results by repeating our experiments for 3 random seeds. We choose the final epoch as our checkpoint in all our experiments as [108, 110]. Tab. E.1 shows the results with these seeds. CHARM3R outperforms the baseline GUP Net in both 162 Table E.1 Reproducibility Results. CHARM3R outperforms all other baselines on CARLA Val split, especially at bigger unseen ego heights in both median (Seed=444) and average cases. All except Oracle are trained on car height Δ𝐻 = 0𝑚 and tested on bot to truck height data. [Key: Best] 3D Detector GUP Net [159] + CHARM3R Oracle Seed − (cid:17) / Δ𝐻 (𝑚)− (cid:17) 111 444 222 Average 111 444 222 Average − (cid:17)) (cid:17)) AP 3D 70 [%] (− 0 55.98 53.82 52.94 54.25 58.16 55.68 53.57 55.80 53.82 −0.70 12.24 9.46 10.35 10.68 19.99 19.45 17.41 18.95 70.96 +0.76 −0.70 44.14 7.53 41.66 7.23 41.67 10.79 42.49 8.52 54.15 29.96 53.40 27.33 54.30 27.77 53.95 28.35 83.88 62.25 AP 3D 50 [%] (− 0 76.37 76.47 75.80 76.21 74.10 74.47 74.83 74.47 76.47 +0.76 −0.70 41.32 40.97 46.45 43.58 64.27 61.98 64.42 63.56 83.96 MDE (𝑚) [≈ 0] +0.76 0 +0.48 +0.00 −0.64 +0.53 +0.03 −0.63 +0.53 +0.01 −0.57 +0.51 +0.01 −0.61 +0.09 +0.00 −0.03 +0.07 +0.05 −0.02 +0.12 +0.01 −0.09 +0.09 +0.02 −0.05 +0.03 +0.03 +0.03 Table E.2 nuScenes to CODa Val Results. CHARM3R outperforms all baselines, especially at unseen height changes. [Key: Best, Second Best, Ped= Pedestrians] 3D Detector Method GUP Net [159] Source UniDrive [129] UniDrive++[129] CHARM3R Oracle (cid:17)) Car AP 3D 50 [%] (− nuScenes CODa 0.02 18.42 0.02 18.42 18.42 0.03 14.80 0.30 18.42 28.56 (cid:17)) Ped [%] (− CODa nuScenes 0.01 0.01 0.02 0.05 30.31 2.93 2.93 2.93 1.26 2.93 (a) CODa Car (b) CODa Pedestrian Figure E.4 CODa Val AP3D at different depths and IoU3D thresholds with GUP Net trained on nuScenes. CHARM3R shows biggest gains on IoU3D < 0.3 for [0, 30]𝑚 boxes. median and average cases. Results with DEVIANT. We next additionally plot the robustness of CHARM3R with other methods on the DEVIANT detector [108] in Fig. E.3 The figure confirms that CHARM3R works even with DEVIANT and produces SoTA robustness to unseen ego heights. 163 E.2.2 nuScenes − CODa Val Results To test our claims further in real-life, we use two real datasets: the nuScenes dataset [22] and (cid:17) the recently released CODa [295] datasets. nuScenes has ego camera at height 1.51𝑚 above the ground, while the CODa is a robotics dataset with ego camera at a height of 0.75𝑚 above the ground. This experiment uses the following data split: • nuScenes Val Split. This split [22] contains 28,130 training and 6,019 validation images from the front camera as [108]. • CODa Val Split. This split [295] contains 19,511 training and 4,176 validation images. We only use this split for testing. We train the GUP Net detector with 10 nuScenes classes and report the results with the KITTI metrics on both nuScenes val and CODa Val splits. Main Results. We report the main results in Tab. E.2 paper. The results of Tab. E.2 shows gains on both Cars and Pedestrians classes of CODa val dataset. The performance is very low, which we believe is because of the domain gap between nuScenes and CODa datasets. These results further confirm our observations that unlike 2D detection, generalization across unseen datasets remains a big problem in the Mono3D task. AP at different distances and thresholds. To further analyze the performance, we next plot the AP3D of the baseline GUP Net and CHARM3R in Fig. E.4 at different distances in meters and IoU3D matching criteria of 0.1 − 0.5 as in [108]. Fig. E.4 shows that CHARM3R is effective over GUP Net at all depths and lower IoU3D thresholds. CHARM3R shows biggest gains on IoU3D < 0.3 for [0, 30]𝑚 boxes. The gains are more on the Pedestrian class on CODa since CODa captures UT Austin campus scenes, and therefore, has more pedestrians compared to cars. nuScenes captures outdoor driving scenes in Boston and Singapore, and therefore, has more cars compared to pedestrians. We describe the statistics of these two datasets in Tab. E.3. E.2.3 Qualitative Results. CARLA. We now show some qualitative results of models trained on CARLA Val split from car height (Δ𝐻 = 0𝑚) and tested on truck height (Δ𝐻 = +0.76𝑚) in Fig. E.5. We depict the 164 Table E.3 Dataset statistics. nuScenes Val has more Cars compared to Pedestrians, while CODa Val has more Pedestrians than Cars. Val nuScenes CODa Ego Ht (𝑚) #Images Car (𝑘) Ped (𝑘) 1.51 0.75 6,019 4,176 18 4 7 86 predictions of CHARM3R in image view on the left, the predictions of CHARM3R, the baseline GUP Net [159], and GT boxes in BEV on the right. In general, CHARM3R detects objects more accurately than GUP Net [159], making CHARM3R more robust to camera height changes. The regression-based baseline GUP Net mostly underestimates the depth of 3D boxes with positive ego height changes, which qualitatively justifies the claims of Th. 4. CODa. We now show some qualitative results of models trained on CODa Val split in Fig. E.6. As before, we depict the predictions of CHARM3R in image view image view on the left, the predictions of CHARM3R, the baseline GUP Net [159], and GT boxes in BEV on the right. In general, CHARM3R detects objects more accurately than the baseline GUP Net [159], making CHARM3R more robust to camera height changes. Also, considerably less number of boxes are detected in the cross-dataset evaluation i.e. on CODa Val. We believe this happens because of the domain shift. 165 Figure E.5 CARLA Val Qualitative Results. CHARM3R detects objects more accurately than GUP Net [159], making CHARM3R more robust to camera height changes. The regression-based baseline GUP Net mostly underestimates the depth which qualitatively justifies the claims of Th. 4. All methods are trained on CARLA images at car height Δ𝐻 = 0𝑚 and evaluated on Δ𝐻 = +0.76𝑚. [Key: Cars (pink) of CHARM3R. ; Cars (cyan) of GUP Net, and Ground Truth (green) in BEV. 166 Figure E.6 CODa Val Qualitative Results. CHARM3R detects objects more accurately than GUP Net [159], making CHARM3R more robust to camera height changes. All methods are trained on nuScenes dataset and evaluated on CODa dataset. [Key: Cars (pink) and Pedestrian (violet) of CHARM3R. ; all classes (cyan) of GUP Net, and Ground Truth (green) in BEV. 167