GENERALIZING MONOCULAR 3D OBJECT DETECTION

By

Abhinav Kumar

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science—Doctor of Philosophy

2025

ABSTRACT

Monocular 3D object detection (Mono3D) is a fundamental computer vision task that estimates

an object’s class, 3D position, dimensions, and orientation from a single image. Its applications,

including autonomous driving, augmented reality, and robotics, critically rely on accurate 3D en-

vironmental understanding. This thesis addresses the challenge of generalizing Mono3D models

to diverse scenarios, including occlusions, datasets, object sizes, and camera parameters. To en-

hance occlusion robustness, we propose a mathematically differentiable NMS (GrooMeD-NMS).

To improve generalization to new datasets, we explore depth equivariant (DEVIANT) backbones.

We address the issue of large object detection, demonstrating that it’s not solely a data imbalance

or receptive field problem but also a noise sensitivity issue. To mitigate this, we introduce a

segmentation-based approach in bird’s-eye view with dice loss (SeaBird). Finally, we mathemati-

cally analyze the extrapolation of Mono3D models to unseen camera heights and improve Mono3D

generalization in such out-of-distribution settings.

Copyright by
ABHINAV KUMAR
2025

Dedicated to my country India.

iv

ACKNOWLEDGEMENTS

My long PhD journey is the result of all my advisors, mentors, collaborators, friends, and family.

First and foremost, I express my deepest gratitude to my advisor, Prof. Xiaoming Liu. Prof. Liu

took a bet on me at a point when I was lost in the dark - having an intense desire to do the PhD, but

bereft of any support. Over the years, his taste, rigor, work-ethics and guidance has instilled in me

an awareness of what it takes to do great research. The faith he had on my capabilities, even when

I did not have on myself, is what I am grateful to him for.

I also thank my PhD committee – Prof. Daniel Morris (MSU), Prof. Georgia Gkioxari (Caltech,

FAIR), Prof. Vishnu Boddetti (MSU), and Prof. Yu Kong (MSU) for agreeing to serve in my

committee and supporting this journey. I acknowledge Prof. Daniel Morris for a three-year long

collaboration on the Radar-Camera project, and sharing all his insights in developing radar-camera

3D detectors.

I thank Prof. Georgia Gkioxari, who was also my internship manager at FAIR,

Meta AI. Her vision expanded my horizons by giving me a taste of moonshot industry grade

research, and what it takes to do one. I thank Prof. Vishnu Bodetti and Prof. Yu Kong for asking

thought-provoking questions in this journey.

I deeply acknowledge my mentors: Dr. Tim Marks, Dr. Michael Jones, Dr. Anoop Cherian,

Dr. Ye Wang, Dr. Toshi Koike-Akino and Prof. Cheng Feng at MERL. They took a bet on me

as a first year PhD student when I didn’t have any significant publications. The work done there

culminated into my first CVPR paper. The paper opened doors to MSU to continue my PhD. If not

for that internship, my aspirations for a PhD would have come to a crashing end five years back.

When I joined MSU, Dr. Garrick Brazil took me under his wings for a very daunting area of

3D computer vision, and was almost a second advisor to me at MSU and FAIR. It was due to his

strong belief in me that I applied to FAIR internship, which at that time, I believed was beyond my

capacity.

I acknowledge Dr. Yuliang Guo, Dr. Xinyu Huang and Dr. Liu Ren from Bosch AI Research.

We had a long collaboration that spanned across 1 internship and 2 CVPR submissions. Their

continued guidance and support enabled us to tackle hard open problems.

v

Dr. SriGanesh Madhvanath was my manager at Xerox Research, Bangalore. His mentorship

gave me an early realization that as much as the calibre of a candidate matters, the environment and

support system matters too. His generous endorsement opened doors for doing PhD in the US. As

I grow in my career, I hope to pay it forward. I also thank Vladimir Kozitsky for his outstanding

leadership on the LPRv2 project at Xerox Research.

Throughout my PhD journey in the US, I have been told that my Maths and Linear Algebra skills

are decent. My Master’s advisor, Prof. Animesh Kumar at IIT Bombay, is the stalwart who should

get due credit. I stand on his shoulders and am deeply grateful to him for a rigorous foundation in

mathematical thinking.

I also thank my professors at IIT Patna: Prof. Ayash Kanto Mukherjee, Prof. Kailash Ray, Prof.

Lokman Hakim Choudhury, Prof. Nutan Tomar, Prof. Somnath Sarangi, Prof. Sumanta Gupta,

and Prof. Yatendra Singh for their rigorous undergrad training. I thank my Physics and Maths +2

teacher Jitendra Bharadwaj for his inspirational and thought-provoking teaching, and my teachers at

MKDAV Public School, Daltonganj: Antariksh Roy, Asha Mishra, Ashok Verma, Ganga Agarwal,

Kunal Kumar, and Rita Sinha for laying a solid foundation to my higher studies.

Additionally, I thank Vincent Mattison and Brenda Hodge, the program coordinator and secre-

tary in the CSE department at MSU for helping me with admin issues every single time.

This research would not have been possible without the funding from Ford Motor Company

and Bosch AI Research. I gratefully acknowledge their financial support.

Prof. Liu’s lab gave me an open culture, access to like-minded peers, and exposure to a setup

for doing high quality research. I am thankful to all these amazing people in the lab: Prof. Feng

Liu, Dr. Amin Jourabloo, Dr. Xi Yin, Dr. Garrick Brazil, Dr. Yaojie Liu, Shengjie Zhu, Andrew

Hou, Vishal Asnani, Masa Hu, Yunfei Long, Xiao Guo, Minchul Kim, Yiyang Su, Jei Zhu and

Zhiyuan Ren for reviewing my ideas, critiquing my papers and open discussions. Also, the newer

members of the group: Girish Ganeshan, Dinqiang Ye, Zhihao Zhong, Zhizhong Huang, Hoang

Le and Ziang Gu for sharing this journey with me. I am pretty sure each one of you have done and

will be doing great in the future.

vi

Next, I thank my friends in East Lansing - Bharat Basti Shenoy, Ankit Gupta, Rahul Dey, Ankit

Kumar, Vishal Asnani, Hitesh Gakhar, Sachit Gaudi, Avrajit Ghosh and Ritam Guha, who made

me feel East Lansing a second home.

I am grateful to my friends Koushik Chattopadhyay, Saurabh Kumar, Ashay Jain, Manas Pratim

Haloi, Vidit Singh, and Priyanka Sinha for being my loudest supporters despite staying thousands

of kilometers away. All of them have been friends for more than eight years with three for more

than fifteen years. These were the people with whom I discussed all my PhD quitting plans.

I am also thankful to my parents, and my sister, Ayushi Raj, for their love, patience, support

and encouragement, and keeping me sane during this demanding PhD journey.

vii

TABLE OF CONTENTS

CHAPTER 1

1.1 Thesis Contributions
1.2 Thesis Organization .

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.

1
3
5

CHAPTER 2

Introduction .

GROOMED-NMS: GROUPED MATHEMATICALLY DIFFERENTIABLE
6
NMS FOR MONOCULAR 3D OBJECT DETECTION . . . . . . . . .
6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
8
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 10
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
.

2.1
.
2.2 Related Works .
2.3 Background . .
2.4 GrooMeD-NMS .
.
2.5 Experiments .
.
2.6 Conclusions .

.
.
. .
. .
.
.
.

.
.

CHAPTER 3

Introduction .

DEVIANT: DEPTH EQUIVARIANT NETWORK FOR MONOCULAR
3D OBJECT DETECTION . . . . . . . . . . . . . . . . . . . . . . . . 25
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1
.
. 28
3.2 Related Works .
3.3 Background . .
. 29
3.4 Depth Equivariant Backbone . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Experiments .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Conclusions .

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
. .
. .

.
.

.
.

.
.

.
.

CHAPTER 4

Introduction .

SEABIRD: SEGMENTATION IN BIRD’S VIEW WITH DICE LOSS
IMPROVES MONOCULAR 3D DETECTION OF LARGE OBJECTS . 44
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
. 47
. 49
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.

4.1
.
4.2 Related Works .
. .
4.3 SeaBird . .
.
4.4 Experiments .
.
4.5 Conclusions .

.
.
. .
. .
.
.
.
.

CHAPTER 5

Introduction .

CHARM3R: TOWARDS CAMERA HEIGHT AGNOSTIC MONOCULAR
3D OBJECT DETECTOR . . . . . . . . . . . . . . . . . . . . . . . . . 61
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
. 64
. 66
. 68
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

.
5.1
.
5.2 Related Works .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Notations and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 CHARM3R .
.
5.5 Experiments .
.
5.6 Conclusions .

.
.
. .

.
.
.

.
.
.

.
.
.

CHAPTER 6

CONCLUSIONS AND FUTURE RESEARCH . . . . . . . . . . . .

. 78

BIBLIOGRAPHY .

.

.

.

.

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

APPENDIX A

PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

viii

APPENDIX B

GROOMED-NMS APPENDIX . . . . . . . . . . . . . . . . . . . . . . 105

APPENDIX C

DEVIANT APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . .

. 114

APPENDIX D

SEABIRD APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . 138

APPENDIX E

CHARM3R APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . 159

ix

CHAPTER 1

INTRODUCTION

Monocular 3D object detection (Mono3D) is a fundamental computer vision problem that estimates

an object’s 3D position, dimensions, and orientation in a scene from a single image and its

camera matrix. Its applications, including autonomous driving [108, 132, 181], robotics [213], and

augmented reality [2,172,183,293], critically rely on accurate 3D environmental understanding. To

address these applications’ demands, Mono3D networks must generalize across occlusions, diverse

datasets [108], object sizes [110], camera intrinsics [14], extrinsics [94, 104], rotations [177],

weather and geographical conditions [54] and be robust to adversarial examples [310].

Although Mono3D popularity stems from its high accessibility from consumer vehicles com-

pared to LiDAR/Radar-based detectors [155, 215, 290] and computational efficiency compared to

stereo-based detectors [34], Mono3D methods suffer from classical scale-depth ambiguity making

their generalization harder. This is why there are fewer works along the lines of generalizing

Mono3D. This thesis aims to generalize Mono3D to these varying conditions.

Most Mono3D networks benefit from end-to-end learning idea. However, they train without

including NMS in the training pipeline making the final box after NMS outside the training

paradigm. While there were attempts to include NMS in the training pipeline for tasks such as 2D

object detection, they have been less widely adopted due to a non-mathematical expression of the

NMS. We present and integrate GrooMeD-NMS– a novel Grouped Mathematically Differentiable

NMS for Mono3D, such that the network is trained end-to-end with a loss on the boxes after

NMS. We first formulate NMS as a matrix operation and then group and mask the boxes in an

unsupervised manner to obtain a simple closed-form expression of the NMS. GrooMeD-NMS

addresses the mismatch between training and inference pipelines and, therefore, forces the network

to select the best 3D box in a differentiable manner. As a result, GrooMeD-NMS achieves state-

of-the-art monocular 3D object detection results on the KITTI benchmark dataset performing

comparably to monocular video-based methods, and outperforming them on the hard occluded

examples.

1

Generalizing to datasets requires features which are dataset-independent. One common way

is to obtain such features is incorporating inductive bias or symmetries in the network. One such

symmetry is translating the ego camera along depth should result in deterministic transformations

of the feature maps. Modern neural networks use building blocks such as convolutions that are

equivariant to arbitrary 2D translations in the Euclidean manifold. However, these vanilla blocks

are not equivariant to arbitrary 3D translations in the projective manifold. Even then, all Mono3D

networks use vanilla blocks to obtain the 3D coordinates, a task for which the vanilla blocks are

not designed for. This paper takes the first step towards convolutions equivariant to arbitrary 3D

translations in the projective manifold. Since the depth is the hardest to estimate for monocular

detection, this paper proposes Depth Equivariant Network (DEVIANT) built with existing scale

equivariant steerable blocks. As a result, DEVIANT is equivariant to the depth translations in the

projective manifold whereas vanilla networks are not. The additional depth equivariance forces

the DEVIANT to learn consistent depth estimates, and therefore, DEVIANT works better than

vanilla networks in cross-dataset evaluation. DEVIANT also achieves state-of-the-art monocular

3D detection results on KITTI and Waymo datasets in the image-only category and performs

competitively to methods using extra information.

Mono3D networks achieve remarkable performance on cars and smaller objects. However, their

performance drops on larger objects, leading to fatal accidents. Large objects like trailers, buses and

trucks are harder to detect [268] in Mono3D, sometimes resulting in fatal accidents [23, 60]. Some

attribute these failures to training data scarcity [308] or the receptive field requirements [268] of large

objects. We find that modern frontal detectors struggle to generalize to large objects even on nearly

balanced datasets. We argue that the cause of failure is the sensitivity of depth regression losses

to noise of larger objects. To bridge this gap, we comprehensively investigate regression and dice

losses, examining their robustness under varying error levels and object sizes. We mathematically

prove that the dice loss leads to superior noise-robustness and model convergence for large objects

compared to regression losses for a simplified case. Leveraging our theoretical insights, we

propose SeaBird (Segmentation in Bird’s View) as the first step towards generalizing to large

2

objects. SeaBird effectively integrates BEV segmentation on foreground objects for 3D detection,

with the segmentation head trained with the dice loss. SeaBird achieves SoTA results on the

KITTI-360 leaderboard and improves existing detectors on the nuScenes leaderboard, particularly

for large objects.

With all these generalizations, the networks do not generalize well to changing extrinsics or

viewpoints in testing. Finally, we aim to extend Mono3D’s capabilities to varying camera extrinsics,

such as camera heights.

1.1 Thesis Contributions

The thesis focuses on generalizing Mono3D across occlusions, datasets, object sizes, and camera

extrinsics. The scale-depth ambiguity in Mono3D task requires elegant handling of the depth error.
• This thesis introduces the mathematically differentiable Non-Maximal Suppression, which
attempts Mono3D generalization to occluded and hard objects. Most detectors use a post-

processing algorithm called Non-Maximal Suppression (NMS) only during inference. While

there were attempts to include NMS in the training pipeline for tasks such as 2D object detection,

they have been less widely adopted due to a non-mathematical expression of the NMS. In this

chapter, we present and integrate GrooMeD-NMS – a novel Grouped Mathematically Differen-

tiable NMS for monocular 3D object detection, such that the network is trained end-to-end with

a loss on the boxes after NMS. We first formulate NMS as a matrix operation and then group

and mask the boxes in an unsupervised manner to obtain a simple closed-form expression of the

NMS. GrooMeD-NMS addresses the mismatch between training and inference pipelines and,

therefore, forces the network to select the best 3D box in a differentiable manner. As a result,

GrooMeD-NMS achieves state-of-the-art monocular 3D object detection results on the KITTI

dataset.

• We next propose the depth equivariant backbone in the projective manifold which attempts
generalization to unseen datasets. Modern neural networks use building blocks such as convo-

lutions that are equivariant to arbitrary 2D translations in the Euclidean manifold. However,

these vanilla blocks are not equivariant to arbitrary 3D translations in the projective manifold.

3

Even then, all monocular 3D detectors use vanilla blocks to obtain the 3D coordinates, a task for

which the vanilla blocks are not designed for. This chapter takes the first step towards convolu-

tions equivariant to arbitrary 3D translations in the projective manifold. Since the depth is the

hardest to estimate for monocular detection, this chapter proposes Depth Equivariant Network

(DEVIANT) built with existing scale equivariant steerable blocks. As a result, DEVIANT is

equivariant to the depth translations in the projective manifold whereas vanilla networks are not.

The additional depth equivariance forces the DEVIANT to learn consistent depth estimates,

and therefore, DEVIANT achieves state-of-the-art monocular 3D detection results on KITTI

and Waymo datasets in the image-only category and performs competitively to methods using

extra information. Moreover, DEVIANT works better than vanilla networks in cross-dataset

evaluation.

• We then investigate large object detection, demonstrating that it is not solely a data imbalance
or receptive field issue but also a noise sensitivity problem. To generalize Mono3D to large

objects, it introduces a segmentation-based approach in bird’s eye view with dice loss (SeaBird).

Monocular 3D detectors achieve remarkable performance on cars and smaller objects. However,

their performance drops on larger objects, leading to fatal accidents. Some attribute the failures

to training data scarcity or their receptive field requirements of large objects. In this chapter,

we highlight this understudied problem of generalization to large objects. We find that modern

frontal detectors struggle to generalize to large objects even on nearly balanced datasets. We

argue that the cause of failure is the sensitivity of depth regression losses to noise of larger objects.

To bridge this gap, we comprehensively investigate regression and dice losses, examining their

robustness under varying error levels and object sizes. We mathematically prove that the

dice loss leads to superior noise-robustness and model convergence for large objects compared

to regression losses for a simplified case. Leveraging our theoretical insights, we propose

SeaBird (Segmentation in Bird’s View) as the first step towards generalizing to large objects.

SeaBird effectively integrates BEV segmentation on foreground objects for 3D detection, with

the segmentation head trained with the dice loss. SeaBird achieves SoTA results on the KITTI-

4

360 leaderboard and improves existing detectors on the nuScenes leaderboard, particularly for

large objects.

• Monocular 3D object detectors, while effective on data from one ego camera height, struggle
with unseen or out-of-distribution camera heights. Existing methods often rely on Plucker

embeddings, image transformations or data augmentation. This chapter takes a step towards

this understudied problem by investigating the impact of camera height variations on state-of-

the-art (SoTA) Mono3D models. With a systematic analysis on the extended CARLA dataset

with multiple camera heights, we observe that depth estimation is a primary factor influencing

performance under height variations. We mathematically prove and also empirically observe

consistent negative and positive trends in mean depth error of regressed and ground-based depth

models, respectively, under camera height changes. To mitigate this, we propose Camera Height

Robust Monocular 3D Detector (CHARM3R), which averages both depth estimates within the

model. CHARM3R significantly improves generalization to unseen camera heights, achieving

SoTA performance on the CARLA dataset.

1.2 Thesis Organization

We organize the remaining chapters of the dissertation as follows. Chapter 2 introduces

the mathematically differentiable Non-Maximal Suppression, which attempts generalization to

occluded and hard objects. Chapter 3 describes the depth equivariant backbone which attempts

generalization to unseen datasets. Chapter 4 investigates large object detection, demonstrating that

it is not solely a data imbalance or receptive field issue but also a noise sensitivity problem. To

improve large object detection, it introduces a segmentation-based approach in bird’s eye view with

dice loss (SeaBird). Chapter 5 attempts solving the generalization of Mono3D trained on single

camera height to multiple camera heights. Chapter 6 introduces the future research for monocular

3D detection.

5

CHAPTER 2

GROOMED-NMS: GROUPED MATHEMATICALLY DIFFERENTIABLE NMS FOR
MONOCULAR 3D OBJECT DETECTION

Modern 3D object detectors have immensely benefited from the end-to-end learning idea. However,

most of them use a post-processing algorithm called Non-Maximal Suppression (NMS) only during

inference. While there were attempts to include NMS in the training pipeline for tasks such as 2D

object detection, they have been less widely adopted due to a non-mathematical expression of the

NMS. In this chapter, we present and integrate GrooMeD-NMS– a novel Grouped Mathematically

Differentiable NMS for monocular 3D object detection, such that the network is trained end-to-end

with a loss on the boxes after NMS. We first formulate NMS as a matrix operation and then group and

mask the boxes in an unsupervised manner to obtain a simple closed-form expression of the NMS.

GrooMeD-NMS addresses the mismatch between training and inference pipelines and, therefore,

forces the network to select the best 3D box in a differentiable manner. As a result, GrooMeD-NMS

achieves state-of-the-art monocular 3D object detection results on the KITTI benchmark dataset

performing comparably to monocular video-based methods.

2.1

Introduction

3D object detection is one of the fundamental problems in computer vision, where the task

is to infer 3D information of the object.

Its applications include augmented reality [2, 201],

robotics [120,213], medical surgery [203], and, more recently path planning and scene understand-

ing in autonomous driving [33,90,123,220]. Most of the 3D object detectors [33,90,121,123,220]

are extensions of the 2D object detector Faster R-CNN [202], which relies on the end-to-end learn-

ing idea to achieve State-of-the-Art (SoTA) object detection. Some of these methods have proposed

changing architectures [123, 216, 220] or losses [15, 35]. Others have tried incorporating confi-

dence [17, 216, 220] or temporal cues [17].

Almost all of them output a massive number of boxes for each object and, thus, rely on post-

processing with a greedy [192] clustering algorithm called Non-Maximal Suppression (NMS)

during inference to reduce the number of false positives and increase performance. However,

6

Training

Inference

Inference
Training

e
n
o
b
k
c
a
B

B
Score
2𝐷
3𝐷

s
OIoU

Predictions
r

NMS

e
n
o
b
k
c
a
B

B
Score
2𝐷
3𝐷

s
OIoU

GrooMeD
NMS

Predictions
r

L𝑏𝑒 𝑓 𝑜𝑟𝑒
L𝑏𝑒 𝑓 𝑜𝑟𝑒

L𝑏𝑒 𝑓 𝑜𝑟𝑒

L𝑎 𝑓 𝑡𝑒𝑟

(a) Conventional NMS Pipeline

(b) GrooMeD-NMS Pipeline

s

O

I

M

P

Sort

𝑝

lower

(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)

d

> 𝑣

s

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

⌊.⌉

G
Group

r

Forward

Backward

(c) GrooMeD-NMS layer

Figure 2.1 Overview of GrooMeD-NMS. (a) Conventional object detection has a mismatch between training
and inference as it uses NMS only in inference. (b) To address this, we propose a novel GrooMeD-NMS layer,
such that the network is trained end-to-end with NMS applied. s and r denote the score of boxes B before
and after the NMS respectively. O denotes the matrix containing IoU2D overlaps of B. L𝑏𝑒 𝑓 𝑜𝑟 𝑒 denotes the
losses before the NMS, while L𝑎 𝑓 𝑡𝑒𝑟 denotes the loss after the NMS. (c) GrooMeD-NMS layer calculates r
in a differentiable manner giving gradients from L𝑎 𝑓 𝑡𝑒𝑟 when the best-localized box corresponding to an
object is not selected after NMS.

these works have largely overlooked NMS’s inclusion in training leading to an apparent mismatch

between training and inference pipelines as the losses are applied on all boxes before NMS but

not on final boxes after NMS (see Fig. 2.1a). We also find that 3D object detection suffers a

greater mismatch between classification and 3D localization compared to that of 2D localization,

as discussed further in Sec. B.3.2 of the supplementary and observed in [17, 90, 216]. Hence, our

focus is 3D object detection.

Earlier attempts to include NMS in the training pipeline [80, 81, 192] have been made for

2D object detection where the improvements are less visible. Recent efforts to improve the

correlation in 3D object detection involve calculating [220, 222] or predicting [17, 216] the scores

via likelihood estimation [111] or enforcing the correlation explicitly [90]. Although this improves

the 3D detection performance, improvements are limited as their training pipeline is not end to end

7

in the absence of a differentiable NMS.

To address the mismatch between training and inference pipelines as well as the mismatch

between classification and 3D localization, we propose including the NMS in the training pipeline,

which gives a useful gradient to the network so that it figures out which boxes are the best-localized

in 3D and, therefore, should be ranked higher (see Fig. 2.1b).

An ideal NMS for inclusion in the training pipeline should be not only differentiable but

also parallelizable. Unfortunately, the inference-based classical NMS and Soft-NMS [12] are

greedy, set-based and, therefore, not parallelizable [192]. To make the NMS parallelizable, we

first formulate the classical NMS as matrix operation and then obtain a closed-form mathematical

expression using elementary matrix operations such as matrix multiplication, matrix inversion, and

clipping. We then replace the threshold pruning in the classical NMS with its softer version [12]

to get useful gradients. These two changes make the NMS GPU-friendly, and the gradients are

backpropagated. We next group and mask the boxes in an unsupervised manner, which removes

the matrix inversion and simplifies our proposed differentiable NMS expression further. We call

this NMS as Grouped Mathematically Differentiable NMS (GrooMeD-NMS).

In summary, the main contributions of this work include:

• This is the first work to propose and integrate a closed-form mathematically differentiable NMS
for object detection, such that the network is trained end-to-end with a loss on the boxes after

NMS.

• We propose an unsupervised grouping and masking on the boxes to remove the matrix inversion

in the closed-form NMS expression.

• We achieve SoTA monocular 3D object detection performance on the KITTI dataset performing

comparably to monocular video-based methods.

2.2 Related Works

3D Object Detection. Recent success in 2D object detection [69, 70, 139, 200, 202] has inspired

people to infer 3D information from a single 2D (monocular) image. However, the monocular

problem is ill-posed due to the inherent scale/depth ambiguity [232]. Hence, approaches use

8

additional sensors such as LiDAR [90, 215, 269], stereo [122, 254] or radar [178, 242]. Although

LiDAR depth estimations are accurate, LiDAR data is sparse [85] and computationally expensive

to process [232]. Moreover, LiDAR s are expensive and do not work well in severe weather [232].

Hence, there have been several works on monocular 3D object detection. Earlier approaches [31,

61,186,187] use hand-crafted features, while the recent ones are all based on deep learning. Some of

these methods have proposed changing architectures [123, 143, 232] or losses [15, 35]. Others have

tried incorporating confidence [17,143,216,220], augmentation [223], depth in convolution [15,52]

or temporal cues [17]. Our work proposes to incorporate NMS in the training pipeline of monocular

3D object detection.

Non-Maximal Suppression. NMS has been used to reduce false positives in edge detection [206],

feature point detection [75, 157, 174], face detection [243], human detection [16, 18, 47] as well

as SoTA 2D [69, 139, 200, 202] and 3D detection [5, 17, 33, 216, 220, 232]. Modifications to

NMS in 2D detection [12, 49, 80, 81, 192], 2D pedestrian detection [116, 145, 209], 2D salient

object detection [298] and 3D detection [216] can be classified into three categories – inference

NMS [12, 216], optimization-based NMS [3, 49, 116, 209, 244, 298] and neural network based

NMS [78, 80, 81, 145, 192].

The inference NMS [12] changes the way the boxes are pruned in the final set of predic-

tions. [216] uses weighted averaging to update the 𝑧-coordinate after NMS. [209] solves quadratic

unconstrained binary optimization while [3,116,224] and [298] use point processes and MAP based

inference respectively. [49] and [244] formulate NMS as a structured prediction task for isolated and

all object instances respectively. The neural network NMS use a multi-layer network and message-

passing to approximate NMS [80, 81, 192] or to predict the NMS threshold adaptively [145]. [78]

approximates the sub-gradients of the network without modelling NMS via a transitive relationship.

Our work proposes a grouped closed-form mathematical approximation of the classical NMS and

does not require multiple layers or message-passing. We detail these differences in Sec. 2.4.2.

9

2.3 Background

2.3.1 Notations

r = {𝑟𝑖}𝑛

Let B = {𝑏𝑖}𝑛

𝑖=1 denote the set of boxes or proposals 𝑏𝑖 from an image. Let s = {𝑠𝑖}𝑛
𝑖=1 and
𝑖=1 denote their scores (before NMS) and rescores (updated scores after NMS) respectively
such that 𝑟𝑖, 𝑠𝑖 ≥ 0 ∀ 𝑖. D denotes the subset of B after the NMS. Let O = [𝑜𝑖 𝑗 ] denote the

𝑛 × 𝑛 matrix with 𝑜𝑖 𝑗 denoting the 2D Intersection over Union (IoU2D) of 𝑏𝑖 and 𝑏 𝑗 . The pruning

function 𝑝 decides how to rescore a set of boxes B based on IoU2D overlaps of its neighbors,

sometimes suppressing boxes entirely. In other words, 𝑝(𝑜𝑖) = 1 denotes the box 𝑏𝑖 is suppressed

while 𝑝(𝑜𝑖) = 0 denotes 𝑏𝑖 is kept in D. The NMS threshold 𝑁𝑡 is the threshold for which two

boxes need in order for the non-maximum to be suppressed. The temperature 𝜏 controls the shape

of the exponential and sigmoidal pruning functions 𝑝. 𝑣 thresholds the rescores in GrooMeD and

Soft-NMS [13] to decide if the box remains valid after NMS.

B is partitioned into different groups G = {G𝑘 }. BG𝑘 denotes the subset of B belonging to

group 𝑘. Thus, BG𝑘 = {𝑏𝑖} ∀ 𝑏𝑖 ∈ G𝑘 and BG𝑘 ∩ BG𝑙 = 𝜙 ∀ 𝑘 ≠ 𝑙. G𝑘 in the subscript of a variable

denotes its subset corresponding to BG𝑘 . Thus, sG𝑘 and rG𝑘 denote the scores and the rescores of

BG𝑘 respectively. 𝛼 denotes the maximum group size.

∨ denotes the logical OR while ⌊𝑥⌉ denotes clipping of 𝑥 in the range [0, 1]. Formally,

⌊𝑥⌉ =

1,

𝑥 > 1

𝑥,

0 ≤ 𝑥 ≤ 1

0,

𝑥 < 0

(2.1)






|s| denotes the number of elements in s.

(cid:108) in the subscript denotes the lower triangular version of

(cid:108)

the matrix without the principal diagonal. ⊙ denotes the element-wise multiplication. I denotes the

identity matrix.

2.3.2 Classical and Soft-NMS

NMS is one of the building blocks in object detection whose high-level goal is to iteratively

suppress boxes which have too much IoU with a nearby high-scoring box. We first give an overview

of the classical and Soft-NMS [12], which are greedy and used in inference. Classical NMS uses

10

Algorithm 1: Classical/Soft-NMS [12]

Input: s: scores, O: IoU2D matrix, 𝑁𝑡: NMS threshold, 𝑝: pruning function,

𝜏: temperature

Output: d: box index after NMS, r: scores after NMS

1 begin
2

d ← {}
t ← {1, · · · , |s|}
r ← s
while t ≠ 𝑒𝑚 𝑝𝑡𝑦 do
𝜈 ← argmax r[t]
d ← d ∪ 𝜈
t ← t − 𝜈
for 𝑖 ← 1 : |t| do

3

4

5

6

7

8

9

10

11

𝑟𝑖 ← (1 − 𝑝𝜏 (O[𝜈, 𝑖]))𝑟𝑖

end

⊲ All box indices

⊲ Top scored box
⊲ Add to valid box index
⊲ Remove from t

⊲ Rescore

end

12
13 end

the idea that the score of a box having a high IoU2D overlap with any of the selected boxes should be

suppressed to zero. That is, it uses a hard pruning 𝑝 without any temperature 𝜏. Soft-NMS makes

this pruning soft via temperature 𝜏. Thus, classical and Soft-NMS only differ in the choice of 𝑝.

We reproduce them in Alg. 1 using our notations.

2.4 GrooMeD-NMS

Classical NMS (Alg. 1) uses argmax and greedily calculates the rescore 𝑟𝑖 of boxes B and,

is thus not parallelizable or differentiable [192]. We wish to find its smooth approximation in

closed-form for including in the training pipeline.

2.4.1 Formulation

2.4.1.1 Sorting

Classical NMS uses the non-differentiable hard argmax operation (Line 6 of Alg. 1). We

remove the argmax by hard sorting the scores s and O in decreasing order (lines 2-3 of Alg. 2).

We also try making the sorting soft. Note that we require the permutation of s to sort O. Most

soft sorting methods [8, 10, 185, 190] apply the soft permutation to the same vector. Only two

other methods [46, 191] can apply the soft permutation to another vector. Both methods use O (cid:0)𝑛2(cid:1)

11

Algorithm 2: GrooMeD-NMS

Input: s: scores, O: IoU2D matrix, 𝑁𝑡: NMS threshold, 𝑝: pruning function, 𝑣: valid box

threshold, 𝛼: maximum group size

Output: d: box index after NMS, r: scores after NMS

1 begin
2

3

4

5

6

7

8

9

10

11

12

(cid:108) ← lower(O)

s, index ← sort(s, descending= True)
O ← O[index] [:, index]
O (cid:108)
P ← 𝑝(O (cid:108)
(cid:108) )
I ← Identity(|s|)
G ← group(O, 𝑁𝑡, 𝛼)
for 𝑘 ← 1 : |G| do

MG𝑘 ← zeros (|G𝑘 | , |G𝑘 |)
MG𝑘 [:, G𝑘 [1]] ← 1
rG𝑘 ← (cid:4) (cid:0)IG𝑘 − MG𝑘 ⊙ PG𝑘

(cid:1) sG𝑘

(cid:7)

end
d ← index[r >= 𝑣]

13
14 end

⊲ Sort s
⊲ Sort O
⊲ Lower △ular matrix
⊲ Prune matrix
⊲ Identity matrix
⊲ Group boxes B

⊲ Prepare mask
⊲ First col of MG𝑘
⊲ Rescore

⊲ Valid box index

computations for soft sorting [10]. We implement [191] and find that [191] is overly dependent on

temperature 𝜏 to break out the ranks, and its gradients are too unreliable to train our model. Hence,

we stick with the hard sorting of s and O.

2.4.1.2 NMS as a Matrix Operation

The rescoring process of the classical NMS is greedy set-based [192] and only considers

overlaps with unsuppressed boxes. We first generalize this rescoring by accounting for the effect of

all (suppressed and unsuppressed) boxes as

𝑠𝑖 −

𝑟𝑖 ≈ max (cid:169)
(cid:173)
(cid:171)

𝑖−1
∑︁

𝑗=1

𝑝(𝑜𝑖 𝑗 )𝑟 𝑗 , 0(cid:170)
(cid:174)
(cid:172)

(2.2)

using the relaxation of logical OR (cid:212) operator as (cid:205) [106, 124]. See Sec. B.1 of the supplementary

material for an alternate explanation of Eq. (2.2). The presence of 𝑟 𝑗 on the RHS of Eq. (2.2)

prevents suppressed boxes from influencing other boxes hugely. When 𝑝 outputs discretely as

{0, 1} as in classical NMS, scores 𝑠𝑖 are guaranteed to be suppressed to 𝑟𝑖 = 0 or left unchanged

12

𝑟𝑖 = 𝑠𝑖 thereby implying 𝑟𝑖 ≤ 𝑠𝑖 ∀ 𝑖. We write the rescores r in a matrix formulation as

𝑠1

𝑠2

𝑟1




𝑟2



𝑟3


...





𝑟𝑛


The above equation is written compactly as

(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)





































































≈ max

𝑠3
...

𝑠𝑛

−

0

𝑝(𝑜21)

0

0

𝑝(𝑜31) 𝑝(𝑜32)

...

...

𝑝(𝑜𝑛1) 𝑝(𝑜𝑛2)

. . . 0

. . . 0














. . . 0



. . . 0
...
...

r ≈ max(s − Pr, 0),

,

𝑟1




𝑟2



𝑟3


...





𝑟𝑛




















.


















0




0



0


...





0



(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

(2.3)

(2.4)

where P, called the Prune Matrix, is obtained when the pruning function 𝑝 operates element-wise

on O (cid:108)

(cid:108) . Maximum operation makes Eq. (2.4) non-linear [112] and, thus, difficult to solve. However,

to avoid recursion, we use

r ≈ (cid:4)(I + P)−1 s(cid:7) ,

(2.5)

as the solution to Eq. (2.4) with I being the identity matrix. Intuitively, if the matrix inversion is

considered division in Eq. (2.5) and the boxes have overlaps, the rescores are the scores divided by

a number greater than one and are, therefore, lesser than scores. If the boxes do not overlap, the

division is by one and rescores equal scores.

Note that the I + P in Eq. (2.5) is a lower triangular matrix with ones on the principal diagonal.

Hence, I + P is always full rank and, therefore, always invertible.

2.4.1.3 Grouping

We next observe that the object detectors output multiple boxes for an object, and a good

detector outputs boxes wherever it finds objects in the monocular image. Thus, we cluster the boxes

in an image in an unsupervised manner based on IoU2D overlaps to obtain the groups G. Grouping

thus mimics the grouping of the classical NMS, but does not rescore the boxes. As clustering limits

interactions to intra-group interactions among the boxes, we write Eq. (2.5) as

rG𝑘 ≈

(cid:106) (cid:0)IG𝑘 + PG𝑘

(cid:1) −1

sG𝑘

(cid:109)

.

(2.6)

13

Algorithm 3: Grouping of boxes

Input: O: sorted IoU2D matrix, 𝑁𝑡: NMS threshold, 𝛼: maximum group size
Output: G: Groups

1 begin
2

G ← {}
t ← {1, · · · , O.shape[1]}
while t ≠ 𝑒𝑚 𝑝𝑡𝑦 do
u ← O[:, 1] > 𝑁𝑡
v ← t[u]
𝑛G𝑘 ← min(|v|, 𝛼)
G.insert(v[: 𝑛G𝑘 ])
w ← O[:, 1] ≤ 𝑁𝑡
t ← t[w]
O ← O[w] [:, w]

3

4

5

6

7

8

9

10

11

⊲ All box indices

⊲ High overlap indices
⊲ New group

⊲ Insert new group
⊲ Low overlap indices
⊲ Keep w indices in t
⊲ Keep w indices in O

end

12
13 end

This results in taking smaller matrix inverses in Eq. (2.6) than Eq. (2.5).

We use a simplistic grouping algorithm, i.e., we form a group G𝑘 with boxes having high

IoU2D overlap with the top-ranked box, given that we sorted the scores. As the group size is limited

by 𝛼, we choose a minimum of 𝛼 and the number of boxes in G𝑘 . We next delete all the boxes of

this group and iterate until we run out of boxes. Also, grouping uses IoU2D since we can achieve

meaningful clustering in 2D. We detail this unsupervised grouping in Alg. 3.

2.4.1.4 Masking

Classical NMS considers the IoU2D of the top-scored box with other boxes. This consideration

is equivalent to only keeping the column of O corresponding to the top box while assigning the

rest of the columns to be zero. We implement this through masking of PG𝑘 . Let MG𝑘 denote the

binary mask corresponding to group G𝑘 . Then, entries in the binary matrix MG𝑘 in the column

corresponding to the top-scored box are 1 and the rest are 0. Hence, only one of the columns in

MG𝑘 ⊙ PG𝑘 is non-zero. Now, IG𝑘 + MG𝑘 ⊙ PG𝑘 is a Frobenius matrix (Gaussian transformation)

and we, therefore, invert this matrix by simply subtracting the second term [71]. In other words,

(IG𝑘 + MG𝑘 ⊙ PG𝑘 )−1 = IG𝑘 − MG𝑘 ⊙ PG𝑘 . Hence, we simplify Eq. (2.6) further to get

rG𝑘 ≈ (cid:4) (cid:0)IG𝑘 − MG𝑘 ⊙ PG𝑘

(cid:1) sG𝑘

(cid:7) .

(2.7)

14

Thus, masking allows to bypass the computationally expensive matrix inverse operation altogether.

We call the NMS based on Eq. (2.7) as Grouped Mathematically Differentiable Non-Maximal

Suppression or GrooMeD-NMS. We summarize the complete GrooMeD-NMS in Alg. 2 and show

its block-diagram in Fig. 2.1c. GrooMeD-NMS in Fig. 2.1c provides two gradients - one through s

and other through O.

2.4.1.5 Pruning Function

As explained in Sec. 2.3.1, the pruning function 𝑝 decides whether to keep the box in the final

set of predictions D or not based on IoU2D overlaps, i.e., 𝑝(𝑜𝑖) = 1 denotes the box 𝑏𝑖 is suppressed

while 𝑝(𝑜𝑖) = 0 denotes 𝑏𝑖 is kept in D.

Classical NMS uses the threshold as the pruning function, which does not give useful gradients.

Therefore, we considered three different functions for 𝑝: Linear, a temperature (𝜏)-controlled

Exponential, and Sigmoidal function.

• Linear Linear pruning function [12] is 𝑝(𝑜) = 𝑜.
• Exponential Exponential pruning function [12] is 𝑝(𝑜) = 1 − exp (cid:16)
• Sigmoidal Sigmoidal pruning function is 𝑝(𝑜) = 𝜎

(cid:16) 𝑜−𝑁𝑡
𝜏

(cid:17).

− 𝑜2
𝜏

(cid:17) with 𝜎 denoting the standard

sigmoid. Sigmoidal function appears as the binary cross entropy relaxation of the subset

selection problem [185].

We show these pruning functions in Fig. 2.2. The ablation studies (Sec. 2.5.4) show that

choosing 𝑝 as Linear yields the simplest and the best GrooMeD-NMS.

2.4.2 Differences from Existing NMS

Although no differentiable NMS has been proposed for the monocular 3D object detection,

we compare our GrooMeD-NMS with the NMS proposed for 2D object detection, 2D pedestrian

detection, 2D salient object detection, and 3D object detection in Tab. 2.1. No method described in

Tab. 2.1 has a matrix-based closed-form mathematical expression of the NMS. Classical, Soft [12]

and Distance-NMS [216] are used at the inference time, while GrooMeD-NMS is used during both

training and inference. Distance-NMS [216] updates the 𝑧-coordinate of the box after NMS as the

weighted average of the 𝑧-coordinates of top-𝜅 boxes. QUBO-NMS [209], Point-NMS [116, 224],

15

Figure 2.2 Pruning functions 𝑝 of the classical and GrooMeD-NMS. We use the Linear and Exponential
pruning of the Soft-NMS [12] while training with the GrooMeD-NMS.

and MAP-NMS [298] are not used in end-to-end training. [3] proposes a trainable Point-NMS. The

Structured-SVM based NMS [49, 244] rely on structured SVM to obtain the rescores. Adaptive-

NMS [145] uses a separate neural network to predict the classical NMS threshold 𝑁𝑡. The trainable

neural network based NMS (NN-NMS) [80, 81, 192] use a separate neural network containing

multiple layers and/or message-passing to approximate the NMS and do not use the pruning

function. Unlike these methods, GrooMeD-NMS uses a single layer and does not require multiple

layers or message passing. Our NMS is parallel up to group (denoted by G). However, |G| is, in

general, << |B| in the NMS.

2.4.3 Target Assignment and Loss Function

Target Assignment. Our method consists of M3D-RPN [15] and uses binning and self-balancing

confidence [17]. The boxes’ self-balancing confidence are used as scores s, which pass through the

GrooMeD-NMS layer to obtain the rescores r. The rescores signal the network if the best box has

not been selected for a particular object.

We extend the notion of the best 2D box [192] to 3D. The best box has the highest product of

IoU2D and gIoU3D [204] with ground truth 𝑔𝑙. If the product is greater than a certain threshold 𝛽,

16

Table 2.1 Comparison of different NMS. [Key: Train= End-to-end Trainable, Prune= Pruning function,
#Layers= Number of layers, Par= Parallelizable]

Rescore
✕
✕
✕

NMS
Train
Classical
✕
Soft-NMS [12]
✕
Distance-NMS [216]
✕
QUBO-NMS [209]
✕ Optimization
✕ Point Process
Point-NMS [116, 224]
Trainable Point-NMS [3] ✓ Point Process
MAP
MAP-NMS [298]
✕
SSVM
Structured-NMS [49, 244] ✕
✕
Adaptive-NMS [145]
✕
✓ Neural Network ✕
NN-NMS [80, 81, 192]
Soft
Matrix
✓
GrooMeD-NMS (Ours)

O (|G|)
O (|G|)
O (|G|)
-
-
-
-
-

Prune #Layers Par
-
Hard
-
Soft
-
Hard
-
✕
-
✕
-
✕
-
✕
-
✕
> 1 O (|G|)
Hard
O (1)
> 1
1
O (|G|)

it is assigned a positive label. Mathematically,

target(𝑏𝑖) =






1,

if ∃ 𝑔𝑙 st 𝑖 = argmax 𝑞(𝑏 𝑗 , 𝑔𝑙)

and 𝑞(𝑏𝑖, 𝑔𝑙) ≥ 𝛽

(2.8)

0, otherwise

with 𝑞(𝑏 𝑗 , 𝑔𝑙) = IoU2D(𝑏 𝑗 , 𝑔𝑙)

(cid:16) 1+gIoU3D (𝑏 𝑗 ,𝑔𝑙)
2

(cid:17). gIoU3D is known to provide signal even for non-

intersecting boxes [204], where the usual IoU3D is always zero. Therefore, we use gIoU3D instead

of regular IoU3D for figuring out the best box in 3D as many 3D boxes have a zero IoU3D overlap

with the ground truth. For calculating gIoU3D, we first calculate the volume 𝑉 and hull volume

𝑉ℎ𝑢𝑙𝑙 of the 3D boxes. 𝑉ℎ𝑢𝑙𝑙 is the product of gIoU2D in Birds Eye View (BEV), removing the

rotations and hull of the 𝑌 dimension. gIoU3D is then given by

gIoU3D(𝑏𝑖, 𝑏 𝑗 ) =

𝑉 (𝑏𝑖 ∩ 𝑏 𝑗 )
𝑉 (𝑏𝑖 ∪ 𝑏 𝑗 )

+

𝑉 (𝑏𝑖 ∪ 𝑏 𝑗 )
𝑉ℎ𝑢𝑙𝑙 (𝑏𝑖, 𝑏 𝑗 )

− 1.

(2.9)

Loss Function. Generally the number of best boxes is less than the number of ground truths in an

image, as there could be some ground truth boxes for which no box is predicted. The tiny number

of best boxes introduces a far-heavier skew than the foreground-background classification. Thus,

we use the modified AP-Loss [30] as our loss after NMS since AP-Loss does not suffer from class

imbalance [30].

17

Vanilla AP-Loss treats boxes of all images in a mini-batch equally, and the gradients are back-

propagated through all the boxes. We remove this condition and rank boxes in an image-wise

manner. In other words, if the best boxes are correctly ranked in one image and are not in the

second, then the gradients only affect the boxes of the second image. We call this modification of

AP-Loss as Imagewise AP-Loss. In other words,

L𝐼𝑚𝑎𝑔𝑒𝑤𝑖𝑠𝑒 =

1
𝑁

𝑁
∑︁

𝑚=1

AP(r(𝑚), target(B (𝑚))),

(2.10)

where r(𝑚) and B (𝑚) denote the rescores and the boxes of the 𝑚th image in a mini-batch respectively.

This is different from previous NMS approaches [78, 80, 81, 192], which use classification losses.

Our ablation studies (Sec. 2.5.4) show that the Imagewise AP-Loss is better suited to be used after

NMS than the classification loss.

Our overall loss function is thus given by L = L𝑏𝑒 𝑓 𝑜𝑟𝑒 + 𝜆L𝑎 𝑓 𝑡𝑒𝑟 where L𝑏𝑒 𝑓 𝑜𝑟𝑒 denotes the

losses before the NMS including classification, 2D and 3D regression as well as confidence losses,

and L𝑎 𝑓 𝑡𝑒𝑟 denotes the loss term after the NMS, which is the Imagewise AP-Loss with 𝜆 being the

weight. See Sec. B.2 of the supplementary material for more details of the loss function.

2.5 Experiments

Our experiments use the most widely used KITTI autonomous driving dataset [67]. We modify

the publicly-available PyTorch [184] code of Kinematic-3D [17]. [17] uses DenseNet-121 [86]

trained on ImageNet as the backbone and 𝑛ℎ = 1,024 using 3D-RPN settings of [15]. As [17] is

a video-based method while GrooMeD-NMS is an image-based method, we use the best image

model of [17] henceforth called Kinematic (Image) as our baseline for a fair comparison. Kinematic

(Image) is built on M3D-RPN [15] and uses binning and self-balancing confidence.

Data Splits. There are three commonly used data splits of the KITTI dataset; we evaluate our

method on all three.

KITTI Test (Full) split: Official KITTI 3D benchmark [1] consists of 7,481 training and 7,518

testing images [67].

KITTI Val 1 split: It partitions the 7,481 training images into 3,712 training and 3,769 validation

images [17, 32, 220].

18

Table 2.2 KITTI Test cars AP 3D| 𝑅40
quoted from the official leader-board or from papers.[Key: Best, Second Best].

and AP BEV| 𝑅40

comparisons (IoU3D ≥ 0.7). Previous results are

(−

(−

(cid:17))

Method

AP 3D|𝑅40

AP BEV|𝑅40

(cid:17))
Easy Mod Hard Easy Mod Hard
2.77 1.51 1.01 5.40 3.23 2.46
FQNet [143]
4.32 2.02 1.46 9.78 4.91 3.74
ROI-10D [169]
4.47 2.90 2.47 8.41 6.08 4.94
GS3D [121]
9.61 5.74 4.25 18.19 11.17 8.73
MonoGRNet [195]
10.76 7.25 5.85 18.33 12.58 9.91
MonoPSR [107]
10.37 7.94 6.40 17.23 13.19 11.12
MonoDIS [222]
15.58 8.61 6.00 21.85 12.51 9.20
UR3D [216]
14.76 9.71 7.42 21.02 13.67 10.23
M3D-RPN [15]
14.03 9.76 7.84 20.83 14.49 12.75
SMOKE [153]
13.04 9.99 8.65 19.28 14.83 12.89
MonoPair [35]
14.41 10.34 8.77 19.17 14.20 11.99
RTM3D [123]
16.50 10.74 9.52 25.03 17.32 14.91
AM3D [165]
15.19 10.90 9.26 22.76 17.03 10.86
MoVi-3D [223]
16.37 11.01 9.52 22.45 15.02 12.93
RAR-Net [144]
17.51 11.46 8.98 24.15 15.93 12.11
M3D-SSD [160]
16.77 11.50 8.93
DA-3Ddet [287]
16.65 11.72 9.51 22.51 16.02 12.55
D4LCN [52]
Kinematic (Video) [17] 19.07 12.72 9.17 26.69 17.52 13.10
GrooMeD-NMS (Ours) 18.10 12.32 9.65 26.19 18.27 14.05

-

-

-

KITTI Val 2 split: It partitions the 7,481 training images into 3,682 training and 3,799 validation

images [271].

Training. Training is done in two phases - warmup and full [17]. We initialize the model with the

confidence prediction branch from warmup weights and finetune using the self-balancing loss [17]

and Imagewise AP-Loss [30] after our GrooMeD-NMS. See Sec. B.3.1 of the supplementary

material for more training details. We keep the weight 𝜆 at 0.05. Unless otherwise stated, we use

𝑝 as the Linear function (this does not require 𝜏) with 𝛼 = 100. 𝑁𝑡, 𝑣 and 𝛽 are set to 0.4 [15, 17],

0.3 and 0.3 respectively.

Inference. We multiply the class and predicted confidence to get the box’s overall score in inference

as in [99, 216, 241]. See Sec. 2.5.2 for training and inference times.

Evaluation Metrics. KITTI uses AP 3D|𝑅40

metric to evaluate object detection following [220,222].

KITTI benchmark evaluates on three object categories: Easy, Moderate and Hard. It assigns each

19

Table 2.3 KITTI Val 1 cars AP 3D| 𝑅40

and AP BEV| 𝑅40

results. [Key: Best, Second Best].

IoU3D ≥ 0.7
(cid:17))
(−

IoU3D ≥ 0.5
(cid:17))
(−

(−

(cid:17))

Method

AP 3D|𝑅40

AP 3D|𝑅40

(cid:17))
Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard
12.50 7.34 4.98 19.49 11.51 8.72

AP BEV|𝑅40

AP BEV|𝑅40

MonoDR [7]
MonoGRNet [195] in [35] 11.90 7.56 5.76 19.72 12.81 10.15 47.59 32.28 25.50 52.13 35.99 28.72
MonoDIS [222] in [220]
M3D-RPN [15] in [17]
MoVi-3D [223]
MonoPair [35]
Kinematic (Image) [17]
Kinematic (Video) [17]
GrooMeD-NMS (Ours)

11.06 7.60 6.37 18.45 12.58 10.66
14.53 11.07 8.65 20.85 15.62 11.88 48.56 35.94 28.59 53.35 39.60 31.77
14.28 11.13 9.68 22.36 17.87 15.73
16.28 12.30 10.42 24.12 18.17 15.76 55.38 42.39 37.99 61.06 47.63 41.92
18.28 13.55 10.13 25.72 18.82 14.48 54.70 39.33 31.25 60.87 44.36 34.48
19.76 14.10 10.47 27.83 19.72 15.10 55.44 39.47 31.26 61.79 44.68 34.56
19.67 14.32 11.27 27.38 19.75 15.92 55.62 41.07 32.89 61.83 44.98 36.29

(−

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

(a) Linear Scale

(b) Log Scale

Figure 2.3 AP3D Comparison at different depths and IoU3D matching thresholds on KITTI Val 1 Split.

object to a category based on its occlusion, truncation, and height in the image space. The AP

3D|𝑅40

performance on the Moderate category compares different models in the benchmark [67].

We focus primarily on the Car class following [17].

2.5.1 KITTI Test Mono3D

Tab. 2.2 summarizes the results of 3D object detection and BEV evaluation on KITTI Test Split.

The results in Tab. 2.2 show that GrooMeD-NMS outperforms the baseline M3D-RPN [15] by a

significant margin and several other SoTA methods on both the tasks. GrooMeD-NMS also outper-

forms augmentation based approach MoVi-3D [223] and depth-convolution based D4LCN [52].

Despite being an image-based method, GrooMeD-NMS performs competitively to the video-based

method Kinematic (Video) [17], outperforming it on the most-challenging Hard set.

20

Table 2.4 Comparisons with other NMS on KITTI Val 1 cars (IoU3D ≥ 0.7). [Key: C= Classical, S=
Soft-NMS [12], D= Distance-NMS [216], G= GrooMeD-NMS ]

S

(−

(−

(cid:17))

Method

Infer
NMS

AP 3D|𝑅40

AP BEV|𝑅40

(cid:17))
Easy Mod Hard Easy Mod Hard
Kinematic (Image) C 18.28 13.55 10.13 25.72 18.82 14.48
Kinematic (Image)
18.29 13.55 10.13 25.71 18.81 14.48
Kinematic (Image) D 18.25 13.53 10.11 25.71 18.82 14.48
Kinematic (Image) G 18.26 13.51 10.10 25.67 18.77 14.44
C 19.67 14.31 11.27 27.38 19.75 15.93
GrooMeD-NMS
S
GrooMeD-NMS
19.67 14.31 11.27 27.38 19.75 15.93
D 19.67 14.31 11.27 27.38 19.75 15.93
GrooMeD-NMS
G 19.67 14.32 11.27 27.38 19.75 15.92
GrooMeD-NMS

2.5.2 KITTI Val 1 Mono3D

Results. Tab. 2.3 summarizes the results of 3D object detection and BEV evaluation on KITTI Val

1 Split at two IoU3D thresholds of 0.7 and 0.5 [17, 35]. Tab. 2.3 results show that GrooMeD-NMS

outperforms the baseline of M3D-RPN [15] and Kinematic (Image) [17] by a significant margin.

Interestingly, GrooMeD-NMS (an image-based method) also outperforms the video-based method

Kinematic (Video) [17] on most of the metrics. Thus, GrooMeD-NMS performs best on 6 out

of the 12 cases (3 categories × 2 tasks × 2 thresholds) while second-best on all other cases. The

performance is especially impressive since the biggest improvements are shown on the Moderate

and Hard set, where objects are more distant and occluded.

AP3D at different depths and IoU3D thresholds. We next compare the AP3D performance of

GrooMeD-NMS and Kinematic (Image) on linear and log scale for objects at different depths of

[15, 30, 45, 60] meters and IoU3D matching criteria of 0.3 −

0.7 in Fig. 2.3 as in [17]. Fig. 2.3 shows

that GrooMeD-NMS outperforms the Kinematic (Image) [17] at all depths and all IoU3D thresholds.

(cid:17)

Comparisons with other NMS. We compare with the classical NMS, Soft-NMS [12] and Distance-

NMS [216] in Tab. 2.4. More detailed results are in Tab. B.2 of the supplementary material. The

results show that NMS inclusion in the training pipeline benefits the performance, unlike [12],

which suggests otherwise. Training with GrooMeD-NMS helps because the network gets an

additional signal through the GrooMeD-NMS layer whenever the best-localized box corresponding

to an object is not selected. Interestingly, Tab. 2.4 also suggests that replacing GrooMeD-NMS

21

Figure 2.4 Score-IoU3D plot after the NMS. GrooMeD-NMS achieves the best correlation.

Table 2.5 KITTI Val 2 cars AP 3D| 𝑅40

and AP BEV| 𝑅40

comparisons. [Key: Best, *= Released, †= Retrained].

IoU3D ≥ 0.7
(cid:17))
(−

IoU3D ≥ 0.5
(cid:17))
(−

Method

AP 3D|𝑅40

(cid:17))
Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard
M3D-RPN [15]*
14.57 10.07 7.51 21.36 15.22 11.28 49.14 34.43 26.39 53.44 37.79 29.36
Kinematic (Image) [17]† 13.54 10.21 7.24 20.60 15.14 11.30 51.53 36.55 28.26 56.20 40.02 31.25
GrooMeD-NMS (Ours) 14.72 10.87 7.67 22.03 16.05 11.93 51.91 36.78 28.40 56.29 40.31 31.39

AP BEV|𝑅40

AP BEV|𝑅40

AP 3D|𝑅40

(cid:17))

(−

(−

with the classical NMS in inference does not affect the performance.

Score-IoU3D Plot. We further correlate the scores with IoU3D after NMS of our model with two

baselines - M3D-RPN [15] and Kinematic (Image) [17] and also the Kinematic (Video) [17] in

Fig. 2.4. We obtain the best correlation of 0.345 exceeding the correlations of M3D-RPN, Kinematic

(Image) and, also Kinematic (Video). This proves that including NMS in the training pipeline is

beneficial.

Training and Inference Times. We now compare the training and inference times of includ-

ing GrooMeD-NMS in the pipeline. Warmup training phase takes about 13 hours to train on a

single 12 GB GeForce GTX Titan-X GPU. Full training phase of Kinematic (Image) and GrooMeD-

NMS takes about 8 and 8.5 hours respectively. The inference time per image using classical and

GrooMeD-NMS is 0.12 and 0.15 ms respectively. Tab. 2.4 suggests that changing the NMS from

GrooMeD to classical during inference does not alter the performance. Then, the inference time of

our method is the same as 0.12 ms.

22

Table 2.6 Ablation studies of GrooMeD-NMS on KITTI Val 1 cars.

Change from GrooMeD-NMS model:

IoU3D ≥ 0.7
(−

(cid:17))

IoU3D ≥ 0.5
(−

(cid:17))

Changed

From −−

(cid:17)

Training

Conf+NMS−
Conf+NMS−
Conf+NMS−
Initialization No Warmup

Pruning
Function

Group+Mask

Loss

(cid:17)
(cid:17)
(cid:17)
(cid:17)

Linear −
Linear −
Linear −
Linear −
Group+Mask−
Group+Mask−
(cid:17)
Imagewise AP −
(cid:17)
Imagewise AP −

Inference Class*Pred−
NMS Scores Class*Pred−

(−

(−

To

(cid:17))

(cid:17)
(cid:17)
(cid:17)

AP 3D|𝑅40

AP 3D|𝑅40

AP BEV|𝑅40

AP BEV|𝑅40

No Conf+No NMS
Conf+No NMS
No Conf+NMS

(cid:17))
Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard
16.66 12.10 9.40 23.15 17.43 13.48 51.47 38.58 30.98 56.48 42.53 34.37
19.16 13.89 10.96 27.01 19.33 14.84 57.12 41.07 32.79 61.60 44.58 35.97
15.02 11.21 8.83 21.07 16.27 12.77 48.01 36.18 29.96 53.82 40.94 33.35
15.33 11.68 8.78 21.32 16.59 12.93 49.15 37.42 30.11 54.32 41.44 33.48
Exponential, 𝜏 = 1
12.81 9.26 7.10 17.07 12.17 9.25 29.58 20.42 15.88 32.06 22.16 17.20
Exponential, 𝜏 = 0.5 [12] 18.63 13.85 10.98 27.52 20.14 15.76 56.64 41.01 32.79 61.43 44.73 36.02
18.34 13.79 10.88 27.26 19.71 15.90 56.98 41.16 32.96 62.77 45.23 36.56
Exponential, 𝜏 = 0.1
17.40 13.21 9.80 26.77 19.26 14.76 55.15 40.77 32.63 60.56 44.23 35.74
Sigmoidal, 𝜏 = 0.1
18.43 13.91 11.08 26.53 19.46 15.83 55.93 40.98 32.78 61.02 44.77 36.09
18.99 13.74 10.24 26.71 19.21 14.77 55.21 40.69 32.55 61.74 44.67 36.00
18.23 13.73 10.28 26.42 19.31 14.76 54.47 40.35 32.20 60.90 44.08 35.47
16.34 12.74 9.73 22.40 17.46 13.70 52.46 39.40 31.68 58.22 43.60 35.27
18.26 13.36 10.49 25.39 18.64 15.12 52.44 38.99 31.3 57.37 42.89 34.68
17.51 12.84 9.55 24.55 17.85 13.63 52.78 37.48 29.37 58.30 41.26 32.66
19.67 14.32 11.27 27.38 19.75 15.92 55.62 41.07 32.89 61.83 44.98 36.29

No Group
Group+No Mask
Vanilla AP
BCE

(cid:17)
Class
(cid:17)
Pred

(cid:17)
(cid:17)

—

GrooMeD-NMS (best model)

2.5.3 KITTI Val 2 Mono3D

Tab. 2.5 summarizes the results of 3D object detection and BEV evaluation on KITTI Val 2 Split

at two IoU3D thresholds of 0.7 and 0.5 [17, 35]. Again, we use M3D-RPN [15] and Kinematic

(Image) [17] as our baselines. We evaluate the released model of M3D-RPN [15] using the KITTI

metric. [17] does not report Val 2 results, so we retrain on Val 2 using their public code. The results

in Tab. 2.5 show that GrooMeD-NMS performs best in all cases. This is again impressive because

the improvements are shown on Moderate and Hard set, consistent with Tabs. 2.2 and 2.3.

2.5.4 Ablation Studies on KITTI Val 1

Tab. 2.6 compares the modifications of our approach on KITTI Val 1 cars. Unless stated

otherwise, we stick with the experimental settings described in Sec. 2.5. Using a confidence

head (Conf+No NMS) proves beneficial compared to the warmup model (No Conf+No NMS),

which is consistent with the observations of [17, 216]. Further, GrooMeD-NMS on classification

scores (denoted by No Conf + NMS) is detrimental as the classification scores are not suited for

localization [17,90]. Training the warmup model and then finetuning also works better than training

without warmup as in [17] since the warmup phase allows GrooMeD-NMS to carry meaningful

grouping of the boxes.

As described in Sec. 2.4.1.5, in addition to Linear, we compare two other functions for pruning

function 𝑝: Exponential and Sigmoidal. Both of them do not perform as well as the Linear 𝑝 possibly

23

because they have vanishing gradients close to overlap of zero or one. Grouping and masking both

help our model to reach a better minimum. As described in Sec. 2.4.3, Imagewise AP loss is better

than the Vanilla AP loss since it treats boxes of two images differently. Imagewise AP also performs

better than the binary cross-entropy (BCE) loss proposed in [78, 80, 81, 192]. Using the product of

self-balancing confidence and classification scores instead of using them individually as the scores

to the NMS in inference is better, consistent with [99, 216, 241]. Class confidence performs worse

since it does not have the localization information while the self-balancing confidence (Pred) gives

the localization without considering whether the box belongs to foreground or background.

2.6 Conclusions

In this chapter, we present and integrate GrooMeD-NMS– a novel Grouped Mathematically

Differentiable NMS for monocular 3D object detection, such that the network is trained end-to-end

with a loss on the boxes after NMS. We first formulate NMS as a matrix operation and then do

unsupervised grouping and masking of the boxes to obtain a simple closed-form expression of

the NMS. GrooMeD-NMS addresses the mismatch between training and inference pipelines and,

therefore, forces the network to select the best 3D box in a differentiable manner. As a result,

GrooMeD-NMS achieves state-of-the-art monocular 3D object detection results on the KITTI

benchmark dataset. Although our implementation demonstrates monocular 3D object detection,

GrooMeD-NMS is fairly generic for other object detection tasks. Future work includes applying

this method to tasks such as LiDAR-based 3D object detection and pedestrian detection.

Limitation. GrooMeD-NMS does not fully solve the generalization issue.

24

CHAPTER 3

DEVIANT: DEPTH EQUIVARIANT NETWORK FOR MONOCULAR 3D OBJECT
DETECTION

Modern neural networks use building blocks such as convolutions that are equivariant to arbitrary

2D translations in the Euclidean manifold. However, these vanilla blocks are not equivariant to

arbitrary 3D translations in the projective manifold. Even then, all monocular 3D detectors use

vanilla blocks to obtain the 3D coordinates, a task for which the vanilla blocks are not designed

for. This chapter takes the first step towards convolutions equivariant to arbitrary 3D translations

in the projective manifold. Since the depth is the hardest to estimate for monocular detection,

this chapter proposes Depth Equivariant Network (DEVIANT) built with existing scale equivariant

steerable blocks. As a result, DEVIANT is equivariant to the depth translations in the projective

manifold whereas vanilla networks are not. The additional depth equivariance forces the DEVIANT

to learn consistent depth estimates, and therefore, DEVIANT achieves state-of-the-art monocular

3D detection results on KITTI and Waymo datasets in the image-only category and performs

competitively to methods using extra information. Moreover, DEVIANT works better than vanilla

networks in cross-dataset evaluation.

3.1

Introduction

Monocular 3D object detection is a fundamental task in computer vision, where the task is

to infer 3D information including depth from a single monocular image.

It has applications in

augmented reality [2], gaming [201], robotics [213], and more recently in autonomous driving

[15, 220] as a fallback solution for LiDAR.

Most of the monocular 3D methods attach extra heads to the 2D Faster-RCNN [202] or CenterNet

[306] for 3D detections. Some change architectures [123, 143, 232] or losses [15, 35]. Others

incorporate augmentation [223], or confidence [17, 143]. Recent ones use in-network ensembles

[159, 301] for better depth estimation.

Most of these methods use vanilla blocks such as convolutions that are equivariant to arbitrary

2D translations [19, 198].

In other words, whenever we shift the ego camera in 2D (See 𝑡𝑢 of

25

𝑧

𝑥

𝑦

ℎ

𝑡𝑍
ℎ′

𝑡𝑢

𝑡𝑋

ℎ′

SES Convolution
*

Depth Translation
= T𝑠(Corollary 1.1)

T𝑠−1

Depth Translation

= T𝑠

*

(a) Idea.

(b) Depth Equivariance.

Figure 3.1 (a) Idea. Vanilla CNN is equivariant to projected 2D translations 𝑡𝑢, 𝑡𝑣 (in red) of the ego camera.
The ego camera moves in 3D in driving scenes which breaks this assumption. We propose DEVIANT
which is additionally equivariant to depth translations 𝑡𝑍 (in green) in the projective manifold. (b) Depth
Equivariance. DEVIANT enforces additional consistency among the feature maps of an image and its
transformation caused by the ego depth translation. T𝑠 =scale transformation, ∗ =vanilla convolution.

Fig. 3.1), the new image (projection) is a translation of the original image, and therefore, these

methods output a translated feature map. However, in general, the camera moves in depth in

driving scenes instead of 2D (See 𝑡𝑍 of Fig. 3.1). So, the new image is not a translation of the

original input image due to the projective transform. Thus, using vanilla blocks in monocular

methods is a mismatch between the assumptions and the regime where these blocks operate.

Additionally, there is a huge generalization gap between training and validation for monocular 3D

detection (See Modeling translation equivariance in the correct manifold improves generalization

for tasks in spherical [41] and hyperbolic [64] manifolds. Monocular detection involves processing

pixels (3D point projections) to obtain the 3D information, and is thus a task in the projective

manifold. Moreover, the depth in monocular detection is ill-defined [232], and thus, the hardest

to estimate [166]. Hence, using building blocks equivariant to depth translations in the projective

manifold is a natural choice for improving generalization and is also at the core of this work (See

Sec. C.1.8).

Recent monocular methods use flips [15], scale [159, 223], mosaic [11, 238] or copy-paste

[135] augmentation, depth-aware convolution [15], or geometry [151, 159, 218, 302] to improve

generalization. Although all these methods improve performance, a major issue is that their

backbones are not designed for the projective world. This results in the depth estimation going

haywire with a slight ego movement [307]. Moreover, data augmentation, e.g., flips, scales,

26

Table 3.1 Equivariance comparisons. [Key: Proj.= Projected, ax= axis]

(cid:17)

3D

Translation −

Proj. 2D
𝑥−ax 𝑦−ax 𝑧−ax 𝑢-ax 𝑣-ax
(𝑡𝑣)
(𝑡𝑍 )
(𝑡𝑋)
(𝑡𝑢)
✓ ✓
Vanilla CNN
−
−
✓
Log-polar [313] −
−
−
✓ ✓
✓
−
DEVIANT
✓
✓
Ideal
−
−

(𝑡𝑌 )
−
−
−
✓

mosaic, copy-paste, is not only limited for the projective tasks, but also does not guarantee desired

behavior [63].

To address the mismatch between assumptions and the operating regime of the vanilla blocks

and improve generalization, we take the first step towards convolutions equivariant to arbitrary 3D

translations in the projective manifold. We propose Depth Equivariant Network (DEVIANT) which

is additionally equivariant to depth translations in the projective manifold as shown in Tab. 3.1.

Building upon the classic result from [76], we simplify it under reasonable assumptions about the

camera movement in autonomous driving to get scale transformations. The scale equivariant blocks

are well-known in the literature [68, 92, 227, 309], and consequently, we replace the vanilla blocks

in the backbone with their scale equivariant steerable counterparts [227] to additionally embed

equivariance to depth translations in the projective manifold. Hence, DEVIANT learns consistent

depth estimates and improves monocular detection.

In summary, the main contributions of this work include:

• We study the modeling error in monocular 3D detection and propose depth equivariant networks

built with scale equivariant steerable blocks as a solution.

• We achieve state-of-the-art (SoTA) monocular 3D object detection results on the KITTI and
Waymo datasets in the image-only category and perform competitively to methods which use

extra information.

• We experimentally show that DEVIANT works better in cross-dataset evaluation suggesting

better generalization than vanilla CNN backbones.

27

Transformation −
Manifold −

(cid:17)

(cid:17)

Table 3.2 Equivariances known in the literature.

Translation

Rotation

Scale

Flips

Learned

Euclidean

Spherical
Hyperbolic
Projective

Vanilla CNN [115]

Spherical CNN [41]
Hyperbolic CNN [64]
Monocular Detector

Polar,

Log-polar [79],
Steerable [266] Steerable [68]

ChiralNets [288] Transformers [55]

−
−
−

−
−
−

−
−
−

−
−
−

3.2 Related Works

Equivariant Neural Networks. The success of convolutions in CNN has led people to look for

their generalizations [43, 262]. Convolution is the unique solution to 2D translation equivariance

in the Euclidean manifold [19, 20, 198]. Thus, convolution in CNN is a prior in the Euclidean

manifold. Several works explore other group actions in the Euclidean manifold such as 2D rotations

[42, 50, 171, 263], scale [98, 170], flips [288], or their combinations [247, 266]. Some consider 3D

translations [265] and rotations [239]. Few [55, 264, 304] attempt learning the equivariance from

the data, but such methods have significantly higher data requirements [265]. Others change the

manifold to spherical [41], hyperbolic [64], graphs [173], or arbitrary manifolds [97]. Monocular

3D detection involves operations on pixels which are projections of 3D point and thus, works in a

different manifold namely projective manifold. Tab. 3.2 summarizes all these equivariances known

thus far.

Scale Equivariant Networks. Scale equivariance in the Euclidean manifold is more challenging

than the rotations because of its acyclic and unbounded nature [198]. There are two major lines

of work for scale equivariant networks. The first [56, 79] infers the global scale using log-polar

transform [313], while the other infers the scale locally by convolving with multiple scales of

images [98] or filters [278]. Several works [68, 92, 227, 309] extend the local idea, using steerable

filters [62]. Another work [267] constructs filters for integer scaling. We compare the two kinds

of scale equivariant convolutions on the monocular 3D detection task and show that steerable

convolutions are better suited to embed depth (scale) equivariance. Scale equivariant networks

have been used for classification [56, 68, 227], 2D tracking [226] and 3D object classification [56].

We are the first to use scale equivariant networks for monocular 3D detection.

28

3D Object Detection. Accurate 3D object detection uses sparse data from LiDARs [215], which are

expensive and do not work well in severe weather [232] and glassy environments. Hence, several

works have been on monocular camera-based 3D object detection, which is simplistic but has

scale/depth ambiguity [232]. Earlier approaches [31, 61, 186, 187] use hand-crafted features, while

the recent ones use deep learning. Some change architectures [123,143,146,232] or losses [15,35].

Some use scale [159, 223], mosaic [238] or copy-paste [135] augmentation. Others incorporate

depth in convolution [15, 52], or confidence [17, 111, 143]. More recent ones use in-network

ensembles to predict the depth deterministically [301] or probabilistically [159]. A few use

temporal cues [17], NMS [109], or corrected camera extrinsics [307] in the training pipeline.

Some also use CAD models [25, 154] or LiDAR [199] in training. Another line of work called

Pseudo-LiDAR [162,165,181,221,254] estimates the depth first, and then uses a point cloud-based

3D object detector. We refer to [163] for a detailed survey. Our work is the first to use scale

equivariant blocks in the backbone for monocular 3D detection.

3.3 Background

We first provide the necessary definitions which are used throughout this chapter. These are

not our contributions and can be found in the literature [21, 76, 265].

Equivariance. Consider a group of transformations 𝐺, whose individual members are 𝑔. Assume

Φ denote the mapping of the inputs ℎ to the outputs 𝑦. Let the inputs and outputs undergo the

transformation T ℎ
𝑔 and T
𝑦
𝑔 (Φℎ), ∀ 𝑔 ∈ 𝐺. Thus, equivariance provides an explicit relationship between

𝑦
𝑔 respectively. Then, the mapping Φ is equivariant to the group 𝐺 [265]

if Φ(T ℎ

𝑔 ℎ) = T

input transformations and feature-space transformations at each layer of the neural network [265],

and intuitively makes the learning easier. The mapping Φ is the vanilla convolution when the

T ℎ
𝑔 = T

𝑦
𝑔 = Tt where Tt denotes the translation t on the discrete grid [19, 20, 198]. These

vanilla convolution introduce weight-tying [115] in fully connected neural networks resulting in a

greater generalization. A special case of equivariance is the invariance [265] which is given by

Φ(T ℎ

𝑔 ℎ) = Φℎ, ∀ 𝑔 ∈ 𝐺.

Projective Transformations. Our idea is to use equivariance to depth translations in the projective

29

manifold since the monocular detection task belongs to this manifold. A natural question to ask is

whether such equivariants exist in the projective manifold. [21] answers this question in negative,

and says that such equivariants do not exist in general. However, such equivariants exist for special

classes, such as planes. An intuitive way to understand this is to infer the rotations and translations

by looking at the two projections (images). For example, the result of [21] makes sense if we

consider a car with very different front and back sides as in Fig. C.2. A 180◦ ego rotation around

the car means the projections (images) are its front and the back sides, which are different. Thus,

we can not infer the translations and rotations from these two projections. Based on this result, we

stick with locally planar objects i.e. we assume that a 3D object is made of several patch planes.

(See last row of Fig. 3.2b as an example). It is important to stress that we do NOT assume that the

3D object such as car is planar. The local planarity also agrees with the property of manifolds that

manifolds locally resemble 𝑛-dimensional Euclidean space and because the projective transform

maps planes to planes, the patch planes in 3D are also locally planar. We show a sample planar

patch and the 3D object in Fig. C.1 in the appendix.

Planarity and Projective Transformation. Example 13.2 from [76] links the planarity and

projective transformations. Although their result is for stereo with two different cameras (K, K′),

we substitute K = K′ to get Th. 1.

Theorem 1. [76] Consider a 3D point lying on a patch plane 𝑚𝑥+𝑛𝑦+𝑜𝑧+ 𝑝 = 0, and observed by

an ego camera in a pinhole setup to give an image ℎ. Let t = (𝑡𝑋, 𝑡𝑌 , 𝑡𝑍 ) and R = [𝑟𝑖 𝑗 ]3×3 denote a

translation and rotation of the ego camera respectively. Observing the same 3D point from a new

camera position leads to an image ℎ′. Then, the image ℎ is related to the image ℎ′ by the projective

transformation

T : ℎ(𝑢 − 𝑢0, 𝑣 − 𝑣0) =
(cid:16)
𝑟21+𝑡 𝑋
(cid:16)
𝑟23+𝑡 𝑍

(cid:16)
𝑟11+𝑡 𝑋
(cid:16)
𝑟13+𝑡 𝑍

(𝑢−𝑢0) +

(𝑢−𝑢0) +

𝑚
𝑝

(cid:17)

(cid:17)

𝑓

𝑚
𝑝

ℎ′(cid:169)
(cid:173)
(cid:173)
(cid:171)

(3.1)

(cid:17)

(cid:17)

𝑛
𝑝

𝑛
𝑝

(𝑣 −𝑣0) +

(𝑣 −𝑣0) +

(cid:16)
𝑟31+𝑡 𝑋
(cid:16)
𝑟33+𝑡 𝑍

𝑜
𝑝

𝑜
𝑝

(cid:17)

(cid:17)

𝑓

𝑓

,

30

(cid:16)
𝑟12+𝑡𝑌
(cid:16)
𝑟13+𝑡 𝑍

𝑚
𝑝

𝑚
𝑝

(cid:17)

(cid:17)

𝑓

(𝑢−𝑢0) +

(𝑢−𝑢0) +

(cid:16)
𝑟22+𝑡𝑌
(cid:16)
𝑟23+𝑡 𝑍

𝑛
𝑝

𝑛
𝑝

(cid:17)

(cid:17)

(𝑣 −𝑣0) +

(𝑣 −𝑣0) +

(cid:16)
𝑟32+𝑡𝑌
(cid:16)
𝑟33+𝑡 𝑍

𝑜
𝑝

𝑜
𝑝

(cid:17)

(cid:17)

𝑓

𝑓

,

(cid:170)
(cid:174)
(cid:174)
(cid:172)

where 𝑓 and (𝑢0, 𝑣0) denote the focal length and principal point of the ego camera, and (𝑡 𝑋, 𝑡𝑌 , 𝑡 𝑍 ) =

R𝑇 t.

3.4 Depth Equivariant Backbone

The projective transformation in Eq. (3.1) from [76] is complicated and also involves rotations,

and we do not know which convolution obeys this projective transformation. Hence, we simplify

Eq. (3.1) under reasonable assumptions to obtain a familiar transformation for which the convolution

is known.

Corollary 1.1. When the ego camera translates in depth without rotations (R = I), and the patch

plane is “approximately” parallel to the image plane, the image ℎ locally is a scaled version of the

second image ℎ′ independent of focal length, i.e.

T𝑠 : ℎ(𝑢 − 𝑢0, 𝑣 − 𝑣0) ≈ ℎ′

(cid:32) 𝑢 − 𝑢0
1+𝑡𝑍

𝑜
𝑝

,

𝑣 − 𝑣0
𝑜
1+𝑡𝑍
𝑝

(cid:33)

.

(3.2)

where 𝑓 and (𝑢0, 𝑣0) denote the focal length and principal point of the ego camera, and 𝑡𝑍 denotes

the ego translation.

See Sec. C.1.6 for the detailed explanation of Corollary 1.1. Corollary 1.1 says

T𝑠 : ℎ(𝑢 − 𝑢0, 𝑣 − 𝑣0) ≈ ℎ′ (cid:16) 𝑢 − 𝑢0

𝑠

,

𝑣 − 𝑣0
𝑠

(cid:17)

,

(3.3)

where, 𝑠 = 1 + 𝑡𝑍

𝑜
𝑝 denotes the scale and T𝑠 denotes the scale transformation. The scale 𝑠 < 1
suggests downscaling, while 𝑠 > 1 suggests upscaling. Corollary 1.1 shows that the transformation

T𝑠 is independent of the focal length and that scale is a linear function of the depth translation.

Hence, the depth translation in the projective manifold induces scale transformation and thus, the

depth equivariance in the projective manifold is the scale equivariance in the Euclidean manifold.

Mathematically, the desired equivariance is [T𝑠 (ℎ) ∗ Ψ] = T𝑠 [ℎ ∗ Ψ𝑠−1], where Ψ denotes the

filter (See Sec. C.1.7). As CNN is not a scale equivariant (SE) architecture [227], we aim to

get SE backbone which makes the architecture equivariant to depth translations in the projective

31

(a) SES Convolution Output.

(b) Receptive fields.

(c) Log-polar SSIM.

Figure 3.2 (a) Scale Equivariance. We apply SES convolution [227] with two scales on a single channel toy
image ℎ. (b) Receptive fields of convolutions in the Euclidean manifold. Colors represent different weights,
while shades represent the same weight. (c) Impact of discretization on log-polar convolution. SSIM is
very low at small resolutions and is not 1 even after upscaling by 4. [Key: Up= Upscaling]

manifold. The scale transformation is a familiar transformation and SE convolutions are well

known [68, 92, 227, 309].

Scale Equivariant Steerable (SES) Blocks. We use the existing SES blocks [226,227] to construct

our Depth Equivariant Network (DEVIANT) backbone. As [226] does not construct SE-DLA-34

backbones, we construct our DEVIANT backbone as follows. We replace the vanilla convolutions

by the SES convolutions [226] with the basis as Hermite polynomials. SES convolutions result in

multi-scale representation of an input tensor. As a result, their output is five-dimensional instead

of four-dimensional. Thus, we replace the 2D pools and batch norm (BN) by 3D pools and 3D BN

respectively. The Scale-Projection layer [227] carries a max over the extra (scale) dimension to

project five-dimensional tensors to four dimensions (See Fig. C.5) in the supplementary). Ablation

in Sec. 3.5.3 confirms that BN and Pool (BNP) should also be SE for the best performance.

The SES convolutions [68, 227, 309] are based on steerable-filters [62]. Steerable approaches

[68] first pre-calculate the non-trainable multi-scale basis in the Euclidean manifold and then build

filters by the linear combinations of the trainable weights w. The number of trainable weights w

equals the number of filters at one particular scale. The linear combination of multi-scale basis

ensures that the filters are also multi-scale. Thus, SES blocks bypass grid conversion and do not

suffer from sampling effects.

We show the convolution of toy image ℎ with a SES convolution in Fig. 3.2a. Let Ψ𝑠 denote

32

the filter at scale 𝑠. The convolution between downscaled image and filter T0.5(ℎ) ∗ Ψ0.5 matches

the downscaled version of original image convolved with upscaled filter T0.5(ℎ ∗ Ψ1.0). Fig. 3.2a

(right column) shows that the output of a CNN exhibits aliasing in general and is therefore, not

scale equivariant.

Log-polar Convolution: Impact of Discretization. An alternate way to convert the depth transla-

tion 𝑡𝑍 of Eq. (3.2) to shift is by converting the images to log-polar space [313] around the principal

point (𝑢0, 𝑣0), as

ℎ(ln 𝑟, 𝜃) ≈ ℎ′

(cid:18)

ln 𝑟 − ln

(cid:18)

1+𝑡𝑍

(cid:19)

𝑜
𝑝

(cid:19)

,

, 𝜃

(3.4)

with 𝑟 = √︁(𝑢−𝑢0)2+ (𝑣 − 𝑣0)2, and 𝜃 = tan−1 (cid:16) 𝑣−𝑣0
𝑢−𝑢0
scale to translation, so using convolution in the log-polar space is equivariant to the logarithm of

(cid:17). The log-polar transformation converts the

the depth translation 𝑡𝑍 . We show the receptive field of log-polar convolution in Fig. 3.2b. The

log-polar convolution uses a smaller receptive field for objects closer to the principal point, while

a larger field away from the principal point. We implemented log-polar convolution and found that

its performance (See Tab. 3.11) is not acceptable, consistent with [227]. We attribute this behavior

to the discretization of pixels and loss of 2D translation equivariance. Eq. (3.4) is perfectly valid in

the continuous world (Note the use of parentheses instead of square brackets in Eq. (3.4)). However,

pixels reside on discrete grids, which gives rise to sampling errors [112]. We discuss the impact

of discretization on log-polar convolution in Sec. 3.5.2 and show it in Fig. 3.2c. Hence, we do not

use log-polar convolution for the DEVIANT backbone.

Comparison of Equivariance s for Monocular 3D Detection. We now compare equivariances

for monocular 3D detection task. An ideal monocular detector should be equivariant to arbitrary

3D translations (𝑡𝑋, 𝑡𝑌 , 𝑡𝑍 ). However, most monocular detectors [109,159] estimate 2D projections

of 3D centers and the depth, which they back-project in 3D world via known camera intrinsics.

Thus, a good enough detector shall be equivariant to 2D translations (𝑡𝑢, 𝑡𝑣) for projected centers

as well as equivariant to depth translations (𝑡𝑍 ).

Existing detector backbones [109,159] are only equivariant to 2D translations as they use vanilla

convolutions that produce 4D feature maps. Log-polar backbones is equivariant to logarithm of

33

depth translations but not to 2D translations. DEVIANT uses SES convolutions to produce 5D

feature maps. The extra dimension in 5D feature map captures the changes in scale (for depth), while

these feature maps individually are equivariant to 2D translations (for projected centers). Hence,

DEVIANT augments the 2D translation equivariance (𝑡𝑢, 𝑡𝑣) of the projected centers with the depth

translation equivariance. We emphasize that although DEVIANT is not equivariant to arbitrary

3D translations in the projective manifold, DEVIANT does provide the equivariance to depth

translations (𝑡𝑍 ) and is thus a first step towards the ideal equivariance. Our experiments (Sec. 3.5)

show that even this additional equivariance benefits monocular 3D detection task. This is expected

because depth is the hardest parameter to estimate [166]. Tab. 3.1 summarizes these equivariances.

Moreover, Tab. 3.10 empirically shows that 2D detection does not suffer and therefore, confirms

that DEVIANT indeed augments the 2D equivariance with the depth equivariance. An idea similar

to DEVIANT is the optical expansion [280] which augments optical flow with the scale information

and benefits depth estimation.

3.5 Experiments

Our experiments use the KITTI [67], Waymo [230] and nuScenes datasets [22]. We modify

the publicly-available PyTorch [184] code of GUP Net [159] and use the GUP Net model as our

baseline. For DEVIANT, we keep the number of scales as three [226]. DEVIANT takes 8.5 hours

for training and 0.04s per image for inference on a single A100 GPU.

Evaluation Metrics. KITTI evaluates on three object categories: Easy, Moderate and Hard. It

assigns each object to a category based on its occlusion, truncation, and height in the image space.

KITTI uses AP 3D|𝑅40

following [220, 222].

percentage metric on the Moderate category to benchmark models [67]

Waymo evaluates on two object levels: Level_1 and Level_2. It assigns each object to a level

based on the number of LiDAR points included in its 3D box. Waymo uses APH3D percentage

metric which is the incorporation of heading information in AP3D to benchmark models. It also

provides evaluation at three distances [0, 30), [30, 50) and [50, ∞) meters.

Data Splits. We use the following splits of the KITTI,Waymo and nuScenes:

34

Table 3.3 Results on KITTI Test cars at IoU3D ≥ 0.7. Previous results are from the leader-board or papers.
We show 3 methods in each Extra category and 6 methods in the image-only category. [Key: Best, Second
Best]

Method

Extra

AutoShape [154]
PCT [246]
DFR-Net [312]
MonoDistill [39]
PatchNet-C [221]
CaDDN [199]
DD3D [181]
MonoEF [307]
Kinematic [17]
GrooMeD-NMS [109]
MonoRCNN [218]
MonoDIS-M [220]
Ground-Aware [151]
MonoFlex [301]
GUP Net [159]
DEVIANT (Ours)

(cid:17))

−

−

(cid:17))

[%](−

AP BEV|𝑅40

[%](−
AP 3D|𝑅40
Easy Mod Hard Easy Mod Hard
22.47 14.17 11.36 30.66 20.08 15.59
21.00 13.37 11.31 29.65 19.03 15.92
19.40 13.63 10.35 28.17 19.17 14.84
22.97 16.03 13.60 31.87 22.59 19.72

CAD
Depth
Depth
Depth
LiDAR 22.40 12.53 10.60
LiDAR 19.17 13.41 11.46 27.94 18.91 17.19
LiDAR 23.22 16.34 14.20 30.98 22.56 20.03
Odometry 21.29 13.87 11.71 29.03 19.70 17.26
19.07 12.72 9.17 26.69 17.52 13.10
18.10 12.32 9.65 26.19 18.27 14.05
18.36 12.65 10.03 25.48 18.11 14.10
16.54 12.97 11.04 24.45 19.25 16.87
21.65 13.25 9.91 29.81 17.98 13.08
19.94 13.89 12.07 28.23 19.75 16.89
20.11 14.20 11.77
21.88 14.46 11.89 29.65 20.44 17.43

Video
−
−
−
−
−
−
−

−

−

−

−

• KITTI Test (Full) split: Official KITTI 3D benchmark [1] consists of 7,481 training and 7,518

testing images [67].

• KITTI Val split: It partitions the 7,481 training images into 3,712 training and 3,769 validation

images [32].

• Waymo Val split: This split [199, 246] contains 52,386 training and 39,848 validation images
from the front camera. We construct its training set by sampling every third frame from the

training sequences as in [199, 246].

• nuScenes Val split: It consists of 28,130 training and 6,019 validation images from the front

camera [22]. We use this split for evaluation [218].

3.5.1 KITTI Test Mono3D

Cars. Tab. 3.3 lists out the results of monocular 3D detection and BEV evaluation on KITTI

Test cars. Tab. 3.3 results show that DEVIANT outperforms the GUP Net and several other SoTA

methods on both tasks. Except DD3D [181] and MonoDistill [39], DEVIANT, an image-based

method, also outperforms other methods that use extra information.

Cyclists and Pedestrians. Tab. 3.4 lists out the results of monocular 3D detection on KITTI Test

35

Table 3.4 Results on KITTI Test cyclists and pedestrians (Cyc/Ped) at IoU3D ≥ 0.5. Previous results are
from the leader-board or papers. [Key: Best, Second Best]

Method

DDMP-3D [245]
DFR-Net [312]
MonoDistill [39]
CaDDN [199]
DD3D [181]
MonoEF [307]
MonoDIS-M [220]
MonoFlex [301]
GUP Net [159]
DEVIANT (Ours)

(cid:17)) Ped AP 3D|𝑅40

(cid:17))

Extra

Cyc AP 3D|𝑅40
[%](−
Easy Mod Hard
2.32
4.18 2.50
Depth
3.10
5.69 3.58
Depth
2.40
Depth
5.53 2.81
3.30
LiDAR 7.00 3.41
1.31
LiDAR 2.39 1.52
0.71
Odometry 1.80 0.92
0.48
1.17 0.54
1.67
3.39 2.10
2.09
4.18 2.65
2.59
5.05 3.13

−
−
−
−

[%](−
Easy Mod Hard
3.01
4.93 3.55
3.39
6.09 3.62
7.45
12.79 8.17
6.76
12.87 8.14
8.05
13.91 9.30
2.21
4.27 2.79
4.42
7.79 5.14
6.81
11.89 8.16
7.87
14.72 9.53
7.69
13.43 8.65

Table 3.5 Results on KITTI Val cars. Comparison with bigger CNN backbones in Tab. C.4. [Key: Best,
Second Best, −= No pretrain ]

Method

Extra

IoU3D ≥ 0.7
[%](−

IoU3D ≥ 0.5
[%](−

[%](−

(cid:17)) AP BEV|𝑅40

AP 3D|𝑅40
[%](−
Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard
28.12 20.39 16.34 −
−
38.39 27.53 24.44 47.16 34.65 28.47 −
24.31 18.47 15.76 33.09 25.40 22.16 65.69 49.35 43.49 71.45 53.11 46.94

(cid:17)) AP BEV|𝑅40

(cid:17)) AP 3D|𝑅40

−
−

−
−

−
−

−
−

−
−

−

−

(cid:17))

−
−

−
−
33.5 26.0 22.6
26.8 20.2 16.7

−
−
−
−

−
−
−
−

−
−
−
−

−
−
−
−

−
−
−
−

−
−
−
−

Depth
Depth
Depth
LiDAR 23.57 16.31 13.84 −
LiDAR 24.51 17.03 13.25 −
LiDAR
LiDAR

−
−

−
−

−
−

Odometry 18.26 16.30 15.24 26.07 25.21 21.61 57.98 51.80 49.34 63.40 61.13 53.22
19.76 14.10 10.47 27.83 19.72 15.10 55.44 39.47 31.26 61.79 44.68 34.56
16.61 13.19 10.65 25.29 19.22 15.30 −
17.45 13.66 11.68 24.97 19.33 17.01 55.41 43.42 37.81 60.73 46.87 41.89
19.67 14.32 11.27 27.38 19.75 15.92 55.62 41.07 32.89 61.83 44.98 36.29
23.63 16.16 12.06 −
23.64 17.51 14.83 −
22.76 16.46 13.72 31.07 22.94 19.75 57.62 42.33 37.59 61.78 47.06 40.88
21.10 15.48 12.88 28.58 20.92 17.83 58.95 43.99 38.07 64.60 47.76 42.97
24.63 16.54 14.52 32.60 23.04 19.99 61.00 46.00 40.18 65.28 49.63 43.50

Video
−
−
−
−
−
−
−
−

− 60.92 42.18 32.02 −
−
−
−

−
−

−
−

−
−

−

−

−

−

−

−

−

DDMP-3D [245]
PCT [246]
MonoDistill [39]
CaDDN [199]
PatchNet-C [221]
DD3D (DLA34) [181]
DD3D −(DLA34) [181]
MonoEF [307]
Kinematic [17]
MonoRCNN [218]
MonoDLE [166]
GrooMeD-NMS [109]
Ground-Aware [151]
MonoFlex [301]
GUP Net (Reported)[159]
GUP Net (Retrained)[159]
DEVIANT (Ours)

Cyclist and Pedestrians. The results show that DEVIANT achieves SoTA results in the image-only

category on the challenging Cyclists, and is competitive on Pedestrians.

3.5.2 KITTI Val Mono3D

Cars. Tab. 3.5 summarizes the results of monocular 3D detection and BEV evaluation on KITTI

Val split at two IoU3D thresholds of 0.7 and 0.5 [35, 109]. We report the median model over 5

runs. The results show that DEVIANT outperforms the GUP Net [159] baseline by a significant

margin. The biggest improvements shows up on the Easy set. Significant improvements are also

36

(a) Linear Scale

(b) Log Scale

Figure 3.3 AP3D at different depths and IoU3D thresholds on KITTI Val Split.

Table 3.6 Cross-dataset evaluation of the KITTI Val model on KITTI Val and nuScenes frontal Val cars
with depth MAE (−

(cid:17) ). [Key: Best, Second Best]

KITTI Val 1

nuScenes frontal Val

Method

0−20 20−40 40−∞ All 0−20 20−40 40−∞ All
10.36 2.67
M3D-RPN [15]
0.56
8.65 2.39
MonoRCNN [218] 0.46
GUP Net [159]
6.20 1.45
0.45
4.50 1.26
0.40
DEVIANT

2.73 1.26 0.94
2.59 1.14 0.94
1.85 0.89 0.82
1.80 0.87 0.76

1.33
1.27
1.10
1.09

3.06
2.84
1.70
1.60

on the Moderate and Hard sets. Interestingly, DEVIANT also outperforms DD3D [181] by a large

margin when the large-dataset pretraining is not done (denoted by DD3D −).

AP3D at different depths and IoU3D thresholds. We next compare the AP3D of DEVIANT and

GUP Net in Fig. 3.3 at different distances in meters and IoU3D matching criteria of 0.3 −

0.7 as

in [109]. Fig. 3.3 shows that DEVIANT is effective over GUP Net [159] at all depths and higher

(cid:17)

IoU3D thresholds.

Cross-Dataset Evaluation. Tab. 3.6 shows the result of our KITTI Val model on the KITTI Val and

nuScenes [22] frontal Val images, using mean absolute error (MAE) of the depth of the boxes [218].

More details are in Sec. C.3.1. DEVIANT outperforms GUP Net on most of the metrics on both

the datasets, which confirms that DEVIANT generalizes better than CNNs. DEVIANT performs

exceedingly well in the cross-dataset evaluation than [15, 159, 218]. We believe this happens

because [15, 159, 218] rely on data or geometry to get the depth, while DEVIANT is equivariant

to the depth translations, and therefore, outputs consistent depth. So, DEVIANT is more robust to

data distribution changes.

37

Table 3.7 Scale Augmentation vs Scale Equivariance on KITTI Val cars. [Key: Best, Eqv= Equivariance,
Aug= Augmentation]

Method

GUP Net [159]

DEVIANT

✓
✓

Scale Scale
Eqv Aug AP 3D|𝑅40

IoU3D ≥ 0.7

IoU3D ≥ 0.5

[%](−

[%](−

(cid:17)) AP 3D|𝑅40

(cid:17)) AP BEV|𝑅40

[%](−
Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard
20.82 14.15 12.44 29.93 20.90 17.87 62.37 44.40 39.61 66.81 48.09 43.14
✓ 21.10 15.48 12.88 28.58 20.92 17.83 58.95 43.99 38.07 64.60 47.76 42.97
21.33 14.77 12.57 28.79 20.28 17.59 59.31 43.25 37.64 63.94 47.02 41.12
✓ 24.63 16.54 14.52 32.60 23.04 19.99 61.00 46.00 40.18 65.28 49.63 43.50

(cid:17)) AP BEV|𝑅40

[%](−

(cid:17))

Table 3.8 Comparison of Equivariant Architectures on KITTI Val cars. [Key: Best, Eqv= Equivariance,
†= Retrained]

IoU3D ≥ 0.7

IoU3D ≥ 0.5

Method

(cid:17))

Eqv

[%](−

AP BEV|𝑅40

AP 3D|𝑅40
[%](−
(cid:17))
Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard
4.41 3.06 2.79 20.09 13.80 12.78 26.51 18.49 17.36
1.94 1.26 1.09
21.10 15.48 12.88 28.58 20.92 17.83 58.95 43.99 38.07 64.60 47.76 42.97
2D +Depth 24.63 16.54 14.52 32.60 23.04 19.99 61.00 46.00 40.18 65.28 49.63 43.50

AP BEV|𝑅40

AP 3D|𝑅40

[%](−

[%](−

2D

(cid:17))

(cid:17))

DETR3D† [257] Learned
GUP Net [159]
DEVIANT

Alternatives to Equivariance. We now compare with alternatives to equivariance in the following

paragraphs.

(a) Scale Augmentation. A withstanding question in machine learning is the choice between

equivariance and data augmentation [63]. Tab. 3.7 compares scale equivariance and scale augmen-

tation. GUP Net [159] uses scale-augmentation and therefore, Tab. 3.7 shows that equivariance

also benefits models which use scale-augmentation. This agrees with Tab. 2 of [227], where they

observe that both augmentation and equivariance benefits classification on MNIST-scale dataset.

(b) Other Equivariant Architectures. We now benchmark adding depth (scale) equivariance to

a 2D translation equivariant CNN and a transformer which learns the equivariance. Therefore, we

compare DEVIANT with GUP Net [159] (a CNN), and DETR3D [257] (a transformer) in Tab. 3.8.

As DETR3D does not report KITTI results, we trained DETR3D on KITTI using their public code.

DEVIANT outperforms GUP Net and also surpasses DETR3D by a large margin. This happens

because learning equivariance requires more data [265] compared to architectures which hardcode

equivariance like CNN or DEVIANT.

(c) Dilated Convolution. DEVIANT adjusts the receptive field based on the object scale, and

so, we compare with the dilated CNN (DCNN) [291] and D4LCN [52] in Tab. 3.9. The results

38

Table 3.9 Comparison with Dilated Convolution on KITTI Val cars. [Key: Best]

IoU3D≥ 0.7

IoU3D≥ 0.5

Method

Extra

AP 3D|𝑅40
[%](−
(cid:17))
Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard

AP BEV|𝑅40

AP BEV|𝑅40

AP 3D|𝑅40

[%](−

[%](−

[%](−

(cid:17))

(cid:17))

(cid:17))

D4LCN [52] Depth 22.32 16.20 12.30 31.53 22.58 17.87
DCNN [291]
DEVIANT

21.66 15.49 12.90 30.22 22.06 19.01 57.54 43.12 38.80 63.29 46.86 42.42
24.63 16.54 14.52 32.60 23.04 19.99 61.00 46.00 40.18 65.28 49.63 43.50

−

−

−

−

−

−

−
−

show that DCNN performs sub-par to DEVIANT. This is expected because dilation corresponds to

integer scales [267] while the scaling is generally a float in monocular detection. D4LCN [52] uses

monocular depth as input to adjust the receptive field. DEVIANT (without depth) also outperforms

D4LCN on Hard cars, which are more distant.

(d) Other Convolutions. We now compare with other known convolutions in literature such as

Log-polar convolution [313], Dilated convolution [291] convolution and DISCO [225] in Tab. 3.11.

The results show that the log-polar convolution does not work well, and SES convolutions are better

suited to embed depth (scale) equivariance. As described in Sec. 3.4, we investigate the behavior

of log-polar convolution through a small experiment. We calculate the SSIM [258] of the original

image and the image obtained after the upscaling, log-polar, inverse log-polar, and downscaling

blocks. We then average the SSIM over all KITTI Val images. We repeat this experiment for

multiple image heights and scaling factors. The ideal SSIM should have been one. However,

Fig. 3.2c shows that SSIM does not reach 1 even after upscaling by 4. This result confirms that

log-polar convolution loses information at low resolutions resulting in inaccurate detection.

Next, the results show that dilated convolution [291] performs sub-par to DEVIANT. Moreover,

DISCO [225] also does not outperform SES convolution which agrees with the 2D tracking results

of [225].

(e) Feature Pyramid Network (FPN). Our baseline GUP Net [159] uses FPN [138] and Tab. 3.5

shows that DEVIANT outperforms GUP Net. Hence, we conclude that equivariance also benefits

models which use FPN.

Comparison of Equivariance Error. We next quantitatively evaluate the scale equivariance of

DEVIANT vs. GUP Net [159], using the equivariance error metric [227]. The equivariance error

39

(a) At blocks (depths) of backbone.

(b) Varying scaling factors.

Figure 3.4 Log Equivariance Error (Δ) comparison for DEVIANT and GUP Net at (a) different blocks
with random image scaling factors (b) different image scaling factors at depth 3. DEVIANT shows lower
scale equivariance error than vanilla GUP Net [159].

Δ is the normalized difference between the scaled feature map and the feature map of the scaled

image, and is given by Δ = 1
𝑁

||T𝑠𝑖 Φ(ℎ𝑖)−Φ(T𝑠𝑖 ℎ𝑖)||2
2
||T𝑠𝑖 Φ(ℎ𝑖)||2
2
the scaling transformation for the image 𝑖, and 𝑁 is the total number of images. The equivariance

, where Φ denotes the neural network, T𝑠𝑖 is

(cid:205)𝑁
𝑖=1

error is zero if the scale equivariance is perfect. We plot the log of this error at different blocks of

DEVIANT and GUP Net backbones and also plot at different downscaling of KITTI Val images

in Fig. 3.4. The plots show that DEVIANT has low equivariance error than GUP Net. This is

expected since the feature maps of the proposed DEVIANT are additionally equivariant to scale

transformation s (depth translations). We also visualize the equivariance error for a validation image

and for the objects of this image in Figs. C.8a and C.8b in the supplementary. The qualitative plots

also show a lower error for the proposed DEVIANT, which agrees with Fig. 3.4. Fig. C.8ba shows

that equivariance error is particularly low for nearby cars which also justifies the good performance

of DEVIANT on Easy (nearby) cars in Tabs. 3.3 and 3.5.

Does 2D Detection Suffer? We now investigate whether 2D detection suffers from using DEVIANT

backbones in Tab. 3.10. The results show that DEVIANT introduces minimal decrease in the 2D

detection performance. This is consistent with [226], who report that 2D tracking improves with

the SE networks.

40

Table 3.10 3D and 2D detection on KITTI Val cars.

IoU ≥ 0.7

IoU ≥ 0.5

Method

AP 3D|𝑅40
(cid:17))
Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard
GUP Net [159]
21.10 15.48 12.88 96.78 88.87 79.02 58.95 43.99 38.07 99.52 91.89 81.99
DEVIANT (Ours) 24.63 16.54 14.52 96.68 88.66 78.87 61.00 46.00 40.18 97.12 91.77 81.93

AP 2D|𝑅40

AP 3D|𝑅40

AP 2D|𝑅40

[%](−

[%](−

[%](−

[%](−

(cid:17))

(cid:17))

(cid:17))

Table 3.11 Ablation studies on KITTI Val cars.

Change from DEVIANT:

IoU3D ≥ 0.7
[%](−

IoU3D ≥ 0.5
[%](−

To

[%](−

(cid:17)) AP 3D|𝑅40

(cid:17)) AP BEV|𝑅40

(cid:17)) AP BEV|𝑅40

Changed From −−

SES−
Convolution SES−
(cid:17)
SES−
(cid:17)
SES−
(cid:17)
5%
Downscale 10% −
(cid:17)
10% −
20%
(cid:17)
SE−
Vanilla
(cid:17)
3 −
1
(cid:17)
2
3 −
(cid:17)
DEVIANT (best)
(cid:17)

AP 3D|𝑅40
[%](−
Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard
(cid:17)
21.10 15.48 12.88 28.58 20.92 17.83 58.95 43.99 38.07 64.60 47.76 42.97
Vanilla
Log-polar [313] 9.19 6.77 5.78 16.39 11.15 9.80 40.51 27.62 23.90 45.66 31.34 25.80
21.66 15.49 12.90 30.22 22.06 19.01 57.54 43.12 38.80 63.29 46.86 42.42
Dilated[291]
20.21 13.84 11.46 28.56 19.38 16.41 55.22 39.76 35.37 59.46 43.16 38.52
DISCO[225]
24.24 16.51 14.43 31.94 22.86 19.82 60.64 44.46 40.02 64.68 49.30 43.49
22.19 15.85 13.48 31.15 23.01 19.90 61.24 44.93 40.22 67.46 50.10 43.83
24.39 16.20 14.36 32.43 22.53 19.70 62.81 46.14 40.38 67.87 50.23 44.08
23.20 16.29 13.63 31.76 23.23 19.97 61.90 46.66 40.61 67.37 50.31 43.93
24.15 16.48 14.55 32.42 23.17 20.07 61.05 46.34 40.46 67.36 50.32 44.07
24.63 16.54 14.52 32.60 23.04 19.99 61.00 46.00 40.18 65.28 49.63 43.50

𝛼
BNP
Scales

—

(cid:17))

3.5.3 Ablation Studies on KITTI Val

Tab. 3.11 compares the modifications of our approach on KITTI Val cars based on the experi-

mental settings of Sec. 3.5.

(a) Floating or Integer Downscaling? We next investigate the question that whether one should

use floating or integer downscaling factors for DEVIANT. We vary the downscaling factors as
1+𝛼 , 1(cid:17). We find that 𝛼 of 10% works
the best. We again bring up the dilated convolution (Dilated) results at this point because dilation

(1+2𝛼, 1+𝛼, 1) and therefore, our scaling factor 𝑠 =

1+2𝛼 , 1

(cid:16)

1

is a scale equivariant operation for integer downscaling factors [267] (𝛼 = 100%, 𝑠 = 0.5). Tab. 3.11

results suggest that the downscaling factors should be floating numbers.

(b) SE BNP. As described in Sec. 3.4, we ablate DEVIANT against the case when only convolutions

are SE but BNP layers are not. So, we place Scale-Projection [227] immediately after every SES

convolution. Tab. 3.11 shows that such a network performs slightly sub-optimal to our final model.

(c) Number of Scales. We next ablate against the usage of Hermite scales. Using three scales

performs better than using only one scale especially on Mod and Hard objects, and slightly better

than using two scales.

41

Table 3.12 Waymo Val vehicles detection results. [Key: Best, Second Best]

IoU3D Difficulty

Method

CaDDN [199]
PatchNet [162] in [246]
PCT [246]

0.7

Level_1 M3D-RPN [15] in [199]

GUP Net (Retrained) [159]
DEVIANT (Ours)
CaDDN [199]
PatchNet [162] in [246]
PCT [246]

0.7

Level_2 M3D-RPN [15] in [199]

GUP Net (Retrained) [159]
DEVIANT (Ours)
CaDDN [199]
PatchNet [162] in [246]
PCT [246]

0.5

Level_1 M3D-RPN [15] in [199]

GUP Net (Retrained) [159]
DEVIANT (Ours)
CaDDN [199]
PatchNet [162] in [246]
PCT [246]

0.5

Level_2 M3D-RPN [15] in [199]

GUP Net (Retrained) [159]
DEVIANT (Ours)

3.5.4 Waymo Val Mono3D

Extra

APH3D [%](−

(cid:17))

AP3D [%](−
(cid:17))
0-30 30-50 50-∞ All

All

0.39
0.89
0.35
2.28
2.69

0.38
0.66
0.33
2.14
2.52

LiDAR 5.03 14.54 1.47 0.10
1.67 0.13 0.03
Depth
3.18 0.27 0.07
Depth
1.12 0.18 0.02
−
6.15 0.81 0.03
−
6.95 0.99 0.02
−
LiDAR 4.49 14.50 1.42 0.09
1.67 0.13 0.03
Depth
3.18 0.27 0.07
Depth
0.18 0.02
1.12
−
6.13 0.78 0.02
−
6.93 0.95 0.02
−

0-30 30-50 50-∞
4.99 14.43 1.45 0.10
1.63 0.12 0.03
0.39
3.15 0.27 0.07
0.88
1.10 0.18 0.02
0.34
2.27
6.11 0.80 0.03
6.90 0.98 0.02
2.67
4.45 14.38 1.41 0.09
1.63 0.11 0.03
0.36
3.15 0.26 0.07
0.66
1.10 0.17 0.02
0.33
6.08 0.77 0.02
2.12
6.87 0.94 0.02
2.50
LiDAR 17.54 45.00 9.24 0.64 17.31 44.46 9.11 0.62
9.75 0.96 0.18
2.74
Depth
2.92 10.03 1.09 0.23
4.15 14.54 1.75 0.39
Depth
4.20 14.70 1.78 0.39
3.63 10.70 2.09 0.21
3.79 11.14 2.16 0.26
−
10.02 24.78 4.84 0.22
9.94 24.59 4.78 0.22
−
10.98 26.85 5.13 0.18 10.89 26.64 5.08 0.18
−
LiDAR 16.51 44.87 8.99 0.58 16.28 44.33 8.86 0.55
2.28
Depth
9.73 0.97 0.16
2.42 10.01 1.07 0.22
4.15 14.51 1.71 0.35
Depth
4.03 14.67 1.74 0.36
3.46 10.67 2.04 0.20
3.61 11.12 2.12 0.24
−
9.39 24.69 4.67 0.19
9.31 24.50 4.62 0.19
−
10.29 26.75 4.95 0.16 10.20 26.54 4.90 0.16
−

We also benchmark our method on the Waymo dataset [230] which has more variability

than KITTI. Tab. 3.12 shows the results on Waymo Val split. The results show that DEVIANT

outperforms the baseline GUP Net [159] on multiple levels and multiple thresholds. The biggest

gains are on the nearby objects which is consistent with Tabs. 3.3 and 3.5. Interestingly, DEVIANT

also outperforms PatchNet [162] and PCT [246] without using depth. Although the performance of

DEVIANT lags CaDDN [199], it is important to stress that CaDDN uses LiDAR data in training,

while DEVIANT is an image-only method.

3.6 Conclusions

This chapter studies the modeling error in monocular 3D detection in detail and takes the

first step towards convolutions equivariant to arbitrary 3D translations in the projective manifold.

Since the depth is the hardest to estimate for this task, this chapter proposes Depth Equivariant

42

Network (DEVIANT) built with existing scale equivariant steerable blocks. As a result, DEVIANT

is equivariant to the depth translations in the projective manifold whereas vanilla networks are

not. The additional depth equivariance forces the DEVIANT to learn consistent depth estimates

and therefore, DEVIANT achieves SoTA detection results on KITTI and Waymo datasets in the

image-only category and performs competitively to methods using extra information. Moreover,

DEVIANT works better than vanilla networks in cross-dataset evaluation. Future works include

applying the idea to Pseudo-LiDAR [254], and monocular 3D tracking.

Limitation. DEVIANT does not model 3D equivariance but only a special case of 3D equivariance.

Considerably less number of boxes are detected in the cross-dataset evaluation.

43

CHAPTER 4

SEABIRD: SEGMENTATION IN BIRD’S VIEW WITH DICE LOSS IMPROVES
MONOCULAR 3D DETECTION OF LARGE OBJECTS

Monocular 3D detectors achieve remarkable performance on cars and smaller objects. However,

their performance drops on larger objects, leading to fatal accidents. Some attribute the failures

to training data scarcity or their receptive field requirements of large objects. In this chapter, we

highlight this understudied problem of generalization to large objects. We find that modern frontal

detectors struggle to generalize to large objects even on nearly balanced datasets. We argue that

the cause of failure is the sensitivity of depth regression losses to noise of larger objects. To bridge

this gap, we comprehensively investigate regression and dice losses, examining their robustness

under varying error levels and object sizes. We mathematically prove that the dice loss leads to

superior noise-robustness and model convergence for large objects compared to regression losses

for a simplified case. Leveraging our theoretical insights, we propose SeaBird (Segmentation in

Bird’s View) as the first step towards generalizing to large objects. SeaBird effectively integrates

BEV segmentation on foreground objects for 3D detection, with the segmentation head trained with

the dice loss. SeaBird achieves SoTA results on the KITTI-360 leaderboard and improves existing

detectors on the nuScenes leaderboard, particularly for large objects.

4.1

Introduction

Monocular 3D object detection (Mono3D) task aims to estimate both the 3D position and

dimensions of objects in a scene from a single image. Its applications span autonomous driving [108,

132,181], robotics [213], and augmented reality [2,172,183,293], where accurate 3D understanding

of the environment is crucial. Our study focuses explicitly on 3D object detectors applied to

autonomous vehicles (AVs), considering the challenges and motivations deviate drastically across

different applications.

AVs demand object detectors that generalize to diverse intrinsics [14], camera-rigs [94, 104],

rotations [177], weather and geographical conditions [54] and also are robust to adversarial examples

[310]. Since each of these poses a significant challenge, recent works focus exclusively on the

44

(a) Improve KITTI-360 SoTA.
(b) Improve nuScenes Val SoTA.
Figure 4.1 Teaser (a) SoTA frontal detectors struggle with large objects (low AP𝐿𝑟 𝑔) even on a nearly
balanced KITTI-360 dataset. Our proposed SeaBird achieves significant Mono3D improvements, particularly
for large objects. (b) SeaBird also improves two SoTA BEV detectors, BEVerse-S [303] and HoP [311]
on the nuScenes dataset, particularly for large objects. (c) Plot of convergence variance Var(𝜖) of dice and
regression losses with the noise 𝜎 in depth prediction. The 𝑦-axis denotes the deviation from the optimal
weight, so the lower the better. SeaBird leverages dice loss, which we prove is more noise-robust than
regression losses for large objects.

(c) Theory Advancement.

generalization of object detectors to all these out-of-distribution shifts. However, our focus is on

the generalization of another type, which, thus far, has been understudied in the literature – Mono3D

generalization to large objects.

Large objects like trailers, buses and trucks are harder to detect [268] in Mono3D, sometimes

resulting in fatal accidents [23, 60]. Some attribute these failures to training data scarcity [308] or

the receptive field requirements [268] of large objects, but, to the best of our knowledge, no existing

literature provides a comprehensive analytical explanation for this phenomenon. The goal of this

chapter is, thus, to bring understanding and a first analytical approach to this real-world problem in

the AV space – Mono3D generalization to large objects.

We conjecture that the generalization issue stems not only from limited training data or larger

receptive field but also from the noise sensitivity of depth regression losses in Mono3D. To

substantiate our argument, we analyze the Mono3D performance of state-of-the-art (SoTA) frontal

detectors on the KITTI-360 dataset [136], which includes almost equal number (1 : 2) of large

objects and cars. We observe that SoTA detectors struggle with large objects on this dataset

(Fig. 4.1a). Next, we carefully investigate the SGD convergence of losses used in Mono3D task

and mathematically prove that the dice loss, widely used in BEV segmentation, exhibits superior

noise-robustness than the regression losses, particularly for large objects (Fig. 4.1c). Thus, the

dice loss facilitates better model convergence than regression losses, improving Mono3D of large

45

objects.

Incorporating dice loss in detection introduces unique challenges. Firstly, the dice loss does not

apply to sparse detection centers and only incorporates depth information when used in the BEV

space. Secondly, naive joint training of Mono3D and BEV segmentation tasks with image inputs

does not always benefit Mono3D task [132, 167] due to negative transfer [45], and the underlying

reasons remain unclear. Fortunately, many Mono3D segmentors and detectors are in the BEV

space, where the BEV segmentor can seamlessly apply dice loss and the BEV detector can readily

benefit from the segmentor in the same space. To mitigate negative transfer, we find it effective to

train the BEV segmentation head on the foreground detection categories.

Building upon our theoretical findings about the dice loss, we propose a simple and effective

pipeline called Segmentation in Bird’s View (SeaBird) for enhancing Mono3D of large objects.

SeaBird employs a sequential approach for the BEV segmentation and Mono3D heads (Fig. 4.2).

SeaBird first utilizes a BEV segmentation head to predict the segmentation of only foreground

objects, supervised by the dice loss. The dice loss offers superior noise-robustness for large objects,

ensuring stable convergence, while focusing on foreground objects in segmentation mitigates

negative transfer. Subsequently, SeaBird concatenates the resulting BEV segmentation map with

the original BEV features as an additional feature channel and feeds this concatenated feature to a

Mono3D head supervised by Mono3D losses1. Building upon this, we adopt a two-stage training

pipeline: the first stage exclusively focuses on training the BEV segmentation head with dice loss,

which fully exploits its noise-robustness and superior convergence in localizing large objects. The

second stage involves both the detection loss and dice loss to finetune the Mono3D head.

In our experiments, we first comprehensively evaluate SeaBird and conduct ablations on the

balanced single-camera KITTI-360 dataset [136]. SeaBird outperforms the SoTA baselines by a

substantial margin. Subsequently, we integrate SeaBird as a plug-in-and-play module into two

SoTA detectors on the multi-camera nuScenes dataset [22]. SeaBird again significantly improves

the original detectors, particularly on large objects. Additionally, SeaBird consistently enhances

1Only Mono3D head predicts additional 3D attributes, namely object’s height and elevation.

46

Figure 4.2 SeaBird Pipeline. SeaBird uses the predicted BEV foreground segmentation (For. Seg.) map
to predict accurate 3D boxes for large objects. SeaBird training protocol involves BEV segmentation pre-
training with the noise-robust dice loss and Mono3D fine-tuning.

Mono3D performance across backbones with those two SoTA detectors (Fig. 4.1b), demonstrating

its utility in both edge and cloud deployments.

In summary, we make the following contributions:

• We highlight the understudied problem of generalization to large objects in Mono3D, showing
that even on nearly balanced datasets, SoTA frontal models struggle to generalize due to the

noise sensitivity of regression losses.

• We mathematically prove that the dice loss leads to superior noise-robustness and model conver-
gence for large objects compared to regression losses for a simplified case and provide empirical

support for more general settings.

• We propose SeaBird, which treats BEV segmentation head on foreground objects and Mono3D
head sequentially and trains in a two-stage protocol to fully harness the noise-robustness of the

dice loss.

• We empirically validate our theoretical findings and show significant improvements, particularly

for large objects, on both KITTI-360 and nuScenes leaderboards.

4.2 Related Works

Mono3D. Mono3D popularity stems from its high accessibility from consumer vehicles compared

to LiDAR/Radar-based detectors [155, 215, 290] and computational efficiency compared to stereo-

based detectors [34]. Earlier approaches [31, 186] leverage hand-crafted features, while the recent

ones use deep networks. Advancements include introducing new architectures [89, 217, 275],

47

equivariance [29,108], losses [15,35], uncertainty [111,159] and incorporating auxiliary tasks such

as depth [175, 301], NMS [109, 147, 216], corrected extrinsics [307], CAD models [25, 117, 154]

or LiDAR [199] in training. A particular line of work called Pseudo-LiDAR [165, 254] shows

generalization by first estimating the depth, followed by a point cloud-based 3D detector.

Another line of work encodes image into latent BEV features [164] and attaches multiple

heads for downstream tasks [303]. Some focus on pre-training [272] and rotation-equivariant

convolutions [59]. Others introduce new coordinate systems [95], queries [128, 161], or positional

encoding [219] in a transformer-based detection framework [24]. Some use pixel-wise depth

[88], object-wise depth [38, 40, 141], or depth-aware queries [296], while many utilize temporal

fusion [17, 150, 248, 261] to boost performance. A few use longer frame history [182, 311],

distillation [105, 260] or stereo [125, 261]. We refer to [163, 167] for the survey. SeaBird also

builds upon the BEV-based framework since it flexibly accepts single or multiple images as input

and uses dice loss. Different from the majority of other detectors, SeaBird improves Mono3D of

large objects using the power of dice loss. SeaBird is also the first work to mathematically prove

and justify this loss choice for large objects.

BEV Segmentation. BEV segmentation typically utilizes BEV features transformed from 2D

image features. Various methods encode single or multiple images into BEV features using

MLPs [180] or transformers [205, 211]. Some employ learned depth distribution [83, 188], while

others use attention [211, 305] or attention fields [37]. Image2Maps [211] utilizes polar ray, while

PanopticBEV [72] uses transformers. FIERY [83] introduces uncertainty modelling and temporal

fusion, while Simple-BEV [74] uses radar aggregation. Since BEV segmentation lacks object

height and elevation, one also needs a Mono3D head to predict 3D boxes.

Joint Mono3D and BEV Segmentation. Joint 3D detection and BEV segmentation using LiDAR

data [58, 215] as input benefits both tasks [252, 281]. However, joint learning on image data often

hinders detection performance [132, 167, 272, 303], while the BEV segmentation improvement is

inconsistent across categories [167]. Unlike these works which treat the two heads in parallel and

decrease Mono3D performance [167], SeaBird treats the heads sequentially and increases Mono3D

48

Noise 𝜂 ∼ N (0, 𝜎2)

BEV

ˆ𝑧
(cid:201)

w

L

Image h

GT 𝑧
Length ℓ

GT

(0, 𝑧)

Pred

(0, ˆ𝑧)

ℓ

ℓ

𝑍

(a)

𝑋

(b)

CS View
𝑍

𝑧 ℓ
𝑍
ˆ𝑧 ℓ

)
𝑍
(
𝑃

)
𝑍
(
𝑃

1

(c)

Figure 4.3 (a) Problem setup. The single-layer neural network takes an image h (or its features) and predicts
depth ˆ𝑧 and the object length ℓ. The noise 𝜂 is the additive error in depth prediction and is a normal random
variable. The GT depth 𝑧 supervises the predicted depth ˆ𝑧 with a loss L in training. We assume the network
predicts the GT length ℓ. Frontal detectors directly regress the depth with L1, L2, or Smooth L1 loss, while
SeaBird projects to BEV plane and supervises through dice loss L𝑑𝑖𝑐𝑒. (b) Shifting of predictions (blue)
in BEV along the ray due to the noise 𝜂. (c) Cross Section (CS) view along the ray with classification scores
𝑃(𝑍).

performance, particularly for large objects.

4.3 SeaBird

SeaBird is driven by a deep understanding of the distinctions between monocular regression

and BEV segmentation losses. Thus, in this section, we delve into the problem and discuss existing

results. We then present our theoretical findings and, subsequently, introduce our pipeline.

We introduce the problem and refer to Lemma 1 from the literature [113, 214], which evaluates

loss quality by measuring the deviation of trained weight (after SGD updates) from the optimal

weight. Fig. 4.3a illustrates the problem setup. Figs. 4.3b and 4.3c visualize the BEV and cross-

section view, respectively. Since this deviation depends on the gradient variance of losses, we next

derive the gradient variance of the dice loss in Lemma 2. By comparing the distance between

trained weight and optimal weight, we assess the effectiveness of dice loss versus MAE (L1) and

MSE (L2) losses in Lemma 3, and choose the representation and loss combination. Combining

these findings, we establish Th. 2 that the model trained with dice loss achieves better AP than the

model trained with regression losses. Finally, we present our pipeline, SeaBird, which integrates

BEV segmentation supervised by dice loss for Mono3D.

49

4.3.1 Background and Problem Statement

Mono3D networks [108, 159] commonly employ regression losses, such as L1 or L2 loss,

to compare the predicted depth with ground truth (GT) depth [108, 303].

In contrast, BEV

segmentation utilizes dice loss [211] or cross-entropy loss [83] at each BEV location, comparing

it with GT. Despite these distinct loss functions, we evaluate their effectiveness under an idealized

model, where we measure the model quality by the expected deviation of trained weight (after SGD

updates) from the optimal weight [214].

Lemma 1. Convergence analysis [214]. Consider a linear regression model with trainable weight

w for depth prediction ˆ𝑧 from an image h. Assume the noise 𝜂 is an additive error in depth prediction

and is a normal random variable N (0, 𝜎2). Also, assume SGD optimizes the model parameters

with loss function L during training with square summable steps 𝑠 𝑗 , i.e. 𝑠 = lim
𝑡→∞

𝑡
(cid:205)
𝑗=1

𝑠2
𝑗 exists and

𝜂 is independent of the image. Then, the expected deviation of the trained weight Lw∞ from the

optimal weight w∗ obeys

E (cid:16)(cid:13)
(cid:13)

Lw∞−w∗

(cid:17)

2
(cid:13)
(cid:13)
2

= 𝑐1Var(𝜖) + 𝑐2,

(4.1)

where 𝜖 =

𝜕L (𝜂)
𝜕𝜂

is the gradient of the loss L wrt noise, 𝑐1 = 𝑠E(h𝑇 h) and 𝑐2 are constants

independent of the loss.

We refer to Sec. D.1.1 for the proof. Eq. (4.1) demonstrates that training losses L exhibit

varying gradient variances Var(𝜖). Hence, comparing this term for different losses allows us to

evaluate their quality.

4.3.2 Loss Analysis: Dice vs. Regression

Given that [214] provides the gradient variance Var(𝜖), for L1 and L2 losses, we derive the

corresponding gradient variance for dice and IoU losses in this chapter to facilitate comparison.

First, we express the dice loss, L𝑑𝑖𝑐𝑒, as a function of noise 𝜂 as per its definition from [211] for

Fig. 4.3c as:

L𝑑𝑖𝑐𝑒 (𝜂) = 1−2

Pred GT
Pred + GT

=

1−2 ℓ−|𝜂|
2ℓ

, |𝜂| ≤ ℓ

1

, |𝜂| ≥ ℓ





50

Table 4.1 Convergence variance of training loss functions. Gradient variance of L𝑑𝑖𝑐𝑒 is more noise-robust
for large objects, resulting in better detectors. We do not analyze cross-entropy loss theoretically since its
Var(𝜖) is infinite, but empirically in Tab. 4.5.

Loss L
L1 [214] (App. D.1.2.1)
L2 [214] (App. D.1.2.2)

Dice (Lemma 2)

Gradient 𝜖
sgn(𝜂)
𝜂
, |𝜂| ≤ ℓ
, |𝜂| ≥ ℓ

(cid:40) sgn(𝜂)
ℓ

0

(cid:17) )
Var(𝜖) (−
1
𝜎2
(cid:18)

ℓ

Erf

√

(cid:19)

1
ℓ2

2𝜎

(4.2)

=⇒ L𝑑𝑖𝑐𝑒 (𝜂) =

|𝜂|
ℓ , |𝜂| ≤ ℓ

1 , |𝜂| ≥ ℓ

,





where ℓ denotes the object length. Eq. (4.2) shows that the dice loss L𝑑𝑖𝑐𝑒 depends on the object

size ℓ. With the given dice loss L𝑑𝑖𝑐𝑒, we proceed to derive the following lemma:

Lemma 2. Gradient variance of dice loss. Let 𝜂 = N (0, 𝜎2) be an additive normal random

variable and ℓ be the object length. Let Erf be the error function. Then, the gradient variance of

the dice loss Var𝑑𝑖𝑐𝑒 (𝜖) wrt noise 𝜂 is

Var𝑑𝑖𝑐𝑒 (𝜖) =

1
ℓ2 Erf

(cid:19)

.

(cid:18)

√

ℓ

2𝜎

(4.3)

We refer to Sec. D.1.2.3 for the proof. Eq. (4.3) shows that gradient variance of the dice loss

Var𝑑𝑖𝑐𝑒 (𝜖) also varies inversely to the object size ℓ and the noise deviation 𝜎 (See Sec. D.1.5).

These two properties of dice loss are particularly beneficial for large objects.

Tab. 4.1 summarizes these losses, their gradients, and gradient variances. With Var𝑑𝑖𝑐𝑒 (𝜖)

derived for the dice loss, we now compare the deviation of trained weight with the deviations from

L1 or L2 losses, leading to our next lemma.

Lemma 3. Dice model is closer to optimal weight than regression loss models. Based on Lemma 1
(cid:17)
and assuming the object length ℓ is a constant, if 𝜎𝑚 is the solution of the equation 𝜎2 = 1
√
and the noise deviation 𝜎 ≥ 𝜎𝑐 = max (cid:16)
2
dice loss L𝑑𝑖𝑐𝑒 is better than the converged weight 𝑟w∞ with the L1 or L2 loss, i.e.

(cid:16) ℓ
√
2𝜎
, then the converged weight 𝑑w∞ with the

ℓ Erf−1(ℓ2)

ℓ2 Erf

𝜎𝑚,

(cid:17)

E (cid:16)(cid:13)
(cid:13)

𝑑w∞ − w∗

(cid:17)

(cid:13)
(cid:13)2

≤ E (∥𝑟w∞ − w∗∥2) .

(4.4)

51

Figure 4.4 Plot of convergence variance Var(𝜖) of loss functions with the noise 𝜎. Dice loss has minimum
convergence variance with large noise, resulting in better detectors for large objects.

We refer to Sec. D.1.3 for the proof. Beyond noise deviation threshold 𝜎𝑐 = max (cid:16)

ℓ Erf−1(ℓ2)
the convergence gap between dice and regression losses widens as the object size ℓ increases.

𝜎𝑚,

√
2

(cid:17),

Fig. 4.4 depicts the superior convergence of dice loss compared to regression losses under increas-

ing noise deviation 𝜎 pictorially. Taking the car category with ℓ = 4𝑚 and the trailer category with

ℓ = 12𝑚 as examples, the noise threshold 𝜎𝑐, beyond which dice loss exhibits better convergence,

are 𝜎𝑐 = 0.3𝑚 and 𝜎𝑐 = 0.1𝑚 respectively. Combining these lemmas, we finally derive:

Theorem 2. Dice model has better AP3D. Assume the object length ℓ is a constant and depth is

the only source of error for detection. Based on Lemma 1, if 𝜎𝑚 is the solution of the equation

𝜎2 = 1

ℓ2 Erf

(cid:17)

(cid:16) ℓ
√
2𝜎

and the noise deviation 𝜎 ≥ 𝜎𝑐 = max (cid:16)

𝜎𝑚,

√

2

ℓ Erf−1(ℓ2)

(cid:17)

, then the Average

Precision (AP3D) of the dice model is better than AP3D from L1 or L2 model.

We refer to Sec. D.1.4 and Tab. D.1 for the proof and assumption comparisons respectively.

4.3.3 Discussions

Comparing classification and regression losses. We now explain how we compare classification

(dice) and regression losses. Our analysis assumes one-class classification in BEV segmentation

with perfect predicted foreground scores 𝑃(𝑍) = 1 (Fig. 4.3c). Hence, dice analysis focuses on

object localization along the BEV ray (Fig. 4.3b) instead of classification probabilities thus allowing

comparison of dice and regression losses. Lemma 1 links these losses by comparing the deviation

52

of learned and optimal weights.

Regression losses work better than dice loss for regression tasks? Our key message is NOT

always! We mathematically and empirically show that regression losses work better only when the

noise 𝜎 is less in Fig. 4.4.

4.3.4 SeaBird Pipeline

Architecture. Based on theoretical insights of Th. 2, we propose SeaBird, a novel pipeline, in

Fig. 4.2. To effectively involve the dice loss which originally designed for segmentation task

to assist Mono3D, SeaBird treats BEV segmentation of foreground objects and Mono3D head

sequentially. Although BEV segmentation map provides depth information (hardest [108, 166]

Mono3D parameter), it lacks elevation and height information for Mono3D task. To address this,

SeaBird concatenates BEV features with predicted BEV segmentation (Fig. 4.2), and feeds them

into the detection head to predict 3D boxes in a 7-DoF representation: BEV 2D position, elevation,

3D dimension, and yaw. Unlike most works [132, 303] that treat segmentation and detection

branches in parallel, the sequential design directly utilizes refined BEV localization information

to enhance Mono3D. Ablations in Sec. 4.4.2 validate this design choice. We defer the details of

baselines to Sec. 4.4. Notably, our foreground BEV segmentation supervision with dice loss does

not require dense BEV segmentation maps, as we efficiently prepare them from GT 3D boxes.

Training Protocol. SeaBird trains the BEV segmentation head first, employing the dice loss

between the predicted and the GT BEV semantic segmentation maps, which fully utilizes the dice

loss’s noise-robustness and superior convergence in localizing large objects. In the second stage, we

jointly fine-tune the BEV segmentation head and the Mono3D head. We validate the effectiveness

of training protocol via the ablation in Sec. 4.4.2.

4.4 Experiments

Datasets. Our experiments utilize two datasets with large objects: KITTI-360 [136] and nuScenes

[22] encompassing both single-camera and multi-camera configurations. We opt for KITTI-360

instead of KITTI [67] for four reasons: 1) KITTI-360 includes large objects, while KITTI does not;

2) KITTI-360 exhibits a balanced distribution of large objects and cars; 3) an extended version,

53

Table 4.2 Datasets comparison. We use KITTI-360 and nuScenes datasets for our experiments. See Fig. D.2
for the skewness.

KITTI [67] Waymo [230] KITTI-360 [136] nuScenes [22]

Large objects
Balanced
BEV Seg. GT
#images (k)

✕
✕
✕
4

✕
✕
✓
52 [108]

✓
✓
✓
49

✓
✕
✓
168

KITTI-360 PanopticBEV [72], includes BEV segmentation GT for ablation studies, while KITTI

3D detection and the Semantic KITTI dataset [6] do not overlap in sequences; 4) KITTI-360

contains about 10× more images than KITTI. We compare these datasets in Tab. 4.2 and show their

skewness in Fig. D.2.

Data Splits. We use the following splits of the two datasets:
• KITTI-360 Test split: This benchmark [136] contains 300 training and 42 testing windows.

These windows contain 61,056 training and 910 testing images.

• KITTI-360 Val split: It partitions the official train into 239 train and 61 validation windows [136].

This split contains 48,648 training and 1,294 validation images.

• nuScenes Test split: It has 34,149 training and 6,006 testing samples [22] from the six cameras.

This split contains 204,894 training and 36,036 testing images.

• nuScenes Val split: It has 28,130 training and 6,019 validation samples [22] from the six cameras.

This split contains 168,780 training and 36,114 validation images.

Evaluation Metrics. We use the following metrics:
• Detection: KITTI-360 uses the mean AP 3D 50 percentage across categories to benchmark
models [136]. nuScenes [22] uses the nuScenes Detection Score (NDS) as the metric. NDS is

the weighted average of mean AP (mAP) and five TP metrics. We also report mAP over large

categories (truck, bus, trailers and construction vehicles), cars, and small categories (pedestrians,

motorcyle, bicycle, cone and barrier) as AP𝐿𝑟𝑔, AP𝐶𝑎𝑟 and AP𝑆𝑚𝑙 respectively.

• Semantic Segmentation: We report mean IoU over foreground and all categories at 200×200

resolution [211, 303].

KITTI-360 Baselines and SeaBird Implementation. Our evaluation on the KITTI-360 focuses on

54

the detectors taking single-camera image as input. We evaluate SeaBird pipelines against six SoTA

frontal detectors: GrooMeD-NMS [109], MonoDLE [166], GUP Net [159], DEVIANT [108], Cube

R-CNN [14] and MonoDETR [300]. The choice of these models encompasses anchor [14,109] and

anchor-free methods [108, 166], CNN [159, 166], group CNN [108] and transformer-based [300]

architectures. Further, MonoDLE normalizes loss with GT box dimensions.

Due to SeaBird’s BEV-based approach, we do not integrate it with these frontal view detectors.

Instead, we extend two SoTA image-to-BEV segmentation methods, Image2Maps (I2M) [211] and

PanopticBEV (PBEV) [72] with SeaBird. Since both BEV segmentors already include their own

implementations of the image encoder, the image-to-BEV transform, and the segmentation head,

implementing the SeaBird pipeline only involves adding a detection head, which we chose to be Box

Net [289]. SeaBird extensions employ dice loss for BEV segmentation, Smooth L1 losses [69] in

the BEV space to supervise the BEV 2D position, elevation, and 3D dimension, and cross entropy

loss to supervise orientation.

nuScenes Baselines and SeaBird Implementation. We integrate SeaBird into two prototypical

BEV-based detectors, BEVerse [303] and HoP [311] to prove the effectiveness of SeaBird. Our

choice of these models encompasses both transformer and convolutional backbones, multi-head

and single-head architectures, shorter and longer frame history, and non-query and query-based

detectors. This comprehensively allows us to assess SeaBird’s impact on large object detection.

BEVerse employs a multi-head architecture with a transformer backbone and shorter frame history.

HoP is single-head query-based SoTA model utilizing BEVDet4D [87] with CNN backbone, and

longer frame history.

BEVerse [303] includes its own implementation of detection head and BEV segmentation head

in parallel. We reorganize the two heads to follow our sequential design and adhere to our training

protocol for network training. Since HoP [311] lacks a BEV segmentation head, we incorporate

the one from BEVerse into this HoP extension with SeaBird.

55

Table 4.3 KITTI-360 Test detection results. SeaBird pipelines outperform all monocular baselines, and
also outperform old LiDAR baselines. Click for the KITTI-360 leaderboard as well as our PBEV+SeaBird
and I2M+SeaBird entries. [Key: Best, Second Best, L= LiDAR, C= Camera, †= Retrained].

Modality
L C
✓
✓

Method

Venue

AP 3D 50 (−
(cid:17)) AP 3D 25 (−
mAP [%] mAP [%]

(cid:17))

L-VoteNet [194]
L-BoxNet [194]

ICCV19
ICCV19
✓ GrooMeD † [109] CVPR21
✓ MonoDLE † [166] CVPR21
✓ GUP Net † [159]
ICCV21
✓ DEVIANT † [108] ECCV22
✓ Cube R-CNN † [14] CVPR23
✓ MonoDETR † [300] ICCV23
CVPR24
✓ I2M+SeaBird
CVPR24
✓ PBEV+SeaBird

3.40
4.08
0.17
0.85
0.87
0.88
0.80
0.79
3.14
4.64

30.61
23.59
16.12
28.99
27.25
26.96
15.57
27.13
35.04
37.12

4.4.1 KITTI-360 Mono3D

KITTI-360 Test. Tab. 4.3 presents KITTI-360 leaderboard results, demonstrating the superior

performance of both SeaBird pipelines compared to all monocular baselines across all metrics.

Moreover, PBEV+SeaBird also outperforms both legacy LiDAR baselines on all metrics, while

I2M+SeaBird surpasses them on the AP 3D 25 metric.

KITTI-360 Val. Tab. 4.4 presents the results on KITTI-360 Val split, reporting the median model

over three different seeds with the model being the final checkpoint as [108]. SeaBird pipelines

outperform all monocular baselines on all but one metric, similar to Tab. 4.3 results. Due to the

dice loss in SeaBird, the biggest improvement shows up on larger objects. Tab. 4.4 also includes

the upper-bound oracle, where we train the Box Net with the GT BEV segmentation maps.

Lengthwise AP Analysis. Th. 2 states that training a model with dice loss should lead to lower

errors and, consequently, a better detector for large objects. To validate this claim, we analyze

the detection performance with AP 3D 50 and AP 3D 25 metrics against the object’s lengths. For

this analysis, we divide objects into four bins based on their GT object length (max of sizes):

[0, 5), [5, 10), [10, 15), [15 + 𝑚. Fig. 4.5 shows that SeaBird pipelines excel for large objects,

where the baselines’ performance drops significantly.

BEV Semantic Segmentation. Tab. 4.4 also presents the BEV semantic segmentation results

on the KITTI-360 Val split. SeaBird pipelines outperforms the baseline I2M [211], and achieve

56

(a) AP 3D 50 comparison.

(b) AP 3D 25 comparison.

Figure 4.5 Lengthwise AP Analysis of four SoTA detectors and two SeaBird pipelines on KITTI-360 Val
split. SeaBird pipelines outperform all baselines on large objects with over 10m in length.

Table 4.4 KITTI-360 Val detection and segmentation results. SeaBird pipelines outperform all frontal
monocular baselines, particularly for large objects. Dice loss in SeaBird also improves the BEV only (w/o
dice) version of SeaBird pipelines. I2M and PBEV are BEV segmentors. So, we do not report their Mono3D
performance. [Key: Best, Second Best, †= Retrained]

View Method

BEV Seg
Loss

Frontal

BEV

GrooMeD-NMS † [109]
MonoDLE † [166]
GUP Net † [159]
DEVIANT † [108]
Cube R-CNN † [14]
MonoDETR † [300]
I2M † [211]
I2M+SeaBird
I2M+SeaBird
PBEV † [72]
PBEV+SeaBird
PBEV+SeaBird
Oracle (GT BEV)

−

Dice
✕
Dice
CE
✕
Dice

(cid:17))

(cid:17))

(cid:17))

Venue

AP 3D 50 [%](−

AP 3D 25 [%](−

33.04 16.52
44.81 22.88
45.11 22.83
44.25 22.39
22.52 11.63
43.24 22.02

38.21 19.11
50.52 27.58
50.52 25.75
48.57 24.79
27.12 16.34
48.69 26.60

AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP Large
0.00
0.00
CVPR21
4.64
0.94
CVPR21
0.98
0.54
ICCV21
1.01
0.53
ECCV22
5.55
0.75
CVPR23
4.50
0.81
ICCV23
ICRA22
−
−
45.09 24.98 26.33 52.31 39.32
4.86
CVPR24
43.19 25.95 35.76 52.22 43.99
CVPR24
8.71
RAL22
−
−
CVPR24
45.37 26.51 29.72 53.86 41.79
7.64
CVPR24 13.22 42.46 27.84 37.15 52.53 44.84

BEV Seg IoU [%](−
MFor
Car
−
−
−
−
−
−
−
−
−
−
−
−
29.25
38.04
3.54
7.07
31.42
39.61
36.18
48.54
1.57
1.47
36.17
48.04
26.77 51.79 39.28 49.74 56.62 53.18 100.00 100.00 100.00

−
−
−
−
−
−
20.46
0.00
23.23
23.83
2.07
24.30

−

−

−

−

−

−

−

−

−

similar performance to PBEV [72] in BEV segmentation. We retrain all BEV segmentation models

only on foreground detection categories for a fair comparison.

4.4.2 Ablation Studies on KITTI-360 Val

Tab. 4.5 ablates I2M [211] +SeaBird on the KITTI-360 Val split, following the experimental

settings of Sec. 4.4.1.

Dice Loss. Tab. 4.5 shows that both dice loss and BEV representation are crucial to Mono3D of

large objects. Replacing dice loss with MSE or Smooth L1 loss, or only BEV representation (w/o

dice) reduces Mono3D performance.

Mono3D and BEV Segmentation. Tab. 4.5 shows that removing the segmentation head hinders

57

Table 4.5 Ablation studies on KITTI-360 Val. [Key: Best, Second Best]

Changed

From −

To

Dice −
Dice −
Dice −
Dice −

(cid:17)
No Loss
Smooth L1
MSE
CE

Segmentation Loss

Semantic Category

(cid:17)
(cid:17)
(cid:17)
Segmentation Head Yes−
No
(cid:17)
Yes−
Detection Head
No
(cid:17)
For.−
All
(cid:17)
For.−
Car
(cid:17)
Sequential−
(cid:17)
Yes−
S+J−
S+J−
−

Multi-head Arch.
BEV Shortcut

Training Protocol

I2M+SeaBird

(cid:17)
(cid:17)
(cid:17)

Parallel

No
(cid:17)
J [303]
D+J [281]

(cid:17))

(cid:17))

7.07

0.00

AP 3D 50 [%](−

AP 3D 25 [%](−

BEV Seg IoU [%](−

45.09 24.98 26.33 52.31 39.32
3.54
36.69 22.16 31.01 47.51 39.26 17.16 34.67 25.92
35.59 21.32 30.90 44.71 37.81 17.46 34.85 26.16
35.60 21.33 33.22 47.60 40.41 21.83 38.11 29.97
39.24 23.38 31.83 47.88 39.86
−

(cid:17))
AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP Large Car MFor MAll
4.86
−
7.63
−
7.04
−
7.06
−
7.52
−
−
−
1.61
4.17
9.12
6.53
7.42
6.07
8.71

−
44.12 22.87 15.36 51.76 33.56 19.26 34.46 26.86 24.34
43.01 23.59 22.68 51.58 37.13
40.28 20.14
40.27 24.69 32.45 51.55 42.00 22.19 40.37 31.28
38.12 22.33 32.05 52.62 42.34 23.00 40.39 31.70
42.73 25.08 31.94 49.88 40.91 22.91 39.66 31.29
43.43 24.75 29.24 52.96 41.10 20.71 35.68 28.20
43.19 25.95 35.76 52.22 43.99 23.23 39.61 31.42

−
20.46 38.04 29.25

−
−
−
−
−
−

−

−

−

−

−

−

Mono3D performance. Conversely, removing detection head also diminishes the BEV segmentation

performance for the segmentation model. This confirms the mututal benefit of sequential BEV

segmentation on foreground objects and Mono3D.

Semantic Category in BEV Segmentation. We next analyze whether background categories play

any role in Mono3D. Tab. 4.5 shows that changing the foreground (For.) categories to foreground +

background (All) does not help Mono3D. This aligns with the observations of [167, 272, 303] that

report lower performance on joint Mono3D and BEV segmentation with all categories. We believe

this decrease happens because the network gets distracted while getting the background right. We

also predict one foreground category (Car) instead of all in BEV segmentation. Tab. 4.5 shows that

predicting all foreground categories in BEV segmentation is crucial for overall good Mono3D.

Multi-head Architecture. SeaBird employs a sequential architecture (Arch.) of segmentation

and detection heads instead of parallel architecture. Tab. 4.5 shows that the sequential architecture

outperforms the parallel one. We attribute this Mono3D boost to the explicit object localization

provided by segmentation in the BEV plane.

BEV Shortcut. Sec. 4.3.4 mentions that SeaBird’s Mono3D head utilizes both the BEV segmen-

tation map and BEV features. Tab. 4.5 demonstrates that providing BEV features to the detection

head is crucial for good Mono3D. This is because the BEV map lacks elevation information, and

incorporating BEV features helps estimate elevation.

Training Protocol. SeaBird trains segmentor first and then jointly trains detector and segmentor

58

Table 4.6 nuScenes Test detection results. SeaBird pipelines achieve the best AP𝐿𝑟 𝑔 among methods
without Class Balanced Guided Sampling (CBGS) [308] and future frames. Results are from the nuScenes
leaderboard or corresponding chapters on V2-99 or R101 backbones. [Key: Best, Second Best, S= Small,
∗= Reimplementation, §= CBGS,

= Future Frames.]

(cid:35)(cid:32)

Resolution Method

(cid:17)) AP𝐶𝑎𝑟 (−

(cid:17)) AP𝑆𝑚𝑙 (−

(cid:17))

Venue
BBone
AAAI23
BEVDepth [127] in [101] R101
AAAI23
BEVStereo [125] in [101] R101
ICCV23
R101
P2D [101]
ArXiv
Swin-S
BEVerse-S [303]
ICCV23
R101
HoP ∗ [311]
CVPR24
R101
HoP+SeaBird
ECCV22
V2-99
SpatialDETR [53]
ICCV23
V2-99
3DPPE [219]
CVPR23
R101
X3KDall [105]
V2-99
PETRv2 [150]
ICCV23
V2-99 CVPR23
VEDet [29]
V2-99 CVPR23
FrustumFormer [256]
ICCV23
V2-99
MV2D [259]
V2-99
HoP ∗ [311]
ICCV23
V2-99 CVPR24
HoP+SeaBird
V2-99
SA-BEV § [299]
ICCV23
ICCV23
V2-99
FB-BEV § [134]
V2-99 CVPR23
CAPE § [273]
ICCV23
V2-99
SparseBEV
R101
ParametricBEV [283]
ICCV23
R101 NeurIPS22
UVTR [126]
V2-99
BEVFormer [132]
ECCV22
V2-99 AAAI23
PolarFormer [95]
V2-99 NeurIPS23
STXD [91]

[142]

(cid:35)(cid:32)

AP𝐿𝑟𝑔 (−
−
−
−
24.4
36.0
36.6
30.2
−
−
36.4
37.1
−
−
37.1
38.4
40.5
39.3
41.3
45.6
−
35.1
34.4
36.8
−

512×1408

640×1600

900×1600

−
−
−
60.4
65.0
65.8
61.0
−
−
66.7
68.5
−
−
68.7
70.2
68.9
71.7
71.4
76.3
−
67.3
67.7
68.4
−

−
−
−
47.0
53.9
54.7
48.5
−
−
55.6
57.7
−
−
55.6
57.4
60.5
61.6
63.3
68.8
−
52.9
55.2
55.5
−

(cid:17)) mAP(−
39.6
40.4
43.6
39.3
47.9
48.6
42.5
46.0
45.6
49.0
50.5
51.6
51.1
49.4
51.1
53.3
53.7
55.3
60.3
46.8
47.2
48.9
49.3
49.7

(cid:17)) NDS(−
48.3
50.2
53.0
53.1
57.5
57.0
48.7
51.4
56.1
58.2
58.5
58.9
59.6
58.9
59.7
62.4
62.4
62.8
67.5
49.5
55.1
56.9
57.2
58.3

Table 4.7 nuScenes Val detection results. SeaBird pipelines outperform the two baselines BEVerse and HoP,
particularly for large objects. We train all models without CBGS. See Tab. D.9 for a detailed comparison.
[Key: S= Small, T= Tiny, = Released, ∗= Reimplementation]

Resolution Method

256×704

512×1408

640×1600

BEVerse-T [303]
+SeaBird
HoP [311]
+SeaBird
BEVerse-S [303]
+SeaBird
HoP ∗ [311]
+SeaBird
HoP ∗ [311]
+SeaBird

(cid:17))

(cid:17))

(cid:17))

R50

46.6

57.2

Swin-T

NDS (−

mAP (−
32.1

AP𝐿𝑟𝑔 (−
18.5

(cid:17)) AP𝑆𝑚𝑙 (−
38.8

(cid:17)) AP𝐶𝑎𝑟 (−
53.4

BBone Venue
ArXiv
CVPR24 19.5 (+1.0) 54.2 (+0.8) 41.1 (+2.3) 33.8 (+1.5) 48.1 (+1.7)
ICCV23 27.4
CVPR24 28.2 (+0.8) 58.6 (+1.4) 47.8 (+1.4) 41.1 (+1.2) 51.5 (+0.6)
ArXiv
CVPR24 24.6 (+3.7) 58.7 (+2.5) 45.0 (+2.8) 38.2 (+3.0) 51.3 (+1.8)
ICCV23 31.4
CVPR24 32.9 (+1.5) 65.0 (+1.3) 53.1 (+0.6) 46.2 (+1.0) 54.7 (–0.3)
ICCV23 36.5
CVPR24 40.3 (+3.8) 71.7 (+2.6) 58.8 (+2.7) 52.7 (+3.1) 60.2 (+1.9)

V2-99

R101

Swin-S

56.2

63.7

69.1

39.9

35.2

45.2

49.6

46.4

50.9

42.2

49.5

52.5

55.0

56.1

58.3

20.9

(S+J). We compare with direct joint training (J) of [303] and training detection followed by joint

training (D+J) of [281]. Tab. 4.5 shows that SeaBird training protocol works best.

59

4.4.3 nuScenes Mono3D

We next benchmark SeaBird on nuScenes [22], which encompasses more diverse object cate-

gories such as trailers, buses, cars and traffic cones, compared to KITTI-360 [136].

nuScenes Test. Tab. 4.6 presents the results of incorportaing SeaBird to the HoP models with the

V2-99 and R101 backbones. SeaBird with both V2-99 and R101 backbones outperform several

SoTA methods on the nuScenes leaderboard, as well as the baseline HoP, on nearly every metric.

Interestingly, SeaBird pipelines also outperform several baselines which use higher resolution

(900×1600) inputs. Most importantly, SeaBird pipelines achieve the highest AP𝐿𝑟𝑔 performance,

providing empirical support for the claims of Th. 2.

nuScenes Val. Tab. 4.7 showcases the results of integrating SeaBird with BEVerse [303] and

HoP [311] at multiple resolutions, as described in [303,311]. Tab. 4.7 demonstrates that integrating

SeaBird consistently improves these detectors on almost every metric at multiple resolutions. The

improvements on AP𝐿𝑟𝑔 empirically support the claims of Th. 2 and validate the effectiveness of

dice loss and BEV segmentation in localizing large objects.

4.5 Conclusions

This chapter highlights the understudied problem of Mono3D generalization to large objects.

Our findings reveal that modern frontal detectors struggle to generalize to large objects even

when trained on balanced datasets. To bridge this gap, we investigate the regression and dice

losses, examining their robustness under varying error levels and object sizes. We mathematically

prove that the dice loss outperforms regression losses in noise-robustness and model convergence

for large objects for a simplified case. Leveraging our theoretical insights, we propose SeaBird

(Segmentation in Bird’s View) as the first step towards generalizing to large objects. SeaBird

effectively integrates BEV segmentation with the dice loss for Mono3D. SeaBird achieves SoTA

results on the KITTI-360 leaderboard and consistently improves existing detectors on the nuScenes

leaderboard, particularly for large objects. We hope that this initial step towards generalization will

contribute to safer AVs.

Limitation. SeaBird does not fully solve the problem of generalization to large objects.

60

CHAPTER 5

CHARM3R: TOWARDS CAMERA HEIGHT AGNOSTIC MONOCULAR 3D OBJECT
DETECTOR

To this end, we attempt generalizing Mono3D networks to occlusion, dataset and object sizes.

Monocular 3D object detectors, while effective on data from one ego camera height, struggle with

unseen or out-of-distribution camera heights. Existing methods often rely on Plucker embeddings,

image transformations or data augmentation. This chapter takes a step towards this understudied

problem by investigating the impact of camera height variations on state-of-the-art (SoTA) Mono3D

models. With a systematic analysis on the extended CARLA dataset with multiple camera heights,

we observe that depth estimation is a primary factor influencing performance under height vari-

ations. We mathematically prove and also empirically observe consistent negative and positive

trends in mean depth error of regressed and ground-based depth models, respectively, under cam-

era height changes. To mitigate this, we propose Camera Height Robust Monocular 3D Detector

(CHARM3R), which averages both depth estimates within the model. CHARM3R significantly

improves generalization to unseen camera heights, achieving SoTA performance on the CARLA

dataset.

5.1

Introduction

Monocular 3D object detection (Mono3D) task uses a single image to determine both the 3D

location and dimensions of objects. This technology is essential for augmented reality [2, 172, 183,

293], robotics [213], and self-driving cars [108, 132, 181], where accurate 3D understanding of the

environment is crucial. Our research specifically focuses on using 3D object detectors applied to

autonomous vehicles (AVs), as they have unique challenges and requirements.

AVs necessitate detectors that are robust to a wide range of intrinsic and extrinsic factors,

including intrinsics [14], domains [133], object size [110], rotations [177, 307], weather conditions

[137, 179], and adversarial examples [310]. Existing research primarily focusses on generalizing

object detectors to these failure modes. However, this work investigates the generalization of

Mono3D to another type, which, thus far, has been relatively understudied in the literature –

61

Figure 5.1 Teaser. Changing ego height at inference quickly drops Mono3D performance of SoTA detectors.
A height change Δ𝐻 of 0.76𝑚 in inference drops AP3D [%] by absolute 35 points.

(a) AP 3D 70 [%] Results.

(b) AP 3D 50 [%] Results.

(c) Depth error trend on changing ego
heights.

Figure 5.2 Performance Comparison. The performance of SoTA detector GUP Net [159] drops significantly
with changing ego heights in inference. Ground-based model shows contrasting depth error (extrapolation)
trend compared to regression-based depth models. Our proposed CHARM3R exhibits greater robustness
to such variations by averaging regression and ground-based depth estimates. All methods, except the Oracle,
are trained on car-height data Δ𝐻 = 0𝑚 and tested on data from bot to truck heights.

Mono3D generalization to unseen ego camera heights.

The ego height of autonomous vehicles (AVs) varies significantly across different platforms and

deployment scenarios. While almost all training data is collected from a specific ego height, such as

that of a passenger car, AVs are now deployed with substantially different ego height such as small

bots or trucks. Collecting, labeling datasets and retraining models for each possible height is not

scalable [104], computationally expensive and impractical. Therefore, our work aims to address

the challenge of generalizing Mono3D models to unseen ego heights.

Generalizing Mono3D to unseen ego heights from single ego height data is challenging due to

62

Test EgoTrain EgoΔHGroundImage3D Detector3D Boxesthe following five reasons. First, neural models excel at In-Domain (ID) generalization, but struggle

with unseen Out-Of-Domain (OOD) generalization [237, 276]. Second, ego height changes induce

projective transformations [76] that CNNs [43], DEVIANT [108] or ViT [55] backbones do not

effectively handle [212]. Third, existing projective equivariant backbones [168, 176] are limited to

single-transform-per-image scenarios, while every pixel in a driving image undergoes a different

depth-dependent transform. Fourth, the non-linear nature [21, 76] of projective transformations

makes interpolation difficult. Finally, disentangled learning does not work for this problem since

such approaches need at least two height data, while the training data here is from single height.

Note that the generalization from single height to multi heights is more practical since multi-height

data is unavailable in almost all real datasets.

We first systematically analyze and quantify the impact of ego height on the performance of

Mono3D models trained on a single ego height. Leveraging the extended CARLA dataset [104], we

evaluate the performance of state-of-the-art (SoTA) Mono3D models under multiple ego heights.

Our analysis reveals that SoTA Mono3D models exhibit significant performance degradation when

faced with large height changes in inference (Figs. 5.2a and 5.2b). Additionally, we empirically

observe a consistent negative trend in the regressed object depth under height changes (Fig. 5.2c).

Furthermore, we decompose the performance impact into individual sub-tasks and identify depth

estimation as the primary contributor to this degradation.

Recent papers address ego height changes by using Plucker embeddings [4], transforming target-

height images to the original height, assuming constant depth [129], or by retraining with augmented

data [104]. While these techniques do offer some effectiveness, image transformation fails (Fig. 5.6)

under significant height changes due to real-world depth variations. The augmentation strategy

requires complicated pipelines for data synthesis at target heights and also falls short when the

target height is OOD or when the target height is unknown apriori during training.

To effectively generalize Mono3D to unseen ego heights, a detector should first disentangle

the depth representation from ego parameters in training and produce a new representation with

new ego parameters in inference, while also canceling the trends. We propose using the projected

63

bottom 3D center and ground depth in addition to the regressed depth. While the ground depth is

easily calculated from ego parameters and height, and can be changed based on the ego height, its

direct application to Mono3D models is sub-optimal (a reason why ground plane is not used alone).

However, we observe a consistent positive trend in ground depth, which contrasts with the negative

trend in regressed depths. By averaging both depth estimates within the model, we effectively

cancel these opposing trends and improve Mono3D generalization to unseen ego heights.

In summary the main contributions of this work include:

• We attempt the understudied problem of OOD ego height robustness in Mono3D models from

single height data.

• We mathematically prove systematic negative and positive trends in the regressed and ground-
based object depths, respectively, under ego height changes under simplified assumptions (Th. 3

and 4).

• We propose simple averaging of these depth estimates within the model to effectively counteract

these opposing trends and generalize to unseen ego heights (Sec. 5.4.3).

• We empirically demonstrate SoTA robustness to unseen ego height changes on the CARLA

dataset (Tab. 5.2).

5.2 Related Works

Extrapolation / OOD Generalization. Neural models excel at ID generalization, but struggle at

OOD generalization [237, 276]. There are two major classes of methods for good OOD classifica-

tion. The first does not use target data and relies on diversifying data [235], features [240, 286],

predictions [119], gradients [207, 236] or losses [193, 208, 210]. Another class finetunes on small

target data [103]. None of these papers attempt OOD generalization for regression tasks.

Mono3D. Mono3D has gained significant popularity, offering a cost-effective and efficient solution

for perceiving the 3D world. Unlike its more expensive LiDAR and radar counterparts [155, 215,

290], or its computationally intensive stereo-based cousins [34], Mono3D relies solely on a single

camera or multiple cameras with little overlaps. Earlier approaches to this task [31, 186] relied

on hand-crafted features, while the recent advancements use deep models. Researchers explored

64

a variety of approaches to improve performance, including architectural innovations [89, 275],

equivariance [29, 108], losses [15, 35], uncertainty [111, 159] and depth estimation [175, 279, 301].

A few use NMS [109, 147], corrected extrinsics [307], CAD models [25, 117, 154] or LiDAR [199]

in training. Other innovations include Pseudo-LiDAR [165,254], diffusion [196,274], BEV feature

encoding [96, 131, 303] or transformer-based [24] methods with modified positional encoding

[82, 219, 233], queries [36, 93, 128, 296] or query denoising [140]. Some use pixel-wise depth [88]

or object-wise depth [38, 40, 141]. Many utilize temporal fusion with short [17, 150, 248, 261]

or long frame history [28, 182, 311] to boost performance. A few use distillation [100, 260],

stereo [125, 261] or loss [110, 148] to improve these results further. For a comprehensive overview,

we redirect readers to the surveys [163, 167]. CHARM3R selects representative Mono3D models

and improves their extrapolation to unseen camera heights.

Camera Parameter Robustness. While several works aim for robust LiDAR-based detec-

tors [27, 84, 255, 277, 282], planners [285] and map generators [197], fewer studies focus on

generalizing image-based detectors. Existing image-based techniques, such as self-training [130],

adversarial learning [249], perspective debiasing [158], and multi-view depth constraints [26],

primarily address datasets with variations in camera intrinsics and minor height differences of

0.2𝑚. Some papers show robustness to other camera parameters such as intrinsics [14], and rota-

tions [177, 307]. CHARM3R specifically tackles the challenge of generalizing to scenarios with

significant camera height changes, exceeding 0.7𝑚.

Height-Robustness.

Image-based 3D detectors such as BEVHeight [94] and MonoUNI [94]

train multiple detectors at different heights, but always do ID testing. Recent works address ego

height changes by either using Plucker embeddings [4, 297] for video generation/pose estimation,

by transforming target-height images to the original height, assuming constant depth [129] for

Mono3D, or by retraining with augmented data [104] for BEV segmentation.

In contrast, we

investigate the contrasting extrapolation behavior of regressed and ground-based depth estimators

and average them for generalizing Mono3D to unseen camera heights.

Wide Baseline Setup. Wide baseline setups are challenging due to issues like large occlusions,

65

Figure 5.3 Problem Setup. Note that changing ego height does not change the object depth 𝑧 but only its
position (𝑢𝑐, 𝑣𝑐) in the image plane. A regressed-depth model uses this pixel position to estimate the depth
and therefore, fails when the ego height is changed.

depth discontinuities [229] and intensity variations [228]. Unlike traditional wide-baseline setups

with arbitrary baseline movements, generalization to unseen ego height requires handling baseline

movements specifically along the vertical direction.

5.3 Notations and Preliminaries

We first list out the necessary notations and preliminaries which are used throughout this chapter.

These are not our contributions and can be found in the literature [65, 73, 76].

Notations. Let K ∈ R3×3 denote the camera intrinsic matrix, R ∈ R3×3 the rotation matrix and

T ∈ R3×1 the translation vector of the extrinsic parameters. Also, 0 ∈ R3×1 denotes the zero vector

in 3D. We denote the ego camera height on the car as 𝐻, and the height change relative to this car

as Δ𝐻 meters. The camera intrinsics matrix K has focal length 𝑓 and principal point (𝑢0, 𝑣0). Let

(𝑢, 𝑣) represent a pixel position in the camera coordinates, and (𝑢𝑐, 𝑣𝑐) and (𝑢𝑏, 𝑣𝑏) denotes the

projected 3D center and bottom center respectively. ℎ denotes the height of the image plane. We

show these notations pictorially in Fig. 5.3.

Pinhole Point Projection [76]. The pinhole model relates a 3D point (𝑋, 𝑌 , 𝑍) in the world

66

ObjectTest EgoTrain EgoHΔHDepth (z)GroundZYff3D CenterBottom 3D CenterImage Plane0.5h2D0.5h2D(uc,vc)(uc,vc)(ub,vb)(ub,vb)hhXFigure 5.4 CHARM3R Overview. CHARM3R predicts the shift coefficient to obtain projected 3D bottom
centers to query the ground depth and then averages the ground-depth and the regressed depth estimates
within the model itself to output final depth estimate of a bounding box. CHARM3R uses the results of
Th. 3 and 4 that demonstrate that the ground and the regressed depth models show contrasting extrapolation
behaviors.

coordinate system to its 2D projected pixel (𝑢, 𝑣) in camera coordinates as:

𝑣


𝑢










1

(cid:104)

𝑧 =

(cid:105)

K 0













R T







0𝑇 1









,

𝑋

𝑌

𝑍

1





























(5.1)

where 𝑧 denotes the depth of pixel (𝑢, 𝑣).

Ground Depth Estimation [65,73]. While depth estimation in Mono3D is ill-posed, ground depth

can be precisely determined given the camera parameters and height relative to the ground in the

world coordinate system [65, 73, 284]. Since all datasets provide camera mounting height from the

ground, we obtain the depth of ground plane pixels in closed form.

Lemma 4. Ground Depth of Pixel [65, 73, 284]. Consider a pinhole camera model with intrinsics

K, rotation R and translation extrinsics T. Let matrix 𝑨 = (𝑎𝑖 𝑗 ) = R−1K−1 ∈ R3×3, and −R−1T as

67

the vector 𝑩 = (𝑏𝑖) ∈ R3×1. Then, the ground depth 𝑧 for a pixel (𝑢, 𝑣) is

𝑧 =

𝐻 − 𝑏2
𝑎21𝑢 + 𝑎22𝑣 + 𝑎23

.

(5.2)

We refer to Sec. E.1.1 in the appendix for the derivation.

Lemma 5. Ground Depth of Pixel For datasets with the rotation extrinsics R an identity, the depth

estimate 𝑧 from Lemma 4 becomes

𝑧 =

.

𝐻 − 𝑏2
𝑣 −𝑣0
𝑓

(5.3)

We refer to Sec. E.1.2 for the proof.

5.4 CHARM3R

In this section, we first mathematically prove the contrasting extrapolation behavior of regressed

and ground-based object depths under varying camera heights. To mitigate the impact of these

opposing trends and improve generalization to unseen heights, we propose Camera Height Agnostic

Monocular 3D Object Detector or CHARM3R. CHARM3R averages both these depth estimates

within the model to mitigate these trends and improves generalization to unseen heights. Fig. 5.4

shows the overview of CHARM3R.

5.4.1 Ground-based Depth Model

Outdoor driving scenes typically contain a ground region, unlike indoor scenes. The ground

depth varies with ego height, providing a valuable reference and prior for generalizing Mono3D to

unseen ego heights.

Bottom Center Estimation. Lemma 4 utilizes the ground plane depth from Eq. (5.2) to estimate

object depths. The numerator in Eq. (5.2) can be negative, while depth is positive for forward facing

cameras. To ensure positive depth values, we apply the Rectified Linear Unit (ReLU) activation

(max(𝑧, 0)) to the numerator of Eq. (5.2). This step promotes spatially continuous and meaningful

ground depth representations, improving the training stability of CHARM3R. Ablation in Sec. 5.5.3

confirm the effectiveness.

68

In practice, CHARM3R leverages the projected 3D center (𝑢𝑐, 𝑣𝑐), 2D height information ℎ2𝐷

and the 2D center (𝑢𝑐,2𝐷, 𝑣𝑐,2𝐷) to compute the projected bottom 3D center (𝑢𝑏, 𝑣𝑏) as follows:

𝑢𝑏 = 𝑢𝑐 ;

𝑣𝑏 = 𝑣𝑐 +

1
2

ℎ2𝐷 + 𝛼(𝑣𝑐 − 𝑣𝑐,2𝐷).

(5.4)

With the projected bottom center (𝑢𝑏, 𝑣𝑏) estimated, we query the ground plane depth at this

point, as derived in Lemma 4. Note that we do not use the 3D height to calculate the bottom center

since projecting this point requires the box depth, which is the quantity we aim to estimate. We,

now, analyze the extrapolation behavior of this ground-based depth model in the following theorem.

Theorem 3. Ground-based bottom center model has positive slope (trend) in extrapolation.

Consider a ground depth model that predicts ˆ𝑧 from the projected bottom 3D center (𝑢𝑏, 𝑣𝑏) image.

Assuming the GT object depth 𝑧 is more than the ego height change Δ𝐻, the mean depth error of

the ground model exhibits a positive trend w.r.t. the height change Δ𝐻:

E(cid:16)𝑔ˆ𝑧Δ𝐻 − 𝑧

(cid:17)

≈ 𝑅𝑒𝐿𝑈

(cid:19)

(cid:18)

1
𝑣𝑏 −𝑣0

𝑓 Δ𝐻,

(5.5)

where 𝑓 is the focal length and (𝑢0, 𝑣0) is the optical center.

Th. 3 says that the ground model over-estimates and under-estimates depth as the ego height

change Δ𝐻 increases and decreases respectively.

Proof. When the ego camera shifts by Δ𝐻 𝑚, the 𝑦-coordinate of the projected 3D bottom center

𝑣𝑏 of a 3D box becomes 𝑣𝑏 +

𝑓 Δ𝐻
𝑧

. Using Eq. (5.3), the new depth 𝑔ˆ𝑧Δ𝐻 is

𝑔ˆ𝑧Δ𝐻 =

𝐻 + Δ𝐻 − 𝑏2
𝑓 Δ𝐻
𝑧
𝑓

− 𝑣0

𝑣𝑏 +

=

𝐻 + Δ𝐻 − 𝑏2
Δ𝐻
𝑣𝑏 −𝑣0
𝑧
𝑓

+

.

(5.6)

If the ego height change Δ𝐻 is small compared to the object depth 𝑧, Δ𝐻
𝑧
above equation as

≈ 0. So, we write the

𝑔ˆ𝑧Δ𝐻 ≈

𝐻 + Δ𝐻 − 𝑏2
𝑣𝑏 −𝑣0
𝑓

= 𝑔ˆ𝑧0 +

Δ𝐻
𝑣𝑏 −𝑣0
𝑓

69

≈ 𝑧 + 𝜂 +

𝑓 Δ𝐻
𝑣𝑏 −𝑣0

=⇒ 𝑔ˆ𝑧Δ𝐻 − 𝑧 ≈ 𝜂 +

𝑓 Δ𝐻
𝑣𝑏 −𝑣0

,

assuming the ground depth 𝑔ˆ𝑧0 at train height Δ𝐻 = 0 is the GT depth 𝑧 added by a normal random

variable 𝜂 with mean 0 and variance 𝜎2 as in [110]. Taking expectation on both sides, the mean

depth error is

E(cid:16)𝑔ˆ𝑧Δ𝐻 − 𝑧

(cid:17)

≈

(cid:19)

(cid:18)

1
𝑣𝑏 −𝑣0

𝑓 Δ𝐻,

confirming the positive trend of the mean depth error of the ground model w.r.t. the height change

Δ𝐻.

The ground lies between the bottom part of the image plane/ image height (ℎ) and the optical

center 𝑦-coordinate 𝑣0, and so 𝑣𝑏 − 𝑣0 > 0. However, in practice, it could get negative in early stage

of training. To enforce non-negativity of this term, we pass 𝑣𝑏−𝑣0 through a ReLU non-linearity to

enforce 𝑣𝑏 −𝑣0 is positive. Sec. 5.5.3 confirms that ReLU remains important for good results.

□

5.4.2 Regression-based Depth Model

Most Mono3D models rely on regression losses, to compare the predicted depth with the GT

depth [108, 303]. We, next, derive the extrapolation behavior of such regressed depth model in the

following theorem.

Theorem 4. Regressed model has negative slope (trend) in extrapolation. Consider a regressed

depth model trained on data from single ego height, predicting depth ˆ𝑧 from the projected 3D center

(𝑢𝑐, 𝑣𝑐). Assuming a linear relationship between predicted depth and pixel position, the mean

depth error of a regressed model exhibits a negative trend w.r.t. the height change Δ𝐻:

E(cid:16)𝑟ˆ𝑧Δ𝐻 − 𝑧

(cid:17)

= −

(cid:19)

(cid:18) 𝛽
𝑧

𝑓 Δ𝐻,

(5.7)

where 𝛽 is a camera height independent positive constant.

Th. 4 says that regressed depth model under-estimates and over-estimates depth as the ego

height change Δ𝐻 increases and decreases respectively.

70

Figure 5.5 CARLA Val samples with both negative and positive ego height changes (Δ𝐻) covers AVs from
bots to cars to trucks.

Table 5.1 Error analysis of GUP Net [159] trained on Δ𝐻 = 0𝑚 on all height changes Δ𝐻 of CARLA Val
split. Depth remains the biggest source of error in inference on unseen ego heights.

𝑧

𝑦

ℎ

✓

✓

(cid:17))

(cid:17))

Oracle Params. −
𝑥

(cid:17) / Δ𝐻 (𝑚)−
𝜃
𝑤
𝑙
(cid:17)

−0.70
9.46
15.95
13.56
34.82
65.44
10.32
✓ ✓ ✓
✓ ✓ ✓ ✓ ✓ ✓
75.86
✓ ✓ ✓ ✓ ✓ ✓ ✓ 78.44

✓
✓ ✓ ✓

AP 3D 70 [%] (−
0
53.82
62.21
59.55
69.99
82.36
56.24
82.82
85.20

+0.76 −0.70
41.66
7.23
46.89
12.74
44.93
10.67
68.10
39.03
74.76
80.70
42.04
7.20
78.21
82.08
78.44
82.28

AP 3D 50 [%] (−
0
76.47
76.78
76.86
82.73
84.93
76.61
85.17
85.20

+0.76 −0.70
40.97
50.97
49.84
76.24
82.11
42.03
82.24
82.28

MDE (𝑚) [≈ 0]
+0.76
0
+0.53 +0.03 −0.63
+0.53 +0.03 −0.63
+0.53 +0.03 −0.63
+0.00 +0.00 +0.00
+0.00 +0.00 +0.00
+0.53 +0.03 −0.63
+0.00 +0.00 +0.00
+0.00 +0.00 +0.00

Proof. Neural nets often use the 𝑦-coordinate of their projected 3D center 𝑣𝑐 to predict depth [51].

Consider a simple linear regression model for predicting depth. Then, the regressed depth 𝑟ˆ𝑧0 is

𝑟ˆ𝑧0 = −

(cid:19)

(cid:18) 𝑧𝑚𝑎𝑥 −𝑧𝑚𝑖𝑛
ℎ−𝑣0

(𝑣𝑐 −𝑣0) + 𝑧𝑚𝑎𝑥

= −𝛽(𝑣𝑐 −𝑣0) + 𝑧𝑚𝑎𝑥,

(5.8)

This linear regression model has a negative slope, with a positive slope parameter 𝛽, and ℎ

being the height of the image. This model predicts depth 𝑧𝑚𝑖𝑛 at pixel position 𝑣𝑐 = ℎ and 𝑧𝑚𝑎𝑥 at

principal point 𝑣𝑐 = 𝑣0. When the ego camera shifts by Δ𝐻 𝑚, the projected center of the object

71

𝑓 Δ𝐻
𝑧

becomes 𝑣𝑐 +

depth 𝑟ˆ𝑧Δ𝐻 as,

. Substituting this into the regression model of Eq. (5.8), we obtain the new

𝑟ˆ𝑧Δ𝐻 = −𝛽

(cid:18)

𝑣𝑐 +

𝑓 Δ𝐻
𝑧

(cid:19)

−𝑣0

+ 𝑧𝑚𝑎𝑥

= −𝛽(𝑣𝑐 −𝑣0) + 𝑧𝑚𝑎𝑥 −

(cid:19)

(cid:18) 𝛽
𝑧

𝑓 Δ𝐻

= 𝑟ˆ𝑧0 −

= 𝑧 + 𝜂 −

(cid:19)

(cid:18) 𝛽
𝑧
(cid:18) 𝛽
𝑧

𝑓 Δ𝐻

(cid:19)

𝑓 Δ𝐻

=⇒ 𝑟ˆ𝑧Δ𝐻 − 𝑧 = 𝜂 −

(cid:19)

(cid:18) 𝛽
𝑧

𝑓 Δ𝐻,

assuming the regressed depth 𝑟ˆ𝑧0 at train height Δ𝐻 = 0 is the GT depth 𝑧 added by a normal

random variable 𝜂 with mean 0 and variance 𝜎2 as in [110]. Taking expectation on both sides, the

mean depth error is

E(cid:16)𝑟ˆ𝑧Δ𝐻 − 𝑧

(cid:17)

= −

(cid:19)

(cid:18) 𝛽
𝑧

𝑓 Δ𝐻,

confirming the negative trend of the mean depth error of the regressed depth model w.r.t. the height

change Δ𝐻.

5.4.3 Merging Depth Estimates.

□

Th. 3 and 4 prove that the ground and the regressed depth models show contrasting extrap-

olation behaviors. The former over-estimates the depth while the latter under-estimates depth as

the ego height change Δ𝐻 increases. Fig. 5.4 shows how these two depth estimates are fused

together. Overall, CHARM3R leverages depth information from these two source sources (with

different extrapolation behaviors) to improve the Mono3D generalization to unseen camera heights.

CHARM3R starts with an input image, and estimates the depth of the object using two methods:

ground and regressed depth. CHARM3R outputs the projected bottom center of the object to

query the ground depth (calculated from the ego camera parameters and its position and orientation

relative to the ground plane as in Lemma 4).

It also outputs another depth estimate based on

regression. The final step combines the two estimated depths with a simple average to cancel the

72

Table 5.2 CARLA Val Results. CHARM3R outperforms all other baselines, especially at bigger unseen
ego heights. All methods except Oracle are trained on car height Δ𝐻 = 0𝑚 and tested on bot to truck height
data. [Key: Best]

3D Detector

GUP Net [159]

DEVIANT [108]

Method −

(cid:17) / Δ𝐻 (𝑚)−
(cid:17)

Source
Plucker [189]
UniDrive [129]
UniDrive++ [129]
CHARM3R
Oracle
Source
Plucker [189]
UniDrive [129]
UniDrive++ [129]
CHARM3R
Oracle

(cid:17))

(cid:17))

AP 3D 70 [%] (−
0
53.82
55.56
53.82
53.82
55.68
53.82
50.18
51.32
50.18
50.18
48.74
50.18

−0.70
9.46
8.43
10.73
10.83
19.45
70.96
8.63
8.43
8.33
6.73
17.11
71.97

+0.76 −0.70
41.66
7.23
37.10
10.13
42.30
5.54
47.81
12.27
53.40
27.33
83.88
62.25
40.24
6.25
38.24
9.52
41.40
6.56
42.91
12.03
49.28
26.24
84.56
62.56

AP 3D 50 [%] (−
0
76.47
76.57
76.46
76.46
74.47
76.47
73.78
73.91
73.78
73.78
70.21
73.78

+0.76 −0.70
40.97
43.22
39.33
53.08
61.98
83.96
41.74
44.22
41.27
52.36
63.60
83.94

MDE (𝑚) [≈ 0]
+0.76
0
+0.53 +0.03 −0.63
+0.55 +0.03 −0.63
+0.51 +0.03 −0.67
+0.39 +0.03 −0.48
+0.07 +0.05 −0.02
+0.03 +0.03 +0.03
+0.46 +0.01 −0.65
+0.46 +0.01 −0.64
+0.46 +0.01 −0.64
+0.37 +0.01 −0.47
+0.01 +0.03 −0.02
+0.03 +0.01 −0.02

opposing trends and obtain the refined depth estimates, resulting in a set of accurate and localized

3D objects in the scene.

5.5 Experiments

Datasets. Our experiments utilize the simulated CARLA dataset1 from [104], configured to mimic

the nuScenes [22] dataset. We use this dataset for two reasons. First, this dataset reduces training

and testing domain gaps, while existing public datasets lack data at multiple ego heights. Second,

recent paper [104] also use this dataset for their experiments. The default CARLA dataset sweeps

camera height changes Δ𝐻 from 0 to 0.76𝑚, rendering a dataset every 0.076𝑚 (car to trucks). To

fully investigate the impact of camera height variations, we extend the original CARLA dataset by

introducing negative height changes. The extended CARLA dataset sweeps height changes Δ𝐻

from −0.70𝑚 to 0.76𝑚 with settings from bots to cars to trucks. Fig. 5.5 illustrates sample images

from this dataset. Note that we exclude Δ𝐻 = −0.76𝑚 setting due to visibility obstructions caused

by the ego vehicle’s bonnet.

Data Splits. Our experiments use the CARLA Val Split. This dataset split [104] contains 25,000

images (2,500 scenes) from town03 map for training and 5,000 images (500 scenes) from town05

map for inference on multiple ego height. Except for Oracle, we train all models on training images

from the car height (Δ𝐻 = 0𝑚).

1The authors of [104] do not release their other Nvidia-Sim dataset.

73

(a) AP 3D 70 [%] comparison.

(b) AP 3D 50 [%] comparison.

(c) MDE comparison.

Figure 5.6 CARLA Val Results on GUP Net. CHARM3R outperforms all baselines, especially at bigger
unseen ego heights. All methods except Oracle are trained on car height and tested on all heights. Results of
inference on height changes of −0.70, 0 and 0.76 meters are in Tab. 5.2. See Fig. 5.6 in the supplementary
for another detector.

Evaluation Metrics. We choose the KITTI AP 3D 70 percentage on the Moderate category [67] as

our evaluation metric. We also report AP3D 50 percentage numbers following prior works [17,109].

Additionally, we report the mean depth error (MDE) over predicted boxes with IoU2D overlap greater

than 0.7 with the GT boxes similar to [108]. Note that MDE is different from MAE metric of [108]

that it does not take absolute value.

Detectors. We use the GUP Net [159] and DEVIANT [108] as our base detectors. The choice of

these models encompasses CNN [159] and group CNN-based [108] architectures.

Baselines. We compare against the following baselines:
• Source: This is the Mono3D model trained on the car height (Δ𝐻 = 0𝑚) data.
• Plucker Embeddings [189,234]: Training a Mono3D model with Plucker embeddings to improve
robustness as in 3D pose estimation and reconstruction tasks. Plucker embeddings generalize

the intrinsic-focused CAM-Convs [57] embeddings to camera extrinsics.

• UniDrive [129]: Transforming unseen ego height (target) images to car height (source) assuming

objects at fixed distance parameter (50𝑚) and then passing to the Mono3D model.

• UniDrive++ [129]: UniDrive with distance parameter optimized per dataset.
• Oracle: We also report the Oracle Mono3D model, which is trained and tested on the same ego

height. The Oracle serves as the upper bound of all baselines.

74

Table 5.3 CARLA Val Results with ResNet-18 backbone. CHARM3R outperforms all baselines, espe-
cially at bigger unseen ego heights. All methods except Oracle are trained on car height Δ𝐻 = 0𝑚 and tested
on bot to truck height data. [Key: Best]

3D Detector

GUP Net [159]

DEVIANT [108]

Method −

(cid:17) / Δ𝐻 (𝑚)−
(cid:17)

Source
UniDrive [129]
UniDrive++ [129]
CHARM3R
Oracle
Source
UniDrive [129]
UniDrive++ [129]
CHARM3R
Oracle

(cid:17))

(cid:17))

AP 3D 70 [%] (−
0
49.82
49.82
49.82
46.13
49.82
49.88
49.87
49.87
49.13
49.88

−0.70
10.13
10.05
9.37
16.62
70.25
8.83
8.21
6.01
14.96
68.35

+0.76 −0.70
47.15
5.28
47.15
6.15
52.95
13.00
57.00
24.50
83.49
62.93
42.10
4.43
42.21
3.75
43.99
12.03
52.68
23.66
84.03
58.49

AP 3D 50 [%] (−
0
73.49
73.49
73.49
67.83
73.49
72.79
72.79
72.79
72.95
72.79

MDE (𝑚) [≈ 0]
+0.76
0
+0.76 −0.70
+0.40 +0.01 −0.65
42.70
+0.35 +0.01 −0.62
43.89
55.57
+0.31 +0.01 −0.46
60.86 −0.15 +0.00 +0.07
84.07 −0.01 +0.05 +0.07
+0.40 +0.01 −0.69
38.42
+0.40 +0.01 −0.70
38.38
+0.38 +0.01 −0.50
50.67
60.98 −0.07 +0.05 +0.02
−0.1
83.42 −0.04 +0.01

Table 5.4 Ablation Studies of GUP Net + CHARM3R on the CARLA Val split on unseen ego heights. [Key:
Best]

Change

GUP Net [159] −

To

From −
(cid:17)

/ Δ𝐻 (𝑚)−
(cid:17)

Merge

Ground
Formulation
CHARM3R
Oracle

Regress
Ground

Regress+Ground −
Regress+Ground −
(cid:17)
Within Model −
Offline
(cid:17)
Simple Avg −
Learned Avg
(cid:17)
No ReLU
ReLU −
(cid:17)
Product −
Sum
(cid:17)
−
(cid:17)
−

5.5.1 CARLA Error Analysis

(cid:17))

AP 3D 70 [%] (−
+0.76 −0.70
0
41.66
7.23
53.82
41.66
7.23
53.82
14.21
26.61
5.39
49.86
47.66 18.36
38.58
9.53
56.49
15.66
0.07
52.94
17.28
37.22 12.79
55.68 27.33
53.40
83.88
53.82 62.25

−0.70
9.46
9.46
0.98
12.86
8.25
0.60
3.28
19.45
70.96

AP 3D 50 [%] (−
0

(cid:17))

MDE (𝑚) [≈ 0]
0

+0.76 −0.70

+0.76
76.47 40.97 +0.53 +0.05 −0.63
76.47 40.97 +0.53 +0.05 −0.63
51.97 31.42 −0.80 −0.01 +0.55
76.30 54.38 +0.24 +0.02 −0.28
76.82 43.13 +0.56 −0.03 −0.62
−1.09 −0.01 +1.34
74.79
63.88 47.09 +0.56 +0.09 −0.22
74.47 61.98 −0.07 +0.05 +0.02
76.47 83.96 +0.03 +0.03 +0.03

4.50

We first report the error analysis of the baseline GUP Net [159] in Tab. 5.1 by replacing the

predicted box data with the oracle parameters of the box as in [110, 166]. We consider the GT box

to be an oracle box for predicted box if the euclidean distance is less than 4𝑚 [110]. In case of

multiple GT being matched to one box, we consider the oracle with the minimum distance. Tab. 5.1

shows that depth is the biggest source of error for Mono3D task under ego height changes as also

observed for single height settings in [108, 110, 166]. Note that the Oracle does not get 100%

results since we only replace box parameters in the baseline and consequently, the missed boxes in

the baseline are not added.

5.5.2 CARLA Height Robustness Results

Tab. 5.2 presents the CARLA Val results, reporting the median model over three different seeds

with the model being the final checkpoint as [108]. It compares baselines and our CHARM3R on all

75

Mono3D models - GUP Net [159], and DEVIANT [108]. Except for Oracle, all models are trained

from car height data and tested on all ego heights. Tab. 5.2 confirms that CHARM3R outperforms

other baselines on all the Mono3D models, and results in a better height robust detector. We also

plot these AP3D numbers and depth errors visually in Fig. 5.6 for intermediate height changes to

confirm our observations. The MDE comparison in Fig. 5.6c also shows the trend of baselines,

while CHARM3R cancels the opposite trends in extrapolation.

Oracle Biases. We further note biases in the Oracle models at big changes in ego height. This

agrees the observations of [104] in the BEV segmentation task. While higher AP3D for a higher

height could be explained by fewer occlusions due a higher height, higher AP3D at lower camera

height is not explained by this hypothesis. We leave the analysis of higher Oracle numbers for a

future work.

Results on Other Backbone. We next investigate whether the extrapolation behavior holds

for other backbones as well following DEVIANT [108]. So, we benchmark on the ResNet-18

backbone. Tab. 5.3 results show that extrapolation shows up in other backbones and CHARM3R

again outperforms all baselines. The biggest gains are in big camera height changes, which is

consistent with Tab. 5.2 results.

5.5.3 Ablation Studies

Tab. 5.4 ablates the design choices of GUP Net + CHARM3R on CARLA Val split, with the

experimental setup of Sec. 5.5.

Depth Merge. We first analyze the impact of averaging the two depth estimates. Merging both

regressed and ground-based depth estimates is crucial for optimal performance. Relying solely on

the regressed depth gives good ID performance but bad OOD performance. Using only ground depth

generalizes poorly in both ID and OOD settings, which is why it is not used in modern Mono3D

models. However, it has a contrasting extrapolation MDE compared to regression models. While

offline merging of depth estimates from regression-only and ground-only models also improves

extrapolation, it is slower and lacks end-to-end training. We also experiment with changing the

simple averaging of CHARM3R to learned averaging. Simple average of CHARM3R outperforms

76

learned one in OOD test because the learned average overfits to train distribution.

ReLUed Ground. Sec. 5.4.1 says that ReLU activation applied to the ground depth ensures spatial

continuity and improves model training stability. Removing the ReLU leads to training instability

and suboptimal extrapolation to camera height. (The training also collapses in some cases).

Formulation. CHARM3R estimates the projected 3D bottom center by using the projected 3D

center and the 2D height prediction. Eq. (5.4) predicts a coefficient 𝛼 to determine the precise

bottom center location. Product means predicting 𝛼 and then multiplying by (𝑣𝑐 −𝑣𝑐,2𝐷) to obtain

the shift, while sum means directly predicting the shift 𝛼. Replace this product formulation by the

sum formulation of 𝛼 confirms that the product is more effective than the sum.

5.6 Conclusions

This chapter highlights the understudied problem of Mono3D generalization to unseen ego

heights. We first systematically analyze the impact of camera height variations on state-of-the-

art Mono3D models, identifying depth estimation as the primary factor affecting performance.

We mathematically prove and also empirically observe consistent negative and positive trends in

regressed and ground-based object depth estimates, respectively, under camera height changes.

This chapter then takes a step towards generalization to unseen camera heights and proposes

CHARM3R. CHARM3R averages both depth estimates within the model to mitigate these opposing

trends. CHARM3R significantly enhances the generalization of Mono3D models to unseen camera

heights, achieving SoTA performance on the CARLA dataset. We hope that this initial step towards

generalization will contribute to safer AVs. Future work involves extending this idea to more

Mono3D models.

Limitation. CHARM3R does not fully solve the generalization issue to unseen camera heights.

77

CHAPTER 6

CONCLUSIONS AND FUTURE RESEARCH

In this thesis, we attempt generalizing Mono3D networks to occlusion, dataset, object sizes and

camera heights. The backbones of our models is in all cases a convolutional neural network or

a transformer backbone. While the current Mono3D networks generalize fairly well across these

shifts, they still suffer from the following issues:
• They do not generalize to unseen datasets during training.
• They do not multiple handle tasks like depth prediction, semantic scene completion and Mono3D.
• They do not generalize to unknown or noisy camera extrinsics.
• They do not handle multiple camera models.
Generalizing to Unseen Datasets. Current multi-dataset trained baselines such as Cube R-

CNN [14] generalize poorly to datasets unseen in training. In other words, these models do not

generalize in cross-dataset settings. Generalizing Mono3D to unseen datasets remains unsolved

till date. We conjecture that the cause of limited generalization is the limited training data and

specialized backbones which handle projective geometry.

Generalizing to Multiple Tasks. Tasks like metric depth prediction, semantic scene completion

and Mono3D all represent 3D scene understanding at varying levels of granularity from points

to voxels to objects. While there are networks that specialize in doing each task, a single model

understands all these granularities as well as intermediate granularities remains an exciting direction

for solidyfying 3D understanding and task generalization.

Generalizing to Unknown Extrinsics. Current Mono3D methods work well when trained and

tested on the same extrinsics. However, such methods do not work well when the camera extrinsics,

is unknown during testing. Joint Mono3D and camera calibration remains an open problem.

Generalizing to Camera Models. Current methods handle only pinhole cameras, while the

cameras available today also include fisheye and 360 camera models. Generalizing Mono3D

networks to handle any camera model remains another open problem in this area.

Advances in Mono3D task enable diverse applications such as Autonomous Driving, Metaverse

78

and robotics. The goal of home robots is to assist humans in indoor activities, such as cooking or

cleaning. Future works which generalize Mono3D along these directions will make our limited 3D

scene understanding even more powerful.

79

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

The KITTI Vision Benchmark Suite. http://www.cvlibs.net/datasets/kitti/eval_object.php?
obj_benchmark=3d. Accessed: 2022-07-03. 18, 35

Hassan Alhaija, Siva Mustikovela, Lars Mescheder, Andreas Geiger, and Carsten Rother.
Augmented reality meets computer vision: Efficient data generation for urban driving scenes.
IJCV, 2018. 1, 6, 25, 44, 61

Samaneh Azadi, Jiashi Feng, and Trevor Darrell. Learning detection with diverse proposals.
In CVPR, 2017. 9, 16, 17

Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng
Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasac-
chi, David Lindell, and Sergey Tulyakov. VD3D: Taming large video diffusion transformers
for 3D camera control. arXiv preprint arXiv:2407.12781, 2024. 63, 65

[5] Wentao Bao, Bin Xu, and Zhenzhong Chen. MonoFENet: Monocular 3D object detection
with feature enhancement networks. IEEE Transactions on Image Processing, 2019. 9

[6]

[7]

[8]

[9]

Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss,
and Jurgen Gall. SemanticKITTI: A dataset for semantic scene understanding of LiDAR
sequences. In ICCV, 2019. 54

Deniz Beker, Hiroharu Kato, Mihai Adrian Morariu, Takahiro Ando, Toru Matsuoka, Wadim
Kehl, and Adrien Gaidon. Monocular differentiable rendering for self-supervised 3D object
detection. In ECCV, 2020. 20

Quentin Berthet, Mathieu Blondel, Olivier Teboul, Marco Cuturi, Jean-Philippe Vert, and
Francis Bach. Learning with differentiable perturbed optimizers. In NeurIPS, 2020. 11

Zygmunt Birnbaum. An inequality for Mill’s ratio. The Annals of Mathematical Statistics,
1942. 144

[10] Mathieu Blondel, Olivier Teboul, Quentin Berthet, and Josip Djolonga. Fast differentiable

sorting and ranking. In ICML, 2020. 11, 12

[11] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. YOLOv4: Optimal
speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020. 26

[12] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry Davis. Soft-NMS–improving
object detection with one line of code. In ICCV, 2017. 8, 9, 10, 11, 15, 16, 17, 21, 23, 108,
109, 110

[13] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry Davis. Soft-NMS implementa-
tion. https://github.com/bharatsingh430/soft-nms/blob/master/lib/nms/cpu_nms.pyx#L98,
2017. Accessed: 2021-01-18. 10

80

[14] Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia
Gkioxari. Omni3D: A large benchmark and model for 3D object detection in the wild. In
CVPR, 2023. 1, 44, 55, 56, 57, 61, 65, 78

[15] Garrick Brazil and Xiaoming Liu. M3D-RPN: Monocular 3D region proposal network for
object detection. In ICCV, 2019. 6, 9, 16, 18, 19, 20, 21, 22, 23, 25, 26, 29, 37, 42, 48, 65,
108, 110, 121, 128, 129

[16] Garrick Brazil and Xiaoming Liu. Pedestrian detection with autoregressive network phases.

In CVPR, 2019. 9

[17] Garrick Brazil, Gerard Pons-Moll, Xiaoming Liu, and Bernt Schiele. Kinematic 3D object
detection in monocular video. In ECCV, 2020. 6, 7, 9, 16, 18, 19, 20, 21, 22, 23, 25, 29, 35,
36, 48, 65, 74, 107, 108, 109, 110, 111, 112, 113, 117, 121, 131

[18] Garrick Brazil, Xi Yin, and Xiaoming Liu.
detection & segmentation. In ICCV, 2017. 9

Illuminating pedestrians via simultaneous

[19] Michael Bronstein. Convolution from first principles. https://towardsdatascience.com/
deriving-convolution-from-first-principles-4ff124888028. Accessed: 2021-08-13. 25, 28,
29, 114

[20] Michael Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep learning:
Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021. 28,
29, 114

[21] Brian Burns, Richard Weiss, and Edward Riseman. The non-existence of general-case

view-invariants. In Geometric invariance in computer vision. 1992. 29, 30, 63, 114, 115

[22] Holger Caesar, Varun Bankiti, Alex Lang, Sourabh Vora, Venice Liong, Qiang Xu, Anush
Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset
for autonomous driving. In CVPR, 2020. 34, 35, 37, 46, 53, 54, 60, 73, 128, 149, 164

[23] Brittany Caldwell. 2 die when tesla crashes into parked tractor-trailer in florida. https:

//www.wftv.com/news/local/2-die-when-tesla-crashes-into-parked-tractor-trailer-florida/
KJGMHHYTQZA2HNAHWL2OFSVIPM/, 2022. Accessed: 2023-11-06. 2, 45

[24] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov,
and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020. 48,
65

[25] Florian Chabot, Mohamed Chaouch, Jaonary Rabarisoa, Céline Teuliere, and Thierry
Chateau. Deep MANTA: A coarse-to-fine many-task network for joint 2D and 3D vehi-
cle analysis from monocular image. In CVPR, 2017. 29, 48, 65

[26] Gyusam Chang, Jiwon Lee, Donghyun Kim, Jinkyu Kim, Dongwook Lee, Daehyun Ji, Sujin
Jang, and Sangpil Kim. Unified domain generalization and adaptation for multi-view 3D

81

object detection. In NeurIPS, 2024. 65

[27] Gyusam Chang, Wonseok Roh, Sujin Jang, Dongwook Lee, Daehyun Ji, Gyeongrok Oh,
Jinsun Park, Jinkyu Kim, and Sangpil Kim. CMDA: Cross-modal and domain adversarial
adaptation for LiDAR-based 3D object detection. In AAAI, 2024. 65

[28] Ming Chang, Xishan Zhang, Rui Zhang, Zhipeng Zhao, Guanhua He, and Shaoli Liu.
RecurrentBEV: A long-term temporal fusion framework for multi-view 3D detection. In
ECCV, 2024. 65

[29] Dian Chen, Jie Li, Vitor Guizilini, Rares Andrei Ambrus, and Adrien Gaidon. Viewpoint

equivariance for multi-view 3D object detection. In CVPR, 2023. 48, 59, 65

[30] Kean Chen, Weiyao Lin, Jianguo Li, John See, Ji Wang, and Junni Zou. AP-Loss for accurate

one-stage object detection. TPAMI, 2020. 17, 19

[31] Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, and Raquel Urtasun.
Monocular 3D object detection for autonomous driving. In CVPR, 2016. 9, 29, 47, 64

[32] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew Berneshawi, Huimin Ma, Sanja Fidler,
and Raquel Urtasun. 3D object proposals for accurate object class detection. In NeurIPS,
2015. 18, 35

[33] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3D object detection

network for autonomous driving. In CVPR, 2017. 6, 9

[34] Yilun Chen, Shu Liu, Xiaoyong Shen, and Jiaya Jia. DSGN: Deep stereo geometry network

for 3D object detection. In CVPR, 2020. 1, 47, 64

[35] Yongjian Chen, Lei Tai, Kai Sun, and Mingyang Li. MonoPair: Monocular 3D object
detection using pairwise spatial relationships. In CVPR, 2020. 6, 9, 19, 20, 21, 23, 25, 29,
36, 48, 65

[36] Zhili Chen, Shuangjie Xu, Maosheng Ye, Zian Qian, Xiaoyi Zou, Dit-Yan Yeung, and
Qifeng Chen. Learning high-resolution vector representation from multi-camera images for
3D object detection. In ECCV, 2024. 65

[37] Kashyap Chitta, Aditya Prakash, and Andreas Geiger. NEAT: Neural attention fields for

end-to-end autonomous driving. In ICCV, 2021. 48

[38] Wonhyeok Choi, Mingyu Shin, and Sunghoon Im. Depth-discriminative metric learning for

monocular 3D object detection. In NeurIPS, 2023. 48, 65

[39] Zhiyu Chong, Xinzhu Ma, Hong Zhang, Yuxin Yue, Haojie Li, Zhihui Wang, and Wanli
Ouyang. MonoDistill: Learning spatial features for monocular 3D object detection. In ICLR,
2022. 35, 36, 132

[40] Xiaomeng Chu, Jiajun Deng, Yuan Zhao, Jianmin Ji, Yu Zhang, Houqiang Li, and Yanyong

82

Zhang. OA-BEV: Bringing object awareness to bird’s-eye-view representation for multi-
camera 3D object detection. arXiv preprint arXiv:2301.05711, 2023. 48, 65

[41] Taco Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Spherical CNNs. In ICLR,

2018. 26, 28

[42] Taco Cohen and Max Welling. Learning the irreducible representations of commutative lie

groups. In ICML, 2014. 28

[43] Taco Cohen and Max Welling. Group equivariant convolutional networks. In ICML, 2016.

28, 63, 114

[44] MMDetection3D Contributors. MMDetection3D: OpenMMLab next-generation platform
for general 3D object detection. https://github.com/open-mmlab/mmdetection3d, 2020. 149

[45] Michael Crawshaw. Multi-task learning with deep neural networks: A survey. arXiv preprint

arXiv:2009.09796, 2020. 46

[46] Marco Cuturi, Olivier Teboul, and Jean-Philippe Vert. Differentiable ranks and sorting using

optimal transport. In NeurIPS, 2019. 11

[47] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In

CVPR, 2005. 9

[48]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
large-scale hierarchical image database. In CVPR, 2009. 124

ImageNet: A

[49] Chaitanya Desai, Deva Ramanan, and Charless Fowlkes. Discriminative models for multi-

class object layout. IJCV, 2011. 9, 16, 17

[50] Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu. Exploiting cyclic symmetry

in convolutional neural networks. In ICML, 2016. 28, 114

[51] Tom van Dijk and Guido de Croon. How do neural networks see depth in single images? In

ICCV, 2019. 71, 161

[52] Mingyu Ding, Yuqi Huo, Hongwei Yi, Zhe Wang, Jianping Shi, Zhiwu Lu, and Ping
In CVPR

Luo. Learning depth-guided convolutions for monocular 3D object detection.
Workshops, 2020. 9, 19, 20, 29, 38, 39, 121

[53] Simon Doll, Richard Schulz, Lukas Schneider, Viviane Benzin, Markus Enzweiler, and
Hendrik Lensch. SpatialDETR: Robust scalable transformer-based 3D object detection from
multi-view camera images with global cross-sensor attention. In ECCV, 2022. 59

[54] Yinpeng Dong, Caixin Kang, Jinlai Zhang, Zijian Zhu, Yikai Wang, Xiao Yang, Hang Su,
Xingxing Wei, and Jun Zhu. Benchmarking robustness of 3D object detection to common
corruptions. In CVPR, 2023. 1, 44

83

[55] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,
Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image
recognition at scale. In ICLR, 2021. 28, 63

[56] Carlos Esteves, Christine Allen-Blanchette, Xiaowei Zhou, and Kostas Daniilidis. Polar

transformer networks. In ICLR, 2018. 28

[57]

Jose Facil, Benjamin Ummenhofer, Huizhong Zhou, Luis Montesano, Thomas Brox, and
Javier Civera. CAM-Convs: Camera-aware multi-scale convolutions for single-view depth.
In CVPR, 2019. 74

[58] Lue Fan, Feng Wang, Naiyan Wang, and Zhao Zhang. Fully sparse 3D object detection. In

NeurIPS, 2022. 48

[59] Chengjian Feng, Zequn Jie, Yujie Zhong, Xiangxiang Chu, and Lin Ma. AEDet: Azimuth-
invariant multi-view 3D object detection. arXiv preprint arXiv:2211.12501, 2022. 48

[60] Roshan

Fernandez.

A tesla

firetruck on a
tesla-driver-killed-california-firetruck-nhtsa, 2023. Accessed: 2023-11-06. 2, 45

california highway.

driver was

a
https://www.npr.org/2023/02/20/1158367204/

smashing

killed

after

into

[61] Sanja Fidler, Sven Dickinson, and Raquel Urtasun. 3D object detection and viewpoint

estimation with a deformable 3D cuboid model. In NeurIPS, 2012. 9, 29

[62] William Freeman and Edward Adelson. The design and use of steerable filters. TPAMI,

1991. 28, 32

[63] Kanchana Gandikota, Jonas Geiping, Zorah Lähner, Adam Czapliński, and Michael Moeller.
Training or architecture? how to incorporate invariance in neural networks. arXiv preprint
arXiv:2106.10044, 2021. 27, 38, 114

[64] Octavian-Eugen Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic neural net-

works. In NeurIPS, 2017. 26, 28

[65] Noa Garnett, Rafi Cohen, Tomer Pe’er, Roee Lahav, and Dan Levi. 3D-LaneNet: end-to-end

3D multiple lane detection. In ICCV, 2019. 66, 67

[66] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics:

The KITTI dataset. IJRR, 2013. 112, 134

[67] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving?
the KITTI vision benchmark suite. In CVPR, 2012. 18, 20, 34, 35, 53, 54, 74, 146

[68] Rohan Ghosh and Anupam Gupta. Scale steerable filters for locally scale-invariant convolu-

tional neural networks. In ICML Workshops, 2019. 27, 28, 32, 122

[69] Ross Girshick. Fast R-CNN. In ICCV, 2015. 8, 9, 55

84

[70] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies

for accurate object detection and semantic segmentation. In CVPR, 2014. 8

[71] Gene Golub and Charles Loan. Matrix computations. 2013. 14

[72] Nikhil Gosala and Abhinav Valada. Bird’s-eye-view panoptic segmentation using monocular

frontal view images. RAL, 2022. 48, 54, 55, 57, 148, 150, 152

[73] Yuliang Guo, Guang Chen, Peitao Zhao, Weide Zhang, Jinghao Miao, Jingao Wang, and
Tae Eun Choe. Gen-lanenet: A generalized and scalable approach for 3D lane detection. In
ECCV, 2020. 66, 67

[74] Adam Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki. Simple-

BEV: What really matters for multi-sensor BEV perception? In CoRL, 2022. 48

[75] Christopher Harris and Mike Stephens. A combined corner and edge detector. In Alvey

vision conference, 1988. 9

[76] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cam-

bridge university press, 2003. 27, 29, 30, 31, 63, 66, 116, 117

[77] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image

recognition. In CVPR, 2016. 150

[78] Paul Henderson and Vittorio Ferrari. End-to-end training of object class detectors for mean

average precision. In ACCV, 2016. 9, 18, 24

[79]

[80]

[81]

[82]

Joao Henriques and Andrea Vedaldi. Warped convolutions: Efficient invariance to spatial
transformations. In ICML, 2017. 28

Jan Hosang, Rodrigo Benenson, and Bernt Schiele. A convnet for non-maximum suppres-
sion. In GCPR, 2016. 7, 9, 16, 17, 18, 24

Jan Hosang, Rodrigo Benenson, and Bernt Schiele. Learning non-maximum suppression.
In CVPR, 2017. 7, 9, 16, 17, 18, 24

Jinghua Hou, Tong Wang, Xiaoqing Ye, Zhe Liu, Xiao Tan, Errui Ding, Jingdong Wang,
and Xiang Bai. OPEN: Object-wise position embedding for multi-view 3D object detection.
In ECCV, 2024. 65

[83] Anthony Hu, Zak Murez, Nikhil Mohan, Sofía Dudas, Jeffrey Hawke, Vijay Badrinarayanan,
Roberto Cipolla, and Alex Kendall. FIERY: future instance prediction in bird’s-eye view
from surround monocular cameras. In ICCV, 2021. 48, 50

[84] Hanjiang Hu, Zuxin Liu, Sharad Chitlangia, Akhil Agnihotri, and Ding Zhao. Investigating
the impact of multi-LiDAR placement on object detection for autonomous driving. In CVPR,
2022. 65

85

[85] Peiyun Hu, Jason Ziglar, David Held, and Deva Ramanan. What you see is what you get:

Exploiting visibility for 3D object detection. In CVPR, 2020. 9

[86] Gao Huang, Zhuang Liu, Laurens Maaten, and Kilian Weinberger. Densely connected

convolutional networks. In CVPR, 2017. 18

[87]

[88]

Junjie Huang and Guan Huang. BEVDet4D: Exploit temporal cues in multi-camera 3D
object detection. arXiv preprint arXiv:2203.17054, 2022. 55, 155

Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du.
High-performance multi-camera 3D object detection in bird-eye-view.
arXiv:2112.11790, 2021. 48, 65, 155

BEVDet:
arXiv preprint

[89] Kuan-Chih Huang, Tsung-Han Wu, Hung-Ting Su, and Winston Hsu. MonoDTR: Monocular

3D object detection with depth-aware transformer. In CVPR, 2022. 47, 65

[90] Tengteng Huang, Zhe Liu, Xiwu Chen, and Xiang Bai. EPNet: Enhancing point features

with image semantics for 3D object detection. In ECCV, 2020. 6, 7, 9, 23

[91] Sujin Jang, Dae Ung Jo, Sung Ju Hwang, Dongwook Lee, and Daehyun Ji. STXD: Structural
and temporal cross-modal distillation for multi-view 3D object detection. In NeurIPS, 2023.
59

[92] Ylva Jansson and Tony Lindeberg. Scale-invariant scale-channel networks: Deep networks

that generalise to previously unseen scales. IJCV, 2021. 27, 28, 32

[93] Haoxuanye Ji, Pengpeng Liang, and Erkang Cheng. Enhancing 3D object detection with 2D

detection-guided query anchors. In CVPR, 2024. 65

[94]

Jinrang Jia, Zhenjia Li, and Yifeng Shi. MonoUNI: A unified vehicle and infrastructure-side
monocular 3D object detection network with sufficient depth clues. In NeurIPS, 2023. 1,
44, 65

[95] Yanqin Jiang, Li Zhang, Zhenwei Miao, Xiatian Zhu, Jin Gao, Weiming Hu, and Yu-Gang
Jiang. Polarformer: Multi-camera 3D object detection with polar transformers. In AAAI,
2023. 48, 59, 155

[96] Zheng Jiang, Jinqing Zhang, Yanan Zhang, Qingjie Liu, Zhenghui Hu, Baohui Wang, and
Yunhong Wang. FSD-BEV: Foreground self-distillation for multi-view 3D object detection.
In ECCV, 2024. 65

[97] Li Jing. Physical symmetry enhanced neural networks. PhD thesis, Massachusetts Institute

of Technology, 2020. 28

[98] Angjoo Kanazawa, Abhishek Sharma, and David Jacobs. Locally scale-invariant convolu-

tional neural networks. In NeurIPS Workshops, 2014. 28

[99] Kang Kim and Hee Lee. Probabilistic anchor assignment with IoU prediction for object

86

detection. In ECCV, 2020. 19, 24

[100] Sanmin Kim, Youngseok Kim, Sihwan Hwang, Hyeonjun Jeong, and Dongsuk Kum. La-
belDistill: Label-guided cross-modal knowledge distillation for camera-based 3D object
detection. In ECCV, 2024. 65

[101] Sanmin Kim, Youngseok Kim, In-Jae Lee, and Dongsuk Kum. Predict to Detect: Prediction-

guided 3D object detection using sequential images. In ICCV, 2023. 59, 155

[102] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR,

2015. 108, 125, 150

[103] Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is

sufficient for robustness to spurious correlations. In ICLR, 2022. 64

[104] Tzofi Klinghoffer, Jonah Philion, Wenzheng Chen, Or Litany, Zan Gojcic, Jungseock Joo,
Ramesh Raskar, Sanja Fidler, and Jose Alvarez. Towards viewpoint robustness in Bird’s Eye
View segmentation. In ICCV, 2023. 1, 44, 62, 63, 65, 73, 76, 161, 162

[105] Marvin Klingner, Shubhankar Borse, Varun Ravi Kumar, Behnaz Rezaei, Venkatraman
Narayanan, Senthil Yogamani, and Fatih Porikli. X3KD: Knowledge distillation across
modalities, tasks and stages for multi-camera 3D object detection. In CVPR, 2023. 48, 59

[106] Emile Krieken, Erman Acar, and Frank Harmelen. Analyzing differentiable fuzzy logic

operators. arXiv preprint arXiv:2002.06100, 2020. 12, 106

[107] Jason Ku, Alex Pon, and Steven Waslander. Monocular 3D object detection leveraging

accurate proposals and shape reconstruction. In CVPR, 2019. 19

[108] Abhinav Kumar, Garrick Brazil, Enrique Corona, Armin Parchami, and Xiaoming Liu.
DEVIANT: Depth Equivariant Network for monocular 3D object detection. In ECCV, 2022.
1, 44, 48, 50, 53, 54, 55, 56, 57, 61, 63, 65, 70, 73, 74, 75, 76, 146, 147, 152, 153, 156, 162,
163, 164

[109] Abhinav Kumar, Garrick Brazil, and Xiaoming Liu. GrooMeD-NMS: Grouped mathemat-
ically differentiable NMS for monocular 3D object detection. In CVPR, 2021. 29, 33, 35,
36, 37, 48, 55, 56, 57, 65, 74, 126, 128, 132, 134, 151

[110] Abhinav Kumar, Yuliang Guo, Xinyu Huang, Liu Ren, and Xiaoming Liu. SeaBird: Seg-
mentation in bird’s view with dice loss improves monocular 3D detection of large objects.
In CVPR, 2024. 1, 61, 65, 70, 72, 75, 162

[111] Abhinav Kumar, Tim Marks, Wenxuan Mou, Ye Wang, Michael Jones, Anoop Cherian,
Toshiaki Koike-Akino, Xiaoming Liu, and Chen Feng. LUVLi face alignment: Estimating
landmarks’ location, uncertainty, and visibility likelihood. In CVPR, 2020. 7, 29, 48, 65

[112] Animesh Kumar and Vinod Prabhakaran. Estimation of bandlimited signals from the signs

of noisy samples. In ICASSP, 2013. 13, 33, 107

87

[113] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining
an O (1/𝑡) convergence rate for the projected stochastic subgradient method. arXiv preprint
arXiv:1212.2002, 2012. 49, 138, 141

[114] John Lambert, Zhuang Liu, Ozan Sener, James Hays, and Vladlen Koltun. MSeg: A

composite dataset for multi-domain semantic segmentation. In CVPR, 2020. 130

[115] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning

applied to document recognition. Proceedings of the IEEE, 1998. 28, 29

[116] Donghoon Lee, Geonho Cha, Ming-Hsuan Yang, and Songhwai Oh. Individualness and

determinantal point processes for pedestrian detection. In ECCV, 2016. 9, 15, 17

[117] Hyo-Jun Lee, Hanul Kim, Su-Min Choi, Seong-Gyun Jeong, and Yeong Koh. BAAM:
Monocular 3D pose and shape reconstruction with bi-contextual attention module and
attention-guided modeling. In CVPR, 2023. 48, 65

[118] Jin Lee, Myung Han, Dong Ko, and Il Suh. From big to small: Multi-scale local planar
guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019. 130

[119] Yoonho Lee, Huaxiu Yao, and Chelsea Finn. Diversify and disambiguate: Learning from

underspecified data. In ICLR, 2022. 64

[120] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning
hand-eye coordination for robotic grasping with deep learning and large-scale data collection.
IJRR, 2018. 6

[121] Buyu Li, Wanli Ouyang, Lu Sheng, Xingyu Zeng, and Xiaogang Wang. GS3D: An efficient

3D object detection framework for autonomous driving. In CVPR, 2019. 6, 19

[122] Peiliang Li, Xiaozhi Chen, and Shaojie Shen. Stereo R-CNN based 3D object detection for

autonomous driving. In CVPR, 2019. 9

[123] Peixuan Li, Huaici Zhao, Pengfei Liu, and Feidao Cao. RTM3D: Real-time monocular 3D

detection from object keypoints for autonomous driving. In ECCV, 2020. 6, 9, 19, 25, 29

[124] Tao Li and Vivek Srikumar. Augmenting neural networks with first-order logic. In ACL,

2019. 12, 106

[125] Yinhao Li, Han Bao, Zheng Ge, Jinrong Yang, Jianjian Sun, and Zeming Li. BEVStereo:
Enhancing depth estimation in multi-view 3D object detection with dynamic temporal stereo.
In AAAI, 2023. 48, 59, 65

[126] Yanwei Li, Yilun Chen, Xiaojuan Qi, Zeming Li, Jian Sun, and Jiaya Jia. Unifying voxel-
based representation with transformer for 3D object detection. In NeurIPS, 2022. 59

[127] Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun,
and Zeming Li. BEVDepth: Acquisition of reliable depth for multi-view 3D object detection.

88

In AAAI, 2023. 59, 155

[128] Yangguang Li, Bin Huang, Zeren Chen, Yufeng Cui, Feng Liang, Mingzhu Shen, Fenggang
Liu, Enze Xie, Lu Sheng, Wanli Ouyang, and Jing Shao. Fast-BEV: A fast and strong
bird’s-eye view perception baseline. In NeurIPS Workshops, 2023. 48, 65

[129] Ye Li, Wenzhao Zheng, Xiaonan Huang, and Kurt Keutzer. UniDrive: Towards universal
driving perception across camera configurations. arXiv preprint arXiv:2410.13864, 2024.
63, 65, 73, 74, 75, 163

[130] Zhenyu Li, Zehui Chen, Ang Li, Liangji Fang, Qinhong Jiang, Xianming Liu, and Junjun
Jiang. Unsupervised domain adaptation for monocular 3D object detection via self-training.
In ECCV, 2022. 65

[131] Zhenxin Li, Shiyi Lan, Jose Alvarez, and Zuxuan Wu. BEVNeXt: Reviving dense BEV

frameworks for 3D object detection. In CVPR, 2024. 65

[132] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and
Jifeng Dai. BEVFormer: Learning bird’s-eye-view representation from multi-camera images
via spatiotemporal transformers. In ECCV, 2022. 1, 44, 46, 48, 53, 59, 61, 147, 155

[133] Zhuoling Li, Xiaogang Xu, SerNam Lim, and Hengshuang Zhao. Unimode: Unified monoc-

ular 3D object detection. In CVPR, 2024. 61

[134] Zhiqi Li, Zhiding Yu, Wenhai Wang, Anima Anandkumar, Tong Lu, and Jose Alvarez. FB-
BEV: BEV representation from forward-backward view transformations. In ICCV, 2023.
59

[135] Qing Lian, Botao Ye, Ruijia Xu, Weilong Yao, and Tong Zhang. Geometry-aware data
augmentation for monocular 3D object detection. arXiv preprint arXiv:2104.05858, 2021.
26, 29

[136] Yiyi Liao, Jun Xie, and Andreas Geiger. KITTI-360: A novel dataset and benchmarks for
urban scene understanding in 2D and 3D. TPAMI, 2022. 45, 46, 53, 54, 60, 148, 149

[137] Hongbin Lin, Yifan Zhang, Shuaicheng Niu, Shuguang Cui, and Zhen Li. MonoTTA: Fully

test-time adaptation for monocular 3D object detection. In ECCV, 2024. 61

[138] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge
Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 39, 123, 149

[139] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for

dense object detection. TPAMI, 2018. 8, 9

[140] Feng Liu, Tengteng Huang, Qianjing Zhang, Haotian Yao, Chi Zhang, Fang Wan, Qixiang
Ye, and Yanzhao Zhou. Ray Denoising: Depth-aware hard negative sampling for multi-view
3D object detection. In ECCV, 2024. 65

89

[141] Feng Liu and Xiaoming Liu. Voxel-based 3D detection and reconstruction of multiple objects

from a single image. In NeurIPS, 2021. 48, 65

[142] Haisong Liu Liu, Yao Teng Teng, Tao Lu, Haiguang Wang, and Limin Wang. SparseBEV:
High-performance sparse 3D object detection from multi-camera videos. In ICCV, 2023.
59, 154

[143] Lijie Liu, Jiwen Lu, Chunjing Xu, Qi Tian, and Jie Zhou. Deep fitting degree scoring

network for monocular 3D object detection. In CVPR, 2019. 9, 19, 25, 29

[144] Lijie Liu, Chufan Wu, Jiwen Lu, Lingxi Xie, Jie Zhou, and Qi Tian. Reinforced axial

refinement network for monocular 3D object detection. In ECCV, 2020. 19

[145] Songtao Liu, Di Huang, and Yunhong Wang. Adaptive NMS: Refining pedestrian detection

in a crowd. In CVPR, 2019. 9, 16, 17

[146] Xianpeng Liu, Nan Xue, and Tianfu Wu. Learning auxiliary monocular contexts helps

monocular 3D object detection. In AAAI, 2022. 29

[147] Xianpeng Liu, Ce Zheng, Kelvin Cheng, Nan Xue, Guo-Jun Qi, and Tianfu Wu. Monocular
3D object detection with bounding box denoising in 3D by perceiver. In ICCV, 2023. 48, 65

[148] Xianpeng Liu, Ce Zheng, Ming Qian, Nan Xue, Chen Chen, Zhebin Zhang, Chen Li, and
Tianfu Wu. Multi-view attentive contextualization for multi-view 3D object detection. In
CVPR, 2024. 65

[149] Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. PETR: Position embedding

transformation for multi-view 3D object detection. In ECCV, 2022. 155

[150] Yingfei Liu, Junjie Yan, Fan Jia, Shuailin Li, Qi Gao, Tiancai Wang, Xiangyu Zhang, and
Jian Sun. PETRv2: A unified framework for 3D perception from multi-camera images. In
ICCV, 2023. 48, 59, 65, 155

[151] Yuxuan Liu, Yuan Yixuan, and Ming Liu. Ground-aware monocular 3D object detection for

autonomous driving. Robotics and Automation Letters, 2021. 26, 35, 36

[152] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining
Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV,
2021. 150

[153] Zechen Liu, Zizhang Wu, and Roland Tóth. SMOKE: Single-stage monocular 3D object

detection via keypoint estimation. In CVPR Workshops, 2020. 19

[154] Zongdai Liu, Dingfu Zhou, Feixiang Lu, Jin Fang, and Liangjun Zhang. AutoShape: Real-

time shape-aware monocular 3D object detection. In ICCV, 2021. 29, 35, 48, 65

[155] Yunfei Long, Abhinav Kumar, Daniel Morris, Xiaoming Liu, Marcos Castro, and Punarjay
Chakravarty. RADIANT: RADar Image Association Network for 3D object detection. In

90

AAAI, 2023. 1, 47, 64

[156] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.

150, 151

[157] David Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. 9

[158] Hao Lu, Yunpeng Zhang, Qing Lian, Dalong Du, and Yingcong Chen. Towards gen-
eralizable multi-camera 3D object detection via perspective debiasing. arXiv preprint
arXiv:2310.11346, 2023. 65

[159] Yan Lu, Xinzhu Ma, Lei Yang, Tianzhu Zhang, Yating Liu, Qi Chu, Junjie Yan, and Wanli
Ouyang. Geometry uncertainty projection network for monocular 3D object detection. In
ICCV, 2021. 25, 26, 29, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 48, 50, 55, 56, 57, 62, 65, 71,
73, 74, 75, 76, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137,
146, 152, 153, 155, 163, 165, 166, 167

[160] Shujie Luo, Hang Dai, Ling Shao, and Yong Ding. M3DSSD: Monocular 3D single stage

object detector. In CVPR, 2021. 19

[161] Zhipeng Luo, Changqing Zhou, Gongjie Zhang, and Shijian Lu. DETR4D: Direct multi-
view 3D object detection with sparse attention. arXiv preprint arXiv:2212.07849, 2022.
48

[162] Xinzhu Ma, Shinan Liu, Zhiyi Xia, Hongwen Zhang, Xingyu Zeng, and Wanli Ouyang.

Rethinking Pseudo-LiDAR representation. In ECCV, 2020. 29, 42

[163] Xinzhu Ma, Wanli Ouyang, Andrea Simonelli, and Elisa Ricci. 3D object detection from

images for autonomous driving: A survey. TPAMI, 2023. 29, 48, 65

[164] Xinzhu Ma, Yongtao Wang, Yinmin Zhang, Zhiyi Xia, Yuan Meng, Zhihui Wang, Haojie
Li, and Wanli Ouyang. Towards fair and comprehensive comparisons for image-based 3D
object detection. In ICCV, 2023. 48

[165] Xinzhu Ma, Zhihui Wang, Haojie Li, Pengbo Zhang, Wanli Ouyang, and Xin Fan. Accu-
rate monocular 3D object detection via color-embedded 3D reconstruction for autonomous
driving. In ICCV, 2019. 19, 29, 48, 65

[166] Xinzhu Ma, Yinmin Zhang, Dan Xu, Dongzhan Zhou, Shuai Yi, Haojie Li, and Wanli
In CVPR,

Ouyang. Delving into localization errors for monocular 3D object detection.
2021. 26, 34, 36, 53, 55, 56, 57, 75, 119, 121, 122, 126, 152

[167] Yuexin Ma, Tai Wang, Xuyang Bai, Huitong Yang, Yuenan Hou, Yaming Wang, Yu Qiao,
Ruigang Yang, Dinesh Manocha, and Xinge Zhu. Vision-centric BEV perception: A survey.
arXiv preprint arXiv:2208.02797, 2022. 46, 48, 58, 65

[168] Lachlan MacDonald, Sameera Ramasinghe, and Simon Lucey. Enabling equivariance for

arbitrary lie groups. In CVPR, 2022. 63

91

[169] Fabian Manhardt, Wadim Kehl, and Adrien Gaidon. Roi-10D: Monocular lifting of 2D

detection to 6D pose and metric shape. In CVPR, 2019. 19

[170] Diego Marcos, Benjamin Kellenberger, Sylvain Lobry, and Devis Tuia. Scale equivariance

in CNNs with vector fields. In ICML Workshops, 2018. 28

[171] Diego Marcos, Michele Volpi, Nikos Komodakis, and Devis Tuia. Rotation equivariant

vector field networks. In ICCV, 2017. 28

[172] Nathaniel Merrill, Yuliang Guo, Xingxing Zuo, Xinyu Huang, Stefan Leutenegger, Xi Peng,
Liu Ren, and Guoquan Huang. Symmetry and uncertainty-aware object SLAM for 6DoF
object pose estimation. In CVPR, 2022. 1, 44, 61

[173] Alessio Micheli. Neural network for graphs: A contextual constructive approach. IEEE

Transactions on Neural Networks, 2009. 28

[174] Krystian Mikolajczyk and Cordelia Schmid. Scale & affine invariant interest point detectors.

IJCV, 2004. 9

[175] Zhixiang Min, Bingbing Zhuang, Samuel Schulter, Buyu Liu, Enrique Dunn, and Manmohan
Chandraker. NeurOCS: Neural NOCS supervision for monocular 3D object localization. In
CVPR, 2023. 48, 65

[176] Mircea Mironenco and Patrick Forré. Lie group decompositions for equivariant neural

networks. In ICLR, 2024. 63

[177] SungHo Moon, JinWoo Bae, and SungHoon Im. Rotation matters: Generalized monocular
3D object detection for various camera systems. arXiv preprint arXiv:2310.05366, 2023. 1,
44, 61, 65

[178] Frank Moosmann, Oliver Pink, and Christoph Stiller. Segmentation of 3D LiDAR data
In Intelligent Vehicles

in non-flat urban environments using a local convexity criterion.
Symposium, 2009. 9

[179] Youngmin Oh, Hyung-Il Kim, Seong Tae Kim, and Jung Kim. MonoWAD: Weather-adaptive

diffusion model for robust monocular 3D object detection. In ECCV, 2024. 61

[180] Bowen Pan, Jiankai Sun, Ho Leung, Alex Andonian, and Bolei Zhou. Cross-view semantic

segmentation for sensing surroundings. RAL, 2020. 48

[181] Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is Pseudo-LiDAR
needed for monocular 3D object detection? In ICCV, 2021. 1, 29, 35, 36, 37, 44, 61, 127,
132, 150

[182] Jinhyung Park, Chenfeng Xu, Shijia Yang, Kurt Keutzer, Kris Kitani, Masayoshi Tomizuka,
and Wei Zhan. Time will tell: New outlooks and a baseline for temporal multi-view 3D
object detection. In ICLR, 2023. 48, 65, 155

92

[183] Kiru Park, Timothy Patten, and Markus Vincze. Pix2Pose: Pixel-wise coordinate regression

of objects for 6D pose estimation. In ICCV, 2019. 1, 44, 61

[184] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas
Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style,
high-performance deep learning library. In NeurIPS, 2019. 18, 34, 149

[185] Max Paulus, Dami Choi, Daniel Tarlow, Andreas Krause, and Chris Maddison. Gradient

estimation with stochastic softmax tricks. In NeurIPS, 2020. 11, 15

[186] Nadia Payet and Sinisa Todorovic. From contours to 3D object detection and pose estimation.

In ICCV, 2011. 9, 29, 47, 64

[187] Bojan Pepik, Michael Stark, Peter Gehler, and Bernt Schiele. Multi-view and 3D deformable

part models. TPAMI, 2015. 9, 29

[188] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera

rigs by implicitly unprojecting to 3D. In ECCV, 2020. 48

[189] Julius Plücker. Analytisch-geometrische Entwicklungen. GD Baedeker, 1828. 73, 74

[190] Marin Pogančić, Anselm Paulus, Vit Musil, Georg Martius, and Michal Rolinek. Differen-

tiation of blackbox combinatorial solvers. In ICLR, 2019. 11

[191] Sebastian Prillo and Julian Eisenschlos. Softsort: A continuous relaxation for the argsort

operator. In ICML, 2020. 11, 12

[192] Sergey Prokudin, Daniel Kappler, Sebastian Nowozin, and Peter Gehler. Learning to filter

object detections. In GCPR, 2017. 6, 7, 8, 9, 11, 12, 16, 17, 18, 24, 105

[193] Aahlad Manas Puli, Lily Zhang, Yoav Wald, and Rajesh Ranganath. Don’t blame dataset

shift! shortcut learning due to gradients and cross entropy. In NeurIPS, 2023. 64

[194] Charles Qi, Or Litany, Kaiming He, and Leonidas Guibas. Deep hough voting for 3D object

detection in point clouds. In ICCV, 2019. 56

[195] Zengyi Qin, Jinglu Wang, and Yan Lu. MonoGRNet: A geometric reasoning network for

3D object localization. In AAAI, 2019. 19, 20

[196] Yasiru Ranasinghe, Deepti Hegde, and Vishal M Patel. MonoDiff: Monocular 3D object

detection and pose estimation with diffusion models. In CVPR, 2024. 65

[197] Narayanan Elavathur Ranganatha, Hengyuan Zhang, Shashank Venkatramani, Jing-Yan Liao,
and Henrik Christensen. SemVecNet: Generalizable vector map generation for arbitrary
sensor configurations. 2024. 65

93

[198] Matthias Rath and Alexandru Condurache. Boosting deep neural networks with geometrical
prior knowledge: A survey. arXiv preprint arXiv:2006.16867, 2020. 25, 28, 29, 114

[199] Cody Reading, Ali Harakeh, Julia Chae, and Steven Waslander. Categorical depth distribu-
tion network for monocular 3D object detection. In CVPR, 2021. 29, 35, 36, 42, 48, 65, 124,
126, 132

[200] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once:

Unified, real-time object detection. In CVPR, 2016. 8, 9

[201] Konstantinos Rematas, Ira Kemelmacher-Shlizerman, Brian Curless, and Steve Seitz. Soccer

on your tabletop. In CVPR, 2018. 6, 25

[202] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time

object detection with region proposal networks. In NeurIPS, 2015. 6, 8, 9, 25

[203] David Rey, Gérard Subsol, Hervé Delingette, and Nicholas Ayache. Automatic detection and
segmentation of evolving processes in 3D medical images: Application to multiple sclerosis.
Medical Image Analysis, 2002. 6

[204] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio
Savarese. Generalized intersection over union: A metric and a loss for bounding box
regression. In CVPR, 2019. 16, 17

[205] Thomas Roddick and Roberto Cipolla. Predicting semantic map representations from images

using pyramid occupancy networks. In CVPR, 2020. 48

[206] Azriel Rosenfeld and Mark Thurston. Edge and curve detection for visual scene analysis.

IEEE Transactions on Computers, 1971. 9

[207] Andrew Slavin Ross, Weiwei Pan, and Finale Doshi-Velez. Learning qualitatively diverse

and interpretable rules for classification. In ICML Workshops, 2018. 64

[208] Shouwei Ruan, Yinpeng Dong, Hang Su, Jianteng Peng, Ning Chen, and Xingxing Wei.
Towards viewpoint-invariant visual recognition via adversarial training. In ICCV, 2023. 64

[209] Sitapa Rujikietgumjorn and Robert Collins. Optimized pedestrian detection for multiple and

occluded people. In CVPR, 2013. 9, 15, 17

[210] Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally
robust neural networks for group shifts: On the importance of regularization for worst-case
generalization. In ICLR, 2019. 64

[211] Avishkar Saha, Oscar Mendez, Chris Russell, and Richard Bowden. Translating images into

maps. In ICRA, 2022. 48, 50, 54, 55, 56, 57, 149, 150, 152

[212] Ayush Sarkar, Hanlin Mai, Amitabh Mahapatra, Svetlana Lazebnik, David Forsyth, and
Anand Bhattad. Shadows don’t lie and lines can’t bend! generative models don’t know

94

projective geometry... for now. In CVPR, 2024. 63

[213] Ashutosh Saxena, Justin Driemeyer, and Andrew Ng. Robotic grasping of novel objects

using vision. IJRR, 2008. 1, 6, 25, 44, 61

[214] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-

gradient solver for SVM. In ICML, 2007. 49, 50, 51, 138, 141

[215] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. PointRCNN: 3D object proposal

generation and detection from point cloud. In CVPR, 2019. 1, 9, 29, 47, 48, 64

[216] Xuepeng Shi, Zhixiang Chen, and Tae-Kyun Kim. Distance-normalized unified representa-
tion for monocular 3D object detection. In ECCV, 2020. 6, 7, 9, 15, 17, 19, 21, 23, 24, 48,
108, 109

[217] Xuepeng Shi, Zhixiang Chen, and Tae-Kyun Kim. Multivariate probabilistic monocular 3D

object detection. In WACV, 2023. 47

[218] Xuepeng Shi, Qi Ye, Xiaozhi Chen, Chuangrong Chen, Zhixiang Chen, and Tae-Kyun Kim.
Geometry-based distance decomposition for monocular 3D object detection. In ICCV, 2021.
26, 35, 36, 37, 128, 129, 132

[219] Changyong Shu, Fisher Yu, and Yifan Liu. 3DPPE: 3D point positional encoding for

multi-camera 3D object detection transformers. In ICCV, 2023. 48, 59, 65, 155

[220] Andrea Simonelli, Samuel Bulò, Lorenzo Porzi, Manuel Antequera, and Peter Kontschieder.
Disentangling monocular 3D object detection: From single to multi-class recognition.
TPAMI, 2020. 6, 7, 9, 18, 19, 20, 25, 34, 35, 36, 128, 132

[221] Andrea Simonelli, Samuel Bulò, Lorenzo Porzi, Peter Kontschieder, and Elisa Ricci. Are
we missing confidence in Pseudo-LiDAR methods for monocular 3D object detection? In
ICCV, 2021. 29, 35, 36, 109, 130

[222] Andrea Simonelli, Samuel Bulò, Lorenzo Porzi, Manuel López-Antequera, and Peter
Kontschieder. Disentangling monocular 3D object detection. In ICCV, 2019. 7, 19, 20, 34,
128

[223] Andrea Simonelli, Samuel Bulò, Lorenzo Porzi, Elisa Ricci, and Peter Kontschieder. Towards
generalization across depth for monocular 3D object detection. In ECCV, 2020. 9, 19, 20,
25, 26, 29

[224] Samik Some, Mithun Das Gupta, and Vinay Namboodiri. Determinantal point process as an

alternative to NMS. In BMVC, 2020. 9, 15, 17

[225] Ivan Sosnovik, Artem Moskalev, and Arnold Smeulders. DISCO: accurate discrete scale

convolutions. In BMVC, 2021. 39, 41

[226] Ivan Sosnovik, Artem Moskalev, and Arnold Smeulders. Scale equivariance improves

95

siamese tracking. In WACV, 2021. 28, 32, 34, 40, 122, 123, 124

[227] Ivan Sosnovik, Michał Szmaja, and Arnold Smeulders. Scale-equivariant steerable networks.

In ICLR, 2020. 27, 28, 31, 32, 33, 38, 39, 41, 119, 120, 122, 123, 132

[228] Christoph Strecha, Rik Fransens, and Luc Van Gool. Wide-baseline stereo from multiple

views: a probabilistic account. In CVPR, 2004. 66

[229] Christoph Strecha, Tinne Tuytelaars, and Luc Van Gool. Dense matching of multiple wide-

baseline views. In ICCV, 2003. 66

[230] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul
Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han,
Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao,
Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability
in perception for autonomous driving: Waymo open dataset. In CVPR, 2020. 34, 42, 54

[231] Mingxing Tan, Ruoming Pang, and Quoc Le. EfficientDet: Scalable and efficient object

detection. In CVPR, 2020. 150

[232] Yunlei Tang, Sebastian Dorn, and Chiragkumar Savani. Center3D: Center-based monocular
3D object detection with joint depth understanding. arXiv preprint arXiv:2005.13423, 2020.
8, 9, 25, 26, 29

[233] Yingqi Tang, Zhaotie Meng, Guoliang Chen, and Erkang Cheng. SimPB: A single model

for 2D and 3D object detection from multiple cameras. In ECCV, 2024. 65

[234] Seth Teller and Michael Hohmeyer. Determining the lines through four lines. Journal of

graphics tools, 1999. 74

[235] Damien Teney, Ehsan Abbasnejad, and Anton Hengel. Unshuffling data for improved gen-

eralization in visual question answering. In ICCV, 2021. 64

[236] Damien Teney, Ehsan Abbasnejad, Simon Lucey, and Anton Hengel. Evading the simplicity
bias: Training a diverse set of models discovers solutions with superior OOD generalization.
In CVPR, 2022. 64

[237] Damien Teney, Yong Lin, Seong Joon Oh, and Ehsan Abbasnejad. ID and OOD performance
are sometimes inversely correlated on real-world datasets. In NeurIPS, 2023. 63, 64

[238] Sugirtha Thayalan-Vaz, Sridevi M, Khailash Santhakumar, B Ravi Kiran, Thomas Gauthier,
and Senthil Yogamani. Exploring 2D data augmentation for 3D monocular object detection.
arXiv preprint arXiv:2104.10786, 2021. 26, 29

[239] Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and
Patrick Riley. Tensor field networks: Rotation-and translation-equivariant neural networks
for 3D point clouds. arXiv preprint arXiv:1802.08219, 2018. 28

96

[240] Rishabh Tiwari and Pradeep Shenoy. Overcoming simplicity bias in deep networks using a

feature sieve. In ICML, 2023. 64

[241] Lachlan Tychsen-Smith and Lars Petersson. Improving object localization with fitness NMS

and bounded IoU loss. In CVPR, 2018. 19, 24

[242] Alexandru Vasile and Richard Marino. Pose-independent automatic target detection and

recognition using 3D laser radar imagery. Lincoln laboratory journal, 2005. 9

[243] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple

features. In CVPR, 2001. 9

[244] Li Wan, David Eigen, and Rob Fergus. End-to-end integration of a convolution network,
deformable parts model and non-maximum suppression. In CVPR, 2015. 9, 16, 17

[245] Li Wang, Liang Du, Xiaoqing Ye, Yanwei Fu, Guodong Guo, Xiangyang Xue, Jianfeng
Feng, and Li Zhang. Depth-conditioned dynamic message propagation for monocular 3D
object detection. In CVPR, 2021. 36

[246] Li Wang, Li Zhang, Yi Zhu, Zhi Zhang, Tong He, Mu Li, and Xiangyang Xue. Progressive
coordinate transforms for monocular 3D object detection. In NeurIPS, 2021. 35, 36, 42, 132

[247] Rui Wang, Robin Walters, and Rose Yu. Incorporating symmetry into deep dynamics models

for improved generalization. In ICLR, 2021. 28

[248] Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xiangyu Zhang. StreamPETR:
Exploring object-centric temporal modeling for efficient multi-view 3D object detection. In
ICCV, 2023. 48, 65

[249] Shuo Wang, Xinhai Zhao, Hai-Ming Xu, Zehui Chen, Dameng Yu, Jiahao Chang, Zhen
Yang, and Feng Zhao. Towards domain generalization for multi-view 3D object detection in
bird-eye-view. In CVPR, 2023. 65

[250] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. FCOS3D: Fully convolutional

one-stage monocular 3D object detection. In ICCV Workshops, 2021. 155

[251] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Probabilistic and geometric depth:

Detecting objects in perspective. In CoRL, 2021. 155

[252] Xueqing Wang, Diankun Zhang, Haoyu Niu, and Xiaojun Liu. Segmentation can aid
detection: Segmentation-guided single stage detection for 3D point cloud. Electronics,
2023. 48

[253] Xinjiang Wang, Shilong Zhang, Zhuoran Yu, Litong Feng, and Wayne Zhang. Scale-

equalizing pyramid convolution for object detection. In CVPR, 2020. 127

[254] Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian
Weinberger. Pseudo-LiDAR from visual depth estimation: Bridging the gap in 3D object

97

detection for autonomous driving. In CVPR, 2019. 9, 29, 43, 48, 65

[255] Yan Wang, Xiangyu Chen, Yurong You, Li Li, Bharath Hariharan, Mark Campbell, Kilian
Weinberger, and Wei-Lun Chao. Train in Germany, test in the USA: Making 3D object
detectors generalize. In CVPR, 2020. 65, 129

[256] Yuqi Wang, Yuntao Chen, and Zhaoxiang Zhang. FrustumFormer: Adaptive instance-aware

resampling for multi-view 3D detection. In CVPR, 2023. 59

[257] Yue Wang, Vitor Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon.
DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In CoRL,
2021. 38, 155

[258] Zhou Wang, Alan Bovik, Hamid Sheikh, and Eero Simoncelli. Image quality assessment:

from error visibility to structural similarity. TIP, 2004. 39

[259] Zitian Wang, Zehao Huang, Jiahui Fu, Naiyan Wang, and Si Liu. Object as Query: Lifting

any 2D object detector to 3D detection. In ICCV, 2023. 59

[260] Zeyu Wang, Dingwen Li, Chenxu Luo, Cihang Xie, and Xiaodong Yang. DistillBEV:
In

Boosting multi-camera 3D object detection with cross-modal knowledge distillation.
ICCV, 2023. 48, 65

[261] Zengran Wang, Chen Min, Zheng Ge, Yinhao Li, Zeming Li, Hongyu Yang, and Di Huang.
STS: Surround-view temporal stereo for multi-view 3D detection. In AAAI, 2023. 48, 65,
155

[262] Maurice Weiler, Patrick Forré, Erik Verlinde, and Max Welling. Coordinate independent
convolutional networks–isometry and gauge equivariant convolutions on riemannian mani-
folds. arXiv preprint arXiv:2106.06020, 2021. 28

[263] Maurice Weiler, Fred Hamprecht, and Martin Storath. Learning steerable filters for rotation

equivariant CNNs. In CVPR, 2018. 28

[264] Mark van der Wilk, Matthias Bauer, ST John, and James Hensman. Learning invariances

using the marginal likelihood. In NeurIPS, 2018. 28

[265] Daniel Worrall and Gabriel Brostow. Cubenet: Equivariance to 3D rotation and translation.

In ECCV, 2018. 28, 29, 38

[266] Daniel Worrall, Stephan Garbin, Daniyar Turmukhambetov, and Gabriel Brostow. Harmonic

networks: Deep translation and rotation equivariance. In CVPR, 2017. 28

[267] Daniel Worrall and Max Welling. Deep scale-spaces: Equivariance over scale. In NeurIPS,

2019. 28, 39, 41, 121

[268] Chen Wu. Waymo keynote talk, CVPR workshop on autonomous driving at 17:20. https:

//www.youtube.com/watch?v=fXsbI2VkHgc, 2023. Accessed: 2023-11-11. 2, 45

98

[269] Pengxiang Wu, Siheng Chen, and Dimitris Metaxas. MotionNet: Joint perception and
motion prediction for autonomous driving based on bird’s eye view maps. In CVPR, 2020. 9

[270] Yuxin Wu and Justin Johnson. Rethinking “batch” in batchnorm.

arXiv preprint

arXiv:2105.07576, 2021. 127

[271] Yu Xiang, Wongun Choi, Yuanqing Lin, and Silvio Savarese. Subcategory-aware convolu-

tional neural networks for object proposals and detection. In WACV, 2017. 19

[272] Enze Xie, Zhiding Yu, Daquan Zhou, Jonah Philion, Anima Anandkumar, Sanja Fidler, Ping
Luo, and Jose Alvarez. Mˆ2BEV: Multi-camera joint 3D detection and segmentation with
unified birds-eye view representation. arXiv preprint arXiv:2204.05088, 2022. 48, 58

[273] Kaixin Xiong, Shi Gong, Xiaoqing Ye, Xiao Tan, Ji Wan, Errui Ding, Jingdong Wang, and
Xiang Bai. CAPE: Camera view position embedding for multi-view 3D object detection. In
CVPR, 2023. 59, 155

[274] Chenfeng Xu, Huan Ling, Sanja Fidler, and Or Litany. 3Difftection: 3D object detection

with geometry-aware diffusion features. In CVPR, 2024. 65

[275] Junkai Xu, Liang Peng, Haoran Cheng, Hao Li, Wei Qian, Ke Li, Wenxiao Wang, and Deng
Cai. MonoNeRD: NeRF-like representations for monocular 3D object detection. In ICCV,
2023. 47, 65

[276] Keyulu Xu, Mozhi Zhang, Jingling Li, Simon Du, Ken-ichi Kawarabayashi, and Stefanie
Jegelka. How neural networks extrapolate: From feedforward to graph neural networks. In
ICLR, 2021. 63, 64

[277] Qiangeng Xu, Yin Zhou, Weiyue Wang, Charles Qi, and Dragomir Anguelov. SPG: Unsu-
pervised domain adaptation for 3D object detection via semantic point generation. In ICCV,
2021. 65

[278] Yichong Xu, Tianjun Xiao, Jiaxing Zhang, Kuiyuan Yang, and Zheng Zhang. Scale-invariant

convolutional neural networks. arXiv preprint arXiv:1411.6369, 2014. 28

[279] Longfei Yan, Pei Yan, Shengzhou Xiong, Xuanyu Xiang, and Yihua Tan. MonoCD: Monoc-

ular 3D object detection with complementary depths. In CVPR, 2024. 65

[280] Gengshan Yang and Deva Ramanan. Upgrading optical flow to 3D scene flow through optical

expansion. In CVPR, 2020. 34

[281] Haitao Yang, Zaiwei Zhang, Xiangru Huang, Min Bai, Chen Song, Bo Sun, Li Erran Li, and
Qixing Huang. LiDAR-based 3D object detection via hybrid 2D semantic scene generation.
arXiv preprint arXiv:2304.01519, 2023. 48, 58, 59

[282] Jihan Yang, Shaoshuai Shi, Zhe Wang, Hongsheng Li, and Xiaojuan Qi. ST3D: Self-training

for unsupervised domain adaptation on 3D object detection. In CVPR, 2021. 65

99

[283] Jiayu Yang, Enze Xie, Miaomiao Liu, and Jose Alvarez. Parametric depth based feature
representation learning for object detection and segmentation in bird’s-eye view. In ICCV,
2023. 59

[284] Xiaodong Yang, Zhuang Ma, Zhiyu Ji, and Zhe Ren. GEDepth: Ground embedding for

monocular depth estimation. In ICCV, 2023. 67, 159, 160

[285] Yue Yao, Shengchao Yan, Daniel Goehring, Wolfram Burgard, and Joerg Reichardt. Im-
proving out-of-distribution generalization of trajectory prediction for autonomous driving
via polynomial representations. arXiv preprint arXiv:2407.13431, 2024. 65

[286] Shingo Yashima, Teppei Suzuki, Kohta Ishikawa, Ikuro Sato, and Rei Kawakami. Feature

space particle inference for neural network ensembles. In ICML, 2022. 64

[287] Xiaoqing Ye, Liang Du, Yifeng Shi, Yingying Li, Xiao Tan, Jianfeng Feng, Errui Ding, and
Shilei Wen. Monocular 3D object detection via feature domain adaptation. In ECCV, 2020.
19

[288] Raymond Yeh, Yuan-Ting Hu, and Alexander Schwing. Chirality nets for human pose

regression. In NeurIPS, 2019. 28

[289] Jingru Yi, Pengxiang Wu, Bo Liu, Qiaoying Huang, Hui Qu, and Dimitris Metaxas. Oriented
object detection in aerial images with box boundary-aware vectors. In WACV, 2021. 55

[290] Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Center-based 3D object detection and

tracking. In CVPR, 2021. 1, 47, 64

[291] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In

ICLR, 2015. 38, 39, 41, 121

[292] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In

CVPR, 2018. 123

[293] Xiang Yu, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. PoseCNN: A convo-
lutional neural network for 6D object pose estimation in cluttered scenes. In RSS, 2018. 1,
44, 61

[294] Syed Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Khan, Ming-Hsuan Yang,
and Ling Shao. Learning enriched features for fast image restoration and enhancement.
TPAMI, 2022. 153

[295] Arthur Zhang, Chaitanya Eranki, Christina Zhang, Raymond Hong, Pranav Kalyani, Lochana
Kalyanaraman, Arsh Gamare, Maria Esteva, and Joydeep Biswas. Towards robust 3D robot
perception in urban environments: The UT Campus Object Dataset (CODa). In IROS, 2023.
164

[296] Hao Zhang, Hongyang Li, Xingyu Liao, Feng Li, Shilong Liu, Lionel Ni, and Lei
Zhang. DA-BEV: Depth aware BEV transformer for 3D object detection. arXiv preprint

100

arXiv:2302.13002, 2023. 48, 65

[297] Jason Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, and Shubham

Tulsiani. Cameras as rays: Pose estimation via ray diffusion. In ICLR, 2024. 65

[298] Jianming Zhang, Stan Sclaroff, Zhe Lin, Xiaohui Shen, Brian Price, and Radomir Mech.
Unconstrained salient object detection via proposal subset optimization. In CVPR, 2016. 9,
16, 17

[299] Jinqing Zhang, Yanan Zhang, Qingjie Liu, and Yunhong Wang. SA-BEV: Generating
semantic-aware bird’s-eye-view feature for multi-view 3D object detection. In ICCV, 2023.
59

[300] Renrui Zhang, Han Qiu, Tai Wang, Xuanzhuo Xu, Ziyu Guo, Yu Qiao, Peng Gao, and
Hongsheng Li. MonoDETR: Depth-guided transformer for monocular 3D object detection.
In ICCV, 2023. 55, 56, 57, 154, 157

[301] Yunpeng Zhang, Jiwen Lu, and Jie Zhou. Objects are different: Flexible monocular 3D

object detection. In CVPR, 2021. 25, 29, 35, 36, 48, 65, 132

[302] Yinmin Zhang, Xinzhu Ma, Shuai Yi, Jun Hou, Zhihui Wang, Wanli Ouyang, and Dan Xu.
Learning geometry-guided depth via projective modeling for monocular 3D object detection.
arXiv preprint arXiv:2107.13931, 2021. 26

[303] Yunpeng Zhang, Zheng Zhu, Wenzhao Zheng, Junjie Huang, Guan Huang, Jie Zhou, and
Jiwen Lu. BEVerse: Unified perception and prediction in birds-eye-view for vision-centric
autonomous driving. arXiv preprint arXiv:2205.09743, 2022. 45, 48, 50, 53, 54, 55, 58, 59,
60, 65, 70, 149, 150, 151, 155

[304] Allan Zhou, Tom Knowles, and Chelsea Finn. Meta-learning symmetries by reparameteri-

zation. In ICLR, 2021. 28

[305] Brady Zhou and Philipp Krähenbühl. Cross-view transformers for real-time map-view

semantic segmentation. In CVPR, 2022. 48

[306] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint

arXiv:1904.07850, 2019. 25

[307] Yunsong Zhou, Yuan He, Hongzi Zhu, Cheng Wang, Hongyang Li, and Qinhong Jiang.
MonoEF: Extrinsic parameter free monocular 3D object detection. TPAMI, 2021. 26, 29,
35, 36, 48, 61, 65, 132

[308] Benjin Zhu, Zhengkai Jiang, Xiangxin Zhou, Zeming Li, and Gang Yu. Class-balanced
grouping and sampling for point cloud 3D object detection. In CVPR Workshop, 2019. 2,
45, 59

[309] Wei Zhu, Qiang Qiu, Robert Calderbank, Guillermo Sapiro, and Xiuyuan Cheng.
Scale-equivariant neural networks with decomposed convolutional filters. arXiv preprint

101

arXiv:1909.11193, 2019. 27, 28, 32, 132

[310] Zijian Zhu, Yichi Zhang, Hai Chen, Yinpeng Dong, Shu Zhao, Wenbo Ding, Jiachen Zhong,
and Shibao Zheng. Understanding the robustness of 3D object detection with bird’s-eye-view
representations in autonomous driving. In CVPR, 2023. 1, 44, 61

[311] Zhuofan Zong, Dongzhi Jiang, Guanglu Song, Zeyue Xue, Jingyong Su, Hongsheng Li, and
Yu Liu. Temporal enhanced training of multi-view 3D object detector via historical object
prediction. In ICCV, 2023. 45, 48, 55, 59, 60, 65, 149, 150, 151, 155

[312] Zhikang Zou, Xiaoqing Ye, Liang Du, Xianhui Cheng, Xiao Tan, Li Zhang, Jianfeng Feng,
Xiangyang Xue, and Errui Ding. The devil is in the task: Exploiting reciprocal appearance-
localization features for monocular 3D object detection. In ICCV, 2021. 35, 36

[313] Philip Zwicke and Imre Kiss. A new implementation of the mellin transform and its

application to radar classification of ships. TPAMI, 1983. 27, 28, 33, 39, 41

102

APPENDIX A

PUBLICATIONS

First-Author Publications.

A list of all first-authored peer-reviewed publications during the Ph.D. program listed in reverse

chronological order.

• Abhinav Kumar, Yuliang Guo, Zhihao Zhang, Xinyu Huang, Liu Ren and Xiaoming Liu.
“CHARM3R: Towards Camera Height Agnostic Monocular 3D Object Detector ", ICCV, 2025

(under review).

• Abhinav Kumar, Yuliang Guo, Xinyu Huang, Liu Ren and Xiaoming Liu. “SeaBird: Seg-
mentation in Bird’s View with Dice Loss Improves 3D Detection of Large Objects", CVPR,

2024.

• Abhinav Kumar, Garrick Brazil, Enrique Corona, Armin Parchami and Xiaoming Liu. “DE-
VIANT: Depth Equivariant Network for Monocular 3D Object Detection", ECCV, 2022.
• Abhinav Kumar, Garrick Brazil and Xiaoming Liu. “GrooMeD-NMS: Grouped Mathematically

Differentiable NMS for Monocular 3D Object Detection", CVPR, 2021.

• Abhinav Kumar∗, Tim Marks∗, Wenxuan Mou∗, Ye Wang, Michael Jones, Anoop Cherian, Toshi
Koike-Akino, Xiaoming Liu and Chen Feng. “LUVLi Face Alignment: Estimating Location,

Uncertainty and Visibility Likelihood", CVPR, 2020.

Other Publications.
• Yunfei Long, Abhinav Kumar, Xiaoming Liu and Daniel Morris. “RICCARDO: Radar Hit

Prediction and Convolution for Camera-Radar 3D Object Detection", CVPR, 2025.

• Yuliang Guo, Abhinav Kumar, Chen Zhao, Ruoyu Wang, Xinyu Huang, and Liu Ren. “SUP-
NeRF: A Streamlined Unification of Pose Estimation and NeRF for Monocular 3D Object

Reconstruction", ECCV, 2024.

• Shengjie Zhu, Girish Ganesan, Abhinav Kumar and Xiaoming Liu. “RePLAy: Remove Projec-

tive LiDAR Depthmap Artifacts via Exploiting Epipolar Geometry", ECCV, 2024.

103

• Shengjie Zhu, Abhinav Kumar, Masa Hu and Xiaoming Liu. “Tame a Wild Camera: In-the-Wild

Monocular Camera Calibration", NeurIPS, 2023.

• Vishal Asnani, Abhinav Kumar, Sua You and Xiaoming Liu. “PrObeD: Proactive 2D Object

Detection Wrapper", NeurIPS, 2023.

• Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson and Georgia
Gkioxari, “Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild",

CVPR, 2023.

• Yunfei Long, Abhinav Kumar, Daniel Morris, Xiaoming Liu, Marcos Castro and Punarjay
Chakravarty. “RADIANT: radar Image Association NeTwork for 3D Object Detection", AAAI,

2023.

• Thiago Serra, Xin Yu, Abhinav Kumar, and Srikumar Ramalingam. “Scaling Up Exact Neural

Network Compression by ReLU Stability", NeurIPS, 2021.

• Abhinav Kumar∗, Tim Marks∗, Wenxuan Mou∗, Chen Feng and Xiaoming Liu. “UGLLI Face
Alignment: Estimating Uncertainty with Gaussian Log-Likelihood Loss". ICCV Workshops,

2019.

104

APPENDIX B

GROOMED-NMS APPENDIX

B.1 Detailed Explanation of NMS as a Matrix Operation

The rescoring process of the classical NMS is greedy set-based [192] and calculates the rescore

for a box 𝑖 (Line 10 of Alg. 1) as

𝑟𝑖 = 𝑠𝑖

(cid:214)

𝑗 ∈d<𝑖

(cid:0)1 − 𝑝(𝑜𝑖 𝑗 )(cid:1) ,

(B.1)

where d<𝑖 is defined as the box indices sampled from d having higher scores than box 𝑖. For

example, let us consider that d = {1, 5, 7, 9}. Then, for 𝑖 = 7, d<𝑖 = {1, 5} while for 𝑖 = 1, d<𝑖 = 𝜙

with 𝜙 denoting the empty set. This is possible since we had sorted the scores s and O in decreasing

order (Lines 2-3 of Alg. 2) to remove the non-differentiable hard argmax operation of the classical

NMS (Line 6 of Alg. 1).

Classical NMS only takes the overlap with unsuppressed boxes into account. Therefore, we

generalize Eq. (B.1) by accounting for the effect of all (suppressed and unsuppressed) boxes as

𝑟𝑖 = 𝑠𝑖

𝑖−1
(cid:214)

𝑗=1

(cid:0)1 − 𝑝(𝑜𝑖 𝑗 )𝑟 𝑗 (cid:1) .

(B.2)

The presence of 𝑟 𝑗 on the RHS of Eq. (B.2) prevents suppressed boxes 𝑟 𝑗 ≈ 0 from influencing

other boxes hugely. Let us say we have a box 𝑏2 with a high overlap with an unsuppressed box 𝑏1.

The classical NMS with a threshold pruning function assigns 𝑟2 = 0 while Eq. (B.2) assigns 𝑟2 a

small non-zero value with a threshold pruning.

Although Eq. (B.2) keeps 𝑟𝑖 ≥ 0, getting a closed-form recursion in r is not easy because of the

product operation. To get a closed-form recursion with addition/subtraction in r, we first carry out

the polynomial multiplication and then ignore the higher-order terms as

𝑖−1
∑︁

𝑗=1

𝑖−1
∑︁

𝑗=1

1 −

𝑟𝑖 = 𝑠𝑖 (cid:169)
(cid:173)
(cid:171)

1 −

≈ 𝑠𝑖 (cid:169)
(cid:173)
(cid:171)

𝑝(𝑜𝑖 𝑗 )𝑟 𝑗 + O (𝑛2)(cid:170)
(cid:174)
(cid:172)

𝑝(𝑜𝑖 𝑗 )𝑟 𝑗 (cid:170)
(cid:174)
(cid:172)

105

Table B.1 Results on using Oracle NMS scores on KITTI Val 1 cars detection. [Key: Best]

(−

(cid:17))

AP 3D|𝑅40

NMS Scores

(cid:17))
Easy Mod Hard Easy Mod Hard Easy Mod Hard
Kinematic (Image) 18.29 13.55 10.13 25.72 18.82 14.48 93.69 84.07 67.14
9.36 9.93 6.40 12.27 10.43 8.72 99.18 95.66 85.77
Oracle IoU2D
87.93 73.10 60.91 93.47 83.61 71.31 80.99 78.38 67.66
Oracle IoU3D

AP BEV|𝑅40

AP 2D|𝑅40

(cid:17))

(−

(−

≈ 𝑠𝑖 −

𝑖−1
∑︁

𝑗=1

𝑝(𝑜𝑖 𝑗 )𝑟 𝑗 .

(B.3)

Dropping the 𝑠𝑖 in the second term of Eq. (B.3) helps us get a cleaner form of Eq. (B.7). Moreover,

it does not change the nature of the NMS since the subtraction keeps the relation 𝑟𝑖 ≤ 𝑠𝑖 intact as

𝑝(𝑜𝑖 𝑗 ) and 𝑟 𝑗 are both between [0, 1].

We can also reach Eq. (B.3) directly as follows. Classical NMS suppresses a box which has

a high IoU2D overlap with any of the unsuppressed boxes (𝑟 𝑗 ≈ 1) to zero. We consider any as a

logical non-differentiable OR operation and use logical OR (cid:212) operator’s differentiable relaxation

as (cid:205) [106, 124]. We next use this relaxation with the other expression r ≤ s.

When a box shows overlap with more than two unsuppressed boxes, the term

𝑖−1
(cid:205)
𝑗=1

𝑝(𝑜𝑖 𝑗 )𝑟 𝑗 > 1

in Eq. (B.3) or when a box shows high overlap with one unsuppressed box, the term 𝑠𝑖 < 𝑝(𝑜𝑖 𝑗 )𝑟 𝑗 .

In both of these cases, 𝑟𝑖 < 0. So, we lower bound Eq. (B.3) with a max operation to ensure that

𝑟𝑖 ≥ 0. Thus,

𝑠𝑖 −

𝑖−1
∑︁

𝑗=1

𝑟𝑖 ≈ max (cid:169)
(cid:173)
(cid:171)

𝑝(𝑜𝑖 𝑗 )𝑟 𝑗 , 0(cid:170)
(cid:174)
(cid:172)

We write the rescores r in a matrix formulation as

.

(B.4)

≈ max

𝑟1




𝑟2



𝑟3


...





𝑟𝑛




















(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)

−

𝑠1

𝑠2

𝑠3
...

𝑠𝑛



































0

𝑝(𝑜21)

0

0

𝑝(𝑜31) 𝑝(𝑜32)

...

...

𝑝(𝑜𝑛1) 𝑝(𝑜𝑛2)


















. . . 0

. . . 0














. . . 0



. . . 0
...
...

We next write the above equation compactly as

r ≈ max(s − Pr, 0),

106

,

𝑟1




𝑟2



𝑟3


...





𝑟𝑛




















.


















0




0



0


...





0



(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

(B.5)

(B.6)

where P, called the Prune Matrix, is obtained by element-wise operation of the pruning function 𝑝

on O (cid:108)

(cid:108) . Maximum operation makes Eq. (B.6) non-linear [112] and, thus, difficult to solve.

However, for a differentiable NMS layer, we need to avoid the recursion. Therefore, we first solve

Eq. (B.6) assuming the max operation is not present which gives us the solution r ≈ (I + P)−1 s. In

general, this solution is not necessarily bounded between 0 and 1. Hence, we clip it explicitly to

obtain the approximation

r ≈ (cid:4)(I + P)−1 s(cid:7) ,

(B.7)

which we use as the solution to Eq. (B.6).

B.2 Loss Functions

We now detail out the loss functions used for training. The losses on the boxes before NMS,

L𝑏𝑒 𝑓 𝑜𝑟𝑒, is given by [17]

where

L𝑏𝑒 𝑓 𝑜𝑟𝑒 = L𝑐𝑙𝑎𝑠𝑠 + L2D + 𝑏𝑐𝑜𝑛 𝑓 L3D

+ 𝜆𝑐𝑜𝑛 𝑓 (1 − 𝑏𝑐𝑜𝑛 𝑓 ),

L𝑐𝑙𝑎𝑠𝑠 = CE(𝑏𝑐𝑙𝑎𝑠𝑠, 𝑔𝑐𝑙𝑎𝑠𝑠),

L2D = − log(IoU(𝑏2D, 𝑔2D)),

L3D = Smooth-L1(𝑏3D, 𝑔3D)

+ 𝜆𝑎CE( [𝑏𝜃𝑎, 𝑏𝜃ℎ], [𝑔𝜃𝑎, 𝑔𝜃ℎ]).

(B.8)

(B.9)

(B.10)

(B.11)

𝑏𝑐𝑜𝑛 𝑓 is the predicted self-balancing confidence of each box 𝑏, while 𝑏𝜃𝑎 and 𝑏𝜃ℎ are its orientation

bins [17]. 𝑔 denotes the ground-truth. 𝜆𝑐𝑜𝑛 𝑓 is the rolling mean of most recent L3D losses per mini-

batch [17], while 𝜆𝑎 denotes the weight of the orientation bins loss. CE and Smoooth-L1 denote

the Cross Entropy and Smooth L1 loss respectively. Note that we apply 2D and 3D regression

losses as well as the confidence losses only on the foreground boxes.

107

Table B.2 Detailed comparisons with other NMS during inference on KITTI Val 1 cars.

IoU3D ≥ 0.7
(cid:17))
(−

Inference
NMS

(−

(cid:17))

AP 3D| 𝑅40

AP BEV| 𝑅40

Classical
Soft [12]

(cid:17))
Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard
18.28 13.55 10.13 25.72 18.82 14.48 54.70 39.33 31.25 60.87 44.36 34.48
Kinematic (Image) [17]
Kinematic (Image) [17]
18.29 13.55 10.13 25.71 18.81 14.48 54.70 39.33 31.26 60.87 44.36 34.48
Kinematic (Image) [17] Distance [216] 18.25 13.53 10.11 25.71 18.82 14.48 54.70 39.33 31.26 60.87 44.36 34.48
18.26 13.51 10.10 25.67 18.77 14.44 54.59 39.25 31.18 60.78 44.28 34.40
Kinematic (Image) [17] GrooMeD
19.67 14.31 11.27 27.38 19.75 15.93 55.64 41.08 32.91 61.85 44.98 36.31
Classical
GrooMeD-NMS
19.67 14.31 11.27 27.38 19.75 15.93 55.64 41.08 32.91 61.85 44.98 36.31
Soft [12]
GrooMeD-NMS
Distance [216] 19.67 14.31 11.27 27.38 19.75 15.93 55.64 41.08 32.91 61.85 44.98 36.31
GrooMeD-NMS
19.67 14.32 11.27 27.38 19.75 15.92 55.62 41.07 32.89 61.83 44.98 36.29
GrooMeD-NMS

AP BEV| 𝑅40

AP 3D| 𝑅40

GrooMeD

(−

IoU3D ≥ 0.5
(cid:17))
(−

As explained in Sec. 2.4.3, the loss on the boxes after NMS, L𝑎 𝑓 𝑡𝑒𝑟, is the Imagewise AP-Loss,

which is given by

L𝑎 𝑓 𝑡𝑒𝑟 = L𝐼𝑚𝑎𝑔𝑒𝑤𝑖𝑠𝑒 =

1
𝑁

𝑁
∑︁

AP(r(𝑚), target(B (𝑚))),

𝑚=1

Let 𝜆 be the weight of the L𝑎 𝑓 𝑡𝑒𝑟 term. Then, our overall loss function is given by

L = L𝑏𝑒 𝑓 𝑜𝑟𝑒 + 𝜆L𝑎 𝑓 𝑡𝑒𝑟

= L𝑐𝑙𝑎𝑠𝑠 + L2D + 𝑏𝑐𝑜𝑛 𝑓 L3D + 𝜆𝑐𝑜𝑛 𝑓 (1 − 𝑏𝑐𝑜𝑛 𝑓 )

+ 𝜆L𝐼𝑚𝑎𝑔𝑒𝑤𝑖𝑠𝑒

= CE(𝑏𝑐𝑙𝑎𝑠𝑠, 𝑔𝑐𝑙𝑎𝑠𝑠) − log(IoU(𝑏2D, 𝑔2D))

+ 𝑏𝑐𝑜𝑛 𝑓 Smooth-L1(𝑏3D, 𝑔3D)

+ 𝜆𝑎 𝑏𝑐𝑜𝑛 𝑓 CE( [𝑏𝜃𝑎, 𝑏𝜃ℎ], [𝑔𝜃𝑎, 𝑔𝜃ℎ])

(B.12)

(B.13)

(B.14)

+ 𝜆𝑐𝑜𝑛 𝑓 (1 − 𝑏𝑐𝑜𝑛 𝑓 ) + 𝜆L𝐼𝑚𝑎𝑔𝑒𝑤𝑖𝑠𝑒.

(B.15)

We keep 𝜆𝑎 = 0.35 following [17] and 𝜆 = 0.05. Clearly, all our losses and their weights are

identical to [17] except L𝐼𝑚𝑎𝑔𝑒𝑤𝑖𝑠𝑒.

B.3 Additional Experiments and Results

We now provide additional details and results evaluating our system’s performance.

B.3.1 Training

Training images are augmented using random flipping with probability 0.5 [17]. Adam opti-

mizer [102] is used with batch size 2, weight-decay 5 × 10−4 and gradient clipping of 1 [15, 17].

108

Warmup starts with a learning rate 4 × 10−3 following a poly learning policy with power 0.9 [17].

Warmup and full training phases take 80𝑘 and 50𝑘 mini-batches respectively for Val 1 and Val

2 Splits [17] while take 160𝑘 and 100𝑘 mini-batches for Test Split.

B.3.2 KITTI Val 1 Oracle NMS Experiments

As discussed in Sec. 2.1, to understand the effects of an inference-only NMS on 2D and

3D object detection, we conduct a series of oracle experiments. We create an oracle NMS by taking

the Val Car boxes of KITTI Val 1 Split from the baseline Kinematic (Image) model before NMS

and replace their scores with their true IoU2D or IoU3D with the ground-truth, respectively. Note

that this corresponds to the oracle because we do not know the ground-truth boxes during inference.

We then pass the boxes with the oracle scores through the classical NMS and report the results in

Tab. B.1.

The results show that the AP3D increases by a staggering > 60 AP on Mod cars when we use

oracle IoU3D as the NMS score. On the other hand, we only see an increase in AP 2D by ≈ 11

AP on Mod cars when we use oracle IoU2D as the NMS score. Thus, the relative effect of using

oracle IoU3D NMS scores on 3D detection is more significant than using oracle IoU2D NMS scores

on 2D detection. In other words, the mismatch is greater between classification and 3D localization

compared to the mismatch between classification and 2D localization.

B.3.3 KITTI Val 1 3D Object Detection

Comparisons with other NMS. We compare GrooMeD-NMS with the other NMS—classical,

Soft [12] and Distance-NMS [216] and report the detailed results in Tab. B.2. We use the publicly

released Soft-NMS code and Distance-NMS code from the respective authors. The Distance-

NMS model uses the class confidence scores divided by the uncertainty in 𝑧 (the most erroneous

dimension in 3D localization [221]) of a box as the Distance-NMS [216] input. Our model does not

predict the uncertainty in 𝑧 of a box but predicts its self-balancing confidence (the 3D localization

score). Therefore, we use the class confidence scores multiplied by the self-balancing confidence

as the Distance-NMS input.

The results in Tab. B.2 show that NMS inclusion in the training pipeline benefits the perfor-

109

Table B.3 Sensitivity to NMS threshold 𝑁𝑡 on KITTI Val 1 cars. [Key: Best]

(−

(cid:17))

AP 3D|𝑅40

AP BEV|𝑅40

(cid:17))
Easy Mod Hard Easy Mod Hard
𝑁𝑡 = 0.3 17.49 13.32 10.54 26.07 18.94 14.61
𝑁𝑡 =0.4 19.67 14.32 11.27 27.38 19.75 15.92
𝑁𝑡 = 0.5 19.65 13.93 11.09 26.15 19.15 14.71

(−

Table B.4 Sensitivity to valid box threshold 𝑣 on KITTI Val 1 cars. [Key: Best]

(−

(−

(cid:17))

AP 3D|𝑅40

AP BEV|𝑅40

(cid:17))
Easy Mod Hard Easy Mod Hard
𝑣 = 0.01 13.71 9.65 7.24 17.73 12.47 9.36
𝑣 = 0.1 19.37 13.99 10.92 26.95 19.84 15.40
𝑣 = 0.2 19.65 14.31 11.24 27.35 19.73 15.89
𝑣 = 0.3 19.67 14.32 11.27 27.38 19.75 15.92
𝑣 = 0.4 19.67 14.33 11.28 27.38 19.76 15.93
𝑣 = 0.5 19.67 14.33 11.28 27.38 19.76 15.93
𝑣 = 0.6 19.67 14.33 11.29 27.39 19.77 15.95

mance, unlike [12], which suggests otherwise. Training with GrooMeD-NMS helps because the

network gets an additional signal through the GrooMeD-NMS layer whenever the best-localized

box corresponding to an object is not selected. Moreover, Tab. B.2 suggests that we can replace

GrooMeD-NMS with the classical NMS in inference as the performance is almost the same even

at IoU3D = 0.5.

How good is the classical NMS approximation? GrooMeD-NMS uses several approximations

to arrive at the matrix solution Eq. (B.7). We now compare how good these approximations are with

the classical NMS. Interestingly, Tab. B.2 shows that GrooMeD-NMS is an excellent approximation

to the classical NMS as the performance does not degrade after changing the NMS in inference.

B.3.4 KITTI Val 1 Sensitivity Analysis

There are a few adjustable parameters for the GrooMeD-NMS, such as the NMS threshold 𝑁𝑡,

valid box threshold 𝑣, the maximum group size 𝛼, the weight 𝜆 for the L𝑎 𝑓 𝑡𝑒𝑟, and 𝛽. We carry out

a sensitivity analysis to understand how these parameters affect performance and speed, and how

sensitive the algorithm is to these parameters.

Sensitivity to NMS Threshold. We show the sensitivity to NMS threshold 𝑁𝑡 in Tab. B.3. The

results in Tab. B.3 show that the optimal 𝑁𝑡 = 0.4. This is also the 𝑁𝑡 in [15, 17].

110

Figure B.1 Sensitivity to group size 𝛼 on KITTI Val 1 Moderate cars.

Sensitivity to Valid Box Threshold. We next show the sensitivity to valid box threshold 𝑣 in

Tab. B.4. Our choice of 𝑣 = 0.3 performs close to the optimal choice.

Sensitivity to Maximum Group Size. Grouping has a parameter group size (𝛼). We vary this

parameter and report AP 3D|𝑅40

and AP BEV|𝑅40

at two different IoU3D thresholds on Moderate cars

of KITTI Val 1 Split in Fig. B.1. We note that the best AP 3D|𝑅40

performance is obtained at 𝛼 = 100

and we, therefore, set 𝛼 = 100 in our experiments.

Sensitivity to Loss Weight. We now show the sensitivity to loss weight 𝜆 in Tab. B.5. Our choice

of 𝜆 = 0.05 is the optimal value.

Sensitivity to Best Box Threshold. We now show the sensitivity to the best box threshold 𝛽 in

Tab. B.6. Our choice of 𝛽 = 0.3 is the optimal value.

Conclusion. GrooMeD-NMS has minor sensitivity to 𝑁𝑡, 𝛼, 𝜆 and 𝛽, which is common in object

detection. GrooMeD-NMS is not as sensitive to 𝑣 since it only decides a box’s validity. Our

parameter choice is either at or close to the optimal. The inference speed is only affected by 𝛼.

Other parameters are used in training or do not affect inference speed.

B.3.5 Qualitative Results

We next show some qualitative results of models trained on KITTI Val 1 Split in Fig. B.2.

We depict the predictions of GrooMeD-NMS in image view on the left and the predictions of

GrooMeD-NMS, Kinematic (Image) [17], and ground truth in BEV on the right.

In general,

111

Table B.5 Sensitivity to loss weight 𝜆 on KITTI Val 1 cars. [Key: Best]

(−

(−

(cid:17))

AP 3D|𝑅40

AP BEV|𝑅40

(cid:17))
Easy Mod Hard Easy Mod Hard
19.16 13.89 10.96 27.01 19.33 14.84
𝜆 = 0
𝜆 = 0.05 19.67 14.32 11.27 27.38 19.75 15.92
𝜆 = 0.1 17.74 13.61 10.81 25.86 19.18 15.57
10.08 7.26 6.00 14.44 10.55 8.41
𝜆 = 1

Table B.6 Sensitivity to best box threshold 𝛽 on KITTI Val 1 cars. [Key: Best]

(−

(−

(cid:17))

AP 3D|𝑅40

AP BEV|𝑅40

(cid:17))
Easy Mod Hard Easy Mod Hard
𝛽 = 0.1 18.09 13.64 10.21 26.52 19.50 15.74
𝛽 = 0.3 19.67 14.32 11.27 27.38 19.75 15.92
𝛽 = 0.4 18.91 14.02 11.15 27.11 19.64 15.90
𝛽 = 0.5 18.49 13.66 10.96 27.01 19.47 15.79

GrooMeD-NMS predictions are more closer to the ground truth than Kinematic (Image) [17].

B.3.6 Demo Video of GrooMeD-NMS

We next include a short demo video of our GrooMeD-NMS model trained on KITTI Val 1 Split.

We run our trained model independently on each frame of the three KITTI raw [66] sequences -

2011_10_03_drive_0047, 2011_09_29_drive_0026 and 2011_09_26_drive_0009. None of the

frames from these three raw sequences appear in the training set of KITTI Val 1 Split. We use

the camera matrices available with the raw sequences but do not use any temporal information.

Overlaid on each frame of the raw input videos, we plot the projected 3D boxes of the predictions

and also plot these 3D boxes in the BEV. We set the frame rate of this demo at 10 fps. The demo

is also available in HD at https://www.youtube.com/watch?v=PWctKkyWrno. In the demo video,

notice that the orientation of the boxes are stable despite not using any temporal information.

112

Figure B.2 Qualitative Results (Best viewed in color). We depict the predictions of GrooMeD-NMS
(magenta) in image view on the left and the predictions of GrooMeD-NMS, Kinematic (Image) [17] (blue),
and Ground Truth (green) in BEV on the right. In general, GrooMeD-NMS predictions are more closer to
the ground truth than Kinematic (Image) [17].

113

APPENDIX C

DEVIANT APPENDIX

C.1 Supportive Explanations

We now add some explanations which we could not put in the main chapter because of the space

constraints.

C.1.1 Equivariance vs Augmentation

Equivariance adds suitable inductive bias to the backbone [43, 50] and is not learnt. Augmen-

tation adds transformations to the input data during training or inference.

Equivariance and data augmentation have their own pros and cons. Equivariance models

the physics better, is mathematically principled and is so more agnostic to data distribution shift

compared to the data augmentation. A downside of equivariance compared to the augmentation

is equivariance requires mathematical modelling, may not always exist [21], is not so intuitive

and generally requires more flops for inference. On the other hand, data augmentation is simple,

intuitive and fast, but is not mathematically principled. The choice between equivariance and data

augmentation is a withstanding question in machine learning [63].

C.1.2 Why do 2D CNN detectors generalize?

We now try to understand why 2D CNN detectors generalize well. Consider an image ℎ(𝑢, 𝑣)

and Φ be the CNN. Let Tt denote the translation in the (𝑢, 𝑣) space. The 2D translation equivariance

[19, 20, 198] of the CNN means that

Φ(Ttℎ(𝑢, 𝑣)) = TtΦ(ℎ(𝑢, 𝑣))

=⇒ Φ(ℎ(𝑢 + 𝑡𝑢, 𝑣 + 𝑡𝑣)) = Φ(ℎ(𝑢, 𝑣)) + (𝑡𝑢, 𝑡𝑣)

(C.1)

where (𝑡𝑢, 𝑡𝑣) is the translation in the (𝑢, 𝑣) space.

Assume the CNN predicts the object position in the image as (𝑢′, 𝑣′). Then, we write

Φ(ℎ(𝑢, 𝑣)) = ( ˆ𝑢, ˆ𝑣)

(C.2)

Now, we want the CNN to predict the output the position of the same object translated by

(𝑡𝑢, 𝑡𝑣). The new image is thus ℎ(𝑢 + 𝑡𝑢, 𝑣 + 𝑡𝑣). The CNN easily predicts the translated position of

114

𝑧

𝑥

𝑦

(𝑥, 𝑦, 𝑧)

Patch Plane
𝑚𝑋+𝑛𝑌 +𝑜𝑍+𝑝 =
0

𝑓

(𝑢0,𝑣0)

ℎ(𝑢, 𝑣)

𝑡𝑍
𝑓

(𝑢0,𝑣0)

ℎ′(𝑢′, 𝑣′)

Figure C.1 Equivariance exists for the patch plane when there is depth translation of the ego camera.
Downscaling converts image ℎ to image ℎ′.

ℎ(𝑢, 𝑣)

ℎ′(𝑢′, 𝑣′)

Figure C.2 Example of non-existence of equivariance [21] when there is 180◦ rotation of the ego camera.
No transformation can convert image ℎ to image ℎ′.

the object because all CNN is to do is to invoke its 2D translation equivariance of Eq. (C.1), and

translate the previous prediction by the same amount. In other words,

Φ(ℎ(𝑢 + 𝑡𝑢, 𝑣 + 𝑡𝑣)) = Φ(ℎ(𝑢, 𝑣)) + (𝑡𝑢, 𝑡𝑣)

= ( ˆ𝑢, ˆ𝑣) + (𝑡𝑢, 𝑡𝑣)

= ( ˆ𝑢 + 𝑡𝑢, ˆ𝑣 + 𝑡𝑣)

Intuitively, equivariance is a disentaglement method. The 2D translation equivariance disentangles

the 2D translations (𝑡𝑢, 𝑡𝑣) from the original image ℎ and therefore, the network generalizes to

unseen 2D translations.

C.1.3 Existence and Non-existence of Equivariance

The result from [21] says that generic projective equivariance does not exist in particular with

rotation transformations. We now show an example of when the equivariance exists and does not

exist in the projective manifold in Figs. C.1 and C.2 respectively.

115

C.1.4 Why do not Monocular 3D CNN detectors generalize?

Monocular 3D CNN detectors do not generalize well because they are not equivariant to arbitrary

3D translations in the projective manifold. To show this, let 𝐻 (𝑋, 𝑌 , 𝑍) denote a 3D point cloud.

The monocular detection network Φ operates on the projection ℎ(𝑢, 𝑣) of this point cloud 𝐻 to

output the position ( ˆ𝑥, ˆ𝑦, ˆ𝑧) as

Φ(K𝐻 (𝑋, 𝑌 , 𝑍)) = ( ˆ𝑥, ˆ𝑦, ˆ𝑧)

=⇒ Φ(ℎ(𝑢, 𝑣)) = ( ˆ𝑥, ˆ𝑦, ˆ𝑧),

where K denotes the projection operator. We translate this point cloud by an arbitrary 3D translation

of (𝑡𝑋, 𝑡𝑌 , 𝑡𝑍 ) to obtain the new point cloud 𝐻 (𝑋 + 𝑡𝑋, 𝑌 + 𝑡𝑌 , 𝑍 + 𝑡𝑍 ). Then, we again ask the

monocular detector Φ to do prediction over the translated point cloud. However, we find that

Φ(K𝐻 (𝑋 + 𝑡𝑋, 𝑌 + 𝑡𝑌 , 𝑍 + 𝑡𝑍 )) ≠ Φ(ℎ(𝑢 + K(𝑡𝑋, 𝑡𝑌 , 𝑡𝑍 ), 𝑣 + K(𝑡𝑋, 𝑡𝑌 , 𝑡𝑍 )))

=⇒ Φ(K𝐻 (𝑋 + 𝑡𝑋, 𝑌 + 𝑡𝑌 , 𝑍 + 𝑡𝑍 )) ≠ Φ(K𝐻 (𝑋, 𝑌 , 𝑍)) + K(𝑡𝑋, 𝑡𝑌 , 𝑡𝑍 )

= Φ(ℎ(𝑢, 𝑣)) + K(𝑡𝑋, 𝑡𝑌 , 𝑡𝑍 )

In other words, the projection operator K does not distribute over the point cloud 𝐻 and arbitrary 3D

translation of (𝑡𝑋, 𝑡𝑌 , 𝑡𝑍 ). Hence, if the network Φ is a vanilla CNN (existing monocular backbone),

it can no longer invoke its 2D translation equivariance of Eq. (C.1) to get the new 3D coordinates

( ˆ𝑥 + 𝑡𝑋, ˆ𝑦 + 𝑡𝑌 , ˆ𝑧 + 𝑡𝑍 ).

Note that the LiDAR based 3D detectors with 3D convolutions do not suffer from this problem

because they do not involve any projection operator K. Thus, this problem exists only in monocular

3D detection. This makes monocular 3D detection different from 2D and LiDAR based 3D object

detection.

C.1.5 Overview of Planar Transformations: Th. 1

We now pictorially provide the overview of Th. 1 (Example 13.2 from [76]), which links the

planarity and projective transformations in the continuous world in Fig. C.3.

116

Continuous WorldDiscrete World

3D point
on plane

Projective

2D point

Sampling

2D pixel

(R, t)

Th. 1

3D point
on plane

Projective

2D point

Sampling

2D pixel

Figure C.3 Overview of Th. 1 (Example 13.2 from [76]), which links the planarity and projective transfor-
mations in the continuous world.

C.1.6 Approximation of Scale Transformations: Corollary 1.1

We now give the approximation under which Corollary 1.1 is valid. We assume that the ego

camera does not undergo any rotation. Hence, we substitute R = I in Eq. (3.1) to get

ℎ(𝑢 − 𝑢0, 𝑣 − 𝑣0) = ℎ′ (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝑓

(cid:16)1+𝑡𝑋

(cid:17)

𝑚
𝑝

(𝑢−𝑢0) +𝑡𝑋

𝑡𝑍

𝑚
𝑝 (𝑢−𝑢0) +𝑡𝑍

𝑡𝑌

𝑚
𝑝 (𝑢−𝑢0) +

𝑓

𝑛
𝑝 (𝑣 −𝑣0) +
(cid:17)
(cid:16)1+𝑡𝑌

𝑛
𝑝

𝑡𝑍

𝑚
𝑝 (𝑢−𝑢0) + 𝑡𝑍

𝑛
𝑝 (𝑣 −𝑣0) +

𝑛
𝑝 (𝑣 −𝑣0) +𝑡𝑋
(cid:16)1+𝑡𝑍

𝑜
𝑝

𝑜
𝑝 𝑓
(cid:17)

𝑓

,

(𝑣 −𝑣0) +𝑡𝑌
(cid:16)1+𝑡𝑍

𝑜
𝑝

𝑜
𝑝 𝑓
(cid:17)

𝑓

(cid:170)
(cid:174)
(cid:174)
(cid:172)

.

(C.3)

Next, we use the assumption that the ego vehicle moves in the 𝑧-direction as in [17], i.e., substitute

𝑡𝑋 = 𝑡𝑌 = 0 to get

ℎ(𝑢−𝑢0, 𝑣 −𝑣0) = ℎ′ (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝑡𝑍
𝑓

𝑢 − 𝑢0
𝑛
𝑝 (𝑣 −𝑣0) +

𝑚

𝑝 (𝑢−𝑢0) + 𝑡𝑍

𝑓

,

(cid:17)

(cid:16)1+𝑡𝑍

𝑜
𝑝

𝑝 (𝑢−𝑢0) + 𝑡𝑍
The patch plane is 𝑚𝑥 + 𝑛𝑦 + 𝑜𝑧 + 𝑝 = 0. We consider the planes in the front of camera. Without

𝑡𝑍
𝑓

𝑜
𝑝

𝑚

𝑓

𝑣 − 𝑣0
𝑛
𝑝 (𝑣 −𝑣0) +

(cid:16)1+𝑡𝑍

(cid:17)

.

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(C.4)

loss of generality, consider 𝑝 < 0 and 𝑜 > 0.

We first write the denominator 𝐷 of RHS term in Eq. (C.4) as

𝐷 =

𝑡𝑍
𝑓

𝑚
𝑝

(𝑢−𝑢0) +

𝑡𝑍
𝑓

𝑛
𝑝

117

(𝑣 −𝑣0) +

(cid:18)

1+𝑡𝑍

(cid:19)

𝑜
𝑝

= 1 +

𝑡𝑍
𝑝

(cid:18) 𝑚
𝑓

(𝑢−𝑢0) +

(cid:19)

(𝑣 −𝑣0) + 𝑜

𝑛
𝑓

Because we considered patch plane s in front of the camera, 𝑝 < 0. Also consider 𝑡𝑍 < 0, which

implies 𝑡𝑍 /𝑝 > 0. Now, we bound the term in the parantheses of the above equation as

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

(𝑢−𝑢0) +

(𝑣 −𝑣0) + 𝑜

𝐷 ≤ 1 +

≤ 1 +

≤ 1 +

≤ 1 +

≤ 1 +

𝑡𝑍
𝑝
𝑡𝑍
𝑝
𝑡𝑍
𝑝
𝑡𝑍
𝑝
𝑡𝑍
𝑝

𝑛
𝑓

(cid:13)
𝑚
(cid:13)
(cid:13)
𝑓
(cid:13)
(cid:18)(cid:13)
𝑚
(cid:13)
(cid:13)
𝑓
(cid:13)
(cid:18) ∥𝑚∥
𝑓
(cid:18) ∥𝑚∥
𝑓

+

(cid:13)
(cid:13)
𝑛
(cid:13)
(cid:13)
(cid:13)
(cid:13)
𝑓
(cid:13)
(cid:13)
𝐻
∥𝑛∥
2
𝑓
𝑊
∥𝑛∥
2
𝑓

+

𝑊
2
𝑊
2
(cid:18) (∥𝑚∥ + ∥𝑛∥)𝑊
2 𝑓

+

(cid:19)

,

+ 𝑜

(𝑢−𝑢0)

(𝑣 −𝑣0)

(cid:19)

+ ∥𝑜∥

by Triangle inequality

(cid:19)

+ 𝑜

(cid:19)

+ 𝑜

, (𝑢−𝑢0) ≤

𝑊
2

, (𝑣 −𝑣0) ≤

𝐻
2

, ∥𝑜∥ = 𝑜

, 𝐻 ≤ 𝑊

If the coefficients of the patch plane 𝑚, 𝑛, 𝑜, its width 𝑊 and focal length 𝑓 follow the relationship

(∥𝑚∥+∥𝑛∥)𝑊
2 𝑓

<< 𝑜, the patch plane is “approximately” parallel to the image plane. Then, a few

quantities can be ignored in the denominator 𝐷 to get

D ≈ 1 + 𝑡𝑍

𝑜
𝑝

Therefore, the RHS of Eq. (C.4) gets simplified and we obtain
(cid:32) 𝑢 − 𝑢0
1+𝑡𝑍

T𝑠 : ℎ(𝑢 − 𝑢0, 𝑣 − 𝑣0) ≈ ℎ′

,

𝑜
𝑝

(C.5)

(C.6)

(cid:33)

𝑣 − 𝑣0
𝑜
1+𝑡𝑍
𝑝

An immediate benefit of using the approximation is Eq. (3.2) does not depend on the distance of the

patch plane from the camera. This is different from wide-angle camera assumption, where the ego

camera is assumed to be far from the patch plane. Moreover, patch plane s need not be perfectly

aligned with the image plane for Eq. (3.2). Even small enough perturbed patch plane s work. We

next show the approximation in the Fig. C.4 with 𝜃 denoting the deviation from the perfect parallel

plane. The deviation 𝜃 is about 3 degrees for the KITTI dataset while it is 6 degrees for the Waymo

dataset.

e.g. The following are valid patch plane s for KITTI images whose focal length 𝑓 = 707 and

width 𝑊 = 1242.

−0.05𝑥 + 0.05𝑦 + 𝑧 = 30

118

𝑧

𝑦

𝜃

Figure C.4 Approximation of Corollary 1.1. Bold shows the patch plane parallel to the image plane. The
dotted line shows the approximated patch plane.

0.05𝑥 − 0.05𝑦 + 𝑧 = 30

(C.7)

The following are valid patch plane s for Waymo images whose focal length 𝑓 = 2059 and width

𝑊 = 1920.

−0.1𝑥 + 0.1𝑦 + 𝑧 = 30

0.1𝑥 − 0.1𝑦 + 𝑧 = 30

(C.8)

Although the assumption is slightly restrictive, we believe our method shows improvements on

both KITTI and Waymo datasets because the car patches are approximately parallel to image planes

and also because the depth remains the hardest parameter to estimate [166].

C.1.7 Scale Equivariance of SES Convolution for Images

[227] derive the scale equivariance of SES convolution for a 1D signal. We simply follow on

their footsteps to get the scale equivariance of SES convolution for a 2D image ℎ(𝑢, 𝑣) for the sake

of completeness. Let the scaling of the image ℎ be 𝑠. Let ∗ denote the standard vanilla convolution

and Ψ denote the convolution filter. Then, the convolution of the downscaled image T𝑠 (ℎ) with the

filter Ψ is given by

(cid:19)

Ψ(𝑢′ − 𝑢, 𝑣′ − 𝑣)𝑑𝑢′𝑑𝑣′

,

ℎ

∫ ∫

[T𝑠 (ℎ) ∗ Ψ] (𝑢, 𝑣)
(cid:18) 𝑢′
𝑣′
𝑠
𝑠
(cid:18) 𝑢′
𝑠
(cid:18) 𝑢′
𝑠
(cid:18) 𝑢′
𝑠

=
= 𝑠2 ∫ ∫
= 𝑠2 ∫ ∫
= 𝑠2 ∫ ∫

ℎ

ℎ

ℎ

,

,

,

(cid:19)

(cid:19)

(cid:19)

𝑣′
𝑠
𝑣′
𝑠
𝑣′
𝑠

(cid:18)

𝑠

Ψ

T𝑠−1

T𝑠−1

(cid:19)

𝑑

, 𝑠

𝑢′ − 𝑢
𝑠
(cid:18) 𝑢′ − 𝑢
𝑠

𝑣′ − 𝑣
𝑠
𝑣′ − 𝑣
𝑠

Ψ

,

(cid:20)

(cid:19)

(cid:18) 𝑢′
𝑠

𝑑

(cid:19)(cid:21)

𝑑

(cid:19)

(cid:18) 𝑣′
𝑠
(cid:19)

𝑑

(cid:18) 𝑢′
𝑠
(cid:18) 𝑢′
𝑠

𝑑

(cid:19)

(cid:18) 𝑣′
𝑠
(cid:18) 𝑣′
𝑠

(cid:19)

(cid:19)

𝑑

(cid:20)

Ψ

(cid:18) 𝑢′
𝑠

−

𝑢
𝑠

,

𝑣′
𝑠

−

𝑣
𝑠

(cid:19)(cid:21)

119

= 𝑠2 [ℎ ∗ T𝑠−1 (Ψ)]

(cid:17)

(cid:16) 𝑢
𝑠

,

𝑣
𝑠

= 𝑠2T𝑠 [ℎ ∗ T𝑠−1 (Ψ)] (𝑢, 𝑣).

Next, [227] re-parametrize the SES filters by writing Ψ𝜎 (𝑢, 𝑣) = 1

𝜎2 Ψ (cid:0) 𝑢

𝜎 , 𝑣

𝜎

(C.9)

(cid:1). Substituting in

Eq. (C.9), we get

[T𝑠 (ℎ) ∗ Ψ𝜎] (𝑢, 𝑣) = 𝑠2T𝑠 [ℎ ∗ T𝑠−1 (Ψ𝜎)] (𝑢, 𝑣)

(C.10)

Moreover, the re-parametrized filters are separable [227] by construction and so, one can write

Ψ𝜎 (𝑢, 𝑣) = Ψ𝜎 (𝑢)Ψ𝜎 (𝑣).

(C.11)

The re-parametrization and separability leads to the important property that

T𝑠−1 (Ψ𝜎 (𝑢, 𝑣)) = T𝑠−1 (Ψ𝜎 (𝑢)Ψ𝜎 (𝑣))

= T𝑠−1 (Ψ𝜎 (𝑢)) T𝑠−1 (Ψ𝜎 (𝑣))

= 𝑠−2Ψ𝑠−1𝜎 (𝑢)Ψ𝑠−1𝜎 (𝑣)

= 𝑠−2Ψ𝑠−1𝜎 (𝑢, 𝑣).

(C.12)

Substituting above in the RHS of Eq. (C.10), we get

[T𝑠 (ℎ) ∗ Ψ𝜎] (𝑢, 𝑣) = 𝑠2T𝑠 (cid:2)ℎ ∗ 𝑠−2Ψ𝑠−1𝜎

(cid:3) (𝑢, 𝑣)

=⇒ [T𝑠 (ℎ) ∗ Ψ𝜎] (𝑢, 𝑣) = T𝑠 [ℎ ∗ Ψ𝑠−1𝜎] (𝑢, 𝑣),

(C.13)

which is a cleaner form of Eq. (C.9). Eq. (C.13) says that convolving the downscaled image with a

filter is same as the downscaling the result of convolving the image with the upscaled filter [227].

This additional constraint regularizes the scale (depth) predictions for the image, leading to better

generalization.

C.1.8 Why does DEVIANT generalize better compared to CNN backbone?

DEVIANT models the physics better compared to the CNN backbone. CNN generalizes better

for 2D detection because of the 2D translation equivariance in the Euclidean manifold. However,

120

Table C.1 Comparison of Methods on the basis of inputs, convolution kernels, outputs and whether output
are scale-constrained.

Method

Input
#Conv
Frame Kernel

Output

Vanilla CNN
Depth-Aware [15]
Dilated CNN [291]
DEVIANT
Depth-guided [52] 1 + Depth
Kinematic3D [17]

1
1
1
1

> 1

1
> 1
> 1
> 1
1
1

4D
4D
5D
5D
4D
5D

Output Constrained
for Scales?
✕
✕
Integer [267]
Float
Integer [267]
✕

monocular 3D detection does not belong to the Euclidean manifold but is a task of the projective

manifold. Modeling translation equivariance in the correct manifold improves generalization. For

monocular 3D detection, we take the first step towards the general 3D translation equivariance

by embedding equivariance to depth translations. The 3D depth equivariance in DEVIANT uses

Eq. (C.10) and thus imposes an additional constraint on the feature maps. This additional constraint

results in consistent depth estimates from the current image and a virtual image (obtained by

translating the ego camera), and therefore, better generalization than CNNs. On the other hand,

CNNs, by design, do not constrain the depth estimates from the current image and a virtual image

(obtained by translating the ego camera), and thus, their depth estimates are entirely data-driven.

C.1.9 Why not Fixed Scale Assumption?

We now answer the question of keeping the fixed scale assumption. If we assume fixed scale

assumption, then vanilla convolutional layers have the right equivariance. However, we do not

keep this assumption because the ego camera translates along the depth in driving scenes and also,

because the depth is the hardest parameter to estimate [166] for monocular detection. So, zero

depth translation or fixed scale assumption is always violated.

C.1.10 Comparisons with Other Methods

We now list out the differences between different convolutions and monocular detection methods

in Tab. C.1. Kinematic3D [17] does not constrain the output at feature map level, but at system

level using Kalman Filters. The closest to our method is the Dilated CNN (DCNN) [291]. We

show in Tab. 3.9 that DEVIANT outperforms Dilated CNN.

121

Multi-scale Steerable Basis

Scale Conv.
Output

⊗w⊗w⊗w

*
Kernel
*
*

Input

Scale-
Projection

4D
Output

Figure C.5 (a) SES convolution [68, 227] The non-trainable basis functions multiply with learnable weights
w to get kernels. The input then convolves with these kernels to get multi-scale 5D output.
(b) Scale-
Projection [227] takes max over the scale dimension of the 5D output and converts it to 4D. [Key: ∗ =
Vanilla convolution.]

C.1.11 Why is Depth the hardest among all parameters?

Images are the 2D projections of the 3D scene, and therefore, the depth is lost during projection.

Recovering this depth is the most difficult to estimate, as shown in Tab. 1 of [166]. Monocular

detection task involves estimating 3D center, 3D dimensions and the yaw angle. The right half of

Tab. 1 in [166] shows that if the ground truth 3D center is replaced with the predicted center, the

detection reaches a minimum. Hence, 3D center is the most difficult to estimate among center,

dimensions and pose. Most monocular 3D detectors further decompose the 3D center into projected

(2D) center and depth. Out of projected center and depth, Tab. 1 of [166] shows that replacing

ground truth depth with the predicted depth leads to inferior detection compared to replacing ground

truth projected center with the predicted projected center. Hence, we conclude that depth is the

hardest parameter to estimate.

C.2

Implementation Details

We now provide some additional implementation details for facilitating reproduction of this

work.

C.2.1 Steerable Filters of SES Convolution

We use the scale equivariant steerable blocks proposed by [226] for our DEVIANT backbone.

We now share the implementation details of these steerable filters.

Basis. Although steerable filters can use any linearly independent functions as their basis, we stick

122

Figure C.6 Steerable Basis [227] for 7×7 SES convolution filters. (Showing only 8 of the 49 members for
each scale).

with the Hermite polynomials as the basis [226]. Let (0, 0) denote the center of the function and

(𝑢, 𝑣) denote the pixel coordinates. Then, the filter coefficients 𝜓𝜎𝑛𝑚 [226] are

𝜓𝜎𝑛𝑚 =

𝐴
𝜎2

𝐻𝑛

(cid:17)

(cid:16) 𝑢
𝜎

𝐻𝑚

(cid:16) 𝑣
𝜎

(cid:17)

𝑒− 𝑢2

+𝑣2
𝜎2

(C.14)

𝐻𝑛 denotes the Probabilist’s Hermite polynomial of the 𝑛th order, and 𝐴 is the normalization

constant. The first six Probabilist’s Hermite polynomials are

𝐻0(𝑥) = 1

𝐻1(𝑥) = 𝑥

𝐻2(𝑥) = 𝑥2 − 1

𝐻3(𝑥) = 𝑥3 − 3𝑥

𝐻4(𝑥) = 𝑥4 − 6𝑥2 + 3

(C.15)

(C.16)

(C.17)

(C.18)

(C.19)

Fig. C.6 visualizes some of the SES filters and shows that the basis is indeed at different scales.

C.2.2 Monocular 3D Detection

Architecture. We use the DLA-34 [292] configuration, with the standard Feature Pyramid Network

(FPN) [138], binning and ensemble of uncertainties. FPN is a bottom-up feed-forward CNN that

computes feature maps with a downscaling factor of 2, and a top-down network that brings them

back to the high-resolution ones. There are total six feature maps levels in this FPN.

We use DLA-34 as the backbone for our baseline GUP Net [159], while we use SES-DLA-34

as the backbone for DEVIANT. We also replace the 2D pools by 3D pools with pool along the

scale dimensions as 1 for DEVIANT.

123

We initialize the vanilla CNN from ImageNet weights. For DEVIANT, we use the regularized

least squares [226] to initialize the trainable weights in all the Hermite scales from the ImageNet [48]

weights. Compared to initializing one of the scales as proposed in [226], we observed more stable

convergence in initializing all the Hermite scales.

We output three foreground classes for KITTI dataset. We also output three foreground classes

for Waymo dataset ignoring the Sign class [199].

Datasets. We use the publicly available KITTI,Waymo and nuScenes datasets for our experi-

ments. KITTI is available at http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=

3d under CC BY-NC-SA 3.0 License. Waymo is available at https://waymo.com/intl/en_us/

dataset-download-terms/ under the Apache License, Version 2.0. nuScenes is available at https:

//www.nuscenes.org/nuscenes under CC BY-NC-SA 4.0 International Public License.

Augmentation. Unless otherwise stated, we horizontal flip the training images with probability

0.5, and use scale augmentation as 0.4 as well for all the models [159] in training.

Pre-processing. The only pre-processing step we use is image resizing.
• KITTI. We resize the [370, 1242] sized KITTI images, and bring them to the [384, 1280]

resolution [159].

• Waymo. We resize the [1280, 1920] sized Waymo images, and bring them to the [512, 768]

resolution. This resolution preserves their aspect ratio.

Box Filtering. We apply simple hand-crafted rules for filtering out the boxes. We ignore the box

if it belongs to a class different from the detection class.
• KITTI. We train with boxes which are atleast 2𝑚 distant from the ego camera, and with visibility

> 0.5 [159].

• Waymo. We train with boxes which are atleast 2𝑚 distant from the ego camera. The Waymo
dataset does not have any occlusion based labels. However, Waymo provides the number of

LiDAR points inside each 3D box which serves as a proxy for the occlusion. We train the boxes

which have more than 100 LiDAR points for the vehicle class and have more than 50 LiDAR

points for the cyclist and pedestrian class.

124

Training. We use the training protocol of GUP Net [159] for all our experiments. Training uses

the Adam optimizer [102] and weight-decay 1 × 10−5 . Training dynamically weighs the losses

using Hierarchical Task Learning (HTL) [159] strategy keeping 𝐾 as 5 [159]. Training also uses a

linear warmup strategy in the first 5 epochs to stabilize the training. We choose the model saved in

the last epoch as our final model for all our experiments.
• KITTI. We train with a batch size of 12 on single Nvidia A100 (40GB) GPU for 140 epochs.
Training starts with a learning rate 1.25 × 10−3 with a step decay of 0.1 at the 90th and the 120th

epoch.

• Waymo. We train with a batch size of 40 on single Nvidia A100 (40GB) GPU for 30 epochs
because of the large size of the Waymo dataset. Training starts with a learning rate 1.25 × 10−3

with a step decay of 0.1 at the 18th and the 26th epoch.

Losses. We use the GUP Net [159] multi-task losses before the NMS for training. The total loss L

is given by

L = Lheatmap + L2D,offset + L2D,size + L3D2D,offset + L3D,𝑎𝑛𝑔𝑙𝑒

+ L3D,𝑙 + L3D,𝑤 + L3D,ℎ + L3D,𝑑𝑒 𝑝𝑡ℎ.

The individual terms are given by

Lheatmap = Focal(𝑐𝑙𝑎𝑠𝑠𝑏, 𝑐𝑙𝑎𝑠𝑠𝑔),

L2D,offset = L1(𝛿𝑏

L2D,size = L1(𝑤𝑏

L3D2D,offset = L1(𝛿𝑏

2D, 𝛿𝑔
2D),
2D, 𝑤𝑔
2D) + L1(ℎ𝑏
3D2D, 𝛿𝑔

3D2D)

2D, ℎ𝑔

2D),

L3D,𝑎𝑛𝑔𝑙𝑒 = CE(𝛼𝑏, 𝛼𝑔)

L3D,𝑙 = L1(𝜇𝑏

L3D,𝑤 = L1(𝜇𝑏

𝑙3D)

𝑙3D, 𝛿𝑔
𝑤3D, 𝛿𝑔

𝑤3D)

L3D,ℎ =

L3D,𝑑𝑒 𝑝𝑡ℎ =

√

2
𝜎ℎ3D
√
2
𝜎𝑑

L1(𝜇𝑏

ℎ3D, 𝛿𝑔

ℎ3D) + ln(𝜎ℎ3D)

L1(𝜇𝑏

𝑑, 𝜇𝑔

𝑑) + ln(𝜎𝑑),

125

(C.20)

(C.21)

(C.22)

(C.23)

(C.24)

(C.25)

(C.26)

(C.27)

(C.28)

(C.29)

where,

𝜇𝑏
𝑑 = 𝑓

+ 𝜇𝑑,𝑝𝑟𝑒𝑑

𝜇𝑏
ℎ3D
ℎ𝑏
2D
(cid:118)(cid:117)(cid:116)(cid:32)

𝜎𝑑 =

𝑓

(cid:33) 2

+ 𝜎2

𝑑,𝑝𝑟𝑒𝑑.

𝜎ℎ3D
ℎ𝑏
2D

(C.30)

(C.31)

The superscripts 𝑏 and 𝑔 denote the predicted box and ground truth box respectively. CE and

Focal denote the Cross Entropy and Focal loss respectively.

The number of heatmaps depends on the number of output classes. 𝛿2D denotes the deviation

of the 2D center from the center of the heatmap. 𝛿3D2D,offset denotes the deviation of the projected

3D center from the center of the heatmap. The orientation loss is the cross entropy loss between

the binned observation angle of the prediction and the ground truth. The observation angle 𝛼 is

split into 12 bins covering 30◦ range. 𝛿𝑙3D, 𝛿𝑤3D and 𝛿ℎ3D denote the deviation of the 3D length,

width and height of the box from the class dependent mean size respectively.

The depth is the hardest parameter to estimate [166]. So, GUP Net uses in-network ensembles

to predict the depth. It obtains a Laplacian estimate of depth from the 2D height, while it obtains

another estimate of depth from the prediction of depth. It then adds these two depth estimates.

Inference. Our testing resolution is same as the training resolution. We do not use any augmentation

for test/validation. We keep the maximum number of objects to 50 in an image, and we multiply the

class and predicted confidence to get the box’s overall score in inference as in [109]. We consider

output boxes with scores greater than a threshold of 0.2 for KITTI [159] and 0.1 for Waymo [199].

C.3 Additional Experiments and Results

We now provide additional details and results of the experiments evaluating DEVIANT’s

performance.

C.3.1 KITTI Val Split

Monocular Detection has Huge Generalization Gap. As mentioned in Sec. 3.1, we now show

that the monocular detection has huge generalization gap between training and inference. We report

the object detection performance on the train and validation (val) set for the two models on KITTI

Val split in Tab. C.2. Tab. C.2 shows that the performance of our baseline GUP Net [159] and our

126

Table C.2 Generalization gap (−
between training and inference sets. [Key: Best]

(cid:17) ) on KITTI Val cars. Monocular detection has huge generalization gap

Method

Scale
Eqv

GUP Net [159]

DEVIANT

✓

IoU3D ≥ 0.7

IoU3D ≥ 0.5

(cid:17))

(cid:17))

Set

[%](−

[%](−

[%](−

AP 3D|𝑅40

AP BEV|𝑅40

AP 3D|𝑅40
[%](−
(cid:17))
Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard
Train 91.83 74.87 67.43 95.19 80.95 73.55 99.50 93.62 86.22 99.56 93.88 86.46
Val 21.10 15.48 12.88 28.58 20.92 17.83 58.95 43.99 38.07 64.60 47.76 42.97
Gap 70.73 59.39 54.55 66.61 60.03 55.72 40.55 49.63 48.15 34.96 46.12 43.49
Train 91.09 76.19 67.16 94.76 82.61 75.51 99.37 93.56 88.57 99.50 93.87 88.90
Val 24.63 16.54 14.52 32.60 23.04 19.99 61.00 46.00 40.18 65.28 49.63 43.50
Gap 66.46 59.65 52.64 62.16 59.57 55.52 38.37 47.56 48.39 34.22 44.24 45.40

AP BEV|𝑅40

(cid:17))

Table C.3 Comparison on multiple backbones on KITTI Val cars. [Key: Best]

IoU3D ≥ 0.7

IoU3D ≥ 0.5

(cid:17))

[%](−

Method

BackBone

AP 3D|𝑅40
[%](−
(cid:17))
Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard
ResNet-18 GUP Net [159] 18.86 13.20 11.01 26.05 19.37 16.57 54.90 40.65 34.98 60.54 46.13 40.12
20.27 14.21 12.56 28.09 20.32 17.49 55.75 42.41 36.97 60.82 46.43 40.59
DLA-34 GUP Net [159] 21.10 15.48 12.88 28.58 20.92 17.83 58.95 43.99 38.07 64.60 47.76 42.97
24.63 16.54 14.52 32.60 23.04 19.99 61.00 46.00 40.18 65.28 49.63 43.50

AP BEV|𝑅40

AP BEV|𝑅40

AP 3D|𝑅40

DEVIANT

DEVIANT

[%](−

[%](−

(cid:17))

(cid:17))

DEVIANT is huge on the training set, while it is less than one-fourth of the train performance on

the val set.

We also report the generalization gap (in pink) metric [270] in Tab. C.2, which is the difference

between training and validation performance. The generalization gap at both the thresholds of 0.7

and 0.5 is huge.

Comparison on Multiple Backbones. A common trend in 2D object detection community is

to show improvements on multiple backbones [253]. DD3D [181] follows this trend and also

reports their numbers on multiple backbones. Therefore, we follow the same and compare with our

baseline on multiple backbones on KITTI Val cars in Tab. C.3. Tab. C.3 shows that DEVIANT

shows consistent improvements over GUP Net [159] in 3D object detection on multiple backbones,

proving the effectiveness of our proposal.

Comparison with Bigger CNN Backbones. Since the SES blocks increase the Flop counts signif-

icantly compared to the vanilla convolution block, we next compare DEVIANT with bigger CNN

backbones with comparable GFLOPs and FPS/ wall-clock time (instead of same configuration)

in Tab. C.4. We compare DEVIANT with DLA-102 and DLA-169 - two biggest DLA networks

127

Table C.4 Results with bigger CNNs having similar flops on KITTI Val cars. [Key: Best]

(cid:17) ) Disk Size (−

(cid:17) ) Flops (−

Method

BackBone

GUP Net [159] DLA-34
GUP Net [159] DLA-102
GUP Net [159] DLA-169
DEVIANT

SES-DLA-34

Param (−
(M)
16
34
54
16

(MB)
235
583
814
236

(cid:17) ) Infer (−
(G)
30
70
114
235

(cid:17) ) AP3D IoU3D≥ 0.7 (−
(ms) Easy Mod Hard
20
25
30
40

(cid:17)) AP3D IoU3D≥ 0.5 (−
Easy Mod Hard
21.10 15.48 12.88 58.95 43.99 38.07
20.96 14.64 12.80 57.06 41.78 37.26
21.76 15.35 12.72 57.60 43.27 37.32
24.63 16.54 14.52 61.00 46.00 40.18

(cid:17))

Table C.5 Results on KITTI Val cyclists and pedestrians (Cyc/Ped) (IoU3D ≥ 0.5). [Key: Best, Second
Best]

Method

Extra

GrooMeD-NMS [109]
MonoDIS [222]
MonoDIS-M [220]
GUP Net (Retrained) [159]
DEVIANT (Ours)

−
−
−
−
−

(cid:17))

Cyc AP 3D|𝑅40
[%](−
Easy Mod Hard
0.00
0.00 0.00
0.71
1.52 0.73
1.30
2.70 1.50
2.03
4.41 2.17
2.14
4.05 2.20

(cid:17)) Ped AP 3D|𝑅40

[%](−
Easy Mod Hard
2.61
3.79 2.71
1.71
3.20 2.28
5.70
9.50 7.10
5.73
9.37 6.84
5.42
9.85 7.18

with ImageNet weights1 on KITTI Val split. We use the fvcore library2 to get the parameters and

flops. Tab. C.4 shows that DEVIANT again outperforms the bigger CNN backbones, especially on

nearby objects. We believe this happens because the bigger CNN backbones have more trainable

parameters than DEVIANT, which leads to overfitting. Although DEVIANT takes more time

compared to the CNN backbones, DEVIANT still keeps the inference almost real-time.

Performance on Cyclists and Pedestrians. Tab. C.5 lists out the results of 3D object detection on

KITTI Val Cyclist and Pedestrians. The results show that DEVIANT is competitive on challenging

Cyclist and achieves SoTA results on Pedestrians on the KITTI Val split.

Cross-Dataset Evaluation Details. For cross-dataset evaluation, we test on all 3,769 images of the

KITTI Val split, as well as all frontal 6,019 images of the nuScenes Val split [22], as in [218]. We

first convert the nuScenes Val images to the KITTI format using the export_kitti3 function in

the nuscenes devkit. We keep KITTI Val images in the [384, 1280] resolution, while we keep the

nuScenes Val images in the [384, 672] resolution to preserve the aspect ratio. For M3D-RPN [15],

we bring the nuScenes Val images in the [512, 910] resolution.

Monocular 3D object detection relies on the camera focal length to back-project the projected

1Available at http://dl.yf.io/dla/models/imagenet/
2https://github.com/facebookresearch/fvcore
3https://github.com/nutonomy/nuscenes-devkit/blob/master/python-sdk/nuscenes/scripts/export_kitti.py

128

centers into the 3D space. Therefore, the 3D centers depends on the focal length of the camera

used in the dataset. Hence, one should take the camera focal length into account while doing

cross-dataset evaluation. We now calculate the camera focal length of a dataset as follows. We take

2 𝑓𝑦
the camera matrix K and calculate the normalized focal length ¯𝑓 =
𝐻 , where 𝐻 denotes the height
of the image. The normalized focal length ¯𝑓 for the KITTI dataset is 3.82, while the normalized

focal length ¯𝑓 for the nuScenes dataset is 2.82. Thus, the KITTI and the nuScenes images have a

different focal length [255].

M3D-RPN [15] does not normalize w.r.t. the focal length. So, we explicitly correct and divide

the depth predictions of nuScenes images from the KITTI model by 3.82/2.82 = 1.361 in the M3D-

RPN [15] codebase. The GUP Net [159] and DEVIANT codebases use normalized coordinates

i.e. they normalize w.r.t. the focal length. So, we do not explicitly correct the focal length for GUP

Net and DEVIANT predictions.

We match predictions to the ground truths using the IoU2D overlap threshold of 0.7 [218]. After

this matching, we calculate the Mean Average Error (MAE) of the depths of the predicted and the

ground truth boxes [218].

Stress Test with Rotational and/or xy-translation Ego Movement. Corollary 1.1 uses translation

along the depth as the sole ego movement. This assumption might be valid for the current outdoor

datasets and benchmarks, but is not the case in the real world. Therefore, we conduct stress tests

on how tolerable DEVIANT and GUP Net [159] are when there is rotational and/or 𝑥𝑦-translation

movement on the vehicle.

First, note that KITTI and Waymo are already large-scale real-world datasets, and our own

dataset might not be a good choice. So, we stick with KITTI and Waymo datasets. We manually

choose 306 KITTI Val images with such ego movements and again compare performance of

DEVIANT and GUP Net on this subset in Tab. C.6. The average distance of the car in this subset

is 27.69 m (±16.59 m), which suggests a good variance and unbiasedness in the subset. Tab. C.6

shows that both the DEVIANT backbone and the CNN backbone show a drop in the detection

performance by about 4 AP points on the Mod cars of ego-rotated subset compared to the all set.

129

Table C.6 Stress Test with rotational and 𝑥𝑦-translation ego movement on KITTI Val cars. [Key: Best]

Set

Method

AP3D IoU3D≥ 0.7 (−
Easy Mod Hard
9.91

(cid:17)) AP3D IoU3D≥ 0.5 (−
Easy Mod Hard
47.47 35.02 32.63
Subset
(306)
20.17 12.49 10.93 49.81 36.93 34.32
KITTI Val GUP Net [159] 21.10 15.48 12.88 58.95 43.99 38.07
(3769)
24.63 16.54 14.52 61.00 46.00 40.18

GUP Net [159] 17.22 11.43
DEVIANT

DEVIANT

(cid:17))

Table C.7 Comparison of Depth Estimates of monocular depth estimators and 3D object detectors on
KITTI Val cars. Depth from a depth estimator BTS is not good for foreground objects (cars) beyond 20+ m
range. [Key: Best, Second Best]

Depth Ground Back+ Foreground

Method

at

Truth
GUP Net [159] 3D Center 3D Box
3D Center 3D Box
DEVIANT
BTS [118]

Pixel

LiDAR 0.48

Foreground (Cars)
0−20 20−40 40−∞ 0−20 20−40 40−∞
1.85
1.80
2.16

1.10
1.09
1.22

−
−
1.30

−
−
1.83

0.45
0.40
0.30

−
−

This drop experimentally confirms the theory that both the DEVIANT backbone and the CNN

backbone do not handle arbitrary 3D rotations. More importantly, the table shows that DEVIANT

maintains the performance improvement over GUP Net [159] under such movements.

Also, Waymo has many images in which the ego camera shakes. Improvements on Waymo

(Tab. 3.12) also confirms that DEVIANT outperforms GUP Net [159] even when there is rotational

or 𝑥𝑦-translation ego movement.

Comparison of Depth Estimates from Monocular Depth Estimators and 3D Object Detectors.

We next compare the depth estimates from monocular depth estimators and depth estimates from

monocular 3D object detectors on the foreground objects. We take a monocular depth estimator

BTS [118] model trained on KITTI Eigen split. We next compare the depth error for all and

foreground objects (cars) on KITTI Val split using MAE (−

(cid:17) ) metric in Tab. C.7 as in Tab. 3.6. We

use the MSeg [114] to segment out cars in the driving scenes for BTS. Tab. C.7 shows that the

depth from BTS is not good for foreground objects (cars) beyond 20+ m range. Note that there is a

data leakage issue between the KITTI Eigen train split and the KITTI Val split [221] and therefore,

we expect more degradation in performance of monocular depth estimators after fixing the data

leakage issue.

Equivariance Error for KITTI Monocular Videos. A better way to compare the scale equiv-

130

Figure C.7 Equivariance error (Δ) comparison for DEVIANT and GUP Net on previous three frames of
the KITTI monocular videos at block 3 in the backbone.

ariance of the DEVIANT and GUP Net [159] compared to Fig. 3.4, is to compare equivariance

error on real images with depth translations of the ego camera. The equivariance error Δ is the

normalized difference between the scaled feature map and the feature map of the scaled image, and

is given by

Δ =

1
𝑁

𝑁
∑︁

𝑖=1

||T𝑠𝑖 Φ(ℎ𝑖) − Φ(T𝑠𝑖 ℎ𝑖)||2
2
||T𝑠𝑖 Φ(ℎ𝑖)||2
2

,

(C.32)

where Φ denotes the neural network, T𝑠𝑖 is the scaling transformation for the image 𝑖, and 𝑁 is

the total number of images. Although we do evaluate this error in Fig. 3.4, the image scaling in

Fig. 3.4 does not involve scene change because of the absence of the moving objects. Therefore,

evaluating on actual depth translations of the ego camera makes the equivariance error evaluation

more realistic. We next carry out this experiment and report the equivariance error on three

previous frames of the val images of the KITTI Val split as in [17]. We plot this equivariance error

in Fig. C.7 at block 3 of the backbones because the resolution at this block corresponds to the output

feature map of size [96, 320]. Fig. C.7 is similar to Fig. 3.4b, and shows that DEVIANT achieves

lower equivariance error. Therefore, DEVIANT has better equivariance to depth translations (scale

transformation s) than GUP Net [159] in real scenarios.

Model Size, Training, and Inference Times. Both DEVIANT and the baseline GUP Net have the

same number of trainable parameters, and therefore, the same model size. GUP Net takes 4 hours

to train on KITTI Val and 0.02 ms per image for inference on a single Ampere A100 (40 GB) GPU.

DEVIANT takes 8.5 hours for training and 0.04 ms per image for inference on the same GPU. This

131

Method

GUP Net [159]

DEVIANT

Table C.8 Five Different Runs on KITTI Val cars. [Key: Average]

IoU3D ≥ 0.7

IoU3D ≥ 0.5

(cid:17))

(cid:17))

(cid:17))

Run

[%](−

[%](−

[%](−

AP 3D|𝑅40

AP BEV|𝑅40

AP BEV|𝑅40

AP 3D|𝑅40
[%](−
(cid:17))
Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard
1
21.67 14.75 12.68 28.72 20.88 17.79 58.27 43.53 37.62 63.67 47.37 42.55
2
21.26 14.94 12.49 28.39 20.40 17.43 59.20 43.55 37.63 64.06 47.46 42.67
3
20.87 15.03 12.61 28.66 20.56 17.48 60.19 44.08 39.36 65.26 49.44 43.17
4
21.10 15.48 12.88 28.58 20.92 17.83 58.95 43.99 38.07 64.60 47.76 42.97
5
22.52 15.92 13.31 30.77 22.40 19.36 59.91 44.00 39.30 64.94 48.01 43.08
Avg 21.48 15.22 12.79 29.02 21.03 17.98 59.30 43.83 38.40 64.51 48.01 42.89
23.19 15.84 14.11 29.82 21.93 19.16 60.19 45.52 39.86 66.32 49.39 43.38
1
23.33 16.12 13.54 31.22 22.64 19.64 61.59 46.33 40.35 67.49 50.26 43.98
2
24.12 16.37 14.48 31.58 22.52 19.65 62.51 46.47 40.65 67.33 50.24 44.16
3
24.63 16.54 14.52 32.60 23.04 19.99 61.00 46.00 40.18 65.28 49.63 43.50
4
5
25.82 17.69 15.07 33.63 23.84 20.60 62.39 46.46 40.61 67.55 50.51 45.80
Avg 24.22 16.51 14.34 31.77 22.79 19.81 61.54 46.16 40.33 66.79 50.01 44.16

Table C.9 Experiments Comparison.

Venue Multi-Dataset Cross-Dataset Multi-Backbone

Method
GrooMeD-NMS [109] CVPR21
CVPR21
MonoFlex [301]
CVPR21
CaDDN [199]
ICCV21
MonoRCNN [218]
ICCV21
GUP Net [159]
ICCV21
DD3D [181]
NeurIPS21
PCT [246]
ICLR22
MonoDistill [39]
TPAMI20
MonoDIS-M [220]
TPAMI21
MonoEF [307]
-
DEVIANT

−
−
✓
−
−
✓
✓
−
✓
✓
✓

−
−
−
✓
−
−
−
−
−
−
✓

−
−
−
−
−
✓
✓
−
−
−
✓

is expected because SE models use more flops [227, 309] and, therefore, DEVIANT takes roughly

twice the training and inference time as GUP Net.

Reproducibility. As described in Sec. 3.5.2, we now list out the five runs of our baseline GUP

Net [159] and DEVIANT in Tab. C.8. Tab. C.8 shows that DEVIANT outperforms GUP Net in all

runs and in the average run.

Experiment Comparison. We now compare the experiments of different chapters in Tab. C.9. To

the best of our knowledge, the experimentation in DEVIANT is more than the experimentation of

most monocular 3D object detection chapters.

132

(cid:17) ).
(a) Depth equivariance error (−

(b) Error (−

(cid:17) ) on objects.

Figure C.8 (a) Depth (scale) equivariance error of vanilla GUP Net [159] and proposed DEVIANT. (See
Sec. 3.5.2 for details) (b) Error on objects. The proposed backbone has less depth equivariance error than
vanilla CNN backbone.

C.3.2 Qualitative Results

KITTI. We next show some more qualitative results of models trained on KITTI Val split in

Fig. C.9. We depict the predictions of DEVIANT in image view on the left and the predictions

of DEVIANT and GUP Net [159], and ground truth in BEV on the right. In general, DEVIANT

predictions are more closer to the ground truth than GUP Net [159].

nuScenes Cross-Dataset Evaluation. We then show some qualitative results of KITTI Val model

evaluated on nuScenes frontal in Fig. C.10. We again observe that DEVIANT predictions are

more closer to the ground truth than GUP Net [159]. Also, considerably less number of boxes are

detected in the cross-dataset evaluation i.e. on nuScenes. We believe this happens because of the

domain shift.

Waymo. We now show some qualitative results of models trained on Waymo Val split in Fig. C.11.

We again observe that DEVIANT predictions are more closer to the ground truth than GUP

Net [159].

C.3.3 Demo Videos of DEVIANT

Detection Demo. We next put a short demo video of our DEVIANT model trained on KITTI Val

split at https://www.youtube.com/watch?v=2D73ZBrU-PA. We run our trained model indepen-

133

dently on each frame of 2011_09_26_drive_0009 KITTI raw [66]. The video belongs to the City

category of the KITTI raw video. None of the frames from the raw video appear in the training

set of KITTI Val split [109]. We use the camera matrices available with the video but do not use

any temporal information. Overlaid on each frame of the raw input videos, we plot the projected

3D boxes of the predictions and also plot these 3D boxes in the BEV. We set the frame rate of this

demo at 10 fps as in KITTI. The attached demo video demonstrates very stable and impressive

results because of the additional equivariance to depth translations in DEVIANT which is absent

in vanilla CNNs. Also, notice that the orientation of the boxes are stable despite not using any

temporal information.

Equivariance Error Demo. We next show the depth equivariance (scale equivariance) error

demo of one of the channels from the vanilla GUP Net and our proposed method at https://www.

youtube.com/watch?v=70DIjQkuZvw. As before, we report at block 3 of the backbones which

corresponds to output feature map of the size [96, 320]. The equivariance error demo indicates

more white spaces which confirms that DEVIANT achieves lower equivariance error compared to

the baseline GUP Net [159]. Thus, this demo agrees with Fig. C.8a. This happens because depth

(scale) equivariance is additionally hard-baked into DEVIANT, while the vanilla GUP Net is not

equivariant to depth translations (scale transformation s).

134

Figure C.9 KITTI Qualitative Results. DEVIANT predictions in general are more accurate than GUP
Net [159]. [Key: Cars (pink), Cyclists (orange) and Pedestrians (violet) of DEVIANT; all classes of GUP
Net (cyan), and Ground Truth (green) in BEV].

135

Figure C.10 nuScenes Cross-Dataset Qualitative Results. DEVIANT predictions in general are more
accurate than GUP Net [159]. [Key: Cars of DEVIANT (pink); Cars of GUP Net (cyan), and Ground Truth
(green) in BEV].

136

Figure C.11 Waymo Qualitative Results. DEVIANT predictions in general are more accurate than GUP
Net [159]. [Key: Cars (pink), Cyclists (orange) and Pedestrians (violet) of DEVIANT; all classes of GUP
Net (cyan), and Ground Truth (green) in BEV].

137

APPENDIX D

SEABIRD APPENDIX

D.1 Additional Explanations and Proofs

We now add some explanations and proofs which we could not put in the main chapter because

of the space constraints.

D.1.1 Proof of Converged Value

We first bound the converged value from the optimal value. These results are well-known in

the literature [113, 214]. We reproduce the result from using our notations for completeness.

Lw∞−w∗

(cid:17)

(cid:13)
(cid:13)

2
2

E (cid:16)(cid:13)
(cid:13)
= E (cid:16)(cid:13)
(cid:13)

= E

Lw∞−L 𝝁 + L 𝝁−w∗
(cid:18) (cid:16)Lw∞−L 𝝁 + L 𝝁−w∗

(cid:17)

2
2

(cid:13)
(cid:13)
(cid:17)𝑇 (cid:16)Lw∞−L 𝝁 + L 𝝁−w∗

(cid:17)(cid:19)

= E((Lw∞−L 𝝁)𝑇 (Lw∞−L 𝝁)) + E((L 𝝁−w∗)𝑇 (L 𝝁−w∗))

+ 2E((Lw∞−L 𝝁)𝑇 (L 𝝁−w∗))

= Var(Lw∞) + E((L 𝝁−w∗)𝑇 (L 𝝁−w∗))

(D.1)

where L 𝝁 = E(Lw∞) is the mean of the layer weight and Var(w) denotes the variance of (cid:205) 𝑗 𝑤2
𝑗 .

SGD. We begin the proof by writing the value of Lw𝑡 at every step. The model uses SGD, and so,

the weight Lw𝑡 after 𝑡 gradient updates is

Lw𝑡 = w0 − 𝑠1

Lg1 − 𝑠2

Lg2 − · · · − 𝑠𝑡

Lg𝑡,

(D.2)

where Lg𝑡 denotes the gradient of w at every step 𝑡. Assume the loss function under consideration

L is L = 𝑓 (w𝑡h − 𝑧) = 𝑓 (𝜂). Then, we have,

Lg𝑡 =

=

=

𝜕L
𝜕w𝑡
𝜕L (w𝑡h − 𝑧)
𝜕w𝑡
𝜕L (w𝑡h − 𝑧)
𝜕 (w𝑡h − 𝑧)

138

𝜕 (w𝑡h − 𝑧)
𝜕w𝑡

𝜕L (𝜂)
𝜕𝜂 h
𝜕L (𝜂)
𝜕𝜂

=

= h

=⇒ Lg𝑡 = h𝜖,

(D.3)

with 𝜖 =

𝜕L (𝜂)
𝜕𝜂

is the gradient of the loss function wrt noise.

Expectation and Variance of Gradient Lg𝑡 Since the image h and noise 𝜂 are statistically

independent, the image and the noise gradient 𝜂 are also statistically independent. So, the expected

gradients

E(Lg𝑡) = E(h)E(𝜖) = 0.

(D.4)

Note that if the loss function is an even function (symmetric about zero), its gradient 𝜖 is an

odd function (anti-symmetric about 0), and so its mean E(𝜖) = 0.

Next, we write the gradient variance Var(Lg𝑡) as

Var(Lg𝑡) = Var(h𝜖) = E(h𝑇 h)E(𝜖 2) − E2(h)E2(𝜖)

= E(h𝑇 h) (cid:2)Var(𝜖) + E2(𝜖)(cid:3)

− E2(h)E2(𝜖)

=⇒ Var(Lg𝑡) = E(h𝑇 h)Var(𝜖)

as E(𝜖) = 0

(D.5)

Expectation and Variance of Converged Weight Lw𝑡 We first calculate the expected converged

weight as

E(Lw𝑡) = E(w0) + (cid:169)
(cid:173)
(cid:171)

𝑠 𝑗 E (cid:16)Lg 𝑗

(cid:17)

𝑡
∑︁

𝑗=1

(cid:170)
(cid:174)
(cid:172)

= 0

using Eq. (D.4)

, using Eq. (D.2)

=⇒ E(Lw∞) = lim
𝑡→∞

E(Lw𝑡)

=⇒ E(Lw∞) = L 𝝁 = 0

139

(D.6)

We finally calculate the variance of the converged weight. Because the SGD step size is

independent of the gradient, we write using Eq. (D.2),

Var(Lw𝑡) = Var(w0) + 𝑠2

1Var (g1) + 𝑠2
(cid:17)

2Var (g2)

+ · · · + 𝑠2

𝑡 Var (cid:16)Lg𝑡

Assuming the gradients Lg𝑡 are drawn from an identical distribution, we have

Var(Lw𝑡) = Var(w0) + (cid:169)
(cid:173)
(cid:171)

𝑡
∑︁

𝑗=1

𝑠2
𝑗 (cid:170)
(cid:174)
(cid:172)

Var (cid:16)Lg𝑡

(cid:17)

=⇒ Var(Lw∞) = lim
𝑡→∞

Var(Lw𝑡)

lim
= Var(w0) + (cid:169)
(cid:173)
𝑡→∞
(cid:171)

Var (cid:16)Lg𝑡

(cid:17)

𝑡
∑︁

𝑗=1

𝑠2
𝑗 (cid:170)
(cid:174)
(cid:172)

(cid:17)

=⇒ Var(Lw∞) = Var(w0) + 𝑠Var (cid:16)Lg𝑡

(D.7)

(D.8)

An example of square summable step-sizes of SGD is 𝑠 𝑗 = 1

𝑗 = 𝜋2
𝑠2
6 .
This assumption is also satisfied by modern neural networks since their training steps are always

𝑗 , and then the constant 𝑠 = (cid:205)
𝑗=1

finite.

Substituting Eq. (D.5) in Eq. (D.8), we have

Var(Lw∞) = Var(w0) + 𝑠E(h𝑇 h)Var(𝜖)

(D.9)

Substituting mean and variances from Eqs. (D.6) and (D.9) in Eq. (D.1), we have

E (cid:16)(cid:13)
(cid:13)

Lw∞−w∗

(cid:17)

(cid:13)
(cid:13)

2
2

= Var(w0) + 𝑠E(h𝑇 h)Var(𝜖)

+ E(||w∗||2)

= 𝑠E(h𝑇 h)Var(𝜖) + Var(w0)

+ E(||w∗||2)

=⇒ E (cid:16)(cid:13)
(cid:13)

Lw∞−w∗

(cid:17)

(cid:13)
(cid:13)

2
2

= 𝑐1Var(𝜖) + 𝑐2,

(D.10)

where 𝜖 =

𝜕L (𝜂)
𝜕𝜂

is the gradient of the loss function wrt noise, and 𝑐1 = 𝑠E(h𝑇 h) and 𝑐2 are terms

independent of the loss function L.

140

D.1.2 Comparison of Loss Functions

Eq. (4.1) shows that different losses L lead to different Var(𝜖). Hence, comparing this term for

different losses asseses the quality of losses.

D.1.2.1 Gradient Variance of MAE Loss

The result on MAE (L1) is well-known in the literature [113, 214]. We reproduce the result

from [113, 214] using our notations for completeness.

The L1 loss is

L1(𝜂) = | ˆ𝑧 − 𝑧|1 = |Lw𝑡h − 𝑧|1 = |𝜂|1

=⇒ 𝜖 =

𝜕L1(𝜂)
𝜕𝜂

= sgn(𝜂)

(D.11)

Thus, 𝜖 = sgn(𝜂) is a Bernoulli random variable with 𝑝(𝜖) = 1/2 for 𝜖 = ±1. So, mean E(𝜖) = 0

and variance Var(𝜖) = 1.

D.1.2.2 Gradient Variance of MSE Loss

The result on MSE (L2) is well-known in the literature [113, 214]. We reproduce the result

from [113, 214] using our notations for completeness. The L2 loss is

L2(𝜂) = 0.5| ˆ𝑧 − 𝑧|2 = 0.5|𝜂|2 = 0.5𝜂2

=⇒ 𝜖 =

𝜕L2(𝜂)
𝜕𝜂

= 𝜂

(D.12)

Thus, 𝜖 = 𝜂 is a normal random variable [214]. So, mean E(𝜖) = 0 and variance Var(𝜖) = Var(𝜂) =

𝜎2.

D.1.2.3 Gradient Variance of Dice Loss. (Proof of Lemma 2)

Proof. We first write the gradient of dice loss as a function of noise (𝜂) as follows:

𝜖 =

𝜕L𝑑𝑖𝑐𝑒 (𝜂)
𝜕𝜂

=

sgn(𝜂)
ℓ

, |𝜂| ≤ ℓ

0

, |𝜂| ≥ ℓ





(D.13)

141

The gradient of the loss 𝜖 is an odd function and so, its mean E(𝜖) = 0. Next, we write its variance

Var(𝜖) as

Var(𝜖) = Var(𝜂) =

=

=

=

=

1
ℓ2

2
ℓ2

2
ℓ2

2
ℓ2

2
ℓ2

ℓ

∫

−ℓ

∫

ℓ

√

√

0
ℓ/𝜎
∫

ℓ/𝜎
∫

0






−∞

(cid:20)

Φ

𝜂2
2𝜎2 𝑑𝜂

𝑒−

𝜂2
2𝜎2 𝑑𝜂

𝑒−

1

2𝜋𝜎

1

2𝜋𝜎

𝜂2
2 𝑑𝜂

𝑒−

1
√

2𝜋

1
√

2𝜋

𝑒−

𝜂2
2 𝑑𝜂 −

1
2








(cid:19)

(cid:18) ℓ
𝜎

−

(cid:21)

1
2

where, Φ is the normal CDF

We write the CDF Φ(𝑥) in terms of error function Erf as:

Φ(𝑥) =

1
2

+

1
2

Erf

(cid:19)

(cid:18) 𝑥
√
2

for 𝑥 ≥ 0. Next, we put 𝑥 =

ℓ
𝜎

to get

(cid:19)

(cid:18) ℓ
𝜎

1
2

+

1
2

=

Φ

(cid:18)

Erf

Substituting above in Eq. (D.14), we obtain

ℓ

(cid:19)

√

2𝜎

Var(𝜖) =

=⇒ Var(𝜖) =

(cid:20) 1
2

Erf

2
ℓ2
1
ℓ2

(cid:18)

ℓ

(cid:19)

Erf

√

2𝜎

(cid:21)

1
2

−

1
2
ℓ

+

(cid:18)

(cid:19)

√

2𝜎

D.1.3 Proof of Dice Model Being BetterLemma 3

Proof. It remains sufficient to show that

E (cid:16)(cid:13)
(cid:13)

𝑑w∞ − w∗

(cid:17)

(cid:13)
(cid:13)2

≤ E (∥𝑟w∞ − w∗∥2)

142

(D.14)

(D.15)

(D.16)

(D.17)

=⇒ E (cid:16)(cid:13)
(cid:13)

𝑑w∞ − w∗

(cid:17)

(cid:13)
(cid:13)

2
2

≤ E (cid:16)

∥𝑟w∞ − w∗∥2
2

(cid:17)

(D.18)

Using Lemma 1, the above comparison is a comparison between the gradient variance of the loss

wrt noise Var(𝜖). Hence, we compute the gradient variance of the loss L, i.e., Var(𝜖) of regression

and dice losses to derive this lemma.

Case 1 𝜎 ≤ 1: Given Tab. 4.1, if 𝜎 ≤ 1, the minimum deviation in converged regression model

comes from the L2 loss. The difference in the estimates of regression loss and the dice loss

E (cid:16)

∥𝑟w∞ − w∗∥2
2

(cid:17)

− E (cid:16)(cid:13)
(cid:13)

𝑑w∞ − w∗
(cid:18)

2
(cid:13)
(cid:13)
2
ℓ

(cid:17)

(cid:19)

Erf

√

1
ℓ2

2𝜎

∝ 𝜎2 −

(D.19)

Let 𝜎𝑚 be the solution of the equation 𝜎2 =

1
ℓ2

Erf

(cid:18)

√

ℓ

(cid:19)

2𝜎

. Note that the above equation has

unique solution 𝜎𝑚 since 𝜎2 is a strictly increasing function wrt 𝜎 for 𝜎 > 0, while

1
ℓ2

Erf

(cid:18)

√

ℓ

(cid:19)

2𝜎

is a strictly decreasing function wrt 𝜎 for 𝜎 > 0. If the noise has 𝜎 ≥ 𝜎𝑚, the RHS of the above

equation ≥ 0, which means dice loss converges better than the regression loss.

Case 2 𝜎 ≥ 1: Given Tab. 4.1, if 𝜎 ≥ 1, the minimum deviation in converged regression model

comes from the L1 loss. The difference in the regression and dice loss estimates:

E (cid:16)

∥𝑟w∞ − w∗∥2
2

(cid:17)

− E (cid:16)(cid:13)
(cid:13)

∝ 1 −

(cid:17)

(cid:19)

2
(cid:13)
𝑑w∞ − w∗
(cid:13)
2
1
ℓ
ℓ2

Erf

√

(cid:18)

2𝜎

(D.20)

If the noise has 𝜎 ≥

√

2
ℓ

Erf−1(ℓ2), the RHS of the above equation ≥ 0, which means dice loss is

better than the regression loss. For objects such as cars and trailers which have length ℓ > 4𝑚, this

is trivially satisfied.

Combining both cases, dice loss outperforms the L1 and L2 losses if the noise deviation 𝜎

exceeds the critical threshold 𝜎𝑐, i.e.

(cid:32)

𝜎 > 𝜎𝑐 = max

𝜎𝑚,

√
2
ℓ

Erf−1(ℓ2)

(cid:33)

.

(D.21)

143

D.1.4 Proof of Convergence Analysis Th. 2

Proof. Continuing from Lemma 3, the advantage of the trained weight obtained from dice loss over

the trained weight obtained from regression losses further results in

Var(𝑑w∞) ≤ Var(𝑟w∞)

=⇒ E(|𝑑w∞h − 𝑧|) ≤ E(|𝑟w∞h − 𝑧|)

=⇒ E(| 𝑑 ˆ𝑧 − 𝑧|) ≤ E(| 𝑟 ˆ𝑧 − 𝑧|)

=⇒ E(𝑑IoU3D) ≥ E(𝑟IoU3D),

(D.22)

assuming depth is the only source of error. Because AP3D is an non-decreasing function of IoU3D,

the inequality remains preserved. Hence, we have 𝑑AP3D ≥ 𝑟AP3D.

□

Thus, the average precision from the dice model is better than the regression model, which

means a better detector.

D.1.5 Properties of Dice Loss.

We next explore the properties of model in Lemma 3 trained with dice loss. From Lemma 1,

we write

E (cid:16)(cid:13)
(cid:13)

𝑑w∞ − w∗

(cid:17)

(cid:13)
(cid:13)

2
2

= 𝑐1Var(𝜖) + 𝑐2

Substituting the result of Lemma 2, we have

E (cid:16)(cid:13)
(cid:13)

𝑑w∞ − w∗

(cid:17)

2
(cid:13)
(cid:13)
2

=

𝑐1
ℓ2

Erf

(cid:18)

√

ℓ

(cid:19)

2𝜎

+ 𝑐2

(D.23)

chapter [9] says that for a normal random variable 𝑋 with mean 0 and variance 1 and for any

𝑥 > 0, we have

√

4 + 𝑥2 − 𝑥
2

=⇒

=⇒

𝑥 +

𝑥 +

1
√

4 + 𝑥2
1
√

4 + 𝑥2

√︂ 1
2𝜋
√︂ 2
𝜋
√︂ 2
𝜋

𝑒− 𝑥2

2 ≤ 𝑃 (𝑋 > 𝑥)

𝑒− 𝑥2

2 ≤ 𝑃 (𝑋 > 𝑥)

𝑒− 𝑥2

2 ≤ 1 − 𝑃 (𝑋 ≤ 𝑥)

144

Table D.1 Assumption comparison of Convergence Analysis of Th. 2 vs Mono3D models.

Regression
Noise 𝜂 PDF
Noise & Image
Object Categories
Object Size ℓ
Error
Loss L
Optimizers
Global Optima

Th. 2
Linear
Normal
Independent
1
Ideal
Depth

Mono3D Models
Non-linear
Arbitrary
Dependent
Multiple
Non-ideal
All 7 parameters

L1, L2, dice Smooth L1, L2, dice, CE

SGD
Unique

SGD, Adam, AdamW
Multiple

=⇒

=⇒

=⇒

=⇒

𝑥 +

𝑥 +

𝑥 +

𝑥 +

1
√

4 + 𝑥2
1
√

4 + 𝑥2
1
√

4 + 𝑥2
1
√

4 + 𝑥2

√︂ 2
𝜋
√︂ 2
𝜋
√︂ 2
𝜋
√︂ 2
𝜋
(cid:18) 𝑥
√
2

𝑒− 𝑥2

2 ≤

𝑒− 𝑥2

2 ≤

1
2

1
2

1
2

𝑒− 𝑥2

2 ≤ 1−

1
2

−

∫ 𝑥

0

1
√

𝑒− 𝑥2

2 ≤

∫ 𝑥

−

1
√

𝑒− 𝑋2

2 𝑑𝑋

2𝜋
𝑒− 𝑋2

2 𝑑𝑋

𝑒−𝑋 2 𝑑𝑋

0
∫ 𝑥

√
2

0
1
2

Erf

2𝜋
1
√
𝜋
(cid:18) 𝑥
√
2

(cid:19)

−

−

=⇒ Erf

(cid:19)

≤ 1 −

2
√

4 + 𝑥2

𝑥 +

√︂ 2
𝜋

𝑒− 𝑥2

2

Substituting 𝑥 =

ℓ
𝜎

above, we have,

Erf

(cid:18)

√

ℓ

(cid:19)

2𝜎

≤ 1 −

√

ℓ +

2𝜎

4𝜎2 + ℓ2

√︂ 2
𝜋

𝑒− ℓ2

2𝜎2

(D.24)

Case 1: Upper bound. The RHS of Eq. (D.24) is clearly less than 1 since the term in the RHS

after subtraction is positive. Hence,

Erf

(cid:18)

√

ℓ

(cid:19)

2𝜎

≤ 1

Substituting above in Eq. (D.23), we have

E (cid:16)(cid:13)
(cid:13)

𝑑w∞ − w∗

(cid:17)

2
(cid:13)
(cid:13)
2

≤

𝑐1
ℓ2 + 𝑐2

(D.25)

Clearly, the deviation of the trained model with the dice loss is inversely proportional to the object

length ℓ. The deviation from the optimal is less for large objects.

145

Case 2: Infinite Noise variance 𝜎2 → ∞. Then, one of the terms in the RHS of Eq. (D.24)

→ 0 =⇒ 𝑒− ℓ2

2𝜎2 ≈

(cid:19)

(cid:18)

1 −

ℓ2
2𝜎2

. So, RHS of Eq. (D.24)

2𝜎

√

4𝜎2 + ℓ2

ℓ +
becomes

→ 1. Moreover,

ℓ
𝜎

Erf

=⇒ Erf

(cid:19)

(cid:19)

(cid:18)

(cid:18)

√

√

ℓ

2𝜎

ℓ

2𝜎

≈ 1 −

(cid:32)

≈

1 +

(cid:18)

√︂ 2
𝜋
√︂ 2
𝜋

1 −

(cid:19)

ℓ2
2𝜎2

(cid:33)

√︂ 2
𝜋

ℓ2
2𝜎2

+

(D.26)

Substituting above in Eq. (D.23), we have

E (cid:16)(cid:13)
(cid:13)

𝑑w∞ − w∗

(cid:17)

(cid:13)
(cid:13)

2
2

≈

(cid:32)

1 +

√︂ 2
𝜋

(cid:33)

√︂ 2
𝜋

ℓ2
2𝜎2

+

𝑐1
ℓ2

+ 𝑐2

(D.27)

Thus, the deviation from the optimal weight is inversely proportional to the noise deviation 𝜎2.

Hence, the deviation from the optimal weight decreases as 𝜎2 increases for the dice loss. This

property provides noise-robustness to the model trained with the dice loss.

D.1.6 Notes on Theoretical Result

Assumption Comparisons. The theoretical result of Th. 2 relies upon several assumptions. We

present a comparison between the assumptions made by Th. 2 and those underlying Mono3D

models, in Tab. D.1. While our analysis depends on these assumptions, it is noteworthy that the

results are apparent even in scenarios where the assumptions do not hold true. Another advantage

of having a linear regression setup is that this setup has a unique global minima (because of its

convexity).

Nature of Noise 𝜂. Th. 2 assumes that the noise 𝜂 is a normal random variable N (0, 𝜎2). To verify

this assumption, we take the two SoTA released models GUP Net [159] and DEVIANT [108] on the

KITTI [67] Val cars. We next plot the depth error histogram of both these models in Fig. D.1. This

figure confirms that the depth error is close to the Gaussian random variable. Thus, this assumption

is quite realistic.

Th. 2 Requires Assumptions? We agree that Th. 2 requires assumptions for the proof. However,

our theory does have empirical support; most Mono3D works have no theory. So, our theoretical

146

Figure D.1 Depth error histogram of released GUP Net and DEVIANT [108] on the KITTI Val cars. The
histogram shows that depth error is close to the Gaussian random variable.

attempt for Mono3D is a step forward! We leave the analysis after relaxing some or all of these

assumptions for future avenues.

Does Th. 2 Hold in Inference? Yes, Th. 2 holds even in inference. Th. 2 relies on the converged

weight Lw∞, which in turn depends on the training data distribution. Now, as long as the training

and testing data distribution remains the same (a fundamental assumption in ML), Th. 2 holds also

during inference.

D.1.7 More Discussions

SeaBird improves because it removes depth estimation and integrates BEV segmentation. We

clarify to remove this confusion. First, SeaBird also estimates depth. SeaBird depth estimates

are better because of good segmentation, a form of depth (thanks to dice loss). Second, predicted

BEV segmentation needs processing with the 3D head to output depth; so it can not replace depth

estimation. Third, integrating segmentation over all categories degrades Mono3D performance

( [132] and our Tab. 4.5 Sem. Category).

Why evaluation on outdoor datasets? We experiment with outdoor datasets in this chapter

because indoor datasets rarely have large objects (mean length > 6𝑚).

147

D.2

Implementation Details

Datasets. Our experiments use the publicly available KITTI-360, KITTI-360 PanopticBEV and

nuScenes datasets. KITTI-360 is available at https://www.cvlibs.net/datasets/kitti-360/download.

php under CCA-NonCommercial-ShareAlike (CC BY-NC-SA) 3.0 License. KITTI-360 Panop-

ticBEV is available at http://panoptic-bev.cs.uni-freiburg.de/ under Robot Learning License Agree-

ment. nuScenes is available at https://www.nuscenes.org/nuscenes under CC BY-NC-SA 4.0 Inter-

national Public License.

Data Splits. We detail out the detection data split construction of the KITTI-360 dataset.
• KITTI-360 Test split: This detection benchmark [136] contains 300 training and 42 testing
windows. These windows contain 61,056 training and 9,935 testing images. The calibration

exists for each frame in training, while it exists for every 10th frame in testing. Therefore, our

split consists of 61,056 training images, while we run monocular detectors on 910 test images

(ignoring uncalibrated images).

• KITTI-360 Val split: The KITTI-360 detection Val split partitions the official train into 239
train and 61 validation windows [136]. The original Val split [136] contains 49,003 training and

14,600 validation images. However, this original Val split has the following three issues:

– Data leakage (common images) exists in the training and validation windows.

– Every KITTI-360 image does not have the corresponding BEV semantic segmentation GT in

the KITTI-360 PanopticBEV [72] dataset, making it harder to compare Mono3D and BEV

segmentation performance.

– The KITTI-360 validation set has higher sampling rate compared to the testing set.

To fix the data leakage issue, we remove the common images from training set and keep them only

in the validation set. Then, we take the intersection of KITTI-360 and KITTI-360 PanopticBEV

datasets to ensure that every image has corresponding BEV segmentation segmentation GT.

After these two steps, the training and validation set contain 48,648 and 12,408 images with

calibration and semantic maps. Next, we subsample the validation images by a factor of 10 as

in the testing set. Hence, our KITTI-360 Val split contains 48,648 training images and 1,294

148

Figure D.2 Skewness in datasets. The ratio of large (yellow) objects to other objects is approximately 1 : 2
in KITTI-360 [136], while the skewness is about 1 : 21 in nuScenes [22].

validation images.

Augmentation. We keep the same augmentation strategy as our baselines for the respective models.

Pre-processing. We resize images to preserve their aspect ratio.
• KITTI-360. We resize the [376, 1408] sized KITTI-360 images, and bring them to the

[384, 1438] resolution.

• nuScenes. We resize the [900, 1600] sized nuScenes images, and bring them to the [256, 704],

[512, 1408] and [640, 1600] resolutions as our baselines [303, 311].

Libraries. I2M and PBEV experiments use PyTorch [184], while BEVerse and HoP use MMDe-

tection3D [44].

Architecture.
• I2M+SeaBird.I2M [211] uses ResNet-18 as the backbone with the standard Feature Pyramid
Network (FPN) [138] and a transformer to predict depth distribution. FPN is a bottom-up

feed-forward CNN that computes feature maps with a downscaling factor of 2, and a top-down

network that brings them back to the high-resolution ones. There are total four feature maps

149

levels in this FPN. We use the Box Net with ResNet-18 [77] as the detection head.

• PBEV+SeaBird.PBEV [72] uses EfficientDet [231] as the backbone. We use Box Net with

ResNet-18 [77] as the detection head.

• BEVerse+SeaBird. BEVerse [303] uses Swin transformers [152] as the backbones. We use the

original heads without any configuration change.

• HoP+SeaBird. HoP [311] uses ResNet-50, ResNet-101 [77] and V2-99 [181] as the backbones.
Since HoP does not have the segmentation head, we use the one in BEVerse as the segmentation

head.

We initialize the CNNs and transformers from ImageNet weights except for V2-99, which is pre-

trained on 15 million LiDAR data.. We output two and ten foreground categories for KITTI-360

and nuScenes datasets respectively.

Training. We use the training protocol as our baselines for all our experiments. We choose the

model saved in the last epoch as our final model for all our experiments.
• I2M+SeaBird. Training uses the Adam optimizer [102], a batch size of 30, an exponential decay
of 0.98 [211] and gradient clipping of 10 on single Nvidia A100 (80GB) GPU. We train the BEV

Net in the first stage with a learning rate 1.0×10−4 for 50 epochs [211] . We then add the detector

in the second stage and finetune with the first stage weight with a learning rate 0.5×10−4 for 40

epochs. Training on KITTI-360 Val takes a total of 100 hours. For Test models, we finetune

I2M Val stage 1 model with train+val data for 40 epochs.

• PBEV+SeaBird. Training uses the Adam optimizer [102] with Nesterov, a batch size of 2 per
GPU on eight Nvidia RTX A6000 (48GB) GPU. We train the PBEV with the dice loss in the

first stage with a learning rate 2.5×10−3 for 20 epochs. We then add the Box Net in the second

stage and finetune with the first stage weight with a learning rate 2.5×10−3 for 20 epochs. PBEV

decays the learning rate by 0.5 and 0.2 at 10 and 15 epoch respectively. Training on KITTI-360

Val takes a total of 80 hours. For Test models, we finetune PBEV Val stage 1 model with

train+val data for 10 epochs on four GPUs.

• BEVerse+SeaBird. Training uses the AdamW optimizer [156], a sample size of 4 per GPU,

150

the one-cycle policy [303] and gradient clipping of 35 on eight Nvidia RTX A6000 (48GB)

GPU [303]. We train the segmentation head in the first stage with a learning rate 2.0×10−3 for

4 epochs. We then add the detector in the second stage and finetune with the first stage weight

with a learning rate 2.0×10−3 for 20 epochs [303]. Training on nuScenes takes a total of 400

hours.

• HoP+SeaBird. Training uses the AdamW optimizer [156], a sample size of 2 per GPU, and
gradient clipping of 35 on eight Nvidia A100 (80GB) GPUs [311]. We train the segmentation

head in the first stage with a learning rate 1.0×10−4 for 4 epochs. We then add the detector in

the second stage and finetune with the first stage weight with a learning rate 1.0×10−4 for 24

epochs [303]. nuScenes training takes a total of 180 hours. For Test models, we finetune val

model with train+val data for 4 more epochs.

Losses. We train the BEV Net of SeaBird in Stage 1 with the dice loss. We train the final SeaBird

pipeline in Stage 2 with the following loss:

L = L𝑑𝑒𝑡 + 𝜆𝑠𝑒𝑔L𝑠𝑒𝑔,

(D.28)

with L𝑠𝑒𝑔 being the dice loss and 𝜆𝑠𝑒𝑔 being the weight of the dice loss in the baseline. We keep

the 𝜆𝑠𝑒𝑔 = 5. If the segmentation loss is itself scaled such as PBEV uses the L𝑠𝑒𝑔 as 7, we use

𝜆𝑠𝑒𝑔 = 35 with detection.

Inference. We report the performance of all KITTI-360 and nuScenes models by inferring on

single GPU card. Our testing resolution is same as the training resolution. We do not use any

augmentation for test/validation.

We keep the maximum number of objects is 50 per image for KITTI-360 models. We use

score threshold of 0.1 for KITTI-360 models and class dependent threshold for nuScenes models

as in [303]. KITTI-360 evaluates on windows and not on images. So, we use a 3D center-based

NMS [109] to convert image-based predictions to window-based predictions for SeaBird and all

our KITTI-360 baselines. This NMS uses a threshold of 4m for all categories, and keeps the highest

score 3D box if multiple 3D boxes exist inside a window.

151

Table D.2 Error analysis on KITTI-360 Val.

✓

✓

(cid:17))

Oracle

AP 3D 25 [%](−

(cid:17))
AP 3D 50 [%](−
𝑥 𝑦 𝑧 𝑙 𝑤 ℎ 𝜃 AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP
8.71 43.19 25.95 35.76 52.22 43.99
9.78 41.63 25.70 36.07 50.63 43.35
9.57 46.08 27.82 34.65 53.03 43.84
9.90 42.32 27.11 39.66 53.08 46.37
19.90 47.37 33.63 41.84 52.53 47.19
9.49 45.67 27.58 33.43 51.53 42.48
✓ ✓ ✓ ✓ ✓ ✓ 37.09 46.27 41.68 44.58 51.15 47.87
✓ ✓ ✓ ✓ ✓ ✓ ✓ 37.02 47.03 42.02 44.46 51.50 47.98

✓
✓ ✓ ✓

✓ ✓ ✓

Table D.3 Complexity analysis on KITTI-360 Val.

Method
GUP Net [159]
DEVIANT [108]
I2M [211]
I2M+SeaBird
PBEV [72]
PBEV+SeaBird

Mono3D Inf. Time (s) Param (M) Flops (G)

✓
✓
✕
✓
✕
✓

0.02
0.04
0.01
0.02
0.14
0.15

16
16
40
53
24
37

30
235
80
130
229
279

D.3 Additional Experiments and Results

We now provide additional details and results of the experiments evaluating SeaBird’s perfor-

mance.

D.3.1 KITTI-360 Val Results

Error Analysis. We next report the error analysis of the SeaBird in Tab. D.2 by replacing the

predicted box data with the oracle box data as in [166]. We consider the GT box to be an oracle box

for predicted box if the euclidean distance is less than 4𝑚. In case of multiple GT being matched

to one box, we consider the oracle with the minimum distance. Tab. D.2 shows that depth is the

biggest source of error for Mono3D task as also observed in [166]. Moreover, the oracle does

not lead to perfect results since the KITTI-360 PanopticBEV GT BEV semantic is only upto 50𝑚,

while the KITTI-360 evaluates all objects (including objects beyond 50𝑚).

Computational Complexity Analysis. We next compare the complexity analysis of SeaBird

pipeline in Tab. D.3. For the flops analysis, we use the fvcore library as in [108].

Naive baseline for Large Objects. We next compare SeaBird against a naive baseline for large

objects detection, such as by fine-tuning GUP Net only on larger objects. Tab. D.4 shows that

152

Table D.4 KITTI-360 Val results with naive baseline finetuned for large objects. SeaBird pipelines
comfortably outperform this naive baseline on large objects. [Key: Best, Second Best, †= Retrained]

Method

Venue

GUP Net † [159]
GUP Net (Large FT) † [159]
I2M+SeaBird
PBEV+SeaBird

(cid:17))

AP 3D 50 [%](−

BEV Seg IoU [%](−
AP 3D 25 [%](−
AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP Large Car MFor
0.54
0.56
8.71

ICCV21
ICCV21
CVPR24
43.19 25.95 35.76 52.22 43.99 23.23 39.61 31.42
CVPR24 13.22 42.46 27.84 37.15 52.53 44.84 24.30 48.04 36.17

45.11 22.83
0.28

50.52 25.75
1.28

0.98
2.56

−
−

−
−

−
−

(cid:17))

−

−

(cid:17))

Table D.5 Impact of denoising BEV segmentation maps with MIRNet-v2 [294] on KITTI-360 Val with
I2M+SeaBird. Denoising does not help. [Key: Best]

Denoiser

✓
✕

(cid:17))

AP 3D 50 [%](−

BEV Seg IoU [%](−
AP 3D 25 [%](−
AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP Large Car MFor
43.77 23.25 14.34 51.23 32.79 21.42 39.72 30.57
2.73
43.19 25.95 35.76 52.22 43.99 23.23 39.61 31.42
8.71

(cid:17))

(cid:17))

Table D.6 Segmentation loss weight 𝜆𝑠𝑒𝑔 sensitivity on KITTI-360 Val with I2M+SeaBird. 𝜆𝑠𝑒𝑔 = 5 works
the best. [Key: Best]

𝜆𝑠𝑒𝑔

0
1
3
5
10

(cid:17))

(cid:17))

AP 3D 50 [%](−

BEV Seg IoU [%](−
AP 3D 25 [%](−
AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP Large Car MFor
4.86
3.54
45.09 24.98 26.33 52.31 39.32
41.71 24.39 32.92
7.07
42.91 23.78 40.58 32.18
43.45 25.36 34.47 52.54 43.51 23.40 40.15 31.78
7.26
43.19 25.95 35.76 52.22 43.99 23.23 39.61 31.42
8.71
43.41 25.55 34.22 50.97 42.60 22.15 39.83 30.99
7.69

7.07

52.9

0

(cid:17))

SeaBird pipelines comfortably outperform this baseline as well.

Does denoising BEV images help? Another potential addition to the SeaBird framework is using a

denoiser between segmentation and detection heads. We use the MIRNet-v2 [294] as our denoiser

and train the BEV segmentation head, denoiser and detection head in an end-to-end manner.

Tab. D.5 shows that denoising does not increase performance but the inference time. Hence, we do

not use any denoiser for SeaBird.

Sensitivity to Segmentation Weight. We next study the impact of segmentation weight on

I2M+SeaBird in Tab. D.6 as in Sec. 4.4.2. Tab. D.6 shows that 𝜆𝑠𝑒𝑔 = 5 works the best for the

Mono3D of large objects.

Reproducibility. We ensure reproducibility of our results by repeating our experiments for 3

random seeds. We choose the final epoch as our checkpoint in all our experiments as [108].

Tab. D.7 shows the results with these seeds. SeaBird outperforms SeaBird without dice loss in the

153

Table D.7 Reproducibility results on KITTI-360 Val with I2M+SeaBird. SeaBird outperforms SeaBird
without dice loss in the median and average cases. [Key: Best, Second Best]

Dice Seed

✕

✓

111
444
222
Avg
111
444
222
Avg

(cid:17))

(cid:17))

(cid:17))

AP 3D 50 [%](−

BEV Seg IoU [%](−
AP 3D 25 [%](−
AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP AP𝐿𝑟𝑔 AP𝐶𝑎𝑟 mAP Large Car MFor
3.00
3.81
44.63 24.22 24.96 53.15 39.06
3.54
4.86
45.09 24.98 26.33 52.31 39.32
2.66
5.79
46.71 26.25 24.32 54.06 39.19
3.06
4.82
45.58 25.15 25.20 53.17 39.19
44.03 25.95 33.55 53.93 43.74 22.64 40.64 31.64
7.87
43.19 25.95 35.76 52.22 43.99 23.23 39.61 31.42
8.71
42.87 25.79 34.71 51.72 43.22 22.74 40.01 31.38
8.71
43.36 25.90 34.67 52.62 43.65 22.87 40.09 31.48
8.43

5.99
7.07
5.32
6.13

0
0
0
0

Table D.8 Dice vs regression on methods with depth estimation. Dice model again outperforms regression
loss models, particularly for large objects. [Key: Best, Second Best]

Resolution Method

BBone Venue Loss AP𝐿𝑟𝑔 (−

256×704 HoP+SeaBird R50

ICCV23 −
L1
L2
CVPR24 Dice

−
−

27.4
27.0

28.2

(cid:17)) AP𝐶𝑎𝑟 (−
57.2
57.1

58.6

(cid:17)) AP𝑆𝑚𝑙 (−
46.4
46.5

(cid:17)) mAP (−
39.9
39.7
Did Not Converge
41.1

47.8

(cid:17))

(cid:17)) NDS (−
50.9
50.7

51.5

median and average cases. The biggest improvement shows up on larger objects.

D.3.2 nuScenes Results

Extended Val Results. Besides showing improvements upon existing detectors in Tab. 4.7 on the

nuScenes Val split, we compare with more recent SoTA detectors with large backbones in Tab. D.9.

Dice vs regression on depth estimation methods. We report HoP +R50 config, which uses depth

estimation and compare losses in Tab. D.8. Tab. D.8 shows that Dice model again outperforms

regression loss models.

SeaBird Compatible Approaches. SeaBird conditions the detection outputs on segmented BEV

features and so, requires foreground BEV segmentation. So, all approaches which produce latent

BEV map in Tabs. 4.6 and 4.7 are compatible with SeaBird. However, approaches which do not

produce BEV features such as SparseBEV [142] are incompatible with SeaBird.

D.3.3 Qualitative Results

KITTI-360. We now show some qualitative results of models trained on KITTI-360 Val split in

Fig. D.3. We depict the predictions of PBEV+SeaBird in image view on the left, the predictions

of PBEV+SeaBird, the baseline MonoDETR [300], predicted and GT boxes in BEV in the mid-

154

Table D.9 nuScenes Val Detection results. SeaBird pipelines outperform the baselines, particularly for
large objects. [Key: Best, Second Best, B= Base, S= Small, T= Tiny, = Released, ∗= Reimplementation, §=
CBGS]

Resolution Method

(cid:17)) AP𝐶𝑎𝑟 (−

(cid:17)) AP𝑆𝑚𝑙 (−

(cid:17))

BBone
R50
R50
R50
Swin-T

R50
R50
R101
R101
R101
R101
R101
Swin-S

Venue
CAPE [273]
CVPR23
PETRv2 [150]
ICCV23
SOLOFusion§ [182]
ICLR23
BEVerse-T [303]
ArXiv
BEVerse-T+SeaBird Swin-T CVPR24
ICCV23
HoP [311]
CVPR24
HoP+SeaBird
ICCV23
3DPPE [219]
AAAI23
STS [261]
ICCV23
P2D [101]
AAAI23
BEVDepth [127]
ArXiv
BEVDet4D [87]
ArXiv
BEVerse-S [303]
BEVerse-S+SeaBird Swin-S CVPR24
R101
ICCV23
HoP ∗ [311]
R101
CVPR24
HoP+SeaBird
V2-99
ArXiv
BEVDet [88]
ICCV23
R101
PETRv2 [150]
V2-99 CVPR23
CAPE [273]
ArXiv
Swin-B
BEVDet4D § [87]
V2-99
HoP ∗ [311]
ICCV23
V2-99 CVPR24
HoP+SeaBird
ICCVW21
R101
FCOS3D [250]
CoRL21
R101
PGD [251]
CoRL21
R101
DETR3D [257]
ECCV22
R101
PETR [149]
R101
BEVFormer [132]
ECCV22
V2-99 AAAI23
PolarFormer [95]

AP𝐿𝑟𝑔 (−
18.5
−
26.5
18.5
19.5
27.4
28.2
−
−
−
−
−
20.9
24.6
31.4
32.9
29.6
−
31.2
−
36.5
40.3
−
−
22.4
−
27.7
−

256×704

512×1408

640×1600

900×1600

53.2
−
57.3
53.4
54.2
57.2
58.6
−
−
−
−
−
56.2
58.7
63.7
65.0
61.7
−
63.2
−
69.1
71.7
−
−
60.3
−
48.5
−

38.1
−
48.5
38.8
41.1
46.4
47.8
−
−
−
−
−
42.2
45.0
52.5
53.1
48.2
−
51.9
−
56.1
58.8
−
−
41.1
−
34.5
−

(cid:17)) mAP (−
31.8
34.9
40.6
32.1
33.8
39.9
41.1
39.1
43.1
43.3
41.8
42.1
35.2
38.2
45.2
46.2
42.1
42.1
44.7
42.6
49.6
52.7
34.4
36.9
34.9
37.0
41.5
50.0

(cid:17)) NDS (−
44.2
45.6
49.7
46.6
48.1
50.9
51.5
45.8
52.5
52.8
53.8
54.5
49.5
51.3
55.0
54.7
48.2
52.4
54.4
55.2
58.3
60.2
41.5
42.8
43.4
44.2
51.7
56.2

dle and BEV semantic segmentation predictions from PBEV+SeaBird on the right. In general,

PBEV+SeaBird detects more larger objects (buildings) than GUP Net [159].

nuScenes. We now show some qualitative results of models trained on nuScenes Val split in

Fig. D.4. As before, we depict the predictions of BEVerse-S+SeaBird in image view from six

cameras on the left and BEV semantic segmentation predictions from SeaBird on the right.

KITTI-360 Demo Video. We next put a short demo video of PBEV+SeaBird model trained on

KITTI-360 Val split compared with MonoDETR at

https://www.youtube.com/watch?v=SmuRbMbsnZA. We run our trained model independently on

each frame of KITTI-360. None of the frames from the raw video appear in the training set of

KITTI-360 Val split. We use the camera matrices available with the video but do not use any

155

temporal information. Overlaid on each frame of the raw input videos, we plot the projected

3D boxes of the predictions, predicted and GT boxes in BEV in the middle and BEV semantic

segmentation predictions from PBEV+SeaBird. We set the frame rate of this demo at 5 fps similar

to [108]. The demo video demonstrates impressive results on larger objects.

156

Figure D.3 KITTI-360 Qualitative Results. PBEV+SeaBird detects more large objects (buildings, in blue)
than MonoDETR [300] in orange. We depict the predictions of PBEV+SeaBird in the image view on the
left, the predictions of PBEV+SeaBird, the baseline MonoDETR [300], and ground truth in BEV in the
middle, and BEV semantic segmentation predictions from PBEV+SeaBird on the right. [Key: Buildings (in
blue) and Cars (in yellow) of PBEV+SeaBird; all classes (pink) of MonoDETR [300], and Ground Truth (in
green) in BEV].

157

Figure D.4 nuScenes Qualitative Results. The first row shows the front_left, front, and front_right cameras,
[Key: Cars (blue), Vehicles
while the second row shows the back_left, back, and back_right cameras.
(green), Pedestrian (violet), Cones (yellow) and Barrier (gray) of BEVerse-S+SeaBird at 200×200 resolution
in BEV ].

158

APPENDIX E

CHARM3R APPENDIX

E.1 Additional Details and Proof

We now add more details and proofs which we could not put in the main paper because of the

space constraints.

E.1.1 Proof of Ground Depth Lemma 1

We reproduce the proof from [284] with our notations for the sake of completeness of this work.

Proof. We first rewrite the pinhole projection Eq. (5.1) as:

𝑋

𝑌

𝑍




























= R−1(K−1

𝑧 − T).

(E.1)

𝑣


𝑢










1












We now represent the ray shooting from the camera optical center through each pixel as −→𝑟 (𝑢, 𝑣, 𝑧).

Using the matrix 𝑨 = (𝑎𝑖 𝑗 ) = R−1K−1, and the vector 𝑩 = (𝑏𝑖) = −R−1T, we define the parametric

ray as:

𝑋 = (𝑎11𝑢 + 𝑎12𝑣 + 𝑎13)𝑧 + 𝑏1

−→𝑟 (𝑢, 𝑣, 𝑧) :

𝑌 = (𝑎21𝑢 + 𝑎22𝑣 + 𝑎23)𝑧 + 𝑏2

(E.2)

𝑍 = (𝑎31𝑢 + 𝑎32𝑣 + 𝑎33)𝑧 + 𝑏3

Moreover, the ground at a distance ℎ can be described by a plane, which is determined by the point
(0, 𝐻, 0) in the plane and the normal vector −→𝑛 = (0, 1, 0):

−→𝑟 · −→𝑛 = 𝐻.

(E.3)

Then, the ground depth is the intersection point between this ray and the ground plane. Combining

Eqs. (E.2) and (E.3), the ground depth 𝑧 of the pixel (𝑢, 𝑣) is:

(𝑎21𝑢 + 𝑎22𝑣 + 𝑎23)𝑧 + 𝑏2 = 𝐻

=⇒ 𝑧 =

𝐻 − 𝑏2
𝑎21𝑢 + 𝑎22𝑣 + 𝑎23

.

(E.4)

□

159

E.1.2 Proof of Lemma 5

We next derive Lemma 5 from Lemma 4 as follows.

Proof.

𝑨 = (𝑎𝑖 𝑗 ) = R−1K−1 = 𝑰−1

= 𝑰

1
𝑓

0

0

1
𝑓

−𝑢0
𝑓

−𝑣0
𝑓

0 0

1

































0 𝑢0

𝑓

𝑣0

1

−1












𝑓

0

0 0

,

with rotation matrix R is identity 𝑰 for forward cameras. So, 𝑎21 = 0, 𝑎22 = 1

𝑓 , 𝑎23 = −𝑣0

𝑓

𝑎21, 𝑎22, 𝑎23 in Eq. (5.2), we get Eq. (5.3).

. Substituting

□

E.1.3 Extension to Camera not parallel to Ground

Following Sec. 3.3 of GEDepth [284], we use the camera pitch 𝛿, and generalize Eq. (5.2) to

obtain ground depth as

𝑧 =

=

𝐻 −𝑏2cos 𝛿−𝑏3sin 𝛿
[𝑎21𝑢+𝑎22𝑣 +𝑎23] cos 𝛿+ [𝑎31𝑢+𝑎32𝑣 +𝑎33] sin 𝛿
𝐻 − 𝑏2 cos𝛿 − 𝑏3sin 𝛿

𝑣−𝑣0
𝑓

cos 𝛿 + sin 𝛿

(E.5)

. Note that if camera pitch 𝛿 = 0, this reduces to the usual form of Eq. (5.2) and Eq. (5.3)

respectively. Also, Th. 1 has a more general form with the pitch value, and remains valid for

majority of the pitch angle ranges.

E.1.4 Extension to Not-flat Roads

For non-flat roads, we assume that the road is made of multiple flat ‘pieces‘ of roads each

with its own slope and we predict the slope of each pixel as in GEDepth [284]. To predict slope

ˆ𝛿 of each pixel, we first define a set of 𝑁 discrete slopes: {𝜏𝑖, 𝑖 = 1, ..., 𝑁 }. We compute each

pixel slope by linearly combining the discrete slopes with the predicted probability distribution

160

(a) GUP Net

(b) CHARM3R

Figure E.1 CARLA Val AP3D at different depths and IoU3D thresholds with GUP Net. CHARM3R shows
biggest gains on IoU3D > 0.3 for [0, 30]𝑚 boxes.

(a) AP 3D 70 [%] comparison.

(b) AP 3D 50 [%] comparison.

(c) MDE comparison.

Figure E.2 CARLA Val Results with GUP Net detector after augmentation of [104]. Training a detector
with both Δ𝐻 = −0.70𝑚 and Δ𝐻 = 0𝑚 images produces better results at Δ𝐻 = −0.70𝑚 and Δ𝐻 = 0𝑚, but
fails at unseen height images Δ𝐻 = +0.76𝑚. CHARM3R outperforms all baselines, especially at unseen
bigger height changes. All methods except Oracle are trained on car height and tested on all heights.

{ ˆ𝑝𝑖 ∈ [0, 1], (cid:205)𝑖 ˆ𝑝𝑖 = 1} over 𝑁 slopes ˆ𝛿 = (cid:205)𝑖 ˆ𝑝𝑖𝜏𝑖. We train the network to minimize the total
loss: 𝐿total = 𝐿det +𝜆slope𝐿slope(𝛿, ˆ𝛿), where 𝐿det are the detection losses, and 𝐿slope is the slope

classification loss. We next substitute the predicted slope in [?]. We do not run this experiment

since planar ground is reasonable assumption for most driving scenarios within some distance.

E.1.5 Unrealistic assumptions of Th. 4

We partially agree. These assumptions reflect the observations of [51]. Also, our theory has

empirical support; most Mono3D works have no theory. So, our theoretical attempt is a step

forward!

161

(a) AP 3D 70 [%] comparison.

(b) AP 3D 50 [%] comparison.

(c) MDE comparison.

Figure E.3 CARLA Val Results with DEVIANT detector. CHARM3R outperforms all baselines, espe-
cially at bigger height changes. All methods except Oracle are trained on car height and tested on all heights.
Results of inference on height changes of −0.70, 0 and 0.76 meters are in Tab. 5.2.

E.2 Additional Experiments

We now provide additional details and results of the experiments evaluating CHARM3R’s

performance.

E.2.1 CARLA Val Results

We first analyze the results on the synthetic CARLA dataset further.

AP at different distances and thresholds. We next compare the AP3D of the baseline GUP Net

and CHARM3R in Fig. E.1 at different distances in meters and IoU3D matching criteria of 0.1 − 0.7

as in [108]. Fig. E.1 shows that CHARM3R is effective over GUP Net at all depths and higher

IoU3D thresholds. CHARM3R shows biggest gains on IoU3D > 0.3 for [0, 30]𝑚 boxes.

Comparison with Augmentation-Methods. Sec. 5.1 of the paper says that the augmentation

strategy falls short when the target height is OOD. We show this in Fig. E.2. Since authors of [104]

do not release the NVS code, we use the ground truth images from height change Δ𝐻 = −0.70𝑚

in training. Fig. E.2 confirms that augmentation also improves the performance on Δ𝐻 = −0.70𝑚

and Δ𝐻 = 0𝑚, but again falls short on unseen ego heights Δ𝐻 = +0.76𝑚. On the other hand,

CHARM3R (even though trained on Δ𝐻 = −0.70𝑚) outperforms such augmentation strategy at

unseen ego heights Δ𝐻 = +0.76𝑚. This shows the complementary nature of CHARM3R over

augmentation strategies.

Reproducibility. We ensure reproducibility of our results by repeating our experiments for 3

random seeds. We choose the final epoch as our checkpoint in all our experiments as [108, 110].

Tab. E.1 shows the results with these seeds. CHARM3R outperforms the baseline GUP Net in both

162

Table E.1 Reproducibility Results. CHARM3R outperforms all other baselines on CARLA Val split,
especially at bigger unseen ego heights in both median (Seed=444) and average cases. All except Oracle are
trained on car height Δ𝐻 = 0𝑚 and tested on bot to truck height data. [Key: Best]

3D Detector

GUP Net [159]

+ CHARM3R

Oracle

Seed −

(cid:17) / Δ𝐻 (𝑚)−
(cid:17)

111
444
222
Average
111
444
222
Average
−

(cid:17))

(cid:17))

AP 3D 70 [%] (−
0
55.98
53.82
52.94
54.25
58.16
55.68
53.57
55.80
53.82

−0.70
12.24
9.46
10.35
10.68
19.99
19.45
17.41
18.95
70.96

+0.76 −0.70
44.14
7.53
41.66
7.23
41.67
10.79
42.49
8.52
54.15
29.96
53.40
27.33
54.30
27.77
53.95
28.35
83.88
62.25

AP 3D 50 [%] (−
0
76.37
76.47
75.80
76.21
74.10
74.47
74.83
74.47
76.47

+0.76 −0.70
41.32
40.97
46.45
43.58
64.27
61.98
64.42
63.56
83.96

MDE (𝑚) [≈ 0]
+0.76
0
+0.48 +0.00 −0.64
+0.53 +0.03 −0.63
+0.53 +0.01 −0.57
+0.51 +0.01 −0.61
+0.09 +0.00 −0.03
+0.07 +0.05 −0.02
+0.12 +0.01 −0.09
+0.09 +0.02 −0.05
+0.03 +0.03 +0.03

Table E.2 nuScenes to CODa Val Results. CHARM3R outperforms all baselines, especially at unseen
height changes. [Key: Best, Second Best, Ped= Pedestrians]

3D Detector Method

GUP Net [159]

Source
UniDrive [129]
UniDrive++[129]
CHARM3R
Oracle

(cid:17))

Car AP 3D 50 [%] (−
nuScenes
CODa
0.02
18.42
0.02
18.42
18.42
0.03
14.80
0.30
18.42
28.56

(cid:17))
Ped [%] (−
CODa nuScenes
0.01
0.01
0.02
0.05
30.31

2.93
2.93
2.93
1.26
2.93

(a) CODa Car

(b) CODa Pedestrian

Figure E.4 CODa Val AP3D at different depths and IoU3D thresholds with GUP Net trained on nuScenes.
CHARM3R shows biggest gains on IoU3D < 0.3 for [0, 30]𝑚 boxes.

median and average cases.

Results with DEVIANT. We next additionally plot the robustness of CHARM3R with other

methods on the DEVIANT detector [108] in Fig. E.3 The figure confirms that CHARM3R works

even with DEVIANT and produces SoTA robustness to unseen ego heights.

163

E.2.2 nuScenes −

CODa Val Results

To test our claims further in real-life, we use two real datasets: the nuScenes dataset [22] and

(cid:17)

the recently released CODa [295] datasets. nuScenes has ego camera at height 1.51𝑚 above the

ground, while the CODa is a robotics dataset with ego camera at a height of 0.75𝑚 above the

ground. This experiment uses the following data split:
• nuScenes Val Split. This split [22] contains 28,130 training and 6,019 validation images from

the front camera as [108].

• CODa Val Split. This split [295] contains 19,511 training and 4,176 validation images. We only

use this split for testing.

We train the GUP Net detector with 10 nuScenes classes and report the results with the KITTI

metrics on both nuScenes val and CODa Val splits.

Main Results. We report the main results in Tab. E.2 paper. The results of Tab. E.2 shows gains

on both Cars and Pedestrians classes of CODa val dataset. The performance is very low, which we

believe is because of the domain gap between nuScenes and CODa datasets. These results further

confirm our observations that unlike 2D detection, generalization across unseen datasets remains a

big problem in the Mono3D task.

AP at different distances and thresholds. To further analyze the performance, we next plot the

AP3D of the baseline GUP Net and CHARM3R in Fig. E.4 at different distances in meters and

IoU3D matching criteria of 0.1 − 0.5 as in [108]. Fig. E.4 shows that CHARM3R is effective

over GUP Net at all depths and lower IoU3D thresholds. CHARM3R shows biggest gains on

IoU3D < 0.3 for [0, 30]𝑚 boxes. The gains are more on the Pedestrian class on CODa since

CODa captures UT Austin campus scenes, and therefore, has more pedestrians compared to cars.

nuScenes captures outdoor driving scenes in Boston and Singapore, and therefore, has more cars

compared to pedestrians. We describe the statistics of these two datasets in Tab. E.3.

E.2.3 Qualitative Results.

CARLA. We now show some qualitative results of models trained on CARLA Val split from

car height (Δ𝐻 = 0𝑚) and tested on truck height (Δ𝐻 = +0.76𝑚) in Fig. E.5. We depict the

164

Table E.3 Dataset statistics. nuScenes Val has more Cars compared to Pedestrians, while CODa Val has
more Pedestrians than Cars.

Val
nuScenes
CODa

Ego Ht (𝑚) #Images Car (𝑘) Ped (𝑘)

1.51
0.75

6,019
4,176

18
4

7
86

predictions of CHARM3R in image view on the left, the predictions of CHARM3R, the baseline

GUP Net [159], and GT boxes in BEV on the right. In general, CHARM3R detects objects more

accurately than GUP Net [159], making CHARM3R more robust to camera height changes. The

regression-based baseline GUP Net mostly underestimates the depth of 3D boxes with positive ego

height changes, which qualitatively justifies the claims of Th. 4.

CODa. We now show some qualitative results of models trained on CODa Val split in Fig. E.6.

As before, we depict the predictions of CHARM3R in image view image view on the left, the

predictions of CHARM3R, the baseline GUP Net [159], and GT boxes in BEV on the right. In

general, CHARM3R detects objects more accurately than the baseline GUP Net [159], making

CHARM3R more robust to camera height changes. Also, considerably less number of boxes are

detected in the cross-dataset evaluation i.e. on CODa Val. We believe this happens because of the

domain shift.

165

Figure E.5 CARLA Val Qualitative Results. CHARM3R detects objects more accurately than GUP
Net [159], making CHARM3R more robust to camera height changes. The regression-based baseline GUP
Net mostly underestimates the depth which qualitatively justifies the claims of Th. 4. All methods are
trained on CARLA images at car height Δ𝐻 = 0𝑚 and evaluated on Δ𝐻 = +0.76𝑚. [Key: Cars (pink) of
CHARM3R. ; Cars (cyan) of GUP Net, and Ground Truth (green) in BEV.

166

Figure E.6 CODa Val Qualitative Results. CHARM3R detects objects more accurately than GUP Net [159],
making CHARM3R more robust to camera height changes. All methods are trained on nuScenes dataset and
evaluated on CODa dataset. [Key: Cars (pink) and Pedestrian (violet) of CHARM3R. ; all classes (cyan) of
GUP Net, and Ground Truth (green) in BEV.

167