OBJECT DETECTION FROM 2D TO 3D

By

Garrick Brazil

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Computer Science — Doctor of Philosophy

2021

ABSTRACT

OBJECT DETECTION FROM 2D TO 3D

By

Garrick Brazil

Monocular camera-based object detection plays a critical role in widespread applications in-

cluding robotics, security, self-driving cars, augmented reality and many more. Increased relevancy

is often given to the detection and tracking of safety-critical objects like pedestrians, cyclists, and

cars which are often in motion and in close association to people. Compared to other generic ob-

jects such as animals, tools, food — safety-critical objects in urban scenes tend to have unique

challenges. Firstly, such objects usually have a wide range of detection scales such that they may
appear anywhere from 5 − 50+ meters from the camera. Safety-critical objects also tend to have a

high variety of textures and shapes, exempliﬁed by the clothing of people and variability of vehicle

models. Moreover, the high-density of objects in urban scenes leads to increased levels of self-

occlusion compared to general objects in the wild. Off-the-shelf object detectors do not always

work effectively due to these traits, and hence special attention is needed for accurate detection.

Moreover, even successful detection of safety-critical is not inherently practical for applications

designed to function in the real 3D world, without integration of expensive depth sensors.

To remedy this, in this thesis we aim to improve the performance of 2D object detection and

extend boxes into 3D, while using only monocular camera-based sensors. We ﬁrst explore how

pedestrian detection can be augmented using an efﬁcient simultaneous detection and segmentation

technique, while notably requiring no additional data or annotations. We then propose a multi-

phased autoregressive network which progressively improves pedestrian detection precision for

difﬁcult samples, while critically maintaining an efﬁcient runtime. We additionally propose a

single-stage region proposal networks for 3D object detection in urban scenes, which is both more
efﬁcient and up to 3× more accurate than comparable state-of-the-art methods. We stabilize our 3D

object detector using a highly tailored 3D Kalman ﬁlter, which both improves localization accuracy

and provides useful byproducts such as ego-motion and per-object velocity. Lastly, we utilize

differentiable rendering to discover the underlying 3D structure of objects beyond the cuboids used

in detection, and without relying on expensive sensors or 3D supervision. For each method, we

provide comprehensive experiments to demonstrate effectiveness, impact and runtime efﬁciency.

Copyright by
GARRICK BRAZIL
2021

This thesis is dedicated to my loving family and in particular to my late aunt Cynthia, who

believed in me and inspires me always.

v

ACKNOWLEDGMENTS

The journey through graduate school and leading to this dissertation would not have been

possible without the help of many mentors, friends, and family.

Firstly, I would like to thank my adviser Dr. Xiaoming Liu, whose encouragement and men-

torship has led me to learn and accomplish goals I never anticipated possible for myself. I am

forever grateful to have been given the chance to learn and grow in the lab with invaluable guid-

ance. His knowledge, experience, and enthusiasm for research are a continued inspiration for me.

As result I feel I have grown as a person in critical thinking, writing, presenting, and conducting

research as a whole. Besides academics, Dr. Liu’s unceasing support and advice during difﬁ-

cult times were not only valued, but were paramount to my well-being and making it this far . . .

I am truly thankful for the patience and counsel in those times especially.

I would like to express my gratitude to my committee members Dr. Arun Ross, Dr. Vishnu

Boddeti and Dr. Daniel Morris for their invaluable insight, advice, and mentorship — pertaining

to the development of this thesis and in countless other meetings throughout my time at MSU.

I would further like to thank Dr. Gerard Pons-Moll and Dr. Bernt Schiele for their mentorship

during my visit abroad in Saarbrücken, Germany and afterwards. Their advice and feedback were

vital to my growth as a researcher. Moreover, the warm welcome into the MPI group is a time in

my life I will always reﬂect back fondly on.

I am thankful to my mentors Georgia Gkioxari, Justin Johnson and Nikhila Ravi for their

help, patience, and substantial counseling during my internship. I am grateful for everything I

learned as a researcher and person, while they helped me climb a rather intimidating learning curve.

I cherish the kindness I was shown from all conceivable directions from the entire FAIR group. De-

spite coinciding with a turbulent year — 2020 — the mentorship and research experience resulted

vi

in one of the most delightful and intriguing periods of my professional career.

It is my pleasure to be labmates with a wonderful and smart group of individuals. I would like

to thank all my labmates (past and present) Dr. Joseph Roth, Dr. Xi Yin, Dr. Amin Jourabloo,

Dr. Morteza Safdarnejad, Dr. Yousef Atoum, Dr. Luan Tran, Dr. Feng Liu, Yaojie Liu, Adam

Terwilliger, Joel Stehouwer, Bangjie Yin, Hieu Nguyen, Shengjie Zhu, Masa Hu, Andrew Hou,

Abhinav Kumar, Ying Tai, Vishal Asnani, Xiao Guo, and Ziyuan Zhang. Thank you for fostering

a kind working atmosphere. I appreciate the help, perspectives, impromptu discussions, and col-

laborations. The memories we have formed together in board game nights, movies, hiking, travels,

etc., have made my PhD journey far more enjoyable, enriching, and . . . entertaining too.

Finally, I would like to thank all of my family and friends. In particular, to my loving mother

Colleen, my cousin Richard and my uncle Werner — thank you for the support and care through

an emotional and rather stressful peak of my life.

I couldn’t have made it without your help.

To my support circle and close friends Andrew, Adam, Kristen, Laura, Kushal and Stella — thank

you for the kind support, delightful memories, and helping me keep a healthy brain. To my canine

friends, Sylar and Raelynn, thank you for your scientiﬁc advice and interesting insights.

vii

TABLE OF CONTENTS

LIST OF TABLES .

.

LIST OF FIGURES .

.

.

.

.

.

.

LIST OF ALGORITHMS .

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. xiv

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

Chapter 1

Introduction .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
.
1.2 Dissertation Contributions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction and Contributions . . . . . . . . . . . . . . . . . . . . . . .
.
.
.

. .

.

.

.

.

1
1
5
7

Chapter 2

.
.
.

.
.
.

.
.
.

. .

. .

Introduction .

.
2.1
. .
2.2 Prior work .
2.3 Proposed method .

Illuminating Pedestrians via Simultaneous Detection & Segmentation .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8
.
9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Region Proposal Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Binary Classiﬁcation Network . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3
Simultaneous Detection & Segmentation . . . . . . . . . . . . . . . . . . . 18
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 Benchmark Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
.
2.4.2 Efﬁciency .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.3 Ablation Study .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4 Experiments .

2.5 Summary .

. .

. .

. .

. .

.

.

.

.

.

.

.

Chapter 3

.
.

.
.

. .

.
. .

Introduction .

.
3.1
3.2 Related Work .
.
3.3 Autoregressive Detector .

Pedestrian Detection with Autoregressive Network Phases . . . . . . . . 32
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 De-Encoder Module
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.2 Autoregressive RPN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.3 R-CNN Detector
.
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
.
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
.
. .
.
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
.
.

. .
3.4.1 Caltech .
3.4.2 KITTI . .
3.4.3 Ablations .
. .

3.4 Experiments .

3.5 Summary .

.
.
.
.
. .
.
.

. .

Chapter 4

Introduction .

4.1
4.2 Related Work .
4.3 M3D-RPN .
.

M3D-RPN: Monocular 3D Region Proposal Network for Object Detection 54
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

.
. .
.
.

. .

.
.
.

.
.
.

.
.
.

.

viii

.

. .

Formulation .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.3.1
. 61
4.3.2 Depth-aware Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 66
Post 3D→2D Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.4
. 68
Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 69
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
. 75

.
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .
4.4.1 KITTI . .
4.4.2 Ablations .
. .

.
.
. .
.
.
.
.

.
.
. .
.
.

. .

4.4 Experiments .

4.5 Summary .

.
.

.
.

.
.

.

.

.
.

.

.

.

.

. .

.
. .

5.3 Methodology .

Introduction .

5.1
5.2 Related Work .

5.3.2 Ego-motion . .
5.3.3 Kinematics
.

5.2.1.1 Orientation Estimation:
5.2.1.2 Uncertainty Estimation:

5.3.1.1 Anchors:
5.3.1.2
5.3.1.3 Orientation Estimation:
5.3.1.4

Kinematic 3D Object Detection in Monocular Video . . . . . . . . . . . 76
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.1 Monocular 3D Object Detection . . . . . . . . . . . . . . . . . . . . . . . 80
. . . . . . . . . . . . . . . . . . . . . . 80
. . . . . . . . . . . . . . . . . . . . . . 81
5.2.2 Video-based Object Detection . . . . . . . . . . . . . . . . . . . . . . . . 82
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.1 Region Proposal Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3D Box Outputs: . . . . . . . . . . . . . . . . . . . . . . . . . . 84
. . . . . . . . . . . . . . . . . . . . . . 86
Self-Balancing Loss: . . . . . . . . . . . . . . . . . . . . . . . . 87
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.3.1 Motion Model: . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.3.2
Forecasting:
5.3.3.3 Association:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3.3.4 Update: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
.
5.4.1 KITTI Dataset .
5.4.1.1 Metric:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3D Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
. . . . . . . . . . . . . . . . . . . . . 96
. . . . . . . . . . . . . . . . . . . . 97
. . . . . . . . . . . . . . . . . . . . . . . . 98
5.5 Summary .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.6 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6.1 Orientation Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6.2 Kalman Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.6.3 Qualitative Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.4.4.1 Orientation Improvement:
Self-balancing Conﬁdence:
5.4.4.2
5.4.4.3
Temporal Modeling:
.
.

5.4.2
5.4.3 Bird’s Eye View .
5.4.4 Ablation Study .

5.4 Experiments .

5.3.4

.

.

. .

. .

.

.

Chapter 5

.

ix

Chapter 6

Introduction .

6.1
6.2 Related Work .
6.3 Method . .
. .

.
.
.

.
.
.

.
.
.

. .

.
. .
. .

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3D Mesh Discovery in the Wild . . . . . . . . . . . . . . . . . . . . . . . 103
. 104
. 107
. 109
6.3.1 Template Learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3.2 Cluster Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.3.3 Mesh Reﬁnement
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.3.4 Adaptive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
.
. 116
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
.
. 123
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 124
6.6.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.6.2 Architecture Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

.
6.4.1
Pascal3D+ .
6.4.2 CUB-200 .
.
6.4.3 COCO .
.
. .
. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
. .
.
.
.
.
.
.

.
.
.
.
.

. .

6.5 Summary .
6.6 Supplementary Materials .

6.4 Experiments .

Chapter 7

7.1 Conclusions .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Future Work Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 130
. 130
. 131
7.2.1 Maps for 3D Monocular Object Detection . . . . . . . . . . . . . . . . . . 131
7.2.2 Monocular 3D Object Detection with Mesh Discovery . . . . . . . . . . . 132

. .

.

.

.

.

.

APPENDIX .

.

.

.

.

BIBLIOGRAPHY .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

x

LIST OF TABLES

Table 2.1:

Table 2.2:

Table 2.3:

Table 3.1:

Table 3.2:

Table 3.3:

Table 4.1:

Table 4.2:

Table 4.3:

Comprehensive comparison of SDS-RCNN with other state-of-the-art meth-
ods showing the Caltech miss rate, KITTI mAP score, and runtime per-
formance.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

.

.

.

Ablation experiments evaluated using the Caltech test set. Each ablation
experiment reports the miss rate for the RPN, BCN, and fused score with
one component disabled at a time.

. . . . . . . . . . . . . . . . . . . . . 26

Stage-wise sharing experiments which demonstrate the trade-off of run-
time efﬁciency and accuracy, using the Caltech dataset. As sharing is in-
creased from RGB (no sharing) to conv5, both the BCN and Fused miss
rate (MR) become less effective.

. . . . . . . . . . . . . . . . . . . . . . 29

Comprehensive comparison of our frameworks and the state-of-the-art on
the Caltech and KITTI benchmarks, in both accuracy and runtime (RT).
We show the Caltech miss rates at multiple challenging settings, with
both the original (O) and new (N) annotations, and at occlusion settings
with the original annotations and FPPI range M RO−2. Further, we eval-
uate the KITTI pedestrian class under easy, moderate, and hard settings,
with mean Average Precision (mAP) [51]. Boldface/italic indicate the
best/second best performance.
. . . . . . . . . . . . . . . . . . . . . .

. 45

The performance with different parameters and numbers of phases under
the Caltech reasonable M RO−2 setting. We further detail the efﬁciency of
each setting in terms of multiply-accumulate (MAC) and runtime on an
NVIDIA 1080 Ti.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

The effects of labeling policies on the Caltech dataset under the reason-
able M RO−2 setting.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Bird’s Eye View. Comparison of our method to image-only 3D localiza-
tion frameworks on the Bird’s Eye View task (APBEV).

. . . . . . . . . . 66

3D Detection. Comparison of our method to image-only 3D localization
frameworks on the 3D Detection task (AP3D).

. . . . . . . . . . . . . . . 66

Multi-class 3D Localization. The performance of our method when ap-
plied as a multi-class 3D detection system using a single shared model.
We evaluate using the mod setting on KITTI.

. . . . . . . . . . . . . . . 70

xi

Table 4.4:

Table 4.5:

Table 4.6:

Table 5.1:

Table 5.2:

Table 5.3:

Table 5.4:

Table 5.5:

Table 5.6:

2D Detection. The performance of our method evaluated on 2D detection
using the car class on val1 and test datasets.

. . . . . . . . . . . . . . . . 71

Ablations. We ablate the effects of b for depth-aware convolution and
the post-optimization 3D→2D algorithm with respect to performance on
moderate setting of cars and runtime (RT).

. . . . . . . . . . . . . . . . . 73

Local and Global α weights. We detail the α weights learned to indi-
vidually fuse each global and local output. Lower implies higher weight
towards the local depth-aware convolution.

. . . . . . . . . . . . . . . . 73

KITTI Test. We compare with SOTA methods on the KITTI test dataset.
We report performances using the AP40 [140] metric available on the of-
ﬁcial leaderboard. * the runtime is reported from the ofﬁcial leaderboard
with slight variances in hardware. We indicate methods reported on CPU
with †. Bold/italics indicate best/second AP.

. . . . . . . . . . . . . . . . 94

KITTI Validation. We compare with SOTA on KITTI validation [24]
split. Note that methods published prior to [140] are unable to report the
AP40 metric.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

.

.

Ablation Experiments. We conduct a series of ablation experiments
with the validation [24] split of KITTI, using diverse IoU matching crite-
ria of ≥ 0.7/0.5.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Orientation. We compare our orientation decomposition to bin-based
orientation following the high-level concepts within [81, 90, 103, 114],
using AP3D and APBEV. We evaluate our performances on the KITTI
validation set [24] using IoU ≥ 0.7/0.5.

. . . . . . . . . . . . . . . . . 100

Forecasting - 3D Object Detection. We evaluate our forecasting perfor-
mance on AP3D within the KITTI validation [24] set and using IoU ≥ 0.7/0.5/0.3.
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 102

.

.

.

.

.

.

.

.

Forecasting - Bird’s Eye View. We evaluate our forecasting perfor-
mance on APBEV within the KITTI validation [24] set and using IoU ≥ 0.7/0.5/0.3.
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 102

.

.

.

.

.

.

.

.

xii

Table 6.1:

Table 6.2:

Table 6.3:

Table 6.4:

Shape reconstruction performance on Pacal3D+ car. We compare our
3DMD with CSDM [73] a keypoint-based approach, CMR [72] which
uses expert templates & 2D keypoints, DRC [156] a multi-view volumet-
ric approach, U-CMR [58] which relies on expert templates and dataset-
speciﬁc camera priors, and Li et al.
[92] which uses off-the-shelf part
segmentation. Unlike prior works, we report two voxel sizes for a pose-
agnostic metric IoUCan
3D which factors in pose &
shape. † indicates public code results.

3D , and a new metric IoUIm

. . . . . . . . . . . . . . . . . . . 117

Shape reconstruction results for all Pascal3D+ categories. . . . . . . . . . 117
We show the 2D IoU performance for our posed template P and the ﬁnal
reﬁned mesh M across respective validation splits.

. . . . . . . . . . . . 125

We detail our overall network architecture for estimating a binary mask,
clusters map, ∆ΠI transformations made of translations & scales in XYZ
and 6d rotations [190], and lastly reﬁned mesh offsets. TConv denotes
transposed convolution and BN denotes batch normalization [70]. We
denote N as the number of cameras in our multiplexer while |V| denotes
the number of vertices in the meshes P.

. . . . . . . . . . . . . . . . . . 126

xiii

LIST OF FIGURES

Detection results on the Caltech test set (left), feature map visualization
from the RPN of conventional Faster R-CNN (middle), and feature map
visualization of SDS-RCNN (right). Notice that our feature map substan-
tially illuminates the pedestrian shape while suppressing the background
region, both of which make positive impact to downstream pedestrian
detection.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

.

9

Overview of the proposed SDS-RCNN framework. The segmentation
layer infuses semantic features into shared conv1-5 layers of each stage,
thus illuminating pedestrians and easing downstream pedestrian detection
(proposal layers in RPN, and FC1-2 in BCN).

. . . . . . . . . . . . . . . 13

Example proposal masks with and without padding. There is no dis-
cernible difference between the non-padded masks of well-localized (a)
and poorly localized (b) proposals.

. . . . . . . . . . . . . . . . . . . . . 16

Feature map visualizations of conv5 and the proposal layer for the base-
line RPN (left) and the RPN infused with weak segmentation supervision
(right). .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

.

.

.

.

Visualization of the similarity between pixel-wise segmentation masks
(from Cityscapes [31]) and weak box-based masks when downsampled
in both the BCN (top) and RPN (bottom).

. . . . . . . . . . . . . . . . . 20

Figure 2.1:

Figure 2.2:

Figure 2.3:

Figure 2.4:

Figure 2.5:

Figure 2.6:

Comparison of SDS-RCNN with the state-of-the-art methods on the Cal-
tech dataset using the reasonable setting.

. . . . . . . . . . . . . . . . . 22

Figure 2.7:

Example error sources which are corrected by infusing semantic segmen-
tation into shared layers. Row 1 shows the test images from Caltech1×.
Row 2 shows a visualization of the RPN proposal layer using the base-
line network which fails on these examples. Row 3 shows a visualiza-
tion of the proposal layer from SDS-RCNN, which corrects the errors.
Collectively, occlusion and unusual poses of pedestrians (sitting, cyclist,
bent over) make up for 75% of the corrections, suggesting that the the
segmentation supervision naturally informs the shared features on robust
pedestrian parts and shape information.

. . . . . . . . . . . . . . . . . . 26

xiv

Figure 2.8:

Figure 3.1:

Figure 3.2:

Figure 3.3:

Visualization of the diversiﬁcation between the RPN and BCN classiﬁ-
cation characteristics. We plot only boxes which the RPN and BCN of
SDS-RCNN disagree on using a threshold of 0.5. The BCN drastically
reduces false positives of the RPN, while the RPN corrects many missed
. . . . . . . . . . . . . . . . . . . . . . . . . .
detections by the BCN.

. 29

Illustration of our proposed autoregressive framework with sample phase
(P1→3) classiﬁcation prediction maps and box visualizations under Cal-
tech [38] dataset. Our method iteratively re-scores predictions under
incrementally more precise label policies, using a series of de-encoder
modules comprised of decoder and encoder pathways. Notice a heavy
reduction in false positives (red) as phases progress, while true positives
(green) are retained.
. . . . . . . . . . . . . . . . . . . . . . . . . . .

. 33

Predictions of our autoregressive network (a) are directly conditioned on
past feature maps as recurrent network (c) and do not share weights be-
tween phases as ensemble network (b). Unlike either, our network is
further conditioned on past predictions.

. . . . . . . . . . . . . . . . . . 36

Overview of our proposed AR-RPN framework (left) and detailed illus-
tration of our de-encoder module (right). The de-encoder module consist
of top-down and bottom-up pathways with inner-lateral convolution be-
tween pathways to produce diversiﬁed features, as well as convolutional
re-sampling layers (s denotes convolutional stride) ei and di for memory-
efﬁcient feature generation. We further condition predictions on the pre-
vious phase predictions through concatenation within fk(·).

. . . . . . . 38

k , Pf g

k }, we deﬁne ˜Pk = maxA Pf g

Figure 3.4: We visualize the prediction maps ˜Pk of each phase by taking the maxi-
mum of foreground scores across all A anchors at each spatial location,
i.e., denoting Pk = {Pbg
k . We use
scaled blue →yellow colors to visualize ˜Pk, where yellowness indicates
high detection conﬁdence. The detections of each phase become increas-
ingly tighter and more adept to non-maximum suppression due to the
incremental supervision for each phase (Sec. 3.3.2). We further analyze
the prediction disagreements between phases ∆1 → 3, shown in the right
column, where green represents the agreement of the foreground and ma-
genta the regions suppressed.

. . . . . . . . . . . . . . . . . . . . . . . . 52

Figure 3.5: We analyze the mean prediction score ( ˜Pk) of 20 uniformly sampled
points along the center lines of X-direction (left) and Y-direction (right)
averaged over all ground-truth pedestrians in Caltech test dataset, us-
ing bilinear interpolation when necessary. We note that successive phase
scores form more peaky inclines radiating from the center of the pedes-
trian.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

.

.

.

.

.

.

xv

Figure 4.1: M3D-RPN uses a single monocular 3D region proposal network with
global convolution (orange) and local depth-aware convolution (blue) to
predict multi-class 3D bounding boxes.

. . . . . . . . . . . . . . . . . . 56

Figure 4.2:

Figure 4.3:

Figure 4.4:

Figure 4.5:

Figure 5.1:

Figure 5.2:

Comparison of Deep3DBox [118] and Multi-Fusion [172] with M3D-
RPN. Notice that prior works are comprised of multiple internal stages
(orange), and external networks (blue), whereas M3D-RPN is a single-
shot network trained end-to-end.

. . . . . . . . . . . . . . . . . . . . . . 58

Overview of M3D-RPN. The proposed method consist of parallel paths
for global (orange) and local (blue) feature extraction. The global features
use regular spatial-invariant convolution, while the local features denote
depth-aware convolution, as detailed right. The depth-aware convolu-
tion uses non-shared kernels in the row-space ki for i = 1 . . . b, where
b denotes the total number of distinct bins. To leverage both variants of
features, we weightedly combine each output parameter from the parallel
paths.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

.

.

.

.

.

.

Anchor Formulation and Visualized 3D Anchors. We depict each pa-
rameter of within the 2D / 3D anchor formulation (left). We visualize
the precomputed 3D priors when 12 anchors are used after projection in
the image view (middle) and Bird’s Eye View (right). For visualization
purposes only, we span anchors in speciﬁc x3D locations which best min-
imize overlap when viewed.

. . . . . . . . . . . . . . . . . . . . . . . . 61

Qualitative Examples. We visualize qualitative examples of our method
for multi-class 3D object detection. We use yellow to denote cars, green
for pedestrians, and orange for cyclists. All illustrated images are from
the val1 [24] split and not used for training.

. . . . . . . . . . . . . . . . 74

Single-frame 3D detection [9] often has unstable estimation through time
(a), while our video-based method (b) is more robust by leveraging kine-
matic motion via a 3D Kalman Filter to fuse forecasted tracks τ(cid:48)
t and
measurements b into τt.

. . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Overview. Our framework uses a RPN to ﬁrst estimate 3D boxes (Sec. 5.3.1).
We forecast previous frame tracks τt−1 into τ(cid:48)
t using the estimated Kalman
velocity. Self-motion is compensated for applying a global ego-motion
(Sec. 5.3.2) to tracks τ(cid:48)
t. Lastly, we fuse τ(cid:48)
t with measurements b using a
. . . . . . . . . . . . . . . . . . 83
kinematic 3D Kalman ﬁlter (Sec. 5.3.3).

xvi

Figure 5.3:

Orientation. Our proposed orientation formulation decomposes an ob-
ject orientation ˆθ (a) into an axis classiﬁcation ˆθa (b), a heading classiﬁ-
cation ˆθh (c), and an offset ˆθr (d). Our method disentangles the objectives
of axis and heading classiﬁcation while greatly reducing the offset region
(red) by a factor of 1
4.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Figure 5.4: We compare AP3D with [9] by varying 3D IoU criteria and depth.

. . . . 98

Figure 5.5: We show the correlation of 3D IoU to classiﬁcation c and 3D conﬁdence

µ.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Figure 5.6:

Figure 6.1:

Figure 6.2:

Figure 6.3:

Figure 6.4:

Figure 6.5:

Qualitative Examples. We depict the image view (left) and BEV (right).
We show velocity vector in green, speed and ego-motion in miles per hour
(MPH) on top of detection boxes and at the top-left corner, and tracks as
dots in BEV.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

.

Our approach 3D Mesh Discovery (3DMD) takes a collection of images
as input and jointly learns a shared 3D mesh template and reﬁned meshes
over a diverse set of object categories.

. . . . . . . . . . . . . . . . . . . 104

Framework overview of 3DMD. Our method learns a shared 3D mesh
template T starting with a unit sphere and canonical to a reference sam-
ple. Then a camera multiplexer determines the best pose relative to net-
work predictions of object partitions and mask. Finally, the best pose is
used to reﬁne the template mesh to match the image inputs (left) produc-
ing instance-level mesh shapes (right).

. . . . . . . . . . . . . . . . . . . 108

Our camera multiplexer computes an alignment score a (Eq. 6.1) which
compares network outputs (top) to renderings from uniform cameras (bot-
tom). The cluster maps are compared with rendered clusters using LKL
while the 2D mask is used for IoU2D. Multiplexer output is computed as
the argmin camera i∗ (bolded).

. . . . . . . . . . . . . . . . . . . . . . 113

Qualitative 3D mesh reconstructions on the COCO dataset for a diverse
set of rigid and non-rigid object categories. We visualize input image
(row 1), posed template (row 2), and the reﬁned mesh (row 3− 5) in view
coordinates from two additional novel viewpoints.

. . . . . . . . . . . . 120

Qualitative 3D mesh reconstructions on the CUB-200 bird (left) and nu-
merous PASCAL3D+ categories (right). We visualize input image (row
1), posed template (row 2), and the reﬁned mesh (row 3 − 5) in view
coordinates from two additional novel viewpoints.
. . . . . . . . . . .

. 120

Figure 6.6: We show our learned templates (bottom) relative to the canonical pose of
. . . . . . . . . . . . . . . . .

their respective reference samples (top).

. 121

xvii

Figure 6.7:

Figure 6.8:

Figure 6.9:

Qualitative 3D mesh reconstructions on additional Pascal3D+ cars. We
visualize input image (row 1), posed template (row 2), and the reﬁned
mesh (row 3 − 5) in view coordinates from two additional novel view-
points.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

.

.

.

.

. 127

Qualitative 3D mesh reconstructions on additional CUB-200 birds. We
visualize input image (row 1), posed template (row 2), and the reﬁned
mesh (row 3 − 5) in view coordinates from two additional novel view-
points.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

.

.

.

.

.

.

Qualitative 3D mesh reconstructions on additional COCO categories (right).
We visualize input image (row 1), posed template (row 2), and the reﬁned
mesh (row 3 − 5) in view coordinates from two additional novel view-
points.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

.

.

.

.

.

.

Figure 6.10: Qualitative 3D mesh reconstructions on additional Pascal3D+ categories.
We visualize input image (row 1), posed template (row 2), and the reﬁned
mesh (row 3 − 5) in view coordinates from two additional novel view-
points.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

.

.

.

.

.

.

Figure 6.11: We show examples of our camera multiplexer renderings (right) as com-
pared to the network predicted cluster map (left middle) and predicted
mask (left bottom) in order to select the best pose i∗ highlighted in blue
for each example category.

. . . . . . . . . . . . . . . . . . . . . . . . . 129

xviii

LIST OF ALGORITHMS

1

P, [w, h, l, θ](cid:48)

Post 3D→2D Algorithm. The algorithm takes input of 2D / 3D box b(cid:48)
[x, y, z](cid:48)
iteratively tunes θ via L1 corner consistency loss.

2D,
3D, step size σ, termination β, and decay γ parameters, then

. . . . . . . . . . . . . . . . . 67

xix

Chapter 1

Introduction and Contributions

1.1

Introduction

Object detection is an important component which many applications depend on. Examples of

such applications include robotics [42], security systems [85], and urban autonomous driving [51].

The task of object detection involves both the classiﬁcation and localization of all objects in a visi-

ble scene whether they be in a 2D image space in pixels or 3D camera space in meters. Compared

to other domains, the domain of urban autonomous driving scenarios tends to deal primarily with

safety-critical classes including pedestrians, cyclists, and cars. The ability to reliably detect such

objects is important to enabling the future of autonomous vehicles (AV). As such, the power and

robustness of object detection and has made signiﬁcant progress in recent years [13, 98, 99, 130].

However, many unique challenges remain unsolved for urban objects compared to that of

generic objects. A notable difference is in the range of scales of the relevant objects. It is common
for objects, even among the same category, to move in a dynamic range between 5 − 50+ meters

away from the capturing camera [51]. Therefore, the appearance of said objects have high intr-

aclass variation and at far ranges become highly blurred and non-discriminative. The shape and

1

textures further tends to have very high variability compared to objects generically. This effect is

exempliﬁed by the variation of clothing on pedestrians and in the diverse shapes of vehicle models.

Additionally, urban self-driving scenarios commonly involve signiﬁcant dynamic motion which

makes modeling in video more troublesome. For instance, such motion is derived from both the

foreground objects and the capturing camera, typically mounted on the roof of a moving vehicle.

Prevalent methods for monocular object detection have focused primarily on object detection

of 2D bounding boxes [76, 166, 184, 188]. We emphasize that although 2D bounding boxes may

be beneﬁcial for other applications (security, face, and generic object detection), within the AV

domain it is critical to localize objects in 3D space for path planning and collision avoidance.

Many 3D sensors are available to aid in urban object detection. Notable sensor conﬁgurations

for object detection include cameras, LiDAR, and RADAR. Cameras typically have the lowest

cost and the highest availability compared the other aforementioned sensors. They uniquely cap-

ture both the shape and texture information of a scene, the latter of which is not present in other

sensors [16]. However, a critical drawback of camera systems is in its inherent ambiguity in depth,

which LiDAR and RADAR comparably excel at. Multi-camera systems partially address such

limitations by using a known 3D geometry and calibration to approximate depth more accurately.

We opt to study the problem of object detection using a single-view monocular camera due to its

low cost and accessibility compared to all other conﬁgurations.

In this dissertation, we study and present methods to solve both monocular 2D and 3D object

detection with emphasis on safety-critical objects in urban autonomous driving scenarios. Starting

from seminal work on Faster R-CNN [132], we analyze the effects of incorporating additional

critical supervision signals and its effects on object detection accuracy. Speciﬁcally, we approach

the problem from the perspective of incorporating weak shape supervision, intermediate network

supervision, 3D bounding box supervision, video cues, and 3D mesh discovery. Despite studying

2

the effects in isolation, each novel technique is compatible to a high-level base framework and

presented as modular extensions of a theoretical shared system.

We begin with a novel method to utilize shape supervision via weak segmentation applied to

the task of pedestrian detection. We identify the similarities of segmentation and detection and

demonstrate how their joint supervision using weak box-based segmentation can greatly improve

the localization of pedestrians. We ﬁnd that the robustness of segmentation shape helps correct

the detection of unusually shaped pedestrians (sitting, bent over, or occluded), which dominate

the corrections. We emphasize that by using weak box-based segmentation masks no additional

data is required. Further, since the segmentation is used only to improve training, there is no

added overhead to the method at inference. Overall, our method improves the state-of-the-art on
competitive datasets [38, 51] and executes ≈ 2× faster than competitive methods of the time.

We then expand our framework to include intermediate network supervision designed to itera-

tively improve the precision of object detection. Speciﬁcally, we build autoregressive phases into a

region proposal network which is optimized with intermediate supervision of increasingly difﬁcult

labels. We emphasize that the phases are built within the network such that more advanced fea-

tures can be learned via an efﬁcient lightweight decoder-encoder structure. A competitive network

runtime efﬁciency is maintained since since the phases are designed inside of the network rather

than adding additional separate networks as an ensemble. We provide detailed analysis to demon-

strate how each phase becomes more precise and produces higher peaks at the centroid of unique

pedestrians. Collectively, our autoregressive method strikes a favorable balance between accuracy

and runtime, while improving the state-of-the-art for pedestrian detection even in notably difﬁcult

high-occlusion evaluation settings.

We next incorporate 3D supervision into the general 2D region proposal network. We note that

perceiving objects in 3D is inherently more valuable for real-world applications and particularly so

3

in the domain of AV. Compared to our technique, prior works on monocular 3D object detection

utilize multiple separate networks, stages, datasets, and are typically not trained end-to-end. In

contrast, we propose a single 3D region proposal network which takes a monocular image as input

and is trained fully end-to-end. We further utilize the relation of 2D and 3D geometry in ﬁxed

camera scenarios by using a novel depth-aware convolutional operator. Our method signiﬁcantly
improves both Bird’s Eye View and 3D object detection accuracy (≈ 3×) despite using only a

single region proposal network. Moreover, our proposed approach is more future-proof since its

core concepts can be naturally integrated into any monocular 3D object detection system which

uses a 2D region proposal network, resulting in a 3D proposal which already performs well.

We note that although motion is important for human vision in detection, tracking and perceiv-

ing depth, it is not thoroughly utilized in monocular 3D object detection systems. We next extend

our system to leverage monocular video streams using a novel 3D Kalman ﬁlter. Collectively, the

integration of video helps stabilize noisy object localizations in the 3D world. We observe our

kinematic method results in an appreciable improvement to localization accuracy and produces

useful byproducts such as ego-motion and per-object velocity.

We have discussed detecting and localizing objects from 2D to 3D, but only coarsely in the

representation of 3D cuboids. We next explore how to perceive such objects more ﬁnely as 3D

meshes. We emphasize that 3D mesh ground truths are expensive to annotate and are generally

not available for in the wild scenarios and datasets. We therefore propose a general end-to-end

approach to discover 3D mesh shapes from only a collection of unassociated images and 2D mask

supervision — a common and cost-effective label. We accomplish this goal by jointly learning a

template mesh, object pose, and reﬁned shapes from only 2D labels. We scale our method to a

diverse set of in the wild object categories beyond cars due to the freedom from 3D supervision.

In summary, we provide novel solutions to localize and understand objects aided by unique

4

techniques and signals including weak segmentation supervision, intermediate supervision, 3D

supervision, temporal cues, and lastly mesh signals. Our methods pay special attention to safety-

critical objects including pedestrians, cyclists, and cars — commonly found in urban landscapes.

Notably, each method is tailored to maintain or improve the runtime efﬁciency, while appreciably

improving accuracy relative to comparable state-of-the-art methods. We emphasize that our meth-

ods help improve object detection from 2D to 3D, and thereby drastically increase the effective

utility in practice — for as long as we live in a 3D world.

1.2 Dissertation Contributions

This dissertation studies the problem of object detection ranging from 2D detection of pedestrians

to multi-class 3D object detection with additional novelties in both video and 3D mesh discovery.

We summarize the collective contributions of this dissertation as follows:

• We propose a multi-task network which enhances monocular pedestrian detection using si-

multaneous detection and segmentation. We avoid requiring any additional data or inference

time by only re-using the bounding box annotations as weak segmentation labels. We pro-

vide insights on the unique characteristics of weak segmentation compared to typical object

detection paradigms, and its illuminating effects on convolutional feature maps.

• We simultaneously address the unique scale challenges suffered in prior pedestrian detection

work [183] when adapting Faster R-CNN [132] by a novel fusion of predictions from a region

proposal network and a binary classiﬁer when trained with a stricter supervision signal.

• We further improve our framework by developing a multi-phased autoregressive pedestrian

detection system built into a region proposal network. Notably, each phase of our network is

trained on progressively more strict supervision signals which iteratively improve precision.

5

• We propose a lightweight decoder-encoder module within the autoregressive pedestrian de-

tection system to enable both feature map reﬁnement and message passing between phases

while using convolutional re-sampling for efﬁcient feature map pathways.

• We formulate a novel 3D object detector using a monocular 3D regional proposal network

(M3D-RPN) which operates in a shared 2D and 3D object detection space. Prior statistics

are precomputed as anchors used for strong initialization, which help ease the regression of

3D parameters. Our method utilizes a single network and is trained end-to-end, compared to

prior art [23, 24, 118, 172] which require external networks, data, and representations.

• We propose a novel operation for depth-aware convolution in order to improve estimation of

3D parameters. In effect, the network is able to learn high-level and spatially aware features,

which are able to vary for objects which are close or far from the camera.

• We propose video-based approach for monocular 3D object detection which leverages real-

istic motion constraints using an integrated ego-motion estimator and a 3D Kalman ﬁlter.

• Our video-based system uses only a single model to more comprehensively understand the

3D dynamics of a scene, resulting in more accurate 3D localization and useful byproducts

including ego-motion and per-object velocity.

• We form a novel method which goes beyond simple 3D cuboids to discover 3D mesh repre-

sentation from unassociated images using only 2D mask supervision, which we comprehen-

sively evaluate on both cars and 20+ additional in the wild categories.

• Our 3D mesh discovery framework is able to jointly learn a shared template shape, object

pose and reﬁned instance-level meshes in an end-to-end manner, powered by a novel tem-

plate learner, camera multiplexer and cluster consistency constraint.

6

1.3 Dissertation Organization

We organize the remaining chapters of the dissertation as follows. Chapter 2 proposes the novel

framework for simultaneous detection and segmentation as an augment to pedestrian detection.

Chapter 3 develops a framework for autoregressive multi-phased pedestrian detection built into a

region proposal network. Chapter 4 explores how 3D object detection can perform accurately with

only a single region proposal network as well as the proposed depth-aware convolution. Chapter 5

extends our 3D object detection system to work with video using a tailored 3D Kalman ﬁlter for

both better stabilization and a more comprehensive understanding of scene dynamics. Chapter 6

explores our method for discovering objects in 3D mesh representations rather than coarse cuboids,

while only utilizing 2D signals for supervision. Chapter 7 concludes the dissertation.

7

Chapter 2

Illuminating Pedestrians via Simultaneous

Detection & Segmentation

Pedestrian detection is a critical problem in computer vision with signiﬁcant impact on safety in ur-

ban autonomous driving. In this work, we explore how semantic segmentation can be used to boost

pedestrian detection accuracy while having little to no impact on network efﬁciency. We propose

a segmentation infusion network to enable joint supervision on semantic segmentation and pedes-

trian detection. When placed properly, the additional supervision helps guide features in shared

layers to become more sophisticated and helpful for the downstream pedestrian detector. Using

this approach, we ﬁnd weakly annotated boxes to be sufﬁcient for considerable performance gains.

We provide an in-depth analysis to demonstrate how shared layers are shaped by the segmentation

supervision. In doing so, we show that the resulting feature maps become more semantically mean-

ingful and robust to shape and occlusion. Overall, our simultaneous detection and segmentation

framework achieves a considerable gain over the state-of-the-art on the Caltech pedestrian dataset,
competitive performance on KITTI, and executes 2× faster than competitive methods.

8

Figure 2.1: Detection results on the Caltech test set (left), feature map visualization from the RPN
of conventional Faster R-CNN (middle), and feature map visualization of SDS-RCNN (right).
Notice that our feature map substantially illuminates the pedestrian shape while suppressing the
background region, both of which make positive impact to downstream pedestrian detection.

2.1

Introduction

Pedestrian detection from an image is a core capability of computer vision, due to its applications

such as autonomous driving and robotics [51]. It is also a long-standing vision problem because

of its distinct challenges including low resolution, occlusion, cloth variations, etc [184]. There are

two central approaches for detecting pedestrians: object detection [13, 183] and semantic segmen-

tation [21, 31]. The two approaches are highly related by nature but have their own strengths and

weaknesses. For instance, object detection is designed to perform well at localizing distinct objects

but typically provides little information on object boundaries. In contrast, semantic segmentation

does well at distinguishing pixel-wise boundaries among classes but struggles to separate objects

within the same class.

Intuitively, we expect that knowledge from either task will make the other substantially eas-

ier. This has been demonstrated for generic object detection, since having segmentation masks of

objects would clearly facilitate detection. For example, Fidler et al. [47] utilize predicted segmen-

tation masks to boost object detection performance via a deformable part-based model. Hariharan

et al. [62] show how segmentation masks generated from MCG [2] can be used to mask background

9

regions and thus simplify detection. Dai et al. [32] utilize the two tasks in a 3-stage cascaded net-

work consisting of box regression, foreground segmentation, and classiﬁcation. Their architecture

allows each task to share features and feed into one another.

In contrast, the pairing of these two tasks is rarely studied in pedestrian detection, despite the

recent advances [13, 91, 183]. This is due in part to the lack of pixel-wise annotations available in

classic pedestrian datasets such as Caltech [38] and KITTI [51], unlike the detailed segmentation

labels in the COCO [100] dataset for generic object detection. With the release of Cityscapes [31],

a high quality dataset for urban semantic segmentation, it is expected that substantial research

efforts will be on how to leverage semantic segmentation to boost the performance of pedestrian

detection, which is the core problem to be studied in this paper.

Given this objective, we start by presenting a competitive two-stage baseline framework of

pedestrian detection deriving from RPN+BF [183] and Faster R-CNN [132]. We contribute a

number of key changes to enable the second-stage classiﬁer to specialize in stricter supervision

and additionally fuse the reﬁned scores with the ﬁrst stage RPN. These changes alone lead to state-

of-the-art performance on the Caltech benchmark. We further present a simple, but surprisingly

powerful, scheme to utilize multi-task learning on pedestrian detection and semantic segmentation.

Speciﬁcally, we infuse the semantic segmentation mask into shared layers using a segmentation

infusion layer in both stages of our network. We term our approach as “simultaneous detection

and segmentation R-CNN (SDS-RCNN)". We provide an in-depth analysis on the effects of joint

training by examining the shared feature maps, e.g., Fig. 2.1. Through infusion, the shared feature

maps begin to illuminate pedestrian regions. Further, since we infuse the semantic features during

training only, the network efﬁciency at inference is unaffected. We demonstrate the effectiveness of

SDS-RCNN by reporting considerable improvement (23% relative reduction of the error) over the

published state-of-the-art on Caltech [38], competitive performance on KITTI [51], and a runtime

10

roughly 2× faster than competitive methods.

In summary our contributions are as follows:
• Improved baseline derived from [132, 183] by enforcing stricter supervision in the second-

stage classiﬁcation network, and further fusing scores between stages.

• A multi-task infusion framework for joint supervision on pedestrian detection and semantic

segmentation, with the goal of illuminating pedestrians in shared feature maps and easing

downstream classiﬁcation.

• We achieve the new state-of-the-art performance on Caltech pedestrian dataset, competitive

performance on KITTI, and obtain 2× faster runtime.

2.2 Prior work

Object Detection: Deep convolution neural networks have had extensive success in the domain

of object detection. Notably, derivations of Fast [54] and Faster R-CNN [132] are widely used

in both generic object detection [13, 53, 175] and pedestrian detection [91, 152, 183]. Faster R-

CNN consists of two key components: a region proposal network (RPN) and a classiﬁcation sub-

network. The RPN works as a sliding window detector by determining the objectness across a

set of predeﬁned anchors (box shapes deﬁned by aspect ratio and scale) at each spatial location of

an image. After object proposals are generated, the second stage classiﬁer determines the precise

class each object belongs to. Faster R-CNN has been shown to reach state-of-the-art performance

on the PASCAL VOC 2012 [44] dataset for generic object detection and continues to serve as a

frequent baseline framework for a variety of related problems [53, 62, 65, 184].

Pedestrian Detection: Pedestrian detection is one of the most extensively studied problems in

11

object detection due to its real-world signiﬁcance. The most notable challenges are caused by

small scale, pose variations, cyclists, and occlusion [184]. For instance, in the Caltech pedestrian

dataset [38] 70% of pedestrians are occluded in at least one frame.

The top performing approaches on the Caltech pedestrian benchmark are variations of Fast

or Faster R-CNN. SA-FastRCNN [54] and MS-CNN [13] reach competitive performance by di-

rectly addressing the scale problem using specialized multi-scale networks integrated into Fast and

Faster R-CNN respectively. Furthermore, RPN+BF [183] shows that the RPN of Faster R-CNN

performs well as a standalone detector while the downstream classiﬁer degrades performance due

to collapsing bins of small-scale pedestrians. By using higher resolution features and replacing the

downstream classiﬁer with a boosted forest, RPN+BF is able to alleviate the problem and achieve

9.58% miss rate on the Caltech reasonable [39] setting. F-DNN [41] also uses a derivation of the

Faster R-CNN framework. Rather then using a single downstream classiﬁer, F-DNN fuses mul-

tiple parallel classiﬁers including ResNet [65] and GoogLeNet [150] using soft-reject and further

incorporates multiple training datasets to achieve 8.65% miss rate on the Caltech reasonable set-

ting. The majority of top performing approaches utilize some form of a RPN, whose scores are

typically discarded after selecting the proposals. In contrast, our work shows that fusing the score

with the second stage network can lead to substantial performance improvement.

Simultaneous Detection & Segmentation: There are two lines of research on simultaneous de-

tection and segmentation. The ﬁrst aims to improve the performance of both tasks, and formulates

a problem commonly known as instance-aware semantic segmentation [31]. Hariharan et al. [62]

predict segmentation masks using MCG [2] then get object instances using “slow" R-CNN [55] on

masked image proposals. Dai et al. [32] achieve high performance on instance segmentation using

an extension of Faster R-CNN in a 3-stage cascaded network including mask supervision.

The second aims to explicitly improve object detection by using segmentation as a strong cue.

12

Figure 2.2: Overview of the proposed SDS-RCNN framework. The segmentation layer infuses
semantic features into shared conv1-5 layers of each stage, thus illuminating pedestrians and easing
downstream pedestrian detection (proposal layers in RPN, and FC1-2 in BCN).

Early work on the topic by Fidler et al. [47] demonstrates how semantic segmentation masks can be

used to extract strong features for improved object detection via a deformable part-based model.

Du et al. [41] use segmentation as a strong cue in their F-DNN+SS framework. Given the seg-

mentation mask predicted by a third parallel network, their ensemble network uses the mask in a

post-processing manner to suppress background proposals, and pushes performance on the Caltech

pedestrian dataset from 8.65% to 8.18% miss rate. However, the segmentation network degrades

the efﬁciency of F-DNN+SS from 0.30 to 2.48 seconds per image, and requires multiple GPUs at

inference. In contrast, our novel framework infuses the semantic segmentation masks into shared

feature maps and thus does not require a separate segmentation network, which outperforms [41]

in both accuracy and network efﬁciency. Furthermore, our use of weak box-based segmentation

masks addresses the issue of lacking pixel-wise segmentation annotations in [38, 51].

13

seg conv1-5 conv1-5 conv1-5 conv1-5 conv1-5 proposal conv1-5 conv1-5 conv1-5 conv1-5 conv1-5 cls bbox FC2 VGG-16 VGG-16 + seg RPN BCN FC1 2.3 Proposed method

Our proposed architecture consists of two key stages: a region proposal network (RPN) to generate

candidate bounding boxes and corresponding scores, and a binary classiﬁcation network (BCN)

to reﬁne their scores.

In both stages, we propose a semantic segmentation infusion layer with

the objective of making downstream classiﬁcation a substantially easier task. The infusion layer

aims to encode semantic masks into shared feature maps which naturally serve as strong cues for

pedestrian classiﬁcation. Due to the impressive performance of the RPN as a standalone detector,

we elect to fuse the scores between stages rather than discarding them as done in prior work [13,

41, 154, 183]. An overview of the SDS-RCNN framework is depicted in Fig. 2.2

2.3.1 Region Proposal Network

The RPN aims to propose a set of bounding boxes with associated conﬁdence scores around po-

tential pedestrians. We adopt the RPN of Faster R-CNN [132] following the settings in [183]. We

tailor the RPN for pedestrain detection by conﬁguring Na = 9 anchors with a ﬁxed aspect ratio

of 0.41 and spanning a scale range from 25 – 350 pixels, corresponding to the pedestrain statistics

of Caltech [38]. Since each anchor box acts as a sliding window detector across a pooled image
space, there are Np = Na× W
fs

total pedestrian proposals, where fs corresponds to the feature

× H
fs

stride of the network. Hence, each proposal box i corresponds to an anchor and a spatial location

of image I.

The RPN architecture uses conv1-5 from VGG-16 [141] as the backbone. Following [132], we

attach a proposal feature extraction layer to the end of the network with two sibling output layers

for box classiﬁcation (cls) and bounding box regression (bbox). We further add a segmentation

infusion layer to conv5 as detailed in Sec. 2.3.3.

14

For every proposal box i, the RPN aims to minimize the following joint loss function with three

terms:

(cid:88)

i

L = λc

Lc(ci, ˆci) + λr

(cid:88)

i

Lr(ti, ˆti) + λsLs.

(2.1)

The ﬁrst term is the classiﬁcation loss Lc, which is a softmax logistic loss over two classes (pedes-

trian vs. background). We use the standard labeling policy which considers a proposal box at

location i to be pedestrian (ci = 1) if it has at least 0.5 Intersection over Union (IoU) with a ground

truth pedestrian box, and otherwise background (ci = 0). The second term seeks to improve lo-

calization via bounding box regression, which learns a transformation for each proposal box to
the nearest pedestrian ground truth. Speciﬁcally, we use Lr(ti, ˆti) = R(ti − ˆti) where R is the
robust (smooth L1) loss deﬁned in [54]. The bounding box transformation is deﬁned as a 4-tuple

consisting of shifts in x, y and scales in w, h denoted as t = [tx, ty, tw, th]. The third term Ls is

the segmentation loss presented in Sec. 2.3.3.

In order to reduce multiple detections of the same pedestrian, we apply non-maximum sup-

pression (NMS) greedily to all pairs of proposals after the transformations have been applied. We

use an IoU threshold of 0.5 for NMS.

We train the RPN in the Caffe [71] framework using SGD with a learning rate of 0.001, mo-

mentum of 0.9, and mini-batch of 1 full-image. During training, we randomly sample 120 pro-

posals per image at a ratio of 1:5 for pedestrian and background proposals to help alleviate the

class imbalance. All other proposals are treated as ignore. We initialize conv1-5 from a VGG-16

model pretrained on ImageNet [36], and all remaining layers randomly. Our network has four max-

pooling layers (within conv1-5), hence fs = 16. In our experiments, we regularize our multi-task

loss terms by setting λc = λs = 1, λr = 5.

15

Figure 2.3: Example proposal masks with and without padding. There is no discernible difference
between the non-padded masks of well-localized (a) and poorly localized (b) proposals.

2.3.2 Binary Classiﬁcation Network

The BCN aims to perform pedestrian classiﬁcation over the proposals of the RPN. For generic ob-

ject detection, the BCN usually uses the downstream classiﬁer of Faster R-CNN by sharing conv1-5

with the RPN, but was shown by [183] to degrade pedestrian detection accuracy. Thus, we choose

to construct a separate network using VGG-16. The primary advantage of a separate network is

to allow the BCN freedom to specialize in the types of “harder" samples left over from the RPN.

While sharing computation is highly desirable for the sake of efﬁciency, the shared networks are

more predestined to predict similar scores which are redundant when fused. Therefore, rather than

cropping and warping a shared feature space, our BCN directly crops the top Nb proposals from

the RGB input image.

For each proposal image i, the BCN aims to minimize the following joint loss function with

two terms:

(cid:88)

i

L = λc

wiLc(ci, ˆci) + λsLs.

(2.2)

Similar to RPN, the ﬁrst term is the classiﬁcation loss Lc where ci is the class label for the ith

16

Without	  Padding	  Padding	  20%	  	  	  	  	  	  	  	  	  	  	  	  (a)	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  (b)	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  (a)	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  (b)	  proposal. A cost-sensitive weight wi is used to give precedence to detect large pedestrians over

small pedestrians. There are two key motivations for this weighting policy. First, large pedestrians

typically imply close proximity and are thus signiﬁcantly more important to detect. Secondly, we

presume that features of large pedestrians may be more helpful for detecting small pedestrians.

We deﬁne the weighting function given the ith proposal with height hi and a pre-computed mean

height ¯h as wi = 1 +

hi
¯h

. The second term is the segmentation loss presented in Sec. 2.3.3.

We make a number of signiﬁcant contributions to the BCN. First, we change the labeling policy

to encourage higher precision and further diversiﬁcation from the RPN. We enforce a stricter

labeling policy, requiring a proposal to have IoU > 0.7 with a ground truth pedestrian box to be

considered pedestrian (ci = 1), and otherwise background (ci = 0). This encourages the network

to suppress poorly localized proposals and reduces false positives in the form of double detections.

Secondly, we choose to fuse the scores of the BCN with the conﬁdence scores of the RPN at test

time. Since our design explicitly encourages the two stages to diversify, we expect the classiﬁcation

characteristics of each network to be complementary when fused. We fuse the scores at the feature

level prior to softmax. Formally, the fused score for the ith proposal, given the predicted 2-class
i} is computed via a fused softmax
scores from the RPN ri = {ˆc0
function:

i} and BCN bi = {ˆc0

i , ˆc1

i , ˆc1

ˆci =

i +ˆb1
e(ˆr1
i )
i ) + e(ˆr0

i +ˆb0
i )

e(ˆr1

i +ˆb1

,

(2.3)

where the superscript denotes the class corresponding channel (0 = background, 1 = pedestrian).

In effect, the fused scores become more conﬁdent when the stages agree, and otherwise lean to-

wards the dominant score. Thus, it is ideal for each network to diversify in its classiﬁcation capa-

bilities such that at least one network may be very conﬁdent for each proposal.

17

For a modest improvement to efﬁciency, we remove the pool5 layer from the VGG-16 archi-
tecture then adjust the input size to 112 × 112 to keep the fully-connected layers intact. This is a
fair trade-off since most pedestrian heights fall in the range of 30 − 80 pixels [38]. Hence, small
pedestrian proposals are upscaled by a factor of ∼2×, allowing space for ﬁner discrimination. We

further propose to pad each proposal by 20% on all sides to provide background context and avoid

partial detections, as shown in Fig. 2.3.

We train the BCN in the Caffe [71] framework using the same settings as the RPN. We initialize

conv1-5 from the trained RPN model, and all remaining layers randomly. During training, we set

Nb = 20. During inference, we set Nb = 15 for a moderate improvement to efﬁciency. We

regularize the multi-task loss by setting λc = λs = 1.

2.3.3 Simultaneous Detection & Segmentation

We approach simultaneous detection and segmentation with the motivation to make our down-

stream pedestrian detection task easier. We propose a segmentation infusion layer trained on

weakly annotated pedestrian boxes which illuminate pedestrians in the shared feature maps pre-

ceding the classiﬁcation layers. We integrate the infusion layer into both stages of our SDS-RCNN

framework.

Segmentation Infusion Layer: The segmentation infusion layer aims to output two masks indi-

cating the likelihood of residing on pedestrian or background segments. We choose to use only a
single layer and a 1 × 1 kernel so the impact on the shared layers will be as high as possible. This

forces the network to directly infuse semantic features into shared feature maps, as visualized in

Fig. 2.4. A deeper network could achieve higher segmentation accuracy but will infer less from

shared layers and diminish the overall impact on the downstream pedestrian classiﬁcation. Further,

18

Figure 2.4: Feature map visualizations of conv5 and the proposal layer for the baseline RPN (left)
and the RPN infused with weak segmentation supervision (right).

we choose to attach the infusion layer to conv5 since it is the deepest layer which precedes both

the proposal layers of the RPN and the fully connected layers of the BCN.

Formally, the ﬁnal loss term Ls of both the RPN and BCN is a softmax logistic loss over

two classes (pedestrian vs. background), applied to each location i, where wi is the cost-sensitive

weight introduced in 2.3.2:

19

conv5 RPN Baseline conv_proposal RGB conv5 RPN + Weak Segmentation conv_proposal Figure 2.5: Visualization of the similarity between pixel-wise segmentation masks (from
Cityscapes [31]) and weak box-based masks when downsampled in both the BCN (top) and RPN
(bottom).

(cid:88)

i

λs

wiLs(Si, ˆSi).

(2.4)

We choose to levereage the abundance of bounding box annotations available in popular pedes-

trian datasets (e.g., Caltech [38], KITTI [51]) by forming weak segmentation ground truth masks.
Each mask S ∈ RW×H is generated by labeling all pedestrian box regions as Si = 1, and oth-
erwise background Si = 0. In most cases, box-based annotations would be considered too noisy

for semantic segmentation. However, since we place the infusion layer at conv5, which has been

20

7 x 7 Downsample mask box mask box Full resolution mask box Full Resolution Downsampled RGB pooled signiﬁcantly, the differences between box-based annotations and pixel-wise annotations di-

minish rapidly w.r.t. the pedestrian height (Fig. 2.5). For example, in the Caltech dataset 68% of
pedestrians are less than 80 pixels tall, which corresponds to 3× 5 pixels at conv5 of the RPN. Fur-
ther, each of the BCN proposals are pooled to 7 × 7 at conv5. Hence, pixel-wise annotations may

not offer a signiﬁcant advantage over boxes at the high levels of pooling our networks undertake.

Beneﬁts Over Detection: A signiﬁcant advantage of segmentation supervision over detection

is its simplicity. For detection, sensitive hyperparamters must be set, such as anchor selection

and IoU thresholds used for labeling and NMS. If the chosen anchor scales are too sparse or

the IoU threshold is too high, certain ground truths that fall near the midpoint of two anchors

could be missed or receive low supervision. In contrast, semantic segmentation treats all ground

truths indiscriminate of how well the pedestrian’s shape or occlusion-level matches the chosen

set of anchors. In theory, the incorporation of semantic segmentation infusion may help reduce

the sensitivity of conv1-5 to such hyperparamters. Furthermore, the segmentation supervision is

especially beneﬁcial for the second stage BCN, which on its own would only know if a pedestrian

is present. The infusion of semantic segmentation features inform the BCN where the pedestrian

is, which is critical for differentiating poorly vs. well-localized proposals.

2.4 Experiments

We evaluate our proposed SDS-RCNN on popular datasets including Caltech [38] and KITTI [51].

We perform comprehensive analysis and ablation experiments using the Caltech dataset. We refer

to our collective method as SDS-RCNN and our region proposal network as SDS-RPN. We show

the performance curves compared to the state-of-the-art pedestrian detectors on Caltech in Fig. 2.6.

We further report a comprehensive overview across datasets in Table 2.1.

21

Figure 2.6: Comparison of SDS-RCNN with the state-of-the-art methods on the Caltech dataset
using the reasonable setting.

2.4.1 Benchmark Comparison

Caltech: The Caltech dataset [38] contains ∼350K pedestrian bounding box annotations across

10 hours of urban driving. The log average miss rate sampled against a false positive per image
(FPPI) range of [10−2, 100] is used for measuring performance. A minimum IoU threshold of 0.5

is required for a detected box to match with a ground truth box. For training, we sample from the
standard training set according to Caltech10× [185], which contains 42,782 training images. We
evaluate on the standard 4,024 images in the Caltech 1× test set using the reasonable [39] setting,

22

which only considers pedestrians with at least 50 pixels in height and with less than 35% occlusion.

SDS-RCNN achieves an impressive 7.36% miss rate. The performance gain is a relative im-

provement of 23% compared to the best published method RPN+BF (9.58%). In Fig. 2.6, we show

the ROC plot of miss rate against FPPI for the current top performing methods reported on Caltech.

We further report our performance using just SDS-RPN (without cost-sensitive weighting,

Sec. 2.4.3) on Caltech as shown in Table 2.1. The RPN performs quite well by itself, reaching
9.63% miss rate while processing images at roughly 3× the speed of competitive methods. Our

RPN is already on par with other top detectors, which themselves contain a RPN. Moreover, the

network signiﬁcantly outperforms other standalone RPNs such as in [183] (14.9%). Hence, the

RPN can be leveraged by other researchers to build better detectors in the future.
KITTI: The KITTI dataset [51] contains ∼80K annotations of cars, pedestrians, and cyclists.

Since our focus is on pedestrian detection, we continue to use only the pedestrian class for training

and evaluation. The mean Average Precision (mAP) [43] sampled across a recall range of [0, 1] is

used to measure performance. We use the standard training set of 7,481 images and evaluate on

the designated test set of 7,518 images. Our method reaches a score of 63.05 mAP on the moderate

setting for the pedestrian class. Surprisingly, we observe that many models which perform well on

Caltech do not generalize well to KITTI, as detailed in Table 2.1. We expect this is due to both sen-
sitivity to hyperparameters and the smaller training set of KITTI (∼6× smaller than Caltech10×).

MS-CNN [13] is the current top performing method for pedestrian detection on KITTI. Aside from

the novelty as a multi-scale object detector, MS-CNN augments the KITTI dataset by random crop-

ping and scaling. Thus, incorporating data augmentation could alleviate the smaller training set

and lead to better generalization across datasets. Furthermore, as described in the ablation study of

Sec. 2.4.3, our weak segmentation supervision primarily improves the detection of unusual shapes

and poses (e.g., cyclists, people sitting, bent over). However, in the KITTI evaluation, the person

23

Method
DeepParts [152]
CompACT-Deep [14]
MS-CNN [13]
SA-FastRCNN [91]
RPN+BF [183]
F-DNN [41]
F-DNN+SS [41]
SDS-RPN (ours)
SDS-RCNN (ours)

Caltech KITTI Runtime
11.89
11.75
9.95
9.68
9.58
8.65
8.18
9.63
7.36

1s
1s
0.4s
0.59s
0.60s
0.30s
2.48s
0.13s
0.21s

63.05

58.67
58.74
73.70
65.01
61.29

-
-
-

Table 2.1: Comprehensive comparison of SDS-RCNN with other state-of-the-art methods showing
the Caltech miss rate, KITTI mAP score, and runtime performance.

sitting class is ignored and cyclists are counted as false positives, hence such advantages are less

helpful.

2.4.2 Efﬁciency

The runtime performance of SDS-RCNN takes ∼0.21s/image. We use images of size 720 × 960

pixels and a single Titan X GPU for computation. The efﬁciency of SDS-RCNN surpasses the
current state-of-the-art methods for pedestrian detection, often by a factor of 2×. Compared to F-
DNN+SS [41], which also utilizes segmentation cues, our method executes ∼10× faster. The next

fastest runtime is F-DNN, which takes 0.30s/image with the caveat of requiring multiple GPUs to

process networks in parallel. Further, our SDS-RPN method achieves very competitive accuracy
while only taking 0.13s/image (∼3× faster than competitive methods using a single GPU).

Moreover, we further study the trade-off of accuracy and efﬁciency by designing multiple mod-

els meant to be run in real-time while focusing on pedestrians in the “near" category of Caltech [38]
(height of ≥ 80px). We ﬁrst note that our full GPU model achieves < 1% MR on this evaluation

setting but runs only at 5 frames per second even while using a high-powered GPU. We begin the

24

efﬁciency study by replacing our backbone network with SqueezeNet [69], removing anchors with

a scale below the 80px threshold and ﬁnally reducing the input image scale to 211px. We observe

the performance of the aforementioned model is 4.0% MR on the “near" scale of pedestrians. The

model runs at an impressive 48 frames per second on a high-powered CPU and 165 frames per

second on a GPU respectively. However, we ﬁnd that a low-power embedded CPU runs such a
model conﬁguration at ≈ 1 frames per second. Therefore, we attempt to further push the runtime

efﬁciency by removing alternating ﬁre modules from SqueezeNet [69], changing the ﬁrst convolu-

tion to have a stride of 16, removing all pooling layers, and reducing the width of all layers with
kernels > 1 × 1 by half, which we refer to as SqueezeNet-Embed. After modifying, we pretrain

the SqueezeNet-Embed on ImageNet [79] which itself achieves a respectable Top-1% and Top-5%

accuracy of 44.9 and 69.2 compared to the peformance of full SqueezeNet [69] which achieves

57.5 and 80.3 respectively. Collectively, this SqueezeNet-Embed model is able to perform at an

incredibly fast 387 frames per second on a high-powered CPU at the cost of 27% MR for the “near"

scale of pedestrians. In an embedded system with a low-power CPU the speed is is approximately
real-time running at ≈ 10 frames per second. Hence, we have demonstrated that our proposed

method can be tailored to various application needs varying from real-time on a high-power GPU,

high-power CPU, or even a low-power embedded CPU. However, in extreme cases, such as in the

case of SqueezeNet-Embed, a high penalty is faced in accuracy.

2.4.3 Ablation Study

In this section, we evaluate how each signiﬁcant component of our network contributes to perfor-

mance using the reasonable set of Caltech [38]. First, we examine the impact of four components:

weak segmentation supervision, proposal padding, cost-sensitive weighting, and stricter supervi-

sion. For each experiment, we start with SDS-RCNN and disable one component at a time as

25

Figure 2.7: Example error sources which are corrected by infusing semantic segmentation into
shared layers. Row 1 shows the test images from Caltech1×. Row 2 shows a visualization of
the RPN proposal layer using the baseline network which fails on these examples. Row 3 shows
a visualization of the proposal layer from SDS-RCNN, which corrects the errors. Collectively,
occlusion and unusual poses of pedestrians (sitting, cyclist, bent over) make up for 75% of the
corrections, suggesting that the the segmentation supervision naturally informs the shared features
on robust pedestrian parts and shape information.

Component Disabled RPN BCN Fusion
proposal padding
7.69
cost-sensitive
7.89
strict supervision
8.71
weak segmentation
10.41
SDS-RCNN
7.36

13.09
14.87
17.41
18.76
10.98

10.67
9.63
10.67
13.84
10.67

Table 2.2: Ablation experiments evaluated using the Caltech test set. Each ablation experiment
reports the miss rate for the RPN, BCN, and fused score with one component disabled at a time.

summarized in Table 2.2. For simplicity, we disable components globally when applicable. Then

we provide detailed discussion on the beneﬁts of stage-wise fusion and comprehensively report the

RPN, BCN, and fused performances for all experiments. Finally, since our BCN is designed to

not share features with the RPN, we closely examine how sharing weights between stages impacts

network diversiﬁcation and efﬁciency.

Weak Segmentation: The infusion of semantic features into shared layers is the most critical

component of SDS-RCNN. The fused miss rate degrades by a full 3.05% when the segmentation

26

supervision is disabled, while both individual stages degrade similarly. To better understand the

types of improvements gained by weak segmentation, we perform a failure analysis between SDS-

RCNN and the “baseline" (non-weak segmentation) network. For analysis, we examine the 43

pedestrian cases which are missed when weak segmentation is disabled, but corrected otherwise.
Example error corrections are shown in Fig. 2.7. We ﬁnd that ∼48% of corrected pedestrians are at
least partially occluded. Further, we ﬁnd that ∼28% are pedestrians in unusual poses (e.g., sitting,

cycling, or bent over). Hence, the feature maps infused with semantic features become more

robust to atypical pedestrian shapes. These beneﬁts are likely gained by semantic segmentation

having indiscriminant coverage of all pedestrians, unlike object detection which requires speciﬁc

alignment between pedestrians and anchor shapes. A similar advantage could be gained for object

detection by expanding the coverage of anchors, but at the cost of computational complexity.

Proposal Padding: While padding proposals is an intuitive design choice to provide background

context (Fig. 2.3), the beneﬁt in practice is minor. Speciﬁcally, when proposal padding is dis-

abled, the fused performance only worsens from 7.36% to 7.69% miss rate. Interestingly, pro-

posal padding remains critical for the individual BCN performance, which degrades heavily from

10.98% to 13.09% without padding. The low sensitivty of the fused score to padding suggests that

the RPN is already capable of localizing and differentiating between partial and full-pedestrians,

thus improving the BCN in this respect is less signiﬁcant.

Cost-sensitive: The cost-sensitive weighting scheme used to regularize the importance of large

pedestrians over small pedestrians has an interesting effect on SDS-RCNN. When the cost-sensitive

weighting is disabled, the RPN performance actually improves to an impressive 9.63% miss rate.

In contrast, without cost-sensitive weighting the BCN degrades heavily, while the fused score

degrades mildly. A logical explanation is that imposing a precedence on a single scale is counter-

27

intuitive to the RPN achieving high recall across all scales. Further, the RPN has the freedom to

learn scale-dependent features, unlike the BCN which warps to a ﬁxed size for every proposal.

Hence, the BCN can gain signiﬁcant boost when encouraged to focus on large pedestrian features,

which may be more scale-independent than features of small pedestrians.

Strict Supervision: Using a stricter labeling policy while training the BCN has a substantial

impact on the performance of both the BCN and fused scores. Recall that the strict labeling policy

requires a box to have IoU > 0.7 to be considered foreground, while the standard policy requires

IoU > 0.5. When the stricter labeling policy is reduced to the standard policy, the fused perfor-

mance degrades by 1.35%. Further, the individual BCN degrades by 6.43%, which is on par with

the degradation observed when weak segmentation is disabled. We examine the failure cases of the

strict versus non-strict BCN and observe that the false positives caused by double detections re-
duce by ∼22%. Hence, the stricter policy enables more aggressive suppression of poorly localized

boxes and therefore reduces double detections produced as localization errors of the RPN.

Stage Fusion: The power of stage-wise fusion relies on the assumption that the each network will

diversify in their classiﬁcation characteristics. Our design explicitly encourages this diversiﬁcation

by using separate labeling policies and training distributions for the RPN and BCN. Table 2.2

shows that although fusion is useful in every case, it is difﬁcult to anticipate how well any two

stages will perform when fused without examining their speciﬁc strengths and weaknesses.

To better understand this effect, we visualize how fusion behaves when the RPN and BCN

disagree (Fig. 2.8). We consider only boxes for which the RPN and BCN disagree using a decision
threshold of 0.5. We notice that both networks agree on the majority of boxes (∼80K), but observe

an interesting trend when they disagree. The visualization clearly shows that the RPN tends to

predict a signiﬁcant amount of background proposals with high scores, which are corrected after

28

Figure 2.8: Visualization of the diversiﬁcation between the RPN and BCN classiﬁcation character-
istics. We plot only boxes which the RPN and BCN of SDS-RCNN disagree on using a threshold
of 0.5. The BCN drastically reduces false positives of the RPN, while the RPN corrects many
missed detections by the BCN.

being fused with the BCN scores. The inverse is true for disagreements among the foreground,

where fusion is able to correct the majority of pedestrians boxes given low scores by the BCN. It

is clear that whenever the two networks disagree, the fused result tends toward the true score for
more than ∼80% of the conﬂicts.

Sharing Features: Since we choose to train a separate RPN and BCN, without sharing features,

Shared Layer BCN MR Fused MR Runtime

conv5
conv4
conv3
conv2
conv1
RGB

16.24
15.53
14.28
13.71
14.02
10.98

10.87
10.42
8.66
8.33
8.28
7.36

0.15s
0.16s
0.18s
0.21s
0.25s
0.21s

Table 2.3: Stage-wise sharing experiments which demonstrate the trade-off of runtime efﬁciency
and accuracy, using the Caltech dataset. As sharing is increased from RGB (no sharing) to conv5,
both the BCN and Fused miss rate (MR) become less effective.

29

we conduct comprehensive experiments using different levels of stage-wise sharing in order to

understand the value of diversiﬁcation as a trade-off to efﬁciency. We adopt the Faster R-CNN

feature sharing scheme with ﬁve variations differing at the point of sharing (conv1-5) as detailed

in Table 2.3. In each experiment, we keep all layers of the BCN except those before and including

the shared layer. Doing so keeps the effective depth of the BCN unchanged. For example, if the

shared layer is conv4 then we replace conv1-4 of the BCN with a RoIPooling layer connected to

conv4 of the RPN. We conﬁgure the RoIPooling layer to pool to the resolution of the BCN at the
shared layer (e.g., conv4 → 14 × 14, conv5→ 7 × 7).

We observe that as the amount of sharing is increased, the overall fused performance degrades

quickly. Overall, the results suggest that forcing the networks to share feature maps lowers their

freedom to diversify and complement in fusion. In other words, the more the networks share the

more susceptible they become to redundancies. Further, sharing features up to conv1 becomes

slower than no stage-wise sharing (e.g., RGB). This is caused by the increased number of channels
and higher resolution feature map of conv1 (e.g., 720 × 960 × 64), which need to be cropped

and warped. Compared to sharing feature maps with conv3, using no sharing results in a very

minor slow down of 0.03 seconds while providing a 1.30% improvement to miss rate. Hence,

our network design favors maximum precision for a reasonable trade-off in efﬁciency, and obtains
speeds generally 2× faster than competitive methods.

2.5 Summary

We present a multi-task infusion framework for joint supervision on pedestrian detection and se-

mantic segmentation. The segmentation infusion layer results in more sophisticated shared feature

maps which tend to illuminate pedestrians and make downstream pedestrian detection easier. We

30

analyze how infusing segmentation masks into feature maps helps correct pedestrian detection

errors. In doing so, we observe that the network becomes more robust to pedestrian poses and oc-

clusion compared to without. We further demonstrate the effectiveness of fusing stage-wise scores

and encouraging network diversiﬁcation between stages, such that the second stage classiﬁer can

learn a stricter ﬁlter to suppress background proposals and become more robust to poorly localized

boxes. In our SDS-RCNN framework, we report new state-of-the-art performance on the Cal-

tech pedestrian dataset (23% relative reduction in error), achieve competitive results on the KITTI
dataset, and obtain an impressive runtime approximately 2× faster than competitive methods.

31

Chapter 3

Pedestrian Detection with Autoregressive

Network Phases

We present an autoregressive pedestrian detection framework with cascaded phases designed to

progressively improve precision. The proposed framework utilizes a novel lightweight stackable

decoder-encoder module which uses convolutional re-sampling layers to improve features while

maintaining efﬁcient memory and runtime cost. Unlike previous cascaded detection systems, our

proposed framework is designed within a region proposal network and thus retains greater context

of nearby detections compared to independently processed RoI systems. We explicitly encourage

increasing levels of precision by assigning strict labeling policies to each consecutive phase such

that early phases develop features primarily focused on achieving high recall and later on accurate

precision. In consequence, the ﬁnal feature maps form more peaky radial gradients emulating from

the centroids of unique pedestrians. Using our proposed autoregressive framework leads to new

state-of-the-art performance on the reasonable and occlusion settings of the Caltech pedestrian

dataset, and achieves competitive state-of-the-art performance on the KITTI dataset.

32

3.1

Introduction

Figure 3.1: Illustration of our proposed autoregressive framework with sample phase (P1→3) clas-
siﬁcation prediction maps and box visualizations under Caltech [38] dataset. Our method iter-
atively re-scores predictions under incrementally more precise label policies, using a series of
de-encoder modules comprised of decoder and encoder pathways. Notice a heavy reduction in
false positives (red) as phases progress, while true positives (green) are retained.

Detecting pedestrians in urban scenes remains to be a challenge in computer vision despite

recent rapid advances [11,75,108,131,147,160,186,187,189]. The use of ensemble [41,150,180]

and recurrent [131, 148] networks has been successful in top-performing approaches of pedestrian

/ object detection. Recurrent networks reﬁne upon their own features while ensemble networks

33

IM CNN Decoder Pathway Encoder Pathway Label Policy:  IoU > 0.4 Label Policy:  IoU > 0.5 Label Policy:  IoU > 0.6 gather features through separate deep classiﬁers. Both techniques offer a way to obtain stronger

and more robust features, thus better detection.

However, the characteristics of ensemble and recurrent networks are distinct. Ensemble net-

works assume that separate networks will learn diversiﬁed features which when combined will

become more robust. In contrast, recurrent networks inherit previous features as input while fur-

ther sharing weights between successive networks. Hence, recurrent networks are more capable

of reﬁning than diversifying. Intuitively, we expect that both feature diversiﬁcation and reﬁnement

are important components to pair together.

Therefore, we explore how to approximate an ensemble of networks using a stackable decoder-

encoder module and incorporating an autoregressive1 ﬂow to connect them, as illustrated in Fig. 3.1.

We formulate our framework as a series of phases where each is a function of the previous phase

feature maps and classiﬁcation predictions. Our decoder-encoder module is made of bottom-up

and top-down pathways similar to [78,98,106,119]. However, rather than using bilinear or nearest

neighbor re-sampling followed by conventional convolution, we propose memory-efﬁcient convo-

lutional re-sampling layers to generate features and re-sample simultaneously in a single step.

In essence, our approach aims to take the best world of both the ensemble and recurrent ap-

proaches. For instance, since past predictions and features are re-used, our network is able to reﬁne

features when necessary. Secondly, since our phases incorporate inner-lateral convolutions and do

not share weights, they are also capable to learn new and diversiﬁed features. Furthermore, we

are able to design the network with an efﬁcient overhead due to the added ﬂexibility of using non-

shared network weights for each phase and by using memory-efﬁcient convolutional re-sampling

layers. As a consequence, we are able to choose optimal channel settings with respect to efﬁciency
1We adopt naming distinction of autoregressive (vs. recurrent) as a network conditioned on previous predictions
without the constraint of repeated shared weights, inspired by terminology in WaveNet [157] which uses casual con-
volution instead of conventional recurrence.

34

and accuracy.

To take full advantage of the autoregressive nature of our network, we further assign each phase

a distinct labeling policy which iteratively becomes more strict as phases progress. In this way, we

expect that the predictions of each consecutive phase will become less noisy and produce tighter

and more clusterable prediction maps. Under the observation that our proposed autoregressive

region proposal network (RPN) obtains a high recall in the ﬁnal phase, we also incorporate a

simple hard suppression policy into training and testing of our second-stage R-CNN classiﬁer.

Such a policy dramatically narrows the subset of proposals processed in the second-stage pipeline
(∼65%), and greatly alleviates the runtime efﬁciency accordingly.

We evaluate our framework on the Caltech [38] pedestrian detection dataset under challenging

occlusion settings, using both the original and newly proposed [184] annotations, and further on

the KITTI [51] benchmark. We achieve state-of-the-art performance under each test setting and

report a marginal overhead cost in runtime efﬁciency.

To summarize, our contributions are the following:

• We propose a multi-phase autoregressive pedestrian detection system inside a RPN, where

each phase is trained using increasingly precise labeling policies.

• We propose a lightweight decoder-encoder module to facilitate feature map reﬁnement and

message passing using convolutional re-sampling layers for memory-efﬁcient pathways.

• We achieve state-of-the-art performance on Caltech [38] under various challenging settings,

and competitive performance on KITTI [51] pedestrian benchmark.

35

Figure 3.2: Predictions of our autoregressive network (a) are directly conditioned on past feature
maps as recurrent network (c) and do not share weights between phases as ensemble network (b).
Unlike either, our network is further conditioned on past predictions.

3.2 Related Work

Ensemble Networks: Recent top-performing methods [41,150,180] on detection have employed

ensemble-based techniques where predictions from multiple deep convolutional neural networks

(CNNs) are fused. For instance, [41] propose a soft-weighting scheme using an ensemble of inde-

pendent detectors, which demonstrate high accuracy with fused scores. However, one drawback is

having multiple CNNs in memory and processing each in parallel. Thus, both the scalability as net-

works become larger and usability in memory-constrained systems are lessened. Further, [11] form

a small ensemble by fusing RPN scores with the scores of a R-CNN detector and demonstrates im-

proved performance. Compared to these methods, our single RPN functions as an ensemble of

inter-connected small networks, which can improve the precision without critically obstructing

runtime or memory efﬁciency.

Cascaded Networks: A similar line to ensemble networks take form of cascaded detection sys-

tems [15, 123, 128], which build on a series of R-CNN detectors and function on cropped region-

of-interests (RoIs) generated by a static proposal network. In contrast, our work focuses as a fully

convolutional cascade inside a proposal network. Therefore, our network is more equipped to

36

(a)Autoregressive Flow (b)  Ensemble (c)  Recurrent Shared weights utilize contextual cues of surrounding detections to inform suppression of duplicate detections,

whereas cropped RoIs are processed independently of other proposals. Liu et al. [108] propose su-

pervision using incremental labeling policies similar to our approach. However, rather than making

immediate predictions based only on previous predictions, we develop new features through our

decoder-encoder pathway.

Recurrent Networks: Recurrent networks are a powerful technique in many challenging pro-

cedural [59, 115] and temporal [17, 137, 146] computer vision problems. Recently, it has been

further demonstrated in urban object detection [131] and person head detection [148]. For in-

stance, [148] uses recurrent LSTM to iteratively detect a single person at a time until reaching an

end condition, thus side-stepping the need to perform non-maximum suppression (NMS) in post.

In contrast, [131] proposes a rolling recurrent convolution (RRC) model which reﬁnes feature maps

and produces new detections at each step. From this respect, our proposed method is similar to

RRC, but with two critical differences. Firstly, the networks of our phases are not shared. This

enables us to learn specialized (ensemble-like) features in each phase and gives more freedom in

network design of a phase, which may aid runtime efﬁciency when using conservative designs.

Secondly, we base each phase conditioned on previous feature maps and predictions, which form

a more potent autoregressive foundation. We show a high-level comparison of our autoregressive

network, ensemble networks, and recurrent networks in Fig. 3.2.

Encoder-Decoder Networks: Many recent works [78, 98, 106, 134] have explored multi-strided

feature maps re-use within computer vision. Each variant of architectures utilize a series of convo-

lution, feature aggregation (concat, residual), and up-sampling / pooling layers in order to form an

encoder-decoder structure. Similar to the network structure in [119] for human pose estimation, we

incorporate stackable top-down and bottom-up pathways. However, in contrast to prior work, we

37

Figure 3.3: Overview of our proposed AR-RPN framework (left) and detailed illustration of our
de-encoder module (right). The de-encoder module consist of top-down and bottom-up pathways
with inner-lateral convolution between pathways to produce diversiﬁed features, as well as convo-
lutional re-sampling layers (s denotes convolutional stride) ei and di for memory-efﬁcient feature
generation. We further condition predictions on the previous phase predictions through concatena-
tion within fk(·).

design our de-encoder module without explicitly using bilinear or nearest neighbor re-sampling.

Instead, we uniquely blend the feature generation and re-sampling into a single convolution layer
using a fractional stride (↑) or strided convolution (↓), making the travel nodes in our streams as

compact as possible. We show in ablation that a single convolutional re-sampling layer consumes

low memory and performs better compared to the conventional two-step techniques previously

used.

3.3 Autoregressive Detector

Our proposed framework is made up of two stages: an autoregressive RPN hence referred to as

AR-RPN, and a second-stage R-CNN detector each founded on [132, 183]. We collectively refer

to both stages as AR-Ped. As shown in Fig. 3.3, AR-RPN consists of multiple phases, where each

38

Backbone VGG-16 IM De-encoder Module predicts classiﬁcation scores and passes these predictions and their features into the next phase.

Each phase is connected to the last through a bottom-up and top-down pathways, which form a

lightweight decoder-encoder module. This module is stackable onto the backbone RPN and onto

itself repeating. We supervise each phase to jointly learn increasingly more precise predictions

by imposing a stricter labeling policy to consecutive phases, thereby producing more peaky and

clusterable classiﬁcations in the ﬁnal phase. We apply the box transformations, NMS, and a hard

suppression policy to the ﬁnal predictions for which the remaining subset are used to train a spe-

cialized R-CNN detector.

3.3.1 De-Encoder Module

To perform autoregressive detection in a single model, we design a stackable decoder-encoder

module, termed de-encoder, where its top-down pathway leverages past feature maps and its

bottom-up pathway encodes stronger semantics. Following [98], we give each pathway the ability

to learn from feature maps at multiple depths of the backbone network. Importantly, our design

encourages the highest level features to remain at the lowest resolution where object detection

functions most efﬁciently. Intuitively, the de-encoder enables the network to look back at previous

features and learn more advanced features during re-encoding.

Let us recall that typical network architectures, e.g., VGG-16 [141] and ResNet-50 [65], func-

tion from low to high stride levels using a series of convolution and pooling layers. We denote the
set of strides of a backbone network as S, where 2i−1 is the down-sampling factor of the ith stride

level preceding a pooling operation. In pedestrian detection, it is common to have n = 5 unique
stride levels such that S = {1, 2, 4, 8, 16}. The hyperparameters of the de-encoder module include

a designated target stride t and channel width ci speciﬁc to each stride, which respectively control

how far up in resolution the phase should de-encode and how many channels at each stride should

39

be learned.

The primary goal of the de-encoder module is to produce ﬁner features at each level from
n} as the
the target stride t to the ﬁnal stride n of the network. Denoting Ck = {Ck
reﬁned features of the kth phase at each stride, gk(·) as the set of convolutional and ReLU layers,
Φk the respective weights, and tk the target stride of feature maps to de-encode and reﬁne, the

t , . . . , Ck

autoregressive nature of the feature generation can be expressed as:

Ck = gk(Ck−1 | Φk, tk).

(3.1)

Hence, each phase of the network takes as input the previous phase feature maps and produces more

advanced features. Initial features C1 are given from top-most layers at corresponding strides from

the backbone (e.g., in VGG-16 C1

4 = conv4_3, C1

5 = conv5_3, and so forth).

Top-down pathway: We design our top-down decoder for phase k by attaching a convolutional
layer with BN [70] to {Ck−1
t...n} feature maps, which produce inner-lateral convolutions Li with
corresponding channel widths ci. Rather than using a two-step process comprised of a bilinear
/ nearest neighbor up-sampling followed by convolution as done in prior work, we denote di(·)
as a convolutional up-sampling layer which simultaneously performs 2× up-sampling and feature

reduction into channel width ci using fractionally strided convolution. The combined operation is
more efﬁcient in both memory and runtime. Starting with the highest feature stride n, we use di(·)
to iteratively decode features, which are then fused with the lateral features at the decoded stride

Li through element-wise addition, denoted:

Di = di(Di+1) + Li.

(3.2)

40

We begin with the base case of Dn = Ln, and repeat this procedure until the target stride feature

map Dt is reached. In theory, the top-down pathway enables high-level semantics to be passed

down through the decoded term di(Di+1) and low-level features to be re-examined using Li.

Bottom-up pathway: We design the bottom-up encoder in the opposite manner as the decoder.
We ﬁrst attach a convolutional layer with BN to each {Dk−1
t+1...n} which each produce new laternal
features L(cid:48)
i with ci channels. Similar to the decoder pathway, we denote ei(·) as a single con-
volutional down-sampling layer which simultaneously performs 2× down-sampling and feature

expansion into channel width ci using strided convolution, rather than conventional two-step pro-
cess used in previous work. We use ei(·) to iteratively encode the features at each stride, which are
then fused with the lateral features of the encoded stride L(cid:48)
i via element-wise addition, denoted as:

Ei = ei(Ei−1) + L(cid:48)
i.

(3.3)

As the name suggests, the bottom-up encoder starts with the lowest stride t and repeats until the

nth stride is reached, such that lateral features at t is Et = Dt. The bottom-up pathway enables

the network to encode low-level features from the lowest stride through the ei(Ei−1) term and for
higher-level features to be re-examined using L(cid:48)
i.

3.3.2 Autoregressive RPN

We utilize the standard RPN head and multi-task loss proposed in [54] following the practices

in [183]. We predeﬁne a set of anchor shapes which act as hyperparameters describing the target

pedestrian scales. The RPN head is comprised of a proposal feature extraction (PFE) layer con-

nected to two sibling layers which respectively predict anchor classiﬁcation (cls) and bounding

box regression (bbox) output maps, hence forming a multi-task learning problem.

41

Multi-phase Network: Our RPN is comprised of a total of Nk = 3 phases. The ﬁrst phase is

simply the backbone network starting with the modiﬁed VGG-16 [141] that has strides of S =
{1, 2, 4, 8, 16}. The second phase is a de-encoder module which has a target stride t = 3 and

channel widths of c3 = 128, c4 = 256, c5 = 512. The ﬁnal phase is another stack of the de-

encoder module following the same channel settings but uses a memory conservative lower target
2i−1 × H
stride of t = 4. The spatial resolution at ith stride can be denoted as wi × hi = W
2i−1 ,
where W × H is the input image resolution. Thus, the ﬁnal proposal network architecture forms a

stair-like shape as in Fig. 3.3.

Autoregressive Flow: To enable the autoregressive ﬂow between phases, we place a PFE layer

and classiﬁcation layer at the end of each phase encoder. For all phases except the ﬁrst, we con-

catenate the previous phase predictions into the input features for the corresponding phase PFE

layer. In doing so, each phase is able to start with strong compact features by directly utilizing its

previous phase predictions. Further, the PFE layer of the ﬁnal phase Nk produces the bounding

box regression output map, since these features are the most precise and peaky within the network.
Formally, we denote functions fk(·) and pk(·) as the kth phase PFE layers and classiﬁcation
layers respectively. We build f (·) as a convolutional layer with 3×3 kernel and 512 output channels
followed by a ReLU layer, while p(·) a convolutional layer with 1 × 1 kernel and outputs channels
2× the number of anchors (A). Thus, pk(·) forms an autoregressive function of previous phase
predictions with an output dimension of w5 × h5 × 2A, via:

Pk = pk(fk(Pk−1 (cid:107) Ck

n)),

(3.4)

where Pk−1 is the classiﬁcation feature map of the previous phase, aka, past predictions, (cid:107) is the
concatenation operator, and Ck
n is the last encoded feature map of the kth phase. As deﬁned, the

42

PFE fk(·) and classiﬁcation layer pk(·) are conditioned autoregressively on past predictions which
logically act as compact but powerful semantic features. In this way, each phase is more free to

learn new features Ck

n to directly complement the past predictions. In essence, the autoregressive

ﬂow can be seen as running memory of the most compact and strong features within the network.

Classiﬁcation Task: Each classiﬁcation layer which proceeds a PFE layer is formulated as pro-

posed in [132] following experimental settings of [11]. Formally, given a PFE layer with dimen-
sions w × h, the designated classiﬁcation layer predicts a score for every spatial location of the
image (x, y) ∈ Rw×h against every predeﬁned anchor shape a ∈ A, and every target class. Every

spatial location of the prediction map is therefore treated as a distinct box with its own corre-

sponding classiﬁcation score. To produce labels for each box, a labeling policy is adopted using a

hyperparameter h that controls the box criteria of Intersection over Union (IoU) with ground truths

in order to be considered foreground. After every box is assigned a label according to the label-

ing policy, each classiﬁcation layer is supervised using multinomial cross-entropy logistic loss as

in [54].

Localization Task: The localization task is formed using the same set of anchor boxes described

in the classiﬁcation task. The localization task aims to perform bounding box regression that

predicts a bounding box transformation for each foreground box towards the nearest pedestrian.

A proposal box is considered nearby a pedestrian ground truth if there is at least h intersection

over union between the two boxes. The box transformation is deﬁned by 4 variables consisting

of translation (tx, ty) and scale factors (tw, th) such that when applied will transform the source

box into the target ground truth. We train the bounding box regression values using Smooth L1

loss [54].

Incremental Supervision:

In order to better leverage the autoregressive and de-encoder proper-

43

ties of AR-RPN, we choose to assign different classiﬁcation labeling policies onto each consecu-

tive phase. We emphasize that the de-encoder modules enable the network to adapt and become

a stronger classiﬁer, which can be exploited to produce more accurate and tighter classiﬁcation

clusters when supervised with incrementally stricter labeling policies.

Let us brieﬂy discuss the trade-offs regarding different labeling policies. Consider using a

labeling policy of h = 1, which is approximately equivalent to requiring the network output a

single box for each pedestrian and thus the imbalance of classes may be difﬁcult. In contrast, as

a labeling policy becomes more lenient at h = 0.5, the classiﬁcation becomes more balanced but

produces many false positives as duplicate detections. In theory, bounding box regression will

reduce the impact of double detections by transforming boxes into clusters which can be sup-

pressed by NMS. Ideally, a network has either high-performing bounding box regression and/or

tight clusterable classiﬁcation maps, since both enable NMS to cluster duplicate detections. There-

fore, rather than using a single discrete labeling policy of h = 0.5, we assign lenient-strict policies

h1 = 0.4, h2 = 0.5, h3 = 0.6, to each phase classiﬁcation layer respectively. In contrast to [108],

we enforce incremental supervision between de-encoder modules rather than being applied im-

mediately in quick succession. In consequence, our classiﬁcation score maps are supervised to

gradually become more peaky and clusterable.

Loss Formulation:

In addition to the classiﬁcation and bounding box regression losses, we

further add auxiliary losses in the form of weak semantic segmentation as in [11]. Speciﬁcally,

during training we add a binary semantic segmentation layer to each stride of the ﬁrst top-down

pathway to act as an auxiliary loss and accelerate training. We formally deﬁne the joint loss terms

incorporating phase classiﬁcation softmax loss Lcls, ﬁnal phase localization Smooth L1 loss Lbbox,

44

MS-CNN [13]
RRC [131]
RPN+BF [183]
F-DNN [41]
TLL(MRF)+LSTM [147]
ALFNet [108]
SDS-RCNN [11]
RepulsionLoss [160]
FRCNN+ATT-vbb [187]
PDOE+RPN [189]
GDFL [96]
DSSD [48]+Grid [75]
AR-RPN (ours)
AR-Ped (ours)

Caltech Reasonable

Caltech Occlusion

M RO−2 M RO−4 M RN−2 M RN−4 PartialO HeavyO RT (ms)
9.95
−
9.58
8.65
7.40
−
7 .36
−
10.33
7.60
7.85
10.85
8.01
6.45

59.94
−
74.36
55 .13
−
−
58.55
−
45.18
44.40
43 .18
42.42
58.06
48.80

22.45
−
18.60
19.92
−
−
17 .82
−
−
−
19.86
18.20
21.62
15.54

64
75
88
−
−
−
95
−
−
−
−
−
86
91

8.08
−
7.28
6.89
−
6.10
6.44
5 .00
−
−
−
−
5.78
4.36

17.42
−
16.76
14 .75
−
−
15.76
−
−
−
−
−
15.86
11.39

19.24
−
24.23
15.41
−
−
14.86
−
−
13 .30
16.74
24.28
16.30
11.93

KITTI
Easy Mod.
83 .92
−
75.58
−
−
−
−
−
−
−
−
−
83.66

Hard
73 .70 68.31
75.33
61.29
−
−
−
63.05
−
−
−
84.61 68.62
−
−
73.44

−
56.08
−
−
−
−
−
−
−
66.86
−
−
68 .12

Table 3.1: Comprehensive comparison of our frameworks and the state-of-the-art on the Caltech
and KITTI benchmarks, in both accuracy and runtime (RT). We show the Caltech miss rates at
multiple challenging settings, with both the original (O) and new (N) annotations, and at occlusion
settings with the original annotations and FPPI range M RO−2. Further, we evaluate the KITTI
pedestrian class under easy, moderate, and hard settings, with mean Average Precision (mAP) [51].
Boldface/italic indicate the best/second best performance.

and each softmax auxiliary loss Lseg as:

Nk(cid:88)

k=1

L =

λkLcls + λbLbbox + λs

5(cid:88)

i=3

Lseg,

(3.5)

where k corresponds to phases 1 → Nk of the full network, and i represents stride for each
auxiliary segmentation layer of the backbone network. We use Caffe [71] with SGD following the

settings in [183] in our training. We set λ1 = λ2 = 0.1, λ3 = 1, λb = 5, and λs = 1.

3.3.3 R-CNN Detector

Most pedestrian detection frameworks are derivatives of Faster R-CNN [132], and hence incorpo-

rate a second-stage scale-invariant region classiﬁer termed as R-CNN. Following [11], we utilize

a modiﬁed VGG-16 as a R-CNN that functions on cropped RGB regions proposed by AR-RPN,

45

utilizes a strict labeling policy, and fuses its scores with the RPN. However, unlike past methods we

impose a simple hard suppression policy that suppresses all box proposals with a score less than a

hyperparameter z. This has two advantages. Firstly, it greatly improves runtime since only a subset

of proposals need to be processed. Secondly, by focusing on only the hard samples leftover from

the RPN, the R-CNN learns specialized classiﬁcation similar to the motivation of the AR-RPN.

Loss Formulation: As in the AR-RPN, we also use softmax loss to train the R-CNN. We use
a strict labeling policy requiring h ≥ 0.7 IoU for foreground, a weak segmentation auxiliary loss

Lseg, and height sensitive weighting scheme w as detailed in [11]. We set z = 0.005 to impose a

score suppression of the RPN proposals and eliminate conﬁdent background proposals from being

re-processed. In practice, the suppression dramatically reduces the search space for both efﬁciency

and accuracy while critically keeping recall unaffected. Thus, we denote the R-CNN loss as:

(cid:88)

L =

wjLcls(cj, ˆcj) + Lseg,

if cj ≥ z,

(3.6)

j

where j corresponds to each proposal of AR-RPN, c is the classiﬁcation result of the R-CNN, and

ˆc is the class label. We use Caffe to train the R-CNN following settings of [11].

3.4 Experiments

We evaluate our proposed AR-Ped framework on two challenging datasets: Caltech [38, 39] and

KITTI [51]. We perform experiments ablating our approach from the perspective of design choices

and hyperparameters. We further examine the qualitative changes and analyze the quantitative

peakiness in detections across phases.

46

3.4.1 Caltech

The Caltech [38, 39] dataset is a widely used benchmark on pedestrian detection that contains 10
hours of video taken from an urban driving environment with ∼350,000 bounding box annotations
and 2,300 unique pedestrians. We use the Caltech10× for training and the Caltech reasonable

setting [39] for testing, unless otherwise speciﬁed. The evaluation uses a miss rate (MR) metric
averaged over a false positive per image (FPPI) range of [10−2, 100] and also a more challenging
metric over the range [10−4, 100], respectfully referred to as MR−2 and MR−4. Recently, new
annotations are released [184] to correct the ofﬁcial annotations in terms of consistency and box

alignment. For completeness, we evaluate on both the original and the new annotations, denoted

respectively as MRO and MRN .

We compare our work to the state-of-the-art pedestrian detection methods of Caltech with

respect to the core experimental conﬁgurations of using each combination of original/FPPI setting,

and partial/heavily occlusion within the original annotation space as deﬁned in [38]. We limit our
comparison to the top-2 methods of any sub-category trained using Caltech10× dataset since these

comprise the most highly competitive methods. We also emphasize that we are among the few

methods to comprehensively evaluate and report each setting and insist to open-source our code to

the community upon release.

Our method advances the state-of-the-art on all but one evaluation setting, as detailed in Ta-

ble 3.1. Under the most common benchmark reasonable setting, we achieve a miss rate of 6.45%
(↓ 0.91) and 4.36% (↓ 0.64) on the ofﬁcial annotations MRO−2 and new annotation MRN−2 respec-
tively. Further, our approach has increased robustness to partial occlusion (↓ 1.37% miss rate).

Compared to methods which do not explicitly address occlusion [11, 13, 41, 183], our method also
improves w.r.t heavy occlusion (↓ 6.33% miss rate). Yet, our method underperforms on heavy oc-

47

clusion compared to work specially designed to target occlusion problem [75, 96, 160, 189], which

is orthogonal to our work.

We further produce a runtime analysis for state-of-the-art works with public code using the

same controlled machine with NVIDIA 1080 Ti GPU, as summarized in Table 3.1. Our method

retains a competitive runtime efﬁciency due to the light overhead design of our de-encoder module

while still improving accuracy in all but one setting.

3.4.2 KITTI

KITTI is a popular urban object detection dataset which offers annotations for cars, pedestrians

and cyclists. We use the ofﬁcial training set of 7,481 images and evaluate on the standard 7,518

test images. We adopt the settings and core training code of [13] in order to initialize good start-

ing hyperparameters. However, due to GPU memory constraints we set the input image scale to

576 height resolution and achieve competitive performance on the pedestrian class, as reported in

Table 3.1. As described in [11], high performing pedestrian detectors [11, 91, 183] on Caltech and

KITTI do not usually have high correlation. We emphasize that our AR-Ped is among the ﬁrst

to report high performance for both datasets, which suggests the generalization of our model to

pedestrian detection rather than a speciﬁc dataset.

3.4.3 Ablations

All ablation experiments use our AR-RPN and the Caltech test set under the reasonable M RO−2

FPPI setting, as this is the most widely tested setting on Caltech.

What are optimal de-encoder settings? In order to analyze the de-encoder module, we ablate

its parameters in each phase concerning channel widths at each feature stride and target strides

48

Nk c size M RO−2 MAC (G) Runtime (ms)
1 M 10.16
8.32
2 M
3
8.62
S
8.01
3 M
8.33
3
L
4 M
8.68

217.9
429.3
255.3
321.3
429.3
355.9

68
80
74
86
115
97

Table 3.2: The performance with different parameters and numbers of phases under the Caltech
reasonable M RO−2 setting. We further detail the efﬁciency of each setting in terms of multiply-
accumulate (MAC) and runtime on an NVIDIA 1080 Ti.

to de-encode. Our primary method of AR-RPN uses what we refer to as medium channel width
settings of cM = {128, 256, 512}. We further denote small and large channel settings such that
cS = {64, 128, 256} and cL = {256, 512, 512}, then train our AR-RPN with other settings
kept consistent. Surprisingly, the small and large channel widths function similarly but neither as

well as the medium, which roughly follows the rules-of-thumb channel settings outlined in VGG-
16 [141]. For instance, the cL and cS achieve 8.33% (↑ 0.32%) and 8.62% (↑ 0.61%) miss rate, as
detailed in Table 3.2. This suggests a difﬁculty when over or under expanding channels compared

to the c width of source feature maps in C1.

We further analyze the runtime complexity of the de-encoder modules under each proposed

setting in Table 3.2. Overall, we observe that channel width settings have a large effect on both

multiply-accumulate (MAC) and runtime efﬁciencies of the AR-RPN. Speciﬁcally, channel width

settings of cS, cM , and cL respectively slow down by 8%, 26%, and 69% compared to Nk = 1

baseline.

What is the effect of convolutional re-sampling? Unlike previous decoder-encoder works [78,

98, 106, 119, 134], our module combines its re-sampling and feature generation into single con-

volutional re-sampling layers using either stride of 2 or fractional 1
2 strides. To better understand
the importance of this combined operation, we split every convolutional re-sampling layer e(·) and

49

d(·) into 2 separate layers: a bilinear re-sampling layer and a convolution feature generation layer.
We observe that this separation causes performance to degrade from 8.01% → 9.45% miss rate.

This degradation suggests that providing the network with more freedom in re-sampling, as op-

posed to ﬁxing the kernels to bilinear (or nearest neighbor), is beneﬁcial for detection. Moreover,

separating the operations into 2-steps is naturally less efﬁcient concerning memory usage and run-

time. Speciﬁcally, using the proposed convolutional re-sampling layers within AR-RPN consumes

41% less GPU memory compared to using a 2-step bilinear / convolution process and maintains a

16% faster runtime speed at inference.

How many autoregressive phases to stack? The use of autoregressive phases is clearly a critical

component of our framework. Therefore, to understand its impact we ablate our framework by

varying the number of phases while keeping all other settings constant. We report the performance

of each setting in Table 3.2. Unsurprisingly, as fewer phases are used the performance is steeply

reduced. For instance, recall that our 3-stage method achieves 8.01% miss rate. By removing a
single phase, the miss rate increases by ↑ 0.32% while only gaining 6 ms in runtime efﬁciency.
When another phase is removed, an extreme degradation of ↑ 2.15% is observed. Hence, the

effect of additional phases seems to diminish with Nk such that the ﬁrst additional phase has the

highest impact, as suggested by Fig. 3.4. We further add a 4th phase following the same trend in

incremental labeling (h4 = 0.7) and observe that the performance begins to worsen. We suspect

using more dense anchor sampling may help train the very high IoU threshold.

How to choose incremental labeling policies? Labeling policies are an important component

to our autoregressive framework. We demonstrate the level of sensitivity and importance when

using a variety of incremental labeling policies. Since high value IoU labeling policies only admit
very well localized boxes as foreground, we refer to the IoU labeling policy of h ≥ 0.4 as lenient,

50

Labeling Policy
no autoregressive
strict → lenient
moderate → moderate
strict → strict
lenient → strict

M RO−2
9.06
9.03
8.94
8.43
8.01

Table 3.3: The effects of labeling policies on the Caltech dataset under the reasonable M RO−2
setting.

h ≥ 0.5 as moderate, h ≥ 0.6 as strict. We train the AR-RPN using labeling techniques of strict-

to-lenient, moderate-to-moderate, strict-to-strict, and our primary setting of lenient-to-strict, as

shown in Table 3.3. The strict-to-lenient method performs the worse among all settings, degrading

by 1.02% MR. The moderate-to-moderate performs similarly and degrades by 0.80% MR. As

shown in Fig. 3.4, the primary labeling policy of lenient-to-strict enables the network to start with

large clusters of pedestrian box detections and iteratively suppress, resulting in more tight and

peaky prediction maps. In contrast, strict-to-strict does not ease this transition as well resulting in a

degradation of 0.42% MR. We further validate the effect by analyzing the score distributions across

all pedestrians in the X/Y directions for the Caltech test dataset, as shown in Fig. 3.5. We observe

a consistent trend in both directions where each successive phase results in a sharper peak with

respect to its mean score. Each other labeling policy encourages the opposite or encourages the

same predictions but more accurately. On a related point, we furhter examine the disagreements

between phases (∆P1→3 colored magenta, Fig. 3.4) which re-afﬁrms phases logically agree on

centroids of pedestrians. This analysis further shows that most suppression appears to be due to

poorly localized boxes primarily in Y-direction (e.g., offset from the legs or head of a pedestrian).

For completeness, we further evaluate the extreme case where there is no incremental super-

vision or autoregressive ﬂow within the network as included in Table 3.3. In this case, the core

3-phase network architecture is kept intact, except the prediction layers and concatenation have

51

Figure 3.4: We visualize the prediction maps ˜Pk of each phase by taking the maximum of fore-
ground scores across all A anchors at each spatial location, i.e., denoting Pk = {Pbg
k }, we
k . We use scaled blue →yellow colors to visualize ˜Pk, where yellowness
deﬁne ˜Pk = maxA Pf g
indicates high detection conﬁdence. The detections of each phase become increasingly tighter
and more adept to non-maximum suppression due to the incremental supervision for each phase
(Sec. 3.3.2). We further analyze the prediction disagreements between phases ∆1 → 3, shown in
the right column, where green represents the agreement of the foreground and magenta the regions
suppressed.

k , Pf g

been removed from phases 1 → 2 and 2 → 3, therefore there is no incremental labeling policy to

be decided. In doing so, the detection performance degrades by a considerable 2.14% miss rate,

which furt

3.5 Summary

In this work, we present an autoregressive pedestrian detection framework which utilizes a novel

stackable de-encoder module with convolutional re-sampling layers. The proposed AR-Ped frame-

work is able to autoregressively produce and reﬁne both features and classiﬁcation predictions. In

consequence, the collective phases approximate an ensemble of increasingly more precise classiﬁ-

52

RGB Detection Figure 3.5: We analyze the mean prediction score ( ˜Pk) of 20 uniformly sampled points along the
center lines of X-direction (left) and Y-direction (right) averaged over all ground-truth pedestrians
in Caltech test dataset, using bilinear interpolation when necessary. We note that successive phase
scores form more peaky inclines radiating from the center of the pedestrian.

cation decisions and results in an overall improved classiﬁer for pedestrian detection. We speciﬁ-

cally supervise each phase using increasingly stricter labeling policies such that each phase of the

network has similar recall as the last but with tighter and more clusterable prediction maps. We

provide comprehensive ablation experiments to better understand and support each proposed com-

ponent of our framework. We attain new state-of-the-art results on the Caltech dataset throughout

many challenging experimental settings and achieve a highly competitive accuracy on the KITTI

benchmark.

53

0 20 20 0 Chapter 4

M3D-RPN: Monocular 3D Region Proposal

Network for Object Detection

Understanding the world in 3D is a critical component of urban autonomous driving. Generally,

the combination of expensive LiDAR sensors and stereo RGB imaging has been paramount for

successful 3D object detection algorithms, whereas monocular image-only methods experience

drastically reduced performance. We propose to reduce the gap by reformulating the monocular

3D detection problem as a standalone 3D region proposal network. We leverage the geometric

relationship of 2D and 3D perspectives, allowing 3D boxes to utilize well-known and powerful

convolutional features generated in the image-space. To help address the strenuous 3D parameter

estimations, we further design depth-aware convolutional layers which enable location speciﬁc

feature development and in consequence improved 3D scene understanding. Compared to prior

work in monocular 3D detection, our method consists of only the proposed 3D region proposal

network rather than relying on external networks, data, or multiple stages. M3D-RPN is able to

signiﬁcantly improve the performance of both monocular 3D Object Detection and Bird’s Eye

54

View tasks within the KITTI urban autonomous driving dataset, while efﬁciently using a shared

multi-class model.

4.1

Introduction

Scene understanding in 3D plays a principal role in designing effective real-world systems such

as in urban autonomous driving [5, 27, 51] and robotics [60, 151]. Currently, the foremost meth-

ods [30,95,126,138,174] on 3D detection rely extensively on expensive LiDAR sensors to provide

sparse depth data as input. In comparison, monocular image-only 3D detection [23, 24, 118, 172]

is considerably more difﬁcult due to an inherent lack of depth cues. As a consequence, the perfor-

mance gap between LiDAR-based methods and monocular approaches remains substantial.

Prior work on monocular 3D detection have each relied heavily on external state-of-the-art

(SOTA) sub-networks, which are individually responsible for performing point cloud genera-

tion [24], semantic segmentation [23], 2D detection [118], or depth estimation [172]. A downside

to such approaches is an inherent disconnection in component learning as well as system complex-

ity. Moreover, reliance on additional sub-networks can introduce persistent noise, contributing to

a limited upper-bound for the framework.

In contrast, we propose a single end-to-end region proposal network for multi-class 3D object

detection (Fig. 4.1). We observe that 2D object detection performs reasonably and continues to

make rapid advances [13,15,28,50,76,131]. The 2D and 3D detection tasks each aim to ultimately

classify all instances of an object; whereas they differ in the dimensionality of their localization

targets. Intuitively, we expect the power of 2D detection can be leveraged to guide and improve

the performance of 3D detection, ideally within a uniﬁed framework rather than as separate com-

ponents. Hence, we propose to reformulate the 3D detection problem such that both 2D and 3D

55

Figure 4.1: M3D-RPN uses a single monocular 3D region proposal network with global convolu-
tion (orange) and local depth-aware convolution (blue) to predict multi-class 3D bounding boxes.

spaces utilize shared anchors and classiﬁcation targets. In doing so, the 3D detector is naturally

able to perform on par with the performance of its 2D counterpart, from the perspective of reliably

classifying objects. Therefore, the remaining challenge is reduced to 3D localization within the

camera coordinate space.

To address the remaining difﬁcultly, we propose three key designs tailored to improve 3D

estimation. Firstly, we formulate 3D anchors to function primarily within the image-space and

initialize all anchors with prior statistics for each of its 3D parameters. Hence, each discretized

anchor inherently has a strong prior for reasoning in 3D, based on the consistency of a ﬁxed cam-

era viewpoint and the correlation between 2D scale and 3D depth. Secondly, we design a novel

depth-aware convolutional layer which is able to learn spatially-aware features. Traditionally, con-

volutional operations are preferred to be spatially-invariant [79, 89] in order to detect objects at

arbitrary image locations. However, while it is likely beneﬁcial for low-level features, we show

that high-level features improve when given increased awareness of their depth and while assum-

ing a consistent camera scene geometry. Lastly, we optimize the orientation estimation θ using

56

CNNIMM3D-RPN3D Object Detection3D → 2D projection consistency loss within a post-optimization algorithm. Hence, helping cor-

rect anomalies within θ estimation while assuming a reliable 2D bounding box.

To summarize, our contributions are the following:

• We formulate a standalone monocular 3D region proposal network (M3D-RPN) with a

shared 2D and 3D detection space, while using prior statistics to serve as strong initialization

for each 3D parameter.

• We propose depth-aware convolution to improve the 3D parameter estimation, thereby en-

abling the network to learn more spatially-aware high-level features.

• We propose a simple orientation estimation post-optimization algorithm which uses 3D pro-

jections and 2D detections to improve the θ estimation.

• We achieve state-of-the-art performance on the urban KITTI [51] benchmark for monocular

Bird’s Eye View and 3D Detection using a single multi-class network.

4.2 Related Work

2D Detection: Many works have addressed 2D detection in both generic [78,93,107,122,130] and

urban scenes [10,11,13,15,108,109,131,175,189]. Most recent frameworks are based on seminal

work of Faster R-CNN [132] due to the introduction of the region proposal network (RPN) as a

highly effective method to efﬁciently generate object proposals. The RPN functions as a sliding

window detector to check for the existence of objects at every spatial location of an image which

match with a set of predeﬁned template shapes, referred to as anchors. Despite that the RPN

was conceived to be a preliminary stage within Faster R-CNN, it is often demonstrated to have

promising effectiveness being extended to a single-shot standalone detector [107, 130, 166, 188].

Our framework builds upon the anchors of a RPN, specially designed to function in both the 2D

57

Figure 4.2: Comparison of Deep3DBox [118] and Multi-Fusion [172] with M3D-RPN. Notice
that prior works are comprised of multiple internal stages (orange), and external networks (blue),
whereas M3D-RPN is a single-shot network trained end-to-end.

and 3D spaces, and acting as a single-shot multi-class 3D detector.

LiDAR 3D Detection: The use of LiDAR data has proven to be essential input for SOTA frame-

works [25, 30, 40, 95, 126, 138, 174] for 3D object detection applied to urban scenes. Leading

methods tend to process sparse point clouds from LiDAR points [126, 138, 174] or project the

point clouds into sets of 2D planes [25, 30]. While the LiDAR-based methods are generally high

performing for a variety of 3D tasks, each is contingent on the availability of depth information

generated from the LiDAR points or directly processed through point clouds. Hence, the meth-

ods are not applicable to camera-only applications as is the main purpose of our monocular 3D

detection algorithm.

58

2D RPNR-CNNPost Optim.CNN3D DetectionIMexternal    /    frozeninternal    /    end-to-endR-CNN2D RPN3D DetectionIMPoint Cloud3D DetectionIMPostOptim.Depth2D-3D RPNFigure 4.3: Overview of M3D-RPN. The proposed method consist of parallel paths for global (or-
ange) and local (blue) feature extraction. The global features use regular spatial-invariant convolu-
tion, while the local features denote depth-aware convolution, as detailed right. The depth-aware
convolution uses non-shared kernels in the row-space ki for i = 1 . . . b, where b denotes the total
number of distinct bins. To leverage both variants of features, we weightedly combine each output
parameter from the parallel paths.

Image-only 3D Detection: 3D detection using only image data is inherently challenging due to

an overall lack of reliable depth information. A common theme among SOTA image-based 3D

detection methods [3, 23, 24, 118, 172] is to use a series of sub-networks to aid in detection. For

instance, [24] uses a SOTA depth prediction with stereo processing to estimate point clouds. Then

3D cuboids are exhaustively placed along the ground plane given a known camera projection ma-

trix, and scored based upon the density of the cuboid region within the approximated point cloud.

As a follow-up, [23] adjusts the design from stereo to monocular by replacing the point cloud

density heuristic with a combination of estimated semantic segmentation, instance segmentation,

location, spatial context and shape priors, used while exhaustively classifying proposals on the

ground plane.

59

In recent work, [118] uses an external SOTA object detector to generate 2D proposals then

processes the cropped proposals within a deep neural network to estimate 3D dimensions and

orientation. Similar to our work, the relationship between 2D boxes and 3D boxes projected onto

the image plane is then exploited in post-processing to solve for the 3D parameters. However, our

model directly predicts 3D parameters and thus only optimizes to improve θ, which converges in
∼8 iterations in practice compared with 64 iterations in [118]. Xu et al. [172] utilize an additional

network to predict a depth map which is subsequently used to estimate a LiDAR-like point cloud.

The point clouds are then sampled using 2D bounding boxes generated from a separate 2D RPN.

Lastly, a R-CNN classiﬁer receives an input vector consisting of the sampled point clouds and

image features, to estimate the 3D box parameters.

In contrast to prior work, we propose a single network trained only with 3D boxes, as opposed

to using a set of external networks, data sources, and composed of multiple stages. Each prior

work [23,24,118,172] use external networks for at least one component of their framework, some

of which have also been trained on external data. To the best of our knowledge, our method is

the ﬁrst to generate 2D and 3D object proposals simultaneously using a Monocular 3D Region

Proposal Network (M3D-RPN). In theory, M3D-RPN is complementary to prior work and may

be used to replace the proposal generation stage. A comparison between our method and prior is

further detailed in Fig. 4.2.

4.3 M3D-RPN

Our framework is comprised of three key components. First, we detail the overall formulation of

our multi-class 3D region proposal network. We then outline the details of depth-aware convolution

and our collective network architecture. Finally, we detail a simple, but effective, post-optimization

60

Figure 4.4: Anchor Formulation and Visualized 3D Anchors. We depict each parameter of
within the 2D / 3D anchor formulation (left). We visualize the precomputed 3D priors when
12 anchors are used after projection in the image view (middle) and Bird’s Eye View (right). For
visualization purposes only, we span anchors in speciﬁc x3D locations which best minimize overlap
when viewed.

algorithm for increased 3D→2D consistency. We refer to our method as Monocular 3D Region

Proposal Network (M3D-RPN), as illustrated in Fig. 4.3.

4.3.1 Formulation

The core foundation of our proposed framework is based upon the principles of the region proposal

network (RPN) ﬁrst proposed in Faster R-CNN [132], tailored for 3D. From a high-level, the region

proposal network acts as sliding window detector which scans every spatial location of an input

image for objects matching a set of predeﬁned anchor templates. Then matches are regressed from

the discretized anchors into continuous parameters of the estimated object.

Anchor Deﬁnition: To simultaneously predict both the 2D and 3D boxes, each anchor template

is deﬁned using parameters of both spaces: [w, h]2D, zP, and [w, h, l, θ]3D. For placing an anchor

and deﬁning the full 2D / 3D box, a shared center pixel location [x, y]P must be speciﬁed. The pa-

rameters denoted as 2D are used as provided in pixel coordinates. We encode the depth parameter

zP by projecting the 3D center location [x, y, z]3D in camera coordinates into the image given a
known projection matrix P ∈ R3×4 as

61



x · z
y · z

z



P

= P ·



x

y

z

1



3D

.

(4.1)

The θ3D represents the observation viewing angle [51]. Compared to the Y-axis rotation in the cam-

era coordinate system, the observation angle accounts for the relative orientation of the object with

respect to the camera viewing angle rather than the Bird’s Eye View (BEV) of the ground plane.

Therefore, the viewing angle is intuitively more meaningful to estimate when dealing with image

features. We encode the remaining 3D dimensions [w, h, l]3D as given in the camera coordinate

system.

The mean statistic for each zP and [w, h, l, θ]3D is pre-computed for each anchor individually,

which acts as strong prior to ease the difﬁcultly in estimating 3D parameters. Speciﬁcally, for each
anchor we use the statistics across all matching ground truths which have ≥ 0.5 intersection over

union (IoU) with the bounding box of the corresponding [w, h]2D anchor. As a result, the anchors

represent discretized templates where the 3D priors can be leveraged as a strong initial guess,

thereby assuming a reasonably consistent scene geometry. We visualize the anchor formulation as

well as precomputed 3D priors in Fig. 4.4.

3D Detection: Our model predicts output feature maps per anchor for c, [tx, ty, tw, th]2D, [tx, ty, tz]P,
[tw, th, tl, tθ]3D. Let us denote na the number of anchors, nc the number of classes, and h × w the
feature map resolution. As such, the total number of box outputs is denoted nb = w × h × na,
spanned at each pixel location [x, y]P ∈ Rw×h per anchor. The ﬁrst output c represents the shared
classiﬁcation prediction of size na×nc×h×w, whereas each other output has size na×h×w. The

outputs of [tx, ty, tw, th]2D represent the 2D bounding box transformation, which we collectively

62

refer to as b2D. Following [132], the bounding box transformation is applied to an anchor with

[w, h]2D as:

x(cid:48)
2D = xP + tx2D · w2D,
y(cid:48)
2D = yP + ty2D · h2D,

w(cid:48)
2D = exp(tw2D) · w2D,
h(cid:48)
2D = exp(th2D ) · h2D,

(4.2)

where xP and yP denote spatial center location of each box. The transformed box b(cid:48)
deﬁned as [x, y, w, h](cid:48)

2D is thus
2D. The following 7 outputs represent transformations denoting the projected
center [tx, ty, tz]P, dimensions [tw, th, tl]3D and orientation tθ3D, which we collectively refer to
as b3D. Similar to 2D, the transformation is applied to an anchor with parameters [w, h]2D, zP, and

[w, h, l, θ]3D as follows:

x(cid:48)
P = xP + txP · w2D,
y(cid:48)
P = yP + tyP · h2D,
z(cid:48)
P = tzP + zP,
θ(cid:48)
3D = tθ3D + θ3D.

w(cid:48)
3D = exp(tw3D) · w3D,
h(cid:48)
3D = exp(th3D) · h3D,
l(cid:48)
3D = exp(tl3D) · l3D,

(4.3)

3D is then denoted as [x, y, z](cid:48)

P and [w, h, l, θ](cid:48)

Hence, b(cid:48)

3D. As described, we estimate the projected
3D center rather than camera coordinates to better cope with the convolutional features based ex-

clusively in the image space. Therefore, during inference we back-project the projected 3D center
location from the image space [x, y, z](cid:48)

P to camera coordinates [x, y, z](cid:48)

3D by using the inverse of

Eqn. 4.1.

Loss Deﬁnition: The network loss of our framework is formed as a multi-task learning problem

composed of classiﬁcation Lc and a box regression loss for 2D and 3D, respectfully denoted as
Lb2D and Lb3D. For each generated box, we check if there exists a ground truth with at least ≥ 0.5

63

IoU, as in [132]. If yes then we use the best matched ground truth for each generated box to deﬁne

a target with τ class index, 2D box ˆb2D, and 3D box ˆb3D. Otherwise, τ is assigned to the catch-all

background class and bounding box regression is ignored. A softmax-based multinomial logistic

loss is used to supervise for Lc deﬁned as:

Lc = − log

(cid:32)

(cid:33)

.

exp(cτ )
i exp(ci)

Σnc

(4.4)

We use a negative logistic loss applied to the IoU between the matched ground truth box ˆb2D and
the transformed b(cid:48)

2D for Lb2D, similar to [160, 179], deﬁned as:

(cid:16)

Lb2D = − log

(cid:17)

IoU(b(cid:48)

2D, ˆb2D)

.

(4.5)

The remaining 3D bounding box parameters are each optimized using a Smooth L1 [54] regression

loss applied to the transformations b3D and the ground truth transformations ˆg3D (generated using

ˆb3D following the inverse of Eqn. 4.3):

Lb3D = SmoothL1(b3D, ˆg3D).

(4.6)

Hence, the overall multi-task network loss L, including regularization weights λ1 and λ2, is de-

noted as:

L = Lc + λ1Lb2D + λ2Lb3D.

(4.7)

4.3.2 Depth-aware Convolution

Spatial-invariant convolution has been a principal operation for deep neural networks in computer

vision [79, 89]. We expect that low-level features in the early layers of a network can reasonably

64

be shared and are otherwise invariant to depth or object scale. However, we intuitively expect

that high-level features related to 3D scene understanding are dependent on depth when a ﬁxed

camera view is assumed. As such, we propose depth-aware convolution as a means to improve

the spatial-awareness of high-level features within the region proposal network, as illustrated in

Fig. 4.3.

The depth-aware convolution layer can be loosely summarized as regular 2D convolution where

a set of discretized depths are able to learn non-shared weights and features. We introduce a

hyperparameter b denoting the number of row-wise bins to separate a feature map into, where each

learns a unique kernel k. In effect, depth-aware kernels enable the network to develop location

speciﬁc features and biases for each bin region, ideally to exploit the geometric consistency of a

ﬁxed viewpoint within urban scenes. For instance, high-level semantic features, such as encoding a

feature for a large wheel to detect a car, are valuable at close depths but not generally at far depths.

Similarly, we intuitively expect features related to 3D scene understanding are inherently related

to their row-wise image position.

An obvious drawback to using depth-aware convolution is the increase of memory footprint
for a given layer by ×b. However, the total theoretical FLOPS to perform convolution remains

consistent regardless of whether kernels are shared. We implement the depth-aware convolution

layer in PyTorch [125] by unfolding a layer L into b padded bins then re-purposing the group

convolution operation to perform efﬁcient parallel operations on a GPU1.

1In practice, we observe a 10 − 20% overhead for reshaping when implemented with parallel group convolution in

PyTorch [125].

65

Type

IoU ≥ 0.7 [val1 / val2 / test]

Easy

Mod

Hard

Mono3D [23]
3DOP [24]
Deep3DBox [118] Mono
Multi-Fusion [172] Mono
M3D-RPN
Mono

Mono
5.22 /
Stereo 12.63 /

-

-
-

/
/
/ 9.99 /

-
-
-

5.19 /
9.49 /
-

-
-

/
/
/ 7.71 /

-
-
-

4.13 /
7.59 /
-

-
-

/
/
/ 5.30 /

-
-
-

22.03 / 19.20 / 13.73
25.94 / 26.86 / 26.43

13.63 / 12.17 / 9.62
21.18 / 21.15 / 18.36

11.60 / 10.89 / 8.22
17.90 / 17.14 / 16.24

IoU ≥ 0.5 [val1 / val2]

Easy

30.50 /
55.04 /

-
-

Mod

22.39 /
41.25 /

-
-

Hard

19.16 /
34.55 /

-
-

-

-

/ 30.02

/ 18.83
55.02 / 54.18 36.73 / 38.06 31.27 / 31.46
55.37 / 55.87 42.49 / 41.36 35.29 / 34.08

/ 23.77

-

Table 4.1: Bird’s Eye View. Comparison of our method to image-only 3D localization frameworks
on the Bird’s Eye View task (APBEV).

Type

IoU ≥ 0.7 [val1 / val2 / test]

Easy

Mod

Hard

Mono
Mono3D [23]
3DOP [24]
Stereo
Deep3DBox [118] Mono
Multi-Fusion [172] Mono
M3D-RPN
Mono

2.53 /
6.55 /
-

-
-

/
/
/ 5.85 /

-
-
-

10.53 / 7.85 / 7.08
20.27 / 20.40 / 20.65

-
-

-
-
-

/
/
/ 4.10 /

2.31 /
5.07 /
-
5.69 / 5.39 / 5.18
17.06 / 16.48 / 15.70

-
-

-
-
-

/
/
/ 3.84 /

2.31 /
4.10 /
-
5.39 / 4.73 / 4.68
15.21 / 13.34 / 13.32

IoU ≥ 0.5 [val1 / val2]

Easy

25.19 /
46.04 /

-
-

Mod

18.20 /
34.63 /

-
-

Hard

15.52 /
30.09 /

-
-

-

-

/ 27.04

/ 15.88
47.88 / 45.57 29.48 / 30.03 26.44 / 23.95
48.96 / 49.89 39.57 / 36.14 33.01 / 28.98

/ 20.55

-

Table 4.2: 3D Detection. Comparison of our method to image-only 3D localization frameworks
on the 3D Detection task (AP3D).

4.3.3 Network Architecture

The backbone of our network uses DenseNet-121 [67]. We remove the ﬁnal pooling layer to keep

the network stride at 16, then dilate each convolutional layer in the last DenseBlock by a factor of

2 to obtain a greater ﬁeld-of-view.

We connect two parallel paths at the end of the backbone network. The ﬁrst path uses regu-

lar convolution where kernels are shared spatially, which we refer to as global. The second path

exclusively uses depth-aware convolution and is referred to as local. For each path, we append a

proposal feature extraction layer using its respective convolution operation to generate Fglobal and
Flocal. Each feature extraction layer generates 512 features using a 3 × 3 kernel with 1 padding
and is followed by a ReLU non-linear activation. We then connect the 12 outputs to each F corre-
sponding to c, [tx, ty, tw, th]2D, [tx, ty, tz]P, [tw, th, tl, tθ]3D. Each output uses a 1 × 1 kernel and
are collectively denoted as Oglobal and Olocal. To leverage the depth-aware and spatial-invariant

strengths, we fuse each output using a learned attention α (after sigmoid) applied for i = 1 . . . 12

66

2D, [x, y, z](cid:48)
P,
3D, step size σ, termination β, and decay γ parameters, then iteratively tunes θ via L1
2D, [x, y, z](cid:48)

Algorithm 1 Post 3D→2D Algorithm. The algorithm takes input of 2D / 3D box b(cid:48)
[w, h, l, θ](cid:48)
corner consistency loss.
Input: b(cid:48)
ρ ← box-project([x, y, z]P, [w, h, l, θ − σ]3D)
η ← L1(b(cid:48)
2D, ρ)
while σ ≥ β do

P, [w, h, l, θ](cid:48)

3D, σ, β, γ

2D, ρ−)
2D, ρ+)

ρ− ← box-project([x, y, z]P, [w, h, l, θ − σ]3D)
ρ+ ← box-project([x, y, z]P, [w, h, l, θ + σ]3D)
loss− ← L1(b(cid:48)
loss+ ← L1(b(cid:48)
if min(loss−, loss+) > η then
σ ← σ · γ;
else if loss− < loss+ then
θ ← θ − σ;
η ← loss−
θ ← θ + σ;
η ← loss+

else

1
2

end

end

as follows:

Oi = Oi

global · αi + Oi

local · (1 − αi).

(4.8)

4.3.4 Post 3D→2D Optimization

We optimize the orientation parameter θ in a simple but effective post-processing algorithm (as

2D, [x, y, z](cid:48)

detailed in Alg. 1). The proposed optimization algorithm takes as input both the 2D and 3D box
estimations b(cid:48)

3D, as well as a step size σ, termination β, and decay γ
parameters. The algorithm then iteratively steps through θ and compares the projected 3D boxes
with b(cid:48)

2D using a L1 loss. The 3D→2D box-project function is deﬁned as follows:

P, and [w, h, l, θ](cid:48)

67




Υ0 =

Υ3D =

(cid:48)




3D
(cid:48)

P



x · z
y · z

z

1

/ 2,

,

(4.9)

l

l

l

l −l −l −1

−l
−h −h h h −h −h h
h
−w −w −w w w w w −w

 Υ0 + P

−1

cos θ

0 sin θ

0

1

0
− sin θ 0 cos θ

0

0

0

ΥP = P · Υ3D,

Υ2D = ΥP./ΥP[φz],

xmin = min(Υ2D[φx]),

ymin = min(Υ2D[φy]),

xmax = max(Υ2D[φx]), ymax = max(Υ2D[φy]).

where P−1 is the inverse projection after padding [0, 0, 0, 1], and φ denotes an index for axis

[x, y, z]. We then use the projected box parameterized by ρ = [xmin, ymin, xmax, ymax] and the
source b(cid:48)
to the loss using θ ± σ, we decay the step by γ and repeat while σ ≥ β.

2D to compute a L1 loss, which acts as the driving heuristic. When there is no improvement

4.3.5 Implementation Details

We implement our framework using PyTorch [125] and release the code at http://cvlab.cse.

msu.edu/project-m3d-rpn.html. To prevent local features from overﬁtting on a subset of

the image regions, we initialize the local path with pretrained global weights. In this case, each

stage is trained for 50k iterations. We expect higher degrees of data augmentation or an iterative

binning schedule, e.g., b = 2i from i = 0 . . . log2(bﬁnal), could enable more ease of training at the

cost of more complex hyperparameters.

68

We use a learning rate of 0.004 with a poly decay rate using power 0.9, a batch size of 2, and

weight decay of 0.9. We set λ1 = λ2 = 1. All images are scaled to a height of 512 pixels. As

such, we use b = 32 bins for all depth-aware convolution layers. We use 12 anchor scales ranging
from 30 to 400 pixels following the power function of 30 · 1.265i for i = 0 . . . 11 and aspect ratios

of [0.5, 1.0, 1.5] to deﬁne a total of 36 anchors for multi-class detection. The 3D anchor priors are

learned using these templates with the training dataset as detailed in Sec. 4.3.1. We apply NMS

on the box outputs in the 2D space using a IoU criteria of 0.4 and ﬁlter boxes with scores < 0.75.
The 3D → 2D optimization uses settings of σ = 0.3π, β = 0.01, and γ = 0.5. Lastly, we perform

random mirroring and online hard-negative mining by sampling the top 20% high loss boxes in

each minibatch.

We note that M3D-RPN relies on 3D box annotations and a known projection matrix P per

sequence. For extension to a dataset without these known, it may be necessary to predict the camera

intrinsics and utilize weak supervision leveraging 3D-2D projection geometry as loss constraints.

4.4 Experiments

We evaluate our proposed framework on the challenging KITTI [51] dataset under two core 3D lo-

calization tasks: Bird’s Eye View (BEV) and 3D Object Detection. We comprehensively compare

our method on the ofﬁcial test dataset as well as two validation splits [24, 169], and perform anal-

ysis of the critical components which comprise our framework. We further visualize qualitative

examples of M3D-RPN on multi-class 3D object detection in diverse scenes (Fig. 4.5).

69

Car

Pedestrian

Cyclist

21.18 / 21.15 / 18.36
11.60 / 11.44 / 11.35
10.13 / 9.09 / 1.29

APBEV [val1 / val2 / test] AP3D [val1 / val2 / test]

17.06 / 16.48 / 15.70
11.28 / 11.30 / 10.54
10.01 / 9.09 / 1.03

Table 4.3: Multi-class 3D Localization. The performance of our method when applied as a multi-
class 3D detection system using a single shared model. We evaluate using the mod setting on
KITTI.

4.4.1 KITTI

The KITTI [51] dataset provides many widely used benchmarks for vision problems related to self-

driving cars. Among them, the Bird’s Eye View (BEV) and 3D Object Detection tasks are most

relevant to evaluate 3D localization performance. The ofﬁcial dataset consists of 7,481 training

images and 7,518 testing images with 2D and 3D annotations for car, pedestrian, and cyclist. For

each task we report the Average Precision (AP) under 3 difﬁcultly settings: easy, moderate and

hard as detailed in [51]. Methods are further evaluated using different IoU criteria per class. We
emphasize our results on the ofﬁcial settings of IoU ≥ 0.7 for cars and IoU ≥ 0.5 for pedestrians

and cyclists.

We conduct experiments on three common data splits including val1 [24], val2 [169], and the

ofﬁcial test split [51]. Each split contains data from non-overlapping sequences such that no data

from an evaluated frame, or its neighbors, have been used for training. We focus our comparison

to SOTA prior work which use image-only input. We primarily compare our methods using the

car class, as has been the focus of prior work [23, 24, 118, 172]. However, we emphasize that our

models are trained as a shared multi-class detection system and therefore also report the multi-class

capability for monocular 3D detection, as detailed in Tab. 4.3.

Bird’s Eye View: The Bird’s Eye View task aims to perform object detection from the overhead

viewpoint of the ground plane. Hence, all 3D boxes are ﬁrst projected onto the ground plane then

70

2D Detection [val1 / test]

Easy

Mod

Hard

Mono3D [23]
3DOP [24]

Deep3DBox [118]
Multi-Fusion [172]

M3D-RPN

93.89 / 92.33 88.67 / 88.66 79.68 / 78.96
93.08 / 93.04 88.07 / 88.64 79.39 / 79.10
/ 77.17
/ 76.78
90.24 / 84.34 83.67 / 83.78 67.69 / 67.85

/ 89.04
/ 87.33

-
-

-
-

/ 92.98
/ 90.43

-
-

Table 4.4: 2D Detection. The performance of our method evaluated on 2D detection using the car
class on val1 and test datasets.

top-down 2D detection is applied. We evaluate M3D-RPN on each split as detailed in Tab. 4.1.

M3D-RPN achieves a notable improvement over SOTA image-only detectors across all data
splits and protocol settings. For instance, under criteria of IoU ≥ 0.7 with val1, our method
achieves 21.18% (↑ 7.55%) on moderate, and 17.90% (↑ 6.30%) on hard. We further emphasize
our performance on test which achieves 18.36% (↑ 8.74%) and 16.24% (↑ 8.02%) respectively on
moderate and hard settings with IoU ≥ 0.7, which is the most challenging setting.

3D Object Detection: The 3D object detection task aims to perform object detection directly

in the camera coordinate system. Therefore, an additional dimension is introduced to all IoU

computations, which substantially increases the localization difﬁculty compared to BEV task. We

evaluate our method on 3D detection with each split under all commonly studied protocols as

described in Tab. 4.2. Our method achieves a signiﬁcant gain over state-of-the-art image-only

methods throughout each protocol and split.

We emphasize that the current most difﬁcult challenge to evalaute 3D localization is the 3D
object detection task. Similarly, the moderate and hard settings with IoU ≥ 0.7 are the most

difﬁcult protocols to evaluate with. Using these settings with val1, our method notably achieves
17.06% (↑ 11.37%) and 15.21 (↑ 9.82%) respectively. We further observe similar gains on the
other splits. For instance, when evaluated using the testing dataset, we achieve 15.70% (↑ 10.52)

71

and 13.32% (↑ 8.64) on the moderate and hard settings despite being trained as a shared multi-class

model and compared to single model methods [23, 24, 118, 172]. When evaluated with less strict
criteria such as IoU ≥ 0.5, our method demonstrates smaller but reasonable margins (∼3 − 6%),

implying that M3D-RPN has similar recall to prior art but signiﬁcantly higher precision overall.

Multi-Class 3D Detection: To demonstrate generalization beyond a single class, we evaluate our

proposed 3D detection framework on the car, pedestrian and cyclist classes. We conduct experi-

ments on both the Bird’s Eye View and 3D Detection tasks using the KITTI test dataset, as detailed

in Tab. 4.3. Although there are not monocular 3D detection methods to compare with for multi-

class, it is noteworthy that the performance on pedestrian outperforms prior work performance

on car, which usually has the opposite relationship, thereby suggesting a reasonable performance.

However, M3D-RPN is noticeably less stable for cyclists, suggesting a need for advanced sampling

or data augmentation to overcome the data bias towards car and pedestrian.

2D Detection: We evaluate our performance on 2D car detection (detailed in Tab. 4.4). We

note that M3D-RPN performs less compared to other 3D detection systems applied to the 2D task.

However, we emphasize that prior work [23,24,118,172] use external networks, data sources, and

include multiple stages (e.g., Fast [54], Faster R-CNN [132]). In contrast, M3D-RPN performs

all tasks simultaneously using only a single-shot 3D proposal network. Hence, the focus of our

work is primarily to improve 3D detection proposals with an emphasis on the quality of 3D local-

ization. Although M3D-RPN does not compete directly with SOTA methods for 2D detection, its

performance is suitable to facilitate the tasks in focus such as BEV and 3D detection.

72

b

Post-Optim AP2D AP3D APBEV RT (ms)

(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)

1
4
8
16
32

82.16 10.99
82.16 15.08
82.88 12.87
84.15 14.46
83.86 16.04
83.02 15.97
83.67 17.06

12.99
17.47
17.91
19.14
20.99
18.48
21.18

118
128
133
134
143
153
161

Table 4.5: Ablations. We ablate the effects of b for depth-aware convolution and the post-
optimization 3D→2D algorithm with respect to performance on moderate setting of cars and run-
time (RT).

c x2D y2D w2D h2D xP yP zP w3D h3D l3D θ3D
33 48

45 44 45 44 42

38 43 38 %

47

45

Table 4.6: Local and Global α weights. We detail the α weights learned to individually fuse each
global and local output. Lower implies higher weight towards the local depth-aware convolution.

4.4.2 Ablations

For all ablations and experimental analysis we use the KITTI val1 dataset split and evaluate utiliz-

ing the car class. Further, we use the moderate setting of each task which includes 2D detection,

3D detection, and BEV (Tab. 4.5).

Depth-aware Convolution: We propose depth-aware convolution as a method to improve the

spatial-awareness of high-level features. To better understand the effect of depth-aware convo-

lution, we ablate it from the perspective of the hyperparameter b which denotes the number of

discrete bins. Since our framework uses an image scale of 512 pixels with network stride of 16,

the output feature map can naturally be separated into 512

16 = 32 bins. We therefore ablate using

bins of [4, 8, 16, 32] as described in Tab. 4.5.

We additionally ablate the special case of b = 1, which is the equivalent to utilizing two global

streams. We observe that both b = 1 and b = 4 result in generally worse performance than the

baseline without local features, suggesting that arbitrarily adding deeper layers is not inherently

73

Figure 4.5: Qualitative Examples. We visualize qualitative examples of our method for multi-
class 3D object detection. We use yellow to denote cars, green for pedestrians, and orange for
cyclists. All illustrated images are from the val1 [24] split and not used for training.

helpful for 3D localization. However, we observe consistent improvements when b = 32 is used,

achieving a large gain of 3.71% in APBEV, 1.98% in AP3D, and 1.51% in AP2D.

We breakdown the learned α weights after sigmoid which are used to fuse the global and

local outputs (Tab. 4.6). Lower values favor local branch and vice-versa for global. Interestingly,

the classiﬁcation c output learns the highest bias toward local features, suggesting that semantic

features in urban scenes have a moderate reliance on depth position.
Post 3D→2D Optimization: The post-optimization algorithm encourages consistency between

3D boxes projected into the image space and the predicted 2D boxes. We ablate the effectiveness

of this optimization as detailed in Tab. 4.5. We observe that the post-optimization has a signiﬁcant

impact on both BEV and 3D detection performance. Speciﬁcally, we observe performance gains

of 4.48% in APBEV and 4.09% in AP3D. We additionally observe that the algorithm converges in

approximately 8 iterations on average and adds minor 13 ms overhead (per image) to the runtime.

Efﬁciency: We emphasize that our approach uses only a single network for inference and hence

74

involves overall more direct 3D predictions than the use of multiple networks and stages (RPN with

R-CNN) used in prior works [23,24,118,172]. We note that direct efﬁciency comparison is difﬁcult

due to a lack of reporting in prior work. However, we comprehensively report the efﬁciency of

M3D-RPN for each ablation experiment, where b and post-optimization are the critical factors,

as detailed in Tab. 4.5. The runtime efﬁciency is computed using NVIDIA 1080ti GPU averaged
across the KITTI val1 dataset. We note that depth-aware convolution incurs 2− 20% overhead cost

for b = 1 . . . 32, caused by unfolding and reshaping in PyTorch [125].

4.5 Summary

In this work, we present a reformulation of monocular image-only 3D object detection using a

single-shot 3D RPN, in contrast to prior work which are comprised of external networks, data

sources, and involve multiple stages. M3D-RPN is uniquely designed with shared 2D and 3D

anchors which leverage strong priors closely linked to the correlation between 2D scale and 3D

depth. To help improve 3D parameter estimation, we further propose depth-aware convolution

layers which enable the network to develop spatially-aware features. Collectively, we are able to

signiﬁcantly improve the performance on the challenging KITTI dataset on both the Birds Eye

View and 3D object detection tasks for the car, pedestrian, and cyclist classes.

Acknowledgment: Research was partially sponsored by the Army Research Ofﬁce under Grant

Number W911NF-18-1-0330. The views and conclusions contained in this document are those of

the authors and should not be interpreted as representing the ofﬁcial policies, either expressed or

implied, of the Army Research Ofﬁce or the U.S. Government. The U.S. Government is autho-

rized to reproduce and distribute reprints for Government purposes notwithstanding any copyright

notation herein.

75

Chapter 5

Kinematic 3D Object Detection in

Monocular Video

Although temporal motion is an invaluable resource to human vision for detection, tracking, and

depth perception, such features have not been thoroughly utilized in modern 3D object detectors.

In this work, we propose a novel method for monocular video-based 3D object detection which

leverages kinematic motion to extract scene dynamics and improve localization accuracy. We

ﬁrst propose a novel decomposition of object orientation and a self-balancing 3D conﬁdence. We

show that both components are critical to enable our kinematic model to work effectively. Collec-

tively, using only a single model, we efﬁciently leverage 3D kinematics from monocular videos

to improve the overall localization precision in 3D object detection while also producing useful

by-products of scene dynamics (ego-motion and per-object velocity). We achieve state-of-the-art

performance on monocular 3D object detection and the Bird’s Eye View tasks within the KITTI

self-driving dataset.

76

Figure 5.1: Single-frame 3D detection [9] often has unstable estimation through time (a), while our
video-based method (b) is more robust by leveraging kinematic motion via a 3D Kalman Filter to
fuse forecasted tracks τ(cid:48)

t and measurements b into τt.

5.1

Introduction

The detection of foreground objects is among the most critical requirements to facilitate self-

driving applications [11, 39]. Recently, 3D object detection has made signiﬁcant progress [26,

88,94,95,139,177], even while using only a monocular camera [9,81,90,103,113,114,140,162].

Such works primarily look at the problem from the perspective of single frames, ignoring useful

temporal cues and constraints.

Computer vision cherishes inverse problems, e.g., recovering the 3D physical motion of ob-

jects from monocular videos. Motion information such as object velocity in the metric space is

highly desirable for the path planning of self-driving. However, single image-based 3D object de-

tection can not directly estimate physical motion, without relying on additional tracking modules.

Therefore, video-based 3D object detection would be a sensible choice to recover such motion in-

formation. Furthermore, without modeling the physical motion, image-based 3D object detectors

are naturally more likely to suffer from erratic and unnatural changes through time in orientation

and localization (as exempliﬁed in Fig. 5.1(a)). Therefore, we aim to build a novel video-based 3D

object detector which is able to provide accurate and smooth 3D object detection with per-object

77

RPNvelocity, while also prioritizing a compact and efﬁcient model overall.

Yet, designing an effective video-based 3D object detector has challenges. Firstly, motion

which occurs in real-world scenes can come from a variety of sources such as the camera atop

of an autonomous vehicle or robot, and/or from the scene objects themselves — for which most

of the safety-critical objects (car, pedestrian, cyclist [51]) are typically dynamic. Moreover, using

video inherently involves an increase in data consumption which introduces practical challenges

for training and/or inference including with memory or redundant processing.

To address such challenges, we propose a novel framework to integrate a 3D Kalman ﬁlter

into a 3D detection system. We ﬁnd Kalman is an ideal candidate for three critical reasons: (1) it

allows for use of real-world motion models to serve as a strong prior on object dynamics, (2) it is

inherently efﬁcient due to its recursive nature and general absence of parameters, (3) the resultant

behavior is explainable and provides useful by-products such as the object velocity.

Furthermore, we observe that objects predominantly move in the direction indicated by their

orientation. Fortunately, the beneﬁt of Kalman allows us to integrate this real-world constraint into

the motion model as a compact scalar velocity. Such a constraint helps maintain the consistency of

velocity over time and enables the Kalman motion forecasting and fusion to perform accurately.

However, a model restricted to only move in the direction of its orientation has an obvious

ﬂaw — what if the orientation itself is inaccurate? We therefore propose a novel reformulation

of orientation in favor of accuracy and stability. We ﬁnd that our orientation improves the 3D
localization accuracy by a margin of 2.39% and reduces the orientation error by ≈ 20%, which

collectively help enable the proposed Kalman to function more effectively.

A notorious challenge of using Kalman comes in the form of uncertainty, which is convention-

ally [164] assumed to be known and static, e.g., from a sensor. However, 3D objects in video are

intuitively dependent on more complex factors of image features and cannot necessarily be treated

78

like a sensor measurement. For a better understanding of 3D uncertainty, we propose a 3D self-

balancing conﬁdence loss. We show that our proposed conﬁdence has higher correlation with the

3D localization performance compared to the typical classiﬁcation probability, which is commonly

used in detection [130, 132].

To complete the full understanding of the scene motion, we elect to estimate the ego-motion

of the capturing camera itself. Hence, we further narrow the work of Kalman to account for

only the object’s motion. Collectively, our proposed framework is able to model important scene

dynamics, both ego-motion and per-object velocity, and more precisely detect 3D objects in videos

using a stabilized orientation and 3D conﬁdence estimation. We demonstrate that our method

achieves state-of-the-art (SOTA) performance on monocular 3D Object Detection and Bird’s Eye

View (BEV) tasks in the KITTI dataset [51].

In summary, our contributions are as follows:

• We propose a monocular video-based 3D object detector, leveraging realistic motion con-

straints with an integrated ego-motion and a 3D Kalman ﬁlter.

• We propose to reformulate orientation into axis, heading and offset along with a self-balancing

3D localization loss to facilitate the stability necessary for the proposed Kalman ﬁlter to per-

form more effectively.

• Overall, using only a single model our framework develops a comprehensive 3D scene un-

derstanding including object cuboids, orientation, velocity, object motion, uncertainty, and

ego-motion, as detailed in Fig. 5.1 and 5.2.

• We achieve a new SOTA performance on monocular 3D object detection and BEV tasks

using comprehensive metrics within the KITTI dataset.

79

5.2 Related Work

We ﬁrst provide context of our novelties from the perspective of monocular 3D object detection

(Sec. 5.2.1) with attention to orientation and uncertainty estimation. We next discuss and contrast

with video-based object detection (Sec. 5.2.2).

5.2.1 Monocular 3D Object Detection

Monocular 3D object detection has made signiﬁcant progress [9, 23, 24, 81, 90, 103, 113, 114, 118,

140, 161, 172]. Early methods such as [24] began by generating 3D object proposals along a

ground plane using object priors and estimated point clouds, culminating in an energy minimiza-

tion approach. [23,81,172] utilize additional domains of semantic segmentation, object priors, and

estimated depth to improve the localization. Similarly, [113, 162] create a pseudo-LiDAR map

using SOTA depth estimator [20, 46, 49], which is respectfully passed into detection subnetworks

or LiDAR-based 3D object detection works [80, 126, 127]. In [90, 103, 114, 118, 140] strong 2D

detection systems are extended to add cues such as object orientation, then the remaining 3D box

parameters are solved via 3D box geometry. [9] extends the region proposal network (RPN) of

Faster R-CNN [132] with 3D box parameters.

5.2.1.1 Orientation Estimation:

Prior monocular 3D object detectors estimate orientation via two main techniques. The ﬁrst method

is to classify orientation via a series of discrete bins then regress a relative offset [23, 24, 81, 90,

103, 114, 118, 172]. The bin technique requires a trade-off between the quantity/coverage of the
discretized angles and an increase in the number of estimated parameters (bin ×). Other meth-

ods directly regress the orientation angle using quaternion [140] or Euler [9, 113] angles. Direct

80

regression is comparatively efﬁcient, but may lead to degraded performance and periodicity chal-

lenges [190], as exempliﬁed in Fig. 5.1.

In contrast, we propose a novel orientation decomposition which serves as an intuitive compro-

mise between the bin and direct approaches. We decompose the orientation estimation into three

components: axis and heading classiﬁcation, followed by an angle offset. Thus, our technique

increases the parameters by a static factor of 2 compared to a bin hyperparameter, while drastically

reducing the offset search space for each orientation estimation (discussed in Sec. 5.3.1).

5.2.1.2 Uncertainty Estimation:

Although it is common to utilize the classiﬁcation score to rate boxes in 2D object detection [18,

130,132,175] or explicitly model uncertainty as parametric estimation [84], prior works in monoc-

ular 3D object detection realize the need for 3D box uncertainty/conﬁdence [103, 140]. [103] de-

ﬁnes conﬁdence using the 3D IoU of a box and ground truth after center alignment, thus capturing

the conﬁdence primarily of the 3D object dimensions. [140] predicts a conﬁdence by re-mapping

the 3D box loss into a probability range, which intuitively represents the conﬁdence of the overall

3D box accuracy.

In contrast, our self-balancing conﬁdence loss is generic and self-supervised, with two beneﬁts.

(1) It enables estimation of a 3D localization conﬁdence using only the loss values, thus being more

general than 3D IoU. (2) It enables the network to naturally re-balance extremely hard 3D boxes

and focus on relatively achievable samples. Our ablation (Sec. 5.4.4) shows the importance of both

effects.

81

5.2.2 Video-based Object Detection

Video-based object detection [6, 104, 171, 191, 192] is generally less studied than single-frame

object detection [18, 130–132, 175, 177]. A common trend in video-based detection is to beneﬁt

the accuracy-efﬁciency trade-off via reducing the frame redundancy [6, 104, 171, 191, 192]. Such

works are applied primarily on domains of ImageNet VID 2015 [135], which contain less ego-

motion from the capturing camera than self-driving scenarios [31, 51]. As such, the methods

are designed to use 2D transformations, which lack the consistency and realism of 3D motion

modeling.

In comparison, to our knowledge this is the ﬁrst work that utilizes video cues to improve the

accuracy and robustness of monocular 3D object detection. In the domain of 2D/3D object track-

ing, [52] experiments using Kalman Filters, Particle Filters, and Gaussian Mixture Models, and

observe Kalman to be the most effective aggregation method for tracking. An LSTM with depth

ordering and IMU camera ego-motion is utilized in [66] to improve the tracking accuracy.

In

contrast, we explore how to naturally and effectively leverage a 3D Kalman ﬁlter to improve the

accuracy and robustness of monocular 3D object detection. We propose novel enhancements in-

cluding estimating ego-motion, orientation, and a 3D conﬁdence, while efﬁciently using only a

single model.

5.3 Methodology

Our proposed kinematic framework is composed of three primary components: a 3D region pro-

posal network (RPN), ego-motion estimation, and a novel kinematic model to take advantage of

temporal motion in videos. We ﬁrst overview the foundations of a 3D RPN. Then we detail our

contributions of orientation decomposition and self-balancing 3D conﬁdence, which are integral

82

Figure 5.2: Overview. Our framework uses a RPN to ﬁrst estimate 3D boxes (Sec. 5.3.1). We
forecast previous frame tracks τt−1 into τ(cid:48)
t using the estimated Kalman velocity. Self-motion is
compensated for applying a global ego-motion (Sec. 5.3.2) to tracks τ(cid:48)
t with
measurements b using a kinematic 3D Kalman ﬁlter (Sec. 5.3.3).

t. Lastly, we fuse τ(cid:48)

to the kinematic method. Next we detail ego-motion estimation. Lastly, we present the complete

kinematic framework (Fig. 5.2) which carefully employs a 3D Kalman [164] to model realistic

motion using the aforementioned components, ultimately producing a more accurate and compre-

hensive 3D scene understanding.

5.3.1 Region Proposal Network

Our measurement model is founded on the 3D RPN [9], enhanced using novel orientation and

conﬁdence estimations. The RPN itself acts as a sliding window detector following the typical

practices outlined in Faster R-CNN [132] and [9]. Speciﬁcally, the RPN consists of a backbone

network and a detection head which predicts 3D box outputs relative to a set of predeﬁned anchors.

83

Associate /UpdateApply EgoForecast5.3.1.1 Anchors:

We deﬁne our 2D-3D anchor Φ to consist of 2D dimensions [Φw, Φh]2D in pixels, a projected

depth-buffer Φz in meters, 3D dimensions [Φw, Φh, Φl]3D in meters, and orientations with respect

to two major axes [Φ0, Φ1]3D in radians. The term Φz is related to the camera coordinate [x, y, z]3D
3D where Υ ∈ R3×4 is a known projection matrix.
by the equation Φz · [u, v, 1]T
We compute the anchor values by taking the mean of each parameter after clustering all ground

2D = Υ· [x, y, z, 1]T

truth objects in 2D, following the process in [9].

5.3.1.2

3D Box Outputs:

Since our network is based on the principles of a RPN [9, 132], most of the estimations are de-

ﬁned as a transformation T relative to an anchor. Let us deﬁne na as the number of anchors, nc
as the number of object classes, and w × h as the output resolution of the network. The RPN
outputs a classiﬁcation map C ∈ R(na·nc)×w×h, then 2D transformations [Tx, Ty, Tw, Th]2D,
3D transformations [Tu, Tv, Tz, Tw, Th, Tl, Tθr ]3D, axis and heading [Θa, Θh], and lastly a
3D self-balancing conﬁdence Ω. Each output has a size of Rna×w×h. The outputs can be un-
rolled into nb = (na · w · h) boxes with (nc + 14)-dim, with parameters of c, [tx, ty, tw, th]2D,
[tu, tv, tz, tw, th, tl, tθr ]3D, [θa, θh], and ω, which relate to the maps by c ∈ C, t2D ∈ T2D,
t3D ∈ T3D, θ ∈ Θ and ω ∈ Ω. The regression targets for 2D ground truths (GTs) [ˆx, ˆy, ˆw, ˆh]2D
are deﬁned as:

ˆtx2D =

ˆx2D − i
Φw2D

, ˆty2D =

ˆy2D − j
Φh2D

, ˆtw2D = log

ˆw2D
Φw2D

, ˆth2D = log

ˆh2D
Φh2D

,

(5.1)

84

Figure 5.3: Orientation. Our proposed orientation formulation decomposes an object orientation
ˆθ (a) into an axis classiﬁcation ˆθa (b), a heading classiﬁcation ˆθh (c), and an offset ˆθr (d). Our
method disentangles the objectives of axis and heading classiﬁcation while greatly reducing the
offset region (red) by a factor of 1
4.

where (i, j) ∈ Rw×h represent the pixel coordinates of the corresponding box. Similarly, follow-
ing the equation of ˆz · [ˆu, ˆv, 1]T

3D, the regression targets for the projected 3D

2D = Υ · [x, y, z, 1]T

center GTs are deﬁned as:

ˆtu =

ˆu − i
Φw2D

,

ˆtv =

ˆv − j
Φh2D

,

ˆtz = ˆz − Φz.

(5.2)

Lastly, the regression targets for 3D dimensions GTs [ ˆw, ˆh, ˆl]3D are deﬁned as:

ˆtw3D = log

ˆw3D
Φw3D

,

ˆth3D = log

ˆh3D
Φh3D

,

ˆtl3D = log

ˆl3D
Φl3D

.

(5.3)

The remaining targets for our novel orientation estimation tθr, [θa, θh], and 3D self-balancing

conﬁdence ω are deﬁned in subsequent sections.

85

5.3.1.3 Orientation Estimation:

We propose a novel object orientation formulation, with a decomposition of three components:

axis, heading, and offset (Fig. 5.3). Intuitively, the axis estimation θa represents the probability an

object is oriented towards the vertical axis (θa = 0) or the horizontal axis (θa = 1), with its label
formally deﬁned as: ˆθa = |sin ˆθ| < |cos ˆθ|, where ˆθ is the ground truth object orientation in radians
from a bird’s eyes view (BEV) with [−π, π) bounded range. We then compute an orientation ˆθr
with a restricted range relative to its axis, e.g., [−π, 0) when ˆθa = 0, and [− π
2 ) when ˆθa = 1.
We start with ˆθr = ˆθ then add or subtract π from ˆθr until the desired range is satisﬁed.

2 , π

Intuitively, ˆθr loses its heading since the true rotation may be {ˆθr, ˆθr ± π}. We therefore
preserve the heading using a separate ˆθh, which represents the probability of ˆθr being rotated by π

with its GT target deﬁned as:

0

1 otherwise.

ˆθh =

ˆθ = ˆθr

(5.4)

Lastly, we encode the orientation offset transformation which is relative to the corresponding
anchor, axis, and restricted orientation ˆθr as: ˆtθr = ˆθr − Φθa. The reverse decomposition is
θ = Φθa + (cid:98)θh(cid:101) · π + tθr where (cid:98)(cid:101) denotes round.

In designing our orientation, we ﬁrst observed that the visual difference between objects at
opposite headings of [θ, θ ± π] is low, especially for far objects. In contrast, classifying the axis

of an object is intuitively more clear since the visual features correlate with the aspect ratio. Note

that [θa, θh, θr] disentangle these two objectives. Hence, while the axis is being determined, the

heading classiﬁer can focus on subtle clues such as windshields, headlights and shape.

86

We further note that our 2 binary classiﬁcations have the same representational power as 4 bins

following [81, 90, 103, 114]. Speciﬁcally, bins of [0, π

2 , π, 3π

2 ]. However, it is more common to use

considerably more bins (such as 12 in [81]). An important distinction is that bin-based approaches

require the network decide axis and heading simultaneously, whereas our method disentangles the

orientation into the two distinct and explainable objectives. We provide ablations to compare our

decomposition and the bin method using [2, 4, 10] bins in Sec. 5.4.4.

5.3.1.4 Self-Balancing Loss:

The novel 3D localization conﬁdence ω follows a self-balancing formulation closely coupled to

the network loss. We ﬁrst deﬁne the 2D and 3D loss terms which comprise the general RPN

loss. We unroll and match all nb box estimations to their respective ground truths. A box is
matched as foreground when sufﬁcient (≥ k) 2D intersection over union (IoU) is met, otherwise it

is considered background (ˆc = 0) and all loss terms except for classiﬁcation are ignored. The 2D

box loss is thus deﬁned as:

L2D = − log(IoU(b2D, ˆb2D))[ˆc (cid:54)= 0] + CE(c, ˆc),

(5.5)

where CE denotes a softmax activation followed by logistic cross-entropy loss over the ground

truth class ˆc, and IoU uses predicted b2D and ground truth ˆb2D. Similarly, the 3D localization loss
for only foreground (ˆc (cid:54)= 0) is deﬁned as:

L3D = L1(t3D, ˆt3D) + λa · BCE([θa, θh], [ˆθa, ˆθh]),

(5.6)

where BCE denotes a sigmoid activation followed by binary cross-entropy loss. Next we deﬁne

the ﬁnal self-balancing conﬁdence loss with the ω estimation as:

87

L = L2D + ω · L3D + λL · (1 − ω),

(5.7)

where λL is the rolling mean of the nL most recent L3D losses per mini-batch. Since ω is predicted

per-box via a sigmoid, the network can intuitively balance whether to use the loss of L3D or incur
a proportional penalty of λL · (1 − ω). Hence, when the conﬁdence is high (ω ≈ 1) we infer that
the network is conﬁdent in its 3D loss L3D. Conversely, when the conﬁdence is low (ω ≈ 0), the
network is uncertain in L3D, thus incurring a ﬂat penalty is preferred. At inference, we fuse the
self-balancing conﬁdence with the classiﬁcation score as µ = c · ω.

The proposed self-balancing loss has two key beneﬁts. Firstly, it produces a useful 3D localiza-

tion conﬁdence with inherent correlation to 3D IoU (Sec. 5.4.4). Secondly, it enables the network

to re-balance samples which are exceedingly challenging and re-focus on the more reasonable tar-

gets. Such a characteristic can be seen as the inverse of hard-negative mining, which is important

while monocular 3D object detection remains highly difﬁcult and unsaturated (Sec. 5.4.1).

5.3.2 Ego-motion

A challenge with the dynamics of urban scenes is that not only are most foreground objects in

motion, but the capturing camera itself is dynamic. Therefore, for a full understanding of the scene

dynamics, we design our model to additionally predict the self-movement of the capturing camera,

e.g., ego-motion.

We deﬁne ego-motion in the conventional six degrees of freedom: translation [γx, γy, γz] in

meters and rotation [ρx, ρy, ρz] in radians. We attach an ego-motion features layer E with ReLU

to the concatenation of two temporally adjacent feature maps χ which is the ﬁnal layer in our
backbone with size Rnh×w×h architecture, deﬁned as χt−1 || χt. We then attach predictions for

88

translation [Γx, Γy, Γz] and rotation [Px, Py, Pz], which are of size Rw×h. Instead of using a global
pooling, we predict a spatial conﬁdence map Ec ∈ Rw×h based on E. We then apply softmax over

the spatial dimension of Ec such that(cid:80) Ec = 1. Hence, the pooling of prediction maps [Γ, P ]

into [γ, ρ] is deﬁned as:

γ =

(cid:88)

(i,j)

Γ(i, j) · Ec(i, j),

ρ =

(cid:88)

(i,j)

P (i, j) · Ec(i, j),

(5.8)

where (i, j) is the coordinate in Rw×h. We show an overview of the motion features E, spatial
conﬁdence Ec, and outputs [Γ, P ] → [γ, ρ] within Fig. 5.2. We use a L1 loss against GTs ˆγ and
ˆρ deﬁned as Lego = L1(γ, ˆγ) + λr · L1(ρ, ˆρ).

5.3.3 Kinematics

In order to leverage temporal motion in video, we elect to integrate our RPN and ego-motion into

a novel kinematic model. We adopt a 3D Kalman [164] due to its notable efﬁciency, effectiveness,

and interpretability. We next detail our proposed motion model, the procedure for forecasting,

association, and update.

5.3.3.1 Motion Model:

The critical variables we opt to track are deﬁned as 3D center [τx, τy, τz], 3D dimensions [τw, τh, τl],

], and scalar velocity τv. We deﬁne τθ as an θ orientation constrained to the
2 ), and θh as in Sec. 5.3.11. We constrain the motion model to only allow objects

orientation [τθ, τθh
range of [− π
to move in the direction of their orientation. Hence, we deﬁne the state transition F ∈ R9×9 as:

2 , π

1We do not use the axis θa of Sec. 5.3.1, since we expect the orientation to change smoothly and do not wish the

orientation is relative to a (potentially) changing axis.

89



F =

I9×8



,

cos(τθ + π(cid:98)τθh

(cid:101))

0

− sin(τθ + π(cid:98)τθh

(cid:101))

0
...

1

(5.9)

where I denotes the identity matrix and the state variable order is [τx, τy, τz, τw, τh, τl, τθ, τθh
We constrain the velocity to only move within its orientation to simplify the Kalman to work more

, τv].

effectively. Recall that since our measurement model RPN processes a single frame, it does not

measure velocity. Thus, to map the tracked state space and measurement space, we also deﬁne an
observation model H as a truncated identity map of size I ∈ R8×9. We deﬁne covariance P with
3D conﬁdence µ, as P = I9×9 · (1 − µ) · λo where λo is an uncertainty weighting factor. Hence,

we avoid the need to manually tune the covariance, while being dynamic to diverse and changing

image content.

5.3.3.2 Forecasting:

The forecasting step aims to utilize the tracked state variables and covariances of time t − 1 to
estimate the state of a future time t. The equation to forecast a state variable τt−1 into τ(cid:48)
t is:
τ(cid:48)
t = Ft−1 · τt−1, where Ft−1 is the state transition model at t − 1. Note that both objects and the
capturing camera may have independent motion between consecutive frames. Therefore, we lastly

apply the estimated ego-motion to all available tracks’ 3D center [τx, τy, τz] by:

90



τx

τy

τz

1

(cid:48)



t

 R, T

0, 1

=

t

t−1

·



τx

τy

τz

1

(cid:48)



t

τ(cid:48)
tθ = τ(cid:48)

tθ + ρy

,

(5.10)

t−1 ∈ R3×3 denotes the estimated rotation matrix converted from Euler angles and
where Rt
t−1 ∈ R3×1 the translation vector for ego-motion (as in Sec. 5.3.2). Finally, we forecast a
Tt
tracked object’s covariance P from t − 1 to t deﬁned as:

P(cid:48)
t = Ft−1 · Pt−1 · FT

t−1 + I9×9 · (1 − µt−1),

(5.11)

where µt−1 denotes the average self-balancing conﬁdence µ of a track’s life. Hence, the resultant
track states τ(cid:48)

t represent the Kalman ﬁlter’s best forecasted estimation

t and track covariances P(cid:48)

with respect to frame t.

5.3.3.3 Association:

After the tracks have been forecasted from t − 1 to t, the next step is to associate tracks to corre-

sponding 3D box measurements (Sec. 5.3.1). Let us denote boxes produced by the measurement
RPN as b ∈ R8 mimicking the tracked state as [bx, by, bz, bw, bh, bl, bθ, bθh
egy consists of two steps. We ﬁrst compute the 3D center distance between the tracks τ(cid:48)

]2. Our matching strat-

t and

measurements b. The best matches with the lowest distance are iteratively paired and removed
until no pairs remain with distance ≤ kd. Then we compute the projected 2D box IoU between
any remaining tracks τ(cid:48)
t and measurements b. The best matches with the highest IoU are also iter-
2We apply the estimated transformations of Sec. 5.3.1 to their respective anchors with equations of [9], and back-

project into 3D coordinates to match track variables.

91

atively paired and removed until no pairs remain with IoU ≥ ku. Measured boxes that were not

matched are added as new tracks. Conversely, tracks that were not matched incur a penalty with
hyperparameter kp, deﬁned as µt−1 = µt−1 · kp. Lastly, any box who has conﬁdence µt−1 ≤ km

is removed from the valid tracks.

5.3.3.4 Update:

After making associations between tracks τ(cid:48)
covariance P(cid:48)

t and measurements b, the next step is to utilize the track

t and measured conﬁdence µ to update each track to its ﬁnal state τt and covariance

Pt. Firstly, we formally deﬁne the equation for computing the Kalman gain as:

K = P(cid:48) HT (H P(cid:48) HT + I8×8 (1 − µ) · λo)−1,

(5.12)

where I8×8 (1 − µ) · λo represents the incoming measurement covariance matrix, and P(cid:48) the
forecasted covariance of the track. Next, given the Kalman gain K, forecasted state τ(cid:48)
covariance P(cid:48)

t, and measured box b, the ﬁnal track state τt and covariance Pt are deﬁned as:

t, forecasted

τt = τ(cid:48)

t + K (b − H τ(cid:48)
t),

Pt = (I9×9 − K H) P(cid:48)
t.

(5.13)

We lastly aggregate each track’s overall conﬁdence µt over time as a running average of µt =
2 · (µt−1 + µ), where µ is the measured conﬁdence.

1

5.3.4 Implementation Details

Our framework is implemented in PyTorch [125], with the 3D RPN settings of [9]. We release

source code at http://cvlab.cse.msu.edu/project-kinematic.html. We use a batch

size of 2 and learning rate of 0.004. We set k = 0.5, λo = 0.2, λr = 40, nL = 100, λa = ku =

92

0.35, kd = 0.5, kp = 0.75, and km = 0.05. To ease training, we implement three phases. We

ﬁrst train the 2D-3D RPN with L = L2D + L3D, then the self-balancing loss of Eq. 5.7, for 80k

and 50k iterations. We freeze the RPN to train ego-motion using Lego for 80k. Our backbone is

DenseNet121 [67] where nh = 1,024. Inference uses 4 frames as provided by [51].

5.4 Experiments

We benchmark our kinematic framework on the KITTI [51] dataset. We comprehensively evaluate

on 3D Object Detection and Bird’s Eye View (BEV) tasks. We then provide ablation experiments

to better understand the effects and justiﬁcation of our core methodology. We show qualitative

examples in Fig. 5.6.

5.4.1 KITTI Dataset

The KITTI [51] dataset is a popular benchmark for self-driving tasks. The ofﬁcial dataset consists

of 7,481 training and 7,518 testing images including annotations for 2D/3D objects, ego-motion,

and 4 temporally adjacent frames. We evaluate on the most widely used validation split as proposed

in [24], which consists of 3,712 training and 3,769 validation images. We focus primarily on the

car class.

5.4.1.1 Metric:

Average precision (AP) is utilized for object detection in KITTI. Following [140], the KITTI metric
has updated to include 40 (↑ 11) recall points while skipping the ﬁrst. The AP40 metric is more

stable and fair overall [140]. Due to the ofﬁcial adoption of AP40, it is not possible to compute

AP11 on test. Hence, we elect to use the AP40 metric for all reported experiments.

93

AP3D (IoU ≥ 0.7)

APBEV (IoU ≥ 0.7)

Easy
2.77
4.32
4.47
10.76
10.37
14.76
16 .50
19.07

Mod
1.51
2.02
2.90
7.25
7.94
9.71
10 .74
12.72

Hard
1.01
1.46
2.47
5.85
6.40
7.42
9.52
9 .17

Easy
5.40
9.78
8.41
18.33
17.23
21.02
25 .03
26.69

Mod
3.23
4.91
6.08
12.58
13.19
13.67
17 .32
17.52

Hard
2.46
3.74
4.94
9.91
11.12
10.23
14.91
13 .10

s/im∗
0.50†
0.20
2.00†
0.20
−
0 .16
0.40
0.12

FQNet [103]
ROI-10D [114]
GS3D [90]
MonoPSR [81]
MonoDIS [140]
M3D-RPN [9]
AM3D [113]
Ours

Table 5.1: KITTI Test. We compare with SOTA methods on the KITTI test dataset. We report
performances using the AP40 [140] metric available on the ofﬁcial leaderboard. * the runtime
is reported from the ofﬁcial leaderboard with slight variances in hardware. We indicate methods
reported on CPU with †. Bold/italics indicate best/second AP.

AP3D (IoU ≥ [0.7/0.5])

Mod
7.60/ −

MonoDIS [140]
M3D-RPN [9]
Ours

Hard
Easy
6.37/ −
11.06/ −
14 .53 /48 .56 11 .07 /35 .94
8 .65 /28 .59 20 .85 /53 .35 15 .62 /39 .60 11 .88 /31 .77
19.76/55.44 14.10/39.47 10.47/31.26 27.83/61.79 19.72/44.68 15.10/34.56

Mod
12.58/ −

Easy
18.45/ −

Hard
10.66/ −

APBEV (IoU ≥ [0.7/0.5])

Table 5.2: KITTI Validation. We compare with SOTA on KITTI validation [24] split. Note that
methods published prior to [140] are unable to report the AP40 metric.

5.4.2 3D Object Detection

We evaluate our proposed framework on the task of 3D object detection, which requires objects be

localized in 3D camera coordinates as well as supplying the 3D dimensions and BEV orientation

relative to the XZ plane. Due to the strict requirements of IoU in three dimensions, the task
demands precise localization of an object to be considered a match (3D IoU ≥ 0.7). We evaluate

our performance on the ofﬁcial test [51] dataset in Tab. 5.1 and the validation [24] split in Tab. 5.2.

We emphasize that our method improves the SOTA on KITTI test by a signiﬁcant margin of
↑ 1.98% compared to [113] on the moderate conﬁguration with IoU ≥ 0.7, which is the most

common metric used to compare. Further, we note that [113] require multiple encoder-decoder

94

networks which add overhead compared to our single network approach. Hence, their runtime
is ≈ 3× (Tab. 5.1) compared to ours, self-reported on similar but not identical GPU hardware.

Moreover, [9] is the most comparable method to ours as both utilize a single network and an

RPN archetype. We note that our method signiﬁcantly outperforms [9] and many other recent
works [81, 90, 103, 114, 140] by ≈ 3.01 − 11.21%.

We further evaluate our approach on the KITTI validation [24] split using the AP40 for available

approaches and observe similar overall trends as in Tab. 5.2. For instance, compared to competitive
approaches [9, 140] our method improves the performance by ↑ 3.03% for the challenging IoU
criteria of ≥ 0.7. Similarly, our performance on the more relaxed criteria of IoU ≥ 0.5 increases
by ↑ 3.53%. We additionally visualize detailed performance characteristics on AP3D at discrete
depth [15, 30, All] meters and IoU matching criterias 0.3 → 0.7 in Fig. 5.4.

5.4.3 Bird’s Eye View

The Bird’s Eye View (BEV) task is similar to 3D object detection, differing primarily in that the

3D boxes are ﬁrstly projected into the XZ plane then 2D object detection is calculated. The projec-

tion collapses the Y-axis degree of freedom and intuitively results in a less precise but reasonable

localization.

We note that our method achieves SOTA performance on the BEV task regarding the moderate

setting of the KITTI test dataset as detailed in Tab. 5.1. Our method performs favorably compared
with SOTA works [9, 81, 90, 103, 114, 140] (e.g., ≈ 3.85 − 14.29%), and similarly to [113] at a

notably lower runtime cost. We suspect that our method, especially the self-balancing conﬁdence

(Eq. 5.7), prioritizes precise localization which warrants more beneﬁt in full 3D Object Detection

task compared to the Bird’s Eye View task.

Our method performs similarly on the validation [24] split of KITTI (Tab. 5.2). Speciﬁcally,

95

AP3D (IoU ≥ [0.7/0.5])

Mod

Easy

Hard

Easy

APBEV (IoU ≥ [0.7/0.5])

Mod

Hard

Baseline
13.81/47.10
11.10/29.88
13.98/38.45
+ θ decomposition 16.66/51.47
17.43/42.53
13.48/34.37
+ self-conﬁdence
16.64/52.18
18.02/42.95 13.92/34.80
+ µ = c · ω
18 .28 /54 .70 13 .55 /39 .33 10 .13 /31.25 25 .72 /60 .87 18 .82 /44 .36 14 .48 /34.48
+ kinematics
19.76/55.44 14.10/39.47 10.47/31 .26 27.83/61.79 19.72/44.68 15.10/34 .56

20.08/52.57
7.44/26.90
9.40/30.98
23.15/56.48
9.60/31.42 24.22/58.52

9.71/34.14
12.10/38.58
12.77/38.99

Table 5.3: Ablation Experiments. We conduct a series of ablation experiments with the valida-
tion [24] split of KITTI, using diverse IoU matching criteria of ≥ 0.7/0.5.

compared to [9, 140] our proposed method outperforms by a range of ≈ 4.10 − 7.14%, which is
consistent to the same methods on test ≈ 3.85 − 4.33%.

5.4.4 Ablation Study

To better understand the characteristics of our proposed kinematic framework, we perform a series

of ablation experiments and analysis, summarized in Tab. 5.3. We adopt [9] without hill-climbing

or depth-aware layers as our baseline method. Unless otherwise speciﬁed we use the experimental

settings outlined in Sec. 5.3.4.

5.4.4.1 Orientation Improvement:

The orientation of objects is intuitively a critical component when modeling motion. When the

orientation is decomposed into axis, heading, and offset the overall performance signiﬁcantly im-
proves, e.g., by ↑ 2.39% in AP3D and ↑ 3.45% in APBEV, as detailed within Tab. 5.3. We compute
the mean angle error of our baseline, orientation decomposition, and kinematics method which
respectively achieve 13.4◦, 10.9◦, and 6.1◦ (↓ 54.48%), suggesting our proposed methodology is

signiﬁcantly more stable.

We compare our orientation decomposition to bin-based methods following general idea of [81,

90, 103, 114]. We speciﬁcally change our orientation deﬁnition into [θb, θo] which includes a bin

96

classiﬁcation and an offset. We experiment with the number of bins set to [2, 4, 10] which are

uniformly spread from [0, 2π). Note that 4 bins have the same representational power as using

binary [θa, θh]. We observe that the ablated bin-based methods achieve [9.47%, 10.02%, 10.76%]

in AP3D.

In comparison, our decomposed orientation achieves 12.10% in AP3D. We provide

additional detailed experiments in our supplemental material.

Further, we ﬁnd that our proposed kinematic motion model (as in Sec. 5.3.3) degrades in per-

formance when a comparatively erratic baseline (Row 1. Tab. 5.3) orientation is utilized instead
(14.10 → 11.47 on AP3D), reafﬁrming the importance of having a consistent/stable orientation
when aggregating through time.

5.4.4.2 Self-balancing Conﬁdence:

We observe that the self-balancing conﬁdence is important from two key respects. Firstly, its

integration in Eq. 5.7 enables the network to re-weight box samples to focus more on reasonable
samples and incur a ﬂat penalty (e.g., λL · (1 − ω) of Eq. 5.7) on the difﬁcult samples. In a sense,
the self-balancing conﬁdence loss is the inverse of hard-negative mining, allowing the network to

focus on reasonable estimations. Hence, the loss on its own improves performance for AP3D by
↑ 0.67% and APBEV by ↑ 0.59%.

The second beneﬁt of self-balancing conﬁdence is that by design ω has an inherent correlation

with the 3D object detection performance. Recall that we fuse ω with the classiﬁcation score c to
produce a ﬁnal box rating of µ = c · ω, which results in an additional gain of ↑ 0.78% in AP3D
and ↑ 0.80% in APBEV. We further analyze the correlation of µ with 3D IoU, as is summarized
in Fig. 5.5. The correlation coefﬁcient with the classiﬁcation score c is signiﬁcantly lower than the

correlation using µ instead (0.301 vs. 0.417). In summary, the use of the Eq. 5.7 and µ account for
a gain of ↑ 1.45% in AP3D and ↑ 1.39% in APBEV.

97

Figure 5.4: We compare AP3D with [9] by
varying 3D IoU criteria and depth.

Figure 5.5: We show the correlation of 3D IoU
to classiﬁcation c and 3D conﬁdence µ.

5.4.4.3 Temporal Modeling:

The use of video and kinematics is a signiﬁcant motivating factor for this work. We ﬁnd that
the use of kinematics (detailed in Sec. 5.3.3) results in a gain of ↑ 0.55% in AP3D and ↑ 0.90% in
APBEV, as shown in Tab. 5.3. We emphasize that although the improvement is less dramatic versus

orientation and self-conﬁdence, the former are important to facilitate temporal modeling. We ﬁnd

that if orientation decomposition and uncertainty are removed, by using the baseline orientation

and setting µ to be a static constant, then the kinematic performance drastically reduces from
14.10% → 10.64% in AP3D.

We emphasize that kinematic framework not only helps 3D object detection, but also naturally

produces useful by-products such as velocity and ego-motion. Thus, we evaluate the respective

average errors of each motion after applying the camera capture rate to convert the motion into

miles per hour (MPH). We ﬁnd that the per-object velocity and ego-motion speed errors perform

reasonably at 7.036 MPH and 6.482 MPH respectively. We depict visual examples of all dynamic

by-products in Fig. 5.6 and additionally in supplemental video.

98

Figure 5.6: Qualitative Examples. We depict the image view (left) and BEV (right). We show
velocity vector in green, speed and ego-motion in miles per hour (MPH) on top of detection boxes
and at the top-left corner, and tracks as dots in BEV.

5.5 Summary

We present a novel kinematic 3D object detection framework which is able to efﬁciently lever-

age temporal cues and constraints to improve 3D object detection. Our method naturally provides

useful by-products regarding scene dynamics, e.g., reasonably accurate ego-motion and per-object

velocity. We further propose novel designs of orientation estimation and a self-balancing 3D con-

ﬁdence loss in order to enable the proposed kinematic model to work effectively. We emphasize

that our framework efﬁciently uses only a single network to comprehensively understand a highly

99

AP3D (IoU ≥ [0.7/0.5])

Easy

Mod

Hard

APBEV (IoU ≥ [0.7/0.5])

Easy

Mod

Hard

2 bins
4 bins
10 bins

12.83/46.46 9.47/33.78 7.93/26.85 19.17/52.06 14.72/37.54 11.38/31.16
12.65/44.01 10.02/33.27 7.87/26.27 19.09/49.86 14.55/37.90 11.14/30.44
14.27/49.71 10.74/36.12 8.29/28.62 21.12/54.70 15.37/39.72 11.60/31.75
Our decomp. 16.66/51.47 12.10/38.58 9.40/30.98 23.15/56.48 17.43/42.53 13.48/34.37

Table 5.4: Orientation. We compare our orientation decomposition to bin-based orientation fol-
lowing the high-level concepts within [81,90,103,114], using AP3D and APBEV. We evaluate our
performances on the KITTI validation set [24] using IoU ≥ 0.7/0.5.

dynamic 3D scene for urban autonomous driving. Moreover, we demonstrate our method’s effec-

tiveness through detailed experiments on the KITTI [51] dataset across the 3D object detection and

BEV tasks.

Acknowledgments: Research was partially sponsored by the Army Research Ofﬁce under Grant

Number W911NF-18-1-0330. The views and conclusions contained in this document are those of

the authors and should not be interpreted as representing the ofﬁcial policies, either expressed or

implied, of the Army Research Ofﬁce or the U.S. Government. The U.S. Government is authorized

to reproduce and distribute reprints for Government purposes notwithstanding any copyright no-

tation herein. This work is further partly funded by the Deutsche Forschungsgemeinschaft (DFG,

German Research Foundation) - 409792180 (Emmy Noether Programme, project: Real Virtual

Humans).

5.6 Supplementary Material

5.6.1 Orientation Ablations

We provide detailed experiments on 3D object detection and Bird’s Eye View tasks to compare

our orientation decomposition performance with bin-based approaches such as [81, 90, 103, 114]

100

within Tab. 5.4. Recall that bin-based orientation ﬁrst classiﬁes the best bin for orientation then

predicts an offset with respect to the bin. In contrast, our method disentangles the bin classiﬁcation

into a distinct explainable objectives such as an axis classiﬁcation and a heading classiﬁcation. For

such experiments we change our formulation to use bins of [2, 4, 10], where 4 bins has a similar

representational power as two binary classiﬁcations [θa, θh]. The bins are spread uniformly from

[0, 2π] and an offset is predicted afterwards. We use the settings in Sec. 3.4 in main paper. We
emphasize that our method outperforms the bin-based approaches between ≈ 1.36 − 2.63% on
AP3D and ≈ 2.06 − 2.71% on APBEV using the standard moderate setting and IoU ≥ 0.7.

5.6.2 Kalman Forecasting

Since our method uses ego-motion and a 3D Kalman ﬁlter to aggregate temporal information,

the approach can be modiﬁed to act as a box forecaster. Although our method was not strictly

designed for the tracking and forecasting task, we evaluate the 3D object detection and Bird’s Eye

View performance after forecasting nf = [1, 2, 3, 4] frames into the future. We assume a static

ego-motion for unknown frames and otherwise use the Kalman equations described in the main

paper Sec. 3.3 to forecast the tracked boxes.

For all forecasting experiments we process 4 temporally adjacent frames before forecasting.

Since KITTI only provides a current frame and 3 proceeding frames, we carefully map images

back to the raw dataset in order to forecast. For instance, when nf = 2 we infer using frames
[−5,−4,−3,−2] then forecast ego-motion and Kalman nf times. We then evaluate with respect
to frame 0 which is the standard timestamp KITTI provides images and 3D labels for. We provide

detailed performances on AP3D in Tab. 5.5 and APBEV in Tab. 5.6. We ﬁnd that the forecasting
performance degrades through time but performs reasonably 1−2 frames ahead, being competitive

in magnitude to state-of-the-art methods on the KITTI test dataset as reported in Tab. 1 of the main

101

AP3D (IoU ≥ [0.7/0.5/0.3])

Mod

Hard

Easy

1.16 / 18.47 / 47.26
3.72 / 28.97 / 58.46
7.84 / 39.40 / 68.87

Forecast → 4
0.84 / 11.21 / 29.22
Forecast → 3
2.32 / 18.05 / 37.82
Forecast → 2
5.10 / 25.48 / 48.30
Forecast → 1 16.09 / 49.66 / 75.88 10.64 / 34.18 / 55.26
No Forecast

0.62 / 8.97 / 23.40
1.75 / 13.88 / 29.80
4.14 / 20.20 / 37.84
8.14 / 26.62 / 44.01
19.76 / 55.44 / 79.81 14.10 / 39.47 / 60.57 10.47 / 31.26 / 48.95

Table 5.5: Forecasting - 3D Object Detection. We evaluate our forecasting performance on AP3D
within the KITTI validation [24] set and using IoU ≥ 0.7/0.5/0.3.

Easy

Mod

Hard

APBEV (IoU ≥ [0.7/0.5/0.3])

Forecast → 4
2.90 / 14.71 / 28.49
3.54 / 18.13 / 36.13
5.48 / 29.40 / 54.52
Forecast → 3 11.03 / 39.08 / 64.87
5.67 / 18.85 / 34.91
6.89 / 24.01 / 43.52
Forecast → 2 17.02 / 47.07 / 72.33 10.76 / 31.62 / 51.67
8.37 / 25.47 / 40.79
Forecast → 1 23.58 / 55.99 / 77.48 15.79 / 39.33 / 58.05 12.54 / 31.22 / 46.59
No Forecast
27.83 / 61.79 / 81.20 19.72 / 44.68 / 63.44 15.10 / 34.56 / 49.84

Table 5.6: Forecasting - Bird’s Eye View. We evaluate our forecasting performance on APBEV
within the KITTI validation [24] set and using IoU ≥ 0.7/0.5/0.3.

paper. For instance, forecasting 1 and 2 frames results in 10.64% and 5.10% AP3D respectively,

which are competitive to methods [9, 81, 90, 103, 113, 114, 140] on the test dataset.

5.6.3 Qualitative Video

We further provide a qualitative demonstration video at http://cvlab.cse.msu.edu/project-kinematic.

html. The video demonstrates our framework’s ability to determine a full scene understanding in-

cluding 3D object cuboids, per-object velocity and ego-motion. We compare to a related monocular

work of M3D-RPN [9], plot ground truths, image view, Bird’s Eye View, and the track history.

102

Chapter 6

3D Mesh Discovery in the Wild

Predicting 3D object shapes from a single image is a seminal task in computer vision with applica-

tions in embodied AI and content creation. Prior work either experiment on synthetic benchmarks,

where 3D annotations are available, or avoid 3D supervision by carefully tuning for speciﬁc object

categories with the help of 2D pose annotations and expert-designed 3D priors. While ground-

breaking, these methods cannot scale to many object categories and complex datasets.

In this

work, we propose a general end-to-end approach to discover 3D shapes from real-world scenes.

Our approach, 3DMD, is trained on a collection of unassociated images using only 2D masks as

supervision. At test time, 3DMD infers the 3D object shape from a single image in view co-
ordinates. By avoiding object-speciﬁc priors and external 3D knowledge, 3DMD scales to 10×

more categories than prior work. We test our approach on challenging real-world datasets, such as

Pascal3D+, CUB-200 where we outperform methods which use more supervision or hand-crafted

priors. Most importantly, we show results on COCO where we scale to dozens of object categories.

103

Figure 6.1: Our approach 3D Mesh Discovery (3DMD) takes a collection of images as input and
jointly learns a shared 3D mesh template and reﬁned meshes over a diverse set of object categories.

6.1

Introduction

Inferring 3D object shapes from monocular inputs is a seminal goal of visual intelligence, with

real-world applications in robotics [86, 136, 176], content creation [7, 8, 34, 165], and self-driving

104

Learned TemplatesRefined MeshCamera Multiplexer + Learned TemplateImagecars [81, 113]. Equivalent 2D tasks such as instance [63, 64, 77] and semantic segmentation [111]

have seen rapid progress due to the success of supervised learning [65, 79, 141] and the advent of

large-scale annotated datasets [101]. In the case of 3D reconstruction, supervised methods [56,

87, 168] have shown promising results but cannot scale to hundreds of object categories due to the

challenges in collecting 3D shape annotations at scale.

Without 3D supervision, 2D cues must be exploited to predict 3D shapes. Reasoning about

3D from 2D is well studied in computer vision, commonly with multi-view inputs or in predict-

ing depth from motion [1, 33, 57, 182] or disparity [4, 145, 153, 163]. Some pioneering recent

efforts [58,72,83,121] have predicted 3D shapes from monocular inputs, training on image collec-

tions. Without 3D supervision or multi-view inputs, these methods rely heavily on object category

templates, 2D keypoints and hand-crafted object priors. This external category-driven knowledge

is another form of supervision which limits scalability.

In this work we propose an end-to-end framework for 3D mesh discovery from a single image,

which we call 3DMD. Our approach does not require 3D supervision, multi-view inputs, hand-

designed category templates or 2D pose annotations and thus scales to many object categories and

in the wild settings. We build on two critical observations:

First, instances of the same object category have similar shapes. We relate object instances by

associating them to a learned canonical template shape. Template learning is aided by selecting an

image to serve as the reference sample for the category. We predict per-instance meshes by posing

and deforming the learned template. While prior work uses expert-designed templates [58,72], our

template learning enables better scalability to diverse categories.

Second, an object’s posed 3D shape and 2D appearance are consistent. Hence, our predicted 3D

shape is taught to agree with the 2D appearance with the help of differentiable mesh rendering [22,

35,74,105,112,129]. 2D silhouettes are commonly used to encourage such agreement [58,72,83].

105

However, 2D silhouettes alone are not discriminative enough to ground the object pose, e.g. a front

and a back facing car have similar 2D masks. Therefore, we partition the 3D shape into clusters

then force a greater consistency in a self-supervised manner.

Our approach, 3DMD, consists of three components: a template learner, a camera multi-

plexer [58, 83] and a mesh reﬁnement branch. The template learner estimates a shared 3D shape

for a category from a collection of images, using the reference sample for a canonical representa-

tion. The camera multiplexer predicts the pose of the object in the input image, where the identity

pose is deﬁned by the reference sample. Unlike [58], our camera multiplexer spreads over the

unit sphere and remains active during training and inference. Lastly, the mesh reﬁnement branch

deforms the transformed 3D shape to match the appearance of the object in the input image, ad-

dressing the intra-class variability. The result is a 3D object mesh in view coordinates predicted

from a monocular image. An overview of our multiplexer, reﬁned meshes, and learned templates

is shown in Fig. 6.1.

We test our method on three real-world datasets: Pascal3D+ [170], CUB-200 [158] and COCO

[101]. We benchmark our approach on Pascal3D+, which contains ground truth mesh models, and

demonstrate competitive 3D reconstruction from previous state of the art despite using less 3D

priors or supervision. In addition, incorporating priors used by prior work into our method further

improves performance. We expand our results to tens of categories, an important advancement

from prior works. We show exciting quantitative and qualitative results for challenging in the wild

objects within COCO [101] such as umbrellas, giraffes, elephants, skateboards and many more.

106

6.2 Related Work

Our method predicts a 3D mesh in view coordinates from a single input image, only using 2D

masks for supervision. We provide background on techniques for image-based supervised and

weakly supervised 3D shape reconstruction.

Supervised 3D Reconstruction Supervised methods can utilize a variety of 3D shape representa-

tions including: implicit functions [22,102,116,124,142], which require post-processing to extract

a 3D triangle mesh, 3D point clouds [45,97], voxels occupancy grids [29,167,168], and 3D meshes

deformed from a template sphere [133,143,144,159]. Gkioxari et al. [56] utilize a combination of

voxelization and 3D mesh deformation. The majority of these methods show results on synthetic

datasets, such as ShapeNet [19], or small scale real-world benchmarks, such as Pix3D [149], for

which 3D supervision is available in the form of CAD models. In this work, we focus on real-world

large-scale image datasets, such as COCO [101] and Pascal3D+ [170].

Our approach uses 3D triangle meshes deformed relative to learned 3D template shapes. We

ﬁnd the triangle mesh structure to be efﬁcient and effective at modeling shared category-speciﬁc

features via a global template, and instance-level features using graph convolution [56, 159].

Weakly Supervised 3D Reconstruction Understanding 3D objects using only images and weak

2D supervision [58, 72, 73, 92, 156] is a highly ill-posed problem. 2D supervision commonly

takes the form of an object segmentation mask or 2D keypoints. However, ambiguities arise when

learning 3D shape from 2D supervision alone. To overcome these ambiguities, methods introduce

other forms of supervision and priors to constrain the solution space.

Some techniques [129, 155, 156, 173] require multiple views of the same instance of an object.

This limits the scope to pairs of images and prevents in the wild scenarios. Others [72, 73] use

2D keypoints to predict the pose. However, keypoints are non-trivial to annotate, and are not

107

Figure 6.2: Framework overview of 3DMD. Our method learns a shared 3D mesh template T
starting with a unit sphere and canonical to a reference sample. Then a camera multiplexer deter-
mines the best pose relative to network predictions of object partitions and mask. Finally, the best
pose is used to reﬁne the template mesh to match the image inputs (left) producing instance-level
mesh shapes (right).

applicable to a large number of object categories. Goel et al. [58] eliminate the need for keypoints

by using a camera-multiplex to make pose estimations over a constrained prior distribution. Our

approach also leverages a multiplexer. In contrast to [58], ours is free of hand-crafted priors and

uses the same camera distribution for all categories or datasets. Moreover, ours functions in an

online fashion rather than being memorized independently per training sample in [58].

Moreover, [58, 72, 82] leverage expert 3D template meshes to initialize the shape prediction.

The templates are most commonly designed by humans, and capture critical priors such as object

structure, relative dimensions, as well as giving meaning to the canonical pose. Although powerful,

we consider external templates to be strong 3D supervision which in turn limits scalability to many

categories.

Recent work by Li et al. [92] utilize a part segmentation network [68] to help constrain pose.

In contrast to [92], we propose an end-to-end framework which does not require pre-training a

separate part segmentation network. Our method uses a single model to jointly learn a shared 3D

template, instance speciﬁc object pose, and reﬁned shape. Concurrent work [178] use a voxel and

mesh approach built on HoloGAN [120] to reconstruct objects with only masks.

We demonstrate our generalization on many categories from Pascal3D+ [170], CUB-200 [158],

108

GraphConv    .Camera Multiplexer + RendererImage InstanceImage Collection + ReferenceSampleInstance Shapes + ReferenceTemplateCanonical TemplatePosed TemplateRefined MeshMaskLearnedduring trainingCluster Mapand COCO [101]. Compared to prior works [58, 72, 73, 92, 156], we evaluate thoroughly on all 12

Pascal3D+ in the wild categories.

6.3 Method

Our goal is to reconstruct 3D objects from a collection of 2D images without 3D supervision. For

each object category, we jointly learn a shared template shape and instance-level mesh reﬁnement

to match the unique appearance of the input image. At train time, only 2D instance-level masks are

used. At test time, our model takes a single RGB image as input and reconstructs the object’s 3D

shape in view coordinates. Our approach, 3DMD, scales to many object categories in real-world

complex scenes with diverse lighting and pose. Fig. 6.2 shows an overview of our framework.

We ﬁrst detail our template learner powered by a uniform camera multiplexer, which comprise

two enabling components of our approach. Our camera multiplexer covers the full range of poses,

is free of manually set object priors and is driven by a self-supervised cluster consistency loss to

regularize the object pose. We lastly explain our instance mesh reﬁnement branch which reﬁnes

the posed template shape.

6.3.1 Template Learner

An important component of our approach is learning a shared 3D template shape, whose structure

closely represents the training object shapes. The template learning process is end-to-end, and

leverages differentiable rendering and a uniform camera multiplexer. The 3D template is optimized

to ﬁt the full dataset and thus is expected to resemble a category-speciﬁc mean shape, in contrast

to [58, 72, 82] which rely on expert-designed 3D models. Hence, the template learner enables

scaling to many more categories, without the need for expensive 3D prior knowledge.

109

Template Deﬁnition Let us deﬁne a deformable template mesh as T = (V, F), where V denotes
a set of learnable vertices and F a set of faces, ﬁxed to follow a sphere topology. We initialize T

as a unit sphere and constrain vertex movement via an adaptive hyperparameter λr.
Camera Multiplexer We aim to learn a 3D template T to ﬁt all training instances of a category.

Given an image, we want to estimate a pose for the template that best explains the image. Rather

than predicting pose directly, we follow [58, 83] and use a camera multiplexer which renders the

template from a set of hypothetical poses and selects the pose that minimizes 2D re-projection

error.

We use a set of N shared cameras Π = {π1, . . . , πN} which are distributed uniformly around
the template mesh. We constrain the poses in order to render templates tight to the image border,

mimicking RoIs in object detection.

level transformations for each camera, denoted as ∆ΠI = {∆πI

Starting from the above camera initializations, we use an image predictor f to estimate instance-
N}. Hence, this gives
f (I) an opportunity to update each camera to a better pose according to the unique image I ap-

1, . . . , ∆πI

pearance.

The purpose of our camera multiplexer is to select among many different hypothetical poses

and ﬁnd the best pose. We approach camera selection by minimizing two losses: (i) The neg-

ative 2D IoU between the i-th rendered pose silhouette Si and the mask label ˆS as LIoU =
−IoU(Si, ˆS). At inference, ˆS is produced from a mask prediction branch. We use a differentiable
silhouette renderer RS to compute per-camera silhouettes following Si = RS(T , πi + ∆πI
i ). (ii)
A self-supervised cluster consistency loss LKL, which measures agreement between pose and im-
age features. Details are described in Sec. 6.3.2. The best pose i∗ is selected as the argmin among

110

camera alignment scores a as:

a = (1 + LIoU +

(cid:112)
LKL) and i∗ = argmini ai

(6.1)

The output of the camera multiplexer is our template T rotated by the i∗-th camera which best
aligns to the object appearance. We denote the best posed template mesh as P. Intuitively, P best

explains 2D shape and cluster partitions.

Our camera multiplexer differs from that of [58] in two key ways. First, [58] uses hand-

designed category-speciﬁc priors to arrange cameras; in contrast we spread cameras uniformly

around the template, aiding scalability to many categories. Second, [58] learns a set of camera

poses independently for each training image; this forces them to train a separate pose regressor for

use during inference. We instead learn to regress multiplexer poses from a set of shared cameras,

enabling end-to-end learning and allowing re-use of the multiplexer during inference as well as

training.

Template Optimization We have deﬁned our learnable template and our uniform camera mul-

tiplexer. Next we discuss the optimization with respect to the template vertices and the network

predicted camera transformations ∆ΠI.

The vertex positions of the mesh template T are trained to minimize LIoU(Si∗, ˆS), where i∗
predictor f as the equation(cid:80)
denotes the best pose. We optimize the camera transformations ∆ΠI produced from the image

i LIoU(Si, ˆS). Intuitively, the network f is responsible for learn-

ing adjustments to the multiplexer poses while the template vertices control the category-speciﬁc
shared mesh shape. We further regularize the template T using edge statistics, mesh normal con-

sistency, and laplacian regularization. See supplemental for additional loss optimization details.

Reference Sample Discovering 3D object shapes from images with only 2D supervision is chal-

111

lenging since inﬁnite 3D shapes can explain a 2D mask. We build on the key insight that object

instances from the same object type share similarities in shape. We select a reference image sample

from the training set which serves as the canonical interpretation of the object shape. Instances of

the same object category are inferred as deformations from the reference shape.

To force our model to ground predictions to the reference sample, we include it in every batch.

We optimize ∆πI
1 to be an identity transform for the reference sample, and force the multiplexer
to output i∗ = 1 for the reference. Hence, π1 is tied to the reference’s canonical viewpoint. Lastly,
we use the reference sample to ﬁlter outlier samples which deviate too far from the reference loss

magnitude. Fig. 6.6 shows examples of the reference sample and corresponding learned templates

across diverse categories.

6.3.2 Cluster Consistency

Jointly learning 3D template shapes and camera poses is challenging when only 2D masks serve as

supervision. This is because 2D masks are not discriminative enough to signal a uniquely correct

3D shape. Prior works alleviate the ambiguity by relying on human-designed CAD models [58,72,

82] or by predicting 2D semantic parts [72, 92]. In contrast, we propose a self-supervised cluster
consistency loss based only on the posed template P structure.

Although silhouettes are often not discriminative enough, the image of an object reveals a lot

about its pose, for example a right vs left facing car. We encourage our camera-multiplexer to
agree with the image content by clustering the learned shared template T into K clusters, then

comparing with a cluster map from the image predictor f, encouraging poses and network features

to be in agreement.
Deﬁning Clusters We automatically generate clusters on the template vertices V using K-means

112

Figure 6.3: Our camera multiplexer computes an alignment score a (Eq. 6.1) which compares net-
work outputs (top) to renderings from uniform cameras (bottom). The cluster maps are compared
with rendered clusters using LKL while the 2D mask is used for IoU2D. Multiplexer output is
computed as the argmin camera i∗ (bolded).

[110] with graph distance as the cost. The number of clusters K is set a priori and clustering
is updated as T changes. Note that our clusters are consistent in training but are not directly

semantically interpretable. Fig. 6.3 shows example learned clusters.

113

…ImageCluster MapMaskCamera Multiplexer + Rendered K-means ClustersCluster Assignments Rather than assigning a vertex to its nearest cluster, we opt for a soft assign-

ment ρ of vertices to a distribution over K clusters. Each vertex is assigned a probability vector

which is equal to the softmax of its distance to all cluster centers. We use a differentiable cluster
renderer RC to compute per-camera clusters deﬁned as Ci = RC (T , ρ, πi + ∆πI
C denotes the rendered cluster map representing the pixel-wise probability across K clusters and

i ). Hence,

computed for each camera pose.

KL Divergence We have deﬁned how we derive the rendered cluster probabilities C from the cam-

era multiplexer. Next we use our image predictor f to estimate cluster probabilities for each pixel

resulting in a softmax cluster map CI. We compare these cluster maps with the rendered clusters

inside the multiplexer using a bi-directional KL-divergence LKL to encourage the distributions to

be consistent.

For the loss to reach zero, the image-based predictions CI would need to be consistent with
the best camera i∗. Similarly, the camera transformations ∆Π would need to morph such that Ci∗
matches CI. Fig. 6.3 visualizes both the predicted cluster maps and rendered camera clusters.

6.3.3 Mesh Reﬁnement

Instances of the same object category typically have varying types of shapes, for example a truck

vs a hatchback. In addition to learning a shared template and a pose, we further learn to reﬁne an

object’s 3D shape to correspond to its appearance in the instance-level input image. Note that our

mesh reﬁnement is trained end-to-end with the template learner, camera multiplexer, and cluster

consistency loss.

Following [56, 159], we use graph convolutions to estimate offsets relative to the output of
the camera multiplexer. The posed template P relative to the best camera i∗ is used to sample

114

features from our image predictor f via a vertex alignment sampling. We then deﬁne a mesh

reﬁner g comprised of two graph convolutions, following the architecture of [56]. The output of g
are offsets for each vertex in P, with a maximum offset constrained to an adaptive hyperparameter

λo. This results in a uniquely posed and deformed 3D mesh for each instance of an object, denoted
as M.

At train time, the ﬁnal reﬁned meshes are rendered using RS and optimized against the 2D

mask labels, using both the IoU loss and mesh regularizers deﬁned in Sec. 6.3.1.

6.3.4 Adaptive Learning

Our model is tasked to simultaneously learn a shared template shape, camera pose adjustments and

instance-level shape without any 3D supervision. These are complex tasks making optimization

challenging. To facilitate learning we follow a few simple adaptive learning techniques.

To capture the complexity of varying object categories, our template T needs sufﬁcient res-

olution. However, training with too many vertices from random initialization tends to lead to
difﬁculties in optimization. Instead, we start with a small mesh resolution of |V| = 42 then gradu-
ally subdivide to |V| = 162, and |V| = 642. We note that we only subdivide if the average template

IoU ﬁt is less than 0.80, implying the current level of resolution may be insufﬁcient. However, our

mesh reﬁner g always uses a resolution of 642 vertices even when the template is more coarse.

We gradually enable various features of our framework during training. We warmup by op-

timizing only our template learner and camera multiplexer for 2k iterations. Then we enable the

cluster consistency loss. This allows our framework to ﬁnd a better optimum for the templates and

cameras, before the self-supervised consistency loss can serve as a pose regularizer. To prevent a

single camera from dominating, we gradually enable the LKL term of Eq. 6.1. After an additional
2k iterations, we enable the mesh reﬁner, allowing a reasonable template T to develop ﬁrst. Finally,

115

we set the constraint on the template vertex movement λr and the mesh reﬁner vertex movement

λo adaptively based on 2D mask heuristics. See Suppl. for details.

6.4 Experiments

We tackle the problem of 3D mesh reconstruction from images on three challenging real-world

datasets, Pascal3D+ [170], CUB-200 [158] and COCO [101]. We show results on a variety of

object categories, such as giraffes, umbrella, bicycle and many more. Note that our approach,

3DMD, scales to many object types and datasets since it only assumes access to 2D silhouettes

for supervision and does not require any expert 3D templates, 2D keypoints or manually deﬁned

object-speciﬁc priors.
Training details We use the PyTorch3D [129] mesh renderer with a resolution of 64 × 64 and

18 faces per pixel. We train using a batch size of 8. Our initial learning rate is set to 0.04 which

we decay by 0.1 at 16k & 24k iterations. We train one model per category for 32k iterations via

a SGD optimizer. We set the number of clusters to K = 12. We use a total of N = 48 cameras.
Our input images are of size 128 × 128, and are ﬁltered to contain unoccluded objects with < 5%

of their border shared with other objects or the image edge. We randomly ﬂip training images. We

use the same hyperparameters for all categories and datasets.

6.4.1 Pascal3D+

We use the Pascal3D+ [170] dataset containing in the wild images to quantitatively evaluate the 3D

mesh reconstruction performance of our method. The dataset provides high-resolution canonical
CAD model and pose for ≈ 3,000 objects per category for 12 diverse objects. We summarize our

overall performance on each object category in Tab. 6.2.

116

(cid:88)
(cid:88)

Expert
template

CSDM [73]
CMR [72]
DRC [156]
U-CMR† [58]
Li et al. [92]
3DMD (Ours)
3DMD+
3DMD++

(cid:88)

(cid:88)

(cid:88)
(cid:88)

(cid:88)

0.600
0.640
0.670
0.680
0.620
0.658
0.705
0.718

(cid:88)

(cid:88)

2D

Multiple
keypoints viewpoints

Camera Canonical IoU3D Image View IoU3D
priors

32

64
−
−
−
0.657
−
0.622
0.671
0.678

32
−
−
−
0.295
−
0.481
0.443
0.522

64
−
−
−
0.265
−
0.450
0.414
0.493

Table 6.1: Shape reconstruction performance on Pacal3D+ car. We compare our 3DMD with
CSDM [73] a keypoint-based approach, CMR [72] which uses expert templates & 2D keypoints,
DRC [156] a multi-view volumetric approach, U-CMR [58] which relies on expert templates and
dataset-speciﬁc camera priors, and Li et al. [92] which uses off-the-shelf part segmentation. Unlike
prior works, we report two voxel sizes for a pose-agnostic metric IoUCan
3D , and a new metric IoUIm
3D
which factors in pose & shape. † indicates public code results.

Category

Table
Bicycle
Chair
Aeroplane
Train
Boat
Motorbike
TV
Sofa
Bus
Car
Bottle

IoUCan
3D
64
32

0.200
0.208
0.248
0.318
0.392
0.406
0.453
0.455
0.532
0.533
0.658
0.673

0.122
0.129
0.216
0.263
0.373
0.362
0.400
0.438
0.512
0.513
0.622
0.656

IoUIm
3D

32

64

0.108
0.166
0.094
0.243
0.197
0.124
0.320
0.390
0.403
0.382
0.481
0.520

0.070
0.098
0.069
0.206
0.174
0.106
0.283
0.387
0.389
0.358
0.450
0.502

Table 6.2: Shape reconstruction results for all Pascal3D+ categories.

Metrics To evaluate mesh reconstruction performance, prior works [58, 72, 73, 92, 156] propose

a canonical 3D Intersection over Union (IoUCan

3D ) metric, which measures the voxelized 3D over-
lap of the canonical ground truth and predicted shape. Yet, this metric does not account for the

pose. To evaluate for both shape and pose, we propose an image-view 3D Intersection over Union

(IoUIm

3D) metric, which measures the voxelized 3D overlap of the ground truth and predicted shape
in view coordinates, factoring in pose and shape accuracy. Following prior work, both metrics

117

use a voxelized approximation for computing 3D IoU and brieﬂy optimize a global scaling and

translation factor to account for the general ambiguity of mesh scales.

Comparison with state of the art We compare against a variety of competing methods, which

predict 3D shapes without 3D supervision. CSDM [73] uses 2D silhouettes and keypoints to

learn deformable basis shapes. DRC [156] is a volumetric approach which relies on multiple

viewpoints 2D silhouettes. CMR [72] predicts object meshes from image inputs but requires expert

object-speciﬁc mesh templates and 2D keypoints. U-CMR [58] also relies on expert object-speciﬁc

templates but eliminates the need of 2D keypoints by using hand-coded camera hypothesis for each

object category, which we refer to as camera priors. These priors work well for a speciﬁc object

category, e.g. images of cars are expected to be taken from a constant height, but fail to generalize

to other categories, e.g. pictures of airplanes may be captured from cameras below or above. Li et

al. [92] use an off-the-shelf external part discovery method and learn a UV map to predict parts of

the predicted mesh. The latter is the most comparable to ours, as both learn without 2D keypoints

and do not require access to expert object-speciﬁc templates. Though, unlike Li et al. [92], our

approach is end-to-end and does not depend on an external part segmentation method.

Table 6.1 shows results for car in Pascal3D+, the category evaluated unanimously in prior

work. Table 6.2 demonstrates comprehensive evaluation for our approach on all categories. We

report IoUCan

3D & IoUIm

3D under two voxel resolutions of 323 & 643. We draw the following obser-
vations. Despite not using camera priors or an expert template, our approach outperforms U-CMR

for IoUIm

3D (0.481 vs 0.295) and performs on par under the pose-agnostic IoUCan

3D (0.658 vs 0.680).
3D ), which is directly comparable to
ours since they do not use extra 3D priors. Our approach remains consistently superior for both

Our method outperforms Li et al. [92] (0.658 vs 0.620 IoUCan

voxelization thresholds. As expected, performance decreases with higher voxelization resolution

since ﬁner details are captured at 643. We observe that IoUIm

3D < IoUCan

3D which is expected since

118

IoUIm

3D captures errors in both pose and 3D shape.
We also evaluate variants of 3DMD by gradually replacing our end-to-end learned components

with expert priors. First, we substitute our template learner with an expert template (referred as

3DMD+ in Tab. 6.1), similar to in U-CMR. We ﬁnd that 3DMD+ slightly improves 3DMD in

IoUCan

in IoUIm

3D (0.705 vs 0.658) and outperforms U-CMR (0.705 vs 0.680). We note a slight degradation
3D for 3DMD+ and 3DMD (0.481 vs 0.443). This effect suggests that the template learned
in 3DMD, which is encouraged to ﬁt silhouettes from image-view, is a better ﬁt for an end-to-end

learning approach than an external hand-designed CAD model. To this end, an analysis of the

2D IoU ﬁt on the validation set for our learned template (0.831) and the expert template (0.820)

support the above observation, namely that our learned template ﬁts the 2D image-view best.

In addition to the expert template, we also replace our general camera multiplexer with an

object-speciﬁc camera distribution (referred as 3DMD++ in Tab. 6.1), as in U-CMR. We observe

that 3DMD++ performs better compared to our vanilla approach 3DMD for IoUCan

3D (0.718 vs
3D (0.522 vs 0.443). 3DMD++ signiﬁes an upper-bound of
our framework when using 3D object priors. We emphasize that 3DMD++, which uses the same

0.705) and more noticeably for IoUIm

supervision as U-CMR, outperforms U-CMR by an appreciable margin (0.718 vs 0.680 IoUCan
3D ).
Unlike prior works [58, 72, 73, 92, 156], we also evaluate our method on all 12 Pascal3D+

categories including aeroplane, bicycle, boat, bottle, bus, car, chair, table, motorbike, sofa, train

and TV. We provide our results in Tab. 6.2. We note that objects which have large pose variation,

such as aeroplane, under perform objects with less variation, e.g. car, bus, sofa, etc. We observe

objects with high intra-class variation, including chair and table, perform worse than objects with

low intra-class variation, e.g. bottle. We show qualitative results of learned category templates in

Fig. 6.6, and reﬁned meshes in Fig. 6.5. See Suppl. for more results.

Ablations We conduct ablations on the car category, focusing on the clusters and our multiplexer.

119

Figure 6.4: Qualitative 3D mesh reconstructions on the COCO dataset for a diverse set of rigid
and non-rigid object categories. We visualize input image (row 1), posed template (row 2), and the
reﬁned mesh (row 3 − 5) in view coordinates from two additional novel viewpoints.

Figure 6.5: Qualitative 3D mesh reconstructions on the CUB-200 bird (left) and numerous PAS-
CAL3D+ categories (right). We visualize input image (row 1), posed template (row 2), and the
reﬁned mesh (row 3 − 5) in view coordinates from two additional novel viewpoints.

120

PosedImageRefinedPosedImageRefinedFigure 6.6: We show our learned templates (bottom) relative to the canonical pose of their respec-
tive reference samples (top).

We use IoUIm

3D at resolution of 643 in ablation for a ﬁner level of detail.

Cluster Consistency: The cluster consistency loss is designed to encourage the camera poses to

be consistent with the image predictor. To understand its effects, we ﬁrst turn off cluster loss

and interaction with the multiplexer. Doing so causes the performance to degrade (0.450 vs 0.408),

suggesting cluster consistency is crucial for optimization. We next allow the clusters to learn during

training but disable their inﬂuence in the multiplexer at inference, e.g. removing the term LKL

from Eq. 6.1. We observe that performance degrades similarly (0.450 vs 0.418), suggesting that

cluster consistency is important for the multiplexer. Interestingly, turning off cluster consistency

at inference is better than not using cluster consistency in both train and test time (0.418 vs 0.408).

Hence, LKL may improve shared features for g, rather than only helping with the multiplexer

selection.

Cameras Multiplexer: The camera poses used to initialize our multiplexer logically have an impact

on performance. We study the effects of using less granular distributions and various priors. First,

we observe that utilizing less cameras generally degrades performance. Recall that 3DMD utilizes

N = 48 cameras spread uniformly over the unit sphere. When we reduce the number of cameras

to N = 32, we observe an appreciable drop (0.450 vs 0.432). When we reduce the number of

121

TemplatesReferencecameras to N = 16 the degradation is catastrophic in comparison to the baseline (0.450 vs 0.287).

Following [58], we next spread the camera poses over only a single band of azimuth and zero

elevation, e.g. level to the XZ plane. After training the car category with this conﬁguration we

observe an a noticable improvement (0.450 vs 0.482). These results suggest that the car category

is slightly biased toward images taken at ground level, which can be beneﬁcial as a prior, at the

cost of manual labeling and difﬁcultly in scaling. In contrast, when the aeroplane category uses

the constrained camera distribution, the performance degrades greatly as result (0.206 vs 0.152).

Hence, although the camera prior distribution is helpful for cars, it is not general for other cat-

egories which cover a larger range of rotations, e.g. aeroplane, skateboard, baseball bats, tennis

rackets among others.

6.4.2 CUB-200

The CUB-200 [158] dataset consist of over 200 bird species with 6,033 images and 2D mask

ground truths. Despite not providing 3D mesh ground truths, the dataset has been a popular choice

for qualitative evaluation among weakly supervised 3D reconstruction methods [58, 72, 92]. We

train our 3DMD on the CUB-200 dataset and show bird reconstructions on the validation set in

Fig. 6.5. Since no ground truth shapes are provided, we evaluate the 2D IoU2D between the ground

truth bird silhouettes and our rendered predictions. On the validation set, we achieve state-of-the-

art performance of 0.792 IoU2D compared to Li et al. [92] that achieve 0.734 and CMR [72] that

achieve 0.706.

122

6.4.3 COCO

The COCO [101] dataset is widely used for 2D object detection and contains about 1.5M object

instances across 80 categories. In comparison to Pascal3D+ and CUB-200, the COCO dataset is

the most difﬁcult and in the wild from the perspective of 3D reconstruction. The annotations tend

to be troubled by occlusion, truncation and resultant 2D modal mask and boxes. We select 20

diverse categories: aeroplane, car, bus, train, motorbike, trafﬁc light, stop sign, giraffe, elephant,

zebra, bear, skateboard, ﬁre hydrant, tennis racket, baseball bat, umbrella, suitcase, spoon, fork,

and knife. We train our approach on the training split and show results on the val set in Fig. 6.4.

Additional analysis for our 2D IoU ﬁtting on the COCO dataset is detailed in Suppl.

6.5 Summary

We present a novel framework for weakly supervised 3D mesh discovery in the wild which is able

to jointly optimize a 3D template, camera pose, and instance-level meshes. The only supervision

used is in the form of 2D mask labels. No prior knowledge is imposed. We further present a novel

self-supervised cluster consistency loss which encourages the agreement between our camera pose

and the image content. We demonstrate the effectiveness of our approach using three datasets

Pascal3D+, CUB-200, and COCO. Our method achieves competitive performance on Pascal3D+

while using less 3D expert knowledge compared to prior work (e.g., CAD models or keypoints).

We also demonstrate new state-of-the-art results when similar 3D priors are assumed. Critically,

we generalize to tens of categories by creating a method free of manually set hyperparameters.

123

6.6 Supplementary Materials

6.6.1 Experimental Results

We provide additional experiments and qualitative results from 3DMD, our 3D mesh reconstruc-

tion method, using datasets Pascal3D+ [170], CUB-200 [158], and COCO [101].

2D IoU Performance Tab. 6.3 provides a comprehensive evaluation on the validation set for each
category in COCO. We report the performance using our rendered posed template P and our ﬁnal
reﬁned mesh M, compared to the ground truth segmentation mask. 2D IoU increases for our ﬁnal

mesh prediction compared to the posed template, which is expected as the mesh reﬁnement module

is trained to deform the posed template to ﬁt to the object instance in the input image. We observe

that IoU is high for rigid categories, such as stop sign, car, ﬁre hydrant, and is lower for categories

with thin structures such as spoon, fork and knife.
Qualitative Examples We provide additional examples of our estimated posed template P and
ﬁnal reﬁned mesh M. We give examples for the car category of Pascal3D+ in Fig. 6.7 and the

birds of CUB-200 in Fig. 6.8. Lastly, we provide many other diverse category examples for COCO

in Fig. 6.9 and for more Pascal3D+ categories in Fig. 6.10.

Camera Multiplexer Visualizations We provide comprehensive visualizations of our camera

multiplexer renderings in Fig. 6.11. In order to help build intuition for the multiplexer process, we

show all rendered camera views, the predicted clusters and mask, and highlight the resultant best
pose i∗.

124

Category
Skateboard
Spoon
Fork
Baseball bat
Knife
Aeroplane
Umbrella
Bicycle
Giraffe
Motorbike
Zebra
Elephant
Tennis racket
Train
Suitcase
Truck
Fire hydrant
Car
Bus
Trafﬁc light
Stop Sign
Bear

P IoU2D M IoU2D
0.345
0.396
0.400
0.490
0.497
0.536
0.560
0.571
0.603
0.661
0.670
0.676
0.696
0.696
0.728
0.736
0.742
0.744
0.761
0.767
0.830
0.858

0.478
0.436
0.433
0.606
0.555
0.725
0.717
0.646
0.740
0.723
0.759
0.760
0.786
0.813
0.793
0.787
0.820
0.831
0.800
0.859
0.895
0.860

Table 6.3: We show the 2D IoU performance for our posed template P and the ﬁnal reﬁned mesh
M across respective validation splits.
6.6.2 Architecture Details

We ﬁrst detail our overall network architecture in Tab. 6.4, which takes a single input image and

outputs a binary segmentation mask, clusters map, camera transformations, and instance-level
mesh offsets. 3DMD runs at ≈ 7 frames per second on a NVIDIA 1080 Ti GPU.
Regularization We regularize the template T and ﬁnal reﬁned meshes M using four regularizers

with respective loss weights: (i) 1

3 mean edge length, (ii) 1

3 standard deviation edge lengths, (iii) 1
3

laplacian regularizers [37], and (iv) 1

10 mesh normal consistency. Lastly, we triple the regulariza-

tion loss weights for the ﬁnal reﬁned meshes.

125

Index
(0)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)

Inputs Operation
Input −
(0)
(1)
(2)
(1)
(4)
(1)
(1)

(6), (7)

Mask

ResNet18 backbone
TConv(512 → 256, 4 × 4), ReLU
Conv(256 → 1, 3 × 3), Sigmoid
TConv(512 → 256, 4 × 4), ReLU
Conv(256 → K, 3 × 3), Softmax
Clusters
Conv(512 → 1, 1 × 1), Vectorize, Softmax Attention
Vectorize
∗, +
Linear(512 → 6N), Tanh
Linear(512 → 3N), Tanh
Linear(512 → 3N), Tanh
Linear(512 → 256), BN, ReLU
Graph conv, BN, ReLU
Graph conv, BN, ReLU
Linear(512 → 3), Tanh

(8)
(8)
(8)
(1),P Vert alignment
(12)
(13)
(14)
(15)

Notes
Image
Spatial Features

Linear Features
Rotations
Translations
Scales

Vertex Features
Mesh offsets

Output shape
128 × 128 × 3
32 × 32 × 512
64 × 64 × 256
64 × 64 × 1
64 × 64 × 256
64 × 64 × K
1024 × 1
1024 × 512
512
N × 6
N × 3
N × 3
|V| × 512
|V| × 256
|V| × 256
|V| × 256
|V| × 3

Table 6.4: We detail our overall network architecture for estimating a binary mask, clusters map,
∆ΠI transformations made of translations & scales in XYZ and 6d rotations [190], and lastly
reﬁned mesh offsets. TConv denotes transposed convolution and BN denotes batch normaliza-
tion [70]. We denote N as the number of cameras in our multiplexer while |V| denotes the number
of vertices in the meshes P.

Hyperparameters We deﬁne the vertex movement λr for the mesh reﬁnement outputs (e.g. index

16 in Tab. 6.4) adaptively based on the average IoU2D of the template shape. Formally we deﬁne
λr = 0.75 · (1 − IoU2D)1.5.
dataset, the more freedom we give to the mesh reﬁner. We deﬁne the template vertex movement λo
adaptively based on the 2D mask occupancy of the object category o. We set λo = 0.75 · (1 − o)
5
3 .

Intuitively, the worse the template mesh ﬁts the overall training

Intuitively, we allow categories with a wider variance of 2D masks to deform the template more.

126

Figure 6.7: Qualitative 3D mesh reconstructions on additional Pascal3D+ cars. We visualize input
image (row 1), posed template (row 2), and the reﬁned mesh (row 3 − 5) in view coordinates from
two additional novel viewpoints.

Figure 6.8: Qualitative 3D mesh reconstructions on additional CUB-200 birds. We visualize input
image (row 1), posed template (row 2), and the reﬁned mesh (row 3 − 5) in view coordinates from
two additional novel viewpoints.

127

PosedImageRefinedPosedImageRefinedFigure 6.9: Qualitative 3D mesh reconstructions on additional COCO categories (right). We vi-
sualize input image (row 1), posed template (row 2), and the reﬁned mesh (row 3 − 5) in view
coordinates from two additional novel viewpoints.

Figure 6.10: Qualitative 3D mesh reconstructions on additional Pascal3D+ categories. We vi-
sualize input image (row 1), posed template (row 2), and the reﬁned mesh (row 3 − 5) in view
coordinates from two additional novel viewpoints.

128

PosedImageRefinedPosedImageRefinedFigure 6.11: We show examples of our camera multiplexer renderings (right) as compared to the
network predicted cluster map (left middle) and predicted mask (left bottom) in order to select the
best pose i∗ highlighted in blue for each example category.

129

Chapter 7

Conclusions and Future Work

7.1 Conclusions

Detecting safety-critical objects in urban scenes using only a single monocular image in both 2D

and 3D space is exceedingly challenging due to scale variability, occlusion, and depth ambiguity.

The need to detect such objects in an accurate and efﬁcient manner is growing in importance

as urban autonomous driving applications and technology continue to rapidly develop. Speciﬁ-

cally, monocular camera-based object detection methodology remains the most cost-effective and

is therefore widespread among the available sensors (LiDAR, RADAR, and stereo cameras) [16].

Throughout this dissertation, we have proposed and demonstrated the effectiveness of vari-

ous monocular object detection techniques while keeping both accuracy and efﬁciency in mind.

We apply this goal ﬁrstly to pedestrian detection (Ch. 2-3) and secondly for multi-class 3D object

detection (Ch. 4-5). We observe from Ch. 2 that the pedestrian detection and semantic segmen-

tation tasks are highly synergistic and can be efﬁciently paired together during training, without

requiring additional data or adding overhead to the runtime speed. In Ch. 3, we further demonstrate

how autoregressive phases can be built into a single network to further improve accuracy without

130

sacriﬁcing runtime efﬁciency. Meanwhile, in Ch. 4 we demonstrate the critical step of extending

such object detection systems into 3D space by proposing an efﬁcient single-network 3D region

proposal network. We validate our 3D detection system not only on pedestrians but also for cars

and cyclists while continuing to use a single model. In Ch. 5 extend our method to video using an

efﬁcient 3D Kalman ﬁlter and observe more stable accuracy without sacriﬁcing runtime. In Ch. 6,

we go beyond perception of 3D cuboids by discovering 3D mesh shapes in cars and 20+ object

categories while using only a single end-to-end model, and exclusively 2D supervision.

Our proposed methods stand as modules compatible to a shared core framework derived from

the principles of seminal work Faster R-CNN [132]. As a whole, each method explores the value

and relative effect of pairing object detection with weak shape supervision, intermediate iterative

supervision, 3D supervision, and weakly supervised 3D mesh reconstruction.

7.2 Future Work Suggestions

7.2.1 Maps for 3D Monocular Object Detection

Even before the rapid growth of autonomous driving, maps have been widespread and common in

everyday life [117]. The usage of detailed and comprehensive maps is expected to remain a critical

component to autonomous driving [12]. Yet, the synergy between relatively inexpensive camera-

based 3D object detection and maps has not been thoroughly explored. Speciﬁcally, maps lanes

and elevation-based maps intuitively provide critical constraints to where objects may realistically

appear in 3D scenes, which orientation they may be in, etc. Although maps are generally fairly

expensive to annotate, it is probable to be less expensive to measure a single set of maps usable

in all vehicles compared to utilizing a high-powered sensor array into all vehicles. Moreover,

increased automatic methods for map generation are being studied in parallel [61]. Hence, careful

131

and efﬁcient integration of critical map data into monocular 3D object detection systems could

provide substantial beneﬁts to accuracy and robustness for known driving terrain.

7.2.2 Monocular 3D Object Detection with Mesh Discovery

A natural extension of monocular 3D object detection (Ch. 4-5) and weakly supervised mesh re-

construction (Ch. 6) is their joining for a total comprehensive full-scene understanding. The com-

bination of such two problems would result in weakly supervised joint 3D object localization and

reconstruction — a probably absurdly ambitious task. The most challenging aspects of this direc-

tion are technological. Full-scene rendering is both progromatically difﬁcult and likely expensive

for consumed GPU memory. A soft variant of this idea is partially explored for autolabeling 3D

cuboids [181], with the added caveat of requiring LiDAR. It is worth emphasizing that object re-

constructions (Ch. 6 and [58, 72, 92]) have signiﬁcant room for improvement even with clean 2D

detections. Naturally, the pairing of these two ideas has a lot of room for innovation and is at-

tractive from the perspective of training a full 3D recognition system with modular amounts of 3D
supervision, e.g., ± 3D localization, ± 3D cuboid shape, and ± object pose. The most extreme

and exciting setting being full scene 3D localization and reconstruction from 2D supervision.

132

APPENDIX

133

APPENDIX

A.1 Publications

A list of all peer-reviewed publications during the MSU PhD program listed chronologically.

• Garrick Brazil, Xi Yin, and Xiaoming Liu. "Illuminating pedestrians via simultaneous de-

tection & segmentation." Proceedings of the IEEE International Conference on Computer

Vision. 2017.

• Terwilliger, Adam, Garrick Brazil, and Xiaoming Liu. "Recurrent ﬂow-guided semantic

forecasting." 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

IEEE, 2019.

• Garrick Brazil, and Xiaoming Liu. "Pedestrian detection with autoregressive network phases."

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

2019.

• Garrick Brazil, and Xiaoming Liu. "M3D-RPN: Monocular 3D region proposal network

for object detection." Proceedings of the IEEE/CVF International Conference on Computer

Vision. 2019. (Oral Top 4.3%)

• Shengjie Zhu, Garrick Brazil, and Xiaoming Liu. "The edge of depth: Explicit constraints

between segmentation and depth." Proceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition. 2020.

134

• Garrick Brazil, Gerard Pons-Moll, Xiaoming Liu, Bernt Schiele. "Kinematic 3d object de-

tection in monocular video." European Conference on Computer Vision. Springer, Cham,

2020.

• Abhinav Kumar, Garrick Brazil, Xiaoming Liu. "GrooMeD-NMS: Grouped Mathematically

Differentiable NMS for Monocular 3D Object Detection." Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition. 2021.

• Garrick Brazil, Georgia Gkioxari, Nikhila Ravi, Justin Johnson, Xiaoming Liu. "3D Mesh

Discovery in the Wild" Under review in the IEEE/CVF International Conference on Com-

puter Vision. 2021.

A.2 Video Demonstrations

• Illuminating Pedestrians via Simultaneous Detection & Segmentation https://youtu.

be/9WlC4Qff3mk

• Pedestrian Detection with Autoregressive Network Phases https://youtu.be/FIgTXQVGUHQ

• M3D-RPN: Monocular 3D Region Proposal Network for Object Detection https://youtu.

be/qu7YMMjnUAk

• Kinematic 3D Object Detection in Monocular Video https://youtu.be/PRmYzHtQ99M

135

BIBLIOGRAPHY

136

BIBLIOGRAPHY

[1] Y. Almalioglu, M. R. U. Saputra, P. P. de Gusmao, A. Markham, and N. Trigoni. Ganvo:
Unsupervised deep monocular visual odometry and depth estimation with generative adver-
sarial networks. In International Conference on Robotics and Automation (ICRA), pages
5474–5480. IEEE, 2019. 105

[2] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial

grouping. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2014. 9, 12

[3] Y. Atoum, J. Roth, M. Bliss, W. Zhang, and X. Liu. Monocular video-based trailer coupler
detection using multiplexer convolutional neural network. In International Conference on
Computer Vision (ICCV). IEEE, 2017. 59

[4] A. Badki, A. Troccoli, K. Kim, J. Kautz, P. Sen, and O. Gallo. Bi3d: Stereo depth estima-
tion via binary classiﬁcations. In Computer Vision and Pattern Recognition (CVPR), pages
1600–1608, 2020. 105

[5] A. Behl, O. Hosseini Jafari, S. Karthik Mustikovela, H. Abu Alhaija, C. Rother, and
A. Geiger. Bounding boxes, segmentations and object coordinates: How important is recog-
In International
nition for 3D scene ﬂow estimation in autonomous driving scenarios?
Conference on Computer Vision (ICCV). IEEE, 2017. 55

[6] G. Bertasius, L. Torresani, and J. Shi. Object detection in video with spatiotemporal sam-
pling networks. In European Conference on Computer Vision (ECCV). Springer, 2018. 82

[7] H. Bertiche, M. Madadi, and S. Escalera. Cloth3d: Clothed 3D humans.

Conference on Computer Vision (ECCV), pages 344–359. Springer, 2020. 104

In European

[8] B. L. Bhatnagar, G. Tiwari, C. Theobalt, and G. Pons-Moll. Multi-garment net: Learning
to dress 3D people from images. In International Conference on Computer Vision (ICCV),
pages 5420–5430, 2019. 104

[9] G. Brazil and X. Liu. M3D-RPN: Monocular 3D region proposal network for object detec-
tion. In International Conference on Computer Vision (ICCV). IEEE, 2019. xvi, xvii, 77,
80, 83, 84, 91, 92, 94, 95, 96, 98, 102

[10] G. Brazil and X. Liu. Pedestrian detection with autoregressive network phases. In Computer

Vision and Pattern Recognition (CVPR). IEEE, 2019. 57

[11] G. Brazil, X. Yin, and X. Liu. Illuminating pedestrians via simultaneous detection segmen-
tation. In International Conference on Computer Vision (ICCV). IEEE, 2017. 33, 36, 43,
44, 45, 46, 47, 48, 57, 77

137

[12] D. J. Burnette, A. H. Chatham, and M. P. McNaughton. Automatic collection of quality
control statistics for maps used in autonomous driving, Sept. 3 2013. US Patent 8,527,199.
131

[13] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A uniﬁed multi-scale deep convolutional
In European Conference on Computer Vision

neural network for fast object detection.
(ECCV). Springer, 2016. 1, 9, 10, 11, 12, 14, 23, 24, 45, 47, 48, 55, 57

[14] Z. Cai, M. Saberian, and N. Vasconcelos. Learning complexity-aware cascades for deep
pedestrian detection. In International Conference on Computer Vision (ICCV). IEEE, 2015.
24

[15] Z. Cai and N. Vasconcelos. Cascade R-CNN: Delving into high quality object detection. In

Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. 36, 55, 57

[16] S. Campbell, N. O’Mahony, L. Krpalcova, D. Riordan, J. Walsh, A. Murphy, and C. Ryan.
Sensor technology in autonomous vehicles: a review. In 2018 29th Irish Signals and Systems
Conference (ISSC), pages 1–4. IEEE, 2018. 2, 130

[17] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics

dataset. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. 37

[18] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teuliere, and T. Chateau. Deep MANTA: A coarse-
to-ﬁne many-task network for joint 2D and 3D vehicle analysis from monocular image. In
Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. 81, 82

[19] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese,
M. Savva, S. Song, H. Su, et al. ShapeNet: An information-rich 3D model repository.
arXiv preprint arXiv:1512.03012, 2015. 107

[20] J.-R. Chang and Y.-S. Chen. Pyramid stereo matching network. In Computer Vision and

Pattern Recognition (CVPR). IEEE, 2018. 80

[21] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic
image segmentation with deep convolutional nets, atrous convolution, and fully connected
crfs. Pattern Analysis and Machine Intelligence (PAMI), 40(4):834–848, 2017. 9

[22] W. Chen, H. Ling, J. Gao, E. Smith, J. Lehtinen, A. Jacobson, and S. Fidler. Learning to
predict 3D objects with an interpolation-based differentiable renderer. In Neural Information
Processing Systems (NeurIPS), 2019. 105, 107

[23] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun. Monocular 3D object
detection for autonomous driving. In Computer Vision and Pattern Recognition (CVPR).
IEEE, 2016. 6, 55, 59, 60, 66, 70, 71, 72, 75, 80

138

[24] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun. 3D object
In Neural Information Processing Systems
proposals for accurate object class detection.
(NeurIPS), pages 424–432, 2015. xii, xvi, 6, 55, 59, 60, 66, 69, 70, 71, 72, 74, 75, 80, 93,
94, 95, 96, 100, 102

[25] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3D object detection network for
autonomous driving. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. 58

[26] Y. Chen, S. Liu, X. Shen, and J. Jia. Fast point R-CNN. In International Conference on

Computer Vision (ICCV). IEEE, 2019. 77

[27] Y. Chen, J. Wang, J. Li, C. Lu, Z. Luo, H. Xue, and C. Wang. LiDAR-video driving dataset:
Learning driving policies effectively. In Computer Vision and Pattern Recognition (CVPR).
IEEE, 2018. 55

[28] Z. Chen, S. Huang, and D. Tao. Context reﬁnement for object detection.

Conference on Computer Vision (ECCV). Springer, 2018. 55

In European

[29] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3D-r2n2: A uniﬁed approach
for single and multi-view 3D object reconstruction. In European Conference on Computer
Vision (ECCV), 2016. 107

[30] H. Chu, W.-C. Ma, K. Kundu, R. Urtasun, and S. Fidler. Surfconv: Bridging 3D and 2D
convolution for RGBD images. In Computer Vision and Pattern Recognition (CVPR). IEEE,
2018. 55, 58

[31] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke,
S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding.
In Computer Vision and Pattern Recognition (CVPR). IEEE, 2016. xiv, 9, 10, 12, 20, 82

[32] J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network

cascades. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2016. 10, 12

[33] Q. Dai, V. Patil, S. Hecker, D. Dai, L. Van Gool, and K. Schindler. Self-supervised ob-
ject motion and depth estimation from video. In Computer Vision and Pattern Recognition
Workshop (CVPRW), pages 1004–1005, 2020. 105

[34] A. Dame, V. A. Prisacariu, C. Y. Ren, and I. Reid. Dense reconstruction using 3D object
shape priors. In Computer Vision and Pattern Recognition (CVPR), pages 1288–1295, 2013.
104

[35] M. de La Gorce, D. J. Fleet, and N. Paragios. Model-based 3D hand pose estimation from

monocular video. Pattern Analysis and Machine Intelligence (PAMI), 2011. 105

[36] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.

Imagenet: A large-scale

139

hierarchical image database. In Computer Vision and Pattern Recognition (CVPR). IEEE,
2009. 15

[37] M. Desbrun, M. Meyer, P. Schröder, and A. H. Barr. Implicit fairing of irregular meshes
using diffusion and curvature ﬂow. In Special Interest Group on Computer Graphics (SIG-
GRAPH), pages 317–324, 1999. 125

[38] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark.

In
Computer Vision and Pattern Recognition (CVPR). IEEE, 2009. xv, 3, 10, 12, 13, 14, 18,
20, 21, 22, 24, 25, 33, 35, 46, 47

[39] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluation of the
state of the art. Pattern Analysis and Machine Intelligence (PAMI), 34(4):743–761, 2012.
12, 22, 46, 47, 77

[40] X. Du, M. H. Ang, S. Karaman, and D. Rus. A general pipeline for 3D detection of vehicles.

In International Conference on Robotics and Automation (ICRA). IEEE, 2018. 58

[41] X. Du, M. El-Khamy, J. Lee, and L. Davis. Fused DNN: A deep neural network fusion
approach to fast and robust pedestrian detection. In Winter Conference on Applications of
Computer Vision (WACV). IEEE, 2017. 12, 13, 14, 24, 33, 36, 45, 47

[42] B. A. Erol, A. Majumdar, J. Lwowski, P. Benavidez, P. Rad, and M. Jamshidi. Improved
deep neural network object tracking system for applications in home robotics. In Computa-
tional Intelligence for Pattern Recognition, pages 369–395. Springer, 2018. 1

[43] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The
PASCAL Visual Object Classes Challenge 2011 (VOC2011) Results. http://www.pascal-
network.org/challenges/VOC/voc2011/workshop/index.html. 23

[44] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The
PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-
network.org/challenges/VOC/voc2012/workshop/index.html. 11

[45] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3D object reconstruction

from a single image. In Computer Vision and Pattern Recognition (CVPR), 2017. 107

[46] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection
with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence
(PAMI), 32(9):1627–1645, 2010. 80

[47] S. Fidler, R. Mottaghi, A. Yuille, and R. Urtasun. Bottom-up segmentation for top-down

detection. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2013. 9, 13

[48] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. DSSD: Deconvolutional single shot

140

detector. arXiv preprint arXiv:1701.06659, 2017. 45

[49] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network
In Computer Vision and Pattern Recognition (CVPR).

for monocular depth estimation.
IEEE, 2018. 80

[50] M. Gao, R. Yu, A. Li, V. I. Morariu, and L. S. Davis. Dynamic zoom-in network for fast
In Computer Vision and Pattern Recognition (CVPR).

object detection in large images.
IEEE, 2018. 55

[51] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the KITTI vision
benchmark suite. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2012. xi, 1,
3, 9, 10, 13, 20, 21, 23, 35, 45, 46, 55, 57, 62, 69, 70, 78, 79, 82, 93, 94, 100

[52] S. Giancola, J. Zarzar, and B. Ghanem. Leveraging shape completion for 3D siamese track-

ing. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2019. 82

[53] S. Gidaris and N. Komodakis.

Object detection via a multi-region and semantic
segmentation-aware cnn model. In International Conference on Computer Vision (ICCV).
IEEE, 2015. 11

[54] R. Girshick. Fast R-CNN. In International Conference on Computer Vision (ICCV). IEEE,

2015. 11, 12, 15, 41, 43, 64, 72

[55] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate
object detection and semantic segmentation. In Computer Vision and Pattern Recognition
(CVPR). IEEE, 2014. 12

[56] G. Gkioxari, J. Malik, and J. Johnson. Mesh r-cnn. In International Conference on Computer

Vision (ICCV), 2019. 105, 107, 114, 115

[57] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow. Digging into self-supervised
In International Conference on Computer Vision (ICCV),

monocular depth estimation.
pages 3828–3838, 2019. 105

[58] S. Goel, A. Kanazawa, and J. Malik. Shape and viewpoint without keypoints. In European
Conference on Computer Vision (ECCV), 2020. xiii, 105, 106, 107, 108, 109, 110, 111, 112,
117, 118, 119, 122, 132

[59] A. Graves and J. Schmidhuber. Ofﬂine handwriting recognition with multidimensional re-
current neural networks. In Neural Information Processing Systems (NeurIPS), pages 545–
552, 2009. 37

[60] J. Guerry, A. Boulch, B. Le Saux, J. Moras, A. Plyer, and D. Filliat. SnapNet-R: Consistent
In International Conference on Computer

3D multi-view semantic labeling for robotics.

141

Vision (ICCV). IEEE, 2017. 55

[61] C. Guo, K. Kidono, J. Meguro, Y. Kojima, M. Ogawa, and T. Naito. A low-cost solution
for automatic lane-level map generation using conventional in-car sensors. Transactions on
Intelligent Transportation Systems (T-ITS), 17(8):2355–2366, 2016. 131

[62] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simultaneous detection and segmen-

tation. In European Conference on Computer Vision (ECCV). Springer, 2014. 9, 11, 12

[63] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation
and ﬁne-grained localization. In Computer Vision and Pattern Recognition (CVPR), 2015.
105

[64] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In International Conference on

Computer Vision (ICCV), 2017. 105

[65] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.

Computer Vision and Pattern Recognition (CVPR). IEEE, 2016. 11, 12, 39, 105

In

[66] H.-N. Hu, Q.-Z. Cai, D. Wang, J. Lin, M. Sun, P. Krahenbuhl, T. Darrell, and F. Yu. Joint
In International Conference on Computer

monocular 3D vehicle detection and tracking.
Vision (ICCV). IEEE, 2019. 82

[67] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convo-
lutional networks. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. 66,
93

[68] W.-C. Hung, V. Jampani, S. Liu, P. Molchanov, M.-H. Yang, and J. Kautz. Scops: Self-
In Computer Vision and Pattern Recognition (CVPR),

supervised co-part segmentation.
pages 869–878, 2019. 108

[69] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer.
Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size.
arXiv preprint arXiv:1602.07360, 2016. 25

[70] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by re-
ducing internal covariate shift. In International Conference on Machine Learning (ICML),
2015. xiii, 40, 126

[71] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and
T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint
arXiv:1408.5093, 2014. 15, 18, 45

[72] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning category-speciﬁc mesh re-
construction from image collections. In European Conference on Computer Vision (ECCV),

142

2018. xiii, 105, 107, 108, 109, 112, 117, 118, 119, 122, 132

[73] A. Kar, S. Tulsiani, J. Carreira, and J. Malik. Category-speciﬁc object reconstruction from
a single image. In Computer Vision and Pattern Recognition (CVPR), pages 1966–1974,
2015. xiii, 107, 109, 117, 118, 119

[74] H. Kato, Y. Ushiku, and T. Harada. Neural 3D mesh renderer.

Pattern Recognition (CVPR), 2018. 105

In Computer Vision and

[75] J. N. S. L. B. Kim and G. Kim. Improving occlusion and hard negative handling for single-
In Computer Vision and Pattern Recognition (CVPR). IEEE,

stage pedestrian detectors.
2018. 33, 45, 48

[76] S.-W. Kim, H.-K. Kook, J.-Y. Sun, M.-C. Kang, and S.-J. Ko. Parallel feature pyramid net-
work for object detection. In European Conference on Computer Vision (ECCV). Springer,
2018. 2, 55

[77] A. Kirillov, Y. Wu, K. He, and R. Girshick. Pointrend: Image segmentation as rendering. In

Computer Vision and Pattern Recognition (CVPR), 2020. 105

[78] T. Kong, F. Sun, W. Huang, and H. Liu. Deep feature pyramid reconﬁguration for object
detection. In European Conference on Computer Vision (ECCV). Springer, 2018. 34, 37,
49, 57

[79] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolu-
tional neural networks. In Neural Information Processing Systems (NeurIPS), pages 1097–
1105, 2012. 25, 56, 64, 105

[80] J. Ku, M. Moziﬁan, J. Lee, A. Harakeh, and S. L. Waslander. Joint 3D proposal generation
In International Conference on Intelligent

and object detection from view aggregation.
Robots and Systems (IROS). IEEE, 2018. 80

[81] J. Ku, A. D. Pon, and S. L. Waslander. Monocular 3D object detection leveraging accurate
proposals and shape reconstruction. In Computer Vision and Pattern Recognition (CVPR).
IEEE, 2019. xii, 77, 80, 87, 94, 95, 96, 100, 102, 105

[82] N. Kulkarni, A. Gupta, D. Fouhey, and S. Tulsiani. Articulation-aware canonical surface

mapping. In Computer Vision and Pattern Recognition (CVPR), 2020. 108, 109, 112

[83] N. Kulkarni, A. Gupta, and S. Tulsiani. Canonical surface mapping via geometric cycle
consistency. In International Conference on Computer Vision (ICCV), 2019. 105, 106, 110

[84] A. Kumar, T. K. Marks, W. Mou, Y. Wang, M. Jones, A. Cherian, T. Koike-Akino, X. Liu,
and C. Feng. LUVLi face alignment: Estimating landmarks’ location, uncertainty, and
visibility likelihood. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2020. 81

143

[85] K. S. Kumar, S. Prasad, P. K. Saroj, and R. C. Tripathi. Multiple cameras using real time ob-
ject tracking for surveillance and security system. In International Conference on Emerging
Trends in Engineering and Technology (IConETech), pages 213–218. IEEE, 2010. 1

[86] S. Kumar, Y. Dai, and H. Li. Monocular dense 3D reconstruction of a complex dy-
namic scene from two perspective frames. In International Conference on Computer Vision
(ICCV), pages 4649–4657, 2017. 104

[87] W. Kuo, A. Angelova, T.-Y. Lin, and A. Dai. Mask2cad: 3D shape prediction by learning
to segment and retrieve. In European Conference on Computer Vision (ECCV), 2020. 105

[88] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom. PointPillars: Fast en-
coders for object detection from point clouds. In Computer Vision and Pattern Recognition
(CVPR). IEEE, 2019. 77

[89] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio. Object recognition with gradient-based
In Shape, contour and grouping in computer vision, pages 319–345. Springer,

learning.
1999. 56, 64

[90] B. Li, W. Ouyang, L. Sheng, X. Zeng, and X. Wang. GS3D: An efﬁcient 3D object detection
framework for autonomous driving. In Computer Vision and Pattern Recognition (CVPR).
IEEE, 2019. xii, 77, 80, 87, 94, 95, 96, 100, 102

[91] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan. Scale-aware fast R-CNN for pedestrian

detection. arXiv preprint arXiv:1510.08160, 2015. 10, 11, 24, 48

[92] X. Li, S. Liu, K. Kim, S. De Mello, V. Jampani, M.-H. Yang, and J. Kautz. Self-supervised
single-view 3D reconstruction via semantic consistency. In European Conference on Com-
puter Vision (ECCV), 2020. xiii, 107, 108, 109, 112, 117, 118, 119, 122, 132

[93] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. DetNet: Design backbone for object

detection. In European Conference on Computer Vision (ECCV). Springer, 2018. 57

[94] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun. Multi-task multi-sensor fusion for 3D

object detection. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2019. 77

[95] M. Liang, B. Yang, S. Wang, and R. Urtasun. Deep continuous fusion for multi-sensor 3D
object detection. In European Conference on Computer Vision (ECCV). Springer, 2018. 55,
58, 77

[96] C. Lin, J. Lu, G. Wang, and J. Zhou. Graininess-aware deep feature learning for pedestrian

detection. In European Conference on Computer Vision (ECCV). Springer, 2018. 45, 48

[97] C.-H. Lin, C. Kong, and S. Lucey. Learning efﬁcient point cloud generation for dense 3D
object reconstruction. In Association for the Advancement of Artiﬁcial Intelligence Confer-

144

ence on Artiﬁcial Intelligence (AAAI), 2018. 107

[98] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid
networks for object detection. In Computer Vision and Pattern Recognition (CVPR). IEEE,
2017. 1, 34, 37, 39, 49

[99] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection.

In International Conference on Computer Vision (ICCV), pages 2980–2988, 2017. 1

[100] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zit-
nick. Microsoft COCO: Common objects in context. In European Conference on Computer
Vision (ECCV). Springer, 2014. 10

[101] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L.
Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer
Vision (ECCV), 2014. 105, 106, 107, 109, 116, 123, 124

[102] F. Liu and X. Liu. Learning implicit functions for topology-varying dense 3D shape corre-

spondence. In Neural Information Processing Systems (NeurIPS), 2020. 107

[103] L. Liu, J. Lu, C. Xu, Q. Tian, and J. Zhou. Deep ﬁtting degree scoring network for monocular
3D object detection. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2019. xii,
77, 80, 81, 87, 94, 95, 96, 100, 102

[104] M. Liu and M. Zhu. Mobile video object detection with temporally-aware feature maps. In

Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. 82

[105] S. Liu, W. Chen, T. Li, and H. Li. Soft rasterizer: Differentiable rendering for unsupervised
single-view mesh reconstruction. In International Conference on Computer Vision (ICCV),
2019. 105

[106] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance segmentation.

In Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. 34, 37, 49

[107] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single
In European Conference on Computer Vision (ECCV). Springer,

shot multibox detector.
2016. 57

[108] W. Liu, S. Liao, W. Hu, X. Liang, and X. Chen. Learning efﬁcient single-stage pedestrian
detectors by asymptotic localization ﬁtting. In European Conference on Computer Vision
(ECCV). Springer, 2018. 33, 37, 44, 45, 57

[109] X. Liu and T. Yu. Gradient feature selection for online boosting. In International Conference

on Computer Vision (ICCV). IEEE, 2007. 57

145

[110] S. Lloyd. Least squares quantization in pcm. Transactions of Information Theory, 1982.

113

[111] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmen-

tation. In Computer Vision and Pattern Recognition (CVPR), 2015. 105

[112] M. M. Loper and M. J. Black. OpenDR: An approximate differentiable renderer. In Euro-

pean Conference on Computer Vision (ECCV), 2014. 105

[113] X. Ma, Z. Wang, H. Li, P. Zhang, W. Ouyang, and X. Fan. Accurate monocular 3D object
detection via color-embedded 3d reconstruction for autonomous driving. In International
Conference on Computer Vision (ICCV). IEEE, 2019. 77, 80, 94, 95, 102, 105

[114] F. Manhardt, W. Kehl, and A. Gaidon. ROI-10D: Monocular lifting of 2D detection to 6D
pose and metric shape. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2019.
xii, 77, 80, 87, 94, 95, 96, 100, 102

[115] B. McCann, J. Bradbury, C. Xiong, and R. Socher. Learned in translation: Contextualized

word vectors. In Neural Information Processing Systems (NeurIPS), 2017. 37

[116] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger. Occupancy networks:
Learning 3D reconstruction in function space. In Computer Vision and Pattern Recognition
(CVPR), 2019. 107

[117] C. C. Miller. A beast in the ﬁeld: The google maps mashup as gis/2. Cartographica: The
International Journal for Geographic Information and Geovisualization, 41(3):187–199,
2006. 131

[118] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka. 3D bounding box estimation using
deep learning and geometry. In Computer Vision and Pattern Recognition (CVPR). IEEE,
2017. xvi, 6, 55, 58, 59, 60, 66, 70, 71, 72, 75, 80

[119] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation.

In European Conference on Computer Vision (ECCV). Springer, 2016. 34, 37, 49

[120] T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and Y.-L. Yang. Hologan: Unsupervised
learning of 3D representations from natural images. In International Conference on Com-
puter Vision (ICCV), Nov 2019. 108

[121] D. Novotny, N. Ravi, B. Graham, N. Neverova, and A. Vedaldi. C3dpo: Canonical 3D pose
networks for non-rigid structure from motion. In Computer Vision and Pattern Recognition
(CVPR), 2019. 105

[122] K. Oksuz, B. Can Cam, E. Akbas, and S. Kalkan. Localization recall precision (LRP): A
new performance metric for object detection. In European Conference on Computer Vision

146

(ECCV). Springer, 2018. 57

[123] W. Ouyang, K. Wang, X. Zhu, and X. Wang. Chained cascade network for object detection.

In International Conference on Computer Vision (ICCV). IEEE, 2017. 36

[124] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf: Learning
In Computer Vision and

continuous signed distance functions for shape representation.
Pattern Recognition (CVPR), 2019. 107

[125] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,

L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017. 65, 68, 75, 92

[126] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum pointnets for 3D object detection
from RGB-D data. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. 55,
58, 80

[127] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. PointNet++: Deep hierarchical feature learning on
point sets in a metric space. In Neural Information Processing Systems (NeurIPS), 2017. 80

[128] H. Qin, J. Yan, X. Li, and X. Hu. Joint training of cascaded CNN for face detection. In

Computer Vision and Pattern Recognition (CVPR). IEEE, 2016. 36

[129] N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W.-Y. Lo, J. Johnson, and G. Gkioxari.
In Special Interest Group on Computer

Accelerating 3D deep learning with pytorch3d.
Graphics (SIGGRAPH), 2020. 105, 107, 116

[130] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Uniﬁed, real-time
object detection. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2016. 1, 57,
79, 81, 82

[131] J. Ren, X. Chen, J. Liu, W. Sun, J. Pang, Q. Yan, Y.-W. Tai, and L. Xu. Accurate single stage
detector using recurrent rolling convolution. In Computer Vision and Pattern Recognition
(CVPR). IEEE, 2017. 33, 37, 45, 55, 57, 82

[132] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection
with region proposal networks. In Neural Information Processing Systems (NeurIPS), 2015.
2, 5, 10, 11, 14, 38, 43, 45, 57, 61, 63, 64, 72, 79, 80, 81, 82, 83, 84, 131

[133] J. Rock, T. Gupta, J. Thorsen, J. Gwak, D. Shin, and D. Hoiem. Completing 3D object shape

from one depth image. In Computer Vision and Pattern Recognition (CVPR), 2015. 107

[134] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical
image segmentation. In Medical Image Computing and Computer Assisted Interventions
(MICCAI). Springer, 2015. 37, 49

147

[135] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. Interna-
tional Journal of Computer Vision (IJCV), 115(3):211–252, 2015. 82

[136] C. Russell, R. Yu, and L. Agapito. Video pop-up: Monocular 3D reconstruction of dynamic
scenes. In European Conference on Computer Vision (ECCV), pages 583–598. Springer,
2014. 104

[137] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based se-
quence recognition and its application to scene text recognition. Pattern Analysis and Ma-
chine Intelligence (PAMI), 39(11):2298–2304, 2017. 37

[138] S. Shi, X. Wang, and H. Li. PointRCNN: 3D object proposal generation and detection from

point cloud. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. 55, 58

[139] S. Shi, X. Wang, and H. Li. PointRCNN: 3D object proposal generation and detection from

point cloud. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2019. 77

[140] A. Simonelli, S. R. Bulo, L. Porzi, M. López-Antequera, and P. Kontschieder. Disentangling
monocular 3D object detection. In International Conference on Computer Vision (ICCV).
IEEE, 2019. xii, 77, 80, 81, 93, 94, 95, 96, 102

[141] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image

recognition. arXiv preprint arXiv:1409.1556, 2014. 14, 39, 42, 49, 105

[142] M. Slavcheva, M. Baust, and S. Ilic. Towards implicit correspondence in signed distance
ﬁeld evolution. In International Conference on Computer Vision Workshop (ICCVW), 2017.
107

[143] E. Smith, S. Fujimoto, and D. Meger. Multi-view silhouette and depth decomposition
In Neural Information Processing Systems

for high resolution 3D object representation.
(NeurIPS), 2018. 107

[144] E. J. Smith, S. Fujimoto, A. Romero, and D. Meger. Geometrics: Exploiting geometric
In International Conference on Machine Learning

structure for graph-encoded objects.
(ICML), 2019. 107

[145] N. Smolyanskiy, A. Kamenev, and S. Birchﬁeld. On the importance of stereo for accurate
depth estimation: An efﬁcient semi-supervised deep neural network approach. In Computer
Vision and Pattern Recognition Workshop (CVPRW), pages 1007–1015, 2018. 105

[146] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu. An end-to-end spatio-temporal attention
model for human action recognition from skeleton data. In Association for the Advancement
of Artiﬁcial Intelligence Conference on Artiﬁcial Intelligence (AAAI), 2017. 37

148

[147] T. Song, L. Sun, D. Xie, H. Sun, and S. Pu. Small-scale pedestrian detection based on
somatic topology localization and temporal feature aggregation. In European Conference
on Computer Vision (ECCV). Springer, 2018. 33, 45

[148] R. Stewart, M. Andriluka, and A. Y. Ng. End-to-end people detection in crowded scenes. In

Computer Vision and Pattern Recognition (CVPR). IEEE, 2016. 33, 37

[149] X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Tenenbaum, and W. T. Freeman.
Pix3d: Dataset and methods for single-image 3D shape modeling. In Computer Vision and
Pattern Recognition (CVPR), 2018. 107

[150] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
In Computer Vision and Pattern

and A. Rabinovich. Going deeper with convolutions.
Recognition (CVPR). IEEE, 2015. 12, 33, 36

[151] K. Tateno, F. Tombari, I. Laina, and N. Navab. CNN-SLAM: Real-time dense monocular
SLAM with learned depth prediction. In Computer Vision and Pattern Recognition (CVPR).
IEEE, 2017. 55

[152] Y. Tian, P. Luo, X. Wang, and X. Tang. Deep learning strong parts for pedestrian detection.

In International Conference on Computer Vision (ICCV). IEEE, 2015. 11, 24

[153] F. Tosi, F. Aleotti, M. Poggi, and S. Mattoccia. Learning monocular depth estimation in-
fusing traditional stereo knowledge. In Computer Vision and Pattern Recognition (CVPR),
pages 9799–9809, 2019. 105

[154] S. Tsogkas, I. Kokkinos, G. Papandreou, and A. Vedaldi. Deep learning for semantic part

segmentation with high-level guidance. arXiv preprint arXiv:1505.02438, 2015. 14

[155] S. Tulsiani, A. A. Efros, and J. Malik. Multi-view consistency as supervisory signal for
learning shape and pose prediction. In Computer Vision and Pattern Recognition (CVPR),
pages 2897–2905, 2018. 107

[156] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-view supervision for single-view re-
construction via differentiable ray consistency. In Computer Vision and Pattern Recognition
(CVPR), pages 2626–2634, 2017. xiii, 107, 109, 117, 118, 119

[157] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalch-
brenner, A. W. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio.
In Speech Synthesis Workshop (SSW), 2016. 34

[158] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-

2011 dataset. 2011. 106, 108, 116, 122, 124

[159] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang. Pixel2mesh: Generating 3D mesh

149

models from single rgb images.
pages 52–67, 2018. 107, 114

In European Conference on Computer Vision (ECCV),

[160] X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen. Repulsion loss: Detecting pedes-
trians in a crowd. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. 33,
45, 48, 64

[161] X. Wang, W. Yin, T. Kong, Y. Jiang, L. Li, and C. Shen. Task-aware monocular depth esti-
mation for 3D object detection. In Association for the Advancement of Artiﬁcial Intelligence
Conference on Artiﬁcial Intelligence (AAAI), 2020. 80

[162] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger. Pseudo-
LiDAR from visual depth estimation: Bridging the gap in 3D object detection for au-
tonomous driving. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2019. 77,
80

[163] Y. Wang, P. Wang, Z. Yang, C. Luo, Y. Yang, and W. Xu. Unos: Uniﬁed unsupervised
optical-ﬂow and stereo-depth estimation by watching videos. In Computer Vision and Pat-
tern Recognition (CVPR), pages 8071–8081, 2019. 105

[164] G. Welch, G. Bishop, et al. An introduction to the kalman ﬁlter. 1995. 78, 83, 89

[165] C.-Y. Weng, B. Curless, and I. Kemelmacher-Shlizerman. Photo wake-up: 3D character
animation from a single photo. In Computer Vision and Pattern Recognition (CVPR), June
2019. 104

[166] A. Womg, M. J. Shaﬁee, F. Li, and B. Chwyl. Tiny SSD: A tiny single-shot detection deep
convolutional neural network for real-time embedded object detection. In Conference on
Robots and Vision (CRV). IEEE, 2018. 2, 57

[167] J. Wu, Y. Wang, T. Xue, X. Sun, W. T. Freeman, and J. B. Tenenbaum. MarrNet: 3D Shape
Reconstruction via 2.5D Sketches. In Neural Information Processing Systems (NeurIPS),
2017. 107

[168] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent
space of object shapes via 3D generative-adversarial modeling. In Neural Information Pro-
cessing Systems (NeurIPS), 2016. 105, 107

[169] Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Subcategory-aware convolutional neural net-
works for object proposals and detection. In Winter Conference on Applications of Computer
Vision (WACV). IEEE, 2017. 69, 70

[170] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A benchmark for 3D object detec-
tion in the wild. In Winter Conference on Applications of Computer Vision (WACV), 2014.
106, 107, 108, 116, 124

150

[171] F. Xiao and Y. Jae Lee. Video object detection with an aligned spatial-temporal memory. In

European Conference on Computer Vision (ECCV). Springer, 2018. 82

[172] B. Xu and Z. Chen. Multi-level fusion based 3D object detection from monocular images.
In Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. xvi, 6, 55, 58, 59, 60,
66, 70, 71, 72, 75, 80

[173] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning
single-view 3D object reconstruction without 3D supervision. In Neural Information Pro-
cessing Systems (NeurIPS), 2016. 107

[174] B. Yang, W. Luo, and R. Urtasun. Pixor: Real-time 3D object detection from point clouds.

In Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. 55, 58

[175] F. Yang, W. Choi, and Y. Lin. Exploit all the layers: Fast and accurate CNN object detector
with scale dependent pooling and cascaded rejection classiﬁers. In Computer Vision and
Pattern Recognition (CVPR). IEEE, 2016. 11, 57, 81, 82

[176] S. Yang and S. Scherer. CubeSLAM: Monocular 3D object SLAM. Transactions on

Robotics (T-RO), 35(4):925–938, 2019. 104

[177] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia. STD: Sparse-to-dense 3D object detector for
point cloud. In International Conference on Computer Vision (ICCV). IEEE, 2019. 77, 82

[178] Y. Ye, S. Tulsiani, and A. Gupta. Shelf-supervised mesh prediction in the wild. 2021. 108

[179] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang. Unitbox: An advanced object detection

network. In Internet Measurement Conference (ICM). ACM, 2016. 64

[180] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, S. Chintala, and P. Dollár. A
In British Machine Vision Conference (BMVC),

multipath network for object detection.
2016. 33, 36

[181] S. Zakharov, W. Kehl, A. Bhargava, and A. Gaidon. Autolabeling 3d objects with differen-
tiable rendering of sdf shape priors. In Computer Vision and Pattern Recognition (CVPR),
pages 12224–12233, 2020. 132

[182] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid. Unsupervised learning
of monocular depth estimation and visual odometry with deep feature reconstruction. In
Computer Vision and Pattern Recognition (CVPR), pages 340–349, 2018. 105

[183] L. Zhang, L. Lin, X. Liang, and K. He. Is faster R-CNN doing well for pedestrian detection?
In European Conference on Computer Vision (ECCV). Springer, 2016. 5, 9, 10, 11, 12, 14,
16, 23, 24, 38, 41, 45, 47, 48

151

[184] S. Zhang, R. Benenson, M. Omran, J. Hosang, and B. Schiele. How far are we from solving
pedestrian detection? In Computer Vision and Pattern Recognition (CVPR). IEEE, 2016. 2,
9, 11, 12, 35, 47

[185] S. Zhang, R. Benenson, and B. Schiele. Filtered channel features for pedestrian detection.

In Computer Vision and Pattern Recognition (CVPR). IEEE, 2015. 22

[186] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Occlusion-aware R-CNN: Detecting
In European Conference on Computer Vision (ECCV). Springer,

pedestrians in a crowd.
2018. 33

[187] S. Zhang, J. Yang, and B. Schiele. Occluded pedestrian detection through guided attention

in CNNs. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. 33, 45

[188] Z. Zhang, S. Qiao, C. Xie, W. Shen, B. Wang, and A. L. Yuille. Single-shot object detection
with enriched semantics. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2018.
2, 57

[189] C. Zhou and J. Yuan. Bi-box regression for pedestrian detection and occlusion estimation.

In European Conference on Computer Vision (ECCV). Springer, 2018. 33, 45, 48, 57

[190] Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations
in neural networks. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2019. xiii,
81, 126

[191] X. Zhu, J. Dai, L. Yuan, and Y. Wei. Towards high performance video object detection. In

Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. 82

[192] X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei. Flow-guided feature aggregation for video
object detection. In International Conference on Computer Vision (ICCV). IEEE, 2017. 82

152