PERCEPTION VIA RADAR-CAMERA FUSION FOR AUTONOMOUS DRIVING

By

Yunfei Long

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Electrical Engineering—Doctor of Philosophy

2025

ABSTRACT

Reliable sensing of environments is a key bottleneck in autonomous driving. Among a variety of

sensors, automotive radar stands out for its low cost, robustness to adverse weather, and ability to

capture motion. Nevertheless, radar is not widely recognized in the computer vision community,

facing challenges in sparse, low-dimensional, and inaccurate measurements. This work sheds a

different light on the traditional view of radar on perception, exploring how radar can be used

to enhance monocular perception in multiple vision tasks including depth completion, velocity

estimation, and 3D object detection. Depth completion with radar-camera fusion aims to predict

dense depths for image pixels given sparse radar points and images. To handle ambiguous geometric

associations between raw radar pixels and image pixels, we propose radar-camera pixel depth

association (RC-PDA), which maps radar pixels to nearby image pixels with the same depths.

We train a model to predict RC-PDA, which is used to enhance and densify radar returns for

depth completion. Full velocity estimation for radar points focuses on predicting the tangential

velocity, which is absent in radar measurements. We present a closed-form solution to compute

point-wise full velocity for radar returns by combining radar Doppler velocity with corresponding

optical flows on images. 3D object detection aims to estimate object categories and 3D bounding

boxes. We focus on using radar to improve monocular detections in position estimation. To

address discrepancy between radar hits and object centers, we build a model to predict point-wise

3D object centers, which are subsequently matched with monocular estimated centers for depth

fusion. To deal with the complexity of possible locations of radar points reflected by targets, we

build a model to estimate radar hit distributions conditioned on object properties predicted by a

monocular detector, and spatially match the distributions with actual radar hits in the neighborhood

of monocular detections. This method reveals radar distributions under different conditions and

achieves interpretable position estimation via radar-camera fusion. Experiments show that the

proposed methods achieve state-of-the-art performance on the individual vision tasks via radar-

camera fusion. We believe this work will contribute new practical solutions for perception with

radar for autonomous driving.

ACKNOWLEDGEMENTS

First, I am grateful to my advisor Daniel Morris for assiduously guiding me through various

interesting projects in computer vision. Second, I would like to acknowledge Ford-MSU Alliance for

sponsoring my research on radar-camera fusion and thank all collaborators for fruitful discussions.

Finally, I wish to express my gratitude to my parents, who always support me in pursuing advanced

degrees.

iii

TABLE OF CONTENTS

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

CHAPTER 2

RADAR-CAMERA PIXEL DEPTH ASSOCIATION FOR DEPTH
COMPLETION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

1

9

CHAPTER 3

FULL-VELOCITY RADAR RETURNS BY RADAR-CAMERA
FUSION .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

. 30

CHAPTER 4

RADIANT: RADAR-IMAGE ASSOCIATION NETWORK FOR 3D
OBJECT DETECTION . . . . . . . . . . . . . . . . . . . . . . . . . . 49

CHAPTER 5

RICCARDO: RADAR HIT PREDICTION AND CONVOLUTION
FOR CAMERA-RADAR 3D OBJECT DETECTION . . . . . . . . . . . 65

CHAPTER 6

CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . .

. 86

BIBLIOGRAPHY .

.

.

.

.

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

iv

CHAPTER 1

INTRODUCTION

1.1 Autonomous Driving

Autonomous driving exemplifies the latest engineering efforts in artificial intelligence. The

evolution of human civilization is characterized by consistent technical efforts extending intrinsic

human capabilities: our hands are extended by mechanics, and our intelligence by computers and

algorithms. Autonomous driving [152] is the latest example of these attempts regarding intelligence:

autonomous cars sense their environments and drive themselves based on the sensed information

and maps.

Autonomous cars have long been vigorously pursued by engineers and the general public for a

number of obvious benefits. First, it frees people from the wheel. Typically, people spend over 300

hours per year behind the wheel [85]. Autonomous driving offers the freedom to invest the time

elsewhere. Thus, we argue that this technology is revolutionary for modern people in the same sense

that standing upright frees the hands of ancient apes [153]. Second, autonomous driving makes

it easier for those who cannot drive, e.g., people without a driver’s license, to move more freely.

Third, we believe that, compared with a human driver, self-driving cars, equipped with optimized

algorithms and a perception ability better than human eyes, can drive safer and use energy more

efficiently.

Typically, the procedures of autonomous driving consist of perception, path planning, and

maneuver control [152], where technically perception has long been the bottleneck of practical

deployment. It is challenging to grasp a complete and accurate picture of complex environments,

such as typical traffic scenes. Historically, only passenger airplanes and drones have successfully

achieved autopilot [130] in our daily lives, as the environment is much simpler to perceive in

the sky than on the road. A new wave of enthusiasm for autonomous driving is fueled by the

progress of supervised learning with deep neural networks [50] and big data [162]. The progress

pushes the performance of perception to a level where it outperforms human in individual tasks.

Industrial companies see the opportunities, invest heavily in research and development, and put on

1

the agenda the day when self-driving cars hit the road. Nevertheless, it is admitted that current

perception performance is still insufficient for autonomous driving to take over [152]. Thus,

some companies focus on specialized autonomous vehicles working in simpler environments, e.g.,

vehicles delivering packages; self-driving trucks [1], which are mostly on freeways; and advanced

driver assistance systems (ADAS), where a human driver is still a necessity.

In a word, we are optimistic with reservations about the future of autonomous driving: we

see its great potentials but when it comes to achieving reliability in complex environments, there

is still a long way to go. Autonomous vehicles are complex systems requiring interdisciplinary

efforts from an engineering point of view, not to mention the dilemmas they face regarding law and

ethics [4]. Technically, perception is a key factor that will decide the fate of autonomous driving.

In this dissertation, we focus on solving the perception problems, which will be introduced in the

following section.

1.2 Perception Tasks Related to Autonomous Driving

To make decisions, self-driving cars need to obtain a number of attributes of objects appearing

in the neighboring environments. We ask where the objects are, what kind of objects they are,

what their sizes and orientations are, and how they are moving. In other words, those attributes

include class, positions (e.g., depth), sizes plus orientations (modeled by 2D/3D bounding boxes),

and dynamic status (e.g., velocity). Object detection [157] is a comprehensive task extracting the

attributes for objects captured by sensors.

Although these attributes belong to objects, they can be defined and extracted at multiple

granularities or levels according to sensors. Typically there are three levels from small to large,

i.e., pixel-level [136], object-level, and image-level [110]. The pixel-level attributes are defined

for objects intersecting with the ray of a pixel/point for camera/depth sensors. The object-level

attributes are naturally defined. And image-level attributes are for objects in the image. Estimating

attributes at low levels is helpful for extracting the same attributes at higher levels but is not always

necessary. For example, image segmentation [135] predicts classes at pixel level and may contribute

to object classification in detection (object-level task), but the object classification does not always

2

require pixel-wise object segmentation.

In this dissertation, we estimate attributes including depth and velocity at the pixel level, which

may lay the groundwork for better detection at the object level. All the vision tasks depend on

information collected by perception sensors, which are introduced in the following section.

1.3 Perception Sensors for Autonomous Driving

Perception sensors [132] are bridges linking the real physical world with a digital one where

perception algorithms are running. Characteristics of sensors, e.g., measured physical properties

and precision, underlie sensor selection and algorithm design for autonomous vehicles. Camera,

radar, and lidar are the most widely-used sensors for autonomous vehicles. Lidar and radar are depth

sensors, which measure object depth directly, while camera is not in the sense that it only captures

colors and intensities. Depth still can be acquired from a stereo camera through triangulation [32],

but accuracy is low at long range.

In summary, camera captures high-resolution textures with

poor or no depth, while depth sensors measure depth with poor texture. Thus, the combination of

cameras and depth sensors is a natural fusion strategy to enhance monocular perception.

On the other hand, the choice of depth sensors (one or both) is not so obvious, typically a

result of a trade-off of a variety of factors such as cost and compactness. Both radar [159] and

lidar [59, 74] are time-of-flight sensors, which infer distance according to the time electromagnetic

waves travel to and back from objects, but work in different spectrums, lidar in infrared and radar in

microwave. Lower frequencies allow radar to measure longer ranges and are more robust to adverse

weather. However, lidar measurements have higher resolution both in azimuth and elevation, and

typically achieve better positional estimates. With regard to cost and compactness, radar is the

winner: radar is more compact in size and more flexible in where to install, while lidar, more bulky

in size, scans 360 degrees and is typically installed on top of vehicles. Automative radar has been

widely deployed in driving assistance systems, while the usage of lidar is rare for consumer cars

on the market. Different from lidar, radar can measure dynamic status of objects with Doppler

velocity.

FMCW (frequency modulated continuous wave) radar [124] is widely used as automotive radar

3

for its low cost, which typically includes a transmitter and receivers. The transmitter emits chirp

signals, which are reflected by targets and received by receivers. A mixer combines transmitted

and received signals, obtains their frequency difference (i.e., beat frequency) and phase difference,

and produce intermediate signals. For further processing, the intermediate signals are arranged as

2D maps, with one dimension representing signals in a chirp period and the second representing

different chirps. As the frequency difference is proportional to time of flight during a single chirp

period, a Fast Fourier Transform (FFT) is performed on intermediate signals in a single chirp

period to extract a beat frequency, from which a range can be computed. When a target is moving,

the radial motion results in small changes in the object range and leads to phase shifts between

neighboring received chirps. Doppler velocity can be inferred from the phase shifts over chirps.

Therefore, a second FFT is additionally applied along the chirp index dimension to generate a

range-Doppler map. To measure azimuths of targets, an array of horizontally aligned receivers

receive echoed chirps from the same targets. There is a small difference in target ranges observed by

neighboring receivers, and the range difference is a known function of the target azimuth. Similarly,

the range differences lead to phase shifts in chirps received by neighboring receivers. Thus, a final

FFT is applied to frames of range-Doppler maps along the receiver index dimension to compute a

range-Doppler-azimuth map.

In this dissertation, we select radar to fuse with camera in order to enhance monocular perception

as a cost-friendly solution. There are other fusion options that are also promising and worth more

investigation as future work, e.g., radar-lidar and radar-lidar-camera combinations, as radar is

complementary to lidar in scenarios of long ranges, adverse weather, and velocity estimation.

A variety of sensor data are publicly available in autonomous driving datasets, which will be

introduced in the following section.

1.4 Perception Dataset for Autonomous Driving

The emergence of large-scale datasets for perception tasks for autonomous driving fuels the

research on perception algorithms, with an abundance of sensor data and ground truth (GT) for

supervised machine learning and evaluation. The sensor data are collected by driving a vehicle

4

with perception sensors in real traffic, and the GT data are typically annotated manually with a

significant amount of time and labor. In addition, a dataset usually provides benchmark, a criterion

to evaluate predictions with GT data. This evaluation makes it convenient to fairly compare

the performance of different algorithms and guide the improvement of methods to further their

performance. The release of datasets frees researchers from the tedious work of collecting and

labeling data, allowing them to focus on the design of perception algorithms. Additionally, a

dataset may not cover all scenarios of driving, and more data and testing are required for real-world

deployment. Nevertheless, it is valuable data for the first step in algorithm design and gives us a

glimpse of autonomous driving in the real world.

Among a few mainstream large-scale datasets, e.g., KITTI [27] and nuScenes [10], nuScenes

stands out with radar data collection and GT object velocity. Using multiple commercial radars

and a lidar, nuScenes allows the ego vehicle to perform a 360-degree scan of the surrounding

environment. The dataset includes measurements collected in city traffic under a variety of weather

and lighting conditions. As a 3D object detection dataset, nuScenes provides GT bounding boxes

(i.e., center location, size, yaw, velocity, and class) of objects in ten classes related to city traffic,

e.g., cars, motorcycles, pedestrians, and traffic cones.

There are two representative formats of radar data, i.e., the azimuth-range-Doppler format and

point clouds. The azimuth-range-Doppler [95] is a raw format directly from the Fourier transform

of received signals, while point clouds are obtained with additional processing such as clustering.

We focus on dealing with the point cloud format as it is the format of radar data provided by

nuScenes dataset [10]. Besides datasets, another engine pushing forward autonomous driving is

neural networks, which will be introduced in the following section.

1.5 Neural Networks as Mathmatical Model in Perception Tasks

Big data and big computational power are two major factors driving the successful application

of deep neural networks in different fields, including vision, audio [101] and natural language

processing [94]. It is the flexibility of neural networks that makes them outperform and finally

replace traditional hand-crafted features in perception tasks:

from large training data, neural

5

networks automatically learn optimized features, largely superior to hand-crafted features in quality

and quantity.

Convolutional neural networks (ConvNet) [60] trained with supervised learning are among the

most widely used and successful models for visual tasks. ConvNet can model a complex nonlinear

function with tens of millions of parameters, while it only consists of a cascade of basic operations,

i.e., convolutions and activation functions. Although ConvNet, as a black box model, achieves

great performance in practice, the understanding of what is happening in the black box is still far

from satisfactory [49]. Nevertheless, some properties of the ConvNet are clear and may guide us in

building a proper model. First, ConvNet processes grid data formats and thus is a good choice for

image or quantized data in bird’s-eye view (BEV). ConvNet keeps the correspondences between

inputs and outputs at the same locations. This is because the output at a pixel is a function of inputs

at the same pixel and surrounding ones, i.e., receptive field. Correspondences can be obtained via

upsampling or downsampling if the output resolution is different from the input. Third, convolution

is location invariant. In other words, the processing is the same at every location of the input.

Transformer has been an emerging network architecture in vision tasks [67]. Featuring self-

attention modules, it enables long-distance interactions while losing translation equivariance and

locality existing in ConvNet. As a result, it requires large training dataset to achieve good per-

formance: typically it is trained with large external dataset before being fine-tuned with smaller

dataset specific to tasks.

The key to designing a good ConvNet model is building good loss functions that converge to

small values via gradient descent. The geometric shape of a loss is typically not convex, and thus

convergence to global minima is not guaranteed [156]. Sometimes local minima are satisfactory

for some applications. The geometry of a loss is primarily determined by three factors, i.e.,

network structure, definition of predictions/labels as well as loss criteria. Other factors influencing

convergence are initial parameters and learning steps. Thus, the focus of model design is on

adjusting the three components through trial and error so that the geometry of the resultant loss

easily converges to small values. For example, L2 loss and cross-entropy loss are two loss criteria:

6

L2 loss is more suitable for regression, while cross-entropy works better for classification. We

typically use existing successful network structures and only need to control the model size to

achieve a balance between efficiency and capacity, since a good model is flexible and can adapt

to new tasks well with new training data. On the other hand, deciding what to predict is another

important dimension of freedom in design that determines the geometry of loss. For example, there

are typically two output options for the task of depth estimation: one is directly predicting depths,

and the other is indirectly predicting the coefficients of a number of pre-defined depths [34]. It

turns out that the second option leads to a better loss, which prevents the smearing of different

depths among neighboring pixels.

1.6 Dissertation Contributions

The contributions of this dissertation are listed as follows:

• We propose radar-camera pixel depth association that upgrades the projection of radar onto

images and prepares a densified depth layer. Using the enhanced radar depth improves

radar-camera depth completion.

• We identify the problem of estimating point-wise full velocity of radar returns by fusing

radar and camera. We propose a novel closed-form solution to infer full radar-return velocity

by leveraging the radial velocity of radar points, optical flow of images, and the learned

association between radar points and image pixels.

• We enhance radar returns to obtain 3D object center detections from each radar return. We

achieve camera-radar association at the detection level using the enhanced radar locations.

We improve monocular object depth estimates by fusing enhanced radar depths

• We build a model to predict radar hit distributions relative to reflecting objects in BEV. We

propose position estimation by convolving predicted radar distributions with actual radar

measurements.

7

1.7 Dissertation Organization

In light of the advantages of radar over lidar (as mentioned in Section 1.3), e.g., low cost,

compactness, and capability of Doppler measurement, in this work, we use radar to enhance

camera in vision tasks including estimation of depth and velocity.

In Chapter 2, we fuse radar and camera for depth estimation or depth completion. Our method

improves monocular depth with radar as a low-cost depth sensor. The difficulty of fusing radar

and camera for depth completion lies in inaccurate projection of radar points on images and

thus misaligned radar and image features. To solve this problem, we estimate radar-camera pixel

association to generate augmented radar depth features, which are compatible with image features.

In Chapter 3, we estimate the full velocity of radar returns with the help of video. This is

significant as estimation of full velocity enables us to capture actual dynamic status of objects.

It is known that radar only measures radial velocity, which by itself does not provide complete

information to infer full velocity. We derive a closed-form solution to full velocity by using

Doppler measurements and corresponding optical flows. A model is built to estimate the radar-flow

association. Our method solves the ambiguities from radial to full velocity and paves the way for

more accurate perception of object motion and accumulation of moving radar points over time.

In Chapters 4 and 5, we use radar-camera fusion to improve 3D object detection over monocular

approaches, particularly for better position estimation. The key is to infer object centers from sparse

radar hits on object surfaces. In Chapters 4, we explicitly predicts pixel and depth offsets from

radar hits to their corresponding object centers in image space. In Chapter 5, we study radar hit

distributions relative to object centers in BEV and use them to match actual radar measurements

for more accurate center prediction. We list conclusions and future work in Chapter 6.

8

CHAPTER 2

RADAR-CAMERA PIXEL DEPTH ASSOCIATION FOR DEPTH COMPLETION

While radar and video data can be readily fused at the detection level, fusing them at the pixel

level is potentially more beneficial. This is also more challenging in part due to the sparsity of

radar, but also because automotive radar beams are much wider than a typical pixel combined with

a large baseline between camera and radar, which results in poor association between radar pixels

and color pixel. A consequence is that depth completion methods designed for lidar and video

fare poorly for radar and video. Here we propose a radar-to-pixel association stage which learns

a mapping from radar returns to pixels. This mapping also serves to densify radar returns. Using

this as a first stage, followed by a more traditional depth completion method, we are able to achieve

image-guided depth completion with radar and video. We demonstrate performance superior to

camera and radar alone on the nuScenes dataset. This chapter was previously published as [76].

2.1

Introduction

We seek to incorporate automotive radar as a contributing sensor to 3D scene estimation. While

recent work fuses radar with video for the objective of achieving improved object detection [12,

62, 87, 90, 93], here we aim for pixel-level fusion of depth estimates, and ask if fusing video with

radar can lead to improved dense depth estimation of a scene.

Up to the present, outdoor depth estimation has been dominated by lidar, stereo, and monocular

techniques. The fusion of lidar and video has led to increasingly accurate dense depth comple-

tion [34]. At the same time, radar has been relegated to the task of object detection in vehicle’s

ADAS [84]. However, phased array automotive radar technologies have been advancing in accuracy

and discrimination [30]. Here we investigate the suitability of using radar instead of lidar for the task

of dense depth estimation. Unlike lidar, automotive radars are already ubiquitous, being integrated

in most vehicles for collision warning and similar tasks. If successfully fused with video, radar

could provide an inexpensive alternative to lidar for 3D scene modeling and perception. However,

to achieve this, attentive algorithm design is required in order to overcome some of the limitations

of radar, including coarser, lower resolution, and sparser depth measurements than typical lidars.

9

(a)

(b)

(c)

Figure 2.1 Radar-camera depth completion: (a) an image with 0.3 seconds (5 sweeps) of radar hits
projected onto it, (b) enhanced radar depths at confidence level 0.9 eliminate occluded pixels and
expand visible hits, and (c) final predicted depth through depth completion.

This chapter proposes a method to fuse radar returns with image data and achieve depth

completion; namely a dense depth map over pixels in a camera. We develop a two-stage algorithm.

The first stage builds an association between radar returns and image pixels, during which we

resolve some of the uncertainty in projecting radar returns into a camera. In addition, this stage

is able to filter occluded radar returns and “densify” the projected radar depth map along with a

confidence measure for these associations (see Fig. 2.1 (a,b)). Once a faithful association between

radar hits and camera pixels is achieved, the second stage uses a more standard depth completion

approach to combine radar and image data and estimate a dense depth map, as in Fig. 2.1(c).

A practical challenge to our fusion goal is the lack of public datasets with radar. KITTI [28],

the dataset used most extensively for lidar depth completion, does not include radar and nor do

the Waymo [126] or ArgoVerse [13] datasets. The main exception is nuScenes [10] and the small

Astyx [86] which have radar, but unfortunately do not include a dense, pixel-aligned depth map as

created by Uhrig et al. [131]. Similarly, the Oxford Radar Robot Car dataset [3] includes camera,

lidar, and raw radar data, but no annotations are available for scene understanding. As a result,

all experiments of this work will use the nuScenes dataset along with its annotations. However,

we find single lidar scans insufficient to train depth completion, and so accumulate scans to build

semi-dense depth maps for training and evaluating depth completion.

The main contributions of this work include:

• Radar-camera pixel depth association that upgrades the projection of radar onto images and

prepares a densified depth layer.

10

• Enhanced radar depth that improves radar-camera depth completion over raw radar depth.

• Lidar ground truth accumulation that leverages optical flow for occluded pixel elimination,

leading to higher quality dense depth images.

2.2 Related Work

Radar for ADAS. Frequency Modulated Continuous Wave radars are inexpensive and all-weather,

and have served as the key sensor for modern ADAS. Ongoing advances are improving radar

resolution and target discrimination [30], while convolutional networks have been used to add dis-

criminative power to radar data, moving beyond target detection and tracking to include classifying

road environments [52, 119], and seeing beyond-line-of-sight targets [113]. Nevertheless, the low

spatial resolution of radar means that the 3D environment, including object shape and classification,

are only coarsely obtained. A key path to upgrading the capabilities of radar is through integration

with additional sensor modalities [84].

Radar-camera fusion. Early fusion of video with radar, such as [38], relied on radar for cueing

image regions for object detection or road boundary estimation [36], or used optical flow to improve

radar tracks [26]. With the advent of deep learning, much more extensive multi-modal fusion has

become possible [23]. However, to the best of our knowledge, no prior work has conducted

pixel-level dense depth fusion between radar and video.

Radar-camera object detection. Object detection is a key task in 3D perception [8]. There has

been significant recent interest in combining radar with video for improved object detection. In [12],

ResNet blocks [31] are used to combine both color images and image-projected radar returns to

improve longer-range vehicle detection.

In [62], an FFT applied to raw radar data generates a

polar detection array which is merged with a bird’s-eye projection of the camera image, and targets

are estimated with a single shot detector [69]. In [87], features from both images and a bird’s-

eye representation of radar enter a region proposal network that outputs bounding boxes [122].

An alternative model for radar hits is a 3m vertical line on the ground plane which is projected

into the image plane by [93], and combined with VGG blocks to classify vehicle detections at

11

multiple scales. Our work differs fundamentally from these methods in that our goal is dense depth

estimation, rather than object classification. But we do share similarity in radar representation: we

project radar hits into an image plane. However, the key novelty in our work is that we learn a

neighborhood pixel association model for radar hits, rather than relying on projected circles [12] or

lines [93].

Lidar-camera depth completion. Our task of depth estimation has the same goal as lidar-camera

depth completion [33, 34, 37, 103, 147]. However, radar is far sparser than lidar and has lower

accuracy, which makes these methods unsuitable for this task. Our radar enhancement stage

densifies the projected radar depths, followed by a more traditional depth completion architecture.

Monocular depth estimation. Monocular depth inference may be supervised by lidar or self-

supervised. Self-supervised methods learn depth by minimizing photometric error between images

captured by cameras with known relative positions. Additional constrains such as semantics seg-

mentation [104, 163], optical flow [35, 133], surface normal [150], and proxy disparity labels [142]

improve performance. Recently, self-supervised PackNet [29] has achieved competitive results.

Supervised methods [19, 25] include continuous depth regression and discrete depth classification.

BTS [51] achieves the state of the art (SOTA) by improving upsampling via additional plane con-

straints, and more recently [107] combines supervised and self-supervised methods. Our goal is

not monocular depth estimation, but rather to improve what is achievable from monocular depth

estimation through fusion with radar.

2.3 Method

While there are a variety of data-spaces in which radar can be fused with video, the most

natural, given our objective of estimating a high resolution depth map, is in the image space.

But this immediately presents a problem: to which pixel in an image does a radar pixel belong?

By radar pixel we mean a simple point projection of the estimated 3D radar hit into the camera.

The nuScenes dataset [10] provides extrinsic and intrinsic calibration parameters needed to map

the radar point clouds from the radar coordinates system to the egocentric and camera coordinate

systems.

12

Figure 2.2 Our two-stage architecture. Network 1 learns 𝑁-channel radar-camera pixel depth
association (RC-PDA), here illustrated for two radar pixels (marked with white squares) on their
neighboring pixels (white boxes). The RC-PDA is converted into a multi-channel enhanced radar
(MER), and input to Network 2 which performs image-guided depth completion.

Figure 2.3 Examples of radar hits projected into a camera. While the hits project into the vicinity
of the target that they hit, their image position can be quite different from their actual location. For
example, radar depths in the yellow/red box are larger/smaller than corresponding image depths
(meters).

Assuming that the actual depth of the image pixel is the same as the radar pixel depth turns out

to be fairly inaccurate. We describe some of the problems with this model, and then propose a new

pixel association model. We present a method for building this new model and show its benefit by

incorporating it into radar-camera depth completion. Fig. 2.2 shows the diagram of the proposed

method.

2.3.1 Radar Hit Projection Model

Single-row scanning automotive radars can be modeled as measuring points in a plane extending

usually horizontally (relative to the vehicle platform) in front of the vehicle, as in [12]. While radars

can measure accurate depth, often the depth they give when projected into a camera is incorrect,

as can be seen in the examples in Fig. 2.3. An important source of this error is the large width of

13

Image FlowImage + Radar DepthsRadar flowNetwork 1Predicted RC-PDARC-PDA LabelLiDAR DepthsSupervision…MERStage 1Stage 2LiDAR DepthsSupervisionImage + Radar DepthsDepth DifferenceNetwork 2Predicted DepthsRadar Depthsradar beams which means that the hits extend well beyond the assumed horizontal plane. In other

words, the height of measured radar hits is inaccurate [91].

In addition to beam width, another source of projected point depth difference is occlusion

caused by the significant baseline between radars on the grill, and cameras on the roof or driver

mirror. Further, these depth differences only increase when radar hits are accumulated over a short

interval, and thus more opportunity for occlusions.

In addition to pixel association errors, we are faced with the problem that automotive radars

generate far sparser depth scans than lidar. There is typically a single row of returns, rather

than anywhere up to 128 rows in lidar, and the azimuth spacing of radar returns can be an

order of magnitude greater than lidar. This sparsity significantly increases the difficulty in depth

completion. One solution is to accumulate radar pixels over a short time interval, and to account

for their 3D position using both ego-motion and radial velocity. Nevertheless, this accumulation

introduces additional pixel association errors (in part from not having tangential velocity) and more

opportunities for occlusions.

2.3.2 Radar-Camera Pixel Depth Association

In using radar to aid depth estimation we face the problem of determining which, if any, point

in the image does a radar return correspond to? This radar pixel to camera pixel association is a

difficult problem, and we do not have ground truth to determine this. Thus we reformulate this

problem slightly to make it more tractable.

The new question we ask is: “Which pixels in the vicinity of the projected radar pixel have the

same depth of that radar return?" We call this Radar-Camera Pixel Depth Association: RC-PDA

or simply PDA. It is a one-to-many mapping, rather than one-to-one mapping, and has four key

advantages. First, we do not need to distinguish between many good but ambiguous matches

and rather can return many pixels with the same depth. This simplifies the problem. Second,

by associating the radar return with multiple pixels, our method explicitly densifies the radar

depth map, which facilitates the second stage of full-image depth estimation. Third, our question

simultaneously addresses the occlusion problem; if there are no nearby pixels with that depth, then

14

(a)

(b)

Figure 2.4 Illustration of depth differences between camera and radar, and how our proposed
association method (Pixel Depth Association: RC-PDA) can address this.
(a) Radar hits are
modeled on a ground-parallel plane (dashed black line). Actual returns may be outside this
plane, as illustrated with orange stars 𝐴, 𝐵, 𝐶 at depths 𝐷 𝐴, 𝐷 𝐵, 𝐷𝐶 respectively. We project the
corresponding in-plane points, 𝐴𝑝, 𝐵𝑝, 𝐶𝑝 (green diamonds), into the camera, and call these the
radar pixels. (b) The camera view showing the radar pixels 𝐴𝑝, 𝐵𝑝, 𝐶𝑝. Now the true image depth
of these pixels is 𝐷 𝐴, the front of the truck, which agrees only with 𝐴𝑝 which is visible, and not
for 𝐵𝑝 and 𝐶𝑝 which are occluded. This illustrates why radar pixel depths are often incorrect
from the camera perspective. Finding associations from radar pixels to the projected true points
𝐴, 𝐵, 𝐶 would solve this, but is difficult. Rather, we seek a neighborhood depth association for
each radar pixel, that specifies which pixels within a neighborhood (dashed blue regions) have the
same depth as the radar pixel, shown here by the orange regions. For example, the orange pixels
in the neighborhood of 𝐴𝑝 have a RC-PDA of 1 while the remaining neighborhood pixels have a
RC-PDA of 0, all relative to 𝐴𝑝. See Sec. 2.3.2 for details.

the radar pixel is automatically inferred to be occluded. Fourth, we are able to leverage a lidar-based

ground-truth depth map as the supervision, rather than a difficult-to-define “ground truth" pixel

association. Fig. 2.4 illustrates image depths obtained from raw radar projections and RC-PDA

around each radar pixel. It shows the height errors of measured radar points and how some hits

visible to the radar are occluded from the camera.

2.3.2.1 RC-PDA Model

We model RC-PDA over a neighborhood around the projected radar pixel in the color image.

At each radar pixel we define a patch around the radar location and seek to classify each pixel in

this patch as having the same depth or not as the radar pixel, within a predetermined threshold.

A similar connectivity model has been used for image segmentation [39]. Radar pixels and the

patches around them are illustrated in Figs. 2.5(a) and (b), respectively.

The connection to each pixel in a ℎ × 𝑤 neighborhood has 𝑁 = 𝑤ℎ elements, and can be

encoded as an 𝑁-channel RC-PDA which we label A(𝑖, 𝑗, 𝑘), where 𝑘 = 1, · · · , 𝑁. Here (𝑖, 𝑗) is

the radar pixel coordinate, and the 𝑘’th neighbor has offset (𝑖𝑘 , 𝑗𝑘 ) from (𝑖, 𝑗). Now the label for

15

Radar raysCamera𝐴𝑝𝐵𝑝𝐶𝑝𝐴𝐵𝐶𝐷𝐴𝐷𝐵𝐷𝐶𝐵𝑝𝐶𝑝𝐶𝐴𝑝𝐴𝐵𝐷𝐴𝐷𝐵(a)

(b)

(c)

Figure 2.5 Overview of our radar depth representation.
(a) Radar pixels indicate sparse depth
in image space. (b) For each radar pixel, a Pixel Depth Association (RC-PDA) probability over
neighboring pixels is calculated, indicated with shaded contours.
(c) Radar pixel depths are
propagated to neighboring pixels to create a Multi-channel Enhanced Radar (MER) image. Each
channel is a densified depth at a given confidence level.

A(𝑖, 𝑗, 𝑘) is 1 if the neighboring pixel has the same depth as radar pixel and 0 otherwise. More

precisely, if 𝐸𝑖 𝑗 𝑘 = 𝑑 (𝑖, 𝑗) − 𝑑𝑇 (𝑖 + 𝑖𝑘 , 𝑗 + 𝑗𝑘 ) is the difference between radar pixel depth, 𝑑 (𝑖, 𝑗),

and the neighboring lidar pixel depth, 𝑑𝑇 (𝑖 + 𝑖𝑘 , 𝑗 + 𝑗𝑘 ), and ˜𝐸𝑖 𝑗 𝑘 = 𝐸𝑖 𝑗 𝑘 /𝑑 (𝑖, 𝑗) is the relative depth

difference, then:

A(𝑖, 𝑗, 𝑘) =

1,

0,

if (|𝐸𝑖 𝑗 𝑘 | < 𝑇𝑎) ∧ (| ˜𝐸𝑖 𝑗 𝑘 | < 𝑇𝑟)

otherwise.

(2.1)





We note that labels A(𝑖, 𝑗, 𝑘) are only defined when there is both a radar pixel at (𝑖, 𝑗) and a lidar

depth 𝑑𝑇 (𝑖 + 𝑖𝑘 , 𝑗 + 𝑗𝑘 ). We define a binary weight 𝑤(𝑖, 𝑗, 𝑘) ∈ {0, 1} to be 1 when both conditions

are satisfied and 0 otherwise. During training we minimize the weighted binary cross entropy

loss [39] between labels A(𝑖, 𝑗, 𝑘) and predicted RC-PDA:

L𝐶𝐸 =

∑︁

𝑖, 𝑗,𝑘

𝑤(𝑖, 𝑗, 𝑘) [ − A(𝑖, 𝑗, 𝑘)𝑧(𝑖, 𝑗, 𝑘) + log(1 + exp(𝑧(𝑖, 𝑗, 𝑘)))].

(2.2)

The network output, 𝑧(𝑖, 𝑗, 𝑘), is passed through a Sigmoid to obtain ˆA(𝑖, 𝑗, 𝑘), the estimated

RC-PDA.

Our network thus predicts a RC-PDA confidence in a range of 0 to 1 representing the probability

that each pixel in this patch has the same depth as the radar pixel. This prediction also applies to the

image pixel at the same coordinates as the radar pixel, i.e., (𝑖, 𝑗), as like other pixels, the depth at

16

this image pixel may differ from the radar depth for a variety of reasons, including those illustrated

in Fig. 2.4.

2.3.3 From RC-PDA to MER

The RC-PDA gives the probability that neighboring pixels have the same depth as the measured

radar pixel. We can convert the radar depths along with predicted RC-PDA into a partially filled

depth image plus a corresponding confidence as follows. Each of 𝑁 neighbors to a given radar pixel

is given depth 𝑑 (𝑖, 𝑗) and confidence ˆA(𝑖, 𝑗, 𝑘). If more than one radar depth is expanded to the

same pixel, the radar depth with the maximum RC-PDA is kept. The expanded depth is represented

as D(𝑖, 𝑗) with confidence ˆA(𝑖, 𝑗). Now many of the low-confidence pixels will have incorrect

depth. Instead of eliminating low-confidence depths, we convert this expanded depth image into a

multi-channel image where each channel 𝑙 is given depth D(𝑖, 𝑗) if its confidence ˆA(𝑖, 𝑗) is greater

than a channel threshold 𝑇𝑙, where 𝑙 = 1, · · · , 𝑁𝑒 and 𝑁𝑒 is the total number of channels of the

enhanced depth. The result is a Multi-Channel Enhanced Radar (MER) image with each channel

representing radar-derived depth at a particular confidence level (see Fig. 2.5(c)).

Our MER representation for depth can correctly encode many complex cases of radar-camera

projection, a few of which are illustrated in Fig. 2.4. These cases include when radar hits are

occluded and no nearby pixels have similar depth. They also include cases where the radar pixel

is just inside or just outside the boundary of a target. In each case, those nearby pixels with the

same depth as the radar can be given the radar depth with high confidence, while the remaining

neighborhood pixels are given low confidence, and their depth are specified on separate channels

of the MER.

The purpose of using multiple channels for depths with different confidences in MER is to

facilitate the task of Network 2 in Fig. 2.2 in performing the dense depth completion. High

confidence channels give the greatest benefit, but low confidence channels may also provide useful

data. In all cases they densify the depth beyond single radar pixels, easing the depth completion

task.

17

(a) Radar depth

(b) Optical flow

(c) Radar flow

Figure 2.6 An example of how radar scene flow and optical flow differences are used to infer
occlusions of radar pixels. The radar flows are plotted as yellow if the 𝐿2 norm of radar/optical flow
differences are larger than a threshold. Note that we do not explicitly filter radar, rather provide
flow to Network 1 in Fig. 2.2 so that it can implicitly filter radar while estimating RC-PDA.

2.3.4 Estimating RC-PDA

We next select inputs to Network 1 in Fig. 2.2 from which it can learn to infer the RC-PDA.

These are the image, the radar pixels with their depths as well as the image flow and the radar flow

from current to a neighboring frame. Here we briefly explain the intuition for each of these.

The image provides scene context for each radar pixel, as well as object boundary information.

The radar pixels provide depth for interpreting the context and a basis for predicting the depth of

nearby pixels. As radar is very sparse, we accumulate radar from a short time history, 0.3 seconds,

and transform it into the current frame using both ego-motion and the radial velocity similar to that

done in [93].

Now a pairing of image optical flow and radar scene flow provides an occlusion and depth

difference cue. For static objects, the optical flow should exactly equal radar scene flow, when the

pixel depth is the same as radar pixel depth. Conversely, radar pixels that are occluded from the

camera view will have different scene flow from the optical flow of a static object occluding them

(Fig. 2.6). Similarly, objects moving radially will have consistent flow. By providing flow, we

expect that Network 1 will learn to leverage flow similarity in predicting RC-PDA for each radar

pixel.

18

(a)

(b)

(c)

Figure 2.7 We noticed that when lidar with a regular scan pattern, as in (b) for image (a), is used
to train depth completion, our network learns to predict the lidar points well, but not the remaining
pixels. This leaves large artifacts, as in (c), and motivates us to create a semi-dense depth lidar
training set.

2.3.5 Lidar-based Supervision

To train both the RC-PDA and the final dense depth estimate, we use a dense ground truth depth.

This is because, as illustrated in Fig. 2.7, training with sparse lidar leads to significant artifacts. We

now describe how we build a semi-dense depth image from lidar scans.

2.3.5.1 Lidar Accumulation

To our knowledge, there is no existing public dataset specially designed for depth completion

with radar. Thus we create a semi-dense ground truth depth from nuScenes dataset, a public dataset

with radar data and designed for object detection and segmentation. We use the 32-ray lidar as

depth label and notice that the sparse depth label generated from a single frame will lead to a

biased model predicting depth with artifacts, i.e., only predictions for pixels with ground truth are

reasonable. Thus, we use semi-dense lidar depth as label, which is created by accumulating multiple

lidar frames. With ego motion and calibration parameters, all static points can be transformed to

destination image frame. Moving points are compensated by bounding box poses at each frame,

which are estimated by interpolating bounding boxes provided by nuScenes in key frames.

2.3.5.2 Occlusion Removal via Flow Consistency

When a foreground object occludes some of the accumulated lidar points, the resulting dense

depth may include depth artifacts as the occluded pixels appear in gaps in the foreground object.

KITTI [131] takes advantages of the depth from stereo images to filter out such occluded points.

As no stereo images are available in nuScenes, we propose detecting and removing occluded lidar

19

(a) Lidar depth

(b) Optical flow

(c) Lidar flow

Figure 2.8 An example of how lidar scene flow and optical flow differences are used to infer
occlusions of lidar pixels. Lidar flows are plotted as yellow if the 𝐿2 norm of lidar/optical flow
differences are larger than a threshold. This is used in the accumulation of lidar for building
ground-truth depth maps, see Fig. 2.9.

(a)

(b)

Figure 2.9 An example of using lidar flow and optical flow consistency to filter occluded pixels.
(a) Accumulated lidar depth, and (b) Accumulated lidar depth with flow consistency filtering.

points based on optical-scene flow consistency.

The scene flow of lidar points, termed lidar flow, is computed by projecting lidar points into

two neighboring images and measuring the change in their coordinates. On moving objects, the

point’s positions are corrected with the object motion. On static visible objects, lidar flow will

equal optical flow, while on occluded surfaces lidar flow is usually different from the optical flow

at the same pixel, see Fig. 2.8. We calculate optical flow with [129] pretrained on KITTI, and

measure the difference between the two flows at the same pixel via the 𝐿2 norm of their difference.

Points with flow difference larger than a threshold 𝑇 𝑓 are discarded as occluded points. Fig. 2.9

shows an example of using flow consistency to filter out occluded lidar depths.

20

2.3.5.3 Occlusion Removal via Segmentation

Flow-based occluded pixel removal may fail in two cases. When there is little to no parallax,

both optical and scene flow will be small, and their difference becomes not measurable. This occurs

mostly at long range or along the motion direction. Further, lidar flow on moving objects can in

some cases be identical to the occluded lidar flow behind it. In both of these cases flow consistency

is insufficient to remove occluded pixels from the final depth estimate.

To solve this problem, we use a combination of 3D bounding boxes and semantic segmentation

to remove occluded points appearing on top of objects. First, accurate pixel region of an instance

is determined by the intersection of 3D bounding box projection and semantic segmentation. The

maximum depth of bounding box corners is used to decide whether lidar points falling on the

object are on it or behind it. Points within the semantic segmentation and closer than this maximum

distance are kept, while points in the segmentation and behind the bounding box are filtered out

as occluded lidar points. Fig. 2.10 shows an example of removing occluded points appearing on

vehicle instances. We use a semantic segmentation model [15] pre-trained with CityScape [17] to

segment vehicle pixels.

2.3.6 Algorithm Summary

We propose a two-stage depth estimation process, as in Fig. 2.2. The Stage 1 estimates RC-PDA

for each radar pixel, which is transformed into our MER representation as detailed in Sec. 2.3.3 and

fed into Stage 2 which performs conventional depth completion. Both stages are supervised by the

accumulated dense lidar, with pixels not having a lidar depth given zero weight. Network 1 uses

an encoder-decoder network with skip connections similar to U-Net [108] and [78] with details in

supplementary material.

2.4 Experimental Results

Dataset We train and test on a subset of images from the nuScenes dataset [10], including 12, 610,

1, 628, and 1, 623 samples for training, validation, and testing, respectively. The data are collected

with moving ego vehicle so flow calculation described in Sec. 2.3.5.2 can be applied. The depth

range for training and testing is 0-50 meters. Resolutions of inputs and outputs are 400 × 192. As

21

(a) Car image

(b) Semantic seg. & bound. box

(c) Depth before filtering

(d) Depth after filtering

Figure 2.10 For small flow instances and some movers, flow consistency is insufficient to remove
accumulated but occluded lidar pixels, see (a,c). To remove these occluded pixels, we first find
vehicle pixels as the intersection between semantic segmentation and 2D bounding box, see (b).
From the 3D bounding box we know the maximum depth of the vehicle, and so can filter out all
accumulated depths greater than this that are actually occluded, see (d).

described in Sec. 2.3.5, we build semi-dense depth images by accumulating lidar pixels from 21

subsequent frames and 4 previous frames (sampled every other frame), and use these for supervision.

Implementation details For parameters, we use 𝑇𝑎 = 1 m and 𝑇𝑟 = 0.05 for Eq. 2.1. In Sec. 2.3.5.2

we use 𝑇 𝑓 = 3 to decide flow consistency. MER has 6 channels with 𝑇1 to 𝑇6 set as 0.5, 0.6, 0.7, 0.8,

0.9, and 0.95, respectively. At Stage 1, we use a U-Net with 5 levels of resolutions and 180 output

channels, corresponding to 180 pixels in a rectangle neighborhood with size 𝑤 = 5 and ℎ = 36. As

the radar points are typically on the lower part of image, to fully leveraging the neighborhood, the

neighborhood center is below the rectangle center with 30 pixel above, 5 pixels below and 2 pixels

on left and right to provide more space for radar points to extend upwards. At Stage 2, we employ

two existing depth completion architectures, [79] and [56], originally designed for lidar-camera

pairs.

Network Details Fig. 2.11 shows a typical structure of U-Net [108] we use for both Stages 1 and

architecture [79] for Stage 2. It has five levels of resolutions where downsampling and upsampling

22

(a)

(b)

(c)
Figure 2.11 (a) U-Net used in our experiment. (b) Diagram of Block 𝐵1, a series of residual blocks
(shown as blue boxes). (c)Diagram of block 𝐵2, a sequence of convolutional blocks (shown as blue
box). 𝐵𝑁 and 𝐶𝑜𝑛𝑣 denote batch normalization and a convolutional layer, respectively.

are achieved by max-pooling and nearest neighbor interpolation. Except the input and output, all

tensors in the network have the same number of channels 𝑁𝑐. All convolutions use 3 × 3 filters.

𝐵0 represents the input block, a convolution layer that increases the number of channels to 𝑁𝑐. 𝐵1

denotes residual blocks connected in series, and 𝐵2 is a sequence of convolutional blocks where

each block consists of batch normalization, ReLU, and convolution. 𝐵3 is the output block which

includes batch normalization, ReLU, and a convolutional layer changing the number of channels

from 𝑁𝑐 to that of output channels.

In our experiment, for Stage 1 network, we set 𝑁𝑐 to 80. There are two residual blocks in 𝐵1

and four convolutional blocks in 𝐵2, respectively. For Stage 2 network, 𝑁𝑐 is set to 64. 𝐵1 has four

residual blocks, and 𝐵2 has 8 convolutional blocks. We use RMSProp as optimizer with a learning

rate of 5 × 10−5.

23

(c)

(a)

(b)

(d)
Figure 2.12 (a) Raw radar depths (b) Each color pixel with a maximum RC-PDA > 0.6 is marked
with a color indicating which radar pixel it is associated with. (c) The RC-PDA score with values
> 0.6 for each pixel. (d) The MER channel with RC-PDA > 0.6. (e) Our final predicted depth. (f)
Depth from monocular input to Stage 2. (g) Depth from monocular and raw radar input to Stage 2.

(g)

(e)

(f)

Inference time When we use Intel Core i7-8700 CPUs and an NVIDIA GeForce RTX 2080 Ti

GPU, the inference times of Network 1 and Network 2 are 4.5 ms and 8.2 ms per frame, respectively.

2.4.1 Visualization of Predicted RC-PDA

The predicted RC-PDA and estimated depths from Stage 2 [79] are visualized in Fig. 2.12.

Column (a) shows the raw radar pixels plotted on images and often includes occluded radar pixels.

Column (b) shows how image pixels are associated with different radar pixels according to their

maximum RC-PDA. Radar pixels and their associated neighboring pixels are marked with the

same color. Notice in column (c) that RC-PDA is high within objects and decreases after crossing

boundaries. Occluded radar depth are mostly discarded as their predicted RC-PDA is low.

In

column (e), the dense depths predicted from MER are improved over predictions from (g) raw radar

and/or (f) monocular. For example, in Row 2 of Fig. 2.12, our predicted pole depth in (e) has better

boundaries than monocular-only in (f), and monocular plus raw radar in (g). How we achieve this

24

Network 2

Ma et al. [79]

Li et al. [56]

None

Input
Image
Image, radar

MAE Abs Rel RMSE RMSE log
2.385
1.609
Image, radar, MER 1.229
1.759
Image, radar, MER 1.274
1.251
7.369

3.505
2.865
2.651
3.039
2.670
2.701
10.900

0.110
0.078
0.058
0.084
0.061
0.059
0.475

0.150
0.126
0.114
0.133
0.116
0.117
0.448

MER
Radar

Image, radar

Table 2.1 Depth error (m) in image regions around non-occluded radar returns, defined as regions
with RC-PDA > 0.9.

can be intuitively understood by comparing raw radar in (a) with our MER in (d), the output of

Stage 1. While raw radar has many incorrect depths, MER selects correct radar depths and extends

these depths along the pole and background, enabling improved final depth inference.

2.4.2 Accuracy of MER

To be useful in improving radar-camera depth completion, the enhanced radar depth in the

vicinity of radar points should be better than alternatives. We compare the depth error of the MER

in regions where RC-PDA is > 0.9, with a few baseline methods, and results are shown in Tab. 2.1.

The enhanced radar depth from Stage 1 improves over not only raw radar depth but also depth

estimates from Stage 2 using monocular as well as monocular plus radar. In comparison, Stage

2 keeps the accuracy of the enhanced radar depth when using it for depth completion. The depth

error for raw radar depth is very large since many of them are occluded and far behind foreground.

About 35% of radar points in the test frames have a maximum RC-PDA smaller than 𝑇1 in their

neighborhood and are discarded as occluded points. Further, Fig. 2.13 shows the depth error

and per-image average area of expanded depth from 6 MER channels, respectively. It shows, as

confidence increases, higher RC-PDA corresponds to higher accuracy and smaller expanded areas.

2.4.3 Comparison of Depth Completion

To evaluate effectiveness of the enhanced radar depth in depth completion, we compare the

depth error with and without using MER as input for Network 2, and show performance in Tabs. 2.2

and 2.3. The results show that including radar improves depth completion over monocular, while

25

Figure 2.13 Image area and depth error of enhanced radar in MER for regions with minimal RC-
PDA at 6 confidence levels.

Network 2

Ma et al. [79]

Li et al. [56]

Input
Image
Image, radar

MAE Abs Rel RMSE RMSE log
1.808
1.569
Image, radar, MER 1.472
1.821
Image, radar, MER 1.655

0.160
0.152
0.144
0.170
0.159

0.102
0.090
0.085
0.107
0.094

3.552
3.327
3.179
3.650
3.463

Image, radar

Table 2.2 Full-image depth estimation/completion errors (m).

Network 2

Ma et al. [79]

Li et al. [56]

Input
Image
Image, radar

MAE Abs Rel RMSE RMSE log
2.673
2.263
Image, radar, MER 2.078
2.515
Image, radar, MER 2.189

0.202
0.194
0.183
0.211
0.193

0.153
0.134
0.124
0.154
0.132

4.259
4.028
3.864
4.266
3.943

Image, radar

Table 2.3 Depth estimation/completion errors (m) in the low-height region (0.3-2 meters above
ground).

using our proposed MER further improves the accuracy of depth completion for the same network.

Qualitative comparisons between depth completion [79] with and without using MER are shown

in Fig. 2.14. This shows improvement from MER in estimating object depth boundaries including

close objects (such as the traffic sign on the bottom image) and far objects.

26

0.50.60.70.80.9PDA0123MAE (m)01000200030004000Area (pixel)(a)

(b)

(c)

(d)

Figure 2.14 Qualitative depth completion comparison showing gains from using MER over raw
radar. (a) Raw radar on top of image versus (b) A MER channel with RC-PDA > 0.8 on top of
image. Depth completion (c) without and (d) with using MER.

2.4.4 Ablation Studies

2.4.4.1 Neighborhood Sizes

The neighborhood is a rectangular region around a radar pixel, which is defined as the neighbor-

hood center. A neighborhood is designed to let the radar pixel have more space to expand upwards,

and thus, typically, the neighborhood center does not correspond to the geometric center. As shown

in Fig. 2.15, we use (𝑤1, 𝑤2, ℎ1, ℎ2) to represent the shape of a neighborhood, where 𝑤1 and 𝑤2

are the number of horizontal neighboring pixels to the left and right of the neighborhood center,

and ℎ3 and ℎ4 denote the number of vertical neighboring pixels above and below the center.

As shown in Table 2.4, we test different neighborhood sizes for Stage 1 network and compute

their resultant depth error after using them in Stage 2 network [79]. The 𝑇𝑙 to create MER in Stage

2 are (0.6, 0.7, 0.8, 0.9, 0.95).

Figure 2.15 A neighborhood with shape (𝑤1, 𝑤2, ℎ1, ℎ2). The small square represents a radar pixel.

27

Neighborhood MAE Abs Rel RMSE RMSE log
0.085
0.086
0.088

(2, 2, 30, 5)
(3, 3, 20, 5)
(1, 1, 8, 1)

3.210
3.210
3.274

1.487
1.497
1.536

0.145
0.146
0.148

Table 2.4 Full-image depth estimation/completion errors (m).

PDA Thresholds
(0.6 : 0.1 : 0.9, 0.95)
(0.5 : 0.1 : 0.9, 0.95)
(0.6 : 0.05 : 0.9, 0.95)

MAE Abs Rel RMSE RMSE log
1.487
1.472
1.494

3.210
3.179
3.229

0.145
0.144
0.146

0.085
0.085
0.084

Table 2.5 Full-image depth estimation/completion errors with MER created by different thresholds
(m). In the first column, “𝑎:𝑠:𝑏" indicates a set of thresholds 𝑎, 𝑎 + 𝑠, 𝑎 + 2𝑠, · · · , 𝑏.

2.4.4.2 PDA Thresholds for MER

MER can be generated with different PDA threshold 𝑇𝑙 where 𝑙 = 1, 2, ..., 𝑁𝑒. With fixed

neighorbood size (2, 2, 30, 5), we test different PDA thresholds and compute their resultant depth

error. Results are shown in Table 2.5.

2.4.5 Visualization of MER

Fig. 2.16 shows MER created by PDA thresholds of 0.6, 0.7, 0.8, 0.9, and 0.95, respectively.

We can see the expanded region decreases as PDA threshold increases.

2.5 Summary

Radar-based depth completion introduces additional challenges and complexities beyond lidar-

based depth completion. A significant difficulty is the large ambiguity in associating radar pixels

with image pixels. We address this with RC-PDA, a learned measure that associates radar hits with

nearby image pixels at the same depth. From RC-PDA we create an enhanced and densified radar

image called MER. Our experiments show that depth completion using MER achieves improved

accuracy over depth completion with raw radar. As part of this work we also create a semi-dense

accumulated lidar depth dataset for training depth completion on nuScenes.

28

(a)

(b)

(c)

(d)

(e)

(f)

Figure 2.16 (a) Raw radar depth. MER with PDA thresholds (b) 0.6, (c) 0.7, (d) 0.8, (e) 0.9, and
(f) 0.95. Occluded radar points are filtered out, and corrected depth are expanded in MER.

29

CHAPTER 3

FULL-VELOCITY RADAR RETURNS BY RADAR-CAMERA FUSION

A distinctive feature of Doppler radar is the measurement of velocity in the radial direction for radar

points. However, the missing tangential velocity component hampers object velocity estimation as

well as temporal integration of radar sweeps in dynamic scenes. Recognizing that fusing camera

with radar provides complementary information to radar, in this chapter we present a closed-form

solution for the point-wise, full-velocity estimate of Doppler returns using the corresponding optical

flow from camera images. Additionally, we address the association problem between radar returns

and camera images with a neural network that is trained to estimate radar-camera correspondences.

Experimental results on the nuScenes dataset verify the validity of the method and show significant

improvements over the SOTA in velocity estimation and accumulation of radar points. This chapter

was previously published as [75].

3.1

Introduction

Radar is a mainstream automotive 3D sensor, and along with lidar and camera, is used in

perception systems for driving assistance and autonomous driving [6, 59, 125]. Unlike lidar, radar

has been widely installed on existing vehicles due to its relatively low cost and small sensor size,

which makes it an easy fit into various vehicles without changing their appearance. Thus, advances

in radar vision systems have potential to make immediate impact on vehicle safety. Recently, with

the release of a couple of autonomous driving datasets with radar data included, e.g., Oxford Radar

RobotCar [3] and nuScenes [10], there is great interest in the community to explore how to leverage

radar data in various vision tasks such as object detection [91, 149].

In addition to measuring 3D positions, radar has the special capability of obtaining radial

velocity of returned points based on the Doppler effect. This extra capability is a significant

advantage over other 3D sensors like lidar, enabling, for instance, instantaneous moving object

detection. However, due to the inherently ambiguous mapping from radial velocity to full velocity,

using radial velocity directly to account for the real movement of radar points is inadequate and

sometimes misleading. Here, the full velocity denotes the actual velocity of radar points in 2D or

30

3D space. While radial velocity can well approximate full velocity when a point is moving away

from or towards the radar, these two can be very different when the point is moving in the non-radial

directions. An extreme case occurs for objects moving tangentially as these will have zero radial

velocity regardless of target speed. Therefore, acquiring point-wise full velocity instead of radial

velocity is crucial to reliably sense the motion of surrounding objects.

Apart from measuring the velocity of objects, another important application of point-wise

velocity is the accumulation of radar points. Radar returns from a single frame are much sparser

than lidar in both azimuth and elevation, e.g., typically lidar has an azimuth resolution 10× higher

than radar [149]. Thus, it is often essential to accumulate multiple prior radar frames to acquire

sufficiently dense point clouds for downstream tasks, e.g., object detection [12, 14, 93]. To align

radar frames, in addition to compensating egomotion, we shall consider the motion of moving

points in consecutive frames, which can be estimated by point-wise velocity and time of movement.

As the radial velocity does not reflect the true motion, it is desirable to have point-wise full velocity

for point accumulation.

To solve the aforementioned dilemma of radial velocity, we propose to estimate point-wise

full velocity of radar returns by fusing radar with a RGB camera. Specifically, we derive a

closed-form solution to infer point-wise full velocity from radial velocity as well as associated

projected image motion obtained from optical flow. As shown in Fig. 3.1, constraints imposed

by optical flow resolve the ambiguities of radial-full velocity mapping and lead to a unique and

closed-form solution for full velocity. Our method can be considered as a way to enhance raw radar

measurement by upgrading point-wise radial velocity to full velocity, laying the groundwork for

improving radar-related tasks, e.g., velocity estimation, point accumulation, and object detection.

Moreover, a prerequisite for our closed-form solution is the association between moving radar

points and image pixels. To enable a reliable association, we train a neural network to predict

radar-camera correspondences as well as discerning occluded radar points. Experimental results

demonstrate that the proposed method improves point-wise velocity estimates and their use for

object velocity estimation and radar point accumulation.

31

(a)

(b)

(c)

Figure 3.1 (a) Full motion cannot be determined with a single sensor: all motions ending on the
blue dashed line (i.e., blue dashed arrows) map to the same optical flow and all motions terminated
on the red dashed line (i.e., red dashed arrows) fit the same radial motion. However, with a radar-
camera pair, the full motion can be uniquely decided: only the motion drawn in black satisfies both
optical flow and radial motion. (b) Optical flow in the camera-image and (c) a BEV of the observed
vehicle. This shows measured radar points with radial velocity (red), our predicted point-wise, full
velocity (black), and GT full velocity of the vehicle (green).

In summary, the main contributions of this work are:

• We define a novel research task for radar-camera perception systems, i.e., estimating point-

wise full velocity of radar returns by fusing radar and camera.

• We propose a novel closed-form solution to infer full radar-return velocity by leveraging the

radial velocity of radar points, optical flow of images, and the learned association between

radar points and image pixels.

• We demonstrate SOTA performance in object velocity estimation, radar point accumulation,

and 3D object localization.

32

RadarRadar PointRadial MotionCameraPossibleMotionsFlow12.515.017.520.022.525.027.530.032.5x42024y3.2 Related Works

Application of Radar in Vision. Radar data differs from lidar data in various aspects [9]. In

addition to the popular point representation (also named radar target [97]), an analogy to lidar points,

there are other radar data representations containing more raw measurements, e.g., range-azimuth

image and spectrograms, which have been applied in tasks such as activity classification [115],

detection [62], and pose estimation [109]. Our method is based on radar points, with the format

available in the nuScenes dataset [10].

The characteristics of radar have been explored to complement other sensors. The Doppler

velocity of radar points is used to distinguish moving targets. For example, RSS-Net [40] uses

radial velocity as a motion cue for image semantic segmentation. Chadwick et al. [12] use radial

velocity to detect distant moving vehicles—difficult to detect with only images. Fritsche et al. [24]

combine radar with lidar for measurement under poor visibility. With a longer detection range than

lidar, radar is also deployed with lidar to better detect far objects [149].

The sparsity of radar makes it difficult to directly apply well-developed techniques for lidar on

radar [62, 91]. For example, Danzer et al. [18] adopt PointNets [102] on radar points for 2D car

detection, while sparsity limits it to large objects like cars. Similar to lidar-camera depth comple-

tion [33, 34], Long et al. [76] develop radar-camera depth completion by learning a probabilistic

mapping from radar returns to images. To obtain denser radar points, Lombacher et al. [71] use

occupancy grid [22] to accumulate radar frames. Yet, the method assumes a static scene and

cannot cope with moving objects. Radar points are projected on images and represented as regions

near projected points, such as vertical bars [93] and circles [12, 14], to account for uncertainty

of projection due to measurement error. While accumulating radar frames is desirable, without

reliably compensating object motion, these methods need to carefully decide the number of frames

to trade off between the gain in accumulation and loss in accuracy due to delay [93]. Our estimated

point-wise velocity can compensate object motion and realize more accurate accumulation.

Velocity Estimation in Perception Systems. Researchers have used monocular videos [8] or

radial velocity of radar points to estimate object-wise velocity. With only radar data of a single

33

frame, Kellner et al. [41, 42] compute full velocity of moving vehicles from radial velocities and

azimuth angles of at least two radar hits. However, for a robust solution, the method requires that 1)

radar captures more radar hits on each object; 2) radar points have significantly different azimuth

angles; and 3) object points are clustered before velocity estimation [41, 112, 114]. Obviously

due to sparsity of radar in a single frame, it is difficult to obtain at least two radar hits on distant

vehicles, let alone objects of smaller sizes. Also, it is common that radar points on the same object,

e.g., a distant or small object, have similar azimuth.

Recognizing the density and accuracy limitation of radar, researchers fuse radar with other

sensors, e.g., lidar and camera, for object-wise velocity estimation. Specifically, existing tech-

niques [58, 144, 157] for images or lidar are employed to obtain preliminary detections. Radar

data, including radial velocity, once associated with the initial detections, are used as additional

cues to predict full velocities of objects. For instance, in RadarNet [149] temporal point clouds of

radar and lidar, modeled as voxels, are used to acquire initial detections and their motions. Object

motion direction is used to resolve the ambiguities in radar-point association by back-projecting

their radial velocities on the motion direction. Yet, a sequence of lidar frames is required to obtain

the initial detection and motion estimation.

CenterFusion [91] integrates radar with camera for object-wise velocity estimation. Well-

developed image-based detector is applied to extract preliminary boxes. After associating radar

points with detections, the method combines radar data, radial velocity, and depth, with image

features within detected regions to regress a full velocity per detection. However, without a closed-

form solution, the mapping from radial to full velocity needs to be learned from a great number of

labeled data. In contrast, we present a point-wise closed-form solution for full-velocity estimation

of radar points, without performing object detection. To our knowledge, there is no prior method

able to perform point-wise full-velocity estimation for radar returns.

3.3 Proposed Method

We consider the case of a camera and radar rigidly attached to a moving platform, e.g., a vehicle,

observing moving objects in the environment. In this section we develop equations relating optical

34

Figure 3.2 Full velocity estimation and learning to associate radar points to camera pixels. (a) A
3D point, 𝒑, is observed by a camera at 𝐵. A short interval, Δ𝑡, later, the point has moved by
(cid:164)𝒎Δ𝑡 to 𝒒 while the camera has moved by (cid:164)𝒄Δ𝑡 to 𝐴. At the same time, the radar measures both
the position of 𝒒 and the radial speed (cid:164)𝑟, which is the radial component of (cid:164)𝒎. Using radial speed (cid:164)𝑟
and the associated optical flow of 𝒒 in images, we derive a closed-form equation (denoted as 𝒇 ())
to estimate 𝒒’s full velocity (cid:164)𝒎. (b) As the closed-form solution requires point-wise association
of two sensors, we train a Radar-2-Pixel (R2P) network to take a multi-channel input and predict
the association probabilities for pixels within a neighborhood of the raw projection (white dot)
obtained via known pose 𝐴
𝑅𝑻. A pixel with the highest probability (yellow arrow) is deemed as the
associated pixel of a radar point. To obtain labels for training R2P, our label generation module
uses 𝒇 () to compute velocities of all neighboring pixels, then calculates velocity error 𝐸𝑚 by using
the GT velocity (cid:164)𝒎𝐺𝑇 , and finally obtains association probabilities of these neighbors based on 𝐸𝑚.

flow measurements in the camera to position and velocity measurements made by the radar.

3.3.1 Physical Configuration and Notation

The physical configuration of our camera and radar measurements is illustrated in Fig. 3.2(a).

Three coordinate systems are shown: 𝐴 and 𝐵 specifying camera poses and 𝑅 specifying a radar

pose. The camera at 𝐵 observes a 3D point 𝒑. A short interval later, Δ𝑡, the point has moved to

𝒒, the camera to 𝐴 and the radar to 𝑅, and both the camera and radar observe the target point 𝒒.

These 3D points are specified by 4-dim homogeneous vectors, and when needed, a left-superscript

specifies the coordinate system in which it is specified, e.g., 𝐴𝒒 indicates a point relative to a

coordinate system 𝐴. The target velocity, (cid:164)𝒎, and camera velocity (cid:164)𝒄 are specified by 3-dim vectors,

again optionally with a left superscript to specify a coordinate system.

Coordinate transformations, containing both a rotation and translation, are specified by 4 × 4

matrices, such as 𝐵

𝐴𝑻, which transforms points from the left-subscript coordinate system to the

35

R2PPredicted AssociationAssociation Labelfሶ𝒎𝑮𝑻𝑬𝒎Label Generation𝒆−𝑬𝒎𝟐𝒄Multi-channel Inputs(a)(b)left-superscript coordinate system. In this case we transform a point from 𝐴 to 𝐵 with:

𝐵𝒒 = 𝐵

𝐴𝑻 𝐴𝒒.

(3.1)

Only the rotational component of these transformations is needed to transform velocities. For

example, 𝐴 (cid:164)𝒎 is transformed to 𝐵 (cid:164)𝒎 by the 3 × 3 rotation matrix 𝐵

𝐴 𝑹:

𝐵 (cid:164)𝒎 = 𝐵

𝐴 𝑹 𝐴 (cid:164)𝒎.

(3.2)

A vector with a right subscript, e.g., 𝒑𝑖, indicates the 𝑖’th element of 𝒑, while a right subscript

of “1:3” puts the first 3 elements in a 3-dim vector. For a matrix, the right subscript indicates the

row. Thus 𝐵

𝐴 𝑹𝑖 is a 1 × 3 row vector containing its 𝑖-th row. A right superscript “T” is a matrix

transpose.

The projections of points 𝒑 and 𝒒 are specified in either undistorted raw pixel coordinates, e.g.,

(𝑥𝑞, 𝑦𝑞) or their normalized image coordinates (𝑢𝑞, 𝑣𝑞) given by:

𝑢𝑞 = (𝑥𝑞 − 𝑐𝑥)/ 𝑓𝑥,

𝑣𝑞 = (𝑦𝑞 − 𝑐𝑦)/ 𝑓𝑦.

(3.3)

Here 𝑐𝑥, 𝑐𝑦, 𝑓𝑥, 𝑓𝑦 are intrinsic camera parameters, while the right subscript of the pixel refers to

the point being projected. Vectors for 3D points can be expressed in terms of the normalized image

coordinates:

𝑢 𝑝𝑑 𝑝
(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)
Here 𝑑𝑞 and 𝑑 𝑝 are depths of points 𝐴𝒒 and 𝐵 𝒑 respectively.

𝑢𝑞𝑑𝑞
(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)

and 𝐵 𝒑 =

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

𝐴𝒒 =

𝑣 𝑝𝑑 𝑝

𝑣𝑞𝑑𝑞

𝑑 𝑝

𝑑𝑞

1

1

.

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

(3.4)

We assume dense optical flow is available that maps target pixel coordinates observed in 𝐴 to

𝐵 as follows:

Flow (cid:0)(𝑢𝑞, 𝑣𝑞)(cid:1) → (𝑢 𝑝, 𝑣 𝑝).

(3.5)

Further, we assume the following are known: camera motion, 𝐵

𝐴𝑻, relative radar pose, 𝐴

𝑅𝑻, and

intrinsic parameters.

36

3.3.2 Full-Velocity Radar Returns

The Doppler velocity measured by a radar is just one component of the three-component, full-

velocity vector of an object point. Here our goal is to leverage optical flow from a synchronized

camera to augment radar and estimate this full-velocity vector for each radar return.

3.3.2.1 Relationship of Full Velocity to Radial Velocity

The target motion from 𝒑 to 𝒒 is modeled as constant velocity, (cid:164)𝒎, over time Δ𝑡, such that

(cid:164)𝒎 =

𝒒1:3 − 𝒑1:3
Δ𝑡

.

(3.6)

Our goal is to estimate the full target velocity, (cid:164)𝒎. Radar provides an estimate of the target position,

𝒒, but not the previous target location 𝒑. Radar also provides the signed radial speed, (cid:164)𝑟, which is

one component of (cid:164)𝒎. In the nuScenes dataset (cid:164)𝑟 is given by:

(cid:164)𝑟 = ˆ𝒓T (cid:164)𝒎.

(3.7)

Here ˆ𝒓 is the unit-norm vector along the direction to the target 𝑅𝒒. Note that this equation is
coordinate-invariant, and could be equally written in 𝐴 using 𝐴 ˆ𝒓 and 𝐴 (cid:164)𝒎. Now Eq. (3.7) is actually

the egomotion-corrected Doppler speed. The raw Doppler speed, (cid:164)𝑟𝑟𝑎𝑤, is the radial component of

the relative velocity between target and sensor, (cid:164)𝒎 − (cid:164)𝒄, and this constraint is given by:

(cid:164)𝑟𝑟𝑎𝑤 = ˆ𝒓T( (cid:164)𝒎 − (cid:164)𝒄),

(3.8)

where (cid:164)𝒄 is the known ego-velocity. Either Eq. (3.7) or (3.8) can be used in our formulation,

depending on whether (cid:164)𝑟 or (cid:164)𝑟𝑟𝑎𝑤 is available from the radar.

3.3.2.2 Relationship of Full Velocity to Optical Flow

𝑅𝒒, and transforming this we obtain 𝐴𝒒 = 𝐴

In solving the velocity constraints, we first identify the known variables. The radar measures
𝑅𝑻 𝑅𝒒 which contains 𝑑𝑞 as the third component. Image
coordinates (𝑢𝑞, 𝑣𝑞) are obtained by projection, and using optical flow in Eq. (3.5), we can also

obtain the (𝑢 𝑝, 𝑣 𝑝) components of 𝐵 𝒑. The key parameter we do not know from this is the depth,

𝑑 𝑝, in 𝐵.

37

Next we eliminate this unknown depth from our constraints. Eq. (3.6) can be rearranged and

each component expressed in frame 𝐵:

𝐵 𝒑1:3 = 𝐵𝒒1:3 − 𝐵

𝐴 𝑹 𝐴 (cid:164)𝒎Δ𝑡,

(3.9)

where the second term on the right is the transformation of the target motion into 𝐵 coordinates.

The third row of this equation is an expression for 𝑑 𝑝:

𝑑 𝑝 = 𝐵𝒒3 − 𝐵

𝐴 𝑹3

𝐴 (cid:164)𝒎Δ𝑡.

(3.10)

Substituting this for 𝑑 𝑝, and the components of 𝐵 𝒑 from Eq. (3.4), into the first two rows of Eq. (3.9),

we obtain


𝑢 𝑝 (𝐵𝒒3 − 𝐵



𝑣 𝑝 (𝐵𝒒3 − 𝐵




𝐴 𝑹3

𝐴 𝑹3

𝐴 (cid:164)𝒎Δ𝑡)

𝐴 (cid:164)𝒎Δ𝑡)









=









𝐵𝒒1 − 𝐵
𝐵𝒒2 − 𝐵

𝐴 𝑹1

𝐴 𝑹2

𝐴 (cid:164)𝒎Δ𝑡

𝐴 (cid:164)𝒎Δ𝑡

and rearrange to give two constraints on the full velocity:









𝐵
𝐴 𝑹1 − 𝑢 𝑝
𝐵
𝐴 𝑹2 − 𝑣 𝑝

𝐵
𝐴 𝑹3
𝐵
𝐴 𝑹3









𝐴 (cid:164)𝒎 =









(cid:0)𝐵𝒒1 − 𝑢 𝑝
(cid:0)𝐵𝒒2 − 𝑣 𝑝

𝐵𝒒3
𝐵𝒒3

(cid:1) /Δ𝑡

(cid:1) /Δ𝑡









,









.

(3.11)

(3.12)

3.3.2.3 Full-Velocity Solution

We obtain three constraints on the full velocity, 𝐴 (cid:164)𝒎, from Eq. (3.12) and by converting Eq. (3.7)

to 𝐴 coordinates. Combining these we obtain:

𝐵
𝐴 𝑹1 − 𝑢 𝑝
𝐵
𝐴 𝑹2 − 𝑣 𝑝
𝐴 ˆ𝒓T












𝐵
𝐴 𝑹3
𝐵
𝐴 𝑹3












𝐴 (cid:164)𝒎 =

(cid:0)𝐵𝒒1 − 𝑢 𝑝
(cid:0)𝐵𝒒2 − 𝑣 𝑝
(cid:164)𝑟












𝐵𝒒3
𝐵𝒒3

(cid:1) /Δ𝑡

(cid:1) /Δ𝑡

.












Then inverting the 3 × 3 coefficient of 𝐴 (cid:164)𝒎 gives a closed form solution for the full velocity:

𝐴 (cid:164)𝒎 =

−1

𝐵
𝐴 𝑹1 − 𝑢 𝑝
𝐵
𝐴 𝑹2 − 𝑣 𝑝
𝐴 ˆ𝒓T












𝐵
𝐴 𝑹3
𝐵
𝐴 𝑹3












(cid:0)𝐵𝒒1 − 𝑢 𝑝
(cid:0)𝐵𝒒2 − 𝑣 𝑝
(cid:164)𝑟












𝐵𝒒3
𝐵𝒒3

(cid:1) /Δ𝑡

(cid:1) /Δ𝑡

.












38

(3.13)

(3.14)

Recall in Fig. 3.1(a) the red/blue dashed lines show the velocity constraints from radar/flow.

The solution of Eq. (3.14) is the full velocity that is consistent with both constraints. We note

that this can handle moving sensors, although Fig. 3.1(a) shows the case of a stationary camera for

simplicity. Further, if we set Δ𝑡 < 0, Eq. (3.14) also applies to the case that the point shifts from

𝒒 to 𝒑 as the camera moves from 𝐴 to 𝐵. And one limitation is that Eq. (3.14) cannot estimate

full velocity for radar points occluded in the camera view, although we can typically identify those

occlusions.

3.3.3

Image Pixels and Radar Points Association

Our solution for point-wise velocity in Eq. (3.14) assumes that we know the pixel coordinates

(𝑢𝑞, 𝑣𝑞) of the radar-detected point, 𝑅𝒒.

It appears straightforward to obtain this pixel corre-

spondence by projecting a radar point onto the image using the known radar-image coordinate

transformation, 𝐴

𝑅𝑻. We refer to this corresponding pixel as “raw projection”. However, there are

a number of reasons why raw projection of radar points into an image is inaccurate. Radar beam-

width typically subtends a few degrees and is large relative to a pixel, resulting in low resolution

target location in both azimuth and elevation. Also, a radar displaced from a camera can often see

behind an object, as viewed by the camera, and when these returns are projected onto an image they

incorrectly appear to correspond to the foreground occluding object. Using flow from an occluder

or an incorrectly associated object pixel may result in incorrect full-velocity estimation. To address

these issues with raw projection, we train a neural network model, termed Radar-2-Pixel (R2P)

network, to estimate associated radar pixels in the neighborhood of raw projection and identify

occluded radar points. Similar models have been applied to image segmentation [39] and radar

depth enhancement [76].

3.3.3.1 Model Structure

Our method estimates association probabilities (ranging from 0 to 1) between a moving radar

point and a set of pixels in the neighborhood of its raw projection. The R2P network is an encoder-

decoder structure with inputs and outputs of image resolution. Stored in 8 channels, the input data

include image, radar depth map (with depth on raw projections), and optical flow. The output has

39

𝑁 channels, representing predicted association probability for 𝑁 pixel neighbors. The association

between the radar point, 𝐴𝒒, and the 𝑘-th neighbor of raw projection (𝑥, 𝑦) is stored in 𝐴(𝑥, 𝑦, 𝑘),

where 𝑘 = 1, 2, ..., 𝑁.

3.3.3.2 Ground Truth Velocity of Moving Radar Points

The nuScenes [10] provides the GT velocity of object bounding boxes. We associate radar hits

on an object to its labeled bounding box, and assign the velocity of the box to its associated radar

points. The association is determined based on two criteria: 1) in radar coordinates, the distance

between radar points and associated box is smaller than a threshold 𝑇𝑑; and 2) the percentage error

between the radial velocity of a radar point and the radial component of the velocity of associated

box is smaller than a threshold 𝑇𝑝.

3.3.3.3 Generating Association Labels

We can project a radar point expressed in corresponding camera coordinates, 𝐴𝒒, to pixel

coordinates (𝑢𝑞, 𝑣𝑞), but as mentioned before, often this image pixel does not correspond to the

radar return. Our proposed solution is to search in a neighboring region around (𝑢𝑞, 𝑣𝑞) for a pixel

whose motion is consistent with the radar return. This neighborhood search is shown in Fig. 3.2. If

a pixel is found, then we correct the 3D radar location 𝐴𝒒 to be consistent with this pixel, otherwise

we mark this radar return as occluded.

We learn this radar-to-pixel association and correction by training the R2P network. We generate

true association score between a radar point and a pixel according to the compatibility between

the true velocity and the optical flow at that pixel: high compatibility indicates high association.

To quantify the compatibility, assuming a pixel is associated with a radar point, we compute a

hypothetical full velocity for the radar point by using the optical flow of that pixel according to

Eq. (3.14). The flow is considered compatible if the hypothetical velocity is close to the GT velocity.

Specifically, the hypothetical velocity can be computed as

𝐴 (cid:164)𝒎𝑒𝑠𝑡 (𝑥, 𝑦, 𝑘) = 𝒇 (cid:0) ˘𝑢𝑞, ˘𝑣𝑞, ˘𝑢 𝑝, ˘𝑣 𝑝, 𝑑𝑞, (cid:164)𝑟, 𝐵

𝐴𝑻, 𝐴

𝑅𝑻(cid:1) ,

(3.15)

where 𝑘 = 1, · · · , 𝑁, 𝒇 (·) is the function to solve full velocity via Eq. (3.14), and (𝑥, 𝑦) is the raw

40

(a)

(c)

(b)

(d)

Figure 3.3 (a) Optical flow; (b) BEV of GT bounding box, radial velocity (red), and GT velocity
(green); (c) and (d) show 𝐸𝑚, computed by using Eq. (3.16), for two radar projections (white square)
over 41 × 41 pixel regions, respectively. For radar hits reflected from the vehicle, 𝐸𝑚 is small for
neighboring pixels on the car and large on the background.

projection of the radar point. Note that ˘𝑢𝑞 = 𝑢𝑞 [𝑥 + Δ𝑥(𝑘), 𝑦 + Δ𝑦(𝑘)], ˘𝑣𝑞 is defined similarly,

and [Δ𝑥(𝑘), Δ𝑦(𝑘)] is the coordinate offset from raw projection to the 𝑘-th neighbor. Using flow,

Eq. (3.5), we obtain ( ˘𝑢 𝑝, ˘𝑣 𝑝) from ( ˘𝑢𝑞, ˘𝑣𝑞).

Second, we calculate the 𝐿2 norm of errors between 𝐴 (cid:164)𝒎𝑒𝑠𝑡 (𝑥, 𝑦, 𝑘) and GT velocity 𝐴 (cid:164)𝒎𝐺𝑇 (𝑥, 𝑦)

by

𝐸𝑚 (𝑥, 𝑦, 𝑘) = ∥ 𝐴 (cid:164)𝒎𝑒𝑠𝑡 (𝑥, 𝑦, 𝑘) − 𝐴 (cid:164)𝒎𝐺𝑇 (𝑥, 𝑦) ∥2.

(3.16)

Fig. 3.3 shows examples of 𝐸𝑚 for two radar hits on a car.

Finally, we transform 𝐸𝑚 to an association score with

𝐿 (𝑥, 𝑦, 𝑘) = 𝑒−

𝐸2𝑚 ( 𝑥,𝑦,𝑘 )
𝑐

,

(3.17)

where 𝐿 is used as a label for association probability between a radar and its 𝑘-th neighbor. Note

that 𝐿 increases with decreasing 𝐸𝑣, and 𝑐 is a parameter adjusting the tolerance of velocity errors

41

when converting errors to association. We use the cross entropy loss to train the model.

3.3.3.4 Estimate Association and Identify Occlusion

With a trained model, we can estimate association probability between radar points and 𝑁

pixels around their raw projections (𝑥, 𝑦), i.e., 𝐴(𝑥, 𝑦, 𝑘). Among the 𝑁 neighbors, the radar return

velocity may be compatible with a number of pixels, and we select the pixel with the maximum

association, 𝐴𝑚𝑎𝑥, as the neighbor ID 𝑘𝑚𝑎𝑥:

𝑘𝑚𝑎𝑥 = arg max

𝑘

[ 𝐴(𝑥, 𝑦, 𝑘)].

(3.18)

If 𝐴𝑚𝑎𝑥 is equal or larger than a threshold 𝑇𝑎, we estimate the associated pixel as [𝑥 + Δ𝑥(𝑘𝑚𝑎𝑥),

𝑦 + Δ𝑦(𝑘𝑚𝑎𝑥)]. Otherwise there is no associated pixels in the neighborhood, and an occlusion is

identified.

3.4 Experimental Results

3.4.1 Comparison of Point-wise Full Velocity

To the best of our knowledge, there is no existing method estimating point-wise full velocity

for radar returns. Thus, we use point-wise radial velocity from raw radar returns as the baseline to

compare with our estimation. We extract data from the nuScenes Object Detection Dataset [10],

with 6432, 632, and 2041 samples in training, validation, and test set, respectively. Each sample

consists of a radar scan and two images for optical flow computation, i.e., one image synchronizing

with the radar and the other is a neighboring image frame. The optical flow is computed by the

RAFT model [129] pre-trained on KITTI [27]. The R2P network is an U-Net [88, 108] with five

levels of resolutions and 64 channels for intermediate filters. The neighborhood skips every other

pixel, and its size (in pixels) is (left: 4, right: 4, top: 10, bottom: 4) and an example of the

neighborhood is illustrated in Fig. 3.2(b). The threshold of association scores 𝑇𝑎 is 0.3. Parameters

associating radar points with GT bounding box are set as 𝑇𝑑 = 0.5m and 𝑇𝑝 = 20%. Parameter

𝑐 in Eq. (3.17) is 0.36. To obtain GT point-wise velocity, based on the criteria in Sec. 3.3.3.2,

we first associate moving radar points to GT detection boxes, whose GT velocity is assigned to

associated points as their GT velocity. The GT velocity of bounding boxes is estimated from GT

42

Mean Error (STD)
(m/s)
Full Velocity
Tangential Comp.
Radial Comp.

Ours
(R2P Network)
0.433 (0.608)
0.322 (0.610)
0.205 (0.196)

Ours
(Raw Projection)
0.577 (1.010)
0.472 (1.024)
0.205 (0.196)

Baseline

1.599 (2.054)
1.536 (2.083)
0.205 (0.196)

Table 3.1 Comparison of point-wise velocity error of our methods and the baseline (raw radial
velocity).

center positions in neighboring frames with timestamps.

Tab. 3.1 shows the average velocity error for moving points. The proposed method achieves

substantially more accurate velocity estimation than the baseline. For instance, the error of our

tangential component is only 21% of that of the baseline. We also have much smaller standard

deviation, indicating more stable estimates. In addition, we list in Tab. 3.1 velocity error of our

method using raw radar projection for radar-camera association. Results show that, compared with

using raw projection, using R2P network achieves higher estimation accuracy. Fig. 3.4 illustrates

qualitative results of our point-wise velocity estimation.

3.4.2 Comparison of Object-wise Velocity

Although there are no existing methods for point-wise velocity estimation for radar, a related

work, CenterFusion [91], estimates object-wise full velocity via object detection with image and

radar inputs. To fairly compare with CenterFusion, we convert our point-wise velocity to object-

wise velocity. Specifically, we use the average velocity of radar points associated with the same

detected box as our estimate of object velocity. Points are associated with detected boxes according

to distance. Note the point-wise velocity to object-wise velocity conversion is straightforward for

comparison purposes, and there would be more advanced approaches to integrate point-wise full

velocities in a detection network, which is beyond the scope of this work. Tab. 3.2 shows that with

our estimated full velocity, the velocity estimation for objects is significantly improved.

3.4.3 Velocity Estimation Error for Different Depths and 𝛼

This experiment extends the evaluation of point-wise velocity estimation discussed in Sec-

tion 3.4.1. In Fig. 3.5, each heat map shows point-wise velocity error under different depth ranges,

43

(a)

(b)
Figure 3.4 Visualization of point-wise velocity estimation: (a) depth of all measured radar returns
as well as flow, (b) optical flow in the white box region, (c) association scores around the selected
radar projections as well as predicted mapping from raw radar projections to image pixels (yellow
arrow), and (d) radial velocity (red), estimated full velocity (black), and GT velocity (green) in
BEV.

(d)

(c)

Methods
Ours
CenterFusion [91]

Error (m/s)
0.451
0.826

Table 3.2 Comparison of object-wise velocity errors. For a fair comparison we inherit the same set
of detected objects from [91].

i.e., [0, 25), [25, 50), and [50, ∞) meters as well as various 𝛼 ranges, i.e., [0, 30), [30, 60), and

[60, 90] degrees, where 𝛼 is the angle between actual moving direction and radial direction of a

radar point and ranges from 0 to 90 degrees. Results of the proposed method and baseline are

show in the first row and second row, respectively. The baseline (second row), with only radial

measurement, suffers from large 𝛼 since the the actual moving direction is very different from radial

direction under large 𝛼. The proposed method outperforms the baseline in all depth and 𝛼 ranges

for full velocity estimation.

44

(a) Full Velocity Error

(b) Tangential Component Error

(c) Radial Component Error

Figure 3.5 Comparison of average error (in meters) of point-wise velocity estimates by the proposed
method (first row) and baseline (second row). Columns 1, 2, and 3 are error of full velocity,
tangential component, and radial component, respectively. Each heat map shows the error for
radar points in different depth and 𝛼 ranges, where 𝛼 is the angle between full velocity and radial
direction.

3.4.4 Radar Point Accumulation

Accumulating radar points over time can overcome the sparsity of radar hits acquired in a

single sweep, achieving dense point cloud for objects and thus allowing techniques designed for

processing lidar points to be applicable for radar. The point-wise velocity estimate makes it possible

to compensate the motion of dynamic objects appearing in a temporal sequence of measurements

for accumulation. Specifically, for a moving radar point (with estimated velocity (cid:164)𝒎) in a previous

frame 𝑖 captured at time 𝑡𝑖, its motion from 𝑡𝑖 to the time at the current frame, 𝑡0, can be compensated

45

by,

𝒑0 = 𝒑𝒊 + (cid:164)𝒎(𝑡0 − 𝑡𝑖),

(3.19)

where 𝒑𝒊 and 𝒑0 are the radar point coordinates at 𝑡𝑖 and 𝑡0 in radar coordinates of 𝑡𝑖. Then 𝒑0 is

transformed to current radar coordinates by known egomotion from 𝑡𝑖 to 𝑡0.

Qualitative results. Fig. 3.6 shows accumulated points of moving vehicles in radar coordinates.

For comparison, we show accumulated radar points compensated by our estimated full velocity,

compensated with radial velocity (baseline) and without motion compensation. Compared with

the baseline and no motion compensation, our accumulated points are more consistent with the GT

bounding boxes.

Quantitative results. To quantitatively evaluate the accuracy of radar point accumulation, we use

the mean distance from accumulated points (of up to 25 frames) to their corresponding GT boxes

as the accumulation error. This distance for points inside the box is zero, and outside it is the

distance from the radar point to the closest point on the box’s boundary. In Fig. 3.7, we compare

the accumulation for our method, the baseline, and accumulation without motion compensation.

While error increases with the number of frames for all methods, our method has the lowest rate of

error escalation.

Application of pose estimation. To demonstrate the utility of accumulated radar points for

downstream applications, we apply a pose estimation method, i.e., BoxNet [92], on the accumulated

2D radar points via our full velocity and radial velocity (baseline), respectively. BoxNet takes pre-

segmented 2D point clouds of an object as input and predicts a 2D bounding box with parameters as

center position, length, width, and orientation. We use accumulated radar points of 5702, 559, and

2001 moving vehicles with corresponding GT bounding boxes as training, validation, and test data,

respectively. Tab. 3.3 shows our accumulated radar achieves higher accuracy than the baseline.

3.4.5

Inference Time

The pipeline of our full velocity estimation includes three major components, optical flow

computation, radar-camera association estimation, and closed-form solution of full velocity. The

46

(a)

(b)

(c)

(d)

(e)

Figure 3.6 Moving radar points are plotted with point-wise radial (red) and full (black) velocity,
including image with bounding box (a), single-frame radar points in BEV (b), accumulated radar
points from 20 frames without motion compensation (c), with radial velocity based compensation
(d), and with our full-velocity based compensation (e). Our accumulated points are tightly sur-
rounding the bounding box, which will benefit downstream tasks such as pose estimation and object
detection.

time used by each component per frame is listed in Table 3.4. Our computational platform includes

Intel Core i7-8700 CPUs and a NVIDIA GeForce RTX 2080 Ti GPU. The proposed closed-form

solution achieves highly efficient computation. Note the computational cost of optical flow can be

improved by limiting the region of flow computation to areas with radar projections.

3.5 Summary

A drawback of Doppler radar has been that it provides only the radial component of velocity,

which limits its utility in object velocity estimation, motion prediction, and radar return accumula-

tion. This chapter addresses this drawback by presenting a closed-form solution to the full velocity

47

Figure 3.7 Error comparison when accumulating radar points from increasing number of frames.
The lines represent mean error and shaded area ±0.1× STD. Our full velocity based accumulation
outperforms the ones with radial velocity, or no compensation.

Metric
Center Error (m) ↓
Orientation Error (degree) ↓
IoU ↑

Ours Baseline
0.997
0.834
7.517
6.873
0.462
0.546

Table 3.3 Comparison of pose estimation performance: average error in center and orientation as
well as Intersection over Union (IoU), by using BoxNet [92] on radar points accumulated using our
velocity and the radial velocity as a baseline.

Components
Optical Flow [129]
Radar-camera Association
Closed-form Velocity Computation

Time per Frame (s)
4.47 × 10−1
2.58 × 10−3
2.49 × 10−4

Table 3.4 Computational time of three components in the method pipeline.

of radar returns. It leverages optical flow constraints to upgrade radial velocity into full velocity.

As part of this work, we use GT bounding-box velocities to supervise a network that predicts

association corrections for the raw radar projections. We experimentally verify the effectiveness of

our method and demonstrate its application on motion compensation for integrating radar sweeps

over time.

This method developed here may apply to additional modalities such as full-velocity estimation

from Doppler lidar and cameras.

48

CHAPTER 4

RADIANT: RADAR-IMAGE ASSOCIATION NETWORK FOR 3D OBJECT
DETECTION

As a direct depth sensor, radar holds promise as a tool to improve monocular 3D object detection,

which suffers from depth errors, due in part to the depth-scale ambiguity. On the other hand,

leveraging radar depths is hampered by difficulties in precisely associating radar returns with

3D estimates from monocular methods, effectively erasing its benefits. This chapter proposes a

fusion network that addresses this radar-camera association challenge. We train our network to

predict the 3D offsets between radar returns and object centers, enabling radar depths to enhance

the accuracy of 3D monocular detection. By using parallel radar and camera backbones, our

network fuses information at both the feature level and detection level, while at the same time

leveraging a SOTA monocular detection technique without retraining it. Experimental results

show significant improvement in mean average precision and translation error on the nuScenes

dataset over monocular counterparts. This chapter was previously published as [73].

4.1

Introduction

Three-dimensional object detection is a core vision problem where the task is to infer 3D

position, orientation, and classification of objects in a scene. Applications that rely on this include

robotics [111], gaming [106], and automotive safety [120]. In the latter application, ADAS can

move beyond automated braking and use precise location and orientation of nearby vehicles and

other objects to perform collision avoidance maneuvers. However, a key limiting factor is the

relatively poor accuracy of 3D object detection, both in current systems and in SOTA methods that

rely on widely used sensors, namely cameras [77, 99] and radars [149]. Thus, there is a significant

need for improved 3D object detection, which is the goal of this chapter.

Image-based object detection achieves high 2D accuracy [7]; for instance, DD3D [99] achieves

94% 2D mean average precision (mAP) for cars on KITTI [28]. However, the performance of these

same SOTA methods drops precipitously on 3D object detection with DD3D [99] only achieving

16.9% 3D mAP of cars on KITTI. The lower 3D performance of monocular detection comes

49

largely from poor depth estimation [46, 82]. This is expected as image projection removes depth,

and recovering object depth from an image suffers from the depth scale ambiguity [128], and is

therefore error-prone. Indeed, when [146] uses precise depth from lidar, mAP of cars increases to

over 90%. Typically lidar is a specialized sensor that is relatively expensive [47] and only available

on a small fraction of vehicles. On the other hand, radar is small [61], inexpensive, and widely

available on existing vehicles so we choose to use radar for the broader impact. Thus, this chapter

explores the fusion of direct depth measurements available from the radar with a 3D monocular

detector.

The choice to use automotive radar rather than lidar presents several challenges. At first

glance, one might consider using a similar 3D object-detector on radar points as for lidar [139].

However, that does not work because the radar data has completely different characteristics from

the lidar point cloud [139]. First, each radar-point sweep is much sparser than a typical lidar

measurement in azimuth [149] and has a single row in elevation, leaving radar coordinates without

height information. Thus, differing from lidar, radar cannot acquire accurate shapes with dense

point clouds. Additionally, the radar point positions have significant azimuth errors [149] and are

much less accurate [139] as measurements of object surfaces. Next, for each radar scan there is

often only a single radar return at longer ranges and sometimes no radar returns for small objects.

This sparsity makes 3D object detection solely based on radar points difficult. Thus, rather than

detection-level fusion, our approach uses feature-level fusion to augment radar points and these

augmented radar points to refine monocular object depths. Finally, since the widely-used KITTI

dataset [28] does not include radar data, we use the nuScenes dataset [10] for experiments.

The association and fusion of image and radar modalities is possible at input-level, feature-level,

or even at detection-level. However, any such association must address the missing height, imprecise

angular location of radar returns, and the observation that occluded portions of objects also return

radar points [76]. Rather than using inaccurate radar projections on image for association, this

chapter uses a neural network to explicitly predict point-wise 3D object centers (see Fig. 4.1).

Predicting these point-wise object centers from a trained network allows us to use radar to correct

50

(a) Pixel Offset

(b) BEV Offset

Figure 4.1 Predicted radar-offsets (cyan arrows) from radar points (dots) to object centers (orange
plus) in (a) pixel space and (b) BEV in meters. RADIANT trains a network to predict these offsets
and improves monocular 3D detection. Orange boxes represent the GT bounding boxes, and dashed
lines denote borders of the camera field of view.

depth errors of monocular detection, which results in a new SOTA for radar-camera 3D object

detection.

This chapter presents a radar-camera fusion method named RADar-Image Association NeTwork

(RADIANT) for 3D object detection, including the following contributions:

• Our method enhances radar returns to obtain 3D object center detections from each radar

return.

• We achieve camera-radar association at the detection level using the enhanced radar locations.

• Our architecture can leverage multiple different pre-trained SOTA monocular methods.

• We improve monocular object depth estimates by fusing enhanced radar depths and achieve

new improved SOTA performance on nuScenes.

4.2 Related Work

Monocular Detection Differing from lidar-based detection [117], monocular 3D detection is

widely applied for its low cost and simple configuration [6]. Researchers have been improving

detection performance via upgrading detection frameworks [68, 141], losses [120], and non-

maximum suppression [47] as well as performing joint detection and 3D reconstruction [66]. To

reduce 2D to 3D ambiguities, various strategies have been developed, e.g., pseudo-lidar [80, 81,

51

−16−12−8−4041216202499, 121, 138], novel convolutions [20] and backbones [46], considering camera geometry [161],

using shape models [11, 70], and leveraging videos [8].

Radar-Camera Fusion Radar has been fused with lidar and camera. Radar-lidar fusion has been

utilized in 3D object detection [149] and object tracking [116] as radar complements lidar in

long range and motion (i.e., Doppler velocity) measurements. Nevertheless, most works combine

radar with camera for advantages listed in the Introduction section. There has been a surge in

research of radar-camera fusion recently since the release of new autonomous driving datasets

[3, 10, 21, 95, 118, 140] with images and radar data collected. The radar data are in raw data

formats [3, 21, 95, 140] (such as Range-Azimuth-Doppler format [83]) or processed formats (i.e.,

radar point clouds [10, 118]). Raw radar formats contain denser measurements while point clouds

are sparser but have less noise. Algorithms are designed according to specific radar formats used.

In this chapter, we use the radar point cloud format from the nuScenes dataset [10]. The goals

for radar-camera fusion include depth completion [53, 76], full-velocity estimation [75], object

tracking [89], and object detection [90, 91].

A focus of radar-camera fusion is 2D object detection on images [57, 90, 118, 148], where

projected radar points are used as extra features or candidates for potential objects. For example,

method [90] maps radar detections to the image coordinate system and generates anchor boxes

for each mapped radar detection point. Here radar plays an important role when the image is not

clear because of darkness or long distances. However, the research on 3D object detection via

radar-camera fusion is still at an early stage with few publications. One top-performing work is

CenterFusion [91], which directly combines monocular detections with raw radar points in the

neighborhood followed by a regression head to refine the depth estimate. In light of the significant

gap between radar points and object centers, we believe this apples and oranges combination limits

the utility of radar depths in CenterFusion. Our method, on the other hand, estimates the 3D object

center for each radar return. This geometric correction performed by the radar branch then enables

our fusion module to effectively combine multiple estimates of 3D center points from radar and

camera for better 3D detection.

52

4.3 Background and Definitions

Radar-Positives. A key component of training a detection network involves the choice of candidates

and the rule to construct positives. FCOS3D [136] treats each pixel as an object candidate, and

positive pixels for training detections are defined as small regions in the neighborhood of projected

centers of 3D objects. To train the network to predict radar offsets, RADIANT follows the same

strategy but now treats each radar return as an object candidate and considers the associated

projected radar pixel as positive. Association of radar returns with an object is easy if the radar

projection always falls on the object. However, radar returns returned by an object are sometimes

outside the object bounding box due to measurement errors [149]. Moreover, the nuScenes dataset

also does not provide GT object labels for radar points. Thus, we carry out the radar-object

association for training according to the position and velocity consistency. If the distance between

an outside radar point and the object GT bounding box is smaller than a threshold and the projection

of GT object velocity on radial direction is close to the Doppler velocity from the radar point, we

associate the radar pixel with the object and consider that radar pixel as positive.

Radar Depth Offset. Radar depth offset Δ ˆ𝑧𝑟 is the depth difference between the 3D center of its

associated object 𝑧 and a positive radar point 𝑧𝑟, and is thus given by

Δ ˆ𝑧𝑟 = 𝑧 − 𝑧𝑟 .

(4.1)

As mentioned, this residual depth is a result of radar measurement error and relative position of

radar hit on object surfaces. We infer the residual depth from both radar and image information

around.

4.4 RADIANT

Monocular 3D detection without depth as input suffers from inaccurate depth estimation [82,

134], especially for far objects. Our goal is to upgrade 3D camera detections with more accurate

depth from radar with minimal changes to the image detection pipeline. To achieve this we address

two difficulties. (1) While radar returns typically provide more precise depth than camera-based

detections, which part of an object they measure can be difficult to determine as it may be the front

53

Figure 4.2 Overview of RADIANT. RADIANT architecture has two parallel input branches, an
image branch and a radar branch that both operate in the image space. More details of these
branches are shown in Fig. 4.3. The depth fusion module combines depth estimates from both the
camera and radar heads to obtain a refined overall depth for each detection.

surface or an internal or occluded point on the object. This unknown offset can add significant

error to a radar depth estimate when combined with an image-based detection. (2) The radar return

point can sometimes be outside the true object bounding box, and this complicates the association

between radar points and objects.

Our solution is to train a radar-focused network to predict the unknown offsets between radar

points and object centers. These offsets include both an image-plane offset to the projection of the

3D center, and a depth offset to the object center. Assuming these offsets are correctly estimated,

the association between radar points and objects becomes much easier. Furthermore, since they

predict the object center, we use the offsets to correct the object center depth.

Our architecture consists of two branches and a fusion block as shown in Fig. 4.2. The upper

branch is a monocular 3D detection network, such as FCOS3D [136], which remains unchanged,

while the lower branch is the radar detection branch. The image branch predicts object centers and

pixel offsets to these centers in the image plane. Now radar points are projected into the image plane,

providing coarse alignment with image detections, and processed with the radar backbone network.

Since radar points are sparse and lack contextual information, we bring features from the image

backbone into the radar network providing image-plane spatial information to the radar pipeline.

Then the radar neck and head portions perform the radar-based detection in the image space, but

only at radar pixels, i.e., the projection of radar points onto the image plane, at five resolutions

to maintain consistency with FCOS3D. We then fuse these two sets of detections through a depth

fusion module that updates the predictions’ depths using a confidence score.

54

ImageRadarDetectionsCamBackboneCamNeckCamHeadRadarBackboneLFusionConvRadarNeckRadarHeadDepthWeightNetFusionαDepthFusionFigure2:OverviewofRADIANT.RADIANTarchitecturehastwoparallelinputbranches,animagebranchandaradarbranchthatbothoperateintheimagespace.MoredetailsofthesebranchesareshowninFig.3.Thedepthfusionmodulecombinesdepthestimatesfromboththecameraandradarheadstoobtainarefinedoveralldepthforeachdetection.truth(GT)objectlabelsforradarpoints.Thus,wecarryouttheradar-objectassociationfortrainingaccordingtothepositionandvelocityconsistency.IfthedistancebetweenanoutsideradarpointandtheobjectGTboundingboxissmallerthanathresholdandtheprojectionofGTobjectvelocityonradialdirectionisclosetotheDopplervelocityfromtheradarpoint,weassociatetheradarpixelwiththeobjectandconsiderthatradarpixelaspositive.RadarDepthOffset.Radardepthoffset∆ˆzristhedepthdifferencebetweenthe3Dcenterofitsassociatedobjectzandapositiveradarpointzr,andisthusgivenby∆ˆzr=z−zr.(1)Asmentioned,thisresidualdepthisaresultofradarmea-surementerrorandrelativepositionofradarhitonobjectsurfaces.Weinfertheresidualdepthfrombothradarandimageinformationaround.RADIANTMonocular3Ddetectionwithoutdepthasinputsuffersfrominaccuratedepthestimation(Maetal.2021;Wangetal.2021a),especiallyforfarobjects.Ourgoalistoupgrade3Dcameradetectionswithmoreaccuratedepthfromradarwithminimalchangestotheimagedetectionpipeline.Toachievethisweaddresstwodifficulties.(1)Whileradarre-turnstypicallyprovidemoreprecisedepththancamera-baseddetections,whichpartofanobjecttheymeasurecanbediffi-culttodetermineasitmaybethefrontsurfaceoraninternaloroccludedpointontheobject.Thisunknownoffsetcanaddsignificanterrortoaradardepthestimatewhencombinedwithanimage-baseddetection.(2)Theradarreturnpointcansometimesbeoutsidethetrueobjectboundingbox,andthiscomplicatestheassociationbetweenradarpointsandobjects.Oursolutionistotrainaradar-focusednetworktopredicttheunknownoffsetsbetweenradarpointsandobjectcen-ters.Theseoffsetsincludebothanimage-planeoffsettotheprojectionofthe3Dcenter,andadepthoffsettotheobjectcenter.Assumingtheseoffsetsarecorrectlyestimated,theassociationbetweenradarpointsandobjectsbecomesmucheasier.Furthermore,sincetheypredicttheobjectcenter,weusetheoffsetstocorrecttheobjectcenterdepth.OurarchitectureconsistsoftwobranchesandafusionblockasshowninFig.2.Theupperbranchisamonocular3Ddetectionnetwork,suchasFCOS3D(Wangetal.2021b),whichremainsunchanged,whilethelowerbranchistheradardetectionbranch.Theimagebranchpredictsobjectcentersandpixeloffsetstothesecentersintheimageplane.Nowradarpointsareprojectedintotheimageplane,providingcoarsealignmentwithimagedetections,andprocessedwiththeradarbackbonenetwork.Sinceradarpointsaresparseandlackcontextualinformation,webringfeaturesfromtheimagebackboneintotheradarnetworkprovidingimage-planespatialinformationtotheradarpipeline.Thentheradarneckandheadportionsperformtheradar-baseddetectionintheimagespace,butonlyatradarpixels,i.e.,theprojectionofradarpointsontotheimageplane,atfiveresolutionstomaintainconsistencywithFCOS3D.Wethenfusethesetwosetsofdetectionsthroughadepthfusionmodulethatupdatesthepredictions’depthsusingaconfidencescore.RadarBranchOneofthegoalsindesigningRADIANTistomakeminimalchangestotheexistingmonoculararchitectures.Therefore,RADIANTbuildsseamlesslyontopofanexistingSOTAmonocularnetwork,suchasFCOS3D(Wangetal.2021b),whichcanbeseparatelytrained.Whileitwouldbenaturaltosimplyaugmentcolorimageswithadditionalradarchannelsandretraintheimagenetwork,wefoundthisineffective,withtheresultingnetworkunabletobenefitfromtheradardata.Instead,weuseaseparatebackbone,ResNet-18(Heetal.2016),fortheradarprocessingandfreezetheimagebranchwhiletrainingtheradarbranch(seeFig.3).Theinputstotheradarbranchareinimagecoordinateswithvaluesonradarprojections,consistingofradardepth,radarbird’s-eyeview(BEV)coordinates,Dopplervelocity,andamaskforradarpixels.Exceptforthebackboneandlosses,themajorityoftheradarbranchissimilartotheim-agebranch,witharadarbackboneprocessinginputsandgeneratingradarfeatures,whichareconcatenatedwithim-agefeaturesatthreeresolutionlevels,thengothroughanindependentneckconsistingofaFeaturePyramidNetwork(FPN)(Linetal.2017)andtheradarheads.Theradarbranchoutputsdatainthesamespace,i.e.,fivelevelsofimagereso-lutions,andusesthesameclassificationandregressionlosses.Thismakestheradarbranchoutputscompatiblewiththeim-agebranchoutputs.RADIANTperformsimageandradarfusionattwostages:feature-levelfusionfromthebackboneanddetection-levelfusionaftertheheads.Figure 4.3 RADIANT Architecture Details. RADIANT architecture includes two parallel branches.
On the left (in magenta) is the unchanged monocular detection pipeline. On the right (in blue),
the radar network processes image-projected radar points and borrows features from the monocular
network to predict offsets to the radar pixels and to their depths in the shared radar head. C3 to C5
denote feature maps from level 3 to 5 in the backbone, and P3 to P7 represent feature maps from
level 3 to 7.

4.4.1 Radar Branch

One of the goals in designing RADIANT is to make minimal changes to the existing monocular

architectures. Therefore, RADIANT builds seamlessly on top of an existing SOTA monocular

network, such as FCOS3D [136], which can be separately trained. While it would be natural to

simply augment color images with additional radar channels and retrain the image network, we

found this ineffective, with the resulting network unable to benefit from the radar data. Instead,

we use a separate backbone, ResNet-18 [31], for the radar processing and freeze the image branch

while training the radar branch (see Fig. 4.3).

The inputs to the radar branch are in image coordinates with values on radar projections,

consisting of radar depth, radar BEV coordinates, Doppler velocity, and a mask for radar pixels.

Except for the backbone and losses, the majority of the radar branch is similar to the image branch,

with a radar backbone processing inputs and generating radar features, which are concatenated with

image features at three resolution levels, then go through an independent neck consisting of a Feature

Pyramid Network (FPN) [64] and the radar heads. The radar branch outputs data in the same space,

55

C3C4C5P3P4P5P6P7Radar Backbone(ResNet)Cam Backbone(ResNet)P3P4P5P6P7C3C4C5HeadHeadHeadHeadHeadFusion ConvConvsConvsConvsRadarImagePixel OffsetDepth Offset ConvsConvsConvsConvsConvsShared Radar HeadClassificationRegressionClass ScoreHeadHeadHeadHeadHeadCamHeadCam Neck(FPN)Radar Neck(FPN)Radar Headi.e., five levels of image resolutions, and uses the same classification and regression losses. This

makes the radar branch outputs compatible with the image branch outputs. RADIANT performs

image and radar fusion at two stages: feature-level fusion from the backbone and detection-level

fusion after the heads.

4.4.2 Radar Heads

Radar heads take in fused radar and image features of five resolutions and predict class scores as

well as relative positions to the object center, namely the depth offset and the pixel offset. The radar

head ignores prediction of object sizes as radar points are too sparse to reveal shape information. We

build radar heads independently from monocular camera heads because of the differences in what

they predict as well as the definition and positions of positive pixels: (1) camera heads estimate the

full depth of objects while radar heads estimate offset depth with respect to radar measured depth;

(2) positive camera pixels are only a 3 × 3 region at the object center while positive radar pixels

may be further away from the center. Nevertheless, we use the same strategy of FCOS3D [136]

to assign positive pixels to different resolution levels so that larger objects on images are detected

at lower resolutions. In summary, the radar head is complementary to the camera head with more

accurate position estimation for those objects with radar hits on them.

4.4.3 Depth Fusion Module

Detection heads generate detection candidates for pixels classified as positive. As the depths

from radar and camera heads are predicted independently, to take advantage of both depths, they

are fused in the Depth Fusion module as follows. First, radar pixels are associated with monocular

detection candidates. This is straightforward as the radar head outputs depth and pixel offsets, and

these can be matched to 3D centers of detection candidates. Second, a confidence-based weight is

predicted for each radar pixel enabling its depth to be combined with the monocular-based depth.

Note that we focus on using radar to enhance only depth prediction since the depth accuracy is a

primary merit of radar compared to camera, while other aspects of detection such as object size

and image position are unlikely to be improved by radar. This fusion process is described in detail

as follows.

56

Radar-Camera Association. RADIANT outputs two sets of detections from each head: {b𝑐

𝑖 }𝑖

from camera head and {b𝑟

𝑗 } 𝑗 from radar head, where 𝑖 and 𝑗 represent indices of detections from

camera and radar head, respectively. Typical outputs from the camera head are box proposals for
𝑖 , ˆ𝑣𝑐
𝑖

𝑖 , consisting of projected center (cid:0) ˆ𝑢𝑐

𝑖 , classification index ˆY𝑐

detection b𝑐

(cid:1), depth ˆ𝑧𝑐

𝑖 , detection

score ˆ𝜎𝑐

𝑖 , dimensions of the 3D box, orientation, and deltas for 2D detections. The outputs from

the radar head are box deltas for detection which contain the pixel offsets (Δ ˆ𝑢𝑟

𝑗 , Δ ˆ𝑣𝑟

𝑗 ) from the

projected radar point (𝑢𝑟

classification index ˆY𝑟

𝑗 , 𝑣𝑟

𝑗 ), depth offset Δ ˆ𝑧𝑟
𝑗 , and detection score ˆ𝜎𝑟
𝑗 .

𝑗 of the object center from the radar depth 𝑧𝑟

𝑗 , radar

Since the radar head does not affect 2D detection, 3D dimensions and orientations, we omit

these variables in the camera outputs b𝑐

𝑖 in the subsequent paragraphs for brevity. In other words,

we only specify the relevant portion of b𝑐

𝑖 vector in the following text. We now write these two set

of detections as

(cid:16)

(cid:16)

b𝑐
𝑖 =

b𝑟
𝑗 =

𝑖 , ˆ𝑣𝑐
ˆ𝑢𝑐

𝑖 , ˆ𝑧𝑐

𝑖 , ˆY𝑐

𝑖 , ˆ𝜎𝑐
𝑖

(cid:17)

,

Δ ˆ𝑢𝑟

𝑗 , Δ ˆ𝑣𝑟

𝑗 , Δ ˆ𝑧𝑟

𝑗 , ˆY𝑟

𝑗 , ˆ𝜎𝑟
𝑗

(cid:17)

.

(4.2)

We then filter the box proposals from both camera and radar as follows. The camera head

outputs a maximum of 1, 000 boxes with the highest scores on each level and with ˆ𝜎𝑐

𝑖 > 𝑇𝑐 where

𝑇𝑐 denotes the minimum threshold for the box to be valid. We employ a similar procedure for radar

detection candidates and consider radar projections with ˆ𝜎𝑟

𝑗 > 𝑇𝑟. After the filtering step, we have

good box proposals from the two modalities.

We now have the radar pixel

(cid:16)

𝑗 , 𝑣𝑟
𝑢𝑟
𝑗

(cid:17)

the radar-camera association. We calculate the projected centers

and corresponding depth/pixel offsets which we use for
(cid:17)
𝑗 , ˆ𝑣𝑟
ˆ𝑢𝑟
𝑗

and depths ˆ𝑧𝑟

𝑗 of the boxes

(cid:16)

as

(cid:16)

𝑗 , ˆ𝑣𝑟
ˆ𝑢𝑟
𝑗

(cid:17)

=

(cid:16)

𝑗 , 𝑣𝑟
𝑢𝑟
𝑗

(cid:17)

+

(cid:16)

Δ ˆ𝑢𝑟

𝑗 , Δ ˆ𝑣𝑟
𝑗

(cid:17)

,

𝑗 = 𝑧𝑟
ˆ𝑧𝑟

𝑗 + Δ ˆ𝑧𝑟
𝑗 .

(4.3)

We consider a camera proposal is associated with the radar proposal if the predicted class labels

of the two modalities match, and the projected centers and the depths are close to each other. In

57

other words, we take a camera proposal b𝑐

𝑖 and iterate through the radar proposals b𝑟

𝑗 and a match

is found if the following conditions are satisfied:

|| (cid:0) ˆ𝑢𝑐

𝑖 , ˆ𝑣𝑐
𝑖

(cid:1) −

(cid:16)

𝑗 , ˆ𝑣𝑟
ˆ𝑢𝑟
𝑗

ˆY𝑐
𝑖 = ˆY𝑟
𝑗
(cid:17)

||2 < 𝑇𝑝

𝑖 − ˆ𝑧𝑟
| ˆ𝑧𝑐

𝑗 | < 𝑇𝑑,

(4.4)

(4.5)

(4.6)

where 𝑇𝑝 and 𝑇𝑑 denote the distance threshold on pixels and the depth, respectively. We use

different thresholds for projected centers and depth as the errors in the two spaces are different.

Thus, we obtain a set of potential corresponding radar detections {b𝑟

𝑗 } for each b𝑐

𝑖 . The complexity

of matching is O (𝑀 𝑁), where 𝑀 and 𝑁 are the number of camera and radar proposals. The score

based thresholding limits the running time of the downstream matching algorithm.

Depth Weighting Network. The high-level idea of RADIANT is to update the monocular depth

of the boxes with better depth from radar. Although the radar depth is generally more accurate than

the camera depth, the camera depth may be better for nearby objects because of the richer semantic

information. Thus, always preferring radar depth over camera is not beneficial. In other words,

there should be a better weighting mechanism between the two depths. Hence, to better determine

association and depth weights for a potential camera-radar detection pair extracted with Eqs. (4.4)

to (4.6), we train another depth weighting network (DWN) to output relative confidence in radar

and camera depths.

The DWN is a 4-layer multilayer perceptron (MLP) that outputs a classification score 𝛼 between

0 and 1 where 1 indicates radar is more accurate and 0 if monocular is more accurate. Its input is

a vector comprised of head output features, raw depths, distance, and Doppler/predicted velocity

consistency. The training labels to this network are binary. We assign the GT label for training

as follows. If the GT depth 𝑧 of a box is closer to the radar estimated depth, the GT label 𝛼 = 1.

58

Otherwise it is zero. In other words,

𝛼 =

1,

0,





| ˆ𝑧𝑟

𝑗 − 𝑧| < | ˆ𝑧𝑐

𝑖 − 𝑧|

otherwise.

(4.7)

are 𝑁 radar associations b𝑟

Fused Depth Calculation. Fusion of camera and radar depths occurs as follows. Assuming there
𝑗 for 𝑗 ∈ 𝑠𝑒𝑡 (𝑖) potentially associated with a given camera detection b𝑐
𝑖 .
We run inference over this DWN for all these 𝑁 pairs to obtain a sequence of confidence scores 𝛼 𝑗 .

We then calculate the fused depth ˆ𝑧fuse from radar depths, weights 𝛼 𝑗 and the monocular depth as

𝛼 𝑗 ˆ𝑧𝑟
𝑗

𝛼 𝑗

(cid:205)
𝑗
(cid:205)
𝑗

ˆ𝑧𝑐
𝑖

,

,

if ∃ 𝑗, 𝛼 𝑗 > 𝑇𝛼

,

if ∀ 𝑗, 𝛼 𝑗 ≤ 𝑇𝛼

(4.8)

ˆ𝑧fuse =






where 𝑇𝛼 denotes depth weighting threshold.

4.5 Experiments

We apply the proposed method on the detection task of nuScenes dataset [10], a widely used

dataset with both image and radar points collected in urban driving environment. The nuScenes

detection dataset consists of 28,130 training samples, 6,019 validation samples, and 6,008 test

samples. We experimentally show improvements in depth estimation accuracy and overall detection

performance after enhancing monocular methods with the proposed strategy. The proposed method

also achieves the SOTA performance in object detection via radar-camera fusion.

On Using Pre-trained Weights We use pre-trained weights for the image-based detector as this

greatly reduces computation in training, and also because it preserves the performance of the

monocular detector. SOTA monocular detectors are difficult to train with various hyperparameter

adjustments needed to achieve published performance. Furthermore, it is not necessarily the case

that jointly training the image branch and radar branch will lead to improved performance. For

example, starting from pre-trained weights and without freezing them, we jointly trained radar/cam

heads for 100 iterations, the monocular performance (mAP) on 900 validation images decreased

from 0.33 to 0.25. Another advantage of not re-training is the method can be easily plugged into

any monocular detector without hampering the performance.

59

Method
Monocular Heads
Radar Heads
Raw Radar Depth

≤ 10
0.563
0.413
1.056

10 − 30 ≥ 30
6.042
1.442
1.017
0.649
1.361
1.082

All
3.415
0.791
1.204

Table 4.1 Depth prediction error (in meters) on nuScenes validation subset using monocular heads
and radar heads on image/radar pixels labeled as object over close, medium, and long range objects.

4.5.1 Depth Errors for Camera and Radar Heads

To show the advantage of the proposed radar head over monocular head in object depth estima-

tion, We quantitatively compare the depth estimation accuracy of the radar head with the monocular

head FCOS3D [136] on random 900 images from the nuScenes validation set in Table 4.1.

We compute the mean absolute error (MAE) for camera estimated depth, radar estimated depth

and raw radar depth for monocular/radar pixels where GT depths are known. For a fair comparison,

the depths are compared for objects having both positive camera and radar pixels as labels and the

error for each object is averaged over all pixels associated with it and final error over all objects. In

addition, we also show the error if we directly use radar depth as object depth without compensating

with estimated residual depth. Table 4.1 shows that the radar heads achieve better depth estimation

compared with camera heads, especially for the far objects.

We also plot the distribution of estimated and GT offset depths in Fig. 4.4. The estimated

residual depth follows the GT distribution. It can be seen that, typically, the object center is a little

farther than the measured depth of radar points. It demonstrates the usefulness of offset depth to

compensate the error from direct radar measurement.

4.5.2 nuScenes Quantitative Results

The nuScenes [10] leaderboard evaluates detection with metrics including mAP, mean average

translation error (mATE), mean average size error (mASE), mean average orientation error (mAOE),

and mean average velocity error (mAVE). As this chapter focuses on using radar to improve the

monocular depth estimation of objects, mAP and mATE are the most relevant metrics and are

reported. Other metrics are not reported since we did not update them with radar.

To show the effectiveness of the proposed camera-radar fusion on both test set (Table 4.2)

60

Figure 4.4 Histograms of predicted and GT depth offsets between radar returns and object centers.
Both these distributions are in close agreement with each other.

R

C

Method

mATE(↓)

MonoDIS-M [120]
CenterNet [160]
FCOS3D [136]
PGD [137]
CenterFusion [91]

✓
✓
✓
✓
✓ ✓
✓ ✓ FCOS3D + RADIANT
✓ ✓

PGD + RADIANT

0.738
0.658
0.690
0.646
0.631
0.622
0.609

Mean
0.304
0.338
0.358
0.360
0.326
0.374
0.380

Car
0.478
0.536
0.524
0.547
0.509
0.582
0.602

Truck
0.220
0.270
0.270
0.268
0.258
0.301
0.302

Bus
0.188
0.248
0.277
0.253
0.234
0.257
0.267

Trailer
0.176
0.251
0.255
0.243
0.235
0.248
0.242

AP(↑)
CV
0.074
0.086
0.117
0.087
0.077
0.145
0.107

Ped. Motor. Bicycle
0.245
0.290
0.370
0.207
0.291
0.375
0.298
0.345
0.397
0.300
0.379
0.422
0.201
0.314
0.370
0.302
0.386
0.439
0.312
0.416
0.444

TC
0.487
0.583
0.557
0.584
0.575
0.579
0.604

Barrier
0.511
0.533
0.538
0.525
0.484
0.500
0.503

Table 4.2 Performance comparison on nuScenes test set. R, C, CV, TC, Ped., and Motor. stand for
radar, camera, construction vehicle, traffic cone, pedestrian, and motorcycle, respectively.

R

C

Method

mATE(↓)

FCOS3D [136]
PGD [137]
CenterFusion [91]

✓
✓
✓ ✓
✓ ✓ FCOS3D + RADIANT
✓ ✓

PGD + RADIANT

0.739
0.658
0.649
0.653
0.617

Mean
0.326
0.368
0.332
0.363
0.384

Car
0.494
0.546
0.524
0.587
0.616

Truck
0.236
0.290
0.265
0.291
0.310

Bus
0.316
0.378
0.362
0.371
0.382

Trailer
0.115
0.148
0.154
0.120
0.141

AP(↑)
CV
0.057
0.063
0.055
0.073
0.068

Ped. Motor. Bicycle
0.303
0.306
0.416
0.343
0.441
0.374
0.229
0.305
0.389
0.333
0.364
0.447
0.374
0.395
0.462

TC
0.549
0.595
0.563
0.581
0.604

Barrier
0.465
0.504
0.470
0.467
0.487

Table 4.3 Performance comparison on nuScenes validation set.

and validation (Table 4.3), we compare the performance, i.e., mATE and mean/classwise average

precision (AP), of monocular methods, i.e., FCOS3D [136] and PGD [137], before and after being

fused with the proposed radar heads outputs, and it shows significant improvements over mAP and

mATE after combined with the proposed radar heads. Note for fairness, we compare monocular

and corresponding RADIANT with the same monocular weights because the performance of

RADIANT is partly determined by the performance of underlying monocular detection.

Next, we compare with CenterFusion [91], the best published radar-camera fusion method

on nuScenes test set. Our method outperforms CenterFusion in both mAP and mATE, which

61

−6−4−20246Depth (m)050100150200250CountGT OffsetPredicted Offsetindicates that our method acquires more correct detections and smaller localization error for those

positive detections. For classwise results, RADIANT shows a gain of over 18% and 20% in AP

over CenterFusion [91] on Cars and Pedestrians, respectively, in Table 4.2. This improvement is

significant as cars and pedestrians are common participants in traffic, with Cars accounting for

about 50% of total objects in nuScenes detection dataset.

4.5.3 nuScenes Qualitative Results

Fig. 4.5 shows the detection of the monocular detector FCOS3D [136] (in magenta), detections

with fusion from our proposed RADIANT (in cyan) and GT bounding boxes (in dashed orange) on

image and BEV, respectively. In addition, we plot estimated position offsets for radar pixels with

scores larger than 0.3. It is clear that the estimated residual depths are able to compensate the depth

gap between radar measurements and actual object positions. As a result, the proposed RADIANT

corrects the localization error of the FCOS3D [136] detections and achieves accurate 3D position

estimates leading to better 3D detection performance. The detections from monocular and fusion

have the same orientations because only depths are updated during fusion. Specific examples show

the effectiveness of depth correction in both near (Fig. 4.5 (b)) and long (Fig. 4.5 (h)) ranges.

4.5.4 Ablation Studies

Our proposed RADIANT model uses DWN to predict the confidence of radar depths for depth

fusion and therefore carries out the intelligent merging of the depths. We therefore carry the

ablation of this component on the nuScenes validation set in Table 4.4 on both the monocular

methods FCOS3D [136] and PGD [134].

In addition, we also consider an alternative strategy

of averaging out the depth of camera box proposals and neighboring radar proposals which we

call it as Average Fusion in the table. Table 4.4 results show that fusion methods outperform the

monocular counterparts (non-fusion) methods. This is expected because the depth remains the

hardest parameter to estimate for the monocular methods [82]. More importantly, the fusion with

DWN outperforms the Average fusion by a significant amount on both the metrics mATE and mAP

on both the monocular methods. This suggests that DWN carries out the intelligent fusion of radar

and camera depths instead of blindly averaging them, proving the effectiveness of DWN.

62

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 4.5 Qualitative Results. Visualization of predicted radar offsets (cyan arrows) to object
centers and detections on image and BEV. The radar association of the RADIANT corrects the
localization error of the FCOS3D [136], improving detection performance. Orange, magenta, and
cyan boxes are GT bounding boxes, monocular detections, and RADIANT detections, respectively.
The vertical axis in BEV indicates object distance in meters.

4.6 Summary

Fusing radar with camera-based detectors has proven challenging due in part to its low spatial

resolution. Our network, RADIANT, provides a new way to fuse radar with 3D monocular

image detectors. Using a radar branch in parallel to an image branch, RADIANT fuses both

mid-level features and final detections. The mid-level features provide context to radar returns

predicting the object centers offsets. RADIANT uses these offsets for association with the image

detector, and to obtain accurate depth estimates for object detections, which are fused with the

63

−16−8081624−6−4−2081012−8−6−4−2101214−18−16−14−1224262830024682426283081012141624262830322427303336394245482468106062646668Monocular Method

FCOS3D [136]

PGD [137]

Fusion mATE (↓) mAP (↑)
0.326
0.739
None
0.342
0.711
Average
0.363
0.653
DWN
0.368
0.658
None
0.371
0.647
Average
0.384
0.617
DWN

Table 4.4 Ablation on fusion strategies, i.e., average fusion and depth weighting network (DWN).
The best results under the same monocular component are in bold.

camera detectors. We show that the parallel branch fusion approach in RADIANT works with two

monocular detectors FCOS3D and PGD. Finally, RADIANT achieves SOTA fused radar-camera

detection on the nuScenes dataset.

Limitations. Although RADIANT improves depth estimation of monocular methods, it does not

enhance other detection parameters such as object sizes, and we leave velocity improvements to

future work.

64

CHAPTER 5

RICCARDO: RADAR HIT PREDICTION AND CONVOLUTION FOR
CAMERA-RADAR 3D OBJECT DETECTION

Radar hits reflect from points on both the boundary and internal to object outlines. This results

in a complex distribution of radar hits that depends on factors including object category, size, and

orientation. Current radar-camera fusion methods implicitly account for this with a black-box neural

network. In this chapter, we explicitly utilize a radar hit distribution model to assist fusion. First,

we build a model to predict radar hit distributions conditioned on object properties obtained from a

monocular detector. Second, we use the predicted distribution as a kernel to match actual measured

radar points in the neighborhood of the monocular detections, generating matching scores at nearby

positions. Finally, a fusion stage combines context with the kernel detector to refine the matching

scores. Our method achieves the SOTA radar-camera detection performance on nuScenes. This

chapter has been accepted for publication as [72].

5.1

Introduction

3D object detection [10, 28] is a key component of scene understanding in autonomous vehicles.

It predicts nearby objects and their attributes including 3D location, size, orientation, and category,

setting the stage for navigation tasks such as path planning. The primary sensors used for 3D object

detection are cameras, lidars [154], and radars, with the focus here on the two-sensor combination

that is the least expensive and already ubiquitous on vehicles, namely cameras and radars. This

chapter asks how to combine camera and radar data in order to achieve the best performance

improvement over a single modality detection.

Detection can be performed on camera and radar individually, with each sensor having its

strengths and drawbacks. Cameras are inexpensive and capture high-resolution details and texture

with SOTA methods [46, 77, 99] achieving accurate object classification, as well as estimating

size and orientation. One of the primary limitations is the depth-scale ambiguity, resulting in

relatively inaccurate object depth estimation [46, 77].

In contrast, current automotive radar is

another inexpensive sensor that directly measures range to target, as well as Doppler velocity, and

65

is robust to adverse weather such as rain, snow, and darkness [91]. The drawbacks of radar are

its very sparse scene sampling and lack of texture, making it challenging to perform tasks such as

object categorization, orientation, and size estimation. For example, radar point clouds collected in

nuScenes dataset [10] are 2D points on radar BEV plane without height measurements (the default

height is zero). This comparison shows that the strengths of radar and camera are complementary,

and indeed this chapter explores how we can combine information from these two modalities to

improve 3D object detection.

While radar and lidar are both widely used depth sensors for autonomous vehicles, they differ

significantly in their target sampling characteristics. Lidar points align densely with the edges of

objects, and for vehicles typical form a distinct “L" or “I" distribution. These regular distributions

are aligned with the object pose and enable precise shape and pose estimation from lidar scans

as evidenced by top-performing lidar methods [155] on nuScenes achieving mAP of 0.702 and

nuScenes detection score (NDS) of 0.736.

In contrast, radars have wide beam-width with low

angular resolution and often penetrate objects or reflect from their undersides. This leads to a

much sparser and dispersed distribution of hits on objects. From this distribution it is much more

challenging to estimate target shape, category, and pose, and the top-performing radar-only method,

RadarDistill [2], achieves a far lower mAP of 0.205 and NDS of 0.437 on nuScenes.

The combination of camera with radar has potential to significantly enhance radar-only methods,

with camera data providing strong results on object category, shape, and pose. Radar, with its

direct range measurements, can contribute range and velocity to a combined detector. Existing

radar-camera models combine camera and radar via concatenation [91], (weighted) sum [43],

or attention-based operations [44, 45]. But obtaining measurable gains by fusing sparse radar

and camera is difficult, with the top radar-camera method [45] having lower performance than

camera-only methods [67].

To address the difficulty in combining disparate modalities, we conjecture that directly modeling

the radar distribution’s dependency on target properties will enable radar returns to be more

effectively aligned and leveraged in object detection. As shown in Fig. 5.1, we introduce a model

66

(a)

(c)

(b)

(d)

Figure 5.1 Given a (a) monocular detection, we estimate (b) radar point distribution relative
to its bounding box in BEV; then we shift the distribution and convolve it with (c) actual radar
measurement in the neighborhood to compute (d) similarity scores and estimate an updated position,
where the matching score is maximum.
In (c) the monocular bounding box (in magenta) is
misaligned with radar points; the updated position (in orange) with peak matching score shifts the
box to a farther range so that relative positions of radar points match the predicted distribution
(radar hits concentrated at the head of vehicle instead of in the middle).

that generates a radar hit prediction (RIC) in BEV from a given monocular detection priors,

i.e., bounding box size, relative pose, and object class. By moving a distribution kernel in the

neighborhood of monocular estimated center and computing the similarity between the kernel and

actual radar measurements, we identify potential object centers with high similarity. Those position

candidates are passed through a refinement stage for final position estimation.

In nutshell, this chapter makes the following contributions:

• Builds a model to predict radar hit distributions relative to reflecting objects on BEV.

• Proposes position estimation by convolving predicted radar distribution with actual radar

measurements.

• Achieves SOTA performance on the nuScenes dataset.

67

5.2 Related Works

Monocular 3D Detection. Monocular 3D detection is known for its low cost and simple setup,

attracting extensive research optimizing every component in the detection pipeline, e.g., architec-

tures [5, 6, 141], losses [8, 48, 120], and NMS [47]. Efforts to alleviate the intrinsic ambiguities

from camera image to 3D include incorporating estimated monocular depth [105, 121, 138], de-

signing special convolutions [46], considering camera pose [161], and taking advantage of CAD

models [70]. However, depth ambiguities still remain a bottleneck to performance [48, 82] and

hence, fusion with radar is a promising strategy for enhancing depth estimation while keeping low

computation cost.

Camera-Radar Fusion for Detection. Camera-radar fusion has been widely applied to different

vision tasks, e.g., depth estimation [63, 76, 123], semantic segmentation [151], target velocity

estimation [75, 98], detection [73, 151], and tracking [127]. For detection, various camera-radar

methods have been proposed which differ in representations [44, 91], fusion level [43, 91], space

[43, 45, 158], and strategies [145]. There are different radar point representations. As 2D radar

measures BEV XY locations without height, it is natural to represent radar points in binned

BEV space [45]. To associate radar points with camera source, radar points are also modeled as

pillars [91, 145] with fixed height in 3D space. In addition, radar points are represented as point

feature [44] and each point as a multi-dimensional vector with elements of radar locations and other

properties from measurement such as radar cross-section.

Radar and image sensors can interact at either the feature-level [43] or the detection-level [91].

While intermediate image features offer a wealth of raw information, such as textures, detection-

level outputs provide directly interpretable information with clear physical meanings, making them

well-suited for fusion tasks. With the rapid advancement of monocular 3D detectors, leveraging

detection-level image information allows us to capitalize on their accurate estimates of object

category, size, orientation, and focus on refining position estimation, particularly range estimation,

where radar excels as a range sensor. Therefore, in this chapter, we utilize camera information at

the detection level.

68

The fusion is conducted in image view [73, 91] , BEV [45] or a mix of the two [43]. Image

features are naturally in image view, while radar is in BEV. To combine them in one space,

fusion methods either project radar points to image space [73, 91] or lift image features to 3D

space [45, 158].

Image view suffers from overlappings and occlusions while it is imprecise to

transform from image view to BEV without reliable depths. In this work, we adopt BEV space for

fusion as we use image source monocular detections, which are already in 3D space.

The radar-camera association is conducted by associating the radar pillars with monocular

boxes in 3D space [91] or projecting the pillars on image to extract corresponding image fea-

tures [145]. However, none of these methods explicitly leverage radar point distributions to address

the misalignment problem in radar-camera fusion.

5.3 RICCARDO

Our goal is to enhance object position estimation using radar returns, surpassing the capabilities

of monocular vision. The challenge lies in the sparse and non-obvious alignment of radar returns

with object boundaries and features, unlike the dense and consistent lidar returns. To address this,

we propose a method that explicitly models the statistics of radar returns on objects, taking its

category, size, orientation, range, and azimuth into account. These statistics enable radar returns

to improve monocular detections.

Our approach, called Radar Hit Prediction and Convolution for Camera-Radar 3D Object

Detection (RICCARDO), is illustrated in Fig. 5.2 and involves three stages. The first stage predicts

the radar distribution returns on an object based on monocular detector outputs. The second stage

convolves the predicted distribution with accumulated and binned radar measurements to obtain a

range-based score. The third and final stage refines the range-based score to obtain a final range

estimate. We describe the details of each stage below.

5.3.1 Stage 1: Radar Hits Prediction (RIC) Model

The RIC model aims to predict radar hit position distributions on objects in BEV, conditioned

on object category, size, orientation, range, and azimuth. This model leverages monocular detection

data to predict radar returns as a distribution, enabling comparison with actual measured returns

69

Figure 5.2 RICCARDO inference. RICCARDO leverages a monocular detector to identify objects
and estimate their attributes (category, size, orientation, and approximate range) and involves three
stages. Its Stage 1 then predicts the radar hit distribution (RIC) for each object. Stage 2 bins and
convolves the observed accumulated radar returns with the RIC, to generate a matching score over
range. A final Stage 3 fusion refines these scores to yield a precise target range estimate.

(Section 5.3.2). This section details the construction and learning of this predictive model. We

model radar hit distributions as a probability of radar return over a set of grid cells.

Coordinate System. A key choice is the coordinate system to predict this distribution. Possibilities

include object-aligned, sensor-aligned or ego-vehicle aligned systems. We choose an object-aligned

system for modeling radar hit position distributions. This choice allows for more gradual changes

in probabilities as a function of relative sensor location compared to ego-vehicle or sensor-aligned

systems, which exhibit significant variations with changes in relative object pose. Consequently,

the object-aligned system facilitates learning from limited data.

Architecture. Fig. 5.3 shows the overview of Stage 1 in RICCARDO. Stage 1 employs a neural

network to predict the distribution of radar points in object-centric BEV coordinates, conditioned

on object category, size, orientation, range, and azimuth. The output is a 2D quantized BEV

probability map centered on the object, with X and Y axes aligned to the object’s length and width

dimensions. The RIC pixel value represents the density of radar hits at those locations. The network

architecture comprises an MLP model, with parallel preprocessing branches for input parameters

and a main branch that fuses these features and predicts the RIC.

Ground Truth Distribution. To construct a GT distribution for a given target, we define a grid

70

remove radarStage 1:RIC ModelStage 2:Matching via ConvolutionStage 3:CandidateSelectorMonocularDetectionImageRadar PointsAccumulationRadarFigure 5.3 Stage-1 RIC Model Training. The radial ray from ego to target is plotted as dashed line
for reference.

relative to the target’s known position and accumulate radar hits over a short time interval. The

normalized density of returns for a grid cell 𝑖, ¯𝑃𝑖, is calculated as:

¯𝑃𝑖 =

𝑐𝑖
(cid:205)𝑁
𝑖=1

,

𝑐𝑖

(5.1)

where 𝑐𝑖 is the number of radar returns in grid cell 𝑖, and 𝑁 is the total number of cells on the RIC

map. This model only includes the points within the object boundary.

Targets may move during an accumulation period in a driving scene, and directly accumulating

radar hits causes smearing. Thus, we offset each radar hit position by the object motion between

the radar return and the final target position. A pair of sequential annotated locations can provide

GT object velocity, 𝒗O, for calculating this. Yet, radar Doppler velocity is more accurate for the

radial component; thus, we combine the Doppler velocity, 𝒗D, with the tangential component of

𝒗O. Given a unit vector perpendicular to radar ray, 𝒏T, the offset 𝒎 of a radar point is:

𝒎 = ((𝒗O · 𝒏T)𝒏T + 𝒗D) Δ𝑡,

(5.2)

where Δ𝑡 is the time between the radar measurement and current sweep. We apply these offsets

before calculating the probabilities in Eq. (5.1).

71

RICModelOrientationSizeRangeBinned Point CountsPredicted DistributionCategoryObjectTransformAzimuthLossSuperviseFigure 5.4 Stage 2 takes the RIC predictions and neighboring radar points. It matches predicted
radar distribution with radar points in the radial direction and computes binned and range-dependent
matching scores. Peak positions indicates range estimations.

Loss Function. The loss function incorporates cross-entropy loss 𝐿CE between the predicted radar

distribution 𝑃, and the GT distribution from the accumulated radar returns,

¯𝑃. In addition, we

include a smoothness regularization term 𝐿S on 𝑃 to encourage spatial smoothness:

𝐿S =

1
𝑁c(𝑁r−1)

𝑁c∑︁

𝑁r−1
∑︁

𝑗=1

𝑖=1

|𝑃(𝑖, 𝑗) −𝑃(𝑖 + 1, 𝑗)| +

1
𝑁r(𝑁c−1)

𝑁r∑︁

𝑁c−1
∑︁

𝑖=1

𝑗=1

|𝑃(𝑖, 𝑗) −𝑃(𝑖, 𝑗 + 1)|,

(5.3)

where 𝑖 and 𝑗 are pixel indices within 𝑁r × 𝑁c grid of rows and columns. Thus, the total loss for

training is 𝐿CE + 𝐿S.

5.3.2 Stage 2: Convolving RIC with Radar

Stage 2 uses the predicted radar density from the RIC model to score the consistency of object

positions with the measured radar hits. By binning measured radar returns, we can use a convolution

of the RIC to obtain similarity scores as a function of range, and so locate potential target positions.

Fig. 5.4 shows the overview of Stage 2. Both the predicted and measured radar data are resampled

into a BEV space with the X-axis along radial direction (from ego to target, approximately parallel

to radar rays), and the Y-axis along the tangential direction. We align the central row of the RIC

map with the central row of the radar measurement map and restrict the search to be along X-axis

(i.e., the radial axis) near to the monocular detected center. Our motivation is that monocular

detections are more accurate tangentially (i.e., image space) and less radially. To compute the

72

ObjectRadial RayMatchingScoresRadar DistributionEgosimilarity between the predicted and measured densities, we slide the RIC kernel along the radial

axis and calculate by sum of dot product (convolution) between the RIC and the actual point count

at each position. This results in a 1D matching score profile along the radial dimension, indicating

potential target locations in the vicinity of the monocular centers.

Specifically, given a binned radar distribution 𝑃 with center (𝐿P, 𝐿P) and resolution of (2𝐿P +

1, 2𝐿P + 1) and radar binned count 𝐶 with center (𝐿C, 𝐿C), i.e., object center from monocular

prediction, and size of (2𝐿C + 1, 2𝐿C + 1), the row convolution (or cross-correlation) is

𝑆STG2(𝑛) =

𝐿P∑︁

𝐿P∑︁

𝑖=−𝐿P

𝑗=−𝐿P

𝑃 (𝐿P + 𝑖 + 1, 𝐿P + 𝑗 + 1) 𝐶 (𝐿C + 𝑖 + 1, 𝐿C + 𝑗 + 𝑛 + 1) ,

(5.4)

where 𝑛 is offset to the object center along the row.

To create the measured densities, we first accumulate multiple radar sweeps over a short interval

prior to the detection. Object motions of radar points are compensated by Doppler velocity and

object velocity from monocular detections similar to Eq. (5.2). The difference is that object velocity,

𝒗O, is obtained from the monocular detector.

5.3.3 Stage 3: Camera-Radar Candidate Selector

Matching scores from Stage 2 provide position candidates, which have high scores. The

purpose of Stage 3 is to select the best range predictions among Stage-2 candidates by considering

the combined evidence from monocular detection and radar measurements. Stage 3 trains a neural

network to rescore the candidate positions using additional evidence.

This model should consider multiple factors that indicate the confidence of each choice. For

instance, high detection scores imply high confidence in monocular detection; large matching scores

indicate more accurate matched positions from Stage 2; monocular detector excels at low ranges

while suffers at long ranges where Stage-2 candidates may be a better choice because of the help

of radar measurement.

Architecture and Loss. The Stage-3 network takes two types of inputs: (a) monocular detection

parameters (class, range, and size); (b) Stage-2 matching scores at binned ranges. The network

preprocesses inputs via separate linear layers, concatenates the results, feeds them to an MLP to

73

extract features, and predicts a confidence score per range candidate. Cross-entropy loss is used

for training the network. Training data are generated from true positive monocular detections with

non-zero Stage-2 matching scores.

Inference. Denoting predicted Stage-3 scores as 𝑆STG3(𝑖), where 𝑖 is quantized offset to monocular

predicted range, we estimate the range offset by finding the peak position as

Δ𝑛 = argmax𝑖 (𝑆STG3(𝑖)) ,

(5.5)

and corresponding Stage-3 score 𝑆STG3(Δ𝑛). The final predicted range 𝑅F can be expressed as

𝑅F = 𝑅CAM + Δ𝑛 𝑏p,

(5.6)

where 𝑅CAM is monocular predicted range, and 𝑏p is bin size. Typically, the estimated range is

approximately at one of the peak positions from Stage-2 score or at monocular estimated range;

thus Stage 3 implicitly learns to select the best position candidates from previous stages. We also

update detection score by combining monocular detection score 𝑆CAM and Stage-3 score as

𝑆F = 𝑆CAM + 𝛼𝑆STG3(Δ𝑛),

(5.7)

where 𝛼 is a Stage-3 weighting parameter. Note 𝑆STG3 has been processed by the Softmax function,

and its value ranges from 0 to 1.

5.4 Experiments

Dataset. Our experiments are based on the widely-used nuScenes dataset [10], with both images

and radar points collected in urban driving scenarios. Equipped with six cameras and five radars,

the ego-vehicle scans traffic environments in 360 degrees. There are 700 training scenes, 150

validation scenes, and 150 test scenes, each with 10 classes of objects specified with bounding

boxes.

Data Splits. We follow the standard splits of the nuScenes detection benchmark: the test results are

obtained from model trained on nuScenes training plus validation set (34K frames) and evaluated

on the test set (6K frames); the validation results trained on nuScenes training set (28K frames) and

evaluated on the validation set (6K frames).

74

Models

Params (M)

SparseBEV
(V2-99)
94.0

SparseBEV
(ResNet101))
63.6

Stage 1 Stage 3

19.2

0.3

Table 5.1 Number of parameters in SparseBEV with different backbones as well as in RICCARDO
Stage 1 and Stage 3.

Implementation Details. Our code uses PyTorch [100] and detection package MMDetection3D [16].

Our experiments use SparseBEV [67] as the monocular detector. Note that RICCARDO is flexible

and easily adaptable to other monocular methods. To preserve the premium detection performance

of monocular component and focus on training the Stage-3 model, we use pretrained weights for

the monocular branch, which are frozen in training and inference. We train Stage-1 and Stage-3

models with RMSProp optimizer for 120 epochs with an initial learning rate of 1 × 10−6, which

is reduced by half at the 60th epoch. We list the number of parameters in RICCARDO Stage-1

and Stage-3 models as well as in underlying monocular models with two backbones in Table 5.1.

RICCARDO Stage 1 and Stage 3 are relatively lightweight compared to monocular models.

Stage 1: We implement the Stage-1 network with a lightweight MLP-like network shown

in Fig. 5.5. An input sample consists of a series of vectors representing different properties of

object, e.g., size, orientation, and range. They are first processed by separate linear projection

Figure 5.5 Stage-1 Network Structure. The class input is in one-hot encoding; 𝑧 represents heights
of bounding box bottom faces; 𝜃AZ stands for azimuths of objects in ego coordinates; 𝜃Y and 𝜃′
Y
are object yaws in ego coordinates and relative yaws (i.e., 𝜃Y − 𝜃AZ), respectively. “C" represents
concatenation, and “Linear" denotes a linear transformation layer. Feature sizes are marked besides
network layers.

75

LClassLLLLLLLLLLLinearL: Linear BlockLinearReLu𝒙,𝒚𝒛Range𝐜𝐨𝐬𝜽𝑨𝒁,𝐬𝐢𝐧𝜽𝑨𝒁𝐜𝐨𝐬𝜽𝒀,𝐬𝐢𝐧𝜽𝒀𝐜𝐨𝐬 𝜽𝒀′,𝐬𝐢𝐧𝜽𝒀′𝑾,𝑳,𝑯C10241024102416641Reshape129,129Figure 5.6 Stage-3 Network Structure. The inputs 𝑣x and 𝑣y are monocular estimated object
velocities in ego coordinates; 𝑣R and 𝑣T are monocular velocities in radial and tangential directions,
respectively; 𝑆CAM and 𝑆STG2 represent monocular detection scores and Stage-2 matching scores,
respectively.

layers before being concatenated and fed into an MLP of 3 hidden layers. The network output

(i.e., the binned distribution map) is defined in object local coordinates with X-axis parallel to

object length, Y-axis to width, and map center at object center. It has a resolution of 129 × 129

with a pixel size of 0.1 × 0.1 meters for small and medium-sized object categories and pixel

size of 0.2 × 0.2 meters for large-sized categories such as buses and trailers. GT distribution is

generated by accumulating 13 neighboring radar sweeps (6 previous sweeps, 1 current, and 6 future

ones). We assume radar points distribute within the GT bounding boxes on BEV, and thus points

outside the bounding boxes are ignored and not used for training. We train Stage-1 and Stage-3

models separately since Stage-1 model is invariant to its underlying monocular model while Stage-3

network depends on the monocular model.

Stage 2: To perform Stage-2 convolution along the radial direction, the measured radar positions

are binned in radial-tangential coordinates to generate a radar measurement map, with X-axis

76

LClassLLLLLLLLLLLinearL: Linear BlockLinearReLu𝒙,𝒚𝒛Range𝐜𝐨𝐬𝜽𝑨𝒁,𝐬𝐢𝐧𝜽𝑨𝒁𝐜𝐨𝐬𝜽𝒀,𝐬𝐢𝐧𝜽𝒀𝐜𝐨𝐬 𝜽𝒀′,𝐬𝐢𝐧𝜽𝒀′𝑾,𝑳,𝑯C51225612865LLLL𝒗𝒙,𝒗𝒚𝒗𝑹,𝒗𝑻𝑺𝑪𝑨𝑴𝑺𝑺𝑻𝑮𝟐Modality
Radar Camera

Method

NDS (↑) mAP (↑) mATE (↓) mASE (↓) mAOE (↓) mAVE (↓) mAAE (↓)

✓
✓
✓
✓
✓
✓
✓
✓

PGD [137]
SparseBEV [67]
MVFusion [145]
CRN [45]
RCBEVDet [65]
HyDRa [143]
HVDetFusion [55]
SparseBEV + RICCARDO

0.448
0.675
0.517
0.624
0.639
0.642
0.674
0.695

0.386
0.603
0.453
0.575
0.550
0.574
0.609
0.630

0.626
0.425
0.569
0.416
0.390
0.398
0.379
0.363

0.245
0.239
0.246
0.264
0.234
0.251
0.243
0.240

0.451
0.311
0.379
0.456
0.362
0.423
0.382
0.311

1.509
0.172
0.781
0.365
0.259
0.249
0.172
0.167

0.127
0.116
0.128
0.130
0.113
0.122
0.132
0.118

✓
✓
✓
✓
✓
✓

Table 5.2 Detection Performance on nuScenes Test Set. RICCARDO achieves SOTA performance
for camera-radar fusion.

Modality
Radar Camera

Method

NDS (↑) mAP (↑) mATE (↓) mASE (↓) mAOE (↓) mAVE (↓) mAAE (↓)

✓
✓
✓
✓
✓
✓
✓
✓

PGD [137]
SparseBEV [67]
MVFusion [145]
CRN [45]
RCBEVDet [65]
HyDRa [143]
HVDetFusion [55]
SparseBEV + RICCARDO

0.428
0.592
0.455
0.607
0.568
0.617
0.557
0.622

0.369
0.501
0.380
0.545
0.453
0.536
0.451
0.544

0.683
0.562
0.675
0.445
0.486
0.416
0.527
0.481

0.260
0.265
0.258
0.268
0.285
0.264
0.270
0.266

0.439
0.320
0.372
0.425
0.404
0.407
0.473
0.325

1.268
0.243
0.833
0.332
0.220
0.231
0.212
0.237

0.185
0.195
0.196
0.180
0.192
0.186
0.204
0.189

✓
✓
✓
✓
✓
✓

Table 5.3 nuScenes Validation Results.

parallel to the ray from ego vehicle to target center and map center at target center detected by a

monocular method. It has a resolution of 193 × 193 with 2 pixel sizes mentioned above. Before

performing convolution, the predicted radar distribution map of Stage 1 is rotated from object local

coordinates to radial-tangential coordinates. The search range of convolution is −3.2m to 3.2m

relative to the range of target center estimated by the monocular method.

Stage 3: As shown in Fig. 5.6, we implement the Stage-3 network via a lightweight MLP similar

to Stage 1. To create labels for GT range, we associate monocular detections with GT bounding boxes

under the conditions that the GT bounding boxes fall in the search range of associated monocular

detections (with radial distance ≤ 3.2m) and on the ray from ego to monocular detections (with

tangential distance ≤ min(0.5m, 𝐿), where 𝐿 is the object length).

5.4.1 Quantitative Results on nuScenes

Tables 5.2 and 5.3 show the performance of RICCARDO on test and validation set, respectively.

The proposed fusion of radar points with monocular detection proposals improves the performance

of position estimation, which is evaluated with mAP and mATE for true positive detections. We

compare performance of SOTA methods of monocular and radar-camera fusion. Note that for

77

a fair comparison, the monocular component in RICCARDO, i.e., SparseBEV, uses exactly the

same weights within Table 5.2 and likewise uses the same weights within Table 5.3. Specifically,

SparseBEV uses V2-99 [54] and ResNet101 [31] as its backbones in Tables 5.2 and 5.3, respectively.

By comparing RICCARDO with its monocular counterpart (i.e., 8th row vs. 2nd row in Ta-

bles 5.2 and 5.3), we see a significant improvement in mAP and a reduction in mATE when using

RICCARDO with radar data as inputs. Meanwhile it preserves good performance of its monocular

component in other aspects, e.g., size and orientation estimation.

RICCARDO also achieves SOTA performance among published radar-camera fusion methods:

in test set performance shown in Table 5.2 and validation set results shown in Table 5.3, RIC-

CARDO achieves the best overall performance measured by NDS and comparable performance

in other metrics. As a detection-level fusion, final performance of RICCARDO depends on the

quality of its underlying monocular model. We observe a stable and significant improvement in

overall performance over its monocular models in both Tables 5.2 and 5.3, although the monocular

components adopt different backbones. The ease of plugging in different monocular components in

our fusion architecture allows RICCARDO to capitalize on SOTA monocular models and achieve

better fusion performance.

5.4.2 Evaluating Stages 2 and 3

To evaluate the Stages 2 and 3 in object range estimation, we compare the range error from

monocular detection, Stage-2, and Stage-3 estimations. Stage-1 model is trained with nuScenes

training set and Stage-3 model is trained with detections by the monocular method and correspond-

ing Stage-2 matching scores from nuScenes training set.

To generate Stage-2 estimation, the trained Stage-1 model is applied to monocular detections

(SparseBEV) in nuScenes validation set to generate radar distributions, which are subsequently

convolved with radar measurements (in Stage 2) to obtain matching scores at binned ranges along

the ray. The range with the maximum matching score is used as Stage-2 estimation for this

evaluation. Finally, we estimate Stage-3 output by feeding the Stage-2 outputs and monocular

detections to the Stage-3 model. From Table 5.4, we can see that simple extraction from Stage 2

78

Method
Monocular
Stage 2
Stage 3

Class-Mean
0.83/0.57
0.94/0.52
0.65/0.36

Car
0.65/0.38
0.50/0.21
0.38/0.18

Truck
0.86/0.56
0.75/0.39
0.59/0.34

Bus
1.04/0.80
0.77/0.46
0.70/0.46

Trailer
1.23/1.08
1.53/1.05
1.13/0.76

CV
1.13/0.88
1.11/0.75
0.87/0.63

Ped.
0.85/0.55
1.13/0.56
0.71/0.28

Motor.
0.81/0.50
0.82/0.35
0.57/0.29

Bicycle
0.65/0.40
0.86/0.44
0.54/0.24

TC
0.51/0.24
0.93/0.41
0.46/0.17

Barrier
0.59/0.31
1.01/0.55
0.57/0.27

Table 5.4 Comparison of range estimation accuracy of monocular, Stage-2, and Stage-3 estimates.
We use mean/median of absolute range error (meter) as metrics. [Key: Ped.= Pedestrian, Motor.=
Motorcycle, CV= Construction Vehicle, TC= Traffic Cone.]

can improve over monocular methods with lower median error but suffers from outliers with larger

mean error. Stage 3 achieves the best range estimation accuracy across all 10 categories, by fusing

the monocular estimation and Stage-2 outputs.

5.4.3 Qualitative Results

In Fig. 5.7, we show examples of RICCARDO being applied to monocular detections. The

radar BEV map in (b) and predicted radar distribution in (c) are both centered at the monocular

object center and with X-axis being radial direction and Y-axis tangential direction. We can

observe the complexity of predicted radar distributions in (c), which vary according to object size

and orientation relative to radar rays. The edges of objects facing the radar tend to have higher

densities, and large vehicles have a wider spread (see the 4th column), with reflections by parts

under the vehicle near the tires.

Fig. 5.7(d) shows Stage-2 matching scores as a function of radial offset (X-axis), with monocular

estimated range (magenta) and GT range (gray). Multiple peaks in the 4th column indicate

ambiguities in matching radar hits. (e) shows the predicted Stage-3 scores in blue, which typically

have a sharper peak than Stage 2, illustrating Stage-3 candidate refinements. In the fourth column

Stage 3 resolved ambiguity by enhancing one peak. Row (f) shows that RICCARDO improves

monocular method in radial position prediction as predicted bounding boxes are closer to GT

compared to monocular detections. The radial directions are plotted as dotted lines for reference.

5.4.4 Ablation Study

5.4.4.1 Ablation On Stage-1 Radar Distribution Models

To assess the benefit of using our trained RIC model to represent radar hits on objects, we

compare it with using two baseline distributions, i.e., a uniform radar distribution within object

79

Distribution MAE (↓) MMS (↑)
0.77
L-Shaped
0.67
Uniform
0.47
RIC

0.059
0.078
0.105

Table 5.5 Comparison of RICCARDO using two baseline radar hit distributions for Stage 1 versus
RIC. The metric is MAE of range in meters and mean matching score (MMS) between the distri-
butions and actual measurements.

boundaries and an “L-shaped" distribution on reflecting sides of bounding boxes. The “L-shaped"

distribution is a simulation of lidar point distribution. Each model is passed through Stage 2, and

we estimate object range at the maximum matching score, and compute their range errors. We also

use matching score (i.e., dot product of predicted distribution and pixel-wise radar point counts

from accumulated measurement) as an additional metric for evaluating distribution accuracy. We

compute mean matching score (MMS) over all classes. We train RIC with radar measurements and

GT bounding boxes in nuScenes training set and evaluate on nuScenes validation set. Note that, in

this experiment, GT bounding boxes are used to generate the radar distribution, and no monocular

boxes are involved. Table 5.5 shows that, with the smallest range estimation error and the largest

matching score, the RIC distribution captures more accurately the real radar distribution compared

with the two baseline distributions.

5.4.4.2 Ablation on Range and Score Updating

In Stage-3 inference we update both range and detection score. Detection scores indicate

confidence in prediction and have an impact on mAP computation, where predictions with higher

scores have priority as true positives to be associated with GT. We update detection scores by

adding Stage-3 scores weighted by 𝛼 to monocular scores. We test different range and score

updating options with different 𝛼 and list resultant detection performance in Tab. 5.6. We can see

both range and score updating improve detection performance while range updating has significantly

bigger impacts on performance.

5.4.4.3 Ablation on Number of Radar Sweeps

Within 0.5s time window, there are about 7 sweeps of radar points (i.e., 1 current plus 6 past

ones) from radars running at 13Hz in nuScenes Dataset [10]. We accumulate multiple radar sweeps

80

Update
Range

Update
Score

𝛼 NDS (↑) mAP (↑)

-
0.5
-
0.2
0.5
0.8
1.0

0.590
0.593
0.617
0.620
0.621
0.621
0.620

0.501
0.503
0.543
0.545
0.545
0.543
0.541

✓

✓
✓
✓
✓

✓
✓
✓
✓
✓

Table 5.6 Ablation on updating range and detection score with fusion weight 𝛼. We use Sparse-
BEV [67] with backbone ResNet101 for the monocular components in RICCARDO. For efficiency,
the data used for evaluation are subset of NuScenes Val set with 600 random samples.

Num. of Sweeps NDS (↑) mAP (↑)
0.501
0.512
0.531
0.541
0.545

0.590
0.597
0.612
0.618
0.621

0
1
3
5
7

Table 5.7 Ablation on Number of Radar Sweeps. More radar sweeps result in better detection
performance. Key: Num.= Number

during inference, and the number of radar sweeps may impact detection performance, as more

sweeps provide denser radar measurement used for Stage 2. To verify this, we run RICCARDO

multiple times with radar input from 0, 1, 3, 5, and 7 sweeps, respectively and record their detection

performance. Note using 0 radar sweep refers to applying only monocular detector without fusion.

As shown in Tab. 5.7, more radar sweeps lead to better detection performance as expected.

5.4.5 Visualization of Radar Distributions

To visualize how predicted distribution varies with viewing angles, we simulate object pa-

rameters with different orientations and apply Stage-1 model to generate corresponding radar hit

distributions. Figs. 5.8 and 5.9 shows predicted distributions for car, bus, bicycle, and barrier with

different orientations and distances. We can see the distributions vary with category, orientation,

and distance. For example, radar distributions are less concentrated spatially at longer range be-

cause of larger beam width. We can also notice that distributions of radar points reflected by the

tail and head of cars (as shown in the 1st and last row of Fig. 5.8) are different because of their

81

different surface shapes.

5.5 Summary

This chapter presents a novel radar-camera fusion strategy that utilizes BEV radar distributions

to improve object range estimation over monocular methods. Evaluation on the nuScenes dataset

shows that RICCARDO realizes stable and significant improvements in object position estimation

over its underlying monocular detector, and achieves the SOTA performance in radar-camera-based

3D object detection. We believe this effective method, that is simple to implement, will broadly

benefit existing and future camera-radar fusion methods.

Limitations. First, as a detection level fusion, RICCARDO only uses high-level monocular

detection parameters and does not directly utilize low-level image features.

Information loss is

inevitable in this low-level to high-level feature transition (e.g., false negative detections), and it is

difficult to make use of radar points for further improvement if an object is missed by the monocular

component in the first place. Second, RICCARDO adopts BEV representation for radar points

distribution. However, BEV representation has an intrinsic limitation to represent radar hits from

two vertically positioned targets at the same BEV location (e.g., a person riding on a bicycle), since

their reflected radar hits are mapped to the same BEV pixel.

82

(a)

(b)

(c)

(d)

(e)

(f)

Figure 5.7 Qualitative Results. Visualizations of (a) objects in images, (b) binned radar points, (c)
predicted radar hits distribution, (d) Stage-2 matching scores, (e) predicted Stage-3 scores, and (f) detections
in ego BEV coordinates. Monocular, RICCARDO, and GT detections are plotted in magenta, blue, and gray,
respectively.

83

0◦

45◦

90◦

135◦

180◦

(a) Car (Short Range)

(b) Car (Long Range)

(c) Bus (Short Range)

(d) Bus (Long Range)

Figure 5.8 Visualization of predicted radar distributions of (a)(b) Car and (c)(d) Bus viewed from
different angles and distances of 10 and 40 meters. X-axis represents radial positions, and Y-axis
denotes tangential offsets to object centers. Radial rays are plotted as horizontal dotted lines. Target
bounding boxes are shown on top of distributions, and dashed lines represent object head.

84

0◦

45◦

90◦

135◦

180◦

(a) Bicycle (Short Range)(b) Bicycle (Long Range)(c) Barrier (Short Range)(d) Barrier (Long Range)

Figure 5.9 Visualization of predicted radar distributions of (a)(b) bicycle and (c)(d) barrier viewed
from five different angles and from distances of 10 and 40 meters.

85

CHAPTER 6

CONCLUSIONS AND FUTURE WORK

6.1 Conclusions

This dissertation addresses challenges in radar-camera fusion for depth completion, velocity

estimation, and 3D object detection. Hampered by the sparsity of radar returns and the ambiguous

association between radar and image pixels, radar-based depth completion is more challenging

than lidar-based depth completion. To solve this problem, in Chapter 2, we propose RC-PDA,

a novel radar-camera association based on equal depth. We are able to generate enhanced radar

depth called MER by densifying raw radar depth with RC-PDA. We demonstrate experimentally

that MER leads to more accurate depth completion than raw radar depth. In addition, to train depth

completion on nuScenes, we create GT depth by accumulating multiple frames of lidar depth over

time. The inability of Doppler radar to directly measure tangential velocity limits its utility in object

velocity estimation and radar point accumulation. In chapter 3, we achieve full velocity estimation

of radar returns by deriving a closed-form solution combining Doppler velocity from radar with

corresponding optical flows from camera. A model supervised with GT bounding box velocities is

used to associate radar points to their corresponding optical flows. Experiments show the validity

of our method and its application to object velocity estimation and accumulation of moving radar

points. Difficulty in fusing radar with monocular detectors includes sparsity of radar points and

positional inconsistency between radar hits and corresponding object centers. In Chapter 4, we

present RADIANT, a novel scheme to fuse radar with monocular detectors. RADIANT adopts a

radar branch, in parallel to an image branch and using mid-level features from both camera and

radar as context, to predict 3D offsets from radar hits to object centers. Radar hits updated by the

predicted offsets are fused with monocular detected centers to improve object depth estimation.

In experiments we apply RADIANT to two monocular detectors FCOS3D and PGD, and improve

their detection performance.

In chapter 5, to address the problem of inferring object positions

from complex radar hits and monocular detections, we propose RICCARDO, an interpretable

radar-camera fusion strategy with 3 stages, from the perspective of modeling radar hit distributions

86

in BEV. In stage 1, we train a model to predict BEV radar distributions relative to monocular

predicted bounding boxes. In stage 2, we convolve the distributions with actual radar points in

the neighborhood to obtain matching scores at binned ranges.

In stage 3, we train a model to

fuse stage-2 and monocular range predictions. Experiments show RICCARDO improves detection

performance over its underlying monocular detector and SOTA radar-camera fusion methods.

6.2 Future Work

6.2.1 Full Velocity Estimation for Objects with Monocular Detector and Radar

In Chapter 3, without using a monocular detector, we estimate point-wise full velocity for radar

hits from Doppler velocity and corresponding optical flows. With a monocular detector, we can

extend the full velocity estimation from points to objects, by replacing optical flows with motions

in detected object centers on neighboring images. In addition, using monocular detectors offers

other options to compute full velocity, e.g., velocity from 3D motion of detected objects, Doppler

velocity back-projected on object heading, and object velocity directly inferred by monocular

models.

In future work, it is worthwhile to compare those options and combine them for an

optimized estimation.

Typically we assume object velocity is constant within a short period of time around a detection,

which is not precise for object turning, accelerating, or decelerating. Thus, in future work, we

will consider more sophisticated velocity models, e.g., angular velocity plus linear velocity with

acceleration.

6.2.2 Expanding Search Space for Radar Distribution Matching in RICCARDO

RICCARDO in Chapter 5 assumes the underlying monocular detector has accurate estimation

in tangential positions, sizes, and orientations and focus on improving range estimation, and also

assumes the distribution remain fixed in the search space. To achieve more precise radar distribution

matching, it is worthwhile in future work expanding the search space from a 1D (i.e., radial offset)

to a high-dimensional space (e.g., radial and tangential offsets, size, and orientation) and adopting

a variable distribution as a function of locations in the search space during matching.

87

6.2.3

3D Object Detection with 4D radar

In the dissertation, we use 2D radar, which suffers from sparsity and a lack of height measure-

ments, and thus radar-camera fusion is necessary for detection. In future work, we plan to design a

radar-only detection method by using a 4D radar format. Some recent autonomous driving datasets

adopt 4D radar [96], which provides 4D radar tensors, i.e., range-azimuth-elevation-Doppler, as

the data format. 4D radar makes it possible for radar-only detection, which is robust to adverse

weather. In addition, compared with 2D radar point clouds, 4D radar tensors include one additional

dimension, elevation, and are denser with more complete raw information.

88

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Evan Ackerman. Robot trucks overtake robot cars: This year, trucks will drive themselves
on public roads with no one on board. IEEE Spectrum, 58(1):42–43, 2020.

Geonho Bang, Kwangjin Choi, Jisong Kim, Dongsuk Kum, and Jun Won Choi. RadarDistill:
Boosting radar-based object detection performance via knowledge distillation from lidar
features. In CVPR, 2024.

Dan Barnes, Matthew Gadd, Paul Murcutt, Paul Newman, and Ingmar Posner. The Oxford
Radar RobotCar Dataset: A radar extension to the Oxford RobotCar Dataset. In ICRA, 2020.

Jean-François Bonnefon, Azim Shariff, and Iyad Rahwan. The social dilemma of autonomous
vehicles. Science, 352(6293):1573–1576, 2016.

Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia
Gkioxari. Omni3D: A large benchmark and model for 3D object detection in the wild. In
CVPR, 2023.

Garrick Brazil and Xiaoming Liu. M3D-RPN: Monocular 3D region proposal network for
object detection. In ICCV, 2019.

Garrick Brazil and Xiaoming Liu. Pedestrian detection with autoregressive network phases.
In CVPR, 2019.

Garrick Brazil, Gerard Pons-Moll, Xiaoming Liu, and Bernt Schiele. Kinematic 3D object
detection in monocular video. In ECCV, 2020.

Daniel Brodeski, Igal Bilik, and Raja Giryes. Deep radar detector. In IEEE Radar Conference,
2019.

[10] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu,
Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal
dataset for autonomous driving. In CVPR, 2020.

[11] Florian Chabot, Mohamed Chaouch, Jaonary Rabarisoa, Céline Teuliere, and Thierry
Chateau. Deep MANTA: A coarse-to-fine many-task network for joint 2D and 3D vehi-
cle analysis from monocular image. In CVPR, 2017.

[12] Simon Chadwick, Will Maddern, and Paul Newman. Distant vehicle detection using radar

and vision. In ICRA, 2019.

[13] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew
Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3D tracking
and forecasting with rich maps. In CVPR, 2019.

89

[14] Shuo Chang, Yifan Zhang, Fan Zhang, Xiaotong Zhao, Sai Huang, Zhiyong Feng, and
Zhiqing Wei. Spatial attention fusion for obstacle detection using mmWave radar and vision
sensor. Sensors, 20(4):956, 2020.

[15] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam,
and Liang-Chieh Chen. Panoptic-DeepLab: A simple, strong, and fast baseline for bottom-up
panoptic segmentation. In CVPR, 2020.

[16] MMDetection3D Contributors. MMDetection3D: OpenMMLab next-generation platform

for general 3D object detection. https://github.com/open-mmlab/mmdetection3d, 2020.

[17] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for
semantic urban scene understanding. In CVPR, 2016.

[18] Andreas Danzer, Thomas Griebel, Martin Bach, and Klaus Dietmayer. 2D car detection in

radar data with PointNets. In IEEE Intelligent Transportation Systems Conference, 2019.

[19] Raul Diaz and Amit Marathe. Soft labels for ordinal regression. In CVPR, 2019.

[20] Mingyu Ding, Yuqi Huo, Hongwei Yi, Zhe Wang, Jianping Shi, Zhiwu Lu, and Ping
In CVPR

Luo. Learning depth-guided convolutions for monocular 3D object detection.
Workshops, 2020.

[21] Xu Dong, Pengluo Wang, Pengyue Zhang, and Langechuan Liu. Probabilistic oriented object

detection in automotive radar. In CVPR Workshops, 2020.

[22] Alberto Elfes. Using occupancy grids for mobile robot perception and navigation. Computer,

22(6):46–57, 1989.

[23] Di Feng, Christian Haase-Schütz, Lars Rosenbaum, Heinz Hertlein, Claudius Glaeser, Fabian
Timm, Werner Wiesbeck, and Klaus Dietmayer. Deep multi-modal object detection and
semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE
Transactions on Intelligent Transportation Systems, 22(3):1341–1360, 2020.

[24] Paul Fritsche, Björn Zeise, Patrick Hemme, and Bernardo Wagner. Fusion of radar, lidar
In IEEE

and thermal information for hazard detection in low visibility environments.
International Symposium on Safety, Security and Rescue Robotics, 2017.

[25] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep

ordinal regression network for monocular depth estimation. In CVPR, 2018.

[26] Fernando Garcia, Pietro Cerri, Alberto Broggi, Arturo de la Escalera, and José María
Armingol. Data fusion for overtaking vehicle detection based on radar and optical flow. In
IEEE Intelligent Vehicles Symposium, 2012.

90

[27] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics:
The KITTI dataset. The International Journal of Robotics Research, 32(11):1231–1237,
2013.

[28] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving?

the KITTI vision benchmark suite. In CVPR, 2012.

[29] Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 3D

packing for self-supervised monocular depth estimation. In CVPR, 2020.

[30]

Jurgen Hasch. Driving towards 2020: Automotive radar technology trends. In IEEE MTT-S
International Conference on Microwaves for Intelligent Mobility, 2015.

[31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image

recognition. In CVPR, 2016.

[32] Yaoyu Hu, Weikun Zhen, and Sebastian Scherer. Deep-learning assisted high-resolution

binocular stereo depth reconstruction. In ICRA, 2020.

[33] Saif Imran, Xiaoming Liu, and Daniel Morris. Depth completion with twin surface extrap-

olation at occlusion boundaries. In CVPR, 2021.

[34] Saif Imran, Yunfei Long, Xiaoming Liu, and Daniel Morris. Depth coefficients for depth

completion. In CVPR, 2019.

[35]

Joel Janai, Fatma Guney, Anurag Ranjan, Michael Black, and Andreas Geiger. Unsupervised
learning of multi-frame optical flow with occlusions. In ECCV, 2018.

[36] Florian Janda, Sebastian Pangerl, Eva Lang, and Erich Fuchs. Road boundary detection for
run-off road prevention based on the fusion of video and radar. In IEEE Intelligent Vehicles
Symposium, 2013.

[37] Maximilian Jaritz, Raoul De Charette, Emilie Wirbel, Xavier Perrotton, and Fawzi
Nashashibi. Sparse and dense data with CNNs: Depth completion and semantic seg-
mentation. In 3DV, 2018.

[38] Zhengping Ji and Danil Prokhorov. Radar-vision fusion for object classification. In Interna-

tional Conference on Information Fusion, 2008.

[39] Michael Kampffmeyer, Nanqing Dong, Xiaodan Liang, Yujia Zhang, and Eric P Xing.
ConnNet: A long-range relation-aware pixel-connectivity network for salient segmentation.
IEEE Transactions on Image Processing, 28(5):2518–2529, 2018.

[40] Prannay Kaul, Daniele De Martini, Matthew Gadd, and Paul Newman. RSS-Net: Weakly-
supervised multi-class semantic segmentation with FMCW radar. In IEEE Intelligent Vehi-

91

cles Symposium, 2020.

[41] Dominik Kellner, Michael Barjenbruch, Klaus Dietmayer, Jens Klappstein, and Jürgen
Dickmann. Instantaneous lateral velocity estimation of a vehicle using Doppler radar. In
International Conference on Information Fusion, 2013.

[42] Dominik Kellner, Michael Barjenbruch, Jens Klappstein, Jürgen Dickmann, and Klaus
Instantaneous full-motion estimation of arbitrary objects using dual Doppler

Dietmayer.
radar. In IEEE Intelligent Vehicles Symposium, 2014.

[43] Youngseok Kim, Jun Won Choi, and Dongsuk Kum. GRIF Net: Gated region of interest
fusion network for robust 3D object detection from radar point cloud and monocular image.
In IROS, 2020.

[44] Youngseok Kim, Sanmin Kim, Jun Won Choi, and Dongsuk Kum. CRAFT: Camera-radar

3D object detection with spatio-contextual fusion transformer. In AAAI, 2023.

[45] Youngseok Kim, Juyeb Shin, Sanmin Kim, In-Jae Lee, Jun Won Choi, and Dongsuk Kum.
CRN: Camera radar net for accurate, robust, efficient 3D perception. In ICCV, 2023.

[46] Abhinav Kumar, Garrick Brazil, Enrique Corona, Armin Parchami, and Xiaoming Liu.
DEVIANT: Depth equivariant network for monocular 3D object detection. In ECCV, 2022.

[47] Abhinav Kumar, Garrick Brazil, and Xiaoming Liu. GrooMeD-NMS: Grouped mathemati-

cally differentiable NMS for monocular 3D object detection. In CVPR, 2021.

[48] Abhinav Kumar, Yuliang Guo, Xinyu Huang, Liu Ren, and Xiaoming Liu. SeaBird: Seg-
mentation in bird’s view with dice loss improves monocular 3D detection of large objects.
In CVPR, 2024.

[49] C-C Jay Kuo. Understanding convolutional neural networks with a mathematical model.

Journal of Visual Communication and Image Representation, 41:406–413, 2016.

[50] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–

444, 2015.

[51]

Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh.
From big to
small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint
arXiv:1907.10326, 2019.

[52] Seongwook Lee, Byeong-Ho Lee, Jae-Eun Lee, and Seong-Cheol Kim.

Statistical
characteristic-based road structure recognition in automotive FMCW radar systems. IEEE
Transactions on Intelligent Transportation Systems, 20(7):2418–2429, 2018.

[53] Wei Lee, Ljubomir Jovanov, and Wilfried Philips. Semantic-guided radar-vision fusion for

92

depth estimation and object detection. In ICCV, 2021.

[54] Youngwan Lee and Jongyoul Park. CenterMask: Real-time anchor-free instance segmenta-

tion. In CVPR, 2020.

[55] Kai Lei, Zhan Chen, Shuman Jia, and Xiaoteng Zhang. HVDetFusion: A simple and robust

camera-radar fusion framework. arXiv preprint arXiv:2307.11323, 2023.

[56] Ang Li, Zejian Yuan, Yonggen Ling, Wanchao Chi, Chong Zhang, et al. A multi-scale

guided cascade hourglass network for depth completion. In WACV, 2020.

[57] Liang Li and Yuan Xie. A feature pyramid fusion detection algorithm based on radar and

camera sensor. In ICSP, 2020.

[58] Ying Li, Lingfei Ma, Zilong Zhong, Fei Liu, Michael A Chapman, Dongpu Cao, and
Jonathan Li. Deep learning for lidar point clouds in autonomous driving: A review. IEEE
Transactions on Neural Networks and Learning Systems, 32(8):3412–3432, 2020.

[59] You Li and Javier Ibanez-Guzman. Lidar for autonomous driving: The principles, challenges,
and trends for automotive lidar and perception systems. IEEE Signal Processing Magazine,
37(4):50–61, 2020.

[60] Zewen Li, Fan Liu, Wenjie Yang, Shouheng Peng, and Jun Zhou. A survey of convolutional
IEEE Transactions on Neural

neural networks: Analysis, applications, and prospects.
Networks and Learning Systems, 33(12):6999–7019, 2021.

[61]

Jaime Lien, Nicholas Gillian, Emre Karagozler, Patrick Amihood, Carsten Schwesig, Erik
Olson, Hakim Raja, and Ivan Poupyrev. Soli: Ubiquitous gesture sensing with millimeter
wave radar. ACM Transactions on Graphics, 35(4):1–19, 2016.

[62] Teck-Yian Lim, Amin Ansari, Bence Major, Daniel Fontijne, Michael Hamilton, Radhika
Gowaikar, and Sundar Subramanian. Radar and camera early fusion for vehicle detection in
advanced driver assistance systems. In NeurIPS Workshops, 2019.

[63]

Juan-Ting Lin, Dengxin Dai, and Luc Van Gool. Depth estimation from monocular images
and sparse radar data. In IROS, 2020.

[64] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge

Belongie. Feature pyramid networks for object detection. In CVPR, 2017.

[65] Zhiwei Lin, Zhe Liu, Zhongyu Xia, Xinhao Wang, Yongtao Wang, Shengxiang Qi, Yang
Dong, Nan Dong, Le Zhang, and Ce Zhu. RCBEVDet: Radar-camera fusion in bird’s eye
view for 3D object detection. In CVPR, 2024.

[66] Feng Liu and Xiaoming Liu. Voxel-based 3D detection and reconstruction of multiple objects

93

from a single image. In NeurIPS, 2021.

[67] Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. SparseBEV: High-
performance sparse 3D object detection from multi-camera videos. In ICCV, 2023.

[68] Lijie Liu, Jiwen Lu, Chunjing Xu, Qi Tian, and Jie Zhou. Deep fitting degree scoring

network for monocular 3D object detection. In CVPR, 2019.

[69] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang

Fu, and Alexander C Berg. SSD: Single shot multibox detector. In ECCV, 2016.

[70] Zongdai Liu, Dingfu Zhou, Feixiang Lu, Jin Fang, and Liangjun Zhang. AutoShape: Real-

time shape-aware monocular 3D object detection. In ICCV, 2021.

[71]

Jakob Lombacher, Markus Hahn, Jürgen Dickmann, and Christian Wöhler. Potential of radar
for static object classification using deep learning methods. In IEEE MTT-S International
Conference on Microwaves for Intelligent Mobility, 2016.

[72] Yunfei Long, Abhinav Kumar, Xiaoming Liu, and Daniel Morris. RICCARDO: Radar hit

prediction and convolution for camera-radar 3D object detection. In CVPR, 2025.

[73] Yunfei Long, Abhinav Kumar, Daniel Morris, Xiaoming Liu, Marcos Castro, and Punarjay
Chakravarty. RADIANT: Radar-image association network for 3D object detection. In AAAI,
2023.

[74] Yunfei Long and Daniel Morris. Lidar essential beam model for accurate width estimation

of thin poles. In IROS, 2020.

[75] Yunfei Long, Daniel Morris, Xiaoming Liu, Marcos Castro, Punarjay Chakravarty, and

Praveen Narayanan. Full-velocity radar returns by radar-camera fusion. In ICCV, 2021.

[76] Yunfei Long, Daniel Morris, Xiaoming Liu, Marcos Castro, Punarjay Chakravarty, and
Praveen Narayanan. Radar-camera pixel depth association for depth completion. In CVPR,
2021.

[77] Yan Lu, Xinzhu Ma, Lei Yang, Tianzhu Zhang, Yating Liu, Qi Chu, Junjie Yan, and Wanli
Ouyang. Geometry uncertainty projection network for monocular 3D object detection. In
ICCV, 2021.

[78] Fangchang Ma, Guilherme Venturelli Cavalheiro, and Sertac Karaman. Self-supervised
sparse-to-dense: Self-supervised depth completion from lidar and monocular camera. In
ICRA, 2019.

[79] Fangchang Ma and Sertac Karaman. Sparse-to-dense: Depth prediction from sparse depth

samples and a single image. In ICRA, 2018.

94

[80] Xinzhu Ma, Shinan Liu, Zhiyi Xia, Hongwen Zhang, Xingyu Zeng, and Wanli Ouyang.

Rethinking pseudo-lidar representation. In ECCV, 2020.

[81] Xinzhu Ma, Zhihui Wang, Haojie Li, Pengbo Zhang, Wanli Ouyang, and Xin Fan. Accu-
rate monocular 3D object detection via color-embedded 3D reconstruction for autonomous
driving. In ICCV, 2019.

[82] Xinzhu Ma, Yinmin Zhang, Dan Xu, Dongzhan Zhou, Shuai Yi, Haojie Li, and Wanli
In CVPR,

Ouyang. Delving into localization errors for monocular 3D object detection.
2021.

[83] Bence Major, Daniel Fontijne, Amin Ansari, Ravi Teja Sukhavasi, Radhika Gowaikar,
Michael Hamilton, Sean Lee, Slawomir Grzechnik, and Sundar Subramanian. Vehicle
detection with automotive radar using deep learning on range-azimuth-doppler tensors. In
ICCV Workshops, 2019.

[84] Enrique Marti, Miguel Angel de Miguel, Fernando Garcia, and Joshue Perez. A review of
sensor technologies for perception in automated driving. IEEE Intelligent Transportation
Systems Magazine, 11(4):94–108, 2019.

[85] David Metz. The myth of travel time saving. Transport Reviews, 28(3):321–336, 2008.

[86] Michael Meyer and Georg Kuschk. Automotive radar dataset for deep learning based 3D

object detection. In European Radar Conference, 2019.

[87] Michael Meyer and Georg Kuschk. Deep learning based 3D object detection for automotive

radar and camera. In European Radar Conference, 2019.

[88] Daniel Morris. A pyramid CNN for dense-leaves segmentation. In Conference on Computer

and Robot Vision, 2018.

[89] Ramin Nabati, Landon Harris, and Hairong Qi. CFTrack: Center-based radar and camera

fusion for 3D multi-object tracking. In Intelligent Vehicles Symposium Workshops, 2021.

[90] Ramin Nabati and Hairong Qi. RRPN: Radar region proposal network for object detection

in autonomous vehicles. In ICIP, 2019.

[91] Ramin Nabati and Hairong Qi. CenterFusion: Center-based radar and camera fusion for 3D

object detection. In WACV, 2021.

[92] Ehsan Nezhadarya, Yang Liu, and Bingbing Liu. BoxNet: A deep learning method for 2D
In IEEE Intelligent Vehicles

bounding box estimation from bird’s-eye view point cloud.
Symposium, 2019.

[93] Felix Nobis, Maximilian Geisslinger, Markus Weber, Johannes Betz, and Markus Lienkamp.

95

A deep learning-based radar and camera sensor fusion architecture for object detection. In
Sensor Data Fusion: Trends, Solutions, Applications, 2019.

[94] Daniel W Otter, Julian R Medina, and Jugal K Kalita. A survey of the usages of deep learning
IEEE Transactions on Neural Networks and Learning

for natural language processing.
Systems, 32(2):604–624, 2020.

[95] Arthur Ouaknine, Alasdair Newson, Julien Rebut, Florence Tupin, and Patrick Perez. CAR-
In

RADA dataset: Camera and automotive radar with range-angle-doppler annotations.
ICPR, 2021.

[96] Dong-Hee Paek, Seung-Hyun Kong, and Kevin Tirta Wijaya. K-Radar: 4D radar object
detection dataset and benchmark for autonomous driving in various weather conditions. In
NeurIPS, 2022.

[97] Andras Palffy, Jiaao Dong, Julian FP Kooij, and Dariu M Gavrila. CNN based road user
detection using the 3D radar cube. IEEE Robotics and Automation Letters, 5(2):1263–1270,
2020.

[98] Aarav Pandya, Ajit Jha, and Linga Reddy Cenkeramaddi. A velocity estimation technique

for a monocular camera using mmWave FMCW radars. Electronics, 10(19):2397, 2021.

[99] Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is pseudo-lidar

needed for monocular 3D object detection? In ICCV, 2021.

[100] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas
Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style,
high-performance deep learning library. In NeurIPS, 2019.

[101] Hendrik Purwins, Bo Li, Tuomas Virtanen, Jan Schlüter, Shuo-Yiin Chang, and Tara Sainath.
IEEE Journal of Selected Topics in Signal

Deep learning for audio signal processing.
Processing, 13(2):206–219, 2019.

[102] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep learning on

point sets for 3D classification and segmentation. In CVPR, 2017.

[103] Jiaxiong Qiu, Zhaopeng Cui, Yinda Zhang, Xingdi Zhang, Shuaicheng Liu, Bing Zeng,
and Marc Pollefeys. DeepLiDAR: Deep surface normal guided depth prediction for outdoor
scene from sparse LiDAR data and single color image. In CVPR, 2019.

[104] Pierluigi Zama Ramirez, Matteo Poggi, Fabio Tosi, Stefano Mattoccia, and Luigi Di Stefano.
Geometry meets semantics for semi-supervised monocular depth estimation. In ACCV, 2018.

96

[105] Cody Reading, Ali Harakeh, Julia Chae, and Steven Waslander. Categorical depth distribu-

tion network for monocular 3D object detection. In CVPR, 2021.

[106] Konstantinos Rematas, Ira Kemelmacher-Shlizerman, Brian Curless, and Steve Seitz. Soccer

on your tabletop. In CVPR, 2018.

[107] Haoyu Ren, Aman Raj, Mostafa El-Khamy, and Jungwon Lee. SUW-Learn: Joint supervised,
unsupervised, weakly supervised deep learning for monocular depth estimation. In CVPR
Workshops, 2020.

[108] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for

biomedical image segmentation. In MICCAI, 2015.

[109] Fabian Roos, Dominik Kellner, Jürgen Dickmann, and Christian Waldschmidt. Reliable
orientation estimation of vehicles in high-resolution radar images. IEEE Transactions on
Microwave Theory and Techniques, 64(9):2986–2993, 2016.

[110] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. ImageNet large scale
visual recognition challenge. International Journal of Computer Vision, 115(3):211–252,
2015.

[111] Ashutosh Saxena, Justin Driemeyer, and Andrew Ng. Robotic grasping of novel objects
using vision. The International Journal of Robotics Research, 27(2):157–173, 2008.

[112] Nicolas Scheiner, Nils Appenrodt, Jürgen Dickmann, and Bernhard Sick. A multi-stage
clustering framework for automotive radar data. In IEEE Intelligent Transportation Systems
Conference, 2019.

[113] Nicolas Scheiner, Florian Kraus, Fangyin Wei, Buu Phan, Fahim Mannan, Nils Appenrodt,
Werner Ritter, Jurgen Dickmann, Klaus Dietmayer, Bernhard Sick, et al. Seeing around
street corners: Non-line-of-sight detection and tracking in-the-wild using Doppler radar. In
CVPR, 2020.

[114] Johannes Schlichenmaier, Fabian Roos, Philipp Hügler, and Christian Waldschmidt. Clus-
tering of closely adjacent extended objects in radar images using velocity profile analysis.
In IEEE MTT-S International Conference on Microwaves for Intelligent Mobility, 2019.

[115] Mehmet Saygın Seyfioğlu, Ahmet Murat Özbayoğlu, and Sevgi Zubeyde Gürbüz. Deep
convolutional autoencoder for radar-based classification of similar aided and unaided human
activities. IEEE Transactions on Aerospace and Electronic Systems, 54(4):1709–1723, 2018.

[116] Meet Shah, Zhiling Huang, Ankit Laddha, Matthew Langford, Blake Barber, Sidney Zhang,
Carlos Vallespi-Gonzalez, and Raquel Urtasun. LiRaNet: End-to-end trajectory prediction
using spatio-temporal radar fusion. In CoRL, 2020.

97

[117] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. PointRCNN: 3D object proposal

generation and detection from point cloud. In CVPR, 2019.

[118] Xian Shuai, Yulin Shen, Yi Tang, Shuyao Shi, Luping Ji, and Guoliang Xing. milliEye:
In

A lightweight mmWave radar and camera fusion system for robust object detection.
International Conference on Internet-of-Things Design and Implementation, 2021.

[119] Heonkyo Sim, The-Duong Do, Seongwook Lee, Yong-Hwa Kim, and Seong-Cheol Kim.
Road environment recognition for automotive FMCW radar systems through convolutional
neural network. IEEE Access, 8:141648–141656, 2020.

[120] Andrea Simonelli, Samuel Bulò, Lorenzo Porzi, Manuel Antequera, and Peter Kontschieder.
Disentangling monocular 3D object detection: From single to multi-class recognition. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 44(3):1219–1231, 2020.

[121] Andrea Simonelli, Samuel Bulò, Lorenzo Porzi, Peter Kontschieder, and Elisa Ricci. Are we
missing confidence in pseudo-lidar methods for monocular 3D object detection? In ICCV,
2021.

[122] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale

image recognition. In ICLR, 2015.

[123] Akash Deep Singh, Yunhao Ba, Ankur Sarker, Howard Zhang, Achuta Kadambi, Stefano
Soatto, Mani Srivastava, and Alex Wong. Depth estimation from camera image and mmWave
radar point cloud. In CVPR, 2023.

[124] Arvind Srivastav and Soumyajit Mandal. Radars for autonomous driving: A review of deep

learning methods and challenges. IEEE Access, 11:97147–97168, 2023.

[125] Leo Stanislas and Thierry Peynot. Characterisation of the Delphi electronically scanning
radar for robotics applications. In Australasian Conference on Robotics and Automation,
2015.

[126] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul
Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception
for autonomous driving: Waymo open dataset. In CVPR, 2020.

[127] Xiaolin Tang, Zhiqiang Zhang, and Yechen Qin. On-road object detection and tracking based
on radar and vision fusion: A review. IEEE Intelligent Transportation Systems Magazine,
14(5):103–128, 2021.

[128] Yunlei Tang, Sebastian Dorn, and Chiragkumar Savani. Center3D: Center-based monocular
In DAGM German Conference on

3D object detection with joint depth understanding.
Pattern Recognition, 2020.

98

[129] Zachary Teed and Jia Deng. RAFT: Recurrent all-pairs field transforms for optical flow. In

ECCV, 2020.

[130] Julian Theis, Daniel Ossmann, Frank Thielecke, and Harald Pfifer. Robust autopilot design
for landing a large civil aircraft in crosswind. Control Engineering Practice, 76:54–64,
2018.

[131] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas

Geiger. Sparsity invariant CNNs. In 3DV, 2017.

[132] Jorge Vargas, Suleiman Alsweiss, Onur Toker, Rahul Razdan, and Joshua Santos. An
overview of autonomous vehicles sensors and their vulnerability to weather conditions.
Sensors, 21(16):5397, 2021.

[133] Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and
Katerina Fragkiadaki. SfM-Net: Learning of structure and motion from video. arXiv
preprint arXiv:1704.07804, 2017.

[134] Li Wang, Li Zhang, Yi Zhu, Zhi Zhang, Tong He, Mu Li, and Xiangyang Xue. Progressive

coordinate transforms for monocular 3D object detection. In NeurIPS, 2021.

[135] Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison

Cottrell. Understanding convolution for semantic segmentation. In WACV, 2018.

[136] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. FCOS3D: Fully convolutional

one-stage monocular 3d object detection. In ICCV Workshops, 2021.

[137] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Probabilistic and geometric depth:

Detecting objects in perspective. In CoRL, 2021.

[138] Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian
Weinberger. Pseudo-lidar from visual depth estimation: Bridging the gap in 3D object
detection for autonomous driving. In CVPR, 2019.

[139] Yingjie Wang, Qiuyu Mao, Hanqi Zhu, Yu Zhang, Jianmin Ji, and Yanyong Zhang.
arXiv preprint

Multi-modal 3D object detection in autonomous driving: A survey.
arXiv:2106.12735, 2021.

[140] Yizhou Wang, Zhongyu Jiang, Yudong Li, Jenq-Neng Hwang, Guanbin Xing, and Hui
Liu. RODNet: A real-time radar object detection network cross-supervised by camera-
radar fused object 3D localization. IEEE Journal of Selected Topics in Signal Processing,
15(4):954–967, 2021.

[141] Yue Wang, Vitor Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon.
DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In CoRL,

99

2021.

[142] Jamie Watson, Michael Firman, Gabriel J Brostow, and Daniyar Turmukhambetov. Self-

supervised monocular depth hints. In ICCV, 2019.

[143] Philipp Wolters, Johannes Gilg, Torben Teepe, Fabian Herzog, Anouar Laouichi, Martin
Hofmann, and Gerhard Rigoll. Unleashing HyDRa: Hybrid fusion, depth consistency and
radar for unified 3D perception. arXiv preprint arXiv:2403.07746, 2024.

[144] Yutian Wu, Yueyu Wang, Shuwei Zhang, and Harutoshi Ogai. Deep 3D object detection
networks using lidar data: A review. IEEE Sensors Journal, 21(2):1152–1171, 2020.

[145] Zizhang Wu, Guilian Chen, Yuanzhu Gan, Lei Wang, and Jian Pu. MVFusion: Multi-view
3D object detection with semantic-aligned radar and camera fusion. In ICRA, 2023.

[146] Qiangeng Xu, Yin Zhou, Weiyue Wang, Charles Qi, and Dragomir Anguelov. SPG: Unsu-
pervised domain adaptation for 3D object detection via semantic point generation. In ICCV,
2021.

[147] Yan Xu, Xinge Zhu, Jianping Shi, Guofeng Zhang, Hujun Bao, and Hongsheng Li. Depth

completion from sparse lidar data with depth-normal constraints. In ICCV, 2019.

[148] Ritu Yadav, Axel Vierling, and Karsten Berns. Radar+RGB fusion for robust object detection

in autonomous vehicle. In ICIP, 2020.

[149] Bin Yang, Runsheng Guo, Ming Liang, Sergio Casas, and Raquel Urtasun. RadarNet:

Exploiting radar for robust perception of dynamic objects. In ECCV, 2020.

[150] Zhenheng Yang, Peng Wang, Wei Xu, Liang Zhao, and Ramakant Nevatia. Unsuper-
vised learning of geometry with edge-aware depth-normal consistency. arXiv preprint
arXiv:1711.03665, 2017.

[151] Shanliang Yao, Runwei Guan, Xiaoyu Huang, Zhuoxiao Li, Xiangyu Sha, Yong Yue, Eng Gee
Lim, Hyungjoon Seo, Ka Lok Man, Xiaohui Zhu, et al. Radar-camera fusion for object
detection and semantic segmentation in autonomous driving: A comprehensive review.
IEEE Transactions on Intelligent Vehicles, 9(1):2094–2128, 2023.

[152] De Jong Yeong, Gustavo Velasco-Hernandez, John Barry, and Joseph Walsh. Sensor and

sensor fusion technology in autonomous vehicles: A review. Sensors, 21(6):2140, 2021.

[153] Richard W Young. Evolution of the human hand: the role of throwing and clubbing. Journal

of Anatomy, 202(1):165–174, 2003.

[154] Georgios Zamanakos, Lazaros Tsochatzidis, Angelos Amanatiadis, and Ioannis Pratikakis.
A comprehensive survey of lidar-based 3D object detection methods with deep learning for

100

autonomous driving. Computers & Graphics, 99:153–181, 2021.

[155] Diankun Zhang, Zhijie Zheng, Haoyu Niu, Xueqing Wang, and Xiaojun Liu. Fully sparse
transformer 3-D detector for lidar point cloud. IEEE Transactions on Geoscience and Remote
Sensing, 61:1–12, 2023.

[156] Hongyang Zhang, Junru Shao, and Ruslan Salakhutdinov. Deep neural networks with multi-
branch architectures are intrinsically less non-convex. In The 22nd International Conference
on Artificial Intelligence and Statistics, 2019.

[157] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, and Xindong Wu. Object detection with
deep learning: A review. IEEE Transactions on Neural Networks and Learning Systems,
30(11):3212–3232, 2019.

[158] Taohua Zhou, Junjie Chen, Yining Shi, Kun Jiang, Mengmeng Yang, and Diange Yang.
Bridging the view disparity between radar and camera features for multi-modal fusion 3D
object detection. IEEE Transactions on Intelligent Vehicles, 8(2):1523–1535, 2023.

[159] Taohua Zhou, Mengmeng Yang, Kun Jiang, Henry Wong, and Diange Yang. MMW radar-
based technologies in autonomous driving: A review. Sensors, 20(24):7283, 2020.

[160] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint

arXiv:1904.07850, 2019.

[161] Yunsong Zhou, Yuan He, Hongzi Zhu, Cheng Wang, Hongyang Li, and Qinhong Jiang.
MonoEF: Extrinsic parameter free monocular 3D object detection. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 44(12):10114–10128, 2021.

[162] Li Zhu, Fei Richard Yu, Yige Wang, Bin Ning, and Tao Tang. Big data analytics in intelligent
transportation systems: A survey. IEEE Transactions on Intelligent Transportation Systems,
20(1):383–398, 2018.

[163] Shengjie Zhu, Garrick Brazil, and Xiaoming Liu. The edge of depth: Explicit constraints

between segmentation and depth. In CVPR, 2020.

101