OLISTER: OBSERVING LIDAR-INDUCED SOURCES FOR TRANSFERABILITY,
ESTIMATION AND ROBUSTNESS IN 3D OBJECT DETECTION

By

Onur Can Yücedağ

A THESIS

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science—Master of Science

2025

ABSTRACT

Unsupervised Domain Adaptation (UDA) for 3D object detection in autonomous driving faces chal-

lenges due to various sources of domain shift, such as differences in LiDAR resolution, the use of

synthetic versus real-world data, scenery variations, and sensor configurations (e.g., sensor place-

ment and number). This thesis systematically investigates these domain shifts through controlled

experiments using synthetic datasets generated via the CARLA simulator, enabling precise isolation

and quantification of each factor. To facilitate these experiments, two software tools are introduced:

carlaSceneCollector, designed for efficient synthetic data generation, and rosbag2nuScenes,

which converts ROSBag data into the widely adopted nuScenes format. The study emphasizes

two critical sources of domain shift: LiDAR resolution and the synthetic-to-real data shift.

It

identifies saturation effects at intermediate LiDAR resolutions (32–64 channels) and analyzes how

varying resolution shifts impact detection performance, particularly noting the disproportionate

effects on smaller objects. It evaluates various performance metrics, highlighting the robustness of

the NuScenes Detection Score (NDS) compared to traditional metrics like mean Average Precision

(mAP). Simultaneously, the synthetic-to-real domain shift is analyzed through systematic compar-

isons across the nuScenes, adaScenes, and carlaScenes datasets. This reveals that synthetic-to-real

differences significantly surpass the impact of LiDAR resolution shifts, underscoring profound

discrepancies between simulated and real-world LiDAR point clouds. The thesis further addresses

limitations in the default voxelization settings of the CenterPoint model by proposing adaptive vox-

elization techniques and structural enhancements, enhancing model adaptability across resolutions.

Finally, it examines real-world datasets like nuScenes, highlighting their complexity and diversity

as key factors in achieving robust model performance and improved generalization.

Copyright by
ONUR CAN YÜCEDAĞ
2025

ACKNOWLEDGMENTS

I would like to express my deepest thanks to my supervisor, Prof. Joshua Siegel, for being constantly

supportive and constructive towards my work on this thesis. He has been a true inspiration and a

great teacher to me during my time in the graduate program.

Special thanks to my committee members, Prof. Philip McKinley and Dr. Yu Kong, for their

enthusiasm towards my project and their invaluable expertise and time in evaluating this work.

In addition, I would like to acknowledge Dr. Ali Ufuk Peker, Dr. Kerem Par and ADASTEC

Corp. for encouraging me to pursue graduate studies and for their financial support throughout this

endeavor.

I sincerely thank my friends and colleagues, who have been by my side whenever I needed them

and helped me through times of stress.

Finally, I am incredibly grateful for my loving wife. She is the bright light who guided me

throughout this journey, kept my priorities straight and always there whenever I needed her support.

iv

TABLE OF CONTENTS

LIST OF ABBREVIATIONS .

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2

BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

1

4

CHAPTER 3

METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 13

CHAPTER 4

EVALUATION AND RESULTS . . . . . . . . . . . . . . . . . . . . . . 32

CHAPTER 5

CONCLUSION .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 50

BIBLIOGRAPHY . .

. .

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

v

LIST OF ABBREVIATIONS

UDA

Unsupervised Domain Adaptation

LiDAR

Light Detection and Ranging

ROS

Robot Operating System

𝑅𝐿𝑅

𝑅𝑆𝑅

𝑅𝑆𝐶

𝑅𝐿𝑃

𝑅𝐿𝐶

LiDAR Resolution

Synthetic vs. Real data

Scenery Disparities

Differences in LiDAR Placement

Variations in the Number of LiDAR Sensors

mAP

mean Average Precision

NDS

NuScenes Detection Score

AP

IoU

Average Precision

Intersection over Union

vi

CHAPTER 1

INTRODUCTION

Autonomous driving depends on a vehicle’s ability to accurately perceive its surroundings. This

perception is achieved through a combination of sensors—LiDAR, radar, and cameras—that work

together to build a detailed understanding of the environment. 3D object detection, a core com-

ponent of this perception, involves identifying and localizing objects in three-dimensional space.

This is crucial for tasks such as path planning, collision avoidance, and decision-making.

LiDAR sensors are indispensable for 3D object detection. They provide accurate and dense

3D measurements in the form of point clouds. Unlike cameras, which primarily capture 2D color

and texture information, LiDAR sensors use laser beams to measure distances, generating a 3D

representation of the environment. This geometric information is essential for accurately localizing

objects, estimating their size and shape, and determining their distance from the vehicle. LiDAR’s

superior spatial resolution is critical for precise 3D object detection, especially in complex and

dynamic environments. However, each LiDAR sensor has unique characteristics based on its

provider, firmware, and measurement mechanism. For instance, mechanical LiDAR products, a

common type, exhibit significant variations in their ray patterns—the specific arrangement and

angles at which laser beams are emitted by a LiDAR sensor. These ray patterns determine how

comprehensively and densely the environment is scanned. Such variations directly impact the

density and distribution of point clouds, which are composite data structures. These structures

consist of two main types of data. First, there are positional fields—X, Y, and Z coordinates—that

indicate each point’s location in space, with errors bounded by the sensor’s resolution. Second,

there are signal noise-related fields that capture additional information about the reflected laser

signal. These include intensity (the strength of the reflected signal), elongation (the stretching of

the return pulse), ambient (background light levels), and reflectivity (how well a surface reflects

the laser). The noise fields are highly provider-dependent, leading to distributional differences in

these values. These noise fields are commonly uncalibrated, meaning that the same surface can

activate different noise signals depending on environmental conditions such as ambient lighting or

1

weather situations. This lack of standardization poses challenges for models trained on data from

one LiDAR sensor, as they may not generalize well to data from another, even when capturing the

same scene. Even when noise fields are calibrated within a provider, other providers do not follow

the same mechanisms. Consequently, noise channels exhibit different scales and characteristics

across different providers. Finally, variations in how different providers handle outliers further

contribute to performance differences, particularly in adverse weather conditions.

A significant challenge in 3D object detection is domain shift, which occurs when a model’s

performance degrades on data differing from the training set. This is particularly true for LiDAR

data. The sensor-specific characteristics described above, along with variations in sensor config-

uration and environmental conditions, make LiDAR data especially susceptible to domain shift.

For example, a model trained on LiDAR data from an urban setting might perform poorly when

applied to data from a highway setting. In an urban environment, objects like pedestrians, cyclists,

and vehicles are often close together and moving at relatively low speeds. On a highway, however,

objects are spaced farther apart, travel at higher speeds, and include different types of road users,

such as trucks and motorcycles. These differences alter the density and distribution of the point

cloud, making it difficult for the model to accurately detect objects. Another example is a model

trained in clear weather that might fail in fog, where the scattering of laser beams changes the point

cloud’s structure, demonstrating the practical impact of domain shift.

This thesis systematically investigates the impact of domain shift on 3D object detection, with a

particular focus on the effects of LiDAR resolution and the challenges of transferring models trained

on synthetic data to real-world scenarios. To address this, we employ a controlled experimental

framework using synthetic data generated with the CARLA simulator [1]. This framework leverages

two software packages developed as part of this research: carlaSceneCollector, for efficient

synthetic data generation, and rosbag2nuscenes, for conversion into the nuScenes dataset format.

These tools allow us to isolate and quantify the impact of specific domain shift factors.

Our findings indicate that LiDAR resolution has a notable effect on detection performance,

especially for smaller objects. We also observe saturation effects at intermediate resolutions

2

(32–64 channels) and suggest strategies to mitigate these effects. Furthermore, we quantify the

challenges associated with transferring models trained on synthetic data to real-world scenarios,

noting the effectiveness of the NuScenes Detection Score (NDS) [2] in capturing this impact.

The remainder of this thesis is structured as follows: Chapter 2 contextualizes our work within

the existing literature, providing a review of related work in 3D object detection and domain

adaptation. Chapter 3 outlines the experimental setup, detailing the methodology used for data

generation and experimentation. Chapter 4 presents and analyzes the key findings, presenting the

results of our experiments. Finally, Chapter 5 summarizes the thesis’s contributions and suggests

avenues for future research, concluding the thesis and discussing future research directions.

3

CHAPTER 2

BACKGROUND

The advent of autonomous vehicles necessitates robust and reliable perception systems capable

of accurately interpreting the surrounding environment. Among the various perception tasks, 3D

object detection from point clouds stands out as a crucial component for ensuring safe navigation

and preventing collisions. This capability allows autonomous vehicles to classify and precisely

locate objects within their three-dimensional surroundings, forming the bedrock for subsequent

tasks like motion planning and decision-making. Consequently, the field of 3D object detection

has witnessed a surge of research interest and significant advancements in recent years. However, a

persistent challenge that hinders the widespread deployment of these systems is the issue of domain

shift.

Domain shift occurs when a model trained on a specific dataset or environment experiences a

significant drop in performance when applied to a different dataset or environment. This discrepancy

often arises due to variations in data characteristics between the training (source) domain and

the operational (target) domain. Understanding and effectively quantifying this domain shift is

paramount for developing adaptable and generalizable 3D object detection systems. Furthermore,

the scarcity and high cost associated with acquiring labeled data in diverse real-world scenarios

underscore the importance of Unsupervised Domain Adaptation (UDA). UDA offers a promising

avenue to bridge the performance gap by adapting models trained on abundant labeled data from a

source domain to an unlabeled target domain.

This background section provides a comprehensive overview of the current research landscape

concerning domain shift quantification in 3D object detection for autonomous driving. It delves into

the fundamental concepts, the various types of domain shift encountered, existing quantification

methods, and the challenges of applying UDA to 3D point clouds. The section also highlights

the pivotal role of datasets like nuScenes, recent advancements in UDA techniques, the impact of

different sensor modalities, the distinction between the semantic gap and feature distribution shift,

and the importance of the structured nature of the nuScenes dataset in facilitating this research.

4

2.1 Fundamentals of 3D Object Detection from Point Clouds

The task of 3D object detection from point clouds has seen the development of various method-

ologies, broadly categorized based on how the unstructured point cloud data is processed and

represented. General overview of pointcloud processing approaches are shown in 2.1. Here,

"unstructured" means that although a point cloud provides 3D Cartesian coordinates, spatial rela-

tionships between points, such as neighborhood or similarity, are not explicitly defined and require

distance calculations or methods like KD-trees or octrees to be established.

2.1.1 Point-based Methods

Point-based methods directly operates on raw and unprocessed pointcloud data. Pioneering

works of point-based methods, PointNet [3] and a following work PointNet++ [4] employs point-

wise operations such as Multi-Layer Perceptrons (MLPs) and symmetric functions such as sum and

max pooling for generating geometric set of features for each point. The symmetry ensures that the

model is invariant to the order of points in the cloud. PointNet++ [4] extends PointNet and further

built upon this foundation by introducing a hierarchical network structure which enables the model

to capture local spatial patterns at different scales. Operating on different scales is paramount for

pointclouds since range drastically effects the density between point patches. This methodology

significantly enhancing PointNet++ ability to comprehend complex scenes by aggregating features

from neighboring points. Subsequent research has further refined point-based methods to achieve

state-of-the-art performance. PointRCNN [5] incorporated a region proposal network(RPN) to

generate candidate 3D bounding boxes directly from the point cloud, which are then refined for

final detection. 3DSSD [6] focused on improving efficiency by employing sophisticated sampling

strategies to select representative points, reducing the computational burden while maintaining

accuracy.

A key advantage of point-based methods lies in their ability to handle the inherent unstructured

nature of point cloud data without requiring any intermediate representation. However, a notable

drawback is their potential computational intensity, as each point in the cloud often needs to be

processed individually.

5

2.1.2 Voxel-based Methods

In contrast to point-based approaches, voxel-based methods adopt a strategy of discretizing the

continuous 3D space into a grid of regular voxels. VoxelNet[7] was among the first to demonstrate

the effectiveness of this representation by applying 3D Convolutional Neural Networks (CNNs) to

the voxelized point clouds. This allowed for leveraging the power of CNNs, which have proven

highly successful in 2D image analysis, for the task of 3D object detection. SECOND[8] further

advanced the efficiency of voxel-based methods through the introduction of sparse convolution.

Sparse convolution techniques are designed to operate only on the occupied voxels, significantly

reducing the computational overhead, especially in scenarios with sparse point clouds common in

autonomous driving.

While voxel-based methods benefit from the structured representation that is well-suited for

CNNs, they may suffer from information loss due to the inherent discretization process. Voxelization

strategies are also important for reliable feature extraction from point clouds with varying densities.

These density variations can arise from differences in the range of specific regions or the use of

different LiDAR sensors. This loss can be particularly pronounced when dealing with sparse point

clouds, where fine-grained details might be smoothed out or lost during voxelization.

2.1.3 Hybrid Methods

Hybrid methods seek to capitalize on the complementary strengths of both point-based and

voxel-based approaches. PV-RCNN[9] exemplifies this strategy by employing a voxel-based net-

work to efficiently generate high-quality 3D proposals, which are subsequently refined by a point-

based network. This allows the model to exploit the computational efficiency of voxelization for

initial proposal generation while retaining the fine-grained geometric information from the raw

point cloud data during the refinement stage. Similarly, CenterPoint[10] utilizes a Bird’s Eye

View (BEV) representation, which is obtained by projecting the 3D point cloud onto a 2D plane.

This BEV representation offers a compact and efficient way to detect objects, proving particularly

effective for tasks like vehicle detection in autonomous driving scenarios.

By strategically combining different representations and processing techniques, hybrid methods

6

often achieve superior performance, balancing computational efficiency and representational power.

Voxel-based methods are more sensitive to changes in point cloud density, as the voxelization

process is affected by the number of points within each voxel. Conversely, point-based methods

are more robust to variations in overall point density but more susceptible to noise or outliers.

Therefore, selecting the appropriate method and domain adaptation strategy requires understanding

the expected domain shifts and the inherent vulnerabilities of each detection approach.

Table 2.1 Comparison of Point Cloud Processing Approaches for 3D Detection

Criteria

Point-based Methods

Voxel-based Methods

Hybrid Methods

Data Representation

Feature Extraction

Computational Efficiency

point

unstructured

Raw,
cloud data
Point-wise operations (e.g.,
MLPs, symmetric functions
like max pooling)
Can be computationally in-
tensive due to processing
each point individually

Discretized into a 3D grid of
voxels
3D Convolutional Neural
Networks (CNNs) on vox-
elized data
More efficient with sparse
convolution techniques

Sensitivity to Density Varia-
tions

Handling
Data

of Unstructured

More robust to overall den-
sity variations but sensitive
to noise and outliers
Directly handles unstruc-
tured data without interme-
diate representations

More sensitive to density
changes due to voxelization
process
Requires conversion to struc-
tured voxel grid

Performance on Sparse Data May struggle with very
sparse data due to lack of lo-
cal context
Handles unstructured data
directly; Can capture fine de-
tails

Key Advantages

Can lose fine details in sparse
regions due to discretization

Leverages powerful CNNs;
Efficient with sparse convo-
lution

Key Disadvantages

Computationally intensive;
May overfit to specific point
distributions

Information loss due to dis-
cretization; Sensitive to den-
sity variations

Combination of voxelized
and raw point cloud data
Voxel-based for initial pro-
posals, point-based for re-
finement
Balances efficiency and de-
tail by using voxelization for
proposals and points for re-
finement
Moderately sensitive, de-
pending on the specific hy-
brid approach
Uses both structured and un-
structured representations

Better at
retaining details
in sparse regions through
point-based refinement
Combines efficiency of vox-
elization with detail preser-
vation of point-based meth-
ods
More complex to implement;
May still suffer from some
limitations of both methods

2.2 Unsupervised Domain Adaptation for 3D Object Detection

Unsupervised Domain Adaptation (UDA) is a critical area of research that aims to adapt

machine learning models trained on a source domain, where abundant labeled data is available,

to a target domain, where only unlabeled data exists[11]. This is particularly relevant for 3D

object detection in autonomous driving because acquiring labeled 3D point cloud data in diverse

real-world environments is often a laborious, time-consuming, and expensive endeavor. Therefore,

the ability to effectively transfer knowledge learned from a well-annotated source domain (e.g., a

synthetic dataset or data collected in a specific geographical location under favorable conditions) to

7

an unlabeled target domain (e.g., real-world data from a new city or collected under adverse weather)

is of paramount importance for the practical deployment of autonomous vehicles[12][13][14][15].

The success of UDA methods often hinges on the assumption that the underlying feature space

between the source and target domains exhibits some degree of similarity [16]. If the fundamental

features representing objects and scenes differ drastically, simple distribution alignment might not

suffice for effective adaptation. While the primary focus of this section is on UDA, it is worth

noting that other forms of domain adaptation exist, including Semi-Supervised Domain Adaptation

(SSDA) [17], where a small fraction of target domain samples are labeled, Weakly Supervised

Domain Adaptation (WSDA) [18], where only weak labels (e.g., image-level tags) are available

in the target domain, and Supervised Domain Adaptation, where labeled data is available in both

the source and target domains. It’s important to note that many of these methods were initially

developed for 2D image data.

2.2.1 Traditional Domain Adaptation

Early UDA methods primarily focused on aligning the feature distributions between the source

and target domains. One prominent example is Domain Adversarial Neural Networks (DANN) [19],

which employed an adversarial training paradigm. In this approach, a feature extractor is trained to

produce features that are not only discriminative for the main task (e.g., object classification) but

also indistinguishable with respect to the domain they originate from (source or target). A domain

classifier is simultaneously trained to distinguish between source and target domain features, and

the gradients from this domain classifier are reversed when updating the feature extractor. This

forces the feature extractor to learn domain-invariant features that can confuse the domain classifier.

However, these traditional methods were primarily designed for 2D image data and often do not

effectively capture the unique characteristics and challenges associated with 3D point cloud data,

such as its sparsity, irregularity, and lack of inherent order.

2.2.2 Adaptation for 3D Object Detection

Recent advancements in Unsupervised Domain Adaptation (UDA) for 3D object detection

have shifted from basic feature alignment to sophisticated techniques that generate robust pseudo-

8

labels for unlabeled target domains. These approaches leverage temporal, spatial, and synthetic

data to bridge domain disparities, such as variations in LiDAR resolution or differences between

synthetic and real-world environments. A prominent strategy involves self-training, where a

detector pretrained on a labeled source domain produces bounding box predictions for the target

domain. These predictions are refined and filtered into pseudo-labels, iteratively retraining the

model to enhance its adaptability.

2.2.2.1 Tracking-Based Methods

A notable group of UDA techniques utilizes multi-object tracking (MOT) to exploit motion con-

sistency across frames, improving pseudo-label reliability. MS3D++ [20] exemplifies this approach

by combining outputs from an ensemble of pretrained detectors—each trained on distinct source

datasets with varying architectures—using a kernel density estimation (KDE) algorithm. The use

of ensemble of models that have different network architectures and source domain is for reducing

common pitfalls among the detection sets. Examples for common pitfalls could be detecting a

false-positive object in the absence of a pointcloud path because of adverse weather or detecting a

small sized pedestrian around the specific traffic signs which can differ on regions. These fused de-

tections initialize a 3D MOT tracker, built on SimpleTrack [21], yielding consistent pseudo-labels

derived from trajectories, classification scores, and motion cues, with iterative refinement until

performance converges. MS3D++ enhances precision through temporal tactics: retroactive object

labeling propagates dependable labels from later frames to correct earlier, ambiguous detections

impacted by sparse points or occlusions, while static vehicle refinement ensures uniform bounding

boxes for stationary objects, improving shape accuracy and detection coherence.

In contrast, CTRL [18] employs a track-centric backtracking technique, atypical for real-time

applications. Following an initial forward pass, it revisits earlier frames to recover missed detec-

tions, enhancing track continuity and label completeness through bidirectional sequence refinement.

Other methods focuses on quantifying the track reliability, SF-UDA 3D [22] employs sophisticated

track equations for score labeling to reduce false-positive and enforce temporal consistency through-

out the timeline, ST3D++ [23] employs novel voting mechanism powered by hybrid quality-aware

9

triplet memory (HQTM) to make sure tracklets can explained by their detections consistently.

2.2.2.2 Recent Advancements in UDA Techniques

Beyond tracking-based methods, recent UDA innovations focus on shape preservation, cluster-

ing, and extended sequence processing. Auto4D [24] preserves rigid object shapes by collecting

point clouds in the object’s reference frame, mitigating distortions from shifting centers. A con-

volutional neural network (CNN) derives shape estimates from these dense clouds, polished via

closest-corner alignment. For static objects, aggregating points in world coordinates—enabled by

precise ego-vehicle localization—refines size estimates, minimizing noise from erroneous detec-

tions. Tracking is supported by AB3DMOT [25].

Once Detected, Never Lost [26] adapts the Fully Sparse Detector (FSD) for offline analysis by

incorporating both past and future frames. It uses bidirectional MOT: a forward pass constructs

tracklets, followed by a backward pass that retrieves overlooked detections prior to tracklet initiation.

A specialized module, integrating UNet for sparse feature extraction and PointNet for bounding

box refinement, enhances proposals, with multi-way registration ensuring track consistency as a

final step.

An unsupervised method in [27] applies augmentations like ray dropping to bolster generaliza-

tion, particularly for distant objects. It employs L-shape fitting for box estimation and clustering

to detect objects without labels, providing a straightforward response to domain shifts, though it

omits temporal refinement.

Offboard 3D Object Detection from Point Cloud Sequences [28] enhances detectors like PointR-

CNN for multi-frame analysis, compensating for vehicle motion. Using AB3DMOT [25] for track-

ing, it aggregates point clouds for static objects to form comprehensive shape priors and aligns

trajectories for dynamic objects, refining accuracy with lightweight PointNet-based regression

networks across sequences.

DetZero [29] integrates an offline tracker with a multi-frame detector to ensure trajectory

integrity. An attention-based module sharpens contextual details across extended point cloud

sequences, addressing incomplete trajectories and diverse motion states. Decomposed regression

10

further hones detections, delivering outstanding performance on the Waymo Open Dataset (85.15

mAPH, L2).

2.3 The nuScenes Dataset Format

The nuScenes dataset[2] uses a structured relational database format to organize its sensor data,

annotations, and metadata.

It is composed of multiple interlinked tables that describe different

aspects of the dataset. For example, the category table defines a hierarchical taxonomy of object

classes (e.g., a top-level class “vehicle” with sub-classes like “vehicle.car” or “vehicle.truck”), and

the attribute table specifies mutable properties of objects (for instance, whether a vehicle is

parked or moving, or whether a bicycle has a rider). The sensor table enumerates all sensors

employed (such as the LiDAR and each camera), while the calibrated_sensor table provides

each sensor’s calibration parameters (intrinsic settings and extrinsic pose relative to the vehicle),

ensuring that data from different sensors can be accurately aligned in a common reference frame.

Additionally, the visibility table offers a measure of how well an object is observed in the camera

views, binned into ranges (e.g., 0–40%, 40–80%, etc.), which gives annotators’ assessment of partial

occlusions. The map table stores environmental context in the form of precomputed semantic maps

(such as drivable area masks) associated with each location or log in the dataset. Several tables

capture the dynamic, time-indexed elements of nuScenes. The log table contains metadata for each

recording session (each “log” corresponds to a route driven by the data collection vehicle, with

information such as the location, date, and the vehicle used). Each log is subdivided into scenes,

and the scene table defines these distinct 20-second sequences (each scene is a continuous clip

within a log). The sample table represents the key frames sampled at 2,Hz in each scene; each

sample acts as a synchronized snapshot containing one LiDAR sweep and the set of camera images

closest in time, along with all associated annotations. For each sample, the actual recorded sensor

readings are listed in the sample_data table: for example, a LiDAR point cloud file and several

camera image files would be separate entries in sample_data, each linked to a specific sensor and

accompanied by the relevant calibration and the vehicle pose. The vehicle’s pose (position and

orientation) at any timestamp is recorded in the ego_pose table, which gives the location of the

11

“ego” vehicle in a global coordinate frame for each sensor reading or sample. The annotations for

objects are stored in the sample_annotation table, which contains the 3D bounding boxes for

all objects present in each sample (key frame), along with pointers linking each box to a particular

object instance and the object’s category and attributes. nuScenes tracks individual object instances

within a scene using the instance table, which lists unique instance identifiers for objects (each

physical object, such as a specific car, gets an instance ID within a scene). It should be noted that

instances are not tracked across different scenes; if the same physical car appears in two separate

scenes, it will be treated as two distinct instances in the dataset. Together, these tables provide

a comprehensive and well-organized structure for the nuScenes data, enabling efficient lookup of

sensor information and annotations needed for training and evaluating 3D object detection models.

12

CHAPTER 3

METHODOLOGY

In this thesis, we address the challenge of Unsupervised Domain Adaptation (UDA) within the

realm of 3D Object Detection. Our approach begins by systematically identifying and quantifying

potential sources of domain shift, leveraging a carefully curated suite of both tailored and generic

datasets. We also present a novel pipeline for generating domain-specific datasets using the CARLA

simulator, designed to capture and analyze domain shift characteristics. Subsequently, we train our

models on this dataset suite and perform cross-dataset evaluations to uncover key axes of domain

shift. These findings underscore the necessity of an auto-labeling pipeline to effectively mitigate

UDA challenges. In the following sections, we detail a foundational auto-training pipeline, critique

its limitations, and propose targeted enhancements, including an innovative re-detection mechanism

driven by tracking priors.

3.1 Sources of Domain Shift

Feature sets extracted from source datasets encapsulate the distinct intrinsic properties inherent

to each dataset[12]. Within the domain of 3D object detection, these properties may originate from

variations in environmental conditions, scene composition, LiDAR sensor specifications (including

type, placement, and channel count), or the fidelity of sensor data[20], particularly when datasets

are synthetically generated. Depiction of potential domain shifts sources related to the environment

conditions and scene composition between real life datasets shown in 3.1. This study systematically

categorizes the recurring patterns that define these intrinsic attributes, establishing a comprehensive

framework for analyzing domain shift in 3D object detection.

In the UDA framework, we designate the source dataset, denoted 𝑆𝑖, as the fully annotated

dataset employed for initial model training, and the target dataset, 𝑆 𝑗 , as the unlabeled dataset

targeted for adaptation, where domain discrepancies must be minimized. The transition 𝑆𝑖 → 𝑆 𝑗

represents the process of training a model on 𝑆𝑖 and evaluating its performance on 𝑆 𝑗 .

To quantify domain shift, we first identify and define potential sources of domain shift. Let

𝑅 represent the set of domain shift sources, where 𝑅 = {𝑅𝐿𝑅, 𝑅𝑆𝑅, 𝑅𝑆𝐶, 𝑅𝐿𝑃, 𝑅𝐿𝐶 }. Here, 𝑅𝐿𝑅

13

denotes variations in LiDAR resolution (e.g., 16Ch vs. 32Ch), 𝑅𝑆𝑅 indicates the use of synthetic

versus real data in source or target datasets, 𝑅𝑆𝐶 refers to scenery disparities (e.g., urban vs. highway

settings), 𝑅𝐿𝑃 signifies differences in LiDAR sensor placement, and 𝑅𝐿𝐶 reflects variations in the

number of LiDAR sensors between source and target datasets.

Let 𝐷 denote any performance metric commonly utilized in 3D object detection for autonomous

driving, such as mean Average Precision (mAP), NuScenes Detection Score (NDS), or class-
specific Average Precision at a fixed threshold (e.g., 𝐶 𝐴𝑅_𝐴𝑃0.5). The value 𝐷 𝑆𝑖→𝑆 𝑗 represents
the performance of a model trained on 𝑆𝑖 and tested on 𝑆 𝑗 . In the context of UDA, the domain

shift between datasets 𝑆𝑖 and 𝑆 𝑗 is quantified as the difference between the baseline performance,

𝐷 𝑆𝑖→𝑆𝑖 (when trained and tested on the source), and the adapted performance, 𝐷 𝑆𝑖→𝑆 𝑗 (when tested

on the target), expressed as:

Δ𝐷 𝑆𝑖→𝑆 𝑗 = 𝐷 𝑆𝑖→𝑆 𝑗 − 𝐷 𝑆𝑖→𝑆𝑖

This difference captures the domain shift between source and target datasets as the amount of

variation in the chosen metric, whether positive (indicating improvement) or negative (indicating

performance degradation).

This Δ𝐷 𝑆𝑖→𝑆 𝑗 quantifies the aggregate domain shift projected onto metric 𝐷, encompassing

contributions from all potential sources in 𝑅. Recognizing Δ𝐷 𝑆𝑖→𝑆 𝑗 as a composite variable

influenced by multiple factors, we decompose it into contributions from individual sources:

Δ𝐷 𝑆𝑖→𝑆 𝑗 =

Δ𝐷 𝑆𝑖→𝑆 𝑗
𝑅𝑘

∑︁

𝑘∈𝑅

where Δ𝐷 𝑆𝑖→𝑆 𝑗

𝑅𝑘

represents the domain shift attributed to source 𝑅𝑘 between datasets 𝑆𝑖 and 𝑆 𝑗 .

To isolate and quantify the impact of each 𝑅𝑘 , this work employs a strategy of meticulously

tailoring datasets such that only a single domain shift source varies between 𝑆𝑖 and 𝑆 𝑗 , while other

sources remain controlled. For instance, to assess 𝑅𝐿𝑅, we generate datasets differing solely in

LiDAR resolution, holding factors like scenery and sensor count constant. This controlled approach
enables precise measurement of Δ𝐷 𝑆𝑖→𝑆 𝑗

for each source, facilitating a detailed understanding of

𝑅𝑘

their individual contributions to the total domain shift.

14

(a) Industrial Site

(b) Passenger Car

(c) University

(d) Highway

(e) Forest

(f) Rainy Weather

(g) Urban

(h) Tunnel

Figure 3.1 Various driving scenarios that contributes to domain shift between real-life datasets

15

3.2 Data Collection

To isolate the impact of a single domain shift source, 𝑅𝑘 , it is imperative to create tailored

datasets where extraneous domain shift sources do not contribute to the overall domain shift effect.

The design and generation of such datasets hinge on several key considerations and requirements:

1. The datasets must be tailored to the autonomous driving domain, incorporating sensor con-

figurations relevant to this context.

2. They should be straightforward to generate and distribute efficiently.

3. They must be compatible with prevalent 3D object detection frameworks to facilitate seamless

training and evaluation.

4. Ground truth annotations for 3D object detection must be provided to ensure reliable assess-

ment.

After careful evaluation, the nuScenes dataset format was selected as it satisfies all these

criteria. The nuScenes format is widely adopted in the autonomous driving research community,

offering a standardized structure that supports diverse sensor data and comprehensive ground truth

annotations, thereby aligning with the needs of this study. However, a challenge remains: generating

the requisite raw data to populate this format. To address this, we opted for the CARLA simulator as

the primary data generation source. CARLA was chosen due to its extensive community support,

rich ecosystem of libraries, and flexibility in simulating a wide range of autonomous driving

scenarios, making it an ideal tool for producing controlled, high-fidelity sensor data.

While CARLA effectively generates raw data in this work, real-world applications might draw

data from diverse sources, such as physical sensor deployments or other simulators. To address

this heterogeneity and ensure versatility across projects, we advocate for a generalized approach

to raw data storage. We adopt the ROSBag format, a widely recognized standard in robotics

and autonomous systems. ROSBag supports the storage of raw sensor data, including LiDAR

point clouds, camera images, and vehicle pose estimates, alongside metadata like 3D rigid body

16

transformations and camera intrinsic calibration data (e.g., focal length, distortion coefficients).

Compatible with both real-world and simulated data from CARLA, ROSBag offers a flexible,

interoperable solution that maintains spatial and temporal relationships critical for 3D object

detection and localization. To convert this data into the nuScenes format, we introduce two

new packages, carlaSceneCollector and rosbag2nuScenes package, a set of modules that

processes and transforms ROSBag data, including sensor streams, transformations, and localization,

to create tailored datasets for 3D object detection experiments.

3.3 Details on the carlaSceneCollector Package

The CARLA simulator, designed specifically for the autonomous driving domain, benefits

from active maintenance and a robust community of contributors and users, ensuring its reliability

for research purposes. Built on Unreal Engine [30], a high-fidelity game engine widely utilized

across industries, CARLA provides powerful APIs that enable users to interact with its physics-

based environment seamlessly. CARLA supports an extensive array of road agents, configurable

as either the ego vehicle or other road users, and includes an autonomous traffic management

system 1. This system simplifies the automation of both the ego vehicle and the surrounding

traffic, enhancing scenario realism. Additionally, CARLA offers a bridge module for integration

with the ROS middleware[31], facilitating data exchange. Spawning agents is straightforward,

as CARLA provides predefined safe spawning points to ensure reliable agent placement. The

carlaSceneCollector package, developed as part of this work, leverages these capabilities by

accepting a configuration file that defines the data collection schema. This file specifies parameters

such as the target ego vehicle, map selection, sensor setup, asset choice for the ego vehicle, and the

number of scenes to collect, where each scene comprises 20 seconds of data formatted according

to the nuScenes standard.

The carlaSceneCollector package integrates a suite of modules to orchestrate data gen-

eration and collection from a running CARLA instance. The setAutopilot module enables or

disables autopilot mode for the ego vehicle, ensuring a safe reset of any residual velocity or acceler-

1CARLA Traffic Manager https://carla.readthedocs.io/en/latest/tuto_G_traffic_manager

17

ation inputs. The generateTraffic module populates the scene with a user-specified number of

pedestrians, bicycles, motorcycles, buses, cars, and trucks. The removeAllActors module clears

all non-ego actors from the scene, allowing a fresh start for each scenario without carryover from

prior configurations. The setEgoVehicleRandomPose module queries the map for safe spawning

locations and randomly repositions the ego vehicle to one of these points. The collector module

functions as a ROSBag recorder, capturing all sensor data—including LiDAR point clouds, camera

images, and localization ground truth—along with frame transformations such as sensor calibration

and pose information. A runner script within carlaSceneCollector coordinates these modules

to execute the pipeline, achieving the desired number of scenes efficiently. A depiction of the

carlaSceneCollector package’s pipeline is shown in Figure 3.2

3.4 Details on rosbag2nuScenes package

After collecting set of ROSBags, rosbag2nuscenes package is responsible for converting the

raw data to nuScenes format. Package consists of many individual components and also components

that are interactively work with each other in order to distribute single understanding of the entire

dataset. In this chapter, we plan first to explain configuration step of the pipeline.Subsequently,

we elaborate on the generation of metadata tables that establish the dataset’s temporal struc-

ture, namely log, scene, and sample. We then explore the data-related tables, encompassing

sensor, sample_data, calibrated_sensor, and additional tables such as category, ego_-

pose, instance, map and sample_annotation, which collectively define the nuScenes dataset

format.

3.4.1 Configuration

The rosbag2nuscenes package maintains a global parameter set to configure a single dataset

conversion session, designed to process any ROSBag data, not solely those collected from CARLA.

The parameter rosbag_paths specifies a set of ROSBag file paths to be considered for dataset gen-

eration. The annotation_type parameter defines the ROS message type for the annotation topic,

which provides ground truth data including bounding boxes, velocities, and object classifications.

To enhance compatibility, we developed conversion functions that transform various 3D object de-

18

Figure 3.2 Depiction of carlaSceneCollector packages pipeline

19

tection message types commonly used in the community into a unified derived_object_array

format. This format is widely adopted and straightforward for ROS developers to utilize.

The parameters global_frame_id and ego_frame_id designate the frame names for the

global reference frame (to which localization messages refer) and the ego vehicle frame, respectively.

The ego frame is defined as the ground projection of the midpoint between the two rear wheels

of the ego vehicle. To prevent unintended domain shifts, we ensured that the ego vehicle frame

remains consistently positioned relative to the vehicle body across all datasets in this study. This

consistency, for instance, maintains ground plane points at a uniform 𝑧-coordinate regardless of

sensor configuration. The sensors_of_interest parameter identifies the set of sensors that

rosbag2nuscenes processes.

Additional sensor-specific parameters include modality, topic_name, is_anchor, and an

optional sensor_info_topic. The is_anchor boolean indicates whether a sensor’s times-

tamps serve as the reference for defining a sample. When a sensor is designated as an anchor,

rosbag2nuscenes synchronizes all other sensor data to its timestamps, discarding any sample

where a match cannot be found. Furthermore, we define sample_duration as the maximum

allowable time difference between a message and its nearest anchor timestamp for inclusion in a

sample, and scene_duration as the total duration of a scene, set to 20 seconds in accordance

with the nuScenes standard.

The rosbag2nuscenes package incorporates several post-processing modules to refine sam-

ples, annotations, and point clouds. The annotation_filters module includes a collection

of filters tailored for annotation data, while sample_filters targets the generation of sample

data, and pointcloud_filters focuses on processing point cloud data. These filters are applied

sequentially according to user-defined specifications within the rosbag2nuscenes pipeline.

3.4.2 Pipeline

The pipeline developed in this work is organized into three distinct class categories to facilitate

nuScenes dataset generation. The first category encompasses classes that manage data storage and

mapping for the nuScenes format, the second includes classes that facilitate coordination among

20

h
c
8
2
1
—

l
a
r
e
n
e
G

)
d
(

h
c

4
6

—

l
a
r
e
n
e
G

)
c
(

h
c

2
3

—

l
a
r
e
n
e
G

)
b
(

h
c

6
1
—

l
a
r
e
n
e
G

)
a
(

h
c

8
2
1
—
n
a
i
r
t
s
e
d
e
P
)
l
(

h
c

4
6

—

n
a
i
r
t
s
e
d
e
P
)
k
(

h
c

2
3

—

n
a
i
r
t
s
e
d
e
P
)
j
(

h
c

6
1
—
n
a
i
r
t
s
e
d
e
P
)
i
(

h
c
8
2
1
—

r
a
C

)
h
(

h
c
4
6

—

r
a
C

)
g
(

h
c

2
3

—

r
a
C

)
f
(

h
c
6
1
—

s
r
a
C

)
e
(

21

h
c
8
2
1
—

e
l
c
y
c
i
B

)
p
(

h
c

4
6

—

e
l
c
y
c
i
B

)
o
(

h
c

2
3

—

e
l
c
y
c
i
B

)
n
(

h
c

6
1
—

e
l
c
y
c
i
B

)

m

(

r
o
t
a
l
u
m

i
s
A
L
R
A
C
n
i

s
n
o
i
t
a
r
u
g
fi
n
o
c

l
e
n
n
a
h
c
R
A
D
L
s
s
o
r
c
a

i

s
e
s
s
a
l
c

t
c
e
j
b
o

t
n
e
r
e
ff
i
d

f
o
n
o
s
i
r
a
p
m
o
C
3
.
3
e
r
u
g
i
F

other classes, and the third comprises utility classes that assist in various pipeline operations. For

example, the ContextManager class, part of the second category, oversees the execution flow and

relays critical data between classes to support subsequent tasks. At initialization, ContextManager

parses the sensors_of_interest parameter to create Sensor objects for each designated sen-

sor. The Sensor class, an instance of the first category, holds sensor-specific attributes (token,

modality, and channel) and directly aligns them with the nuScenes format without complex pro-

cessing. Subsequently, ContextManager instructs the storage of this data to disk, generating the

sensor.json file. This sensor information, retained in memory by ContextManager, is shared

with later stages, such as the creation of the calibrated_sensor and sample_data tables, en-

suring cohesive dataset assembly. The initial sensor set definition is crucial, as it remains constant

across ROSBags and establishes the dataset’s sensor framework. In scenarios requiring a heteroge-

neous sensor configuration across scenes, rosbag2nuscenes must receive the superset of sensors

during the configuration phase. While autonomous driving datasets typically feature homogeneous

sensor setups, rosbag2nuscenes is fully equipped to handle heterogeneous configurations when

necessary.

After creating the sensor set, ContextManager creates Log object for each of the ROSBag

files. Log class is one of the most complex classes of the rosbag2nuscenes package since it

holds the most generic and interconnected information for the nuScenes dataset. A depiction of the

rosbag2nuscenes package’s pipeline is shown in Figure 3.4

3.4.2.1 Creation of Logs

The Log class handles a ROSBag file by utilizing its file path and a scene_start_index

parameter, an offset that differentiates scenes across various logs to ensure unique name fields

in the nuScenes format, calculated and supplied by the ContextManager to each Log instance.

It segments the ROSBag into uniform portions according to the scene_duration parameter,

keeping any remaining data if the total length isn’t perfectly divisible, thereby retaining all data

rather than omitting leftovers, despite scenes typically lasting 20 seconds for standardization.

Next, it generates EgoPose objects for each piece of localization data in the ROSBag, which

22

Figure 3.4 Depiction of rosbag2nuscenes packages pipeline

23

supports Scene objects in producing SampleData objects representing raw sensor outputs. The

EgoPose class contains the necessary details to populate the ego_pose table in the nuScenes

format, capturing the ego vehicle’s position relative to a global frame. Following this, the Log

class creates CalibratedSensor objects that define a sensor’s specific state, including intrinsic

and extrinsic details, corresponding to the calibrated_sensor table in nuScenes.

In the nuScenes format, log data represents a continuous data collection session within a global

timeline, encompassing a single interval of recorded activity. While the log spans the entire session,

scenes represent smaller portions within it, meaning a log consists of multiple scenes, so the Log

class manages all ego positions and calibrated sensor information, distributing these details to

individual Scene objects created for each segment, along with the ROSBag object and its specific

start and end times.

The Log class also initiates the AnnotationManager object, which oversees the generation of

the instance and sample_annotation tables that store ground truth object information. Using

the pre-existing list of Scene objects, the Log class activates the AnnotationManager with all

sample data from the current log to connect frame-specific objects in the sample_annotation

table to the comprehensive timeline of road agents in the instance table.

3.4.2.2 Scene, Sample and SampleData

In the nuScenes format, the scene table describes a 20-second portion of a data collection

session, tied to a specific log entry. Each scene identifies a start and end sample, where a sample

captures a single frame in the scene’s timeline and connects to synchronized sensor data stored

in sample_data. For easy navigation, sample includes links to the previous and next samples

within the scene. The sample_data table ties this sensor data to ego_pose and calibrated_-

sensor entries for accurate positioning and calibration. It also contains timestamp (when the

data was captured), filename, and fileformat (indicating the sensor type and data location),

with timestamps that may vary across a sample’s sensors. Additionally, sample_data has an

is_key_frame flag to show if a frame is labeled. While nuScenes collects data at 20 Hz, only 2 Hz

frames are labeled, leaving most with a false is_key_frame value. These unlabeled frames remain

24

useful for multi-frame detection models like CenterPoint, which we explore in later experiments.

To create these tables, the Scene class gathers all sensor messages from the ROSBag between

the given start and end times. It then picks out anchor topic messages based on the is_anchor

setting in the sensor setup, typically the highest-resolution LiDAR in multi-LiDAR cases, as it’s

key for localization when using one LiDAR. Samples are formed by setting a time window around

each anchor message using sample_duration, matching messages from different sensors listed

in sensors_of_interest if their timestamps fit within this window. For 20 Hz data, sample_-

duration is about 0.05 seconds; for 10 Hz, it’s around 0.1 seconds. We opted for 20 Hz data

collection in CARLA to match the nuScenes standard, maintaining timestamp intervals of 0.05

seconds, since we observed that differing frequencies introduce unintended domain shifts in multi-

frame detectors like CenterPoint, which rely on multi-frame data paired with relative timestamp

differences calculated from the earliest point cloud’s timestamp. Adjusting these time gaps isn’t

helpful since CenterPoint also predicts per-object velocity, tied to point feature shifts over specific

time intervals, affecting performance consistency across datasets.

During the pairing of sensor data with the anchor topic, the Scene class creates Sample

and SampleData objects, linking their next and prev fields in a two-way queue structure. The

Scene class also attaches each SampleData object to its corresponding ego_pose entry from the

EgoPose object during this process, while its dual-queue and reference-based design ensures clear

connections between Sample and SampleData objects, making it a key module that manages both

these relationships and its own environmental data for the nuScenes format.

After generating Sample objects, each Sample creates individual SampleData objects for every

piece of synchronized sensor data, linking each one to the corresponding entry in the calibrated_-

sensor table by utilizing the CalibratedSensor object to ensure proper calibration details are

attached. Unlike other components in the rosbag2nuscenes package, the SampleData class is

unique because it directly writes the sensor data to disk instead of holding it in memory, a choice

driven by the num_features configuration parameter that determines the target dimensions of the

point cloud data to be saved, such as deciding whether to include all five fields—x, y, z, intensity,

25

and timestamp—or just a subset like the first three if we exclude intensity. We require this parameter

to be specified because 3D detection frameworks, like mmdetection3d, depend on consistent point

cloud parsing rules for both training and evaluation, and mismatched dimensions can disrupt these

processes. For example, if we don’t need the intensity field, we can set num_features to use

only x, y, and z, tailoring the data to our needs, but this flexibility demands that all input point

clouds share the same field structure across the dataset. To achieve this uniformity, the SampleData

class first reads the incoming point cloud data, then adjusts it by either adding padding or trimming

columns as necessary to match the specified num_features, ensuring every saved point cloud

has the same format. In our work, we set num_features to 5, covering the full set of {x, y, z,

intensity, timestamp}, since this is a widely used configuration for multi-frame detection models

like CenterPoint, which we explore later, and saving directly to disk after processing helps manage

memory efficiently by avoiding the need to retain the already-processed data in memory.

3.4.2.3 AnnotationManager

The AnnotationManager class is tasked with managing the ground truth data, ensuring it

is properly structured and stored within the sample_annotation and instance tables of the

nuScenes dataset format. It begins this process by collecting all messages published to the topic

specified by the annotation_type parameter, which defines the type of ROS message carrying

annotation information, and then converts these messages into a standardized derived_object_-

array format for consistency across the pipeline. Following this conversion, the class carefully

processes each object message by extracting and storing its id—a unique identifier assigned to

each frame-specific object that corresponds to a specific CARLA agent—into a comprehensive

list; this list serves as the foundation for generating Instance objects, where each Instance

object represents a distinct agent in the dataset, while the sample_annotation messages provide

snapshots of that agent’s state at particular points in time.

In the nuScenes framework, every

sample_annotation must be associated with a specific sample entry, which represents a single

frame in the timeline, so the AnnotationManager systematically works through the full list of

instances, gathering all annotations tied to each instance and then pairing these annotations with

26

the appropriate sample entries; it does this by applying a time window defined by the sample_-

duration parameter, matching annotations to samples if their timestamps fall within this window, a

method akin to how samples are initially created from sensor data. This pairing approach, although

thorough, demands significant computational effort because it requires iterating over the entire set

of annotations for each instance and then aligning them with every sample entry in the dataset,

leading to a time complexity of O (𝑛2), where 𝑛 denotes the total number of annotations or samples,

making it one of the more resource-intensive operations in the pipeline.

3.4.2.4 Post-processing Filters

The rosbag2nuscenes package incorporates three distinct post-processing modules to enhance

the dataset’s quality after it has been saved to disk —AnnotationFilter, SampleFilter, and

PointCloudFilter. These modules are configured using a global parameter list, consistent

with the setup of other components within this package suite, and they operate on the dataset by

leveraging the tools and context provided by the nuScenes-devkit, allowing for additional refinement

of the data entities stored on disk.

The AnnotationFilter module is composed of four specialized submodules designed for

filtering: BoxElevationShiftFilter, RangeFilter, AnnotationRelationCorrector and

PointsFilter. The BoxElevationShiftFilter adjusts the height of ground truth bounding

boxes for specific object classes when necessary, addressing a quirk in CARLA where a bounding

box, defined as 𝑏𝑏𝑜𝑥 = {𝑥, 𝑦, 𝑧, 𝑙, 𝑤, ℎ, 𝑜}—with 𝑥, 𝑦, 𝑧 as the center coordinates, 𝑙, 𝑤, ℎ as the

length, width, and height, and 𝑜 as the yaw orientation—positions 𝑧 at ground level rather than

the box’s center; to align with nuScenes’ center-based standard, it adds ℎ/2 to the 𝑧-coordinate,

ensuring the 𝑏𝑏𝑜𝑥𝑥𝑦𝑧 accurately reflects the box’s midpoint. The RangeFilter is a straightforward

tool that takes min_range and max_range values along with a channel input (defaulting to

LiDAR, though it must match a sensor entry name) to exclude annotations falling outside these

specified distance boundaries, helping to focus on relevant objects within a sensor’s effective range.

The PointsFilter removes annotations that lack sufficient points within their bounding box,

determined by a min_points threshold, and it supports multiple LiDAR inputs via channel_-

27

list, counting the total points inside the box across all listed sensors to ensure meaningful data

density. The AnnotationRelationCorrector is a more intricate submodule that addresses the

ripple effects of prior filters deleting sample_annotation entries; in nuScenes, each instance

entry points to a starting and ending sample_annotation, so if one is removed, the entire instance

can be lost, and consecutive sample_annotation entries rely on next and prev pointers for easy

traversal, which can break when entries disappear; this filter meticulously scans the full sample_-

annotation table to mend these gaps, either by finding the next valid annotation or adjusting

pointers if no further entries exist, marking the current one as the last if needed.

This correction process is both critical and time-intensive because CARLA’s object messages,

which form the initial basis for our ground truth data, list all agents in a scene—whether they’re vis-

ible or not—without checking if sensors can detect them due to occlusions or being out of range, re-

quiring us to refine the dataset post-collection. Additionally, the AnnotationRelationCorrector

evaluates the velocities attached to sample_annotation entries, which the nuScenes-devkit cal-

culates by interpolating the three nearest states of an instance to estimate movement; when an

instance has too few annotations, this can lead to unrealistic velocity values that don’t make sense

geometrically, and in such cases, the filter removes the entire instance from the dataset to preserve

accuracy and reliability, ensuring the ground truth reflects observable and feasible object behavior.

SampleFilter module includes a single submodule, UnsyncedSamplesFilter, which ex-

amines all sample entries to detect those lacking annotations while their previous and next samples

both contain related annotation data. This filter is essential because, in rare instances, all sample_-

annotation entries tied to a sample might be removed—often due to CARLA occasionally re-

peating an object message for the same timestamp, which the AnnotationManager then discards

as duplicates—leaving an empty sample that doesn’t reflect meaningful information and could

indicate an outlier rather than valuable data; the SampleFilter identifies such cases, reconnects

the previous and next samples by updating their pointers, and removes the empty sample from the

dataset to maintain its integrity.

PointCloudFilter module features a single submodule called SelfCropBoxFilter, which

28

uses min and max vectors to define a bounding box (𝑏𝑏𝑜𝑥) that outlines the ego vehicle’s boundaries,

along with a channel_list parameter to specify which LiDAR sensor or sensors’ data should

be processed, and it removes any points falling within this defined 𝑏𝑏𝑜𝑥. This filtering step is

necessary because we found that when a model trained on a dataset without visible ego vehicle

parts in the sensor data is tested on a dataset where the ego vehicle is detectable by LiDAR sensors, it

often generates persistent false-positive detections around the ego vehicle’s location; this unwanted

behavior introduces a bias in performance metrics, skewing results in a way we aim to prevent by

ensuring the point cloud data reflects only the external environment and not the vehicle itself.

3.5 Quantification of Domain Shift

To explore domain shift, we carefully selected a subset of potential sources—specifically 𝑅𝐿𝑅

(LiDAR resolution) and 𝑅𝑆𝑅 (synthetic versus real data)—and designed our datasets to isolate

their effects. When building datasets to examine 𝑅𝐿𝑅 (LiDAR resolution), we equipped the ego

vehicle with a sensor setup that includes one RGB camera and four LiDAR sensors, all fixed at the

same position relative to the vehicle’s frame to maintain consistency. Even though these LiDAR

sensors share the same location, they differ in resolution, operating at 16, 32, 64, and 128 channels,

allowing us to test how resolution impacts detection performance. Pointclouds collected from the

CARLA simulator for estimation of 𝑅𝐿𝑅 shown in Figure 3.3 for comparison. We crafted this

synthetic sensor arrangement within a uniform scenario to remove influences from other domain

shift factors, such as 𝑅𝑆𝑅 (synthetic vs. real data), 𝑅𝑆𝐶 (variations in scenery), 𝑅𝐿𝑃 (differences in

LiDAR placement), and 𝑅𝐿𝐶 (number of LiDAR units), ensuring that only 𝑅𝐿𝑅 drives any observed

domain shift. We created four distinct datasets, each tailored to a specific LiDAR resolution (16,

32, 64, and 128 channels), resulting in the carlaScenes datasets named accordingly—carlaScenes

16, carlaScenes 32, carlaScenes 64, and carlaScenes 128—to assess the individual effect of each

LiDAR’s resolution. This deliberate and controlled approach lets us accurately measure how 𝑅𝐿𝑅

affects key performance metrics in 3D object detection, providing clear insights into its role. To

maintain consistency with the nuScenes dataset, we positioned the LiDAR sensors at the same

location and orientation relative to the ego vehicle frame, which is defined as the ground projection

29

of the midpoint between the two rear wheels, mirrored accordingly. Overall, we gathered 1000

scenes and sampled them to achieve approximately 28,000 samples, aligning with the sample

count of the nuScenes dataset; we further emphasize maintaining a similar or identical number of

training samples across all datasets, and although the sample size alone doesn’t guarantee model

success without considering other hyperparameters, we intentionally standardized this aspect to

ensure more reliable and comparable training sessions. Each scene features a random assortment of

agents—including the ego vehicle—placed and acting unpredictably across the map, which boosts

the dataset’s variety and strength for robust analysis. To investigate 𝑅𝑆𝑅 (synthetic vs. real data),

we utilized the nuScenes dataset and created a custom dataset by labeling real-world data from

ADASTEC CORP using Segments.AI. The datasets employed in this work are detailed in Table 3.1.

Dataset Name
nuScenes
adaScenes
carlaScenes 16
carlaScenes 32
carlaScenes 64
carlaScenes 128

Table 3.1 Datasets Used in This Study

Num LiDARs LiDAR Resolution Synthetic Number of Samples

1
5
1
1
1
1

32
128+32
16
32
64
128

No
No
Yes
Yes
Yes
Yes

28130
19727
27902
27902
27902
27902

Although the adaScenes dataset has fewer samples compared to the others, we addressed this

difference by randomly selecting an equal number of samples from the nuScenes and carlaScenes

datasets to match adaScenes’ size, ensuring a fair comparison without the influence of dataset

length. Also, we only used the single top LiDAR data from the adaScenes to not further add

a potential domain shift sources such as 𝑅𝐿𝑃 and 𝑅𝐿𝐶. Once the datasets were prepared, we

made thoughtful decisions about the 3D detection framework and neural network model for our

experiments, choosing the CenterPoint model within the mmdetection3d library [32], a 3d detection

framework built on PyTorch [33] that simplifies working with pre-trained models across various

architectures and datasets, especially since all our datasets follow the nuScenes format; notably,

CenterPoint already has a pre-trained version for nuScenes, though it relies on the dataset’s inclusion

of point cloud intensity data.

30

The intensity value in a point cloud is a measure of how much a surface reflects the LiDAR

signal, influenced by distance because signal strength weakens over range, but this measurement

varies between LiDAR manufacturers due to differences in their hardware and calibration methods,

making it inconsistent across devices. A key challenge arises with CARLA’s LiDAR simulator,

which assigns intensity using a basic formula, 𝐼 = 𝑒−𝛼𝑑, where 𝛼 is a fixed attenuation rate and

𝑑 is the point’s distance; this oversimplified approach produces intensity values that don’t match

real-world conditions [34], lacking the complexity of actual sensor behavior [35]. Because of this

limitation and the variability in real LiDAR intensity, we chose to retrain the CenterPoint model

without using the intensity channel, ensuring our results depend on more reliable features like

position and avoid potential inaccuracies introduced by this noisy and simulator-specific data.

For training sessions, we have used in total of 20 epochs with learning rate auto scaling and

only kept the [𝑐𝑎𝑟, 𝑚𝑜𝑡𝑜𝑟𝑐𝑦𝑐𝑙𝑒, 𝑝𝑒𝑑𝑒𝑠𝑡𝑟𝑖𝑎𝑛] heads during training. This class truncation is done

because CARLA does not provide any distinguishing classes between four wheeled objects such

as 𝑡𝑟𝑢𝑐𝑘, 𝑏𝑢𝑠 or two wheeled objects such as 𝑏𝑖𝑐𝑦𝑐𝑙𝑒, 𝑠𝑐𝑜𝑜𝑡𝑒𝑟. For concrete model configuration,

we have used centerpoint_pillar02_second_secfpn_8xb4-cyclic which corresponds to

the CenterPoint model with Pillar encoding with a 0.2 voxel resolution, SECOND backbone,

SECONDFPN neck, batch normalization applied throughout the model and a cyclic learning rate

schedule over 20 epochs. Remaining identifier, 8xb4 refers to having 8 samples per GPU across 4

GPUs, which we do share in our training sessions. We have trained and tested all of our models on

machine equipped with 4 x V100 GPU. Execution times are shown in Table 3.2.

Dataset Name
nuScenes
adaScenes
carlaScenes 16
carlaScenes 32
carlaScenes 64
carlaScenes 128

Table 3.2 Datasets Used in This Study

Number of Samples Number of Epochs LiDAR Resolution Time(hours)

32
128
16
32
64
128

36
42
11
14
22
33

28130
19727
27902
27902
27902
27902

20
20
20
20
20
20

31

CHAPTER 4

EVALUATION AND RESULTS

We evaluate model performance across datasets by presenting key metrics, including mean Aver-

age Precision (mAP), NuScenes Detection Score (NDS), and class-specific Average Precision at a

0.5 IoU threshold for cars (Car AP 0.5), pedestrians (Pedestrian AP 0.5), and motorcycles

(Motorcycle AP 0.5), as detailed in Tables 4.1, 4.2, 4.3, 4.4, and 4.5, respectively. To further

assess domain shift impacts, we report the performance differences, Δ𝐷 𝑆𝑖→𝑆 𝑗 , in Tables 4.7, 4.6,

4.8, and 4.9, highlighting how these variations influence detection accuracy across datasets.

Table 4.1 Trained/Tested mAP Across Datasets

Train/Test

carlaScenes 16 carlaScenes 32 carlaScenes 64 carlaScenes 128 nuScenes

adaScenes

carlaScenes 16
carlaScenes 32
carlaScenes 64
carlaScenes 128
nuScenes
adaScenes

0.8398
0.7714
0.6618
0.5541
0.5574
0.3898

0.8265
0.9405
0.9089
0.8547
0.695
0.4849

0.7761
0.9474
0.9636
0.9493
0.7134
0.5201

0.6684
0.9337
0.9727
0.9721
0.6771
0.5262

0.0973
0.1148
0.1254
0.1014
0.5011
0.1619

0.0998
0.1922
0.2227
0.22
0.4957
0.542

Table 4.2 Trained/Tested NDS Across Datasets

Train/Test

carlaScenes 16 carlaScenes 32 carlaScenes 64 carlaScenes 128 nuScenes

adaScenes

carlaScenes 16
carlaScenes 32
carlaScenes 64
carlaScenes 128
nuScenes
adaScenes

0.7518
0.7178
0.6641
0.6009
0.4939
0.3798

0.7319
0.8135
0.8017
0.7639
0.5691
0.4314

0.6987
0.8147
0.8368
0.8249
0.573
0.4566

0.6282
0.7961
0.8386
0.8404
0.5554
0.4649

0.2694
0.2976
0.3187
0.3077
0.5743
0.3243

0.2147
0.2767
0.2996
0.3031
0.4884
0.5339

Table 4.3 Trained/Tested CAR AP 0.5 Across Datasets

Train/Test

carlaScenes 16 carlaScenes 32 carlaScenes 64 carlaScenes 128 nuScenes

adaScenes

carlaScenes 16
carlaScenes 32
carlaScenes 64
carlaScenes 128
nuScenes
adaScenes

0.863
0.7635
0.6127
0.4639
0.5169
0.2934

0.8203
0.9075
0.8548
0.7669
0.6036
0.3828

0.8364
0.9336
0.9511
0.9353
0.6219
0.4388

0.809
0.9071
0.9624
0.963
0.5475
0.4394

0.134
0.1259
0.1178
0.0854
0.5220
0.1942

0.0444
0.0647
0.065
0.046
0.4795
0.543

32

Table 4.4 Trained/Tested PEDESTRIAN AP 0.5 Across Datasets

Train/Test

carlaScenes 16 carlaScenes 32 carlaScenes 64 carlaScenes 128 nuScenes

adaScenes

carlaScenes 16
carlaScenes 32
carlaScenes 64
carlaScenes 128
nuScenes
adaScenes

0.6678
0.6181
0.6017
0.5184
0.4167
0.3331

0.6936
0.899
0.8947
0.8471
0.6549
0.5816

0.6275
0.9218
0.9332
0.9184
0.704
0.6288

0.517
0.925
0.9543
0.9519
0.7481
0.7015

0.0155
0.1002
0.1464
0.1193
0.5800
0.1547

0.1586
0.4053
0.4951
0.5055
0.6942
0.7688

Table 4.5 Trained/Tested MOTORCYCLE Across Datasets

Train/Test

carlaScenes 16 carlaScenes 32 carlaScenes 64 carlaScenes 128 nuScenes

adaScenes

carlaScenes 16
carlaScenes 32
carlaScenes 64
carlaScenes 128
nuScenes
adaScenes

0.8898
0.8289
0.6607
0.5589
0.4752
0.3014

0.8442
0.9493
0.8959
0.8355
0.6075
0.2588

0.7598
0.9394
0.9745
0.9499
0.5954
0.2897

0.576
0.9081
0.9782
0.9787
0.5459
0.2716

0.0041
0.0033
0.0046
0.0012
0.1579
0.0052

0.0012
0.0077
0.0171
0.0276
0.0706
0.1249

Table 4.6 mAP Differences (±) Across Datasets

Train/Test

carlaScenes 16 carlaScenes 32 carlaScenes 64 carlaScenes 128 nuScenes

adaScenes

carlaScenes 16
carlaScenes 32
carlaScenes 64
carlaScenes 128
nuScenes
adaScenes

0
-0.1691
-0.3018
-0.418
0.0563
-0.1522

-0.0133
0
-0.0547
-0.1174
0.1939
-0.0571

-0.0637
0.0069
0
-0.0228
0.2123
-0.0219

-0.1714
-0.0068
0.0091
0
0.176
-0.0158

-0.7425
-0.8257
-0.8382
-0.8707
0
-0.3801

-0.74
-0.7483
-0.7409
-0.7521
-0.0054
0

Table 4.7 NDS Differences (±) Across Datasets

Train/Test

carlaScenes 16 carlaScenes 32 carlaScenes 64 carlaScenes 128 nuScenes

adaScenes

carlaScenes 16
carlaScenes 32
carlaScenes 64
carlaScenes 128
nuScenes
adaScenes

0
-0.0957
-0.1727
-0.2395
-0.0804
-0.1541

-0.0199
0
-0.0351
-0.0765
-0.0052
-0.1025

-0.0531
0.0012
0
-0.0155
-0.0013
-0.0773

-0.1236
-0.0174
0.0018
0
-0.0189
-0.069

-0.4824
-0.5159
-0.5181
-0.5327
0
-0.2096

-0.5371
-0.5368
-0.5372
-0.5373
-0.0859
0

33

Table 4.8 CAR AP 0.5 Differences (±) Across Datasets

Train/Test

carlaScenes 16 carlaScenes 32 carlaScenes 64 carlaScenes 128 nuScenes

adaScenes

carlaScenes 16
carlaScenes 32
carlaScenes 64
carlaScenes 128
nuScenes
adaScenes

0
-0.144
-0.3384
-0.4991
-0.0051
-0.2496

-0.0427
0
-0.0963
-0.1961
0.0816
-0.1602

-0.0266
0.0261
0
-0.0277
0.0999
-0.1042

-0.054
-0.0004
0.0113
0
0.0255
-0.1036

-0.729
-0.7816
-0.8333
-0.8776
0.0000
-0.3488

-0.8186
-0.8428
-0.8861
-0.917
-0.0425
0

Table 4.9 PED AP 0.5 Differences (±) Across Datasets

Train/Test

carlaScenes 16 carlaScenes 32 carlaScenes 64 carlaScenes 128 nuScenes

adaScenes

carlaScenes 16
carlaScenes 32
carlaScenes 64
carlaScenes 128
nuScenes
adaScenes

0
-0.2809
-0.3315
-0.4335
-0.1633
-0.4357

0.0258
0
-0.0385
-0.1048
0.0749
-0.1872

-0.0403
0.0228
0
-0.0335
0.1240
-0.14

-0.1508
0.026
0.0211
0
0.1681
-0.0673

-0.6523
-0.7988
-0.7868
-0.8326
0.0000
-0.6141

-0.5092
-0.4937
-0.4381
-0.4464
0.1142
0

The following figures explore how LiDAR resolution influences different performance metrics

within the carlaScenes datasets. In these visualizations, the x-axis represents the LiDAR resolution,

which includes 16, 32, 64, and 128 channels, while the y-axis displays the corresponding metric

values. To emphasize specific results, a marker and an arrow are used to highlight test outcomes

when the training and testing datasets share the same LiDAR resolution. This highlighting is

applied to same-dataset evaluations, such as carlaScenes 16 tested on carlaScenes 16, as well as

cross-dataset evaluations, like nuScenes (with 32 channels) tested on carlaScenes 32, and adaScenes

(with 128 channels) tested on carlaScenes 128, even though these datasets are not identical.

More specifically, Figure 4.3 provides a detailed view for models both trained and tested on

carlaScenes datasets. On the left side, it features a heatmap showing the performance differences

across various carlaScenes train-test pairs, with colors indicating the magnitude of these differences.

On the right side, a line chart illustrates the class-specific Average Precision at 0.5 IoU for cars

(Car AP 0.5), pedestrians (Pedestrian AP 0.5), and motorcycles (Motorcycle AP 0.5). Similarly,

Figures 4.1 and 4.2 present the mean Average Precision (mAP) and NuScenes Detection Score

(NDS), respectively, for models trained and tested on carlaScenes datasets. In both figures, the

left side shows a heatmap representing the performance differences as listed in Tables 4.6 and

34

4.7, while the right side displays a line plot that tracks how these metrics change across different

LiDAR resolutions. Together, these figures demonstrate the effect of LiDAR resolution on model

performance within the carlaScenes dataset family.

On the other hand, Figure 4.4 examines the same set of metrics—mAP, NDS, Car AP 0.5,

Pedestrian AP 0.5, and Motorcycle AP 0.5—but for models trained on nuScenes and adaScenes

and then tested on carlaScenes datasets. This figure also highlights points where the LiDAR

resolutions match between the training and testing datasets, making it easier to spot these specific

comparisons. By doing so, it reveals how performance varies due to domain shift when models are

applied across different datasets, offering insights into the challenges of cross-dataset generalization.

Figure 4.1 Heatmap (from Table 4.6) and line plot illustrating the effect of LiDAR resolution (𝑅𝐿𝑅)
on mean Average Precision (mAP) across carlaScenes datasets, with highlighted points for same-
resolution train-test pairs

Figure 4.2 Heatmap (from Table 4.7) and line plot depicting the influence of LiDAR resolution
(𝑅𝐿𝑅) on NuScenes Detection Score (NDS) across carlaScenes datasets, with highlighted points
for same-resolution train-test pairs

35

Figure 4.3 Heatmaps and line plots showing the impact of LiDAR resolution (𝑅𝐿𝑅) on class-
specific metrics (Car AP 0.5, Pedestrian AP 0.5, Motorcycle AP 0.5) across carlaScenes datasets,
with highlighted points indicating same-resolution train-test pairs

36

Figure 4.4 Heatmaps and line plots showing the performance of models trained on nuScenes and
adaScenes and tested on carlaScenes datasets for mAP, NDS, Car AP 0.5, Pedestrian AP 0.5, and
Motorcycle AP 0.5, with highlighted points for matching LiDAR resolutions

37

4.1 Effect of LiDAR Resolution, 𝑅𝐿𝑅

To elucidate the influence of LiDAR resolution (𝑅𝐿𝑅) on detection performance, a detailed

examination of Figures 4.3, 4.1, and 4.2 are warranted. These figures illustrate the performance of

the CenterPoint model across carlaScenes datasets, which share identical environmental settings

but differ in LiDAR resolutions (16, 32, 64, and 128 channels). For aggregate performance metrics

such as mean Average Precision (mAP) and NuScenes Detection Score (NDS)—which combine

detailed measures of detection accuracy across object classes and attributes like translation, scale,

and orientation—a clear trend of logarithmic-like saturation is observed when comparing models

trained and evaluated on their source datasets for carlaScenes 32, 64, and 128. This saturation

indicates that beyond a certain resolution threshold, additional increases in LiDAR channels yield

diminishing improvements in performance. For instance, a model trained on carlaScenes 64

achieves an mAP of 0.9636 on carlaScenes 64 (Table 4.1), with only marginal gains to 0.9727 on

carlaScenes 128, despite the doubled resolution. When assessing the resilience of these models to

resolution shifts in cross-dataset evaluations, single-step changes—such as from 32 to 64 channels

or 64 to 128 channels—demonstrate minimal impact on performance metrics. This is evident in

the results for models trained on carlaScenes 32, which achieve an mAP of 0.9474 on carlaScenes

64 (a single-step increase), compared to 0.9405 on their source dataset (Table 4.1). Similarly,

a carlaScenes 64-trained model maintains an NDS of 0.8017 on carlaScenes 32 (a single-step

decrease), close to its source NDS of 0.8368 (Table 4.2). In contrast, larger resolution changes,

such as two-step shifts (e.g., from 64 to 16 channels or 128 to 32 channels), result in substantial

performance degradation. For example, a carlaScenes 64-trained model, which achieves an mAP

of 0.9636 on its source dataset, drops to an mAP of 0.6618 on carlaScenes 16 (Table 4.1), a

two-step decrease, highlighting a significant loss in detection capability of approximately 0.3018

in mAP. Similarly, a carlaScenes 128-trained model, which achieves an mAP of 0.9721 on its

source dataset, drops to an mAP of 0.8547 on carlaScenes 32 (Table 4.1), a two-step decrease,

reflecting a notable decline of 0.1174 in mAP, though less extreme than the drop observed with

carlaScenes 16. This pattern implies that single-step transitions between typical LiDAR resolutions

38

(e.g., 16 to 32, 32 to 64, or 64 to 128 channels) do not severely compromise model efficacy, whereas

larger shifts do. A practical illustration is a model trained on 64-channel LiDAR, which performs

robustly on both 32-channel (mAP of 0.9089) and 128-channel (mAP of 0.9727) point clouds, yet

falters significantly on 16-channel data (mAP of 0.6618), as shown in Table 4.1. Consequently,

when designing a training dataset for the CenterPoint model to ensure robust performance across

a spectrum of LiDAR resolutions, these findings suggest that intermediate resolutions, such as 32

or 64 channels, may offer a balanced trade-off between performance and adaptability, with further

analysis to follow.

Nevertheless, an exception to this trend is observed with carlaScenes 16, where the model’s

behavior diverges markedly from the patterns seen in higher-resolution datasets. We hypothesize

that this anomaly stems from the CenterPoint model’s configuration, which struggles to extract

generalizable features from the sparse 16-channel LiDAR data across diverse object classes, lead-

ing to overfitting to the specific point cloud distributions of carlaScenes 16. This overfitting is

particularly pronounced in class-specific metrics for smaller objects, such as pedestrians and mo-

torcycles, as depicted in Figures 4.3(d) and 4.3(f). For instance, a model trained on carlaScenes

16 achieves a Pedestrian AP 0.5 of 0.6678 on its source dataset, but this plummets to 0.518 when

tested on carlaScenes 128—a three-step resolution increase (Table 4.4). Similarly, Motorcycle AP

0.5 drops from 0.8898 on carlaScenes 16 to 0.558 on carlaScenes 128 (Table 4.5). Conversely,

the Car AP 0.5 exhibits greater stability, saturating around 0.82 across resolutions; for example,

it reaches 0.8364 on carlaScenes 64 and 0.809 on carlaScenes 128 for a carlaScenes 16-trained

model (Table 4.3). This resilience likely arises from the larger physical size of car objects, which

ensures their shapes remain discernible even in sparser point clouds, unlike smaller objects that

demand denser data for accurate detection. The saturation of Car AP 0.5, rather than a steep

decline, also sheds light on the limitations of the model’s complexity and its default configuration.

The CenterPoint model employed here mirrors one of the default training setups for the nuScenes

dataset (32 channels) from mmdetection3d, utilizing pillar-based voxelization with fixed parame-

ters: voxel_size = [0.2, 0.2, 10] and max_voxels = [30000, 40000]. For carlaScenes 16, the

39

sparsity of the 16-channel point clouds results in fewer points per voxel, potentially causing the

model to overfit by memorizing resolution-specific patterns rather than learning broadly applica-

ble features. In higher-resolution datasets like carlaScenes 64 and 128, the denser point clouds

overwhelm these fixed parameters. The voxel_size, optimized for 32-channel data, becomes too

coarse for 64- and 128-channel inputs, failing to capture the finer details available in these denser

clouds. Additionally, the max_voxels limit triggers random truncation of points in approximately

30% of object related voxels in these higher-resolution datasets, discarding valuable information.

This truncation skews the model toward learning localized relationships within truncated point cloud

patches, rather than fostering a holistic, resolution-agnostic understanding of object shapes within

the environment. As a result, the model’s generalization across resolutions is impaired, particu-

larly for smaller objects like pedestrians and motorcycles. These findings indicate that modifying

the voxelization parameters—such as adopting a resolution-dependent voxel_size or implement-

ing a dynamic max_voxels threshold—could improve the model’s capacity to learn robust and

transferable features across diverse LiDAR resolutions, thereby alleviating the observed domain

shift effects. However, for a more straightforward solution, we suggest increasing max_voxels to

[100000, 100000] from the default [30000, 40000] to reduce truncation in dense point clouds like

carlaScenes 64 and 128, allowing the model to retain more information from high-resolution data.

Additionally, we propose adjusting voxel_size to [0.1, 0.1, 10] from [0.2, 0.2, 10], keeping the

z-dimension unchanged. This finer horizontal resolution in x and y enables better capture of small

objects like pedestrians and motorcycles, which benefit from increased cell occupation per object

rather than vertical detail. Since CenterPoint employs pillar-based encoding, where the z-axis is

collapsed into a single pillar, refining the z-resolution offers no advantage and aligns with the pillar

feature encoder’s design, unlike an alternative version of model with voxel feature encoder where

z-resolution might matter. Furthermore, to complement the increased complexity of the feature

extraction process, we propose enhancing the depth of the class-specific SeparateHead compo-

nents within the CenterHead of the CenterPoint model. Specifically, we recommend increasing

the number of convolutional layers—each consisting of Conv + BatchNorm + ReLU—from the

40

original 2 to at least 4. This adjustment provides the model with greater capacity to process and

distill the more intricate feature sets generated by the proposed adaptive feature extraction, thereby

improving its ability to generalize across varying LiDAR resolutions.

In order to find the 𝑅𝐿𝑅 distribution, we have used the carlaScenes 32 and carlaScenes 128 as

test datasets, renamed as carlasc32 and carlasc128 in this section to avoid clutter, because the

real-life datasets in this work also use the same resolutions. We aim to approximate a Gaussian

distribution for 𝑅𝐿𝑅 for each performance metric; to achieve this, we calculate the difference between

the minimum and maximum performance metrics for each trained model across the test datasets

carlasc32 and carlasc128. Here, 𝑅𝐿𝑅 is an abstract definition representing the domain shift due

to LiDAR resolution, and we assume that any performance metric (e.g., NDS, mAP, Car AP 0.5)

provides an equally acceptable approximation of 𝑅𝐿𝑅. Specifically, for each trained model in the

set {carlasc32, carlasc64, carlasc128}, we compute the difference min 𝐷 𝑆𝑖→𝑆 𝑗 −max 𝐷 𝑆𝑖→𝑆 𝑗 , where

𝐷 is treated as a variable representing the performance metric, and 𝑆 𝑗 ∈ {carlasc32, carlasc128},

excluding carlasc16 due to the model not learning properly, as discussed in previous paragraphs.

These differences form a set of samples used to approximate the 𝑅𝐿𝑅 distribution as a Gaussian for

each metric, and ultimately, 𝑅𝐿𝑅 relates to these distributions we approximate, providing insight

into the domain shift caused by LiDAR resolution:

𝑅𝐿𝑅 ∼ Gaussian

(cid:18)(cid:26)

min
𝑆 𝑗

𝐷 𝑆𝑖→𝑆 𝑗 − max
𝑆 𝑗

𝐷 𝑆𝑖→𝑆 𝑗

𝑆𝑖 ∈{carlasc32,carlasc64,carlasc128}
𝑆 𝑗 ∈{carlasc32,carlasc128}

(cid:27)(cid:19)

(cid:12)
(cid:12)
(cid:12)
(cid:12)

(4.1)

The Gaussian distributions in Figure 4.6 illustrate the combined effects of LiDAR-resolution

shift (𝑅𝐿𝑅) and synthetic vs. real data shift (𝑅𝑆𝑅) on various performance metrics across datasets.

In the context of 𝑅𝐿𝑅, we observe that NDS exhibits the smallest standard deviation (𝜎 = 0.025)

among the metrics, indicating that it is the most consistent in capturing the effect of LiDAR

resolution on performance. This suggests that NDS, despite being a composite metric derived from

multiple true-positive metrics, effectively reflects the performance loss due to domain shift and is

a more reliable choice for comparing domain shifts between dataset pairs when a single metric is

needed. Following NDS, Pedestrian AP 0.5 (𝜎 = 0.032) and Motorcycle AP 0.5 (𝜎 = 0.042) also

41

show relatively low deviations, meaning they are more confident in distinguishing the domain shift

between the two LiDAR resolutions. Although their mean differences (𝜇 = −0.063 for Pedestrian

AP 0.5 and 𝜇 = −0.089 for Motorcycle AP 0.5) are closer to zero compared to Car AP 0.5, their

smaller standard deviations indicate that these metrics, which focus on smaller objects, are less

variable and thus provide a clearer signal of the domain shift caused by LiDAR resolution.

In

contrast, Car AP 0.5 has the largest standard deviation (𝜎 = 0.080) and the furthest mean from

zero (𝜇 = −0.101), suggesting greater variability in its estimation of domain shift. We attribute

this higher variability to the two-step resolution jump between carlaScenes 32 and carlaScenes 128

(from 32 to 128 channels). As discussed in previous sections, while car detection performance tends

to saturate between carlaScenes 32 and carlaScenes 64, this larger resolution jump significantly

impacts models trained on high-resolution datasets (e.g., carlaScenes 128) when tested on lower-

resolution datasets (e.g., carlaScenes 32), leading to more pronounced and variable performance

drops for larger objects like cars.

4.2 Thoughts on Performance Metrics

Among the evaluation metrics we consider, the NuScenes Detection Score (NDS) proves sub-

stantially more robust to outliers than mean Average Precision (mAP). To demonstrate this, we

performed a sensitivity analysis by fitting Gaussian curves to each test-column in the performance

difference tables (Δ𝐷 𝑆𝑖→𝑆 𝑗 ) for mAP (Table 4.6) and NDS (Table 4.7), and plotting the result-

ing distributions (Figure 4.5). The noticeably narrower distributions for NDS confirm its lower

variability and greater resilience to large performance deviations. Consequently, when comparing

overall 3D detection performance without focusing on a particular object class, NDS is the preferred

metric because it aggregates multiple true-positive sub-metrics into a single, stable score.

The NuScenes Detection Score (NDS) is defined as a weighted sum of mean Average Precision

(mAP) and five true-positive error metrics — mean Average Translation Error (mATE), mean

Average Scale Error (mASE), mean Average Orientation Error (mAOE), mean Average Velocity

Error (mAVE), and mean Average Attribute Error (mAAE) — collectively denoted by 𝑇 𝑃:

42

𝑇 𝑃 = {mATE, mASE, mAOE, mAVE, mAAE}.

NDS =

(cid:34)

1
10

5 mAP +

∑︁

𝑚𝑇 𝑃 ∈ 𝑇 𝑃

(cid:0)1 − min(1, 𝑚𝑇 𝑃)(cid:1)

(cid:35)

(4.2)

Figure 4.5 Gaussian distributions comparing the variability of performance differences (Δ𝐷 𝑆𝑖→𝑆 𝑗 )
for mAP and NDS across carlaScenes datasets, illustrating the robustness of NDS to LiDAR-
resolution shift (𝑅𝐿𝑅) with standard deviations 𝜎 = 0.121 for mAP and 𝜎 = 0.070 for NDS (see
Tables 4.6 and 4.7)

4.3 Effect of Synthetic vs Real Life, 𝑅𝑆𝑅

In addition to the carlaScenes datasets, we incorporated the adaScenes and nuScenes datasets,

which were collected using sensors mounted on ego vehicles operating in real-world environments.

The adaScenes dataset is a custom dataset generated from real-life sensors mounted on a bus mea-

suring 8 meters in length and 2.3 meters in width, equipped with a top-mounted 128-channel LiDAR

sensor, which we exclusively used to avoid introducing 𝑅𝐿𝐶 (number of LiDAR units) as an addi-

tional domain shift factor. However, the difference in ego vehicle dimensions between adaScenes

and other datasets, such as nuScenes or carlaScenes, inevitably introduces 𝑅𝐿𝑃 (differences in

43

Figure 4.6 Gaussian sensitivity analysis of performance differences (Δ𝐷 𝑆𝑖→𝑆 𝑗 ) for multiple metrics
across carlaScenes test resolutions, highlighting the variability in LiDAR-resolution shift (see
Tables 4.6 and 4.7)

44

LiDAR placement) as a domain shift source. Comparing these real-life datasets (adaScenes and

nuScenes) with the synthetic carlaScenes datasets introduces multiple potential sources of domain

shift, complicating the analysis of performance differences, as detailed in Table 4.10, which lists

the domain shift factors between each pair of datasets. For instance, while the nuScenes dataset

employs a single 32-channel LiDAR, similar to the carlaScenes 32 dataset, a direct comparison

reveals at least two distinct domain shift factors. First, 𝑅𝑆𝑅 (synthetic vs. real data) arises because

the carlaScenes 32 dataset is synthetically generated, whereas the nuScenes dataset is derived from

real-world data. Second, 𝑅𝑆𝐶 (variations in scenery) emerges due to differences in the environ-

ments: the nuScenes dataset was collected in urban settings across Singapore and Boston, while

carlaScenes 32 was generated using Town10, a pre-existing map provided by the CARLA simulator,

which mimics a different urban landscape. Similarly, when comparing adaScenes and carlaScenes

128, both equipped with a 128-channel LiDAR, at least two domain shift factors are present: 𝑅𝑆𝑅

(synthetic vs. real data) due to carlaScenes 128 being synthetically generated while adaScenes is

real-world data, and 𝑅𝑆𝐶 (variations in scenery) because adaScenes was collected in real-world

environments, while carlaScenes 128 uses the Town10 map in the CARLA simulator, representing

a different urban setting. Beyond these two domain shift sources, additional factors could further

influence the results. For example, 𝑅𝐿𝑃 may play a role; although we positioned the LiDAR sensor

in carlaScenes 32 to match the placement in the nuScenes dataset, the ego vehicles in the simulator

and real-world settings differ. These differences affect the relative positioning of the LiDAR sensor

with respect to the ego vehicle’s surface, potentially altering the point cloud data for nearby objects

(e.g., the closest objects and their corresponding point clouds relative to the ego vehicle frame may

vary if the ego vehicle’s boundaries differ). Specifically, the carlaScenes datasets were collected

using an Audi vehicle model within the CARLA simulator, whereas the real-world datasets involve

distinct vehicle models: nuScenes uses a smaller Renault different vehicle, and adaScenes employs

a bus, adding another layer of complexity to the domain shift analysis.

45

Table 4.10 Domain Shift Factors Between Dataset Pairs

Train/Test

nuScenes

nuScenes
adaScenes
carlaScenes 32
carlaScenes 128 𝑅𝐿𝑅 + 𝑅𝑆𝑅 + 𝑅𝑆𝐶

-
𝑅𝐿𝑅 + 𝑅𝑆𝐶 + 𝑅𝐿𝑃
𝑅𝑆𝑅 + 𝑅𝑆𝐶

adaScenes
𝑅𝐿𝑅 + 𝑅𝑆𝐶 + 𝑅𝐿𝑃
-
𝑅𝑆𝑅 + 𝑅𝑆𝐶 + 𝑅𝐿𝑅 + 𝑅𝐿𝑃
𝑅𝑆𝑅 + 𝑅𝑆𝐶 + 𝑅𝐿𝑃

carlaScenes 32
𝑅𝑆𝑅 + 𝑅𝑆𝐶

carlaScenes 128
𝑅𝐿𝑅 + 𝑅𝑆𝑅 + 𝑅𝑆𝐶
𝑅𝑆𝑅 + 𝑅𝑆𝐶 + 𝑅𝐿𝑅 + 𝑅𝐿𝑃 𝑅𝑆𝑅 + 𝑅𝑆𝐶 + 𝑅𝐿𝑃

-
𝑅𝐿𝑅

𝑅𝐿𝑅
-

To estimate the impact of 𝑅𝑆𝑅, we utilize known domain shift sources across datasets to isolate

its effect. For instance, to eliminate the influence of 𝑅𝑆𝐶 from Δ𝐷nusc→carlasc32, we subtract

Δ𝐷nusc→adasc, which introduces additional 𝑅𝐿𝑅 and 𝑅𝐿𝑃 terms into the equation. To address the

𝑅𝐿𝑅 component, we incorporate Δ𝐷carlasc32→carlasc128, leveraging the carlaScenes datasets with 32

and 128 channels. Furthermore, to mitigate the 𝑅𝐿𝑃 term, we include Δ𝐷adasc→carlasc32. This

method serves as an approximation to quantify the relative contributions of different domain shift

sources, aiming to understand the extent of challenges each source poses during training and testing

phases. For a more precise calculation of 𝑅𝐿𝑃 in the context of 𝑅𝑆𝑅, we propose the creation of a

carlaScenes 128 Bus dataset, which would allow a focused analysis of 𝑅𝐿𝑃. However, integrating

a new bus asset into the CARLA simulator presents challenges, as it requires modeling expertise

and familiarity with Unreal Engine. While CARLA provides a pre-existing bus asset, the Fuso

Rosa from Mitsubishi Motors, it differs significantly from the bus in adaScenes. The Fuso Rosa

measures 6.9 meters in length and 2.7 meters in height, whereas the adaScenes bus is 8.3 meters

long, with its LiDAR mounted at 3.1 meters above the ground. These discrepancies suggest that

using the Fuso Rosa as a substitute for the adaScenes bus would introduce further approximations,

potentially increasing ambiguity in estimating 𝑅𝐿𝑃. Therefore, a more accurate representation of

the adaScenes bus is necessary to minimize such uncertainties in the analysis.

46

𝑅𝑆𝑅 ∼ Δ𝐷nusc→carlasc32

←− +𝑅𝑆𝑅 +𝑅𝑆𝐶

− Δ𝐷nusc→adasc

←−

− 𝑅𝑆𝐶 − 𝑅𝐿𝑅 − 𝑅𝐿 𝑃

− Δ𝐷nusc→carlasc128

←− −𝑅𝑆𝑅 − 𝑅𝑆𝐶 − 𝑅𝐿𝑅

+ Δ𝐷adasc→carlasc32

←− +𝑅𝑆𝑅 +𝑅𝑆𝐶 + 𝑅𝐿𝑅 + 𝑅𝐿 𝑃

+ Δ𝐷carlasc32→carlasc128 ←−

+ 𝑅𝐿𝑅

The same estimation of 𝑅𝑆𝑅 would also work in reverse order of the datasets:

𝑅𝑆𝑅 ∼ Δ𝐷carlasc32→nusc

←− −𝑅𝑆𝑅 − 𝑅𝑆𝐶

− Δ𝐷adasc→nusc

←−

+ 𝑅𝑆𝐶 + 𝑅𝐿𝑅 + 𝑅𝐿 𝑃

− Δ𝐷carlasc128→nusc

←− +𝑅𝑆𝑅 + 𝑅𝑆𝐶 + 𝑅𝐿𝑅

+ Δ𝐷carlasc32→adasc

←− −𝑅𝑆𝑅 − 𝑅𝑆𝐶 − 𝑅𝐿𝑅 − 𝑅𝐿 𝑃

+ Δ𝐷carlasc128→carlasc32 ←−

− 𝑅𝐿𝑅

In Figure 4.6, we can observe that 𝑅𝑆𝑅 effect on the amount of metric loss is far greater then the

effect of 𝑅𝐿𝑅 which suggests LiDAR pointcloud from CARLA simulator differs from the real-life

profoundly. This difference alone shows that for Unsupervised Domain Adaptation problems, it is

better and more reliable to use real-life data as the training source in order to distill information to

the target datasets.

4.4 Generalization of a Dataset

In this study, we explore how well real-world datasets, such as nuScenes, generalize compared to

synthetic datasets in the context of domain shift for autonomous driving applications. The nuScenes

dataset is widely valued within the autonomous driving community for several key reasons. Firstly,

it captures data from diverse locations, featuring a variety of traffic patterns and complex decision-

making scenarios that challenge autonomous vehicles. Secondly, with approximately 28,000

samples, nuScenes occupies a middle ground in terms of dataset size. For comparison, the Waymo

47

dataset[36] contains a much larger 390,000 samples, while the KITTI dataset[37] is smaller with

15,000 samples, and the Lyft dataset[38] is closer to nuScenes with 55,000 samples. This moderate

size makes nuScenes a practical choice for research purposes. Additionally, while the nuScenes

dataset is collected at a high frequency of 20Hz, its annotations are provided at a lower rate of 2Hz.

This difference creates an interesting opportunity for multi-sweep models, which use temporal

information from multiple point cloud sweeps to improve densification. Even though the extra

sweeps between the labeled key frames do not come with their own annotations, this setup can

actually be a strength. It encourages the models to depend on the key frame’s annotations to figure

out object states in the unlabeled intermediate sweeps, helping them learn more robust temporal

patterns. As a result, this feature of nuScenes could make models better at generalizing to real-

world autonomous driving situations, where not every frame has full labels—a common scenario

that tests a model’s adaptability.

Evidence from Tables 4.1 and 4.2 demonstrates that models trained on nuScenes perform

robustly not only on their own dataset but also when evaluated on synthetic datasets, such as

carlaScenes. This strong cross-dataset performance indicates that nuScenes enables models to

learn features that are not overly specific to its own characteristics, suggesting a high capacity for

generalization across different domains. Another compelling indicator of nuScenes’ generalization

is its performance on adaScenes, a distinct real-world dataset. As shown in Tables 4.6 and 4.7,

the performance of nuScenes-trained models on adaScenes remains close to that of models both

trained and tested on adaScenes. For instance, the NDS score for a nuScenes-trained model tested on

adaScenes is 0.4884, which is notably close to the 0.5339 achieved by an adaScenes-trained model

on its own dataset. This relatively small performance gap highlights the ability of nuScenes-trained

models to adapt effectively to other real-world environments.

On the other hand, models trained on synthetic datasets like carlaScenes struggle significantly

when evaluated on real-world datasets such as nuScenes or adaScenes. For instance, a model trained

on carlaScenes 128, which uses the same 128-channel LiDAR as adaScenes, only achieves an NDS

of 0.3031 when tested on adaScenes. This is much lower than the 0.5339 scored by a model trained

48

and tested on adaScenes itself. This notable performance drop emphasizes the challenges synthetic

data faces in matching real-world conditions, even when the LiDAR resolution is identical.

However, it’s important to note that while models trained on nuScenes perform well when

tested on synthetic datasets, their scores don’t match the results of models trained directly on those

synthetic datasets. For example, a nuScenes-trained model tested on carlaScenes 32 earns an NDS

of 0.5691, which is solid but still below the 0.8135 achieved by a model trained and tested on

carlaScenes 32. This difference matters:

the real power of nuScenes isn’t in beating synthetic

models on their own turf, but in equipping models to tackle a broad variety of scenarios—both

synthetic and real-world—much better than synthetic-only training can.

Likewise, adaScenes, another real-world dataset, shows some ability to generalize, though not

as strongly as nuScenes. Unlike models trained on synthetic data, which suffer huge performance

drops when tested on real-world sets, adaScenes-trained models hold up better. For example,

when tested on carlaScenes 128—which matches its 128-channel LiDAR—an adaScenes-trained

model scores an NDS of 0.4649. This is decent but well below the 0.8404 of a carlaScenes 128-

trained model, showing that adaScenes offers moderate generalization to synthetic data, though less

effectively than nuScenes. On the flip side, when adaScenes-trained models are tested on nuScenes,

they struggle, achieving an NDS of just 0.3243 compared to 0.5743 for a nuScenes-trained model.

This big gap suggests that nuScenes might have greater complexity—think diverse road users,

varied environments, or unique ego-vehicle movements—that adaScenes lacks. It seems a dataset’s

ability to generalize could hinge on how complex it is: richer datasets like nuScenes train models

that adapt well across domains, while less complex ones like adaScenes leave models less prepared

for tougher, more varied test conditions.

Both nuScenes and adaScenes were collected from multiple cities, exposing their models to a

broad spectrum of environmental conditions and urban layouts. In contrast, carlaScenes is derived

from a single simulated city, potentially limiting the variety of scenarios it represents. This broader

real-world exposure in nuScenes and adaScenes likely aids in training models that learn more robust

and transferable features, better equipping them to handle domain shifts across diverse test datasets.

49

CHAPTER 5

CONCLUSION

This thesis addressed the significant challenge of Unsupervised Domain Adaptation (UDA) within

the domain of 3D object detection, specifically focusing on the systematic quantification and analysis

of domain shifts between datasets. We began by identifying key sources of domain shift relevant

to 3D object detection in autonomous driving, such as LiDAR resolution variations, differences

between synthetic and real-world data, sensor placement, and scenery discrepancies.

To systematically investigate these domain shift sources, we developed a comprehensive method-

ological framework. Central to this framework was the generation of carefully curated datasets

using the CARLA simulator, enabling precise control over domain shift factors. We introduced

two novel packages, carlaSceneCollector and rosbag2nuScenes, specifically designed for

this research. The carlaSceneCollector package streamlines the process of data generation

in CARLA by automating sensor data collection, scenario configuration, and ROSBag recording,

thus facilitating the creation of controlled, synthetic raw data. The rosbag2nuScenes package

provides a unique, generic solution for converting ROSBag data into the widely-used nuScenes

format, accommodating various sensor setups and ensuring compatibility with prevalent 3D de-

tection frameworks. This represents a significant contribution, as no other publicly available tool

currently offers such comprehensive and adaptable functionality, making these packages invaluable

for synthetic data-driven research and development in autonomous systems.

Through rigorous experimentation, we revealed crucial insights into the relationship between

LiDAR sensor resolution (𝑅𝐿𝑅) and detection performance. Specifically, we demonstrated a clear

performance saturation effect beyond certain LiDAR resolutions, highlighting that intermediate

resolutions (such as 32 or 64 channels) provide an optimal trade-off between accuracy and gen-

eralizability. However, it is essential to emphasize that this saturation effect is closely related to

the underlying model and its hyperparameters. As such, these findings should not be generalized

universally across all 3D detection models. We also laid out potential sources for this observed

saturation effect and proposed methods to mitigate it, including adaptive voxelization parameters

50

and increased model complexity.

Our investigation also highlighted pronounced differences between synthetic and real-world

datasets (𝑅𝑆𝑅). Models trained on synthetic CARLA-generated datasets showed substantial perfor-

mance drops when evaluated on real-world datasets (nuScenes and adaScenes). Conversely, models

trained on real-world datasets exhibited considerably better generalization capabilities, reinforcing

the importance of real-world training data for robust adaptation.

Additionally, we identified the NuScenes Detection Score (NDS) as a particularly robust and

reliable metric for capturing the aggregate impact of domain shift. Compared to other metrics such

as mean Average Precision (mAP), NDS proved less susceptible to variability and outliers, making

it well-suited for comparative studies across datasets.

Lastly, the broader generalization capabilities of real-world datasets, particularly nuScenes,

were underlined. This dataset’s diverse real-world scenarios and moderate complexity provided

the foundation for training models with robust adaptability across varied domains, both synthetic

and real. In contrast, simpler datasets like adaScenes demonstrated limited adaptability, emphasiz-

ing that dataset complexity and scenario diversity are critical for fostering model generalization.

Notably, synthetic datasets generated as part of this research (carlaScenes datasets) demonstrated

even lower adaptability compared to adaScenes, suggesting a significant gap in realism and scenario

complexity. This limited adaptability of synthetic datasets can be attributed to the random sampling

approaches used for agent motion, agent count, and ego vehicle movements, as well as the inherent

limitations of the LiDAR simulator, which employs a simplified ray-casting method. Addressing

these limitations and enhancing the realism and complexity of synthetic datasets remain important

areas for future investigation.

In conclusion, this thesis contributes to a deeper understanding of domain shift phenomena in

3D object detection and provides a clear methodology for its quantification. Our findings underscore

the critical importance of careful dataset selection, thoughtful sensor configuration, and robust met-

ric choice when addressing the challenges posed by Unsupervised Domain Adaptation. The novel

software packages developed in this research, carlaSceneCollector and rosbag2nuScenes,

51

significantly enhance the process of dataset generation and conversion, laying a strong foundation

for future synthetic data-driven research and development in autonomous systems. Future research

may explore further refinement of adaptive methodologies, leveraging real-world data more effec-

tively, improving synthetic dataset realism, and extending these insights to broader contexts within

autonomous systems and robotics.

Future Work

Given the observed performance degradation when using synthetic datasets, future work should

prioritize the development and refinement of self-labeling techniques for real-world, unlabeled

datasets. These approaches could leverage the superior generalization capabilities of real-world data

to generate high-quality pseudo-labels, thereby reducing reliance on synthetic data and improving

model adaptability across domains.

Furthermore, the superior generalization observed with the nuScenes dataset compared to both

synthetic datasets and other real-world datasets like adaScenes raises important questions about the

factors that contribute to a dataset’s generalizability. Future research should aim to identify and

quantify these factors—such as scenario diversity, data complexity, and sensor fidelity—and develop

a framework for evaluating and comparing the generalization potential of different datasets. Such a

framework would be invaluable for selecting optimal training datasets and establishing criteria for

the collection of new datasets tailored to specific autonomous driving applications.

Additionally, the logarithmic saturation effect observed in synthetic training performance un-

derscores the need for more sophisticated approaches to dataset creation and model training. Future

work should explore adaptive dataset generation techniques that dynamically adjust to the model’s

learning progress, as well as augmentation strategies that introduce targeted variability to counteract

saturation and enhance model robustness across a wider range of conditions.

To further enhance the applicability of these findings, future studies should incorporate a

diverse array of LiDAR sensors, including solid-state LiDAR, which are becoming increasingly

prevalent in autonomous systems. Moreover, efforts should be directed toward improving the

fidelity of simulated LiDAR data by developing methods to accurately match the ray patterns and

52

noise characteristics of real-world LiDAR sensors, thereby closing the gap between synthetic and

real-world data.

By pursuing these avenues, future research can build upon the insights gained in this thesis,

advancing the field of 3D object detection and contributing to the development of more robust and

adaptable autonomous systems.

53

BIBLIOGRAPHY

[1] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: An open urban driving

simulator,” in Conference on robot learning. PMLR, 2017, pp. 1–16.

[2] H. Caesar, V. Bankiti, A. Lang, S. Vora, V. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan,
and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving. arxiv,” 2019.

[3] C. Qi, H. Su, K. Mo, and L. Guibas, “Pointnet: Deep learning on point sets for 3d classification

and segmentation, corr abs/1612.00593,” arXiv preprint arXiv:1612.00593, 2016.

[4] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning
on point sets in a metric space,” Advances in neural information processing systems, vol. 30,
2017.

[5] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from
point cloud. corr, vol. abs/1812.04244 (2018),” arXiv preprint arxiv:1812.04244, 2018.

[6] Z. Yang, Y. Sun, S. Liu, and J. Jia, “3dssd: Point-based 3d single stage object detector,” in
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020,
pp. 11 040–11 048.

[7] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object
detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
2018, pp. 4490–4499.

[8] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors,

vol. 18, no. 10, p. 3337, 2018.

[9] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. L. Pv-rcnn, “Pointvoxel feature
set abstraction for 3d object detection. in 2020 ieee,” in CVF Conference on Computer Vision
and Pattern Recognition (CVPR), 2019, pp. 10 526–10 535.

[10] T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection and tracking,” in
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021,
pp. 11 784–11 793.

[11] J. Yang, H. Qian, Y. Xu, K. Wang, and L. Xie, “Can we evaluate domain adaptation models

without target-domain labels?” arXiv preprint arXiv:2305.18712, 2023.

[12] Y. Wang, X. Chen, Y. You, L. E. Li, B. Hariharan, M. Campbell, K. Q. Weinberger, and
W.-L. Chao, “Train in germany, test in the usa: Making 3d object detectors generalize,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2020, pp. 11 713–11 723.

[13] Y. You, K. Luo, C. P. Phoo, W.-L. Chao, W. Sun, B. Hariharan, M. Campbell, and K. Q.
Weinberger, “Learning to detect mobile objects from lidar scans without labels,” in Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.
1130–1140.

54

[14] Z. Ding, Y. Hu, R. Ge, L. Huang, S. Chen, Y. Wang, and J. Liao, “1st place solution
for waymo open dataset challenge–3d detection and domain adaptation,” arXiv preprint
arXiv:2006.15505, 2020.

[15] Q. Xie, E. Hovy, M. Luong et al., “Self-training with noisy student improves imagenet

classification. arxiv,” Learning, 2019.

[16] J. Li, R. Xu, X. Liu, J. Ma, B. Li, Q. Zou, J. Ma, and H. Yu, “Domain adaptation
based object detection for autonomous driving in foggy and rainy weather,” arXiv preprint
arXiv:2307.09676, 2023.

[17] S. Ahmed, A. Al Arafat, M. N. Rizve, R. Hossain, Z. Guo, and A. S. Rakin, “Ssda: Secure
source-free domain adaptation,” in Proceedings of the IEEE/CVF International Conference
on Computer Vision, 2023, pp. 19 180–19 190.

[18] N. Hanselmann, N. Schneider, B. Ortelt, and A. Geiger, “Learning cascaded detection tasks
with weakly-supervised domain adaptation,” in 2021 IEEE Intelligent Vehicles Symposium
(IV).

IEEE, 2021, pp. 532–539.

[19] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and
V. Lempitsky, “Domain-adversarial training of neural networks (2016),” URL https://arxiv.
org/abs/1505.07818, 2015.

[20] D. Tsai, J. S. Berrio, M. Shan, E. Nebot, and S. Worrall, “Ms3d++: Ensemble of experts for
multi-source unsupervised domain adaptation in 3d object detection,” IEEE Transactions on
Intelligent Vehicles, 2024.

[21] Z. Pang, Z. Li, and N. Wang, “Simpletrack: Understanding and rethinking 3d multi-object

tracking. arxiv,” arXiv preprint arXiv:2111.09621, 2021.

[22] C. Saltori, S. Lathuiliére, N. Sebe, E. Ricci, and F. Galasso, “Sf-uda 3d: Source-free unsuper-
vised domain adaptation for lidar-based 3d object detection,” in 2020 International Conference
on 3D Vision (3DV).

IEEE, 2020, pp. 771–780.

[23] J. Yang, S. Shi, Z. Wang, H. Li, and X. Qi, “St3d++: Denoised self-training for unsupervised

domain adaptation on 3d object detection,” arXiv preprint arXiv:2108.06682, 2021.

[24] B. Yang, M. Bai, M. Liang, W. Zeng, and R. Urtasun, “Auto4d: Learning to label 4d objects

from sequential point clouds,” arXiv preprint arXiv:2101.06586, 2021.

[25] X. Weng, J. Wang, D. Held, and K. Kitani, “3d multi-object tracking: A baseline and new
evaluation metrics,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS).

IEEE, 2020, pp. 10 359–10 366.

[26] L. Fan, Y. Yang, Y. Mao, F. Wang, Y. Chen, N. Wang, and Z. Zhang, “Once detected, never lost:
Surpassing human performance in offline lidar based 3d object detection,” in Proceedings of
the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 820–19 829.

55

[27] L. Zhang, A. J. Yang, Y. Xiong, S. Casas, B. Yang, M. Ren, and R. Urtasun, “Towards
unsupervised object detection from lidar point clouds,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2023, pp. 9317–9328.

[28] C. R. Qi, Y. Zhou, M. Najibi, P. Sun, K. Vo, B. Deng, and D. Anguelov, “Offboard 3d object
detection from point cloud sequences,” in Proceedings of the IEEE/CVF Conference.

[29] T. Ma, X. Yang, H. Zhou, X. Li, B. Shi, J. Liu, Y. Yang, Z. Liu, L. He, Y. Qiao et al.,
“Detzero: Rethinking offboard 3d object detection with long-term sequential point clouds,”
in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp.
6736–6747.

[30] Epic Games, “Unreal engine.” [Online]. Available: https://www.unrealengine.com

[31] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, A. Y. Ng et al.,
“Ros: an open-source robot operating system,” in ICRA workshop on open source software,
vol. 3, no. 3.2. Kobe, 2009, p. 5.

[32] M. Contributors, “MMDetection3D: OpenMMLab next-generation platform for general 3D

object detection,” https://github.com/open-mmlab/mmdetection3d, 2020.

[33] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,

L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS-W, 2017.

[34] D. Yang, X. Cai, Z. Liu, W. Jiang, B. Zhang, G. Yan, X. Gao, S. Liu, and B. Shi, “Realistic rainy
weather simulation for lidars in carla simulator,” in 2024 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS).

IEEE, 2024, pp. 951–957.

[35] F. Goudreault, D. Scheuble, M. Bijelic, N. Robidoux, and F. Heide, “Lidar-in-the-loop hyper-
parameter optimization,” in Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, 2023, pp. 13 404–13 414.

[36] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai,
B. Caine et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020,
pp. 2446–2454.

[37] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”

International Journal of Robotics Research (IJRR), 2013.

[38] J. Houston, G. Zuidhof, L. Bergamini, Y. Ye, L. Chen, A. Jain, S. Omari, V. Iglovikov,
and P. Ondruska, “One thousand and one hours: Self-driving motion prediction dataset,” in
Conference on Robot Learning. PMLR, 2021, pp. 409–418.

56