OBJECT DETECTION FOR AUTONOMOUS SYSTEMS OPERATING UNDER
                   CHALLENGING CONDITIONS
                                   By
                              Mazin Hnewa
                          A DISSERTATION
                               Submitted to
                       Michigan State University
               in partial fulfillment of the requirements
                            for the degree of
            Electrical Engineering - Doctor of Philosophy
                                  2023


                                             ABSTRACT
    Advanced Driver-Assistance Systems (ADAS) and autonomous systems, in general, such as
emerging autonomous vehicles rely heavily on visual data and state-of-the-art deep learning ap-
proaches to classify and localize objects such as pedestrians, traffic signs and lights, and other
nearby cars, to assist the corresponding vehicles maneuver safely in their environments. However,
due to the well-known domain shift problem, the performance of object detection methods could
degrade rather significantly under challenging scenarios such as low light and adverse weather con-
ditions. The domain shift problem arises due to the difference between the distributions of source
data used for training in comparison with target data used during realistic testing scenarios. The
area of domain adaptation has been instrumental in addressing the domain shift problem encoun-
tered by many applications. In fact, domain adaptation frameworks for object detection methods
have been providing powerful tools for handling a variety of underlying changes in probability dis-
tribution between training and testing data. In this dissertation, we first propose a novel integrated
Generative-model based unsupervised training and Domain Adaptation (GDA) framework that im-
proves the performance of a region-proposal based object detector under challenging scenarios. In
particular, we exploit unsupervised image-to-image translation to generate annotated visuals that
are representatives of a target challenging domain. Then, we use these generated annotated visuals
in addition to unlabeled target domain data to train a domain adaptive region-proposal based ob-
ject detector. We show that using this integrated approach outperforms both methods, unsupervised
image translation, and domain adaptation, when they are used separately.
    Despite the popularity of region-proposal based object detectors, such as Faster R-CNN and
many of its variants, these detectors suffer from a long inference time. Therefore, such approaches
are not the optimal choice for time-critical, real-time applications such as autonomous driving. As
a result, in the second part of this dissertation, we propose a novel MultiScale Domain Adaptive


YOLO (MS-DAYOLO) framework for the popular state-of-the-art real time object detector YOLO.
MS-DAYOLO employs multiple domain adaptation paths and corresponding domain classifiers at
different scales of the recently introduced YOLOv4 object detector. Building on our baseline
MS-DAYOLO architecture, we introduce three novel deep learning architectures for a Domain
Adaptation Network (DAN) that generates domain-invariant features. In particular, we propose a
Progressive Feature Reduction, a Unified Domain Classifier, and an Integrated architecture.
     While RGB cameras represent the most popular imaging sensors used by ADAS systems and
autonomous vehicles due to cost and related practical reasons, employing other modalities such as
thermal and gated imaging sensors can significantly improve the detection performance under chal-
lenging conditions. However, these other types of sensors are expensive, and incorporating them
into ADAS and autonomous vehicle platforms may cause design and manufacturing challenges.
As a result, in the third part of this dissertation, we propose a new framework that utilizes Cross
Modality Knowledge Distillation (CMKD) to improve the performance of RGB-only pedestrian
detection in low light and adverse weather conditions without increasing computational complex-
ity during inference. Specifically, we develop two CMKD methods that rely on feature-based
knowledge distillation and adversarial training to transfer knowledge from a detector (teacher) that
is trained using multiple modalities to a single modality detector (student) that is trained using
RGB images only.
     To validate the proposed approaches, we train and test them using popular datasets captured
by vehicles driving under different conditions including challenging scenarios. Our experiments
with the proposed approaches show significant improvements in object detection performance in
comparison with state-of-the-art methods.


Copyright by
MAZIN HNEWA
2023


                                    ACKNOWLEDGMENTS
First of all, I would like to express my deep gratitude to my advisor Professor Hayder Radha
for offering me the great opportunity to join the Connected and Autonomous Networked Vehicle
for Active Safety (CANVAS) research program at Michigan State University. I would like to
thank him for his guidance, knowledge, and experience during my Ph.D. study. His support and
encouragement made this work possible. Moreover, I would like to express my appreciation to my
committee members: Professor Arun Ross, Professor Daniel Morris, and Professor Gary Bente for
their comments and feedback.
    In addition, I would like to acknowledge the funding support from Ford Motor Company that
made this work possible. Specifically, I would like to thank Jon Diedrich, Alireza Rahimpour, and
Justin Miller from Ford Motor Company for their support and encouragement.
    Furthermore, I am grateful to my colleagues in WAVES lab: Daniel Kent, Su Pang, and Xiaohu
Lu for their help, and the informative discussions we had in the lab.
    Finally, a special thanks to my family for their patience, encouragement, and support during
the challenging times I had in my study.
                                                 v


                                 TABLE OF CONTENTS
Chapter 1 Introduction . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1
   1.1 Background . . . . . . .   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1
   1.2 Contribution Summary .     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5
   1.3 Dissertation Organization  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  6
Chapter 2 Object Detection under Rainy Conditions . . . . . . . . . . . . . . .       . . . . .  8
   2.1 Object detection for autonomous vehicles under clear and rainy conditions      . . . . . 10
   2.2 Deraining in conjunction with object detection . . . . . . . . . . . . . . .   . . . . . 19
   2.3 Alternative training approaches for deep learning based object detection .     . . . . . 24
   2.4 Integrated Generative-Model Domain-Adaptation . . . . . . . . . . . . .        . . . . . 34
Chapter 3 Multiscale Domain Adaptive YOLO for Cross-Domain Object Detection . . . 41
   3.1 Proposed MultiScale Domain Adaptive YOLO . . . . . . . . . . . . . . . . . . . . 42
   3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 4 Cross Modality Knowledge Distillation for Robust Pedestrian Detection . . . 65
   4.1 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
   4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Chapter 5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
   5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
   5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
                                               vi


Chapter 1
Introduction
1.1      Background
Visual data plays a critical role in enabling automotive Advanced Driver Assistance Systems
(ADAS) and autonomous vehicles to achieve high levels of safety while maneuvering in their
environments. Hence, emerging autonomous systems are employing cameras, and deep learn-
ing based methods for object detection [1, 2, 3]. These methods predict bounding boxes that
surround detected objects as well as class probabilities associated with each bounding box. In
particular, Convolutional Neural Network (CNN) based approaches have shown very promising
results and achieved tremendous success in the detection of pedestrians, vehicles, and other ob-
jects [4, 5, 6, 7, 8, 9, 10].
    In general, state-of-the-art CNN-based object detection models can be classified into two groups:
one-stage and two-stage based methods. One-stage object detectors predict bounding boxes of
objects and class probabilities associated with these objects directly from a full image in single
computation via a unified neural network. Examples of well-known models of one-stage object
detectors include YOLO [8, 11, 12, 13], SSD [9], and RetinaNet [7]. On the other hand, two-stage
object detectors generate proposal bounding boxes that potentially have an object using Region
Proposal Network (RPN) in the first stage. Then, the proposals are fed to a second stage where
cropped features are used to classify objects and fine-tune the bounding boxes. The most well-
                                                 1


known models of two-stage object detectors are the R-CNN [4] series, including Fast R-CNN [5],
Faster R-CNN [6], R-FCN [10], and Mask R-CNN [14].
    These object detection models have been achieving exceedingly improved performance for
object detection in terms of classifying and localizing a variety of objects in a scene [4, 5, 6,
9, 8, 7, 10]. However, under a domain shift, when the testing data has a different distribution
from the training data distribution, the performance of state-of-the-art object detection methods
drops noticeably and sometimes significantly. Such domain shift could occur due to capturing
the data under different lighting or weather conditions, or due to viewing the same objects from
different viewpoints leading to changes in object appearance and background. For example, an
object detection method trained using a large amount of visual data captured by a vehicle driving
in clear and favorable lighting conditions may degrade significantly if the same trained method is
tested under realistic challenging conditions such as adverse weather or low light conditions. This
is because the quality of the captured visual signals is impaired and distorted. In that context, the
domain under which training is done is known as the source domain while the new domain under
which testing is conducted is referred to as the target domain.
    A straightforward solution to the domain shift problem is training with target domain data.
However, annotated target domain data is unavailable, and it is expensive and time consuming to
annotate data. An alternative solution is applying prepossessing methods to enhance the quality of
the captured images [15, 16, 17, 18, 19]. However, the enhanced images may not lead to improve
detection performance, and sometimes even make it worse [20, 21]. Another solution is to use
other modalities (e.g. radar, lidar, thermal, and gated imaging) that improve the overall system
perception capabilities. [22, 23, 24, 25, 26, 27, 28]. For example, Liao et al. [25] proposed a cross-
collaboration enhancement strategy to learn robust object detection by fusing RGB and thermal
                                                   2


images. Furthermore, Chaturvedi et al. [28] fused data of lidar, RGB, and gated cameras at early
and late stages by proposing a global-local attention framework to improve detection performance
in adverse weather. In addition, Bijelic et al. [27] proposed deep feature exchange and adaptive
fusion of four modalities (radar, lidar, RGB, and gated cameras) to develop robust multi-modal
object detection. Although these approaches achieved some improvement, they are not feasible to
support for many applications with constrained sensing modalities (e.g. systems that rely on RGB
cameras only).
    The lack of target domain data, and especially annotated data, led to the emergence of the
area of domain adaptation [29, 30, 31, 32, 33, 34], which has been widely studied to solve the
problem of domain shift without the need to annotate data for new target domains. In general,
domain adaptation solutions have relied on an adversarial network and other strategies that are de-
signed to generate domain-invariant features. In fact, they attempt to learn a robust object detector
using labeled data from the source domain and unlabeled data from the target domain. Domain
adaptation approaches for object detection can mainly be classified into reconstruction-based, and
adversarial-based solutions [35]. Reconstruction based domain adaptation attempts to improve the
performance of an object detector for target domain by using image-to-image translation mod-
els [36, 37, 38, 39, 40]. In particular, it utilizes image-to-image translation methods to generate
artificial (fake) samples of the target domain from the corresponding source labeled samples. Con-
sequently, translating labeled source data into corresponding target data will help in the training of
an object detector in a target domain, and this should improve the performance of object detection
in that domain.
    In adversarial-based, a domain discriminator is trained to classify whether a data point is from
the source or target domain. In contrast, the feature extractor of the object detector is trained
                                                    3


to confuse the domain discriminator [41]. Consequently, the feature extractor generates domain
invariant features as a result of this training strategy. Many adversarial-based domain adaptation
methods have been proposed for the Faster R-CNN object detector [42, 43, 44, 45, 46, 47, 48,
49, 50]. The state-of-the-art approach of adversarial-based domain adaptation is Domain Adaptive
Faster R-CNN [42]. Subsequently, many other approaches were proposed. For example, Zhu
et al.[43] proposed region mining and adjusted region-level alignment to develop a region-level
adaptation. Wang et al.[44] proposed Few-shot Adaptive Faster R-CNN that required only a few
target domain images with limited annotation. Saito et al.[45] combined strong local alignment
with weak global alignment to develop an adaptive object detector. He and Zhang [46] proposed
multiple adversarial submodules for both domain and proposal features alignment. Furthermore,
Zhao et al.[49] proposed a collaborative self-training method that can propagate the loss gradient
through the whole detection network, and mutually enhance the region proposal network and the
region proposal classifier. In addition, Xu et al.[50] utilized elaborate prototype representations to
achieve category-level domain alignment. In that context, we observe that domain adaptation has
been studied rather extensively for the Faster R-CNN object detector, and its variants. However,
other popular object detection schemes, and in particular YOLO-based architectures, have received
little or no attention.
     Furthermore, many previous works used Knowledge Distillation (KD) to compress an object
detection model [51, 52, 53, 54, 55, 56]. Specifically, they attempted to transfer the learned knowl-
edge of a complex model to a simpler and faster model. Chen et al. [51] proposed a weighted clas-
sification loss, and a bounded regression loss to distill the knowledge between detection models.
Additionally, an adaptation layer is used to allow the simple student model to learn the distribution
of features extracted from the more complex teacher model. Wang et al. [53] developed a fine-
                                                    4


grained feature imitation method that transfers knowledge via imitating features of local regions
near objects. Moreover, Guo et al. [54] proposed a decoupled method that separates the features
that have objects from features of background regions, and then applies KD to each of them indi-
vidually to learn a better student detector. In summary, the main objective of these approaches is to
speed up detectors during inference without a significant drop in performance by using KD tech-
niques. Instead, in this dissertation, we exploit KD in a different manner to improve an RGB-based
detector in low light and adverse weather conditions. To accomplish this, we distill and capture the
learned knowledge of a multi-modal object detector into an RGB-based detector.
1.2     Contribution Summary
The main contribution of this dissertation can be summarized as follows:
• We propose an integrated Generative Domain Adaptation (GDA) that combines adversarial do-
  main adaptation and image-to-image translation approaches in one framework for detecting ob-
  jects under realistic rainy conditions. We present the performance of our proposed GDA for
  Faster R-CNN object detector as compared to other mitigation techniques including deraining,
  domain adaptation, and image-to-image translation using real driving data.
• We introduce a MultiScale Domain Adaptive YOLO (MS-DAYOLO) architecture that supports
  domain adaptation at different layers of the feature extraction stage within the YOLOv4 back-
  bone network. To the best of our knowledge, this is the first proposed work that employs domain
  adaptation to improve the performance of YOLO for cross-domain object detection. The MS-
  DAYOLO architecture, including a Domain Adaptive Network (DAN) with multiscale feature
  inputs and multiple domain classifiers, represents our baseline architecture. Moreover, we pro-
                                                   5


  pose three novel domain adaptation architectures that further improve YOLOv4 object detection
  performance when tested on challenging target data. These architectures are Progressive Feature
  Reduction, Unified Domain Classifier, and Integrated that combines the benefits of the other two
  architectures.
• we propose a novel framework that is based on Cross Modality Knowledge Distillation (CMKD)
  to improve the performance of RGB-based pedestrian detection in low light and adverse weather
  conditions. We achieve this by transferring the knowledge of a teacher detector that is trained
  using both RGB and gated images to a student detector, which is trained using RGB images only.
  The proposed CMKD framework makes the student model generate features that are similar
  to the features of the teacher model. To accomplish this, we develop two methods within the
  proposed CMKD framework. The first one is based on using a KD loss, while the second one
  incorporates adversarial training with knowledge distillation
• We conduct extensive experiments to train and test the proposed approaches using many popu-
  lar datasets such as Cityscapes, KITTI, BDD100K, Waymo, and Seeing Through Fog datasets.
  These experiments show that the proposed frameworks provide significant improvements relative
  to the state-of-the-art approaches when tested on the target domain.
1.3      Dissertation Organization
The remainder of this dissertation is organized as follows:
• In Chapter 2, we first provide an insight into the impact of rain on two major classes of object
  detection frameworks. Then, we present a proposed Generative Domain Adaptation (GDA)
  framework in term of improving the performance of Faster R-CNN object detector under rainy
                                                 6


  conditions. Furthermore, we compare its performance with the state-of-the-art techniques that
  represent leading candidates for mitigating the influence of rainy conditions on an autonomous
  system’s ability to detect objects.
• In Chapter 3, we explain a novel MultiScale Domain Adaptive YOLO (MS-DAYOLO) frame-
  work that employs multiple domain adaptation paths and corresponding domain classifiers at
  different scales of the YOLOv4 object detector.
• In Chapter 4, we introduce a novel Cross Modality Knowledge Distillation (CMKD) framework
  that transfers the knowledge of a teacher detector that is trained using both RGB and gated im-
  ages to a student detector, which is trained using RGB images only. We demonstrate that CMKD
  improves the performance of RGB-based pedestrian detection under challenging conditions by
  making the student model generate features that are similar to the features of the teacher model.
• Finally, in Chapter 5, we state the conclusion of this dissertation and some potential future work.
                                                   7


Chapter 2
Object Detection under Rainy Conditions
The quality of visual signals captured by autonomous vehicles can be impaired and distorted in
adverse weather conditions, most notably under rain, snow, and fog. Such conditions minimize
scene contrast and visibility, and this could lead to a significant degradation in the ability of the ve-
hicle to detect critical objects in the environment. Depending on the visual effect, adverse weather
conditions can be classified into: steady (such as fog, mist, and haze) and dynamic, which has
more complex effects (such as rain and snow) [57].
    In this chapter, we focus on rain because it is the most common dynamic challenging weather
condition that impacts virtually every populated region of the globe. Furthermore, there has been
a great deal of recent efforts that attempt to mitigate the effect of rain in the context of visual
processing. While addressing the effect of other weather conditions has been receiving some, yet
minimal attention, the volume of work regarding the mitigation of rain is by far more prevalent and
salient within different research communities. It is worth noting that rain comprises of countless
drops that have a wide range of sizes and complex shapes; and rain spreads quite randomly with
varying speeds when falling on roadways, pavements, vehicles, pedestrians, and other objects in
the scene. Moreover, raindrops naturally cause intensity variations in images and video frames. In
particular, every raindrop blocks some of the light that is reflected by objects in a scene. In addition,
rain streaks lead to low contrast and elevated levels of whiteness in visual data. Consequently,
mitigating the effect of rain on visual data is arguably one of the most challenging tasks that
                                                    8


autonomous vehicles will have to perform due to the fact that it is quite challenging to detect and
isolate raindrops, and it is equally difficult to restore the information that is lost or occluded by
rain.
    Meanwhile, there has been noticeable progress in the development of advanced visual deraining
algorithms [58, 59, 60, 61, 62, 63]. Thus, one natural and intuitive solution for mitigating the effect
of rain on active safety systems and autonomous vehicles is to employ robust deraining algorithms
and then apply the desired object detection approach on the resulting derained signal. State-of-
the-art deraining algorithms, however, are designed to remove the visual impairments caused by
rain while attempting to restore the original signal with minimal distortion. Hence, the primary
objective of these algorithms, in general, is to preserve the visual quality as measured by popular
performance metrics, such as Peak-Signal-to-Noise-Ratio (PSNR) and structure similarity index
(SSIM)[64]. However, these metrics do not reflect a viable measure for analyzing the system’s
performance for more complex tasks such as object detection.
    The first objective of this chapter is to present a tutorial on state-of-the-art and emerging
techniques that are leading candidates for mitigating the influence of rainy conditions on an au-
tonomous vehicle’s ability to detect objects. In that context, our goal includes analyzing the
performance of object detection methods that are representatives of state-of-the-art frameworks,
which are being considered for integration into autonomous vehicles’ artificial intelligence (AI)
platforms. Furthermore, we highlight the inherent limitations of leading deraining algorithms,
deep learning based domain adaptation, and image translation frameworks in the context of rainy
conditions, which are summarized in Figure 2.1. We present experimental results for applying
the leading candidates of the used techniques with the objective of highlighting the urgent need
for developing new paradigms for addressing the challenges of autonomous driving under severe
                                                   9


         Figure 2.1: The frameworks highlighting the first three sections of this chapter.
weather conditions. Because generative model based image translation and domain adaptation ap-
proaches show some promising results, we propose a novel Generative Domain Adaptation (GDA)
framework that combines generative model based image translation and domain adaptation [65].
2.1     Object detection for autonomous vehicles under clear and
        rainy conditions
The level of degradation in the performance of an object detection method, trained under certain
conditions, is influenced heavily by: (a) How different the training and testing domains are; and (b)
The type of deep learning based architecture used for object detection. Most recent object detectors
are CNN based networks such as SSD [9], R-FCN [10], YOLO [8], RetinaNet [7], and Faster R-
CNN[6]. To that end, we review two major classes of object detection frameworks that are both
popular and representative of deep learning based approaches. As we will see later in this section,
                                                 10


these two classes of architectures exhibit different levels and forms of degradation in response to
challenging rainy conditions, and they also perform rather differently in conjunction with potential
rain mitigation frameworks. In particular, we briefly describe the underlying architectures for
Faster R-CNN and YOLO as representatives of two major classes of object detection algorithms.
Faster R-CNN is arguably the most popular among object detection algorithms that are based on a
two-stage deep learning architecture, one stage is for identifying region proposals and the second
stage is for refining and assigning class probabilities for the corresponding regions. On the other
hand, YOLO is a representative of detection frameworks that operate on the whole image directly.
2.1.1     Deep learning based methods for object detection
The utility of Convolutional Neural Networks (CNNs) for object detection was well established
prior to the introduction of the notion of region proposals, or commonly known as R-CNN [4],
where the ”R” stands for regions or region proposals. A fast version of R-CNN was later introduced
[5], and then Ren et al. [6] introduced the idea of Region Proposal Network (RPN) that shares
convolutional layers with Fast R-CNN [5]. RPN is merged with Fast R-CNN into one unified
network that is known as Faster R-CNN to achieve more computationally efficient detection. Under
Faster R-CNN, an input image is fed to a feature extractor such as the ZF model [66], or VGG-
16 [67] to produce a feature map. Then, RPN utilizes this feature map to predict region proposals
(regions in the image that could potentially contain objects of interest). In that context, many region
proposals are quite overlapped with each other with significant numbers of pixels common among
multiple region proposals. To filter out the substantial redundancy that might occur with such
framework, Non-Maximum Suppression (NMS) [68] is used to remove redundant regions while
keeping the ones that have the highest prediction scores. Subsequently, each regional proposal
                                                  11


Figure 2.2: The high-level architectures of the detection methods that are used in this chapter. The
domain adaptation of Faster R-CNN is explained in Section 2.3.2.
that survives the NMS process is used by a Region of Interest (RoI) pooling layer to crop the
corresponding features from the feature map. This cropping process produces a feature vector
that is fed to two fully connected layers: one layer predicts offset values of a bounding box of an
object with respect to the regional proposal, and the other layer predicts class probabilities for the
predicted bounding box. Figure 2.2 shows a high-level architecture of Faster R-CNN.
    On the other hand, Redmon et at. [8] proposed to treat object detection as a regression problem,
and they developed a unified neural network that is called YOLO (stands for You Only Look Once)
to predict bounding boxes and class probabilities directly from a full image in one evaluation.
Under YOLO, an input image is divided into a specific set of grid cells. Each cell is responsible
                                                 12


for detecting objects where their centers are located within that cell. To that end, each cell predicts
a certain number of bounding boxes, and it also predicts the confidence scores for these boxes
in terms of the likelihood that they contain an object. Furthermore, it predicts conditional class
probabilities given it has an object of a particular class. In this case, there are potentially many
wrongly predicted bounding boxes. To filter them out and provide the final detection result, a
threshold is used on the confidence scores of the predicted bounding boxes. Figure 2.2 shows the
general architecture of YOLO.
2.1.2     Object detection performance for neural network architectures under
          clear and rainy conditions
Here, we provide an insight into the level of degradation caused by rainy conditions on the per-
formance of the two major deep learning architectures described above. In particular, we focus
on the following fundamental question: how much degradation a deep neural network, which is
trained under clear conditions, will suffer when tested under rainy weather. In that context, we first
describe the dataset that we used for training and testing; and this is followed by presenting some
visual and numerical results. As a result, we needed a rich dataset that is captured under diverse
weather conditions. Despite the fact that there are few notable datasets [69, 70, 71], which are
quite popular among the computer vision and AI research communities in terms of training deep
neural networks, there is only one (arguably two [72][73]) that is properly labeled and annotated
for our purpose, and hence, it could be used for training and testing for different weather condi-
tions. In particular, we use the Berkeley Deep Drive dataset (BDD100K) [72] because it contains
image tagging for the weather (i.e., each image in the dataset is labeled with its weather condition
such as clear, rainy, foggy, etc). Meanwhile, although some other datasets, such as nuScenes [73],
                                                 13


might contain some visuals captured under rainy conditions, these datasets do not have weather
tagging. Hence, choosing the BDD100k dataset was influenced by the fact that we can select im-
ages under a specific weather condition. Moreover, BDD100K has 100,000 video clips captured
under diverse geographic, environmental, and weather conditions. It is worth noting that only one
selected frame from each video is annotated with object bounding boxes as well as image-level
tagging. Examples of annotated frames in clear and rainy weathers are shown in Figure 2.3. In
this chapter, we consider the four classes (vehicle, pedestrian, traffic light, and traffic sign) that
are labeled and provided as ground truth objects within the BDD100K dataset. Naturally, these
four classes are among the most critical objects for an autonomous vehicle. In this chapter, we use
12454 images that are captured in clear weather from the designated training set of BDD100K to
form our underlying training dataset. We refer to this training data as the train clear set, which we
used consistently to train the detection methods for the different scenarios covered in this chapter.
For testing, we use a set of clear weather images from the testing set of BDD100K. We refer to this
later set as the test clear set. Table 2.1 shows the number of annotated objects in the train clear
and test clear datasets.
    One approach to demonstrate the impact of rain on object detection methods, which are trained
under clear conditions, is by rendering synthetic rain [74, 75, 76] within the images of the test clear
set. Then, the synthetic rainy data can be used to test the already trained object detection methods.
The benefit of this approach is that one would have the exact same underlying content in both
testing datasets in terms of the objects within the scene, but one set representing the original clear
weather content when the data was captured, and another set with the synthetic rain. This would
clearly show the impact of rain as the visual objects are the same in both tested sets (the test clear
set and a test synthetic rain set). However, from our extensive experience in this area, we noticed
                                                  14


Figure 2.3: Examples of annotated images in BDD100K dataset [72]. Images in the top row are
tagged as clear weather, and images in the middle and bottom rows are tagged as being captured
in rainy weather. However, images in the bottom row are wrongly tagged as being rainy weather,
but they are actually in clear or cloudy weather.
that most well-known rain simulation methods do not render realistic rain that viably captures
actual and true rainy weather conditions, especially for a driving vehicle. Thus, when comparing
the two scenarios, this discrepancy between synthetic and natural (real) rainy conditions will lead
to domain mismatch. As a result, we do not test the detection methods using synthetic rain in our
study because this will not demonstrate the impact of true natural rain on a driving vehicle.
    Alternatively, we use images captured under real rainy conditions from the training and testing
sets of the BDD100K dataset to test the object detection methods. It is worth noting that several
images in the dataset are wrongly tagged as being ”rainy weather”, but they are actually in clear
or cloudy weather such as the examples shown in the bottom row of Figure 2.3. To solve this
problem, we manually select the images that are truly captured in rainy weather to form what we
                                                  15


   Table 2.1: Number of annotated objects in training and testing sets that are used in our study.
                     Set        Vehicles  Pedestrians    Traffic signs   Traffic lights
                 Train clear    149,548      16777           43866           26002
                 Test clear      13721        2397           3548             4239
                 Test rainy      13724        2347           3551             4246
refer to as the test rainy set. Equally important, we elected to have both the test clear and test rainy
sets approximately having the same number of annotated objects as shown in Table 2.1 in order to
provide statistically comparable results.
    It is important to make one final critical note regarding the currently available datasets for
training neural networks designed for object detection. The lack of datasets captured under di-
verse conditions including rain, snow, fog, and other weather scenarios represents one of the most
challenging aspects of achieving a viable level of training for autonomous vehicles. Even for the
BDD100K dataset, which is one of very few publicly available datasets with properly annotated
objects captured under different weather conditions, there is not sufficiently annotated visual con-
tent within BDD100K that is truly viable for training under rainy weather. This fundamental issue
with the lack of real training data for rainy and other conditions has become clearly a major obsta-
cle to the extent that leading high-tech companies working in the area have begun a focused effort
designated specifically for collecting data under rainy conditions.
2.1.3     Performance Metric
To evaluate the performance of detection, we compute the mean Average Precision (mAP). This
metric has been the most popular performance measure since the time when it was originally
defined in the PASCAL Visual Object Classes Challenge 2012 for evaluating detection methods
[77]. To determine mAP, the precision/recall curve is firstly computed based on the prediction
                                                   16


result against the ground truth. A prediction is considered a true positive if its bounding box: (a)
has Intersection-over-Union (IoU) value greater than 0.5 relative to the corresponding ground truth
bounding box, and (b) has the same class label as the ground truth. Then, the curve is updated by
making precision monotonically decreasing. This is achieved by setting the precision for recall r
to the maximum precision obtained for any recall r′ > r. Average Precision (AP) is the area under
the updated precision/recall curve, and it is computed by numerical integration. Finally, mAP is
the mean of AP among all classes.
2.1.4      Results and Discussion
We trained the detection methods (Faster R-CNN and YOLO) using the train clear set, which is
described in section 2.1.2. We used the same training settings and hyper-parameters that were used
in the original papers [6][8]. Then, we tested the trained models using the test clear and test rainy
sets to illustrate the impact of rain. Table 2.2 shows average precision (AP) for each class as well as
the mean average precision (mAP) evaluated based on the AP values of the classes. From the table,
we observe that mAP clearly declines in rainy weather as compared to clear weather using both
Faster R-CNN and YOLO. Consequently, these results undoubtedly illustrate that the performance
of an object detection framework, which is trained using clear visuals, could significantly degrade
under rainy weather conditions. The performance decreases due to the fact that rain covers and
distorts important details of the underlying visual features, which are used by detection methods to
classify and localize objects. Figure 2.4 shows examples when the detection methods fail to detect
most objects under rainy conditions.
    Moreover, one can notice that under rainy conditions, the average precision for the pedestrian
and traffic light classes declines more significantly than the decline in performance for vehicle and
                                                    17


Table 2.2: Quantitative results of the proposed GDA framework and other mitigation techniques
for comparison, based on adaptation from clear to rainy weather of the BDD100K dataset. Average
precision (AP) for each class, and mean average precision (mAP) evaluated based on the AP values
of the classes. V: vehicle, P: pedestrian, TL: traffic light, and TS: traffic sign. *The top row shows
the performance under clear conditions (i.e., using the test clear set), while all other rows show
the performance under rainy conditions (i.e., using the test rainy set). **Significant degradation in
performance can be observed due to rainy conditions (text in red) relative to the performance under
clear conditions (top row). Improvements in performance by mitigating the effect of rain can be
observed using generative model-based image translation and/or domain adaptation, and signifi-
cant improvements can be achieved under the proposed GDA framework. Meanwhile, deraining
algorithms do not improve, and most of the time further degrade the performance.
                                        Faster R-CNN                             YOLOv3
  Mitigating technique
                             V       P       TL     TS      mAP V          P       TL     TS     mAP
  None(clear*)               72.61 40.99 26.07 38.12        44.45 76.57 37.12 46.22 50.56        52.62
  None (rainy**)             67.84 32.58 20.52 35.04        39.00 74.15 32.07 41.07 50.27        49.39
  Deraining DDN [60]         67.00 28.55 20.02 35.55        37.78 73.07 29.89 40.05 48.74        47.94
  Deraining DeRain-
                             64.37 29.27 18.32 33.33 36.32 70.77 30.16 37.70 48.03 46.66
  drop [61]
  Deraining PReNet
                             63.69 24.39 17.40 31.68 34.29 70.83 27.36 35.49 43.78 44.36
  [62]
  Image       translation
                             68.47 32.76 18.85 36.20 39.07 74.14 34.19 41.18 48.41 49.48
  UNIT [78]
  Domain adaptation
                             67.36 34.89 19.24 35.49 39.24
  [42]                                                                        Not applicable
  The proposed GDA           68.04 36.16 20.29 38.24 40.68
traffic sign classes. This discrepancy in performance degradation for different objects is due to a
variety of factors. For example, vehicles usually occupy larger regions within an image frame than
other types of objects; and hence, even when raindrops or rain streaks cover a visual of a vehicle,
there are still sufficient features that can be extracted by the detection method. Furthermore, traffic
signs are typically made from materials that have high reflectivity, which makes it easier for an
object detection method to achieve higher accuracy even when a traffic sign visual is distorted with
some rain. Overall, in both cases, the important features needed for reliable detection are still
salient within the underlying deep neural networks of the detection algorithms. Nevertheless, rain
still could impact the detection of vehicle and traffic signs as shown in the bottom three rows of
                                                     18


Figure 2.4: Examples of detection results using Faster R-CNN and YOLO for different visual
scenes from test rainy set.
Figure 2.4.
2.2     Deraining in conjunction with object detection
Deraining methods attempt to remove the effect of rain and restore an image of a scene that has
been distorted by raindrops or rain streaks while preserving important visual details. In this chapter,
we review three recently developed deraining algorithms [60, 61, 62] that employ deep learning
frameworks for the removal of rain from a scene. The high-level architectures of these methods
                                                  19


Figure 2.5: General architectures of the used deraining methods (DDN [60], DeRaindrop [61], and
PReNet [62]).
are shown in Figure 2.5. Below, we briefly describe these three deraining methods and highlight
their limitations when employing them in conjunction with object detection methods.
2.2.1     Deep Detail Network
Fu et at. [60] proposed a Deep Detail Network (DDN) to remove rain from a single image. They
employed a convolutional neural network (CNN), which is ResNet [79] to predict the difference
between clear and rainy images, and used this difference to remove rain from a scene. In particular,
DDN exploits the rainy image’s high-frequency details only, and it uses such details as input to
ResNet while ignoring the low-frequency background (interference) of the same underlying scene.
                                                 20


2.2.2     Attentive Generative Adversarial Network
Qian et at. [61] proposed an attentive generative adversarial network that is called ”DeRaindrop”
to remove raindrops from images. In this method, a generative adversarial network (GAN) [80]
with visual attention is employed to learn raindrop areas and their surroundings. The first part of
the generative network, known as the Attentive-Recurrent Network (ARN), produces an attention
map to guide the next stage of the DeRaindrop framework. ARN includes ResNet, a Long Short-
Term Memory (LSTM) [81], and CNN layers. The second stage of DeRaindrop, which is known
as Contextual Autoencoder, operates on the attention map, and hence it focuses on (or ”pay more
attention” to) the raindrop areas. The overall process from the two stages is expected to clean
images free of raindrops. The architecture also includes a discriminative network, which assesses
the generated rain-free images to verify that they are similar to real ones that have been used during
the training process.
2.2.3     Progressive Image Deraining Network
Ren et at. [62] proposed a Progressive Recurrent Network (PReNet) to recursively remove rain
from a single image. At each iteration, some rain is removed, and the remaining rain can be
progressively removed in subsequent iterations. Consequently, after a certain number of iterations,
most of the rain should be removed leading to a rain-free quality image. In addition to several
residual blocks of ResNet, PReNet includes a CNN layer that operates on both the original rainy
image and the current output image. PReNet also includes another CNN layer to generate the
output image. Furthermore, a recurrent layer is appended to exploit dependencies of deep features
across iterations via convolutional LSTM. To train PReNet, a single negative SSIM [64] or MSE
loss is used.
                                                  21


2.2.4     Results and Discussion
To demonstrate the performance of the deraining methods outlined above, we apply the pretrained
deraining models provided by the corresponding authors to test rainy set as a prepossessing step.
After applying the deraining algorithms, which are anticipated to remove the rain from the visual
input data and generate rain-free “clear” visuals, we feed the derained images into the object de-
tection methods. Table 2.2 shows the performance of the detection methods after applying the
deraining approaches. It can be seen that the deraining algorithms actually degrade the detection
performance when compared to directly using the rainy images as input into the corresponding
detection frameworks. This is true for both Faster R-CNN and YOLO. One important factor for
this degradation in performance is that the deraining process tends to smooth out the input image,
and hence, it distorts meaningful information and distinctive features of a scene while attempting
to remove the effect of rain from the same scene. In particular, it is rather easy to observe that
state-of-the-art deraining algorithms smooth out edges of objects in an image, which leads to a
loss of critical information and features. Such information and features are essential for enabling
detection algorithms to classify and localize objects. The images in the top two rows of Figure
2.6, which represent outputs of Faster R-CNN and YOLO, show some of the objects that are not
detected after using the deraining methods, while they are successfully detected if rainy images are
directly used as input into the detection algorithms.
    A related critical issue to highlight about current deraining algorithms is their inability to re-
move natural raindrops found in realistic scenes captured by moving vehicles. The root cause of
this issue is the fact that deraining algorithms have been largely designed and tested using syn-
thetic rain visuals superimposed on the underlying scenes. What aggravates this issue is that, at
least in some cases, the background environments used to design and test deraining algorithms are
                                                  22


Figure 2.6: Examples of detection results for different visual scenes without employing any derain-
ing methods, and with employing deraining methods (DDN [60], DeRaindrop [61], and PReNet
[62]) in conjunction with the detection methods. Objects were detected using Faster R-CNN [6] in
the first group of images, and YOLO [12] in the second group.
                                                 23


predominantly static scenes with minimal moving objects. Consequently, the salient differences
between such synthetic scenarios and the realistic environment encountered by a vehicle that is
moving under natural rain represent a domain mismatch that is too much to handle by current
deraining algorithms, and this leads to their failure under realistic conditions for autonomous ve-
hicles. Hence, overall, we believe that relying purely on state-of-the-art deraining solutions does
not represent a viable approach for mitigating the impact of rain on object detection. The images
shown in Figure 2.6, especially some of the cases in the bottom two rows, illustrate examples of
the failure of deraining methods to improve the performance of object detection.
2.3      Alternative training approaches for deep learning based
         object detection
The requirement that autonomous driving systems are expected to work reliably under different
weather conditions, is at odds with the fact that the training data is usually collected in dry weather
with good visibility. Thus, the performance of object detection algorithms degrades under chal-
lenging weather conditions as we showed in Section 2.1.4.
    A straightforward approach to address this problem is to train a given CNN to detect objects
using images captured in real rainy weather. As we highlighted earlier, sufficient annotated datasets
captured by moving vehicles in realistic urban environments under natural rainy conditions are not
readily available. To that end, and in spite of the fact that some datasets are available, the very
few datasets captured under real rainy conditions are not properly annotated [72]. Having such
small datasets inherently makes them inadequate to reliably train deep learning architectures for
objection detection. Furthermore, annotating whatever available data captured under real rainy
                                                   24


conditions with accurate bounding boxes is an expensive and time-consuming process.
     An alternative approach for addressing the lack of real data is training detection methods using
synthetic rain data. However, as we highlighted earlier, the trained methods generalize poorly on
real data due to the domain shift between synthetic and natural rain. To solve this issue, we review
approaches that can be employed for training the detection methods using annotated clear data in
conjunction with unannotated rainy data. In particular, we review two emerging frameworks for
addressing this critical issue: image translation and domain adaptation.
2.3.1      Unsupervised Image-to-Image Translation
Image-to-Image Translation (I2IT) is a well-known computer vision framework that translates im-
ages from one domain (e.g., captured under clear weather) to another domain (e.g., rainy condi-
tions) while preserving underlying and critical visual contents of the input images[82, 83, 84, 78,
85, 86, 87]. In short, I2IT attempts to learn a joint distribution of images in different domains.
The learning process can be classified into: supervised setting where the training dataset consists
of paired examples of the same exact scene captured in both domains (e.g., clear and rainy condi-
tions), and unsupervised setting where different examples of both domains are used for training;
hence, these examples do not have to be taken from the same corresponding scene. The unsuper-
vised case is inherently more challenging than supervised learning. More importantly, to address
the main issue we are facing in the context of the lack of data needed for training object detection
architectures under realistic conditions, we consequently need an unsupervised setting. In particu-
lar, the requirement of having a very large set of image pairs, where each pair of images must be of
the exact same scene captured under different domains, render supervised I2IT solutions virtually
useless for our purpose. In fact, this requirement imposes more constraints than the lack of data
                                                  25


issue that we are already trying to address. Hence, and despite the availability of well-known su-
pervised learning based techniques in this area [82, 88], we have to resort to unsupervised solutions
to address the problem at hand.
    Recently, Generative Adversarial Networks (GANs) [80] have been accomplishing promising
results in the field of unsupervised I2IT [83, 84, 78, 85]. In general, a GAN consists of a gen-
erator and a discriminator. The generator is trained to fool the discriminator, while the latter is
attempting to distinguish (or discriminate) between real natural images on one hand and fake im-
ages, which are generated by the trained generator, on the other hand. By doing this, GANs align
the distribution of translated images with real images in the target domain.
    As mentioned above, datasets that have paired clear-rainy images in a driving environment are
not publicly available and difficult to collect. As a result, we use UNsupervised Image-to-image
Translation (UNIT) [78] to translate clear images to rainy ones since the training process for the
UNIT framework does not require paired images of the same scene. In other words, training UNIT
requires two independent sets of images where one consists of images in one domain, and another
consists of images in another domain. Moreover, the UNIT model has an excellent performance in
the translation of images, which are captured by driving vehicles, from one domain to another.
    The high-level architecture of the UNIT model is shown in Figure 2.7. The UNIT model
consists of three networks: encoder, generator, and discriminator. First, the encoder maps an input
image to a shared latent code (a shared compact representation of an image in both domains).
Then, the generator uses the shared latent code to generate an image in the desired domain. Both
encoder and generator are trained to fool the discriminator, while the latter aims to discriminate
between real images and generated ones. By doing this, UNIT aligns the distribution of generated
images with real ones in the target domain.
                                                  26


        Figure 2.7: The high-level architecture of the UNIT model [78] to generate images.
Figure 2.8: Examples of generated images by the trained UNIT model, left: original clear images,
right: generated rainy images, which have less domain shift with respect to real rainy images than
the domain shift they exhibit with respect to original clear images.
    To train the UNIT model that learns the mapping from clear images to rainy ones, we use
the train clear set that consists of clear-weather annotated images as the source domain. For the
target rainy domain, we extract a sufficiently large number of images from the rainy videos of the
BDD100K dataset. Subsequently, we apply images in the train clear set to the trained UNIT model
to generate rainy images. We refer to the images that are generated by the UNIT model as the train
gen rainy set. Examples of generated rainy images are shown in Figure 2.8.
    Eventually, we use the train gen rainy dataset to train the detection methods. This is followed
                                                 27


by using the test rain dataset to evaluate the average precision performance of the detection meth-
ods, which are now trained using the generated rainy set. We also calculate the mean average
precision (mAP) as we have done for other approaches. Table 2.2 shows the performance of detec-
tion methods that are trained using generated rainy images by the UNIT model.
2.3.2      Domain Adaptation
Domain adaptation is another potentially viable framework that could be considered to address the
major challenges that we have been highlighting in this chapter regarding: (a) the salient mismatch
between the two domains, clear and rainy weather conditions, and (b) the lack of annotated training
data captured under rainy conditions. The area of domain adaptation (DA) [31, 89, 32, 33, 90,
34] addresses a fundamental and practical problem that is related to the availability of annotated
training data. In particular, there is usually sufficient annotated training data captured under the
source domain, but a significantly smaller amount of annotated data for the target domain. For
example, there is an abundance of annotated data available for vehicles driving under clear weather.
However, and comparatively speaking, there is much less data, and especially annotated data,
captured by vehicles driving under severe weather conditions.
    DA training is designed to compensate for the shift in probability distribution between the two
domains while utilizing labeled source domain data and unlabeled target domain data. In general,
the convolutional neural network (CNN) stage of a domain adaptive object detector is trained to
generate domain-invariant features that can be used by the detection stage regardless of whether
the test is being conducted under the source or target domains.
    Domain Adaptation (DA) has been used in deep learning-based object detectors to adapt them
to a new desired target domain that is dissimilar from the original training domain, without the
                                                  28


need to annotate visual data of the target domain [42, 44, 45, 46, 91]. Most domain adaptation
approaches in literature utilize the adversarial training approach [80] to produce robust domain-
invariant features. In particular, a domain classifier is optimized to identify whether a data point
is from the source or target domain. Meanwhile, the feature extractor of the object detector is
optimized to generate feature maps that are indistinguishable, regardless of the domain, in order to
confuse the domain classifier. This strategy makes the feature extractor learn to produce domain
invariant features. In other words, the distribution of features extracted from images becomes
indistinguishable in both domains.
    In particular, a domain adaptation framework [42] has been designed and developed specifically
for Faster R-CNN due to the fact that it is among the most popular object detection approaches.
The framework developed in [42] adapts deep learning based object detection to a target domain
that is different from the training domain without requiring any annotations in the target domain. In
particular, it employs the adversarial training strategy [80] to learn robust features that are domain-
invariant. In other words, it makes the distribution of features extracted from images in the two
domains indistinguishable.
    The architecture for Domain Adaptive Faster R-CNN model [42] is shown in Figure 2.2. There
are two levels of domain adaptation that are employed. First, an image-level domain classifier is
used. At this level, the global attributes (such as the image style and illumination) of the input
image are used to distinguish between the source and target domains. Thus, the (global) feature
map resulting from the common CNN feature extractor of the Faster R-CNN detector is used
as input toward the image-level domain classifier. Second, an instance-level domain classifier is
employed to reduce the local domain shift occurring to an object (such as its appearance, viewpoint,
and size). This classifier uses the specific features associated with a particular region to distinguish
                                                   29


between the two domains. Hence, the instance-level domain classifier uses the feature vector
resulting from the fully connected layers (FCs) at the output of the RoI Pooling Layer of the Faster
R-CNN detector. The two classifiers, the image- and instance-level ones, should naturally agree
in terms of their binary classification decision regarding if the input image belongs to the source
or target domain. Consequently, a consistency regularization stage combines the output of the two
classifiers to promote consistency between the two classifier outcomes.
    Domain Adaptation Network (DAN) is attached to the original object detection network dur-
ing training only. DAN has three components: image-level adaptation, instance-level adaptation,
and consistency regularization. This DAN includes several layers for predicting the domain class
(either source or target) for both levels. Then, domain classification losses are computed via cross-
entropy for image-level and instant level as follows:
                                       X             (x,y)                     (x,y)
                          Limg = −           [ti ln pi     + (1 − ti ) ln(1 − pi      )]           (2.1)
                                       i,x,y
where ti is the ground truth domain label for i-th training image, with ti = 1 for source domain and
                              (x,y)
ti = 0 for target domain. pi        is predicted image-level domain class probability for i-th training
image at location (x, y) of the feature map.
                                         X
                             Lins = −          [ti ln pi,j + (1 − ti ) ln(1 − pi,j )]              (2.2)
                                           i,j
Here, pi,j is the predicted instance domain class probability for j-th region proposal that may have
an object in i-th training image.
    The domain classifiers of these two adaptation levels need to produce the same classification
prediction about the domain of the input image, which may be from the source or target domain.
                                                         30


Therefore, to impose consistency between outcomes of the two domain classifiers, a consistency
regularization stage is used to join the output of the two classifiers as follows:
                                          X      1 X (x,y)
                                  Lcst =      ||       p    − pi,j ||2                         (2.3)
                                          i,j
                                                 N x,y i
where N is the total number of activations in a feature map. Therefore, the total loss of DAN can
be written as follows:
                                    LDAN = Limg + Lins + Lcst                                    (4)
    While the two domain adaptation classifiers are optimized to differentiate between the source
and target domains, the Faster R-CNN detector must be optimized such that it becomes domain-
independent or domain-invariant. In other words, the Faster R-CNN detector must detect objects
regardless of the input image domain (clear or rainy). Hence, the feature map resulting from
the Faster R-CNN feature extractor must be domain-invariant. To that end, this feature extrac-
tor should be trained and optimized to maximize the domain classification error achieved by the
domain adaptation stage. Thus, while both the image- and instance-level domain classifiers are
designed to minimize the binary-classification error (between the source and target domains), the
Faster R-CNN feature extractor is designed to maximize the same binary-classification error. To
achieve these contradictory objectives, a Gradient Reversal Layer (GRL) [34, 92] is employed.
Thus, GRL is a bidirectional operator that is used to realize two different optimization objectives.
In the feed-forward direction, the GRL acts as an identity operator. This leads to the standard
objective of minimizing the classification error when performing local backpropagation within
the domain adaptation network. On the other hand, for backpropagation toward the Faster R-
CNN network, the GRL becomes a negative scalar. Hence, in this case, it leads to maximizing
                                                   31


the binary-classification error; and this maximization promotes the generation of domain-invariant
feature map by the Faster R-CNN feature extractor.
    Consequently, we developed and employed a domain adaptive faster R-CNN [42] under rainy
conditions. To train this model, we prepare the training data to include two sets: source data,
which consists of images captured in clear weather (and this set includes data annotations in terms
of bounding boxes coordinates and object categories), and target data, which only consists of
images captured under rainy conditions without any annotations. To validate the trained model
using domain adaptation, we tested it using the test rainy set. The performance of the detection
method (Faster R-CNN) that is trained by the domain adaptation approach is shown in Table 2.2.
2.3.3     Results and Discussion
Based on the results in Table 2.2, we observe that while deraining algorithms degrade the aver-
age precision performance when tested on scenes distorted by natural rain, improvements can be
achieved when employing image-to-image translation and domain adaptation as mitigating tech-
niques. Different cases are shown in Figure 2.9.
    In terms of average precision, and as an example, rainy conditions degrade the pedestrian
detection capabilities for YOLO by more than five percent (from around 37 to around 32); but by
using image translation, the performance is improved to an average precision of more than 34, and
consequently, narrowing the gap between clear and rainy conditions’ performances. Similarly, both
image translation and domain adaptation improve the traffic-signs detection performance for Faster
R-CNN. Furthermore, image translation seems to improve the vehicle detection performance under
Faster R-CNN.
    In other cases, for example, traffic-light detection performance under Faster R-CNN, domain
                                                  32


Figure 2.9: Examples of detection results using alternative training approaches for Faster R-CNN
and YOLO are shown in the rightmost column. The top-right image shows improvement in vehicle
and traffic sign detection when generated images by I2IT (the UNIT model [78]) are used to train
Faster R-CNN. The middle-right image shows improvement in pedestrian detection when domain
adaptation [42] is used to train Faster R-CNN. The bottom-right image shows improvement in
pedestrian detection when generated images by I2IT (the UNIT model [78]) are used to train
YOLO.
adaptation, and image translation do not seem to perform well when tested on natural rainy images
(even when using natural rainy images as the target domain for training these techniques). One
potential factor for this poor performance in some of these cases is the fact that small objects such
as traffic lights are quite challenging to detect to start with. This can be seen from the very low
numbers of average precision, even under clear conditions, which is a mere 26%. Naturally, the
impact of raindrops or rain streaks on such small objects in the scene could be quite severe to the
extent that a mitigating technique might not be able to recover the salient features of these objects.
    In summary, employing domain adaptation or generating rainy-weather visuals using unsuper-
                                                  33


vised image-to-image translation, and then using these visuals for training seems to narrow the
gap in performance due to the domain mismatch between clear and rainy weather conditions. This
promising observation becomes especially clear when considering the disappointing performance
of deraining algorithms. Nevertheless, it is also clear that there is still much room for improvements
toward reaching the performance under clear conditions. There are key challenges that need to be
addressed, though, when designing any new mitigating techniques for closing the aforementioned
gap. These challenges include the broad and diverse scenarios for “rainy conditions”, especially in
driving environments. These diverse cases and scenarios cannot be learned in a viable way by us-
ing state-of-the-art approaches. For example, raindrops have a wide range of possible appearances,
and they come in various sizes and shapes, especially when falling on the windshield of a vehicle.
Another factor is the influence of windshield wipers on altering the amount, and even shapes and
sizes of raindrops, in-between wipe cycles. Other external factors include reflections from the sur-
rounding wet pavement, mist in the air, and splash effects. Hence, current state-of-the-art image
translation techniques and domain adaptation are not robust enough to capture this wide variety
of rain effects. Figure 2.10 shows images from the test rainy set, where these examples illustrate
several scenarios and effects of rainy weather for driving vehicles.
2.4      Integrated Generative-Model Domain-Adaptation
In this section, we address the problem of object detection of objects in a driving environment under
rainy conditions using a novel Generative Domain Adaptation (GDA) framework that combines:
(a) Generative-model based image translation and (b) Domain Adaptation.
                                                   34


Figure 2.10: Images from test-rainy set that illustrate several scenarios and effects of rainy weather
for driving vehicles.
2.4.1     Proposed GDA framework
We exploit unsupervised image-to-image translation to generate visuals that are representatives of
a challenging target domain, and then we use these generated visuals in addition to unlabeled target
domain data to train a domain adaptive object detection method. A high-level architecture of the
proposed (GDA) framework is shown in Figure. 2.11. We show that using this novel integrated
approach outperforms both methods, unsupervised image translation, and domain adaptation, when
they are used separately.
    In Section 2.3.1, we show that the trained UNIT model generates visuals that reduce the domain
shift between the source and target domains. This will help to increase the performance of domain
adaptive object detection for the target domain. Therefore, in this section, we employ the UNIT
model as a GAN-based unsupervised I2IT method that we integrate with our GDA framework.
    On the other hand, if the domain shift between source and target domain is significant, espe-
cially under different weather conditions, a domain adaptive model fails to improve detection in the
target domain. Because the generated visuals of the developed I2IT model in section 2.3.1 reduce
                                                  35


           Figure 2.11: The proposed Generative Domain Adaptation (GDA) framework.
the domain shift between source and target domains, we propose the GDA framework that uses
these generated visuals as source domain during training. In other words, the generative visuals in
conjunction with unlabeled target data are used to train the domain adaptive object detector. This
will help to improve the detection performance. For inference, and during testing, trained weights
are loaded into the original object detector architecture (without the DAN network). Therefore,
our proposed framework will not increase the underlying detector complexity during inference,
which is an essential factor for many real-time applications such as autonomous driving. In fact,
the proposed GDA framework is targeted to improve detection of objects (such as pedestrians and
traffic signs) for autonomous vehicles driving under diverse weather conditions, and in particular
rainy weather.
    To evaluate the performance of the proposed GDA framework, we used the trained UNIT model
in Section 2.3.1 to translate clear-weather annotated images into a corresponding set of rainy-
                                                36


weather annotated images. Examples of the generated images are shown in Figure 2.8. It is
important to note that these generated images have less domain shift with respect to real rainy
images than the domain shift that they exhibit with respect to original images captured under clear
weather. Afterward, we trained Domain adaptive Faster R-CNN using: (a) the generated images
as representative of the source domain in conjunction with (b) unlabeled rainy images as the target
domain.
2.4.2    Results and Discussion
We compare the GDA framework with other mitigation techniques in the previous sections includ-
ing:
• Three state-of-the-art deraining algorithms [60, 61, 62].
• Training original Faster R-CNN object detection using generated images by the UNIT image-
  translation method instead of clear images.
• Domain adaptation without using generative model based training. In other words, we trained
  domain adaptive Faster R-CNN using the original annotated clear data as the source domain and
  the rainy data without annotation as the target domain.
The final results for these performance evaluation tests are summarized in Table 2.2.
    The top row of the table includes the performance results for applying Faster R-CNN on clear
weather data as a reference. As can be seen, even under clear weather, object detection perfor-
mance is quite low for pedestrian, traffic light, and traffic sign. The second row shows the results
of applying Faster R-CNN on the real rainy test data without any mitigation technique. It is clear
that rainy conditions degrade the (already low) performance significantly. For example, AP for
                                                  37


pedestrian detection drops by 8. It is also clear that deraining algorithms fail miserably in these
realistic scenarios. The table also shows that domain adaptation or image translation-based train-
ing provides some improvements. Meanwhile, significant improvements can be achieved under
the GDA framework. When comparing the pedestrian detection performance of GDA with no-
mitigation (second row), GDA improves the performance by around 50% toward the ”ideal” clear-
weather performance. Hence, instead of a drop of 8 without any mitigation, under GDA, we have
a drop of around four relative to clear conditions. For traffic sign detection, GDA performance is
even more impressive. It reaches similar levels of performance as the clear-weather AP. For other
classes (vehicle and traffic light), GDA achieves almost similar performance as other approaches.
The overall mAP is also closing the gap toward the ideal clear-weather performance. Examples of
visual detection results for the GDA improvements are shown in Figure 2.12 and Figure 2.13.
                                                 38


Figure 2.12: Visual detection results using: (Left) State-of-the-art (SOTA) Domain Adaptive Faster
R-CNN [42] that failed in detecting pedestrians in these examples, and (Right) The proposed GDA
framework that successfully detected the same pedestrians (red bounding boxes).
                                                 39


Figure 2.13: Visual detection results using Left: original Faster R-CNN object detector trained
using generated images by image-to-image translation (I2IT) UNIT model [78], which failed in
detecting some pedestrians, and Right: the proposed GDA framework, which successfully detected
them (red bounding boxes).
                                               40


Chapter 3
Multiscale Domain Adaptive YOLO for
Cross-Domain Object Detection
Most previous approaches for domain adaptation object detection, used Faster R-CNN as the base
detector. Despite its popularity, Faster R-CNN suffers from a long inference time to detect ob-
jects. As a result, it is arguably not the optimal choice for time-critical, real-time applications such
as autonomous driving. On the other hand, one-stage object detectors, and in particular YOLO,
can operate quite fast, even much faster than real-time, and this makes them invaluable for au-
tonomous driving and similar time-critical applications. Furthermore, domain adaptation for the
family of YOLO architectures has received virtually no attention. Besides the computational ad-
vantage of YOLO, the latest version, YOLOv4, has many salient improvements, and its object
detection performance has improved rather significantly relative to prior YOLO architectures and
more important in comparison to Faster R-CNN. All of these factors motivated our focus on the
development of a new domain adaptation framework for YOLOv4.
    In this chapter, we propose novel domain adaptation architectures for the YOLOv4 object de-
tector. In particular, we introduce four new MultiScale Domain Adaptive YOLO (MS-DAYOLO)
architectures that promote multiscale domain adaptation for the feature extraction stage and pro-
gressive channel reduction strategies for one or more domain classifiers. The proposed MS-
DAYOLO framework achieves significant improvements over YOLOv4 as shown in the examples
                                                    41


of Figure 3.1 (c). To the best of our knowledge, this is the first proposed work that improves the
performance of YOLO for cross-domain object detection [93].
3.1       Proposed MultiScale Domain Adaptive YOLO
YOLOv4 [13] has been released recently as the latest version of the family of the YOLO object
detectors. Relative to its predecessor, YOLOv4 has incorporated many new revisions and novel
techniques to improve the overall detection accuracy. YOLOv4 has three main parts: backbone,
neck, and head as shown in Figure 3.2. The backbone is responsible for extracting multiple layers
of features at different scales. The neck collects these features from three different scales of the
backbone using upsampling layers and feed them to the head. Finally, the head predicts bounding
boxes surrounding objects as well as class probabilities associated with each bounding box.
    The backbone (i.e. feature extractor) represents a major module of the YOLOv4 architecture,
and we believe that it makes a significant impact on the overall performance of the detector. In
addition to many convolutional layers, it has 23 residual blocks [79], and five downsampling layers
to extract critical layers of features that are used by the subsequent detection stages. Here, we
concentrate on the features (F1, F2, and F3 in Figures 3.2) because they are fed to the next stage
(neck module). In particular, our goal is to apply domain adaptation to these three features to make
them robust against domain shifts, and hence, have them converge toward domain invariance during
domain adaptation based training. Equally important, these three stages of features have different
dimensions due to the successive downsampling layers that progressively reduce the width and
height of features by half while doubling the number of channels. If d is the width of the feature
at the first scale (F1), then the dimensions of the three stages of features are: F1: d × d × 256, F2:
d    d                    d   d
2
  ×  2
       × 512, and F3:     4
                            × 4
                                × 1024.
                                                   42


Figure 3.1: Visual detection examples using the original YOLOv4 method on: (a) clear images
and (b) foggy images. (c) Our proposed MS-DAYOLO applied onto foggy images. The images
are from the Cityscapes [70] and Foggy Cityscapes [94] datasets.
                                              43


Figure 3.2: Architecture of YOLOv4 with domain adaptation network (DAN) to develop domain
adaptive YOLO. The details architectures of DAN are shown in Figure 3.3.
3.1.1     Domain Adaptive Network for YOLO
The proposed Domain Adaptive Network (DAN) is attached to the YOLOv4 object detector only
during training in order to learn domain invariant features. Indeed, YOLOv4 and DAN are trained
in an end-to-end fashion. For inference, and during testing, domain-adaptive trained weights are
used in the original YOLOv4 architecture (without the DAN network). Therefore, our proposed
framework will not increase the underlying detector complexity during inference, which is an
essential factor for many real-time applications such as autonomous driving.
    DAN uses the three distinct scale features of the backbone that are fed to the neck as inputs. It
has several convolutional layers to predict the domain class (either source or target). Then, domain
classification loss (Ldc ) is computed via binary cross entropy as follows:
                                   1 X            (x,y)                     (x,y)
                          Ldc = −         [ti ln pi     + (1 − ti ) ln(1 − pi )]                (3.1)
                                  N i,x,y
    Here, ti is the ground truth domain label for the i-th training image, with ti = 1 for source
                                                     44


                                           (x,y)
domain and ti = 0 for target domain. pi          is predicted domain class probabilities for i-th training
image at location (x, y) of the feature map. N represents the total number of images in a batch
multiplied by the total number of elements in the feature map.
    DAN is optimized to differentiate between the source and target domains by minimizing this
loss. On the other hand, the backbone is optimized to maximize the loss to learn domain invariant
features. Thus, features of the backbone should be indistinguishable for the two domains. Conse-
quently, this should improve the performance of object detection for the target domain. To solve
the joint minimization and maximization problem, we employ the adversarial learning strategy
[41]. In particular, we achieve this contradictory objective by using a Gradient Reversal Layer
(GRL) [34, 31] between the backbone and the DAN network.
    To compute the detection loss (Ldet ) [13], only source images are used because they are an-
notated with ground-truth objects. Consequently, all three parts of YOLOv4 (i.e. backbone, neck
and head) are optimized via minimizing Ldet . On the other hand, both source labeled images and
target unlabeled images are used to compute the domain classification loss (Ldc ) which is used to
optimize DAN via minimizing it, and the backbone via maximizing it. As a result, both Ldet and
Ldc are used to optimize the backbone. In other words, the backbone is optimized by minimizing
the following total loss:
                                            Lt = Ldet + λLdc                                          (3.2)
where λ is a negative scalar of GRL that balances a trade-off between the detection loss and domain
classification loss. In fact, λ controls the impact of DAN on the backbone.
                                                      45


3.1.2     DAN Architectures
We developed various architectures for the Domain Adaptive Network (DAN) as shown in Figure
3.3 to explore and gain insight into the impact of different components on achieving improved
performance for the target domain. Under all of our architectures, we employ a multiscale strategy
that connects the three features F1, F2, and F3 of the backbone to the DAN through three cor-
responding GRLs. Other than this common multiscale strategy, the proposed DAN architectures
differ from each other as explained below.
a- Multiscale Baseline : Instead of applying domain adaptation for only the final scale of the
feature extractor as done in the Domain Adaptive Faster R-CNN architecture [42], we develop
domain adaptation for three scales separately. In other words, applying domain adaptation only to
the final scale (F3) does not make a significant impact on the previous scales (F1 and F2). As a
result, we apply domain adaptation to all scales as shown in Figure 3.3 (a). For each scale, there
are two convolutional layers after GRL, the first one reduces the feature channels by half, and the
second one predicts the domain class probabilities. Finally, a domain classifier layer is used to
compute the domain classification loss.
b- Progressive Feature Reduction (PFR): As shown in Figure 3.3(a), the baseline architecture
reduces the feature vector size resulting from the YOLOv4 backbone into a single-value feature
(scalar) rather abruptly through two stages of neural networks. This simple two-stage DAN aims at
generating a single feature value that serves as an input into the domain classifier. The fact that the
domain classifier requires a single feature value is inherent in the binary nature of the classifier that
simply needs to classify the image data into either source domain or target domain. Meanwhile,
and due to the adversarial strategy used here, the above Baseline domain adaptation network is
competing with the significantly more complex network of the backbone as shown in Figure 3.2.
                                                  46


Figure 3.3: Proposed architectures for the Domain Adaptive Network (DAN): (a) Baseline, (b)
Progressive Feature Reduction (PFR), (c) Unified Domain Classifier (UDC), (d) Integrated. F1,
F2, and F3 are features of the backbone network of YOLO that are fed to the neck.
                                              47


We observed that this mismatch between the simplistic baseline DAN architecture and the complex
backbone network could compromise the domain adaptation performance. Thus, the DAN network
may not be sufficiently powerful to distinguish between the source and target domains since the
complex backbone network can easily confuse (and trick) the DAN network. To mitigate this
mismatch, we increase the number of convolutional layers for each scale by progressively reducing
the feature channels as shown in Figure 3.3(b). This progressive reduction of feature channels helps
the DAN network to compete more efficiently against the more complex backbone. As a result,
the extracted features by the backbone network will be more domain invariant.
    Therefore, while the baseline architecture reduces the number of the feature channels using two
stages of neural networks, our proposed progressive-feature-reduction employs four or five stages
depending on the original feature size. In particular, for feature vectors F1 and F2, we employ
four stages of neural networks that progressively reduce the feature vector size from 128 and 256,
respectively, toward a single-feature scalar value, which is all that is required as an input for the
binary domain classifier. For feature vector F3, we employ a five-stage neural network DAN to
progressively reduce the backbone feature vector toward the scalar feature value. It is important
to highlight the following regarding the proposed progressive-feature-reduction architecture. It is
possible to employ a larger number of stages of progressive reduction than the number of stages
we employed in our architecture shown in Figure 3.3(b). However, based on our experience,
increasing the number of stages beyond four or five stages does not necessarily improve the overall
performance.
c- Unified Domain Classifier (UDC): Under the multiscale baseline and Progressive Feature Re-
duction architectures, each scale has its own distinct domain classifier. This multi-classifier strat-
egy may lead to inconsistency among scales. For example, a domain classifier at one scale may
                                                 48


classify an image patch as a source data, while a domain classifier at another scale may classify
the same image patch as originating from the target domain. (Examples of this inconsistency are
shown later in the Experiments section.) To address this potential inconsistency, we propose to
use a single (unified) domain classifier that combines the feature vectors from all scales as shown
in Figure 3.3(c). It is important to highlight the following about the proposed Unified Domain
Classifier (UDC) domain adaptive network:
   1. We use downsampling convolutional layers to match the size of features at different scales.
       For example, in order to combine the feature vectors resulting from the F1 and F2 scales of
       the backbone, we add a downsampling stage to the F1 scale and concatenate the resulting
       vector with the feature vector from F2. This strategy maintains the multiscale attribute of
       our domain adaptive network while targeting a unified domain classifier architecture.
   2. Furthermore, we concatenate features at different scales in a way to make each scale con-
       tributes equally in terms of the number of feature channels. In other words, each feature
       scale equally contributes to the prediction of the domain class probabilities.
d- Integrated: It is important to note that the above two improvements, progressive feature reduc-
tion (PFR) and unified domain classifier (UDC), have been applied directly and separately to the
multiscale baseline architecture. Consequently, to gain the benefits of both the progressive feature
reduction and unified domain classifier strategies, we integrate them in one network as shown in
Figure 3.3(d). In principle, we have developed the network by complementing the unified-classifier
architecture (Figure 3.3(c)) with additional stages of convolutional layers to achieve a more pro-
gressive reduction in feature channel sizes. This is evident by comparing the two architectures
shown in Figures 3.3(c) and 3.3(d).
                                                  49


3.2         Experiments
In this section, we evaluate our proposed domain adaptive YOLO framework and the proposed
MS-DAYOLO architectures. We modified the official source code of YOLOv4 that is based on the
darknet platform1 , and developed a new code to implement our proposed methods2 .
3.2.1        Setup
For training, we used the default settings and hyper-parameters that were used in the original
YOLOv4[13]. The network is initialized using the pre-trained weights file. The training data
includes two sets: source data that has images and their annotations (bounding boxes and object
classes), and target data without annotation. Each batch has 64 images, 32 from the source domain
and 32 from the target domain. Based on prior works [42, 43, 46] and our experience based on trial
and error technique, we set λ = 0.1 for all experiments.
      For evaluation, we report Average Precision (AP) for each class as well as mean average pre-
cision (mAP) with a threshold of 0.5 [77] using testing data that has labeled images of the target
domain. We have followed other prior domain adaptive object-detection works that use the same
threshold value of 0.5. We compare our proposed method with the original YOLOv4 and other
state-of-the-art domain adaptation approaches that are based on Faster R-CNN object detector [6],
all applied to the same target domain validation set.
    1
      https://github.com/AlexeyAB/darknet
    2
      https://github.com/Mazin-Hnewa/MS-DAYOLO
                                                 50


3.2.2     Results
3.2.2.1   Adverse Weather Adaptation
Domain shift due to changes in weather conditions is one of the most prominent reasons for the
discrepancy between the source and target domains. Reliable object detection systems in different
weather conditions are essential for many critical applications such as autonomous driving. As
a result, we focus on presenting the evaluation results of our proposed MS-DAYOLO framework
by studying domain shifts under changing weather conditions for autonomous driving. To achieve
this, we use three different driving datasets: Cityscapes [70], Foggy Cityscapes [94], Waymo [95].
    Clear → Foggy: We discuss the ability of our proposed method to adapt from clear to foggy
weather using driving datasets: Cityscapes [70] and Foggy Cityscapes [94] as has been done by
many recent works in this area [42, 47, 44, 46, 96, 97, 43, 45, 98]. The Cityscapes training set
has 2975 labeled images that are used as source domain. Similarly, the Foggy Cityscapes training
set also has 2975 images, but without annotations, and is used as the target domain. Original
YOLOv4 is trained using the source domain data only. In contrast, MS-DAYOLO is trained using
both source and target domain data. The Foggy Cityscapes validation set has 500 labeled images
that are used for testing and evaluation. Because the Foggy Cityscapes training set is annotated,
we are able to train the original YOLOv4 with this set to show the ideal performance (oracle).
    Table 3.1 summarizes the performance results. Based on these results, all of the architectures
of our proposed framework outperform the original YOLOv4 approach by a significant margin.
Moreover, the proposed integrated architecture achieves the best overall performance in terms of
mAP. Although the GPA method has slightly better results than the proposed integrated architecture
in terms of AP for some classes, MS-DAYOLO achieves the best overall performance in term of
mAP, and it is significantly faster than GPA in inference time. It is worth noting that the proposed
                                                  51


Table 3.1: Quantitative results of domain adaptation for the clear → foggy experiment of the
Cityscapes dataset. The MS-DAYOLO uses YOLOv4 object detector [13], while the other methods
use Faster R-CNN object detector [6]. *The results are reported from [97] and [99]. The inference
time is measured in Frame Per Second (FPS) using NVIDIA GeForceGTX 1080 Ti GPU.
  Method      Backbone Person Rider       Car   Truck    Bus  Train  Mcycle  Bicycle   mAP    FPS
  DAF[42]                  25.0 31.0      40.5  22.1     35.3 20.2   20.0    27.1      27.6
  MAF[46]                  28.2 39.5      43.9  23.8     39.9 33.3   29.2    33.9      34.0
  iFAN[48] VGG16 32.6 40.0                48.5  27.9     45.5 31.7   22.8    33.0      35.3   6.2
  CT[49]                   32.7 44.4      50.1  21.7     45.6 25.4   30.1    36.8      35.9
  PDA[50]                  36.0 45.5      54.4  24.3     44.1 25.8   29.1    35.9      36.9
  DAF[42]*                 29.2 40.4      43.4  19.7     38.3 28.5   23.7    32.7      32.0
  MTOR[97] ResNet50 30.6 41.4             44.0  21.9     38.6 40.6   28.3    35.6      35.1   3.7
  GPA[99]                  32.9 46.7      54.1  24.7     45.7 41.1   32.4    38.7      39.5
  YOLOv4                   31.6 38.3      46.9  23.9     39.9 20.1   16.8    30.3      31.0
              Baseline 38.6 45.5          55.9  22.8     45.6 32.5   28.8    36.5      38.3
  MS-         PFR          38.5 46.5      56.5  27.6     48.7 38.5   26.4    38.4      40.1
                                                                                              48.2
  DAYOLO UC                39.3 45.0      57.0  29.9     48.0 36.6   30.2    36.4      40.3
              Integrated 39.6 46.5        56.5  28.9     51.0 45.9   27.5    36.0      41.5
  Oracle                   42.4 49.5      63.6  37.6     59.8 47.1   31.1    39.9      46.3
integrated architecture achieves significant improvements relative to the original YOLOv4, and it
almost reaches the performance of the ideal (oracle) scenario, especially for some object classes in
terms of average precision. Figure 3.1 shows examples of detection results of the proposed method
as compared to the original YOLOv4.
    Sunny → Rainy: we present results for applying YOLOv4 and our MS-DAYOLO framework
on the Waymo dataset [95], which includes two sets of visual data that are designated as ”sunny”
and ”rainy”. We extracted 14319 ”sunny weather” labeled images for the source data, and 13004
”rainy weather” unlabeled images to represent the target data. As before, the original YOLOv4
is trained using only source data (i.e. labeled sunny images). Meanwhile, our proposed MS-
DAYOLO is trained using both source and target data (i.e. labeled sunny images and unlabeled
rainy images). In addition, we extracted 1676 labeled images from the rainy-weather data for
testing and evaluation. It is important to note the following key observations regarding the Waymo
                                                  52


Figure 3.4: Examples of training images of Waymo dataset [95]. Images in the top row are tagged
as being captured in sunny weather, while images in the bottom row are tagged as being captured
in rainy weather. It is obvious that the domain shift between sunny and rainy images is not very
significant.
datasets: (a) The designations ”sunny” and ”rainy” images have been determined by the providers
of the Waymo dataset. (b) From our extensive experience in working with this data, the distinction
between sunny and rainy image samples is quite subjective, and in many cases, one can argue that
a ”rainy” image sample should be labeled as ”sunny” or vice versa. Consequently, the domain shift
between the two domains, which are designated as ”sunny” and ”rainy”, is not very significant as
shown in the examples of Figure 3.4. This is crucial since training the original YOLOv4 using
the Waymo ”sunny” dataset effectively covers a large number of ”rainy” testing samples that fall
within the source ”sunny” domain. Nevertheless, we opted to follow the dataset designations with
the aim of evaluating any potential improvements that the proposed MS-DAYOLO framework may
provide.
    The results are summarized in Table 3.2. It is clear that the MS-DAYOLO framework still
provided good improvements despite the fact that, in this case, the two domains, sunny and rainy,
                                                53


Figure 3.5: Visual detection examples of the sunny → rainy experiment using (a) the original
YOLOv4, and (b) Our proposed Integrated MS-DAYOLO applied onto labeled rainy images ex-
tracted from the Waymo datasete [95].
                                             54


Table 3.2: Quantitative results of domain adaptation for the sunny → rainy experiment of the
Waymo dataset.
                       Method                        Person     Vehicle  mAP
                       YOLOv4                         38.56      55.41  46.99
                                         Baseline     39.97      55.37  47.67
                                           PFR        39.75      56.34  48.05
                       MS-DAYOLO
                                           UC         39.42      56.66  48.04
                                        Integrated    40.03      56.97  48.50
have a significant overlap. This could explain why the improvements are not as salient as the
improvements achieved when applying MS-DAYOLO on the Cityscapes data, which consisted of
two clearly distinct domains as shown in the examples of Figure 3.1. Moreover, and similar to the
clear → foggy experiment, we observe that the proposed Progressive Feature Reduction (PFR),
Unified Domain Classifier (UDC), and Integrated architectures improve the detection performance
relative to the baseline architectures when applied to the Waymo dataset. For this experiment, we
do not report the performance of other domain adaptive object detection methods that are based on
Faster R-CNN because none of these methods reported or used the Waymo dataset for the sunny
→ rainy domain-shift scenario. Figure 3.5 shows examples of detection results of the proposed
Integrated MS-DAYOLO framework as compared to the original YOLOv4. In addition, Figure 3.6
shows examples of detection results that the integrated architecture succeeds in detecting objects
while the baseline architecture fails to detect the same objects. Furthermore, Figure 3.7 shows
examples where the baseline architecture suffers from false positive cases, while the integrated
one eliminates these false positives, which contribute to its improved performance.
3.2.2.2   Cross Camera Adaptation
Domain shift can occur between different real visual datasets captured by different driving vehicles
equipped with different cameras even if these visuals are taken under similar weather conditions.
                                                 55


Figure 3.6: Visual detection examples of the sunny → rainy experiment using (a) the baseline
architecture, and (b) the integrated architecture of MS-DAYOLO applied onto the rainy images
extracted from the Waymo datasete [95]. The baseline MS-DAYOLO fails to detect people that
cross the street in the top two images, and cars in the bottom two images, while the integrated
MS-DAYOLO successfully detects them.
                                                 56


Figure 3.7: Visual detection examples of the sunny → rainy experiment using (a) the baseline
architecture, and (b) the integrated architecture of MS-DAYOLO applied onto the rainy images
extracted from the Waymo datasete [95]. In these examples, the baseline MS-DAYOLO suffers
from false positive problem, while the integrated MS-DAYOLO eliminates these false positives.
FP: means false positive.
                                                 57


Such domain shift is usually driven by different camera setups leading to a shift in image quality
and resolution. Moreover, such datasets are usually captured in various locations, which have
different views and driving environments. All these factors lead to domain disparity between
datasets. Under this experiment, we evaluate the performance of our MS-DAYOLO framework for
domain adaptation between two real driving datasets: KITTI [69] and Cityscapes as has been done
by many recent works in this area [42, 46, 96, 43, 50]. In particular, the KITTI training set which
has 6000 labeled images, is utilized as source data. While the Cityscapes training set which has
2975 images, but without labels, is utilized as target data. The Cityscapes validation set which has
500 labeled images, is used for testing and evaluation.
    Table 3.3 presents the performance results based on the car AP as has been reported by prior
works [96, 42, 43, 46, 50] because it is the only common object class between the two datasets.
A clear performance improvement is achieved by our method over the original YOLOv4. We
also observe that the proposed Progressive Feature Reduction (PFR), Unified Domain Classifier
(UDC), and Integrated architectures improve the detection performance relative to the baseline
architecture. Although the GPA method outperforms Integrated MS-DAYOLO by a small margin
(0.3%), our MS-DAYOLO runs in real-time, and it is significantly faster than GPA in terms of
frames per second (FPS), which is essential for time-critical applications. Figure 3.8 shows visual
examples for qualitative comparison of our method with the original YOLOv4. It is obvious from
these examples that our approach successfully detects the vehicles in the scenes while the original
YOLOv4 fails to detect the same vehicles.
                                                  58


Figure 3.8: Visual detection examples of the KITTI → Cityscapes experiment for the car class
using (a) the original YOLOv4, and (b) Our proposed integrated MS-DAYOLO applied onto the
Cityscapes validation set. These examples show that the integrated MS-DAYOLO successfully
detects the vehicles in the scenes while the original YOLOv4 fails to detect the same vehicles.
                                                  59


Table 3.3: Quantitative results of cross camera adaptation from KITTI to Cityscapes based on AP
of the car class which is shared between the two datasets. The MS-DAYOLO uses YOLOv4 object
detector [13], while the other methods use Faster R-CNN object detector [6]. *The results are
reported from [99]. The inference time is measured in Frame Per Second (FPS) using NVIDIA
GeForceGTX 1080 Ti GPU.
                            Method             Backbone     Car AP    FPS
                            DAF [42]                          38.5
                            MAF [46]                          41.0
                                                VGG16                  6.2
                            CT [49]                           43.6
                            PDA [50]                          43.9
                            DAF [42]*                         41.8
                                               ResNet50                3.6
                            GPA [99]                          47.9
                            YOLOv4                            44.5
                                                Baseline      45.5
                                                  PFR         46.8    48.2
                            MS-DAYOLO
                                                  UC          47.3
                                               Integrated     47.6
3.2.3     Ablation Study
To show the importance of applying domain adaptation to three distinct scales of the backbone
network, we conducted an ablation study for the clear → foggy experiment. First, we applied
domain adaptation, separately, to each of the three scales of features that are fed into the neck of the
YOLOv4 architecture. Also, we applied domain adaptation to different combinations of two scales
at a time. Finally, we compared the results with the performance of applying these combinations
of the study with the performance of applying our baseline MS-DAYOLO to all three scales as
explained in section 3.1.2. Another important aspect of this ablation study is that we wanted to
consider objects that have statistically significant numbers of sample data. In that context, because
the number of ground-truth objects for some classes (truck, bus, and train) is small (i.e. less than
500 in the training set, and 100 in the testing set), the performance measure will be inaccurate for
these classes. As a result, we exclude them in this ablation study and compute mAP based on the
remaining classes.
                                                   60


Table 3.4: Ablation Study, ✓means that domain adaptation is applied to the feature scale(s) using
our baseline MS-DAYOLO for the clear → foggy experiment.
                 F1  F2    F3   Person    Rider Car         Mcycle     Bicycle    mAP
                                 31.57    38.27 46.93        16.75      30.32     32.77
                           ✓     36.84    42.84 53.69        24.77      32.35     38.09
                      ✓          37.08    41.49 54.49        26.22      32.43     38.34
                 ✓               36.28    44.22 53.10        25.81      35.87     39.06
                      ✓    ✓     36.62    42.68 55.70        26.09      33.52     38.92
                 ✓    ✓          37.50    42.48 54.53        27.84      34.75     39.43
                 ✓         ✓     36.41    46.06 52.19        22.48      34.99     38.43
                 ✓    ✓    ✓     38.62    45.52 55.85        28.82      36.46     41.05
    Table 3.4 summarizes results of the ablation study. It is clear that based on these results, we
can conclude that applying domain adaptation to all three feature scales improves the detection
performance on the target domain, and achieves the best result.
3.2.4     Analysis
In order to show the benefit of using a unified domain classifier instead of three different domain
classifiers, we recorded the domain classifier loss of Equation 3.1 over training iterations for the
KITTI → Cityscapes experiment. Figure 3.9 shows the losses of the three domain classifiers,
corresponding to features F1, F2, and F3 of the baseline architecture over the first 2500 iterations
of training. We can see that the losses are dissimilar after 1K iterations. This implies inconsistency
among the classifiers’ performance, which leads to a drop in performance. This motivated our
objective in employing a unified domain classifier for all three scales. In turn, this led to the UC
architecture for improving the detection performance when applied to the target domain data as
shown in Tables 3.1, 3.2 and 3.3.
    Moreover, to study the relationship between the domain classification loss of DAN and the
detection performance, we conducted a time analysis during training. We plot in Figure 3.10 the
                                                  61


Figure 3.9: Losses of three domain classifiers of the baseline architecture over the first 2500 itera-
tions of training for the KITTI → Cityscapes experiment.
                                                62


domain classifier (DC) loss of the integrated architecture, and the detection performance in term
of mAP for the KITTI → Cityscapes experiment. We normalize mAP by 100 to plot it at the
same scale with the DC loss. At the beginning of the training, we found the DC loss starts at
its highest values that are around 0.745. Then, and as training progresses, the DAN is optimized
by minimizing the loss while the YOLO backbone is optimized by maximizing the loss. In other
words, the DAN and the YOLO backbone compete against each other. From the figure, we observe
that the detection performance continues to improve until the loss approximately reaches around
0.6. After that, the performance almost remains the same because the impact of the DAN on the
backbone will not be significant as the DC loss becomes small.
                                                63


Figure 3.10: Domain classifier (DC) loss of the integrated architecture, and the detection per-
formance in term of normalized mAP over the first 2500 iterations of training for the KITTI →
Cityscapes experiment.
                                             64


Chapter 4
Cross Modality Knowledge Distillation for
Robust Pedestrian Detection
Detecting pedestrians in low light and adverse weather conditions would be challenging if only
RGB imaging is used as input. This is because the quality of images captured under poor lighting
and/or adverse weather conditions can degrade rather significantly. This is due to the fact that
low lighting can adversely impact the dynamic range of RGB images while decreasing signal-to-
noise ratio (SNR) levels, thereby affecting the performance of deep learning and related computer
vision algorithms. Consequently, the performance of object detection methods can significantly
drop. One solution is incorporating other sensing modalities such as thermal and gated imaging
[100, 27, 22]. However, such sensor modalities are expensive, and incorporating them into ADAS
and autonomous vehicle platforms can significantly increase production cost as well. Moreover,
additional sensors add complexity to the design and manufacturing processes. For instance, ex-
tra laser illuminators need to be installed on the car for gated imaging [101]. In addition, using
several modalities increases the inference time of a detection model due to sensing time and the
computation overhead of a fusion technique for combining different modalities. Naturally, increas-
ing inference time is undesirable for many real-time applications such as ADAS and autonomous
driving.
    Knowledge Distillation (KD) is one of the most effective techniques to transfer knowledge
                                                 65


between different models. It was initially proposed for model compression to transfer the informa-
tion for a large complex teacher model to a smaller and simpler student model without a substantial
drop in accuracy [102]. Typically, KD methods rely on a teacher model that supervises a student
model during training to improve the performance of the student model [103, 104, 105]. KD
methods were originally proposed for classification tasks [104, 106, 107, 108, 109]. Subsequently,
KD has been adapted to object detection, which is inherently more challenging than classification
[51, 52, 53, 54, 55, 56]. These previous KD approaches for object detection transfer the knowledge
from a large model to a smaller one (i.e. model compression). However, we propose that KD can
also be used in an alternate way where we can transfer multi-modal information. Specifically, we
refer to the case where a single modality student model learns from a multi-modal teacher model.
    In this chapter, we propose a novel framework that is based on Cross Modality Knowledge
Distillation (CMKD) to improve the performance of RGB-based pedestrian detection in low light
and adverse weather conditions. We achieve this by transferring the knowledge of a teacher detec-
tor that is trained using both RGB and gated images to a student detector, which is trained using
RGB images only as shown in Figure 4.1. The proposed CMKD framework makes the student
model generate features that are similar to the features of the teacher model. To accomplish this,
we develop two methods within the proposed CMKD framework. The first one is based on using a
KD loss, while the second one incorporates adversarial training with knowledge distillation. Based
on experimental results, we show that both of our proposed CMKD methods significantly improve
the detection performance relative to a baseline RGB detector, and they reduce the performance
gap between teacher and baseline models by a considerable margin. To the best of our knowledge,
this is the first proposed work that uses CMKD to improve the performance of pedestrian detection
in low light and adverse weather conditions.
                                                 66


Figure 4.1: The proposed Cross Modality Knowledge Distillation (CMKD) framework to improve
RGB-based pedestrian detection in low light and adverse weather conditions.
4.1      Proposed Framework
Most object detectors have two primary parts: a backbone to extract meaningful features, and a
head that uses these features to detect objects. These features have a significant impact on the
overall performance of the object detector. As a result, we focus on these features, and our goal is
to make the student model generate features that are similar to the features of the teacher model. To
achieve this objective, we first train the teacher detector using multiple modalities. Then, we freeze
the teacher detector and begin training the student detector using KD loss in addition to the ground
truth loss as shown in Figure 4.2. The KD loss makes the student backbone generate features
that are similar to features generated by the teacher backbone. To compute the KD loss (LKD ), we
have attempted different forms of losses such as Mean Squared Error (MSE), Mean Absolute Error
(MAE), Smooth L1 loss [5], and Kullback-Leibler (KL) divergence. Based on our experiments,
we find out that using MSE as LKD achieves the best results. Specifically, we compute LKD
using MSE in (4.1) to measure the Euclidean distance between features of the teacher and student
models. Therefore, we refer to this proposed method as CMKD-MSE henceforth.
                                                    67


Figure 4.2: An overview of the first proposed method (CMKD-MSE) that utilizes Knowledge
Distillation (KD) loss to make the student backbone produce features that are similar to the features
of the teacher model. First, The teacher detector is trained using multiple modalities. Then, the
trained teacher detection is frozen and the student detector is trained using KD loss in addition to
the Ground Truth (GT) loss.
                                               N  A
                                          1 XX S
                                LKD =                (Fi (j) − FiT (j))2                        (4.1)
                                        NA i j
Where FiS and FiT are features of the i-th input image in the current batch for student and teacher
models, respectively, N is the batch size, and A is the total number of activations in the feature
tensor. During training, the student backbone is optimized to reduce the KD loss. Therefore, the
total training loss for the student detector can be written as follows:
                                          L = LGT + αLKD                                        (4.2)
where α is a weight parameter that balances a trade-off between the ground-truth detection loss
(LGT ) and the KD loss (LKD ). Hence, the weight parameter controls the impact of knowledge
distillation on the student backbone.
                                                   68


4.1.1     Adversarial Training
To further improve the performance, we propose another CMKD method that is based on adversar-
ial training [41]. We build a Binary Classifier Network (BCN) to classify the features into teacher
features and student features. For its architecture, we use two convolutional layers to reduce fea-
ture channels to a single one, followed by two fully connected layers to predict a final binary class
probability (i.e. 0 for student and 1 for teacher). During training, we feed the extracted features
from the backbones of both teacher and student models to the BCN. Then, BCN is optimized to
differentiate between teacher features and student features by minimizing binary cross entropy
loss (LBCE ) in (4.3). On the other hand, the student backbone is optimized to maximize the loss
in (4.3) to confuse the BCN. This makes the student backbone generate features that are similar
to the teacher features. Consequently, the performance of the student detector will be significantly
improved.
                                            N
                                         1 X
                           LBCE = −            [ti ln pi + (1 − ti ) ln(1 − pi )]                    (4.3)
                                        N i
Here, ti is the ground truth label of the binary classifier for the i-th input image in the current batch,
with ti = 1 for the teacher features and ti = 0 for the student features. pi is the predicted class
probability of BCN for features of i-th training image, and N is the batch size.
     To simultaneously solve the minimization and maximization problems, we adopt adversarial
learning. In particular, we use a Gradient Reversal Layer (GRL) [34, 31] between the student
backbone and BCN to achieve this contradictory objective. It should be noted that GRL is a
bidirectional operator that is used to realize two different optimization objectives. In fact, GRL
acts as an identity operator in the feed-forward direction. This leads to the standard objective of
                                                     69


minimizing the classification error when performing local backpropagation within the BCN. In
contrast, GRL becomes a negative scalar (λ) for backpropagation toward the student backbone.
Consequently, it leads to maximizing the classification error; and this maximization makes the
student backbone generate features that are similar to the teacher features.
    The total training loss for the student detector can be written as follows:
                                         L = LGT + λLBCE                                        (4.4)
where λ is a negative scalar of GRL that balances a trade-off between the ground-truth detection
loss(LGT ) and the binary cross entropy loss (LBCE ). In that context, λ controls the impact of BCN
on the student backbone. The overview of the proposed adversarial learning method is shown in
Figure 4.3. Since this proposed method depends on adversarial training, we refer to it as CMKD-
Adv henceforth.
    For inference, the trained weights of the student detector in both proposed methods are used in
the original detector without the need for the teacher detector or the binary classifier. Hence, our
proposed framework will not increase the detector complexity during testing, which is an essential
factor for many real-time applications such as ADAS and autonomous driving.
4.2      Experiments
4.2.1     Setup
We use the state-of-the-art Faster R-CNN [6] and SSD [9] as base detectors in our experiments.
For backbone networks, we use ResNet50 [6] with Faster R-CNN, and VGG16 [67] with SSD.
                                                  70


Figure 4.3: An overview of the second proposed method (CMKD-Adv) that utilizes adversarial
learning to make student backbone produces features that are similar to teacher features. First,
The teacher detector is trained using multiple modalities. Then, the trained teacher detector is
frozen and the student detector is trained using adversarial learning in addition to the Ground
Truth (GT) loss. The binary classifier is optimized to differentiate between teacher features and
student features by minimizing Binary Cross Entropy (BCE) loss, while the student backbone is
optimized to increase BCE loss due to Gradient Reversal Layer (GRL) to generate features that are
similar to the teacher features. Conv: Convolutional layer, FC: Fully Connected layer.
                                                71


It is worth noting that these backbones generate several scales of features that are used by the
detector head to predict location and class of objects. Therefore, we apply our framework to all
scales of features. Specifically, we compute KD loss in (4.1) and BCE loss in (4.3) for each scale
individually, and then we take the average loss among the scales. This average loss is used to
optimize the student backbone during training to acquire knowledge from the teacher detector.
     There are some limitations in the availability of public datasets that provide matched pairs of
cross-domain data with pedestrian annotations for driving scenarios under challenging conditions.
Only the Seeing Through Fog dataset [27] publicly provides in addition to RGB images, corre-
sponding images of other modalities such as thermal and gated images. In fact, these modalities
substantially help to improve the detection performance in low light and adverse weather condi-
tions. Therefore, in the experiments, we only use this dataset. Since only the gated images in the
dataset contain both projections to the RGB frame and object annotations, we only add the gated
images to the RGB images to train the teacher detector. To do so, we first project three slices of
gated images from the dataset onto the image plane of the RGB camera. Examples of projected
slices of gated images and corresponding RGB ones are shown in Figure 4.4. Then, we perform
an early fusion to fuse the projected gated images with the corresponding RGB images by con-
catenating them in the channel dimension. Hence, the teacher detector uses a 6 channel tensor as
input. By comparison, the student detector uses only the 3 channel RGB images as input.
     In this work, we focus on detecting pedestrians in low light and adverse weather conditions.
As a result, we only consider the pedestrian class and select the images at night in various weather
conditions. We split the selected images into three sets: train, validation, and test. The number of
images and annotated pedestrians for each set is provided in Table 4.1. We use the train set to op-
timize models during training. After each epoch of training, we evaluate the current performance
                                                  72


Figure 4.4: Examples of projected slices of gated images and corresponding RGB ones that are
used in our experiments with ground-truth annotations for pedestrian class. The original images
are from the Seeing Through Fog dataset [27].
Table 4.1: Number of images and annotated pedestrians in each set that is used in our experiments.
                               Set    Images   Annotated pedestrians
                               Train  3500             10377
                               Val    1167              3290
                               Test   1167              3211
                               Total  5834             16878
of the model using the validation set. Once the performance does not increase for 5 consecutive
epochs, we stop the training. Eventually, we calculate the final performance of the trained model
using the test set images that are unseen during training. This allows achieving generalized per-
formance results. For an evaluation metric, we adopt the popular COCO Average Precision (AP)
[110].
    We implement the proposed framework using PyTorch [111]. All models are initialized using
pre-trained weights from ImageNet. We use the SGD optimizer to fine-tune the model for our
                                                73


detection task. All used images are resized to have a height of 500 pixels. The initial learning rate
is set to 0.005, and is scaled by 0.1 every 5 epochs. Furthermore, we set the batch size to 4, the
momentum to 0.9, and weight decay to 0.0005. For the hyper-parameters α and λ in our proposed
methods, we select them based on the trial and error technique.
4.2.2     Results and Discussion
The performance results of the proposed methods as compared to the baseline and teacher models
are summarized in Table 4.2. Only RGB images in the test set are used to calculate the perfor-
mance, except for the teacher model which requires both RGB and gated images for training and
testing. The baseline model is the original Faster R-CNN and SSD detectors that are trained using
only RGB images without any KD technique for comparison. Based on the results, both of our
methods significantly improve the detection performance relative to the baseline model. Equally
important, they minimize a performance gap between the teacher and baseline models. For exam-
ple with Faster R-CNN detector, CMKD-MSE reduces the gap in AP by 36%, while CMKD-Adv
reduces it by 55% as shown in Figure 4.5. For all metrics, CMKD-Adv achieves the best results
relative to the baseline.
    It is worth noting that we do not report the performance of previous methods [51, 52, 53, 54,
55, 56] that used KD for object detection. This is because none of these methods used CMKD to
improve object detection in low light and adverse weather conditions. In fact, all these methods
use KD for model compression (i.e. they transfer the knowledge from a large complex model to a
small simple one) which is different from the objective of this work.
    To further demonstrate the improvements from our proposed framework, we show in Figure
4.6 visual detection examples of our framework as compared to the baseline. These examples
                                                 74


Table 4.2: Performance results of the proposed framework based on the metrics in COCO dataset
evaluator [110]. Both our methods (CMKD-MSE and CMKD-Adv) improve the detection perfor-
mance relative to the baseline model. For AP75 metric, they achieve better performance than the
teacher model. For all metrics, CMKD-Adv obtains the best results.
                    Faster R-CNN - ResNet50                        SSD - VGG16
  Model    AP     AP50 AP75 APS APM            APL     AP   AP50   AP75 APS APM           APL
  Teacher  27.5    63.0    18.6  8.5    27.5    43.6   16.3  44.2    7.4    3.7    16.8   28.2
  Baseline 24.2    55.5    17.3  4.6    23.0    45.0   12.5  34.7    6.8    2.5    11.2   27.0
  CMKD-
           25.4    56.4    19.0  5.1    24.9    45.2   13.8  36.4    6.7    2.5    12.9   28.0
  MSE
  CMKD-
           26.0    56.6    19.7  5.5    25.6    45.8   14.5  39.2    7.8    2.9    13.8   28.9
  Adv
Figure 4.5: The performance of the proposed framework based on COCO AP [110] using only
RGB images in the test set except for Teacher model which is trained and tested using both RGB
and gated images. CMKD-MSE is the first proposed method that uses MSE in (4.1) to transfer
knowledge, while CMKD-Adv is the second proposed method that transfers knowledge using
adversarial training. Baseline is the original Faster R-CNN detector that is trained using only
RGB images.
                                               75


show that our methods (CMKD-MSE and CMKD-Adv) successfully detect pedestrians that the
baseline fails to detect in snowy weather and low light conditions. Moreover, Figure 4.7 shows
examples where the baseline suffers from false positive cases, while our methods eliminate these
false positives which contribute to improve the performance.
    It is worth noting that fundamental limitations to CMKD can exist if the information is simply
missing from the single modality data. Our framework does not reach to the full performance of the
teacher model because many annotated pedestrians in the dataset are totally dark in RGB images.
In other words, there are not enough pixels in the RGB images to represent many pedestrians.
For this reason, our framework that only uses RGB images as inputs cannot detect these specific
pedestrians as the examples shown in Figure 4.8. By comparison, these pedestrians are clear and
easy to detect in gated images. Therefore, the teacher model successfully detects them because it
uses both RGB and gated images as inputs.
                                                 76


Figure 4.6: Visual detection examples of our framework as compared to the baseline. The examples
show that CMKD-MSE (top row) and CMKD-Adv (bottom rows) successfully detect pedestrians
that the baseline fails to detect in snowy weather and low light conditions.
                                                 77


Figure 4.7: Visual detection examples of our framework as compared to the baseline. In these
examples, the baseline suffers from instances of false positives, while our methods: CMKD-MSE
(top middle image) and CMKD-Adv (bottom middle image) eliminate these false positives.
Figure 4.8: Visual detection examples of our framework as compared to the baseline and teacher
models. The examples show that pedestrians are totally dark in RGB images, and there are not
enough pixels to represent them. Therefore, it is challenging for the baseline and our framework
which only use RGB images as inputs to detect them. While the teacher model successfully detects
them since it uses both RGB and gated images as inputs.
                                                78


Chapter 5
Conclusion and Future Work
5.1      Conclusion
In this dissertation, we first outlined state-of-the-art frameworks for object detection, deraining,
image-to-image translation, and domain adaptation. Moreover, we highlighted crucial results re-
garding current methods in terms of their performance under rainy weather conditions. In par-
ticular, there is an overarching consistent message regarding the limitations of these techniques
in handling and mitigating the impact of rain for visuals captured by moving vehicles. However,
we believe that generative models and domain adaptation could still play a crucial role in training
object detection methods to be more robust and resilient under challenging conditions. As a result,
we proposed the GDA framework that integrates generative model-based image translation with
domain adaptation object detection to improve the detection performance of objects in a driving
environment under realistic rainy conditions. In particular, we generated visuals that are repre-
sentatives of a target challenging domain by utilizing unsupervised Image-to-Image Translation
(I2IT). Then, these generated visuals and unlabeled target domain data were used to train a domain
adaptive object detection method. Our results showed that the proposed GDA achieved significant
improvements when tested under real rainy weather conditions in comparison with state-of-the-art
Domain Adaptive Faster R-CNN or the baseline generative model-based I2IT training method.
    In addition, we proposed a multiscale domain adaptation framework for the popular state-of-
the-art real-time object detector YOLO. Specifically, under our MS-DAYOLO architecture, we
                                                   79


applied domain adaptation to three different scale features within the YOLO feature extractor that
are fed to the next stage. In addition to the baseline architecture of a multiscale domain adaptive
network, we developed three various deep learning architectures to produce more robust domain
invariant features that reduce the impact of domain shift. The proposed architectures include pro-
gressive feature reduction, unified domain classifier, and the integrated architecture that combines
the benefits of progressive feature reduction and unified domain classifier strategies for improving
the overall detection performance under the target domain. Based on various experimental results,
our proposed MS-DAYOLO framework can successfully adapt YOLO to target domains without
the need to annotate the objects found in the data. Furthermore, the proposed MS-DAYOLO ar-
chitectures outperformed state-of-the-art YOLOv4 and other exciting approaches that are based on
Faster R-CNN object detector under diverse testing scenarios for autonomous driving applications.
    Furthermore, we proposed a novel framework that improved the performance of RGB-only
pedestrian detection in low light and adverse weather conditions by employing cross modality
knowledge distillation. In particular, we transferred the knowledge of a teacher detector that uses
both RGB and gated images to a student RGB-only detector. Specifically, we developed two
methods within the proposed framework that forced the student model to generate features that
are similar to features of the teacher model. Our experiments show that incorporating adversarial
training with knowledge distillation improves the detection performance by a significant margin,
and reduces the performance gap between teacher and baseline models by 55%. Equally important,
the proposed framework can be applied to other state-of-the-art detectors.
    Overall, we believe that the proposed frameworks in this dissertation will increase the efficiency
and safety of autonomous systems via improving the detection performance of objects under chal-
lenging conditions.
                                                 80


5.2      Future Work
Object-based Domain adaptation: Domain adaptation of MS-DAYOLO in Chapter 3 is an image-
level adaptation because it is applied to all features of the input image. This image-level strategy
helps to reduce the domain shift due to global image differences such as image scale, image style,
and illumination. However, local instance differences such as object appearance, size, and view-
point are not addressed in MS-DAYOLO. Therefore, we believe applying domain adaptation to
local features of objects in the YOLO framework will further improve the detection performance.
Specifically, domain adaptation could be applied to the features that are associated with each object
in an input image. This instance-level adaptation should help to reduce the domain shift that is due
to the local instance differences.
    After the input image features are generated by the YOLO backbone network, we plan to
extract the features that correspond to each object in the image as shown in Figure 5.1. Then, we
can apply domain adaptation to the extracted features by feeding them to an Object-Based Domain
Adaptive Network (OB-DAN).
    In domain adaptation, ground-truth bounding boxes of target data are generally not available.
And hence, it is not clear which features from within the image should be used for instance-level
domain adaptation. To solve this problem, an already trained YOLO can be used to detect objects,
and these objects can be used to extract the instance-level features. However, these detections do
not represent the ”ground-truth”, and therefore such detections may be highly inaccurate, which
could lead to a degradation in performance. Alternatively, we propose to perform the following
steps:
     • Train MS-DAYOLO, which represents the global image-level domain adaptation for the
       YOLO object detector.
                                                  81


Figure 5.1: Example of extracting features that are corresponding to the car in the middle of the
image, the extracted features are fed to Object-Based Domain Adaptive Network (OB-DAN).
     • Use the trained MS-DAYOLO to predict bounding boxes for target data.
     • Filter the predicted bounding boxes based on their confidence scores. In particular, bounding
       boxes with confidence scores larger than a specific threshold can be selected. This technique
       is usually called pseudo labeling.
     • Train the object-based domain adaptive YOLO, which is initialized with the trained weights
       of MS-DAYOLO, using ground truth bounding boxes for source data and the filtered pre-
       dicted bounding boxes for target data.
    The proposed framework is summarized in Figure 5.2. We plan to conduct several experiments
for this framework with different datasets and scenarios to validate its performance.
Class-Wise Domain Adaptation: Instead of applying domain adaptation to all objects indepen-
dent of their classes as explained in Object-based Domain adaptation, we believe that applying
domain adaptation separately to each class can provide further improvements. Specifically, we in-
                                                 82


Figure 5.2: The proposed framework for Object-Based Domain Adaptive YOLO (OB-DAYOLO).
GT: Ground Truth, BB: Bounding Boxes.
tend to develop an individual object-based Domain Adaptive Network (DAN) that is optimized for
each class as shown in Figure 5.3. This will prevent aligning distributions of object features from
different classes jointly. For example, aligning the distribution of features for the car class with
the distribution of features for the pedestrian class in a different domain, may drop the detection
performance. Therefore, aligning the distributions of local instance features within each class can
improve the detection performance, especially with a dataset that has many classes.
Extension of Cross Modality Knowledge Distillation (CMKD): The CMKD framework in Chap-
ter 4 can be extended in several directions. In addition to RGB camera, other sensor modalities
(e.g. thermal imaging, lidar, and radar) that improve the detection performance in low light and
adverse weather conditions, can be used in the teacher model. In the CMKD framework, we used
                                                  83


Figure 5.3: The proposed Class-Wise Domain Adaptation for YOLO object detector. Object-Based
Domain Adaptive Network (OB-DAN) is developed for each class in the used dataset.
early (data-level) fusion to fuse RGB and gated images. On the hand, there are other techniques
such as late (decision-level) fusion and intermediate (features-level) fusion that can be employed
to fuse multiple modalities in the teacher model. These techniques can be used to improve the
performance of the teacher model, which subsequently should improve the student model.
New performance metric: It is important to emphasize that the current metrics for detection
performance (Pascal mAP and COCO AP) may not reflect the true effectiveness of the proposed
frameworks. In driving environments, we observe that detecting certain objects are more crucial
than others. For instance, detecting pedestrians that are crossing the road or walking nearby the ve-
hicle is more critical than detecting pedestrians that are walking at safe distances on well-protected
sidewalks. To illustrate this observation, there are two ground-truth annotated pedestrians in the
scene of the bottom row of Figure 4.6 that are used to compute COCO AP. However, for this
particular image-frame, the pedestrian that crosses the road is more important to detect than the
pedestrian standing back on the sidewalk. To address this issue, we believe it is essential for future
                                                   84


research to develop a new performance metric rather than COCO mAP or Pascal mAP to evaluate
the effectiveness of detecting critical objects instead of all objects in a scene.
                                                   85


                                       BIBLIOGRAPHY
 [1] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learning affordance for direct
     perception in autonomous driving,” in Proceedings of the IEEE International Conference
     on Computer Vision, 2015, pp. 2722–2730.
 [2] B. Wu, F. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Unified, small, low power fully
     convolutional neural networks for real-time object detection for autonomous driving,” in
     Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Work-
     shops, 2017, pp. 129–137.
 [3] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and R. Urtasun, “Multinet: Real-time
     joint semantic reasoning for autonomous driving,” in IEEE Intelligent Vehicles Symposium
     (IV), 2018, pp. 1013–1020.
 [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate
     object detection and semantic segmentation,” in Proceedings of the IEEE conference on
     computer vision and pattern recognition, 2014, pp. 580–587.
 [5] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer
     vision, 2015, pp. 1440–1448.
 [6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection
     with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intel-
     ligence, vol. 39, no. 6, pp. 1137–1149, June 2017.
 [7] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,”
     in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 [8] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-
     time object detection,” in IEEE Conference on Computer Vision and Pattern Recognition
     (CVPR), June 2016, pp. 779–788.
 [9] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single
     shot multibox detector,” in European conference on computer vision. Springer, 2016, pp.
     21–37.
[10] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully convolu-
     tional networks,” in Advances in neural information processing systems, 2016, pp. 379–387.
[11] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE
     conference on computer vision and pattern recognition, 2017, pp. 7263–7271.
                                                86


[12] ——, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
[13] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of
     object detection,” arXiv preprint arXiv:2004.10934, 2020.
[14] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE
     international conference on computer vision, 2017, pp. 2961–2969.
[15] K. He, J. Sun, and X. Tang, “Single image haze removal using dark channel prior,” IEEE
     transactions on pattern analysis and machine intelligence, vol. 33, no. 12, pp. 2341–2353,
     2010.
[16] X. Liu, Y. Ma, Z. Shi, and J. Chen, “Griddehazenet: Attention-based multi-scale network
     for image dehazing,” in Proceedings of the IEEE/CVF international conference on computer
     vision, 2019, pp. 7314–7323.
[17] C. Guo, C. Li, J. Guo, C. C. Loy, J. Hou, S. Kwong, and R. Cong, “Zero-reference deep
     curve estimation for low-light image enhancement,” in Proceedings of the IEEE/CVF Con-
     ference on Computer Vision and Pattern Recognition, 2020, pp. 1780–1789.
[18] H. Dong, J. Pan, L. Xiang, Z. Hu, X. Zhang, F. Wang, and M.-H. Yang, “Multi-scale boosted
     dehazing network with dense feature fusion,” in Proceedings of the IEEE/CVF conference
     on computer vision and pattern recognition, 2020, pp. 2157–2167.
[19] X. Qin, Z. Wang, Y. Bai, X. Xie, and H. Jia, “Ffa-net: Feature fusion attention network for
     single image dehazing,” in Proceedings of the AAAI Conference on Artificial Intelligence,
     vol. 34, no. 07, 2020, pp. 11 908–11 915.
[20] S. Li, I. B. Araujo, W. Ren, Z. Wang, E. K. Tokuda, R. H. Junior, R. Cesar-Junior, J. Zhang,
     X. Guo, and X. Cao, “Single image deraining: A comprehensive benchmark analysis,” in
     Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
     2019, pp. 3838–3847.
[21] M. Hnewa and H. Radha, “Object detection under rainy conditions for autonomous vehicles:
     A review of state-of-the-art and emerging techniques,” IEEE Signal Processing Magazine,
     vol. 38, no. 1, pp. 53–67, 2021.
[22] M. Krišto, M. Ivasic-Kos, and M. Pobar, “Thermal object detection in difficult weather
     conditions using yolo,” IEEE access, vol. 8, pp. 125 459–125 476, 2020.
                                               87


[23] R. Blin, S. Ainouz, S. Canu, and F. Meriaudeau, “Road scenes analysis in adverse weather
     conditions by polarization-encoded images and adapted deep learning,” in 2019 IEEE Intel-
     ligent Transportation Systems Conference (ITSC). IEEE, 2019, pp. 27–32.
[24] P. Tumas, A. Nowosielski, and A. Serackis, “Pedestrian detection in severe weather condi-
     tions,” IEEE Access, vol. 8, pp. 62 775–62 784, 2020.
[25] G. Liao, W. Gao, G. Li, J. Wang, and S. Kwong, “Cross-collaborative fusion-encoder net-
     work for robust rgb-thermal salient object detection,” IEEE Transactions on Circuits and
     Systems for Video Technology, 2022.
[26] F. Julca-Aguilar, J. Taylor, M. Bijelic, F. Mannan, E. Tseng, and F. Heide, “Gated3d:
     Monocular 3d object detection from temporal illumination cues,” in Proceedings of the
     IEEE/CVF International Conference on Computer Vision, 2021, pp. 2938–2948.
[27] M. Bijelic, T. Gruber, F. Mannan, F. Kraus, W. Ritter, K. Dietmayer, and F. Heide, “Seeing
     through fog without seeing fog: Deep multimodal sensor fusion in unseen adverse weather,”
     in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020,
     pp. 11 679–11 689.
[28] S. S. Chaturvedi, L. Zhang, and X. Yuan, “Pay ”attention” to adverse weather: Weather-
     aware attention-based object detection,” arXiv preprint arXiv:2204.10803, 2022.
[29] L. Duan, I. W. Tsang, and D. Xu, “Domain transfer multiple kernel learning,” IEEE Trans-
     actions on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 465–479, 2012.
[30] B. Kulis, K. Saenko, and T. Darrell, “What you saw is not what you get: Domain adaptation
     using asymmetric kernel transforms,” in CVPR 2011. IEEE, 2011, pp. 1785–1792.
[31] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand,
     and V. Lempitsky, “Domain-adversarial training of neural networks,” The Journal of Ma-
     chine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
[32] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adap-
     tation,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
     2017, pp. 7167–7176.
[33] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Unsupervised domain adaptation with residual
     transfer networks,” in Advances in neural information processing systems, 2016, pp. 136–
     144.
                                              88


[34] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in Pro-
     ceedings of the 32nd International Conference on International Conference on Machine
     Learning-Volume 37. JMLR. org, 2015, pp. 1180–1189.
[35] W. Li, F. Li, Y. Luo, P. Wang et al., “Deep domain adaptive object detection: A survey,”
     in 2020 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, 2020, pp.
     1808–1813.
[36] V. F. Arruda, T. M. Paixão, R. F. Berriel, A. F. De Souza, C. Badue, N. Sebe, and T. Oliveira-
     Santos, “Cross-domain car detection using unsupervised image-to-image translation: From
     day to night,” in 2019 International Joint Conference on Neural Networks (IJCNN). IEEE,
     2019, pp. 1–8.
[37] C.-T. Lin, “Cross domain adaptation for on-road object detection using multimodal
     structure-consistent image-to-image translation,” in 2019 IEEE International Conference
     on Image Processing (ICIP). IEEE, 2019, pp. 3029–3030.
[38] T. Guo, C. P. Huynh, and M. Solh, “Domain-adaptive pedestrian detection in thermal im-
     ages,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019,
     pp. 1660–1664.
[39] C. Devaguptapu, N. Akolekar, M. M Sharma, and V. N Balasubramanian, “Borrow from
     anywhere: Pseudo multi-modal object detection in thermal imagery,” in Proceedings of the
     IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops,
     June 2019.
[40] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa, “Cross-domain weakly-supervised object
     detection through progressive domain adaptation,” in Proceedings of the IEEE conference
     on computer vision and pattern recognition, 2018, pp. 5001–5009.
[41] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
     and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing
     systems, 2014, pp. 2672–2680.
[42] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain adaptive faster r-cnn for
     object detection in the wild,” in Proceedings of the IEEE conference on computer vision and
     pattern recognition, 2018, pp. 3339–3348.
[43] X. Zhu, J. Pang, C. Yang, J. Shi, and D. Lin, “Adapting object detectors via selective cross-
     domain alignment,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
     Recognition, 2019, pp. 687–696.
                                                 89


[44] T. Wang, X. Zhang, L. Yuan, and J. Feng, “Few-shot adaptive faster r-cnn,” in Proceedings
     of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7173–7182.
[45] K. Saito, Y. Ushiku, T. Harada, and K. Saenko, “Strong-weak distribution alignment for
     adaptive object detection,” in Proceedings of the IEEE Conference on Computer Vision and
     Pattern Recognition, 2019, pp. 6956–6965.
[46] Z. He and L. Zhang, “Multi-adversarial faster-rcnn for unrestricted object detection,” in
     Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6668–
     6677.
[47] V. A. Sindagi, P. Oza, R. Yasarla, and V. M. Patel, “Prior-based domain adaptive object
     detection for hazy and rainy conditions,” in European Conference on Computer Vision.
     Springer, 2020, pp. 763–780.
[48] C. Zhuang, X. Han, W. Huang, and M. Scott, “ifan: Image-instance full alignment net-
     works for adaptive object detection,” in Proceedings of the AAAI Conference on Artificial
     Intelligence, vol. 34, no. 07, 2020, pp. 13 122–13 129.
[49] G. Zhao, G. Li, R. Xu, and L. Lin, “Collaborative training between region proposal local-
     ization and classification for domain adaptive object detection,” in European Conference on
     Computer Vision. Springer, 2020, pp. 86–102.
[50] H.-K. Hsu, C.-H. Yao, Y.-H. Tsai, W.-C. Hung, H.-Y. Tseng, M. Singh, and M.-H. Yang,
     “Progressive domain adaptation for object detection,” in The IEEE Winter Conference on
     Applications of Computer Vision, 2020, pp. 749–757.
[51] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efficient object detection
     models with knowledge distillation,” Advances in neural information processing systems,
     vol. 30, 2017.
[52] Q. Li, S. Jin, and J. Yan, “Mimicking very efficient network for object detection,” in Pro-
     ceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.
     6356–6364.
[53] T. Wang, L. Yuan, X. Zhang, and J. Feng, “Distilling object detectors with fine-grained
     feature imitation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
     Pattern Recognition, 2019, pp. 4933–4942.
                                                90


[54] J. Guo, K. Han, Y. Wang, H. Wu, X. Chen, C. Xu, and C. Xu, “Distilling object detectors
     via decoupled features,” in Proceedings of the IEEE/CVF Conference on Computer Vision
     and Pattern Recognition, 2021, pp. 2154–2164.
[55] X. Dai, Z. Jiang, Z. Wu, Y. Bao, Z. Wang, S. Liu, and E. Zhou, “General instance distillation
     for object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
     Pattern Recognition, 2021, pp. 7842–7851.
[56] G. Li, X. Li, Y. Wang, S. Zhang, Y. Wu, and D. Liang, “Knowledge distillation for object
     detection via rank mimicking and prediction-guided feature imitation,” in Proceedings of
     the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1306–1313.
[57] K. Garg and S. K. Nayar, “Detection and removal of rain from videos,” in Proceedings of
     the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1,
     2004, pp. I–I.
[58] L. Kang, C. Lin, and Y. Fu, “Automatic single-image-based rain streaks removal via image
     decomposition,” IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 1742–1755,
     April 2012.
[59] W. Yang, R. T. Tan, J. Feng, J. Liu, Z. Guo, and S. Yan, “Deep joint rain detection and
     removal from a single image,” in Proceedings of the IEEE Conference on Computer Vision
     and Pattern Recognition, 2017, pp. 1357–1366.
[60] X. Fu, J. Huang, D. Zeng, Y. Huang, X. Ding, and J. Paisley, “Removing rain from single
     images via a deep detail network,” in Proceedings of the IEEE Conference on Computer
     Vision and Pattern Recognition, 2017, pp. 3855–3863.
[61] R. Qian, R. T. Tan, W. Yang, J. Su, and J. Liu, “Attentive generative adversarial network for
     raindrop removal from a single image,” in Proceedings of the IEEE Conference on Computer
     Vision and Pattern Recognition, 2018, pp. 2482–2491.
[62] D. Ren, W. Zuo, Q. Hu, P. Zhu, and D. Meng, “Progressive image deraining networks: a
     better and simpler baseline,” in Proceedings of the IEEE Conference on Computer Vision
     and Pattern Recognition, 2019, pp. 3937–3946.
[63] T. Wang, X. Yang, K. Xu, S. Chen, Q. Zhang, and R. W. Lau, “Spatial attentive single-image
     deraining with a high quality real rain dataset,” in Proceedings of the IEEE Conference on
     Computer Vision and Pattern Recognition, 2019, pp. 12 270–12 279.
                                               91


[64] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from
     error visibility to structural similarity,” IEEE transactions on image processing, vol. 13,
     no. 4, pp. 600–612, 2004.
[65] M. Hnewa and H. Radha, “Integrated generative-model domain-adaptation for ob-
     ject detection under challenging conditions,” in 2022 IEEE 95th Vehicular Technology
     Conference:(VTC2022-Spring), pp. 1–5.
[66] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in
     European conference on computer vision. Springer, 2014, pp. 818–833.
[67] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image
     recognition,” in 3rd International Conference on Learning Representations, ICLR, 2015.
[68] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with
     discriminatively trained part-based models,” IEEE transactions on pattern analysis and ma-
     chine intelligence, vol. 32, no. 9, pp. 1627–1645, 2009.
[69] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vi-
     sion benchmark suite,” in Conference on Computer Vision and Pattern Recognition (CVPR),
     2012.
[70] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke,
     S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in
     Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[71] G. Neuhold, T. Ollmann, S. Rota Bulò, and P. Kontschieder, “The mapillary vistas dataset
     for semantic understanding of street scenes,” in International Conference on Computer
     Vision (ICCV), 2017. [Online]. Available: https://www.mapillary.com/dataset/vistas
[72] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell, “Bdd100k: A diverse
     driving video database with scalable annotation tooling,” arXiv preprint arXiv:1805.04687,
     2018.
[73] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Bal-
     dan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” arXiv
     preprint arXiv:1903.11027, 2019.
[74] P. Rousseau, V. Jolivet, and D. Ghazanfarpour, “Realistic real-time rain rendering,” Com-
     puters & Graphics, vol. 30, no. 4, pp. 507–518, 2006.
                                                 92


[75] K. Garg and S. K. Nayar, “Photorealistic rendering of rain streaks,” ACM Transactions on
     Graphics (TOG), vol. 25, no. 3, pp. 996–1002, 2006.
[76] Cycore Rainfall simulation, Adobe After Effects CC 2019, Adobe Inc., Sab Jose, CA, USA,
     2019. [Online]. Available: https://www.adobe.com/products/aftereffects.html
[77] M. Everingham and J. Winn, “The pascal visual object classes challenge 2012 development
     kit,” Pattern Analysis, Statistical Modelling and Computational Learning, Tech. Rep, 2011.
[78] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image translation networks,” in
     Advances in neural information processing systems, 2017, pp. 700–708.
[79] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
     Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
     770–778.
[80] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
     and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing
     systems, 2014, pp. 2672–2680.
[81] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional
     lstm network: A machine learning approach for precipitation nowcasting,” in Advances in
     neural information processing systems, 2015, pp. 802–810.
[82] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional
     adversarial networks,” in Proceedings of the IEEE conference on computer vision and pat-
     tern recognition, 2017, pp. 1125–1134.
[83] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using
     cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference
     on computer vision, 2017, pp. 2223–2232.
[84] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman, “To-
     ward multimodal image-to-image translation,” in Advances in Neural Information Process-
     ing Systems, 2017, pp. 465–476.
[85] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsupervised image-to-image
     translation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018,
     pp. 172–189.
                                                93


[86] Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised dual learning for image-to-
     image translation,” in Proceedings of the IEEE international conference on computer vision,
     2017, pp. 2849–2857.
[87] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative
     adversarial networks for multi-domain image-to-image translation,” in Proceedings of the
     IEEE conference on computer vision and pattern recognition, 2018, pp. 8789–8797.
[88] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Te-
     jani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a
     generative adversarial network,” in The IEEE Conference on Computer Vision and Pattern
     Recognition (CVPR), July 2017.
[89] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep domain confusion:
     Maximizing for domain invariance,” arXiv preprint arXiv:1412.3474, 2014.
[90] M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adap-
     tation networks,” in International Conference on Machine Learning, 2015, pp. 97–105.
[91] R. Xie, F. Yu, J. Wang, Y. Wang, and L. Zhang, “Multi-level domain adaptive learning for
     cross-domain detection,” in The IEEE International Conference on Computer Vision (ICCV)
     Workshops, Oct 2019.
[92] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand,
     and V. Lempitsky, “Domain-adversarial training of neural networks,” The Journal of Ma-
     chine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
[93] M. Hnewa and H. Radha, “Multiscale domain adaptive yolo for cross-domain object detec-
     tion,” in IEEE International Conference on Image Processing (ICIP), 2021, pp. 3323–3327.
[94] C. Sakaridis, D. Dai, and L. Van Gool, “Semantic foggy scene understanding with synthetic
     data,” International Journal of Computer Vision, vol. 126, no. 9, pp. 973–992, 2018.
[95] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou,
     Y. Chai, B. Caine et al., “Scalability in perception for autonomous driving: Waymo open
     dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
     Recognition, 2020, pp. 2446–2454.
[96] M. Khodabandeh, A. Vahdat, M. Ranjbar, and W. G. Macready, “A robust learning approach
     to domain adaptive object detection,” in Proceedings of the IEEE International Conference
     on Computer Vision, 2019, pp. 480–490.
                                               94


 [97] Q. Cai, Y. Pan, C.-W. Ngo, X. Tian, L. Duan, and T. Yao, “Exploring object relation in mean
      teacher for cross-domain detection,” in Proceedings of the IEEE Conference on Computer
      Vision and Pattern Recognition, 2019, pp. 11 457–11 466.
 [98] T. Kim, M. Jeong, S. Kim, S. Choi, and C. Kim, “Diversify and match: A domain adaptive
      representation learning paradigm for object detection,” in Proceedings of the IEEE Confer-
      ence on Computer Vision and Pattern Recognition, 2019, pp. 12 456–12 465.
 [99] M. Xu, H. Wang, B. Ni, Q. Tian, and W. Zhang, “Cross-domain detection via graph-induced
      prototype alignment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
      Pattern Recognition, 2020, pp. 12 355–12 364.
[100] M. Bijelic, T. Gruber, and W. Ritter, “Benchmarking image sensors under adverse weather
      conditions for autonomous driving,” in 2018 IEEE Intelligent Vehicles Symposium (IV).
      IEEE, 2018, pp. 1773–1779.
[101] T. Gruber, F. Julca-Aguilar, M. Bijelic, and F. Heide, “Gated2depth: Real-time dense lidar
      from gated images,” in Proceedings of the IEEE/CVF International Conference on Com-
      puter Vision, 2019, pp. 1506–1516.
[102] C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in Proceedings of
      the 12th ACM SIGKDD international conference on Knowledge discovery and data mining,
      2006, pp. 535–541.
[103] J. Ba and R. Caruana, “Do deep nets really need to be deep?” Advances in neural informa-
      tion processing systems, vol. 27, 2014.
[104] G. Hinton, O. Vinyals, J. Dean et al., “Distilling the knowledge in a neural network,” arXiv
      preprint arXiv:1503.02531, vol. 2, no. 7, 2015.
[105] G. Urban, K. J. Geras, S. E. Kahou, O. Aslan, S. Wang, R. Caruana, A. Mohamed, M. Phili-
      pose, and M. Richardson, “Do deep convolutional nets really need to be deep and convolu-
      tional?” in ICLR workshop, 2016.
[106] J. Kim, S. Park, and N. Kwak, “Paraphrasing complex network: Network compression via
      factor transfer,” Advances in neural information processing systems, vol. 31, 2018.
[107] S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh, “Im-
      proved knowledge distillation via teacher assistant,” in Proceedings of the AAAI conference
      on artificial intelligence, vol. 34, no. 04, 2020, pp. 5191–5198.
                                                   95


[108] S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai, “Variational information dis-
      tillation for knowledge transfer,” in Proceedings of the IEEE/CVF Conference on Computer
      Vision and Pattern Recognition, 2019, pp. 9163–9171.
[109] B. Heo, M. Lee, S. Yun, and J. Y. Choi, “Knowledge transfer via distillation of activation
      boundaries formed by hidden neurons,” in Proceedings of the AAAI Conference on Artificial
      Intelligence, vol. 33, no. 01, 2019, pp. 3779–3787.
[110] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zit-
      nick, “Microsoft coco: Common objects in context,” in European conference on computer
      vision. Springer, 2014, pp. 740–755.
[111] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,
      L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
                                                 96