DETECTING OBJECTS UNDER CHALLENGING ILLUMINATION CONDITIONS
By
Yousef Atoum

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Electrical Engineering — Doctor of Philosophy
2018

ABSTRACT
DETECTING OBJECTS UNDER CHALLENGING ILLUMINATION CONDITIONS
By
Yousef Atoum
Object detection is considered one of the most critical components of any Computer Vision
(CV) system. For many real world CV systems such as object tracking, recognition, and alignment,
a good localization of the targeted object is necessary as an initialization step. In this thesis,
we explore object detection when used in real life scenarios, and study the effect of challenging
illumination scenarios impacting the general performance. More specifically, we study two main
challenges: underexposed dark images captured during the night time, and overexposed bright
images with sunlight projected on the object.
We initially study two detection applications used in real-life scenarios. The first application
used a Correlation Filter (CF) detector for detecting small objects. CF’s were robust in handling
object detection with minimum appearance variations, but suffers in handling illumination challenges causing many false alarms. To overcome this issue, we propose to use a post-refinement
method to eliminate any false detections caused by specular light. The second detection application used Convolutional Neural Networks (CNN). We learned a total of five detection CNN’s,
constructing the inputs of a Multiplexer-based method which controls the flow of the CV system,
and is driven to select the appropriate CNN based on physical estimated measurements. The five
CNN’s include a global frame object detector, three local object detectors used for tracking, and
finally an object contour detector used to enhance the 2D detection and infer 3D localization of
the target. With sufficient training data, all five CNN’s were proven to generalize well for a wide
range of illumination variations introduced from weather changes, along with many other visual
challenges. However, the challenging underexposed light images collected during night time led

to system failure.
In this thesis, we propose two CNN models to handle both illumination challenges: (1) a temporal delighting Guided-CNN scheme for recovering overexposed video frames caused by sunlight,
which is based on the analysis of directional light. (2) An iterative CNN-based technique to synthesize good lighting images suffering from underexposed low intensity lighting. We demonstrate that
our approach allows the recovery of plausible illumination conditions and enables improved object
detection capability. An extensive evaluation on several CV systems was carried out, including
pedestrian detection, and trailer coupler detection.

Copyright by
YOUSEF ATOUM
2018

This dissertation is dedicated to my beautiful wife, Baraa, and my children, Adam and Ryan. I
wouldn’t have made it this far without your continued support and encouragement.

v

ACKNOWLEDGMENTS

I would like to thank my advisor, Dr. Xiaoming Liu, for the patient guidance, encouragement
and advice he has provided throughout my time as his student. I joined Dr. Liu’s lab at a time
where I had almost lost interest in pursuing my graduate degree due to a variety of reasons beyond
my control. Dr. Liu helped me in a variety of aspects along the way such as becoming a wellrounded researcher, refining my writing and presentation skills, attention to detail, professional
advice, and maturing my problem solving skills. I have been privileged to have Dr. Liu as my
advisor, and I will always be proud of it.
I would also like to thank all of the CV lab members, for the many years we have spent together
as lab-mates and as friends. They were always there for me whenever I needed help or advise,
whether it was research related or any other personal matters. Amin Jourabloo, Xi Yin, Luan Tran,
Garrick Brazil, Yaojie Liu, Tony Zhang, Joel Stehouwer, Adam Terwilliger, Bangjie Yin, Morteza
Safdarnejad, Joseph Roth, and Jamal Afridi, thank you guys for all of the great times we had.
I also thank my parents Adnan and Jaine, who provided me with the steppingstones to become
the person I am today. Since I was a kid, they have guided me in gaining the skills and requirements
in order to excel in my future journey and encouraged me towards attaining my future goals.
And last but not least, I want to thank my wife Baraa, who has stood by me since day one of my
graduate studies, who has tolerated my absences working late at school, my fits of being exhausted
and impatience. She gave me endless support and help, and supported the family during much of
my graduate studies. And more importantly, she sacrificed her career for me to finish my studies.
Along with her, I want to acknowledge my two sons, Adam and Ryan. They have never known
their dad as anything but a student always working on the laptop. They have been a great source
of motivation, love and relief for whenever I felt down.
vi

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Chapter 1
Introduction on object detection challenges and contributions
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

1
1
6
8

Chapter 2
Detecting small objects using correlation filters
2.1 Problem statement . . . . . . . . . . . . . . . . . . . . .
2.2 Proposed method on feed detection . . . . . . . . . . . .
2.2.1 Correlation Filter for Feed Detection . . . . . . .
2.2.2 SVM Classifier for Refinement of False Alarms .
2.2.3 Locating the Optimum Local Region . . . . . . .
2.2.4 Automatic Control of the Feeding Process . . . .
2.3 Experimental results . . . . . . . . . . . . . . . . . . . .
2.3.1 Feed Detection . . . . . . . . . . . . . . . . . .
2.3.2 Local Region Estimation . . . . . . . . . . . . .
2.3.3 Computational Efficiency . . . . . . . . . . . . .
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

9
9
11
12
14
15
16
17
17
18
20
20

Chapter 3
Detecting 2D trailer couplers using convolutional neural networks
3.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Convolutional neural networks on object detection . . . . . . . . . . . . . . .
3.3 Proposed method on 2D coupler detection . . . . . . . . . . . . . . . . . . .
3.3.1 Preprocessing for Geometric Interpolation . . . . . . . . . . . . . . .
3.3.2 Coupler Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.3 Coupler Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Trailer Coupler Database . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Multiplexer-CNN Setup . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1.1 Network Implementation Details . . . . . . . . . . . . . .
3.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.2.1 Coupler detection . . . . . . . . . . . . . . . . . . . . . .
3.5.2.2 Coupler tracking . . . . . . . . . . . . . . . . . . . . . . .
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

21
21
24
25
25
27
29
31
32
32
32
33
33
34
36

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

Chapter 4
3D trailer coupler localization using convolutional neural networks . . . 37
4.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Height Estimation and 3D Localization . . . . . . . . . . . . . . . . . . . . . . . . 38
vii

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

38
39
40
40
41
41
42
43
44
45
46
47
49
51

Chapter 5
Detecting objects in overexposed bright illumination conditions
5.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Introduction to the overexposed light challenge . . . . . . . . . . . . . .
5.3 Prior work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Light direction estimation . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Enhancing lighting in images . . . . . . . . . . . . . . . . . . . .
5.3.2.1 Shadow removal . . . . . . . . . . . . . . . . . . . . .
5.3.2.2 Image dehazing . . . . . . . . . . . . . . . . . . . . . .
5.3.2.3 Image relighting . . . . . . . . . . . . . . . . . . . . .
5.3.3 Joint filtering methods . . . . . . . . . . . . . . . . . . . . . . .
5.4 Dataset collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Light direction dataset . . . . . . . . . . . . . . . . . . . . . . .
5.4.2 BOSCH vehicle backseat dataset . . . . . . . . . . . . . . . . . .
5.5 Estimating light direction using a CNN model . . . . . . . . . . . . . . .
5.6 Classifying the light quality using the light direction CNN . . . . . . . . .
5.7 Learning the DelightCNN model . . . . . . . . . . . . . . . . . . . . . .
5.8 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.8.1 DelightCNN setup . . . . . . . . . . . . . . . . . . . . . . . . .
5.8.2 Results on the light direction dataset . . . . . . . . . . . . . . . .
5.8.3 Results on the BOSCH dataset . . . . . . . . . . . . . . . . . . .
5.8.4 Limitation and failure cases . . . . . . . . . . . . . . . . . . . . .
5.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

52
52
54
57
58
58
59
59
60
60
61
62
63
64
66
68
70
70
71
73
77
78

Chapter 6
Detecting objects in underexposed low-light illumination conditions
6.1 Introduction to underexposed light challenge . . . . . . . . . . . . . . . . . .
6.2 Prior work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Nighttime object detection . . . . . . . . . . . . . . . . . . . . . . .
6.2.2 Dark image enhancement and synthesizing . . . . . . . . . . . . . .
6.3 Learning the iterative CNN model . . . . . . . . . . . . . . . . . . . . . . .
6.4 Underexposed light datasets . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

80
80
83
83
84
85
88

4.3

4.4

4.5

4.2.1 Coupler contour network . . . . . . . . . .
4.2.2 Contour estimation . . . . . . . . . . . . .
4.2.3 Height estimation . . . . . . . . . . . . . .
4.2.4 3D coupler localization . . . . . . . . . . .
Experimental results . . . . . . . . . . . . . . . . .
4.3.1 Parameter setting . . . . . . . . . . . . . .
4.3.2 Height estimation . . . . . . . . . . . . . .
4.3.3 Overall system test . . . . . . . . . . . . .
4.3.4 System efficiency . . . . . . . . . . . . . .
4.3.5 Qualitative results . . . . . . . . . . . . . .
A generalized approach for 3D coupler localization
4.4.1 Height estimation algorithm: A new method
4.4.2 Experimental results . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . .

viii

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

6.5

6.6

Experimental results . . . . . . . . . . . . . . . . . . . . . . .
6.5.1 Iterative CNN setup . . . . . . . . . . . . . . . . . . .
6.5.2 Results on the MEF dataset using the iterative method .
6.5.3 Comparison with state of the art . . . . . . . . . . . .
6.5.4 Pedestrian detection at night analysis . . . . . . . . . .
6.5.5 Trailer coupler detection at night analysis . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

89
89
90
93
96
98
100

Chapter 7
Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

ix

LIST OF TABLES

Table 1.1:

A list of real-world problems and challenges covered in this thesis. . . . .

Table 2.1:

Guideline for controlling the feeding process. . . . . . . . . . . . . . . . 16

Table 3.1:

CNN architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Table 4.1:

System efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Table 6.1:

Average SSIM gain of the iterative CNN model using the MEF dataset. . . 91

Table 6.2:

Average SSIM and PSNR using the EMPA HDR dataset compared with
SOTA methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Table 6.3:

Average SSIM and PSNR using the Waterloo MEF dataset compared with
SOTA method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

x

3

LIST OF FIGURES

Figure 2.1:

Given the video input, our system performs real-time monitoring and
feeding decision for a highly dense fish tank. . . . . . . . . . . . . . . . . 10

Figure 2.2:

The architecture of our feeding control system. . . . . . . . . . . . . . . 11

Figure 2.3:

Feed detection procedures with each column being one local region of
150 × 150 pixels: (a) the original image with green circles indicating
labeled ground-truth feed, (b) the CF output (green and red), where the
red squares are false alarms, and (c) the results of the SVM classifier in
a binary image where the white regions are the final detected feed. Note
the reduced false alarms from (b) to (c). . . . . . . . . . . . . . . . . . . 17

Figure 2.4:

Comparison of feed detection. . . . . . . . . . . . . . . . . . . . . . . . 18

Figure 2.5:

Local region optimization. . . . . . . . . . . . . . . . . . . . . . . . . . 19

Figure 2.6:

Comparison of normalized feed. . . . . . . . . . . . . . . . . . . . . . . 19

Figure 3.1:

Automatic trailer hitching by detecting and tracking the coupler using a
Multiplexer-CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Figure 3.2:

An automated computer vision system for coupler detection, tracking,
and 3D localization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Figure 3.3:

3D distance estimation of the coupler. . . . . . . . . . . . . . . . . . . . 26

Figure 3.4:

Iterative coupler detection using DCNN. The top row is the input image
with the results of (∆u, s). The green color means s = 1, and the red is s =
0. The bottom row shows the results of the weighted sum of Gaussians.
Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Figure 3.5:

Statistics of the trailer coupler database. . . . . . . . . . . . . . . . . . . 31

Figure 3.6:

Detection errors vs with distance in meters. . . . . . . . . . . . . . . . . 33

Figure 3.7:

Confidence scores vs Euclidean estimation errors. . . . . . . . . . . . . . 34

Figure 3.8:

Tracking comparison with errors in meters. . . . . . . . . . . . . . . . . 35

Figure 3.9:

Tracking comparison with errors in pixels. . . . . . . . . . . . . . . . . . 35

xi

Figure 3.10:

Tracking comparison with precision plot. . . . . . . . . . . . . . . . . . . 35

Figure 3.11:

Number of TCNN networks used for tracking. . . . . . . . . . . . . . . . 36

Figure 4.1:

Geometric feature of a contour include: distances along straight lines, and
slopes of red lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Figure 4.2:

Estimation of xt and yt , on the Dh elevated by estimated height zt . Red
dot is the estimated coupler, green has zero offset from the origin point. . 39

Figure 4.3:

3D coupler localization. (a) Illustrates an example of the 3D localization
process. (b) The exact algorithm followed in 3D coupler localization. . . 41

Figure 4.4:

Contour estimation comparison. . . . . . . . . . . . . . . . . . . . . . . 42

Figure 4.5:

Overall system accuracy on the height label set. . . . . . . . . . . . . . . 44

Figure 4.6:

Qualitative results of five videos. The first column is the DCNN results,
the three middle columns are TCNN results, last column is the CCNN
result. Red + is the estimated result, yellow ◦ is groundtruth, green ×
is the estimated contour. Above each frame is (CVD, meter error, pixel
error), the last column also has the estimated height in meters. The red
rectangles indicate the failure cases. . . . . . . . . . . . . . . . . . . . . 45

Figure 4.7:

Boat trailer couplers. Bottom row are examples of cases where the counter
estimation will fail, and hence the height and 3D localization will fail as
well. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Figure 4.8:

SIFT matching to find the amount of shift on the ground plane. . . . . . . 47

Figure 4.9:

Finding the best distance of the two key frames for applying SIFT and
estimating zt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Figure 4.10:

3D analysis on the 72 height dataset using the new method. . . . . . . . . 50

Figure 4.11:

Comparing the height estimation of the CCNN approach versus the new
method using SIFT matching points. . . . . . . . . . . . . . . . . . . . . 50

Figure 4.12:

Estimating zt using the generalized approach. . . . . . . . . . . . . . . . 51

Figure 5.1:

Challenging illumination examples. All examples on the left column represent light overexposure in several applications. The right column represents examples of underexposed images in dark light conditions. . . . . 53

xii

Figure 5.2:

We present an approach for synthesizing good lighting frames (bottom
row), from videos taken in extreme light conditions (top row). . . . . . . 55

Figure 5.3:

Examples of the light direction dataset. . . . . . . . . . . . . . . . . . . . 63

Figure 5.4:

Examples of the BOSCH database. The left column represent one of the
good lighting frames in a video sequence. The right column represents
an overexposed sunlight example from the same sequence. . . . . . . . . 64

Figure 5.5:

Dividing the sphere into smaller regions, such that each region represents
a specific direction of a light source. . . . . . . . . . . . . . . . . . . . . 66

Figure 5.6:

After applying dimensionality reduction using PCA on the 18-dimensional
data for all frames extracted from a video, we obtain the distribution of
points as seen in the plot. For better visualization, we perform K-mean
clustering on the data points with K set to 10 clusters. By labeling the
lighting quality of the frames, i.e.,0-good 1-bad, we learn an SVM classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Figure 5.7:

Proposed method overview and pipeline. . . . . . . . . . . . . . . . . . . 68

Figure 5.8:

Evaluation of light direction CNN. . . . . . . . . . . . . . . . . . . . . . 72

Figure 5.9:

ROC curve of light direction bin classification. . . . . . . . . . . . . . . . 73

Figure 5.10:

Evaluation of DelightCNN on the BOSCH backseat dataset. All frames
belong to the first test video in the BOSCH dataset. The first column is
the target frame obtained at time t. The second column is the temporally paired frame located at time t − j. The last column represents the
delighted output of the proposed method. . . . . . . . . . . . . . . . . . 74

Figure 5.11:

CF responses of the ID card detector. The images on the left correspond
to the original input frames with challenging illumination conditions. The
images on the right are the proposed delighted output given the frame on
the left side. Beside each frame is an illustration of the correlation output.

76

Figure 5.12:

CF detection performance compression when applied to the original video,
and the delighted video. . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Figure 5.13:

Failure cases of DelightCNN. Top row are the target frames of a video.
Bottom row are the DelightCNN output. The red rectangles represent the
ghosting effect when sudden object motion occurs in the scene. . . . . . . 78

xiii

Figure 6.1:

Proposed idea for delighting dark images which was motivated by the
multi-exposure sequences along with the HDR image. Our goal is to
synthesize better lighting images similar to the HDR quality (Green star),
starting from an underexposed low dynamic range image (red circle) in
an iterative scheme (orange and yellow circles). . . . . . . . . . . . . . . 82

Figure 6.2:

Iterative CNN architecture for image delighting. . . . . . . . . . . . . . . 86

Figure 6.3:

Iterative CNN example. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Figure 6.4:

Examples of the EMPA HDR database [96]. The top row are good lighting HDR images generated by MEF of the LDR images using [60]. The
second and third rows are examples of underexposed images with an EV
of −4 and −2 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 88

Figure 6.5:

Iterative CNN evaluation on the MEF dataset. The first column is the
input LDRDark , the following six columns are the output of six iterations
using the proposed method. The final column is the HDR groundtruth
image. All input images are of the same scene, taken at EV = −6, −4,
−2, 0 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Figure 6.6:

Iterative CNN evaluation on the MEF dataset. The first column is the
input LDRDark , the following columns are the output of six iterations
using the proposed method. The final column is the HDR groundtruth
image. Each row is an example of several scenes from the MEF dataset,
such that all have a very low EV. . . . . . . . . . . . . . . . . . . . . . . 92

Figure 6.7:

Iterative CNN comparison with SOTA methods on the EMPA HDR dataset.
The first column is the input LDRDark , the following columns are: histogram equalization, LIME, HDRCNN, proposed iteration 4, proposed
iteration 6, and finally, the HDR groundtruth image. All input images
have different negative EV values. . . . . . . . . . . . . . . . . . . . . . 94

Figure 6.8:

Iterative CNN comparison with SOTA methods on the Waterloo MEF
dataset. The first column is the input LDRDark , the following columns
are: LIME, histogram equalization, proposed iteration 1 − 6, and finally,
the HDR groundtruth image in the last column. All input images have
different negative EV values. . . . . . . . . . . . . . . . . . . . . . . . . 95

Figure 6.9:

Iterative CNN evaluation on the pedestrian detection application. The
yellow dashed bounding-boxes are groundtruth labels, and the red boundingboxes are the estimated detected person given the input frames. The first
column is the original frame, the second is the histogram equalization
frame, the third and fourth columns are the 2nd and 4th iteration output of
the proposed method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
xiv

Figure 6.10:

Precision-recall curve (left) and Miss-rate curve (right) of the pedestrian
detection algorithm when used on the original video sequences, histogram
equalization of the videos, and the 2nd and 4th iteration output of the proposed method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Figure 6.11:

Iterative CNN evaluation on the trailer coupler detection system at nighttime for two videos. The red plus signs are the estimated 2D coupler
center estimation. We obtain three frames from each video, illustrating
the results at far, middle, and close range. For each video, the left column
are the original frames, the middle column is the histogram equalization
of the frames, and the right column are the output of the proposed method. 99

xv

Chapter 1
Introduction on object detection challenges
and contributions

1.1

Introduction

With the advancement of autonomous vehicles, face recognition, surveillance, biometric systems,
and many other computer vision (CV) applications, accurate and efficient object detection systems
are rising in demand. A wide range of algorithms have been proposed to detect objects in still images or videos, but a solution that clearly outperforms the human vision system is still missing [92].
The core challenge is that each object in the world can have an infinite number of different 2D images, as the objects position, scale, pose, lighting, and background vary relative to the camera. Yet
as humans, our brains can remedy this challenge effortlessly. Whether the object is very far and
small in size, collected at nighttime or in very bright lighting conditions, highly occluded with
other objects, or even when captured at any possible pose, our visual system can simply spot the
target. With the advancement in CV algorithms, researches are taking small steps towards the human detection performance. In this thesis, we will highlight several object detection challenges
causing the slow growth of this field, and propose solutions to overcome these problems.
Great object detection success has been achieved in controlled environments. One of the first
object detectors which gained a lot of attraction was the use of templates in localizing the position
of an object. Correlation filters (CF) are 2D templates and are considered one of the most powerful
1

and efficient detectors, especially when the targeted object has limited appearance changes. CF’s
are applied to every location in an image using 2D convolution. Since the correlation is processed
in the Fourier domain as element-wise multiplication, these filters have attracted many researchers
attention to visual detection and tracking due to its remarkable computational efficiency, i.e., 600
FPS on a CPU in [10]. One of the other interesting properties of CF’s, is the graceful degradation property. This is remarkably useful when objects undergo partial occlusion or illumination
changes, where the CF detectors will have a positive but degraded response still indicative of the
target existence. CF’s have been widely used in many object detection applications, such as detecting military vehicles and planes [52], pedestrians and vehicles [33], vehicle parts [8], and even
tracking by detection of objects [42]. In Chapter 2, we will utilize CF’s in detecting several small
objects simultaneously in real-time given the challenging illumination conditions of the problem
as seen in Table ??.
Convolutional Neural Networks (CNN’s) was also used extensively for object detection over
the past few years, witnessing remarkable progress. Many challenges are associated with CNN
methods as follows:
• One of the biggest limitations when attempting to design CNN’s for handling specific challenges, is the availability of training data when trying to learn the detector. Usually, data
augmentation, fine tuning or training shallow architectures are some practices used when
having limited samples. Overfitting to the training data is one of the main concerns in such
cases. Throughout this thesis, we have proposed exactly eight CNN models for different
purposes, each of which had very limited, if any, publicly available data. We will demonstrate how we collect data and utilize limited data resources to pursue the high generalization
ability for CNN learning.

2

Ch

problem

Challenges

Example

Light reflection

2

Feed detection
Multiple feed

3

4

Small size objects,
pose variation, specular
illumination.
Real-time detection
needed.

Coupler 2D detection

Scale, pose, illumination, and appearance
variations. Real-time
detection needed.

Coupler 3D detection

Scale, pose, illumination, and coupler style
variations. Real-time
detection needed.

56 cm

35 cm
60 cm

5

Object
appearance
visibility is low. Pixels
all have high intensity
values.
Real-time
light
normalization
needed.

Light
normalization
(Overexposed
challenge)

Pedestrian

6

Light
normalization
(Underexposed challenge)

Trailer

Object
appearance
visibility is low. Pixels
all have low intensity
values.
Real-time
light
normalization
needed.

Table 1.1: A list of real-world problems and challenges covered in this thesis.

3

• Another challenge in this framework, is the design complexity. Recently, the object detection task has been divided into several smaller components [11, 44, 72], such as a region
proposal network (RPN), a region CNN classifier, and a proposal suppression component
such as non-maximum suppression (NMS). Hence, any new component proposed for the
object detection pipeline should not create a computational bottleneck, otherwise it will be
conveniently ignored in practical implementations [9]. Moreover, most desired application
nowadays have restricted memory and computational power using CPU’s only, as well as
the desire to run in real-time. In Chapter 3 and 4, we will propose a real-time light-weight
multiplexer-CNN method which is composed of five CNN components, for the purpose of
solving the trailer coupler detection problem as seen in Table ??.
• Most CNN object detectors are often formulated as a regression problem to localize object
bounding boxes. This helps in generating detection confidence scores, which are needed in
real-world scenarios due to the many challenges associate with it. The scores are usually
computed after two main components, the RPN and the CNN classifier [72]. The RPN
generates a set of rectangular object proposals, each with an objectness score measuring
membership to an object class. The CNN classifier also outputs a classification confidence.
Both of these scores are thresholded to produce a binary detection decision. Given a specific
test sample, selecting a fixed threshold on the generated scores can have a huge impact
on the performance. For example, the pedestrian detection in [11] achieves state-of-the-art
performance on the Caltech benchmark [24]. When evaluating the same system on nighttime
sequences, i.e., in Chapter 6, the system performance degraded significantly because of the
high missing rate. It was observed that nearly a 15% performance gain was achieved through
readjusting the classification score threshold, i.e., lowering the threshold from 0.44 to 0.1.

4

In Chapter 3, we will introduce several CNN detectors, one of which was trained using our
novel confidence loss function along with a CNN-based object detector, which estimates
object coordinates associated with confidence scores. These scores reflect the existence of
the target object within the spatial enclosure of the training patches, as well as how accurate
the 2D regression detection.
• Very few works in the literature utilize object detection in understanding the surrounding
real-world, even though it was intended for such problems. For example, estimating the
distance of the detected object in meters, is the path clear between the object and camera, is
the detected object size normal, and so on. In Chapter 4, we propose to transfer 2D object
detections into 3D using a single monocular camera, which helped in inferring real-world
distances of objects.
After completing all of the works mentioned above, i.e., Chapter 2, 3 and 4, several observations
have been made based on the failure cases of the CV systems. In any realistic real-world outdoor application, there are a number of additional challenges associated in object detection such
as efficiency, unstable background and foreground, obscurity of objects, and extreme illumination
challenges [88], which are more problematic compared to the typical visual object detection challenges, i.e., pose, scale, partial occlusion and illumination variations. Such challenges are not well
considered in most object detection literature. In Chapter 5 and 6, we will specifically focus on
the illumination challenges introduced in object detection systems when operated in outdoor scenarios. More specifically, Chapter 5 will introduce the overexposed bright sunlight problem when
projected on objects, with most pixel values being white i.e., washed-out appearance. In Chapter 6,
we will tackle the opposite problem with underexposed low-light illumination when captured in the
absence of light sources. One common solution in the literature, is to utilize infrared (IR) sensors

5

to replace the RGB cameras. On the down side, there are limitations in terms of the illumination
angle and distance when such sensors are used. Moreover, the illuminator power must be adjusted
depending on whether the object is close or far away. The cost of such sensors is another concern
to the user, as well as complicating the system by adding additional sensors. Another solution
would be to fine-tune the system with training data collected with such illumination challenges.
However, the detection cues for objects in such cases are fundamentally different than those when
objects are in good lighting conditions, as well as the lack of data in such scenarios, makes this
solution not feasible.

1.2

Contributions

In this section, we list all contributions made towards completing this thesis:
• Utilized CF’s in detecting an immense number of small objects simultaneously, such that the
detection is done in real-time and takes place in an environment with challenging overexposed specular reflections. CF’s alone struggle when false objects have similar features and
properties to the target, causing severe false alarms. We propose a post-refinement component attached to the CF’s output, to enhance the general detection performance.
• Detected trailer couplers in real-world scenarios with several challenges including scale,
pose, illumination, partial occlusion, and a wide range of appearance variations. This is
made possible through our novel design of a distance-driven Multiplexer-CNN that achieves
both generalization as well as real-time efficiency. Even though our system is composed of
five CNN components, we managed to run the system on a standard computer using only
a single CPU. Scale variation is among the most challenging cases in this problem, since
the object appearance changes according to the distance between the object and camera.
6

Therefore, we proposed three CNN detectors to handle this challenge, such that the output
performance maintains a high accuracy regardless of the object distance.
• Developed a novel loss function along with a CNN-based object detector, which estimates
object coordinates associated with confidence scores. These scores reflect the existence of
the target object within the spatial enclosure of a local region, as well as how accurate the
2D regression detection.
• Developed two methods to transfer 2D object detections into 3D using a single monocular
camera. One of the methods is based on estimating the geometric shape of the target to infer
a 3D localization of the object. The other method which has better generalization capability,
is based on detecting and tracking points on both the foreground object as well as some
points from the background to infer the 3D coordinate of the object.
• Developed a novel DelightCNN model to handle overexposed bright illumination challenges,
aiming at enhancing object detection performance. The main objective of this model, is to
synthesize video frames suffering from bright sunlight challenges, and produce normalized
light frames with ambient-like lighting conditions.
• Proposed a novel iterative CNN model for light normalization in underexposed dark images.
The objective of this model is to take any test image, improve the underexposed dark regions, while keeping the good lighting regions untouched. The main goal of this model is
also to improve object detection in such scenarios, including nighttime detection with RGB
cameras.

7

1.3

Organization

This thesis is outlined as follows: In Chapter 2, we describe how CF’s were used in detecting
multiple small objects in real time. We will also explain how we handled the specular reflections,
which was causing a high false alarm rate, through proposing a post-refinement process on the CF’s
outputs. In Chapter 3, we present a multiplexer-based-CNN method which controls the flow of the
CV system of a real-world problem based on estimated physical measurements. Five different
detection CNN’s were proposed with each having a specific task in the system. In Chapter 4, we
introduce the fifth detection CNN used in the multiplexer-based-CNN method from Chapter 3. The
goal of this CNN is to perform 3D detection on the object which yields valuable information for
the system in understanding the object localization in the 3D space. Finally, in Chapter 5 and 6, we
propose two methods for normalizing light in outdoor images undergoing illumination challenges.
The first method in Chapter 5 helps at recovering overexposed light patterns in images, through
our proposed joint filter CNN method named DelightCNN. The second method in Chapter 6 helps
at recovering underexposed dark images, through our proposed iterative CNN method. We finally
evaluate our proposed methods on real world problems such as detecting pedestrians and trailer
couplers at night time, as well as delighting the backseat of a vehicle while driving.

8

Chapter 2
Detecting small objects using correlation
filters

2.1

Problem statement

Based on the statistics from Fisheries and Aquaculture Department [1], aquaculture is growing at
a very high rate internationally, and its contribution to the world’s total fish production reached
42.2% in 2012, up from 25.7% in 2000. The fish feeding process is one of the most important
aspects in managing aquaculture tanks, where the cost of fish feeding is around 40% of the total
production costs [14].
Monitoring several aquaculture tanks with highly populated fish is a challenging task. Many
researchers adopt a telemetry based approach to study fish behavior [12, 16]. In addition, some
scientists prefer a computer vision (CV)-based approach for fish monitoring [17, 27, 79, 87, 95].
Unfortunately, all these studies are conducted at a small scale, i.e., a small number of fish in small
tanks. Compared to fish behavior, excess feed detection is rarely addressed except [54], where
feeding control is achieved by estimating fish appetite. However, the tank in [54] is also small and
fish are easily segmented.
By collaborating with an active aquaculture fish farm, we have developed a CV-based automated feeding control system. A video camera is placed above the water surface of a highly dense
fish tank with ∼10, 000 fish, as shown in Fig. 2.1. The camera captures only part of the water
9

Figure 2.1: Given the video input, our system performs real-time monitoring and feeding decision
for a highly dense fish tank.
surface due to the large tank size. Videos are directly transferred to a host computer that performs
immediate analysis on the state of fish behavior. Moreover, the system is also programmed to take
immediate actions in stopping the feeding process when needed.
In this work we present an efficient CV system to continuously monitor fish eating activity,
detect excess feed, and automatically control the feeding process. We will skip monitoring fish
eating activity, since it is out of scope of the dissertation, where interested readers can refer to
[5] for a detailed explanation. To detect the amount of feed floating on the water surface, we
propose a novel two-stage approach. First, a supervised learned correlation filter is applied to
the test frame in order to detect every individual feed. Second, a Support Vector Machine (SVM)
classifier is deployed as a refinement step of the correlation filter output, which attempts to suppress
falsely detected feed while preserving true feed. Furthermore, we propose to detect feed in an
optimum local region only, rather than the entire frame whose accuracy and efficiency are both
less than ideal. Using the particle filter technique, the local region is estimated by maximizing the
correlation between the number of locally detected feed and that of true feed in the entire frames.
Finally, based on continuous measurements from fish activity and feed detection, various actions
take place to control the feeding process.
This work makes the following contributions: 1) a fully automated aquaculture monitoring

10

Figure 2.2: The architecture of our feeding control system.
system that controls feeding for a highly dense fish tank, 2) an accurate measure of the fish activity,
and a continuous detection of excess feed from an optimum local region utilizing CF’s, and 3) the
video dataset and the labels that are publicly available for future research.

2.2

Proposed method on feed detection

Monitoring the behavior of fish eating activity along with making sure the fish are provided the
correct amount of feed is the main goal of this paper. The process is simply illustrated in Figure 2.2.
The blue branch of the figure can be ignored since it is not related to object detection. For more
info about the activity classification, please refer to [5].
Accurately estimating the amount of excess feed floating on the water is a critical component
for any intelligent aquaculture system. However, detecting individual feed is very challenging
due to the tiny feed size, partially submerged into the water, and the overexposed specular light
reflection. Further, feed detection should be conducted in real time for immediate feeding control.
The efficiency challenge is attributed by the contrast between the large frame size (1080 × 1960
11

pixels) and tiny feed size (∼30 pixels), i.e., a huge number of local candidates to be classified
as feed vs. non-feed. These challenges motivate us to develop a carefully designed feed detector
with three components: 1) correlation filter is used to detect all possible feed, 2) a classifier built
on handcrafted features suppresses non-feed from the first component, and 3) a local region is
searched to maximize the computational efficiency and accuracy. We now discuss each component
in detail.

2.2.1

Correlation Filter for Feed Detection

Correlation filters have been widely used in many applications such as object detection, recognition, tracking and alignment [8, 10, 33, 42, 52]. Since the correlation is processed in the Fourier
domain as element-wise multiplication, these filters have attracted many researchers attention to
visual detection and tracking due to its remarkable computational efficiency.
The main challenges of this work are related to: (a) the small size of feed where they are
partially submerged into the water. (b) Light specular reflections off of the water surface, which
confuses the system with false alarms. (c) Real-time needs in controlling the feeding process, and
avoid dispensing excess feed. To address these challenges, we like to efficiently and accurately rule
out the majority of non-feed candidates while preserving most true feed. Correlation filter (CF) is
chosen for this purpose due to its proven success in object detection [33], [10].
In the context of object detection, correlation filters are 2D templates that are applied to every
location in an image using 2D convolution. The goal of the templates is to provide a strong response only in regions of the image that correspond to the template. The most important part lies
in making the right template. Most contemporary filters are designed to minimize the difference
between the response of the filter on the training set with a desired output. In a sense, the peak of
the desired output marks a positive training sample, and the rest of the training image is a large
12

set of negative samples. This is different from discriminative classifiers, which require explicitly
defined positive and negative training samples.
Over the past few decades, many variations in the filter design have been proposed, but for
the sake of thesis length, we will only explain the method used in solving the feed detection only.
Specifically, we adopt the unconstrained scalar feature approach [8], which is learned by minimizing the average Mean Square Error between the cross correlation output and the desired correlation
output for all training images. This is basically accomplished by controlling the shape of the correlation output between the entire image and the filter. Thus, the correlation filter design problem
is formed in the following optimization problem,
1 N
min ∑ ||xi ⊕ h − gi ||22 + λ ||h||22 ,
h N i=1

(2.1)

where h, xi , gi , λ , and ⊕ are the CF, visual features of ith image, desired output, regularization
weight, and convolution, respectively. Here the bold face character denotes that the matrix is represented in a column vector. By converting into the frequency domain, Eqn. 2.1 has the following
closed-form solution,
1 N †
ĥ = λ I + ∑ X̂i X̂i
N i=1

[

−1

] [

1 N †
∑ X̂i ĝi ,
N i=1

]

(2.2)

where ˆ is the FFT operation, X̂ is the diagonal matrix with x̂ on its diagonal, † is conjugate transpose, and I is the identity matrix. A set of N L × L local patches with true feed in the center are
used as the training images. For efficiency, the raw intensity is used as x. Given a test image xt , the
convolution output xt ⊕ h containing peaks larger than a threshold τ are detected as the candidate
feed. We choose τ where the maximum true detection and minimal false alarm are achieved.

13

2.2.2

SVM Classifier for Refinement of False Alarms

The SVM classifier is focused on distinguishing between feed and non-feed. Therefore, the same
group of training images used to build the correlation filter are used again as positive training images. For negative samples, false alarms patches are obtained. Both positive and negative samples
are utilized to extract features represented by {vi }i∈[1,N] , where i is the training sample index and
N is the total number of positive and negative samples.
Feed is easily distinguishable by identifying its color. Thus, a group of features are extracted
from the training images based on color properties from the RGB color space, and then perform Kmeans clustering on its Cartesian representation. The resultant dc = 20 code words, [s1 , s2 , ..., sdc ]
representing the clusters that hold a wide range of colors for both the feed and false alarms. For
any given correlation output C, we extract patches Pi centered at any high peak representing either
feed or false alarm, where i is the peak index. For every pixel in Pi , we locate the nearest color
code that represents it. Thus, for every Pi , we generate a d-dim Bag-of-Words (BoW) histogram
fc = ∑(u,v)∈Pi δ (d = argmind k Pi (u, v) − sd k2 ), where δ is the indicator function. This histogram
is normalized by

fc − min( fc )
. Another group of features are extracted from the Histogram
max( fc ) − min( fc )

of Oriented Gradients (HOG) descriptor, obtaining orientation information of the object located in
the image. We followed [19] approach in obtaining the HOG descriptor using 9 orientations. Due
to normalization in every window, a 36 feature vector fh can be obtained by summing the results
of all similar orientations.
In some cases, the HOG features alone will not be able to classify feed from false alarms,
because false alarms might appear to have the shape of a feed. On the other hand, the BoW
features based on RGB colors alone will also encounter difficulties in some cases, where the feed
might appear with color features similar to the negative samples due to lighting conditions. By

14

combining both features together f = [ fc , fh ], the result feature vector can overcome all possible
cases. Thus, a group of 56 features are extracted from the positive and negative patches, which are
used for learning the SVM classifier.

2.2.3

Locating the Optimum Local Region

The key concept of searching for an optimum representative local region relays on the following
two reasons: (1) Detecting feed in a sub-region will make the system more robust compared to
searching in the entire frame from a computational efficiency perspective. (2) The feed tend to
gather in some portions in the image more than others. Hence finding the local region where they
are well represented throughout all frames in a video is needed.
In our work, we adopt an approach to address this problem by using the concept of particle
filters. A group of K windows are represented by their location and size with their weights as
{cki , wki }k∈[1,K] , where cki denotes the location and size of window k for frame number i and wki
is the weight for that window. All K windows are initially distributed uniformly throughout the
frame image for the process of finding the best location and size of the window. A group on n
labeled frames are used in order to determine the optimum location. The weight in every window
is computed by using Pearson’s correlation coefficient,

∑ni=1 (Tik − µT )(Gi − µG )
,
wki = q
p n
n
k
2
2
∑i=1 (Ti − µT ) ∑i=1 (Gi − µG )

(2.3)

where Tik is the true detection rate within the local region, µT is the mean true detection of every
labeled images in n, Gi is the ground truth number of feed in the entire global frame and µG is the
mean ground truth of every labeled image. At every iteration the Pearson’s correlation coefficients
are computed to assign every window with a weight. The window with less weight will be less
15

# of feed
High
High
Low
Low

Fish Active
Yes
No
Yes
No

Feeding Machine
On
On
Off
On

Action
Off
Off
On
Off

Table 2.1: Guideline for controlling the feeding process.
likely selected in the following iteration, while the higher weights will have higher chances of
revisiting this region in next iteration. After N number of iterations, we would expect to observe
that all K windows had diverged to a specific region where the weights are highly correlated.
The updating process via particle filter is done by resampling based on weights achieved in the
previous iteration. This will result with a new set of windows having the same size from the
previous iteration that are located at regions with higher potential of determining larger Pearson’s
correlation coefficient.

2.2.4

Automatic Control of the Feeding Process

The main purpose of classifying the fish behavior at every frame, as well as detecting the amount
of excess feed located on the surface of the tank, is to be able to automatically make changes to
the feeding process without the need of any human interference. The goal is to maintain a stable
and accurate amount of feed provided to the fish tank. The system is designed to automatically
control when to switch on the feeding machine and when to turn it off. Based on the results
obtained from both the 2 class classification of fish activity and the feed detection systems at every
frame, a continuous decision needs to be made whether to stop or continue the feeding process.
We conclude to have the following rules listed in Table 2.1 that represents some critical instances
in which requires to stop or continue feeding. If any of the cases listed in the table take place, an
immediate action has to be made.
16

(a)

(b)

(c)
Figure 2.3: Feed detection procedures with each column being one local region of 150 × 150
pixels: (a) the original image with green circles indicating labeled ground-truth feed, (b) the CF
output (green and red), where the red squares are false alarms, and (c) the results of the SVM
classifier in a binary image where the white regions are the final detected feed. Note the reduced
false alarms from (b) to (c).

2.3

Experimental results

Our dataset consists of 21 videos of a top-view aquaculture fish tank. These videos were captured
at 10 FPS, with 1080 × 1960 pixels and an average length of 5, 684 frames. The first 20 videos
are captured under normal circumstances. The last video exhibits a huge amount of excess feed,
since the feeding machine is intentionally switched on for a longer period of time. To evaluate the
feed detection, we manually label feed in n = 12 frames randomly taken from the video at different
stages of the feeding process. We conduct the labeling twice and only the feed labeled in both trials
are claimed as true feed. The number of true feed ranges from 22 to 856 per frame, with the total
of 4, 485 feed.

2.3.1

Feed Detection

We set the parameters as N = 2, 000, L = 25, τl = 229, τ = 0.53, and gi is a 2D Gaussian centered
at the targets locations with a variance of 2 and peak amplitude of 1. The default parameters in
LibSVM are used for SVM learning.

17

Normalized False Alarm

1
0.8
0.6

Local CF
Local CF + SVM
Global CF
Global CF + SVM

0.4
0.2
0
0

0.2

0.4
0.6
True Detection Rate

0.8

1

Figure 2.4: Comparison of feed detection.
Figure 2.4 compares the results of the CF in the local region alone vs. having a SVM refinement
classifier following the CF. The Normalized False Alarm (NFA) is the number of falsely detected
feed divided by the number of true feed. Remarkably, the refinement classifier reduces the amount
of false alarm by nearly 50%, while maintaining the similar true detection rate. For example,
one good point on ROC has the detection rate of 90.8% at a NFA of 0.3. Further, the results of
operating on the entire frame is much worse than on the local region. Finally, we also employ the
SVM classifier for feed detection without first applying the CF. It can detect 85.3% of feed, but the
NFA is considerably high at 4.5, not to mention the much lower efficiency. The superior over this
baseline demonstrates the excellent accuracy and efficiency of our two-step approach.
An illustration of feed detection procedure is shown in Fig. 2.3. Columns 1-2 are successful at
detecting all feed with no false alarms. Columns 3-4 have missing detection, but no false alarms.
Columns 5-8 illustrate variations of false alarms.

2.3.2

Local Region Estimation

The number of particles for localizing the optimum local region is 100. Since the results of the
particle filter depend on the initialization, we repeat this experiment three times with different
18

0.98

Max(wk )

0.96
0.94
0.92
0.9

300x300
400x400
200x200

0.88
0.86
0

2

4
6
8
Iteration Number

10

12

Figure 2.5: Local region optimization.
initial sizes of local regions. The maximum wk for three runs are shown in Fig. 2.5. Note that
the final iterations of all runs achieve a similar weight of 0.97, due to the huge overlap in the final
optimum local region. The optimum local region is found to be centered at (224 ± 3, 256 ± 4), with
a size of 258 × 258. The fact that all three runs converge to the same local region gives a strong
indication of achieving the global optimization solution for this optimization.
2.5
Normalized Feed

2
1.5
1

Local − initialization
Local − final iteration
Ground truth
Global

0.5
0
−0.5
−1
0

2

4
6
8
Labeled Frame Index

10

12

Figure 2.6: Comparison of normalized feed.

To illustrate the effectiveness of the particle filter, we plot four signals: ground-truth feed in
the entire frame Gi , Ti when feed detection is applied to the frame, Tik with the maximum wk at the
initialization and at the final iteration. To compensate different data ranges, we plot the normalized
feed as

Tik −µ(T k )
Tik

in Fig. 2.6. Compared to the initialization and the global feed detection, the

feed estimation at the optimum local region has the highest correlation with the ground-truth feed.

19

Therefore, the feeding control based on the local region is almost the same as based on the true
feed of the entire frame.

2.3.3

Computational Efficiency

The computational efficiency is an important metric for any computer vision system. We evaluate
the efficiency using a Matlab implementation on a conventional Windows 8 desktop computer with
an Intel i5 CPU at 3.0 GHz with 8 GB RAM. The efficiency of feed detection depends on several
factors, such as the size of the local region, the number of candidate feed for the SVM classifier.
The total time for the CF step in the optimum local region is 0.006 seconds. The refinement
classifier requires 0.004 sec. to extract features and classify a single candidate feed resulted from
the CF. The average total time to detect feed in the local region is 0.085 seconds. In summary, our
entire system operates at 5+ FPS, which also includes fish activity classification. With the future
C++ implementation, we believe that our system can operate in real time on a conventional PC.

2.4

Conclusion

A fully automatic system is developed to understand fish eating behavior in a highly dense aquaculture tank. The ability to classify whether the fish are actively consuming feed along with the
continuous detection of excess feed provides valuable information for feeding control in the tank.
Correlation filters are an excellent candidate for detecting small feed-like particles, since they were
able to detect objects with very limited appearance ( 30 pixels in area) at a very high accuracy.
However, many false alarms can appear due to specular reflections on the water surface, which can
have similar features to feed particles.

20

Chapter 3
Detecting 2D trailer couplers using
convolutional neural networks

3.1

Problem statement

Trailers range in size from small utility and boat trailers to large box trailers or recreation vehicle
(RV) trailers. RV trailers alone had a revenue of five billion dollars in 2015 [32] and are estimated
to exceed 381 thousand units sold in the United States in 2016 [65].
To hitch with a vehicle, trailers have a coupler at the front that is placed over a ball connected
to the vehicle at the rear. Small, lightweight trailers may be manually moved into place. However,
for heavy trailers, the vehicle must be driven backwards to connect to the stationary trailer. Traditionally, a second person, called a spotter, stands outside the vehicle to instruct the driver. Even
with rear-view cameras on modern vehicles, the task is difficult and tedious due to the small size
of the coupler on the screen. An automated system can take the place of the spotter and allow a
single person to connect to a trailer with ease.
Engineers have developed numerous advanced vehicle systems providing security and comfort
for drivers, e.g., emergency breaking, blind spot assist, lane recognition, and active park assist.
Current systems for trailers only provide assistance after they are hitched to a vehicle, such as
maneuvering in narrow spaces [55], preventing large articulation angles [64], and avoiding oscillation at high speeds [22]. To the best of our knowledge, no other automated system exists for
21

Our system

Vehicle control system

DCNN
TCNN1
TCNN2

5×1
Multiplexer

Detect/Track
coupler

Detect trailer coupler

Estimated
3D location

TCNN3
CCNN

Select

Distance estimation

Figure 3.1: Automatic trailer hitching by detecting and tracking the coupler using a
Multiplexer-CNN.
backing-up towards a trailer. This paper presents an automated computer vision system using a
single rear-view camera to hitch the vehicle to a trailer, as in Fig. 4.9.
There are three main challenges or sources of visual variation: position, trailer, and environment. Position refers to the pose and scale of the trailer. The trailer varies in type, shape, color,
and size of the trailer and coupler. The environment affects the background of the scene through
asphalt, dirt, grass, or snow and the lighting from sun, clouds, or nighttime. Furthermore, there are
strong performance requirements needed to successfully hitch the trailer. When hitching, the coupler 2D estimation should have an error less than the radius of the ball, e.g., 2.2 cm for a standard
ball, and the efficiency should exceed 10 FPS.
Assuming the driver parks the vehicle with the trailer in the field of view of the camera, this
work presents a Multiplexer-CNN based system to automatically detect and track the trailer’s
coupler and continuously provide 3D coordinates to control a vehicle, while relying solely on a
monocular rear-view fish-eye camera. Specifically, the goal of our system is to estimate both the
2D location in pixels and the 3D location in meters. An automatic control system may consume
the 3D estimate to mechanically control the vehicle’s movement. However, the mechanical system
is outside the scope of this paper. For convenience, we have dedicated Chapter 3 for 2D detection

22

of the coupler only, and in Chapter 4 we will discuss how we localize the trailers coupler in 3D.
The Multiplexer-CNN system selects from five CNN architectures to perform detection/tracking
operations. The current estimate of the coupler position drives the CNN selection. Each CNN is
invoked independently based on the estimated distance between the vehicle and the coupler. When
no estimate is known, the DCNN (Detection network) detects the coupler by estimating potential
locations along with confidence scores. We develop a novel loss function to enable DCNN to
learn accurate confidence measures, along with regression estimates of the coupler position. With
a confident estimate of trailer position, the multiplexer selects among three networks, TCNN1 ,
TCNN2 , and TCNN3 (Tracking via detection networks) to perform 2D tracking by detection until
the coupler is centered over the ball. During this time, the 3D location is inferred using a calibrated distance map from a fixed coupler height. As the coupler approaches the ball, it is crucial to
estimate the height to avoid collision. The fifth network, CCNN (Contour network), estimates the
coupler’s contour, which regresses the height, and adjusts the distance map to provide accurate 3D
estimation. All details regarding CCNN and height estimation can be found in Chapter 4.
Data is key to learning an accurate Multiplexer-CNN. We introduce the first large-scale dataset
for trailers, with 899 videos containing ∼712, 000 frames. We demonstrate the accuracy and efficiency of the system using a regular PC. Our system fulfills the performance requirements for
successful hitching by achieving an estimation error of 1.4 cm when the ball reaches the coupler,
while running at 18.9 FPS. Qualitatively, we show the ability of our system to detect and track
unseen trailers in the field, generalizing to handle a large variety of challenges.
In summary, our main contributions are: 1) Develop a novel loss function along with a CNNbased object detector, which estimates coupler coordinates associated with confidence scores. 2)
Design a distance-driven Multiplexer-CNN that achieves both generalization across larger variations and real-time efficiency. 3) Develop a method to estimate the 3D coupler coordinate with a
23

monocular camera. 4) Present a large dataset for trailers coupler detection and tracking.

3.2

Convolutional neural networks on object detection

Deformable part model (DPM) [31] has been the state-of-the-art object detector for many years,
before the viral emergence of CNN’s. Recent approaches using CNNs to regress the location
of object bounding boxes demonstrate remarkable accuracy and efficiency. Object detection has
witnessed remarkable progress with CNN’s, where often a regression problem is formulated to
localize object bounding boxes. Sermanet et al. [75] propose a regression network for detection
where classification confidence helps aggregate proposed bounding boxes. However, it exhaustively searches the image and is not suitable for real-time applications. Girshick et al. [35] propose
R-CNN, using object proposals generated by selective search. This model has been extended to
Fast R-CNN [34] and Faster R-CNN [72]. In Faster R-CNN, the region proposal network (RPN)
generates a set of rectangular object proposals, each with an objectness score measuring membership to an object class, after which the proposal is assigned a class specific confidence score. Note
that both the objectness score of RPN and classification confidence are trained separately from
localization and are thresholded to produce a binary detection decision. In contrast, we incorporate the confidence score into the learning process and use it as a scalar without thresholding. Our
confidence score reflects two major cues: (1) the existence of the target object in a region, and (2)
how accurate is the 2D target detection.

24

Output

Coupler detection

(ut , vt)

Video
Extract N1
patches

DCNN

Update G

u0
v0

Update 2D
estimation

Itr #
(xt , yt , zt)

meters

Coupler tracking
Extract N2
patches

Select CNN

TCNNi

ut
vt

Compute
3D
estimation

Distance map

3D coupler localization
Extract N3
patches

CCNN

Height & c
estimation

ct

zt

Update
distance
map

Figure 3.2: An automated computer vision system for coupler detection, tracking, and 3D
localization.

3.3

Proposed method on 2D coupler detection

Our Multiplexer-CNN system has five CNN inputs: DCNN, TCNN1 , TCNN2 , TCNN3 , CCNN.
As shown in Fig. 3.2, our system consists of three stages: (1) 2D coupler detection, (2) 2D coupler
tracking, and (3) 3D coupler localization for vehicle automation. Stage 1 initializes the 2D coordinate of the coupler for Stage 2. Stage 2 and 3 collaborate in estimating both 2D and 3D coupler
positions. Stage 3 along with CCNN will be explained in Chapter 4.

3.3.1

Preprocessing for Geometric Interpolation

Our rear-view camera has a fish-eye lens with a wide field-of-view (FOV). Using a checkerboard,
standard camera calibration is performed to estimate camera intrinsics, extrinsics, and lens distortion parameters, based on which we unwarp the input frame to correct the fish-eye effect.
In our Multiplexer system, the coupler-to-vehicle distance (CVD) is crucial in serving as the
selector to choose one CNN among five. Thus, it is important to estimate CVD accurately and
efficiently. Instead of employing SfM for distance estimation [78], we propose to rely on a distance
map covering the entire frame. This distance map can be obtained as follows. Given an unwarped
frame with a checkerboard placed on a flat ground, we estimate the camera rotation and translation
25

Coupler distance map Dh
Ground distance map D0
Camera
Coupler
Distance to ground

h0

hc

Dh

D0
Coupler coordinate
(x, y, z)

VEHICLE

TRAILER

Origin point
(0,0,0)
Figure 3.3: 3D distance estimation of the coupler.
matrices, which can convert a single pixel (u, v) to a 3D world coordinate and measure the distance
between the camera and target pixel (green point in Fig. 3.3). This step is repeated for all pixels in
the unwarped frame. Given the known (fixed) camera height h0 , solving a simple triangle problem
can convert the camera-to-pixel distance to the origin-to-pixel distance, where the origin of the 3D
world coordinate is the projection of the camera onto the ground plane ((0, 0, 0) in Fig. 3.3). The
origin-to-pixel distances of all pixels constitute the distance map on the ground plane D0 . Finally,
given the coupler is at a fixed but unknown height hc , we elevate D0 to obtain the coupler distance
map Dh by simply applying Dh = (1 − hh0c )D0 . In our system, we assume hc = 50 cm when the
CVD is over 1 meter, and otherwise estimate using the method in Chapter 4. For different vehicles
with different hc , the camera should be recalibrated to generate a new ground distance map D0 . We
can do this for all possible heights offline, and assign one to a vehicle during manufacture.
Our distance map method has a few advantages. First, we only use a single camera without
additional sensors/cameras to retrieve 3D depth information. Second, Dh can be updated efficiently
to accommodate changes in the coupler height. Finally, CVD is obtained by a simple look-up of
Dh (u, v). However, our flat ground assumption might affect CVD estimation if the ground is not
26

flat. Fortunately, this would impact most when the trailer is far away, and have much less influence
while approaching it, since close regions become locally flat. Note that the critical distance is the
last CVD meter when the coupler is near the vehicle.

3.3.2

Coupler Detection

When the system starts, we assume the trailer is positioned between three and seven meters from
the rear of the vehicle. This is a reasonable assumption for our application. If the trailer starts too
close, the vehicle may not have enough room to align with the hitch. If too far away, the coupler
cannot be detected. We train a dedicated network, DCNN, to detect the coupler in the frame along
with providing a confidence score of the estimation.
DCNN network The architecture of DCNN is in Tab. ??. Each convolution layer is followed
by an ReLU and maxpool layer. The key to DCNN is the novel confidence loss that allows the
detector to balance the regression accuracy with the detection confidence accuracy. These scores
reflect the existence of the target object within the spatial enclosure of the training patches, as well
as how accurate the 2D regression detection. The confidence loss is,

L

=

∑ s̄||∆ū

− ∆u||2 +

∑ λ1||s

− 2s̄(1 − sigm(λ2 ||∆ū − ∆u||2 ))||2 , (3.1)

where ∆u and s are the estimated 2D offset and confidence score, and ∆ū and s̄ are the groundtruth.
The loss function has two parts. The first part represents the Euclidean loss, which is the regression
error when having a positive coupler patch enabled by s̄. For negative patches where s̄ = 0, loss
is ignored so learning focuses only on the confidence score. The second part is the confidence
score loss. For negative patches, it penalizes nonzero scores since the system should have no
27

DCNN
input(200 × 200)
conv (7 × 7 × 20)
maxpool (2)
conv (7 × 7 × 30)
maxpool (2)
conv (5 × 5 × 40)
maxpool (2)
FC-100
FC-3
Confidence loss
—
—
—
—
—

TCNN1,2,3
input(224 × 224)
conv (3 × 3 × 20)
conv (3 × 3 × 20)
maxpool (3)
conv (3 × 3 × 40)
conv (3 × 3 × 40)
maxpool (3)
conv (3 × 3 × 60)
maxpool (2)
conv (3 × 3 × 80)
maxpool (2)
conv (3 × 3 × 100)
FC-100
FC-2
Euclidean loss

CCNN
input(200 × 200)
conv (7 × 7 × 20)
maxpool (3)
conv (5 × 5 × 40)
maxpool (3)
FC-100
FC-76
Euclidean loss
—
—
—
—
—
—
—

Table 3.1: CNN architectures.
confidence. For positive patches, the confidence should be negatively correlated to the regression
error, i.e., more accurate predictions of ∆u should have higher confidence. We pass the regression
error through a sigmoid function and subtract from 1, restricting confidence between 0-1.
We obtain training image patches to learn DCNN. The positive patches are obtained from the
videos when the trailer is in the range of d1 to d2 , and are assigned confidence scores s̄ = 1, along
with the coupler offset ∆ū. Random perturbation is applied to the patches such that the center
of the coupler is less than |∆ū| <

w
4

away from the patch center, where w is the patch size. The

negative patches are randomly selected in the surrounding area of the coupler such that the coupler
center is far away from the patch center |∆ū| > w4 , and are assigned scores of zero.
Coupler detection algorithm An illustration of the coupler detection algorithm is in Fig. 3.2.
Given the initial trailer location within [d1 , d2 ] meters, we place a grid G containing N1 evenly
distributed points covering the region as shown in the first column of Fig. 3.4. These points represent the center locations of N1 test patches, which are processed by DCNN. These patches have
different sizes based on their estimated CVD dt , such that the patches in the top row of G have the
28

smallest sizes, whereas the bottom row have the largest sizes. This is motivated by the desire to
feed CNN training data with smaller scale variations. Otherwise if we would select a fixed patch
size, e.g., 200 × 200, the patches contain a lot of background info when the trailer is far away, and
nearly no background when near by. We follow a simple formula to decide the patch size via CVD,

w=

−45
Dh (ut−1 , vt−1 ) + 300.
2

(3.2)

Here the maximum patch size is 300 × 300 when the trailer is at zero meters away, and 75 × 75
when 10 meters away. Once the scale is determined, the patches are resized to 200 × 200. Since
many patches overlap with the target coupler, it is likely that a number of patches will have high
confidence scores. We update each grid point with its estimated location and repeat the process
iteratively on the same initial frame, until the points cluster around potential couplers.
In this particle filter like approach, we adopt a weighted sum of Gaussians approach for a final
detection estimation. Every grid point is replaced with a 2D Gaussian (100 × 100 kernel size with
σ = 30) weighted by the confidence score. The final detection is the maximum of the summation
of weighted Gaussians from all grid points. The estimated confidence scores of correct point’s
increases with more iterations performed. In general, we observe that four iterations are sufficient
for satisfactory accuracy as seen in Fig. 3.4.

3.3.3

Coupler Tracking

In coupler tracking, the coupler appearance changes throughout the backing-up process, which can
be described in three stages: (a) Initially, the coupler is far and often hard to discern due to the
small size. (b) As the trailer gets closer, the coupler appears increasingly larger. (c) Within the last
meter of backing-up, the coupler appearance changes dramatically due to the increasing downward
viewing angle of the camera. Therefore, we propose to use three networks, TCNN1 , TCNN2 and
29

Initialization

Iteration 1

Iteration 2

Iteration 4

Figure 3.4: Iterative coupler detection using DCNN. The top row is the input image with the
results of (∆u, s). The green color means s = 1, and the red is s = 0. The bottom row shows the
results of the weighted sum of Gaussians. Best viewed in color.
TCNN3 , to perform tracking by detection, where each operates in a predefined range, defined by
d1 and d2 .
Tracking CNN networks The network architecture of TCNN is in Tab. ??. The first 10 layers
are similar to the VGG network [68], with minor changes in the number of filters and maxpool
layers. The full network is optimized using training images of trailer couplers. Unlike most object
tracking works estimating a bounding box, we are only interested in finding the center of the
coupler. Therefore, we define the Euclidean loss for the 2D coupler center position.
A large number of training images are used to learn all three TCNN networks. Similar to
DCNN, we apply random perturbation to the training patches such that the coupler center is less
than

w
4

away from the patch center. We also follow Eqn. 3.2 to crop the local region with the

CVD-dependent size to a training patch.
Coupler tracking algorithm We adopt a tracking by detection method where TCNN serves as the
detector. We use the tracking result of the previous frame to initialize the current frame. For stable
tracking, we apply N2 randomly perturbed patches surrounding the initialization. The tracking
result is obtained by averaging the estimations of all N2 patches. Two thresholds, τ1 and τ2 , define
three ranges where each of the three TCNNs operate.

30

Figure 3.5: Statistics of the trailer coupler database.

3.4

Trailer Coupler Database

With a rear-view camera of a vehicle, the videos capture the process of a vehicle backing-up
towards a trailer from a variable distance to the point where the hitch ball is aligned with the trailer
coupler. The database contains 899 videos consisting of ∼712K frames, with an average length
of 19.3 seconds. The videos are collected at 40 FPS with a resolution of 1, 920 × 1, 200 using
M-JPEG compression. A Point Gray camera (Model: BFLY-PGE-23S6C-C) with a fish-eye lens
is used with a wide FOV of 190◦ . The database contains three types of challenges as explained
in Sec. 3.1. Fig. 3.5 shows some statistics of the database. Nearly

3
4

of the database is obtained

from RV trailers, as they are the most popular and available. Even if the trailer type introduces
differences in shape and size, they all have similarities in the coupler itself. Typically, people
backup to the trailer in a straight line with a center pose, however, we collect videos with various
poses to make the system robust to any pose situation. The database is collected in a period of one
year, which covered different weather conditions and ground types. Some examples can be seen in
Fig. 4.6.
There are three types of labels in the dataset. 1) Coupler label set: the 2D coordinate of the
coupler center is labeled at every 10th frame of all 899 videos. 2) Contour label set: the 2D
coordinates of 37 coupler contour points are labeled for a set of 810 frames, collected from 162
videos, when the CVD is 1.0, 0.8, 0.6, 0.4, and 0.2 meters away. 3) Height label set: the coupler
31

height is physically measured at the site when 72 videos are captured. These 72 videos are a subset
from the 162 videos of the contour label set, which means we also have contour labels along with
the height labels.

3.5

Experimental results

In this section, we will discuss the experimental setup, report the quantitative results of all three
main stages separately, and then jointly as a whole system. Finally, we will show qualitative results
of the entire system.

3.5.1

Multiplexer-CNN Setup

3.5.1.1

Network Implementation Details

We train and test DCNN and TCNN using the first 827 videos using only the coupler label set.
The videos are divided into 413 training videos, and 414 testing videos. Given most trailers have
2∼3 videos captured at different posses, we make sure that each unique trailer does not exist in
both training and testing sets. To train and test CCNN, we use all videos which contained contour
labeled frames. The videos are divided into 90 training videos with 450 labeled coupler contour
frames, and 72 testing videos with 360 labeled coupler contour frames. The networks are trained
with a learning rate of 0.001 and a mini-batch size of 100, 20, and 20 for DCNN, TCNN and
CCNN, respectively. Our system uses the following parameters: τ1 = τ2 = 3, τ3 = 1, N1 = 100,
N2 = N3 = 10, λ1 = λ2 = 1 .

32

300

2D error (pixel)

DCNN w/o score
DCNN
200

100

0
3

4

5
CVD (m)

6

7

Figure 3.6: Detection errors vs with distance in meters.

3.5.2

Results

3.5.2.1

Coupler detection

For coupler detection, the baseline is similar to DCNN with the exact same network structure
except for the loss function, which is a normal Euclidean loss, i.e., it only estimates ∆u without the
confidence score s. Hence, the coupler detection algorithm remains the same, except replacing the
sum of weighted Gaussians with a sum of Gaussians. Fig. 3.6 reports the results of both methods
at various initializations of CVD in the range of 3∼7 meters. This experiment shows the advantage
of learning confidence score over the typical regression detectors. A clear margin of 20∼45 in
pixel errors is due to incorporating confidence score learning into DCNN.
To visualize the effectiveness of our confidence scores in DCNN, we demonstrate the correlation between the estimated scores and the 2D offset estimation errors. Given six randomly selected
testing videos, we collect a total of 600 pairs of (u, s) after running DCNN at seven CVD meters
for four iterations. As seen in Fig. 3.7, the scores have an inverse correlation with the Euclidean
estimation errors, i.e., the detections with lower errors will have higher scores. The correlation
coefficient is found to be −0.65 capturing this strong correlation between these two. Note that

33

Confidence score

1
0.8
0.6
0.4
0.2
0
0

200

400

600

Euclidean distance error

Figure 3.7: Confidence scores vs Euclidean estimation errors.
some large offset estimation errors might still have a high score. This false alarm is caused by the
patch drifting away from the coupler, where it detects some background object thinking it’s the
target coupler. An example is found in iteration 1 of Fig. 3.4, where some points illustrate this
noisy behavior.

3.5.2.2

Coupler tracking

We compare our TCNN with two baseline object tracking methods, KCF [43] and C-COT [21].
Both methods have excelled on many tracking benchmarks, e.g., C-COT wins the 2016 VOT challenge. To compare with the baseline, we provide three initializations to both baselines, while one
initialization to our method. For the baseline tracking initialization, we define a 100 × 100 bounding box centered around the coupler, at three CVD locations d0 , d1 = 7 and d2 = 3 meters. Here
d0 is the CVD of the first video frame, with an average of 10 meters in our database. We report
resultant x − y plane error in meters at specific CVD in Fig. 3.8, 2D pixel errors at specific CVD in
Fig. 3.9, and precision plots in Fig. 3.10. We observe that both KCF and C-COT perform well after
initialization. However, both baseline methods suffer from drifting problems due to the extreme
scale variations, which induces substantial appearance variation caused by changes in perspective.
34

x-y plane error (m)

TCNN
KCF
C-COT

3
2
1
0
0

2

4

6

8

10

CVD (m)

Figure 3.8: Tracking comparison with errors in meters.
250
TCNN
KCF
C-COT

2D error (pixel)

200
150
100
50
0
0

2

4

6

8

10

CVD (m)

Figure 3.9: Tracking comparison with errors in pixels.
1

Precision

0.8
0.6
0.4

TCNN
KCF

0.2

C-COT
0
0

20

40
60
Threshold in pixels

80

100

Figure 3.10: Tracking comparison with precision plot.
To the best of our knowledge, this challenge is rarely addressed in any of the tracking benchmark
studies.
Using three TCNN networks was specifically justified due to large appearance variation in the

35

2

x-y plane error (m)

1.5
1
0.5
0
1 TCNN (Err: 19.9 cm)
2 TCNNs (Err: 4.2 cm)
3 TCNNs (Err: 2.4 cm)
4 TCNNs (Err: 2.3 cm)

-0.5
-1
0

1

2

3

4

5

6

7

CVD (m)

Figure 3.11: Number of TCNN networks used for tracking.
trailer coupler as the CVD converges to zero meters. We have experimentally studied the effect of
using different numbers of TCNN networks, as seen in Fig. 3.11, with the final error at zero CVD
meters in parenthesis. With increasing number of TCNN networks, a higher tracking accuracy can
be achieved yet system complexity increases. Thus, we choose three TCNN networks to balance
the accuracy and complexity.

3.6

Conclusion

In this chapter, we present a computer vision system which is capable of detecting and tracking the
coupler at various distances. One of the key contributions in this system, is the ability to detect the
trailer using DCNN through a new loss function producing confidence measures associated with
regression estimates. The TCNN networks have demonstrated superior detection accuracy as well
as efficiency, especially in the close CVD range. The next chapter will be a continuation of this
chapter, where we will present a complete automated computer vision system for backing-up a
vehicle towards a trailer, through providing accurate 3D coordinates of the coupler.

36

Chapter 4
3D trailer coupler localization using
convolutional neural networks

4.1

Problem statement

The problem statement is the same as in Chapter 3. In this chapter, we propose to complete the
Multiplexer-CNN based system for detecting and tracking the trailer’s coupler and continuously
provide 3D coordinates to control a vehicle, while relying solely on a monocular rear-view fish-eye
camera as seen in Fig.4.9. Specifically, the goal of this chapter is to estimate the 3D location of
coupler in meters.
Our motivation for finding the 3D location of coupler is three-fold. (1) An automatic control
system will consume the 3D estimate to mechanically control the vehicle’s movement. (2) To
avoid collisions between the vehicle and trailer. (3) By dedicating a separate method for height
estimation via contour estimation, we can further improve the detection rates from Chapter 3. This
is because having knowledge of the general contour for target coupler, makes it easy to estimate
the center of the coupler, i.e., which is the center of the contour.
In the following sections, we will first explain the algorithm for estimating the coupler height
and 3D localization based on coupler contour estimations. After that, we will explore some limitations introduced by the coupler shape, which will directly cause the CCNN network to fail. We will
further explore a generalized second approach in estimating the 3D coupler localization regardless
37

of the coupler shape. Finally, we will report results based on both approaches separately.

4.2

Height Estimation and 3D Localization

The motivation of estimating the coupler height is two-fold. 1) To control the vehicle mechanically,
where the vehicle control system requires a precise 3D location of the coupler, (x, y, z) which is the
range, offset and height in meters. This demands the distance map at the true height of the coupler.
Hence, we need to estimate the coupler height rather than using the assumed height. 2) The vehicle
has a hitch ball set to a fixed height. We need to ensure that the coupler is high enough to avoid
colliding with the hitch ball.
It is challenging to estimate the coupler height from a monocular camera. For fixed size objects
like license plates, the depth can be estimated from the plate pixel size [15]. However, couplers
vary significantly in their shape. To address this problem, we discover that the geometric shape
of the coupler contour is indicative of the height, e.g., at a fixed CVD, increasing the height of a
coupler will spread out the contour points. Therefore, we propose to estimate the coupler contour
using CCNN. This allows us to extract contour geometric features and feed them to regressors to
estimate coupler heights, as detailed in Alg. 1.

4.2.1

Coupler contour network

The network architecture of CCNN is in Tab. ??. Given the small number of contour labels, we
learn a shallow network of 2 convolution layers and 2 FC layers, with an ReLU and maxpool layer
after each convolution layer. CCNN defines the Euclidean loss on the coupler contour represented
by a 76-dim c, i.e., 38 points, where 37 points are on the contour, and the last one (c(75), c(76)) is
the coupler center.
38

Meters

Figure 4.1: Geometric feature of a contour include: distances along straight lines, and slopes of
red lines.

yt
dt θ xt
Origin point

Figure 4.2: Estimation of xt and yt , on the Dh elevated by estimated height zt . Red dot is the
estimated coupler, green has zero offset from the origin point.

4.2.2

Contour estimation

Similar to tracking, to improve estimation stability, our contour is estimated using candidate contours of N3 patches extracted with random perturbations, each obtained by CCNN. Due to the
high-dimension output, there are normally a few outliers among the candidate contours. Therefore,
we propose to use a shape model [63] to remove the outliers, i.e., contours with unusual coupler
shapes. Based on labeled contour training images c̄, we compute a mean shape c̄0 and five basis
shapes P, such that any candidate contour c can be represented as a coefficient b = PT (c − c̄0 ). If
any candidate contour’s b does not meet the normal coefficient distribution learned from training
data, the contour will be ignored when making a final mean estimation.

39

4.2.3

Height estimation

Given a stable contour estimation, we extract geometric features to capture the 2D shape of the
coupler as shown in Fig. 4.1. Specifically, we uniformly sample five points along the contour,
including the two end points. We compute the Euclidean distance between any two points, resulting
in 10 features, i.e., red and green lines. We further compute the slopes of the red lines, resulting
in four features. Thus, a 14-dim feature vector is extracted from the contour estimation of CCNN.
Given five sets of training images, each having couplers at a specific CVD, we utilize their feature
vectors to learn five height estimators {Ri }5i=1 via the bagging M5P regressor [3, 86]. Our analysis
shows bagging M5P to be superior to other well-known regression paradigms.

4.2.4

3D coupler localization

A detailed algorithm for 3D coupler localization is in Fig. 4.3. Given the current 2D coupler
location (ut , vt ) estimated via TCNN, we find the CVD dt utilizing the distance map Dh at the
assumed height of 50 cm. However, when dt is less than τ3 = 1 meter, CCNN is activated to
estimate the contour, followed by the height estimation for each video frame. Then a refined CVD
dt can be retrieved for two updates: First, the distance map Dh is elevated to the estimated coupler
height hc . Second, the 2D coupler location (ut , vt ) is refined by averaging the TCNN result with
the coupler center estimation from CCNN.
Independent of whether dt is less than τ3 , we need to convert the CVD dt to the 3D coupler
localization (xt , yt , zt ), for the purpose of vehicle control. To find xt and yt , we solve a simple
triangle problem on the distance map at the coupler height, where xt and yt are the two sides
forming the 90◦ angle as seen in Fig. 4.2, and the third side is the CVD dt . As for zt , it is either the
assumed height of 50 cm if dt > τ3 or otherwise the estimated height hc .

40

Results of TCNNi

Extract N patches

CCNN

Apply ASM elimination constraints

Mean contour result

14-d
feature
vector

Height

zt

(a) 3D coupler localization illustration

(b) 3D coupler localization algorithm

Figure 4.3: 3D coupler localization. (a) Illustrates an example of the 3D localization process. (b)
The exact algorithm followed in 3D coupler localization.

4.3

Experimental results

In this section, we will evaluate the contour estimation method first, and then we will analyze the
height estimation using CCNN. We will further evaluate the overall system accuracy and efficiency.

4.3.1

Parameter setting

For ValidShapeTest(b) in Alg. 1, it returns true if all elements of b are within three standard
deviations of coefficient distributions. The five height estimators {Ri }5i=1 are trained from contours
at distance ranges of (0.9, 1.1], (0.7, 0.9], (0.5, 0.7], (0.3, 0.5], (0.1, 0.3] meters, respectively.

41

1

Precesion

0.8
0.6
0.4
CCNN
CCNN P

0.2

ASM
0
0

10
20
30
40
Threshold in Euclidean distance error

50

Figure 4.4: Contour estimation comparison.
System
Time(s)

DCNN
0.088

TCNN
0.023

CCNN
0.010

Alg 1
0.471

Alg 2
0.053

Alg 3
0.035

Table 4.1: System efficiency.

4.3.2

Height estimation

Contour estimation is crucial to our height estimation. We first compare our CCNN-based contour
estimation with two baselines, ASM contour fitting [63], and CNN-based polynomial coefficient
fitting inspired by [58], which used curve functions to describe the facial contour. For the classic
ASM method, we learn a 2D ASM model to iteratively fit a shape to the coupler contour. For the
second baseline, we learn a CNN named CCNNP similar to CCNN with the same structure, except
that instead of producing a regression output for 38 contour points, CCNNP estimates an 8-dim
vector (αi , βi ) for i = 1 · · · 4. Here αi and βi are the third degree polynomial coefficients of the xcoordinate and y-coordinate of the contour points, respectively. We report the Euclidean distance
error by measuring the shortest path of the estimated point to the groundtruth contour in Fig. 4.4.
Observing a fixed threshold at 20 pixels, our CCNN has a precision rate of 92% compared to 75%
and 63% for CCNNP and ASM.
Given the coupler contour and its geometric features, we can estimate the height. We analyze

42

the performance of the five height estimators through a 5-fold cross validation on the 72 videos
of the height labeled set. The absolute mean errors are 0.85, 0.75, 0.60, 0.54, and 0.45 cm for
{Ri }5i=1 . We observe higher performance for both contour and height estimation the closer we get
to the coupler.
Given the estimated height, we adjust the distance map, based on which we retrieve the CVD
using the coupler center estimated by CCNN, i.e., (c(75), c(76)). Then we perform the 3D coupler
localization using the same triangulization in Fig. 4.2. The blue curve in Fig. 4.5 shows the height
estimation accuracy using the height label set of 72 videos.

4.3.3

Overall system test

We report the results of the entire system, using components from all three stages, in Fig. 4.5.
We use the height label set, because the labeled height (i.e., its elevated distance map) and the
labeled coupler center can provide the ground truth 3D coupler locations in each video frame.
Note that the 2D coupler center is estimated by fusing TCNN3 and CCNN, as in Line 17 of Alg.
1. The minimum error can be found in the offset estimation, followed by the height and range
estimations. The small error of offset is due to the fact that most backups are along the frontal
angle, and therefore, the error on the x − y plane is almost the same as the range error. The final
estimation error on the x − y plane is merely 1.4 cm. While not directly comparable due to different
datasets, it is substantially smaller than TCNN3 error of 2.4 cm in Fig. 3.8. This x − y plane error
is the most important accuracy metric for our system. At 1.4 cm, the vehicle control system would
drive the vehicle to park and stop at a point where the hitch ball is only 1.4 cm away from the
coupler center. The fact that most couplers have a radius of 2.2 cm means that users can easily
lower the coupler and hook up with the hitch.

43

0.06
0.05

Error (m)

0.04

Range error " x
Offset error " y
Height error " z
3D error
x-y plane error

0.03
0.02
0.01
0
0.3

0.4

0.5

0.6
0.7
CVD (m)

0.8

0.9

Figure 4.5: Overall system accuracy on the height label set.

4.3.4

System efficiency

Our system is implemented in MATLAB using MatConvNet [83] on an Intel Core i7 − 4770 CPU
with 3.40GHz and a single NVIDIA TITAN X GPU. The system can run in real-time obtaining
frames from the rear-view fish-eye camera, or offline using the trailer coupler database for analysis.
Table. ?? provides detailed efficiency analysis per CNN network and per system stage, where Alg
1, Alg 2, and Alg 3 are the detection, tracking and localization of the coupler when tested with
N1 (for 1 iteration), N2 and N3 patches. Note that Alg 1 requires nearly 1.88 seconds to perform
4 iterations during initialization of the system. While tracking the coupler, we can achieve 18.9
FPS. However, when the CVD passes τ3 the system drops in speed to 11.4 FPS due to running
both TCNN and CCNN.

44

(6.21, 0.82, 90)

(6.60, 0.23, 11)

(1.29, 0.12, 24)

(0.29, 0.02 , 16)

(0.30, 0.01,10, 0.513)

(6.84, 0.57, 21)

(6.98, 0.17, 2)

(0.78, 0.04, 8)

(0.24, 0.01, 6)

(0.24, 0.00, 3, 0.487)

(4.79, 0.45, 28)

(4.43, 0.67, 18)

(1.74, 0.09, 7)

(0.27, 0.02, 26)

(0.28, 0.01, 18, 0.539)

(6.65, 0.93, 32)

(7.07, 0.18, 11)

(1.69, 0.22,17)

(0.24, 0.02, 35)

(0.25, 0.02, 21, 0.504)

(5.40, 0.46, 8)

(5.36, 0.43, 7)

(2.30, 0.38, 25)

(0.43, 0.12, 67)

(0.44, 0.05, 39, 0.498)

Figure 4.6: Qualitative results of five videos. The first column is the DCNN results, the three
middle columns are TCNN results, last column is the CCNN result. Red + is the estimated result,
yellow ◦ is groundtruth, green × is the estimated contour. Above each frame is (CVD, meter
error, pixel error), the last column also has the estimated height in meters. The red rectangles
indicate the failure cases.

4.3.5

Qualitative results

Figure 4.6 shows qualitative results of detecting and tracking the coupler for full video sequences.
We illustrate five different video examples, representing typical challenging cases in the database.
We show a few failure cases of the system obtained from different stages of the videos. Note
that the fourth and fifth columns use the same frame to illustrate the results of TCNN and CCNN,
respectively. One key observation in our system, is that our Multiplexer-CNN approach has the
ability to overcome failure cases at various stages of the system. E.g., if our TCNN fails within
the last few frames, such as the last row of Fig. 4.6, our CCNN has a good chance of correcting
the problem. The same observation is made when Multiplexer-CNN switches between any two
possible networks.

45

Figure 4.7: Boat trailer couplers. Bottom row are examples of cases where the counter estimation
will fail, and hence the height and 3D localization will fail as well.

4.4

A generalized approach for 3D coupler localization

Using CCNN in estimating the contour points was proven to work well based on Fig. 4.5. However,
all of the coupler types used in this evaluation, generally have the same round shape. This type of
coupler is commonly found in more than 95% of trailers such as the ones in the first row shown in
Fig 4.7, while the remaining 5% will have other shapes such as the bottom row. These coupler are
usually found on light weight trailers, such as boat and utility trailers.
A simple solution is to retrain CCNN with more coupler training images such as the ones in
Fig. 4.7. However, since these types of coupler are rare to find, then it would be very hard to collect
more data. Retraining our CCNN with limited amount of data, will lead to over-fitting. Therefore,
in order to come up with a generalized solution which is capable of functioning across all type
of trailers, we propose to estimate the height without the need for contour estimation via CCNN.
More specifically, we propose a new method to detect and track points on both the coupler as well
on some points on the ground. By detecting and tracking points on the ground, we can estimate
46

Figure 4.8: SIFT matching to find the amount of shift on the ground plane.
the amount of shift on the ground in meters using the distance map. We propose to use the scale
invariant feature transform (SIFT) matching [59] to find matching points on the ground as seen
in Fig. 4.8. Note that in this figure, we also show some matching points on the coupler in the red
rectangle, which will be ignored when computing the ground shift. Technically, the amount of
shift on the ground should be equivalent to the amount of shift of the coupler.

4.4.1

Height estimation algorithm: A new method

During tracking with TCNN3 in the last meter range, we expect to continuously estimate and
update the coupler height using the new method. Our new method will follow these steps:
• Obtain frame It and It−1 when the CVD is less than 1 meter. However, we experimentally
discovered that selecting adjacent frames produces very poor results. This is due to the small
changes which makes it hard to find a clear distance measurement in both the ground and
coupler. To find the best distances for the two key frames, we test several scenarios with
different real measured distances between the frames as seen in Fig. 4.9. We observe that
having a small gap or very large gap will always give bad results. On the other hand, having
a fixed gap in the range of 20 ∼ 30 cm, has best performance. Note that this is made possible
by keeping track of the real distance estimated for the coupler at every frame for the last
47

meter. This is needed to find the best second frame candidate to perform SIFT matching and
ultimately estimate the height zt .

Figure 4.9: Finding the best distance of the two key frames for applying SIFT and estimating zt .

• Detect and track points on the ground. To do this first we extract a fixed patch of size
300 × 150, located in a fixed region of the lower left of the frame, i.e., region to the left of
the ball of the truck. The local region and parameters of SIFT were manually selected based
on trial and error during a live demo session in an RV trailer site.
• Detect SIFT points and apply a SIFT matcher across the two extracted local regions from It
and It−1 .
• Find the average shift (∆û, ∆v̂), and estimate this shift in terms of meters dg using the ground
distance map D0 .
• Find the average shift (∆u, ∆v) in the coupler 2D estimation, and find dc accordingly.
• Our motivation in this step, is to try to find how much we need to elevate D0 such that
48

dc − dg = 0. We were able to find a closed form solution as follows:


D0 (ût−1 , v̂t−1 ) − D0 (ût , vˆt )
zt = h0 1 −
D0 (ut−1 , vt−1 ) − D0 (ut , vt )

(4.1)

where D0 is the ground distance map, h0 is the camera height from the ground, (û, v̂) is the
2D detection of a ground point via SIFT, (u, v) is the 2D detection of coupler from TCNN3 .
• To ensure a smooth estimation of height, we take the average of a sliding window approach
across 25 frames.

4.4.2

Experimental results

For all of the following reported results on height estimation, we are still utilizing the entire system,
using components from all three stages as indicated in Chapter 3. Note that for the errors in x-y
plane, ∆x and ∆y, they should not change from the previous method in Fig. 4.5, however, since
the previous algorithm updates the x-y plane coordinate based on the height, some minor changes
will happen. Evaluating the system on the same 72 height labeled videos, we obtain the results in
Fig. 4.10. Even though the results are a bit worse than the old method, i.e., contour estimation with
CCNN and then regress the height, however, we argue that this performance is much appreciated
when we can now claim that this height estimator will work across any type of coupler, whereas
the old method only worked on round couplers.
To have a side-by-side comparison of both old vs. new method, we decide to highlight the
generalization capability of the new method by creating a new validation set containing a group
of 40 videos. This new set contains various types of couplers including different shapes. The
result of this experiment can be seen in Fig. 4.11. In this figure, we also show the results of height
estimation when tracking is replaced with the groundtruth center coupler label of every frame. In
49

Figure 4.10: 3D analysis on the 72 height dataset using the new method.

Figure 4.11: Comparing the height estimation of the CCNN approach versus the new method
using SIFT matching points.
this experiment, the new method has the lead since the testing set is more challenging for the old
method.
We also evaluate the height estimation on site in the real-time demo, where we only report the
result of ∆z in Fig. 4.12. A total of 30 different trailers with couplers ranging as low as 38 cm upto
55 cm. As seen in Fig. 4.12, we evaluate the height estimation throughout the last CVD meter.
Based on our observation, height estimation have large errors during the beginning of the process
(mainly in the range of 0.5 ∼ 1.0 meter). After that, height estimation becomes more reliable

50

Figure 4.12: Estimating zt using the generalized approach.
(especially in the last 0.25 meters). The final mean absolute error reported at zero distance to the
coupler was found to be 2 ± 1.2 cm.

4.5

Conclusion

We present an automated computer vision system for backing-up a vehicle towards a trailer, using a
distance-driven Multiplexer-CNN. While relying solely on a monocular rear-view fish-eye camera,
we are able to provide accurate 3D coordinates of the coupler, which are needed for vehicle control.
One of the key contributions in this system, is the ability to detect the trailer using DCNN through
a new loss function producing confidence measures associated with regression estimates. Our
quantitative and qualitative results on the collected large-scale trailer database, demonstrate our
system’s ability to be integrated in any vehicle with a rear-view camera. This work also represents
how successful vision systems are built to meet real world needs. From the technical perspective,
other applications such as autonomous driving problems, may benefit from our components, e.g.,
Confidence loss function, distance map, Multiplexer-CNN.

51

Chapter 5
Detecting objects in overexposed bright
illumination conditions

5.1

Problem statement

Because intelligent CV systems have recently undergone rapid growth over the past couple of
decades, research on accurately detecting objects in outdoor scenes is growing in importance.
The existing research using RGB cameras has mainly focused on methods of detecting objects
in controlled environments or during normal ambient light conditions. Even though we, i.e. as
humans, undergo challenging illumination conditions on a daily basis in our everyday lives, this
aspect has been largely neglected when designing CV systems. The illumination challenges can
be categorized into two classes: overexposed bright images with light being projected on the target
object, and underexposed dark images captured during the absence of light sources. For the first
class of challenges, many examples include detecting any object with overexposed illumination
properties, such as the feed particles on the water surface of an aquaculture fish tank as mentioned
in Chapter 2. For the underexposed dark class, many examples include driving at night with very
limited light in the scene, or surveillance systems attempting to detect faces or pedestrians during
nighttime. Additional examples of both challenging illumination classes can be seen in Fig. 5.1.
Note that both illumination challenges have different impact on the images in terms of appearance. For the case of overexposed light in images, the light source intensity makes the target
52

Feed particle

Coupler

Vehicle

Trailer

Pedestrian

Vehicle

Pedestrian

ID Card

Figure 5.1: Challenging illumination examples. All examples on the left column represent light
overexposure in several applications. The right column represents examples of underexposed
images in dark light conditions.
objects very bright, and in some cases, important bright parts of an image are washed-out or effectively all white, making the object indistinguishable. Examples of this case can be seen in the left
column of Fig. 5.1. On the other hand, the appearance of objects in underexposed dark images are
generally hard to discern compared to good lighting images. In this case, the intensity of pixels
and edges of the object are degraded, making it difficult to separate from the background of the
image.
Given that the underexposed and overexposed illumination challenges have completely opposite effects on the images, it is nearly impossible to design a single CV system which can handle
53

both objectives at the same time [29]. Therefore, we propose two separate methods for each case.
The remainder of this chapter will only consider the overexposed challenges and is organized as
follows: an introduction to the overexposed bright light problem, along with several datasets used
in this chapter including the details and the motivation of collecting them. After that we will introduce our proposed method including estimating light direction and using joint filter CNN’s for
normalizing the light in overexposed images. Finally, we will present quantitative and qualitative
results in a real-world problem. In the following chapter, we will propose a method to handle the
underexpose low-light illumination challenge separately.

5.2

Introduction to the overexposed light challenge

Over the past decade, engineers have developed numerous advanced vehicle systems for autonomous
driving, providing security and safety for drivers while driving. These systems operate on the
outside world such as detecting vehicles on the streets and pedestrians on sidewalks. Recently
researchers have been developing CV system for drivers, operating inside of the vehicle, such as
recognizing the behavior of the driver [89], monitoring gaze estimation and hand tracking [50], and
developing passenger monitoring systems [91]. Having CV systems for inside vehicle perception,
is just as important for outside perception, in providing comfort, safety and security to the drivers.
In this chapter, we are interested in detecting objects in the backseat of a vehicle while driving.
More specifically, having frames captured at a top-view angle of the backseat in a vehicle, we
would like to make object detection feasible regardless of the lighting conditions as seen in Fig. 5.2.
This problem is accompanied with many challenges, where the illumination challenges is among
the most challenging cases. First, when sunlight is projected on the backseat, the intensities of
the objects becomes washed-out, prohibiting the ability to localize and even recognize objects.

54

Input frame

Input frame

DelightCNN
estimation

DelightCNN
estimation

Figure 5.2: We present an approach for synthesizing good lighting frames (bottom row), from
videos taken in extreme light conditions (top row).
Second, since the vehicle is in motion, the illumination challenges varies accordingly, which might
also cause motion to the objects in the vehicle. This also makes object tracking and detection
very difficult. Finally, since the projected bright sunlight on the backseat has high pixel intensity
values, cameras tend to balance the global frame intensities, resulting in very dark regions where
no sunlight exists. In this section, we propose to continuously delight the frames and produce
visually appealing frames, and potentially enhance the object detection performance as seen in the
second row of Fig. 5.2.
The motivation of this work is to mainly make object detection possible given the challenging
illumination scenario. From an application stand point, several systems can benefit from having
a delighted frame in the backseat of a vehicle. For example, notifying the driver of what objects
he/she might have forgot or even lost. Another example can be related to driver safety for trans-

55

portation services, such as Uber or taxi drivers. In such cases, detecting knives or guns would
alarm the driver or even the police. Another application could be used to monitor children in the
backseat, i.e. infants.
Given the success of deep networks at related tasks such as image super resolution, depth
upsampling, and noise reduction, we propose to use CNN’s in synthesizing appealing images by
normalizing the light in the scene. Joint filter CNN methods [56] will be utilized, where our hope
is that an appropriately designed guided CNN can learn image delighting. However, training such
a CNN requires a very large dataset of outdoor sunlight images as seen in the top row of Fig. 5.2,
with their corresponding good lighting conditions to serve as label. Unfortunately, such a dataset
currently does not exist, because capturing the light image pairs, i.e., target and label light images,
requires significant time and effort, acquiring it is not feasible.
Our insight is to exploit videos captured of the scenario in Fig 5.2 through driving the vehicle
for long periods of time, capturing various light conditions. Our hypothesis is that giving a bad
lighting frame in a video, a good lighting or semi-good lighting pair can be found temporally,
such that it has better illumination properties. Note that when driving, the light condition changes
dramatically from one frame to another. To fully understand the illumination properties of a scene,
such as the existence of light and its direction, we design a CNN-based method which can identify
the light situation for any given frame. More specifically, this CNN can identify the direction of all
light sources within an image, and more importantly, help in detecting good ambient light images
to be paired with neighboring bad light images. Therefore, every single frame of a video, will have
a paired image found temporally such that the light direction of both pairs are quite different.
After overcoming the data collection challenge, we propose DelightCNN which uses the paired
frames for supervising a CNN to filter good lighting images from bad ones. We adopt the joint-filter
based CNN [56] method in learning the network architecture that consists of three sub-networks,
56

as shown in Fig. 5.7. The first two sub-networks DelightCNNT and DelightCNNG act as feature
extractors to determine informative features from both target and guidance images. These feature
responses are then concatenated as inputs for the network DelightCNNF to selectively transfer
common structures and reconstruct the filtered good lighting output. Here, DelightCNNT will take
the target bad lighting image paired with a second image temporally, selected based on the light
direction estimation CNN which will have complementary lighting conditions. On the other hand,
DelightCNNG will take an image to help guide the network in recovering the bad light regions
from the target image. Since the camera is static, we propose to use a background image with
no illuminations defects of the top-view scene with no objects residing in it. This will provide
guidance to the CNN in learning what the original background looks like with ambient lighting.
Therefore, each training sample is now composed of four images, i.e., two paired target images,
one guide, and the groundtruth label image. As a result, the learned network allows recovery
of plausible ambient light frames. We will also demonstrate the impact using a CF detector and
compare the results of before and after delighting the scene.
In the following sections, we will first briefly introduce some of the prior work related to our
study. After that we will introduce the datasets collected. Then we will explain how we pair bad
lighting frames with good (or semi good) lighting frames using the light direction CNN. Finally,
we will introduce the proposed DelightCNN model.

5.3

Prior work

We review papers based on the two main components in our paper. First, we will review some
methods in light direction estimation from images. Secondly, we will review papers which attempt
to enhance and filter images suffering from bad light conditions and any other relevant scenarios.

57

Since we use a guided filter CNN approach, we will also review joint filtering methods separately.

5.3.1

Light direction estimation

Light direction estimation from a single image is a topic well studied over the past decades. Many
of the previous works handle synthetic objects only such as in [85], where they present a method
for the detection and estimation of multiple directional illuminates, using a single image of any
object with known geometry and Lambertian reflectance. However, we need to be able to estimate
light direction without prior knowledge of geometry and reflectance properties. Many other works
use real life images in estimating the light direction such as [53], where they propose a method
for estimating the likely illumination conditions of the scene. In particular, they compute the
probability distribution over the sun position and visibility. The method relies on a combination
of weak cues that can be extracted from different portions of the image: the sky, the vertical
surfaces, and the ground. In our work, we learn a simple CNN-based method for estimating light
direction of a single image. The CNN is forced to learn the direction of light based on analyzing the
shadows and sunlight projections in any image. A similar approach in [73] introduce a method for
recovering an illumination distribution of a scene from image brightness inside shadows cast by an
object of known shape in the scene. The method combines illumination analysis with an estimation
of the reflectance properties of a shadow surface. In our method, the features indicative of light
direction is learned in the CNN model, regardless of the object shape and reflectance properties.

5.3.2

Enhancing lighting in images

The following list of works are examples of methods which aim at improving the image lighting by
either removing unwanted intensity regions from an image, or by improving the image as a whole.

58

5.3.2.1

Shadow removal

Shadow removal has been a widely studied topic in the literature [7, 38, 48], which generally includes two steps: shadow localization and shadow removal. However, in our problem the shadow
region might represent most of the input image whereas the shadows in the shadow removal literature only corresponds to what have been generated by smaller objects, which makes it very hard to
localize. In [70], they propose to remove the shadow detection step via learning a CNN to estimate
the shadow matte directly from the input image. This method would be more suitable to consider
in the delighting process, but the neighboring regions to shadows are of extreme importance in
removing and generating a shadow matte. The overexposed sunlight washes-out this part making
it hard to recover the shadow regions.

5.3.2.2

Image dehazing

In this field, researchers attempt to estimate the optical transmission in hazy scenes given a single
input image. The goal is eliminate scattered light to increase scene visibility and recover hazefree scene contrasts. Existing methods in image dehazing use various constraints and priors to get
plausible dehazing solutions [30]. In [66], assuming they have the scene depth, atmospheric effects
are removed from terrain images taken by a forward-looking airborne camera. More recently,
CNN-based methods such as [13] have proposed a system named DehazeNet which estimates
the medium transmission of an image. This system takes a hazy image as input, and outputs its
medium transmission map that is subsequently used to recover a haze-free image via atmospheric
scattering model.

59

5.3.2.3

Image relighting

Image relighting is to change the illumination of an image to a target illumination effect without
knowing the original scene geometry, object information and illumination condition. The authors
in [46] perform outdoor scene relighting method, which needs only a single reference image and is
based on material constrained layer decomposition. In [57] they take advantage of multiple color
plus depth images in modeling lighting and reflected light field in terms of spherical harmonic
coefficients, to recover both illumination and albedo to produce realistic relighting. Such works,
along with many other works in this field [47, 80], focuses at changing the illumination at a
pixel level such that the resultant images appear brighter. All relighting methods assumes typical
illumination scenarios. Applying such methods on overexposed scenes cannot recover from the
visual artifacts which requires synthesizing to overcome.

5.3.3

Joint filtering methods

One of the initial works in joint image filters are the bilateral filters [51, 81], which compute the
filtered output as a weighted average of neighboring pixels in the target image. The filter weights
depend on the local structure of the guidance image neglecting the target. On the other hand,
our DelightCNN considers the contents of both images through extracting feature maps. Another
class of joint image filtering is based on global optimization, where the objective function consists
of a data fidelity term and regularization term. The first term ensures that the filtered output is
similar to the target image. Most approaches in this field differ in the regularization term, which
is responsible to maintain the structure between the guide and output images. Some works defined
the regularization term according to texture derivatives [23], or mutual structures shared by the
target and guidance image [76], and many others. However, these methods rely on hand-designed

60

objective functions that may not be the optimal choice to generalize across all real life images. In
contrast, our method learns how to selectively transfer details directly from real life datasets.
In the past few years, researchers have been motivated by the concept of joint image filtering
with CNNs, where the network architecture will contain two main branches, one serves as a guide
branch, while the other is for the target image as seen in our proposed DelightCNN in Fig. 3.2.
A similar architecture in [26] performs optical flow estimation, where two frames were used for
both of the target and guide branches. Similarly in [93], they take stereo images as input, and
generates a disparity map. The first branch is used for computing the cost-volume and the other
is for jointly filtering the volume. In [56], they used this approach for several low-level vision
tasks, such as image noise reduction, depth map upsampling, and super resolution. Their network
architecture has some resemblance to that in [26], with the only difference is that the merging
layer in [26] uses a correlation operator while [56] model merges the inputs through stacking the
feature responses. Note that in our DelightCNN network, we also use stacking. Moreover in [56],
they propose to select the guide image from a different modality, for example they use an RGB
image as guide but the target uses a depth image. However, in most real-world application, a single
modality would be preferred, since not all systems has access to several sensors. In contrast, we
leverage the knowledge of the light direction CNN model in selecting the guide and target images,
for the purpose of insuring that the guide branch will indeed guide the target branch in delighting.

5.4

Dataset collection

Our proposed method in finding a solution for the overexposed lighting challenges, utilizes two
different datasets, the light direction dataset, and the BOSCH vehicle backseat dataset. In the
following two sections, we will explain the details of each dataset respectively.

61

5.4.1

Light direction dataset

The motivation of collecting this dataset, was to have sufficient training data of various lighting
scenarios for objects placed on a table surface. The goal was to have the camera placed at a top
view angle to match the case of the BOSCH dataset which will be explained in the following
section. Some sample examples of this dataset can be found in Fig. 5.3.
We collect data from 11 different objects, which could be found in the backseat of any vehicle.
The objects were selected such that they all have different properties in terms of reflectance, color,
size, and shape. We use four light sources with different Lumens values. Here Lumens is a unit
of luminous flux in the International System of Units that is equal to the amount of light given
out through a solid angle by a source of one candela intensity radiating equally in all directions.
In other words, Lumens represents the brightness of the light source. We use one very bright
light source with 720 Lumens, two 300 Lumens light sources, and ambient lighting provided from
the normal indoor office environment lighting. For every data collection sample, we adjust four
different variables: (a) the number of light sources used ranging from 1 ∼ 3. (b) The polar angle,
i.e., denoted by θ , which ranges from 0 at the North Pole, to π/2 at the Equator. (c) The longitude,
i.e., denoted by φ which is also known by azimuth, ranges between 0 ≤ φ < 2π. (d) The pose
of the object. Note that θ cannot exceed π/2, since we assume the light source is always above
the horizon. The physical distance of the light sources and the center of the object in the image
is always fixed, i.e., which was equal to the radius of the circle shown in Fig. 5.5, i.e., 53 cm.
Therefore, the location of the light source can be represented on the top half of a spherical surface
shape. We have collected a total of 264 images at a resolution of 1000 × 1000 as seen in Fig. 5.3,
using the Logitech c920 webcam. Every object has a total of 24 images with different lighting
setups. Note that the image in the bottom far right corner represents the ambient lighting case.

62

Figure 5.3: Examples of the light direction dataset.

5.4.2

BOSCH vehicle backseat dataset

This dataset was collected by BOSCH for the purpose of detecting objects undergoing overexposed
sunlight challenges. The key motivation is to provide the user a list of objects which might be
forgotten by the driver in the backseat. The total number of 12 videos were collected, such that
three of them had no objects on the backseat, while the remaining nine videos had several objects.
The total number of frames in this dataset is equal to 20747 frames, with an average of 1729 frames
per video. The number of objects in every video ranges from 6 ∼ 8 objects, including an ID card,
wallet, gloves, car keys, bottle, an empty can, a phone charger, a tool, a dollar bill, and sunglasses.
The videos were recorded while the car is moving, and the driver would intentionally make several
left and right turns to cause changes in light direction. Moreover, the videos were recorded in a
city area, which contained buildings, bridges and trees. Therefore, even when the driver is driving
in a straight line, many variations in the overexposed sunlight happens. Several examples of this
dataset can be found in Fig. 5.4. Every row in this figure was obtained from a specific video. The
first row is a video with no objects in the back seat, while the remaining other videos in this figure

63

Figure 5.4: Examples of the BOSCH database. The left column represent one of the good lighting
frames in a video sequence. The right column represents an overexposed sunlight example from
the same sequence.
have objects. The left images are examples of good lighting frames, whereas the right column
are frames selected temporally after the good frame which have bad lighting. When having a very
bright region within an image, cameras tend to automatically balance the distribution of intensities.
This will ultimately make the shadow regions much darker as seen in the first two rows of Fig. 5.4.

5.5

Estimating light direction using a CNN model

For estimating the light direction using a CNN model, we had to collect a dataset representative
of light direction as seen in Fig. 5.3. We only use 10 objects for training the CNN, where the

64

11th object was left for validation. We have a total of 240 images to train the CNN model. For
CNN training purposes, we only extract small local patches collected from the object through
random perturbation of the center of the image. We extract a total of 7965 training patches of size
128 × 128 × 3 pixels, through data augmentation by flipping and scaling the original images.
The light direction dataset has labeled light source direction in terms of (θ , φ ) and light source
brightness. We utilize these labels in creating an 18 dimensional feature vector, representative of
the lighting setup, through dividing the surface of the 3D sphere surrounding the object into 18
regions as seen in Fig. 5.5. Note that when a light source is placed on the surface of the sphere,
we do not represent this light as a single point, instead, it will be represented as a circular region.
The radius of this circle is defined based on the real light source glare shield surrounding the
light bulb, i.e., 10 cm radius. If only one light source exist in the image, and this light source
only resides inside the boundaries of a bin on the sphere, then the light directional feature vector
representing the image will contain 17 zeros and only one corresponding bin will have a value of
one representing the origin of the light. Note that all training patches extracted from this image will
have the same feature vector, which is considered as the training label. If the light source location
happens to overlap with multiple bin regions, the feature vector representing the image will have
distributed the value 1 among the bins, such that the summation of this vector remains equal to 1.
Similarly with multiple light sources, if n light sources were used, the summation of this vector
will be n.
The network architecture of the light direction CNN is similar to the one used in the trailer
coupler tracker [4], i.e., TCNN network, as seen in Tab. ??, where the only difference resides
in the final fully connected layer with an 18 dimensional output. Note that the first 10 layers are
similar to the VGG network [68], with minor changes in the number of filters and maxpool layers.
The full network is optimized using the augmented training patches of the light direction dataset.
65

Figure 5.5: Dividing the sphere into smaller regions, such that each region represents a specific
direction of a light source.
We define the Euclidean loss for learning the 18-dimensional light direction feature vector.

5.6

Classifying the light quality using the light direction CNN

Knowing the light direction information specifically, is not of great importance in the task of image
delighting. However, having knowledge of existing overexposed light patterns in the scene, can be
used in identifying bad lighting frames. Therefore, we leverage the results of the light direction
CNN in identifying which frames is considered good lighting, to help in solving the main objective
of image delighting. Our goal is to learn a classification system, which can produce a continuous
score representing the lighting quality in a frame.
We use one of the BOSCH sequences to conduct the following experiment and learn the classifier. Given every frame, we extract 80 patches selected randomly from the global frame, and
compute the 18-dimensional feature vector representing the light direction in every patch. The
final global light direction of the frame is computed by taking the average feature vectors from
all local patches. We repeat the same process for all frames in the video sequence. After that,
we reduce the dimensionality of all feature vectors to 2-dimensions using PCA for visualization
purposes. With the lower dimensionality, we perform Kmean clustering to divide the data samples
66

K-mean clustering on the 2-dimensional light directional data

Learning an SVM classifier

Figure 5.6: After applying dimensionality reduction using PCA on the 18-dimensional data for all
frames extracted from a video, we obtain the distribution of points as seen in the plot. For better
visualization, we perform K-mean clustering on the data points with K set to 10 clusters. By
labeling the lighting quality of the frames, i.e.,0-good 1-bad, we learn an SVM classifier.
into 10 clusters as seen in Fig. 5.6. Note that K = 10 was determined experimentally. Through visualizing the frames residing in each cluster, we observe high light pattern correlation among them.
Meaning that, the light direction of all frames in the same cluster are very similar. Moreover, we
were able to identify three clusters which had no sunlight projected in the frame. We consider
these three clusters as good lighting quality with only ambient light illuminating the scene. The remaining seven clusters had varying overexposed light patterns. Therefore, learning a simple SVM
classifier can be achieved, where good lighting quality images assigned a label of 1 and bad quality assigned a label of 0. Moreover, by taking the average 18-dimensional good lighting quality
feature vectors, we can simply apply a cosine similarity metric on any probe frame in identifying
how close this frame is from being good quality. In the following section, we will explain how the
cosine similarity is used in details.

67

Video of empty
back seat

Good lighting Guide image
(DelightCNNG)

Light Direction
Estimation CNN Model

Left: Target image
Right: Paired image
(DelightCNNT)

Filter
DelightCNNF

9x9
1x1

5x5

Delighted image

Guide
DelightCNNG

Target video

Target
DelightCNNT

96
48

1

1
9x9

1x1
9x9
1x1

2

2

5x5

5x5

96
48

1

96
48

1

Figure 5.7: Proposed method overview and pipeline.

5.7

Learning the DelightCNN model

Since we can identify the light quality given any probe frame, we direct our focus on learning
a CV system for image delighting. Recently, researches have been using joint filter CNNs, i.e.
also known as guided-CNN’s, where the goal is to filter unwanted regions, artifacts, and defects
from a target image, and produce outputs with desired properties. The joint filter CNN was used
in several similar tasks such as image super resolution [56, 90], depth upsampling [37, 45], and
noise reduction [41, 56]. our hope is that an appropriately designed guided CNN can learn image
delighting. In the remainder of this section, we will explain how we designed the DelightCNN
network including the sub-networks DelightCNNT and DelightCNNG .
The main concept behind a guided-CNN is, given a target image, how can a guide image provide additional information to enhance or remove undesirable light patterns? In previous literature,
the guide image can be obtained from a different modality compared to the target image. Such as
in the depth image enhancement work [56], where the guide image was obtained from the origi-

68

nal RGB image of the same scene for the purpose of improving the depth map. However, in our
delighting work, no cross-modality information is available. Therefore, we propose to use RGB
images in both the target and guide branches to help synthesize good lighting images. Since the
overexposed target images will contain bright light patterns, and the backseat might be occupied
with passengers or objects, we propose to utilize an empty backseat with good lighting as an input
guide image to the DelightCNNG branch as seen in Fig. 5.7. This will help the DelightCNN model
understand what a good lighting image looks like, and will also help in recovering any distorted
parts of the backseat caused by overexposed light patterns.
Given a frame xt with overexposed light patterns at time t, we use the light direction CNN
model to extract the light direction feature vector ft . After that, we would like to search for a
neighboring frame to form an image pair, such that the paired images have different light direction
properties. To do this, we compute the light direction feature vector for the previous n frames.
Given any of the n vectors ft− j , such that j ≤ n, we wish to find the minimum cosine similarity c,

ct− j =

ft · ft− j
||ft ||2 ||ft− j ||2

(5.1)

This similarity measure output will naturally range from 0 − 1, since the feature vectors are not
negative. After obtaining c for all n frames, we pair frame xt with the frame producing the lowest
similarity score, i.e., index of min(c). Our general assumption is that n will be very small such that
the two pair of images have minimal object motion. This pair will be used as input to the target
branch DelightCNNT of the DelightCNN model as seen in Fig. 5.7.
All three sub-models DelightCNNT , DelightCNNG and DelightCNNF have similar CNN architectures to the proposed super-resolution CNN work in [25]. Our CNN architecture represents a
fully convolutional network (FCN) model, which means we have the ability to normalize the light

69

for any arbitrary size input images, as long as the target and guide have similar dimensionality.
Each submodel contains three convolutional layers where each is followed by a ReLU layer. The
sizes of filters for the three layers are 9 × 9, 1 × 1, and 5 × 5 respectively. The 1 × 1 filters in the
middle of the network provides a nonlinear mapping of the responses from the previous layer. The
number of filters and the general layout of all submodels is illustrated in Fig. 5.7.
Based on several experimental observations, we found that operating on the YCbCr color space
is more robust compared to the RGB space. Moreover, since the YCbCr color space only has one
luminance channel Y representing the pixel intensities, and two chrominance channels Cb and Cr
representing color, we only propose to use the Y channel as input for both branches in learning the
model. Both output responses of DelightCNNT and DelightCNNG are concatenated to form the
input of DelightCNNF . The output of DelightCNN is also a single Y channel, where we take the
average Cb and Cr channels of both target pair images to reform the colored YCbCr image. The
use of Y channel alone in learning a CNN model has been adopted in several other literature [69].

5.8

Experimental results

In this section, we will discuss the model and experimental setup, show quantitative and qualitative
results of the entire system.

5.8.1

DelightCNN setup

Due to the large number of frames in the videos, we only utilize the first two videos with objects in
the BOSCH dataset for training the DelightCNN, along with one video without objects to be used
in extracting an empty backseat for guiding the training process. For obtaining the groundtruth
image for supervising the learning process, we also utilize the light direction CNN as follows.
70

Since the target branch obtains two images paired based on the lighting in the previous n frames
during test time, we modify this part to select triples with maximum cosine distance among the
three triples during training. One of the triplets will serve as the groundtruth image while the
remaining two images will be input training sample pairs. Therefore, we expect the groundtruth
image to have the best image quality among the triplets, which can be guaranteed by applying the
SVM classifier learned with the light direction data.
A total of 50, 000 data triplets were generated from the two video sequences. Note that every
training sample data had a dimensionality of 128 × 128 × 4, where two channels were obtained
from the target image pair, one from the guide image, and one from the groundtruth. The data
patches were extracted from the same location of the original four full frames. The network is
trained with a learning rate of 0.001, a mini-batch size of 16, and trained for 80 epochs. During
testing, n = 30 since the camera was operating at 30 fps, which means we search temporally for
an image pair within the last second. However, during training, n = 90 to have higher chances of
finding a much better light quality frame to serve as groundtruth.

5.8.2

Results on the light direction dataset

We have evaluated the light direction CNN model using the 11th object, i.e., a coffee mug, which
was excluded from the training set. For this object, the light settings are completely different from
the light settings in the training data. Given this evaluation setup, we should observe whether
generalization exists in the trained CNN model for random objects and light positions. A total of
24 evaluation images were used, where two of them are illustrated in Fig. 5.8. For every test image,
we extract 80 patches selected from the center of the image with a random offset in both x and y
directions, along with a random scale similar to the training setup. Please refer to Fig. 5.5 to use
the bin number on the sphere in order to understand the correspondence of light direction with bins
71

1

Estimated score

Test patches

20

40

60

80

2

4

6

8 10 12 14
Feature vector bins

16

18

0

1

Estimated score

Test patches

20

40

60

80

2

4

6

8 10 12 14
Feature vector bins

16

18

0

Figure 5.8: Evaluation of light direction CNN.
in Fig. 5.8. For the first image, the light source is located such that bins 5, 6, and 12 sum up to one
and all other bins are zero. As seen from the results of all 80 patches, the results appear to be very
consistent with the groundtruth label, along with very small noise in neighboring bins. Similar
analysis can be made on the second image, with the groundtruth being bins 2 and 10 which sum up
to one and the rest are zero. We collect all results obtained from the test images, i.e., 24 images ×
80 patches in total, to compute the ROC curve in Fig. 5.9. At a false positive bin estimation rate of
0.1, our light direction CNN can correctly classify the true bin direction at a rate of 89.0%. When
observing all of the failure cases in the results, the errors can be related to two reasons: (a) poor
patch extraction, with high offsets from the center. In a few of the images, the object is also not
perfectly centered in the image, causing some extracted patches to have very little overlap with the
object. (b) Our label for each patch is an 18-dimensional vector, with one’s assigned to the bins
where the light source is originated. In some cases, the CNN estimates neighboring bins causing
72

ROC for light direction bin classification

1

True positive rate

0.8

0.6

0.4

0.2

0
0

0.2

0.4

0.6

0.8

1

False positive rate

Figure 5.9: ROC curve of light direction bin classification.
false negatives and positives in the performance.
We have also evaluated this system on the real-world outdoor BOSCH sequences as well, when
we performed kmeans clustering on the frames. The results are found in Fig. 5.6. Each cluster had
over 100 frames each on average, where all frames seemed to have high illumination similarities,
which is a direct indication that the CNN model is indeed operating as required.

5.8.3

Results on the BOSCH dataset

We have evaluated the system performance using the BOSCH backseat sequence as seen in Fig. 5.10.
Based on the proposed delightCNN method, we were able to synthesize good lighting images given
overexposed sunlight patterns. Note that in this experiment, the guide image was an empty backseat such as the one presented in Fig. 5.7. This image remains the same for all of sequences
collected in this vehicle. If the vehicle, or backseat have been changed, the guide image needs to
be updated accordingly, via collecting an additional video with no objects in the scene. From the
first two columns in Fig. 5.10, we can visualize how all the target images contain overexposed sunlight patterns, whereas the output frames are free from such challenge. The guided-CNN approach
73

Target frame at time t

Paired frame at time t-j

DelightCNN output

Figure 5.10: Evaluation of DelightCNN on the BOSCH backseat dataset. All frames belong to the
first test video in the BOSCH dataset. The first column is the target frame obtained at time t. The
second column is the temporally paired frame located at time t − j. The last column represents
the delighted output of the proposed method.
learns how to filter the overexposed areas in the frame to restore an image with normal light conditions. By looking closely to the results, a light residual with soft edges from the light patterns still
exist. However, this residual is much easier to handle compared to the illumination challenge in
the target images. The first three rows in Fig. 5.10 represent good results with minimum artifacts.
The fourth row on the other hand, both paired target images suffer from extreme overexposure,
which resulted with artifacts in the ID card having some gray patches matching the back seat. This
is expected since the ID card is hardly visible in both target images.
We evaluate the results using an ID card object detector. Two of the sequences in the BOSCH

74

dataset has the ID card on the backseat. We designed a CF based approach trained on a total of
30 good lighting frames obtained from one of the sequences, while the other sequence was kept
for evaluation. We utilized the same CF method used for feed detection in Chapter 2. From the
30 good lighting frames, we define 30 positive and 30 negative training samples. For the positive
samples, we manually crop the ID card such that it has a size of 100 × 100, while the negative
patches were selected randomly from the frame with the same size as positive samples, such that
the ID card doesn’t exist in them. The goal of this CF is to provide a strong response only in
regions of the frame that correspond to the ID card.
In Fig. 5.11, we show four examples of the CF responses when testing with frames undergoing
challenging illumination conditions along with delighted images of the same frame. The examples
are sorted based on the card illumination challenges, where the first row the card has no light
patterns, and in the last example it has extreme overexposed sunlight. Based on the correlation
output of the used CF detector, a pattern in the correlation peak can be visualized such that the
more light exposure to the ID card, degradation in the peak sharpness and amplitude is observed.
On the other hand, if DelightCNN is used to normalize the light challenge, the CF detector is
capable of detecting the ID card.
We estimate the peak-to-side-lobe ratio (PSR) of each test frame as a final score of the detection, and we place a fixed threshold of 10 on PSR value to determine the final result, which is
commonly used in other CF works [74]. The average precision rate on the original test set reached
a low 42.6%, while on the delighted test set it reached a remarkable 91.1% average precision rate
as seen in Fig. 5.12. It was observed that the CF detection miss rate at a false positive per image
(FPPI) of 0.1 was 57.6% on the original video. This average miss rate was reduced to 33.5% on
the delighted sequences. Given lower PSR threshold values, the performance on the DelightCNN
video achieves nearly perfect detection performance, which means the correlation peaks exist in
75

Figure 5.11: CF responses of the ID card detector. The images on the left correspond to the
original input frames with challenging illumination conditions. The images on the right are the
proposed delighted output given the frame on the left side. Beside each frame is an illustration of
the correlation output.

76

Figure 5.12: CF detection performance compression when applied to the original video, and the
delighted video.
all test cases. However, with higher thresholds, the detection performance drops fast which means
the peak amplitude is not very high. This is expected since the output frames has a small blurriness
affect introduces while synthesizing the image in DelightCNN. When evaluated on the original
scene, more than 90% of the tested frames have very sharp peaks with a high PSR rate, which all
correspond to good lighting images. The reason why DelightCNN does not synthesize good quality with high resolution images when tested with this good images, is due to the pairing algorithm
used which attempts at finding a frame with opposite lighting condition. This will lead to finding
bad lighting quality images to pair with the good lighting input image.

5.8.4

Limitation and failure cases

One of the biggest limitations in this work, resides in the temporal pairing of target images, especially in the existence of sudden object motion. Note that the BOSCH sequences has very limited
object motion, where the objects stay in the same location for long periods of time. However, a few
motion cases show up when the driver makes sharp turns causing the object to move on the seat

77

Target at frame n

Target at frame n + 10

Target at frame n + 20

Figure 5.13: Failure cases of DelightCNN. Top row are the target frames of a video. Bottom row
are the DelightCNN output. The red rectangles represent the ghosting effect when sudden object
motion occurs in the scene.
or even fall off the seat. The problem is that both paired images can have different object layouts,
which causes ghosting effects in the output image as seen in Fig. 5.13. The DelightCNNT submodel is trained such that both paired images are complementary of each other, and can essentially
provide enough information of the objects in the scene for synthesizing a normalized light image
of the given object. With motion problem, it will appear to the system as if two objects exist in the
scene instead of one.

5.9

Conclusion

In this chapter, we present a method to normalize light in videos suffering from overexposed illumination conditions. We were able to synthesize good lighting conditions, even when the target
objects undergoes bright illuminations. One of the key contributions in this system, is the utilization of estimated light direction in designing the DelightCNN system. Even though we trained the
light direction CNN in a controlled environment, it seems to generalize well in the BOSCH videos

78

which contained direct and indirect illumination distributed in a complex way. Our quantitative and
qualitative results demonstrate our system’s ability to improve detection capability in overexposed
light challenges.

79

Chapter 6
Detecting objects in underexposed low-light
illumination conditions

6.1

Introduction to underexposed light challenge

The camera lens takes the beams of light reflected off of an object and redirect them so they come
together to form a real image of the object. Absence of the light source will ultimately digitize a
black image. On the other hand, if we move the light source far away, or if its intensity is reduced,
this will reflect accordingly on the object in the generated image, where the object will appear
darker. This is because the object was less exposed to light. In this section, we will focus on the
case when objects are presented in dark images, or local parts of the image are dark, where objects
have degraded appearance features. Our goal is to synthesize good lighting images from such
images using a monocular RGB camera, for the goal of improved object detection performance.
The main aim of any night vision detection systems is to recognize the existence of objects,
possibly to avoid collision, whether this was a vehicle-vehicle collision, vehicle-pedestrian collision, or vehicle-trailer collision. This is a very important issue for any real-world outdoor detection
application. Having night vision capability is of high importance for such applications, to guarantee security and safety of the users. Another important use of night vision capability is for face
recognition in surveillance scenarios [62]. Recently, Apple has introduced a form of biometric
authentication named "Face ID" in their smartphones, which is capable of recognizing the person
80

regardless of the light condition, relaying on a combination of depth and infrared sensors. Most
previous work in object detection at night utilizes sensors such as infrared, thermal and depth
cameras [6, 49, 62, 82], to overcome the underexposed illumination challenges.
In this chapter, we propose to learn a CNN-based method in normalizing the light conditions
for any given object undergoing low-light illumination challenges. The main limitation is the lack
of training data for such problems with corresponding labeled good lighting groundtruth data. Over
the past couple of decades, many researchers in the image processing field have been working on
multiple exposure fusion (MEF) of low dynamic range (LDR) images to produce a single high
dynamic range (HDR) image. All MEF algorithms attempt at producing an HDR image by computing the weights for each image either locally or pixel wise [69]. The fused HDR image would
then be the weighted sum of the images in the input sequence. It is important to note that the input
LDR stack of images are collected by changing the shutter speed, usually by a factor of 2, while
the aperture, focal length, ISO sensitivity, and other parameters remain the same [36]. Therefore,
we propose to use all LDR images from MEF datasets which has a negative exposure value (EV)
including the normal exposure, i.e., EV = 0, which we will refer to as LDRDark . These LDRDark
images will serve as training data, where the groundtruth HDR image is obtained via utilizing the
state-of-the-art MEF algorithm [60]. This groundturth HDR image will be used in supervising the
CNN learning.
Our proposed method is motivated by the small illumination changes found between every
neighboring underexposed LDRDark images as seen in Fig. 6.1. One light normalization approach
is to use every LDRDark image as a training sample, where the improved groundtruth image being
the HDR image. However, we experimentally found that learning such a problem is complicated
due to the large differences in illumination between the data and groundtruth images. Another
approach is to take every two neighboring LDRDark as a training pair. E.g., A training sample
81

LDR images
HDR image
Good

Path of Multi-expo
Path of Multi-expo

Image quality

1 step closer to HDR

2 steps closer to HDR

Bad
Dark

Bright
Exposure value (EV)

Figure 6.1: Proposed idea for delighting dark images which was motivated by the multi-exposure
sequences along with the HDR image. Our goal is to synthesize better lighting images similar to
the HDR quality (Green star), starting from an underexposed low dynamic range image (red
circle) in an iterative scheme (orange and yellow circles).
has an EV = −4 and the groundtruth image has an EV = −2 and so on. Unfortunately, given an
LDRDark image as a test sample, will only attempt to estimate another LDR image with a higher
EV. This can be illustrated in Fig. 6.1, as learning to predict images along the multi-exposure
path presented by the solid black line. Instead, we need a solution which can attempt to relight
LDRDark images such that the illumination differences compared with the HDR image is reduced.
In other words, given the analogy and illustration in Fig. 6.1, we would like to predict images
along the dotted lines connecting every LDRDark with the HDR image. We propose to train a
simple yet effective CNN model which predicts images with HDR-like illumination conditions,

82

via an iterative technique. Note that our optimal goal, is not to produce HDR images it self,
instead we want to learn the illumination properties of such images for the purpose of image
light normalization. Extensive experiments have been carried out to evaluate the delighted images
through object detection on several applications, such as pedestrian and trailer coupler detection.

6.2

Prior work

In this section, we will review two types of works. First, prior work which attempt to perform object detection on night images without improving the scene illuminations state. Second, prior work
which attempts to normalize the dark light challenges of the scene, regardless if object detection
was part of the work.

6.2.1

Nighttime object detection

As a common solution, researchers employ additional sensors such as near-infrared (NIR) [62]
and far-infrared (FIR) [82] cameras or thermal cameras [6]. Unfortunately, there are major limitations for infrared cameras in terms of the illumination angle and distance when such sensors are
used [49]. Moreover, the illuminator power must be adjusted adaptively depending on whether
the object is close or far away, which makes it hard to generalize well for real-world problems.
For the case of thermal cameras, two major concerns arise, the cost is very high and it only works
for objects emitting heat. This limits the possible uses for CV applications. Even though thermal
cameras have been used in some applications, such as pedestrian detection [6], using it to avoid
collisions is not a good idea, because objects with similar heat energy of the background cannot be
avoided. For the above reasons, the optimal solution is to use standard RGB cameras to synthesize
good lighting images from dark images using image processing (IP) and CV techniques.
83

Recent research has been conducted on nighttime detection using RGB cameras, but this has
focused on objects at short distance [18], or the use of video-based methods to capture multiple
images and process them [84]. In some other cases, researchers attempt to use side information in
detecting objects at night, such as detecting the light beams from incoming cars as an indication of
a vehicle [2]. In contrast to these methods, we propose to improve the illumination conditions in
the input images prior to applying the object detector.

6.2.2

Dark image enhancement and synthesizing

As mentioned earlier, several researchers have been working on HDR image reconstruction and
tone mapping [60]. However, our work is not related to this line of research since our goal is not to
produce high quality images using higher bit depth than 8 bits color channel. We only utilize the
datasets from these works to accomplish our goal.
Many works have been made to perform image light normalization of dark images in the literature. In [71], propose a simple histogram equalization algorithm. However, this method generates
several artifacts especially with darker night time images. The authors in [20], present an image
restoration method that leverages a large database of images gathered from the web. Given a test
image, they first search for a similar image from the large database, and then apply a local color
transfer to perform image restoration. However, this problem will not work well when the test image is very dark with limited object visibility. In [67], presents an algorithm that generates virtual
image exposures via altering the illumination and reflectance components of the image, which is
then reconstructed to generate an HDR image. Even though they use MEF to solve the problem,
they do start with a similar testing image as ours. However, they can only handle images with the
dark region only covering local regions in the image, where nighttime images will struggle. In [77],
where they proposed a model for synthesizing a plausible image at a different time of day from an
84

input image. They utilize time-lapse videos of various outdoor scenes of buildings and cities. The
biggest limitation of this method is that it transfers color information from the matched frame to
the target frame, which will introduce artifacts, and in some cases objects such as pedestrians or
vehicles can be lost during the transformation.
More recently, the authors in [39] propose a state-of-the-art low-light image enhancement
(LIME) method. They construct an illumination map to transform all pixels in the image such
that the output image is brighter. The individual illumination of each pixel is first estimated by
finding the maximum value in the RGB channels. After that, they refine the initial map by imposing a structure prior on it, to result with the final illumination map. Another recent method using
deep learning approach was proposed in [28]. The authors attempt at predicting information that
have been lost in saturated dark images in order to enable HDR reconstruction from a single exposure, using the HDRCNN autoencoder network. Due to the impressive results of both methods,
i.e., LIME and HDRCNN, in underexposed dark challenges, they will be used as baseline in our
evaluation using the MEF dataset.

6.3

Learning the iterative CNN model

Given any LDRDark image along with the HDR image of the same scene, we first attempt at constructing the path between these pair of images through weighted averaging. The number of images
generated n is based on the structural similarity index measure (SSIM). SSIM is a method for predicting the perceived quality between two images. This measure is based on three comparison
measurements between the two samples: luminance, contrast and structure. The SSIM can range
between −1 ∼ 1, but in the case of SSIM of LDRDark and HDR, the SSIM measure is always
above zero since the structures and texture are not as degraded compared to the luminance. If the

85

Output bright
image

Input dark
image

5x5

3x3
96

1x1

3x3
48

96

3x3

3x3

48
3

3

3

Skip connection

+
Figure 6.2: Iterative CNN architecture for image delighting.
SSIM between LDRDark and HDR is very high, meaning that LDRDark is already very close in appearance to HDR, than not many images will be interpolated along the path between both images,
and vice versa. Therefore, the total number of images on the path is defined using the following
formula: n = bm · [1 − SSIM(LDRDark , HDR)]c, where m is defined as the max number of images
to be interpolated on the path between LDRDark and HDR. The interpolation is done such that the
i
n−i
interpolated image Ii = LDRDark · ( ) + HDR · (
), where i ∈ [1, n]. Note that every training
n
n
sample Ii will use Ii+1 as groundtruth when training the iterative CNN model.
The proposed fully convolutional network (FCN) model used to learn the iterative light normalization technique, is illustrated in Fig. 6.2. An FCN is proposed for the ability to normalize the
light for any arbitrary size input image. It is composed of six convolutional layers, each followed
by a ReLU layer. All layers use small 3 filters, except for the first conv layer using filter sizes of
5 × 5 and the third conv layer with sizes of 1 × 1 providing a nonlinear mapping. The number of
filters for all layers are 3, 96, 96, 48, 48, 3, respectively. A skip connection is introduced in the
network, adding the input image to the responses of the fifth conv layer. This is done to inforce
residual learning, such that the output of the network will be the amount of change made to the

86

Itr 1
Input

Itr 2

Itr 3

Figure 6.3: Iterative CNN example.
input image. We have also noticed that having this skip connection improves the quality of the
output image, generating images with sharper edges compared to the case without having the skip
connection.
Due to the nature of the training data, where the paired training sample and groundtruth images
only have a small underexposed illumination enhancement, the iterative technique can be used. An
example obtained from the BOSCH dataset, where the backseat of the vehicle was taken under a
very low exposure value is shown in Fig. 6.3. Note how the intensities of the image changes across
the first 3 iteration. Ideally, the number of iterations used can be infinity, where after so many
iterations the output images would have saturated, reaching the optimum illumination of an HDR
image. However, this is not the case in real world scenarios. If an error occurs in one iteration, it
will propagate throughout the other iterations causing artifacts. Therefore, the optimum number of
iterations was found to be different based on the input image.

87

Cafe

Flowers

Knossos6

HDR
Images

LDR
Images

Figure 6.4: Examples of the EMPA HDR database [96]. The top row are good lighting HDR
images generated by MEF of the LDR images using [60]. The second and third rows are
examples of underexposed images with an EV of −4 and −2 respectively.

6.4

Underexposed light datasets

We used the images from the EMPA HDR database [96], which provides a total of 33 scenes
containing the multi-exposure image stacks, acquired using different exposure times. This dataset
was collected using two different cameras, however, we only use the images which was captured by
the Canon EOS-1D Mark IV, which had a total of 19 scenes. The camera has a resolution of about
16MP (4896 × 3264 pixels). Fig. 6.4 depicts a few examples of the scenes used in our proposed
method to recover from underexposed light challenges in images. Each selected scene was captured
with 7 exposure values: one picture with "normal" exposure settings, three underexposed pictures,
and three overexposed pictures. Between the different pictures, the shutter speed was changed by
a factor 2, while the aperture, focal length, ISO sensitivity, and other parameters remain the same.
In our work, we are only interested in underexposed images, and therefore, we only utilize the
three underexposed image per scene, along with the normal exposure setting, i.e., four images per

88

scene representing LDRDark . Each image of the stack of multiple exposure images was converted
from uncompressed RAW format to raster graphics using dcraw3 (version 9.23) as performed in
[40]. Each image was stored to 8-bit depth, i.e., 24 bits per pixel, which is the most common
representation used to store LDR images, and it is suitable for display on LDR monitors.
For evaluation purposes only, we also utilize a second MEF dataset [61] collected at University
of Waterloo. We will refer to this dataset as Waterloo MEF dataset, to distinguish it from the above
MEF dataset. In this set, a total of 17 scenes were collected, with a mixture of indoor and outdoor
images. Each scene had different number of LDR images, where we remove all images above
normal exposure keeping only LDRDark . In total, 60 LDRDark images will be used for testing.

6.5

Experimental results

In this section, we will discuss the model and experimental setup. After that, we will analyze the
results on the MEF dataset. Finally, we will apply this model on real-world problems including the
trailer coupler and pedestrian detection.

6.5.1

Iterative CNN setup

We used all 19 scenes from the EMPA HDR dataset for training the FCN model. Even though the
total number of images are small, the resolution of them are large, i.e., 4896 × 3264 pixels. Moreover, each image has 4 different exposure settings, where we also interpolate several intermediate
images along the path of of LDRDark and the HDR image. For interpolation, we set m = 10, such
that the maximum number of interpolated images can be 10 in the case of very low SSIM. For
training the model, we further extract 100 smaller patches of size 128 × 128 × 3 selected randomly
from the image. The total number of training patches used was 190, 000, such that each patch had
89

Input image

Itr 1

Itr 2

Itr 3

Itr 4

Itr 5

Itr 6

HDR

Figure 6.5: Iterative CNN evaluation on the MEF dataset. The first column is the input LDRDark ,
the following six columns are the output of six iterations using the proposed method. The final
column is the HDR groundtruth image. All input images are of the same scene, taken at EV =
−6, −4, −2, 0 respectively.
a corresponding groundtruth image of same size extracted from the same location of the original
image. Similar to the DelightCNN work, we found that operating on the YCbCr color space is
more robust compared to the RGB space. Instead of just learning the Y channel, we use all YCbCr
channels to be learning by the FCN network. The network is trained with a learning rate of 0.001,
a mini-batch size of 16, and trained for 150 epochs.

6.5.2

Results on the MEF dataset using the iterative method

We will initially evaluate the system on the EMPA HDR dataset to observe how the system performs. Note that this set was used for training, by extracting 100 small patches selected randomly
from the very large frame. However, now we evaluate the system using the global LDRDark image
and compare the resultant image generated at different iterations with the HDR image via computing the SSIM. Note that after the first iteration, the input of the CNN in the second iteration is
a synthesized image from the CNN, and therefore, this becomes disjoint from the training set. In
other words, given the analogy in Fig. 6.1, using our iterative approach of one single image will

90

Iteration number
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6

Average SSIM gain
0.10 ± 0.22
0.22 ± 0.19
0.32 ± 0.16
0.40 ± 0.12
0.44 ± 0.08
0.45 ± 0.06

Table 6.1: Average SSIM gain of the iterative CNN model using the MEF dataset.
generate a new dotted path between the LDRDark and some point near the HDR image. Note that
this dotted path will be different from the path used in the training phase. In Fig. 6.5, we illustrate
the system behavior given 4 images of the same scene, taken under variable exposure values, i.e.,
EV = −6, −4, −2, 0 respectively. Note how when we have a very underexposed dark image as
input to the system, i.e., first row of Fig. 6.5, each iteration gradually improves the illumination
quality of the image. However, when the input image already has a good starting point, i.e., last
row of Fig. 6.5, the proposed network has minimal effect on improving the input image, where
each iteration output is similar to the input of that iteration. This property is very important to
avoid causing unwanted artifacts in the image, or continuing image brightening.
In Fig. 6.6, we visualize the system behavior given 9 images of various scenes, taken under low
exposure values. One key observation, is that high intensity pixels in the input image is not effected
throughout the iterative CNN model. For example, the sun in rows 2, 3 and 5 are illuminating bright
light in the input images. This light remains the same throughout all six iteration, while the dark
underexposed region is enhanced. For the last four rows in Fig. 6.6, some artifacts begin to show
up in the resultant image of iteration 5 and 6. As mentioned earlier, once an error occurs in one
of the iterations, it will propagate to all other iterations. Given that the images were taken with
the same setting and camera, it appears that some specific colors are causing the issue, mainly a
specific red and blue color, since the artifacts are appearing in such colors.
91

Input image

Itr 1

Itr 2

Itr 3

Itr 4

Itr 5

Itr 6

HDR

Figure 6.6: Iterative CNN evaluation on the MEF dataset. The first column is the input LDRDark ,
the following columns are the output of six iterations using the proposed method. The final
column is the HDR groundtruth image. Each row is an example of several scenes from the MEF
dataset, such that all have a very low EV.
The SSIM for each image is continuously increasing as the number of iteration increases, which
is an indication that the system is truly approaching the illumination condition of an HDR image.
On average, the SSIM gain over the first six iterations can be seen in Table 6.1. It is observed that
the SSIM gain is slow in the first Note that all prior work in the MEF field [69], utilize multiple
images with different exposure values to result with an HDR image, where we were able to utilize
a single image and get very close to the HDR image.

92

Method
Target image
Histogram Equalization
LIME [39]
HDRCNN [28]
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6

Average SSIM
0.36 ± 0.23
0.47 ± 0.19
0.75 ± 0.16
0.1 ± 0.1
0.46 ± 0.22
0.58 ± 0.19
0.68 ± 0.16
0.76 ± 0.12
0.80 ± 0.08
0.81 ± 0.06

Average PSNR
11.12 ± 3.10
13.61 ± 2.56
16.67 ± 3.57
8.59 ± 2.49
12.10 ± 3.18
13.34 ± 3.27
14.66 ± 3.33
16.02 ± 3.29
17.32 ± 3.04
17.84 ± 2.67

Table 6.2: Average SSIM and PSNR using the EMPA HDR dataset compared with SOTA
methods.

6.5.3

Comparison with state of the art

We compare our proposed iterative CNN model with two baseline light enhancing methods, HDRCNN [28] and LIME [39]. Both methods have excelled on enhancing and normalizing dark light
input images. For a complete evaluation, we also compare our method with simply using histogram
equalization. The qualitative results when testing on the EMPA HDR dataset are shown in Fig. 6.7.
Unexpectedly, the HDRCNN technique seemed to synthesize the least natural, even less than the
LDR input images. This can be explained by the fact that HDRCNN mainly emphasized to predict
information that have been lost in saturated image areas, in order to enable HDR reconstruction
from a single exposure. However, improving the global image regardless of its saturation seems to
not work well. On the other hand, the LIME method has comparable illumination quality to our
proposed method. We compare our proposed method with the baseline methods by computing the
SSIM and PSNR values of the EMPA dataset as seen in Table 6.2. The highest average SSIM
was produced using the iterative CNN with 6, and even 5 and 4, iterations followed by the LIME
method in [39]. For the PSNR evaluation, the highest ratio was found for the 6th and 5th iteration
of our proposed method, followed by LIME.
93

Input image

Hist. Equal.

LIME

HDRCNN

It r 4

Itr 6

HDR

Figure 6.7: Iterative CNN comparison with SOTA methods on the EMPA HDR dataset. The first
column is the input LDRDark , the following columns are: histogram equalization, LIME,
HDRCNN, proposed iteration 4, proposed iteration 6, and finally, the HDR groundtruth image.
All input images have different negative EV values.
We also compare our results on the Waterloo MEF dataset as seen in Fig. 6.8. Note that we
excluded the HDRCNN method in this experiment, due to the poor performance observed from
the previous result. One of the main challenges in this dataset, is the small resolution of the input
images, which makes it hard to recover the illumination details of smaller objects compared to the
previous dataset. Also, in this dataset, they tend to use much lower EV when capturing the dataset,
making some images very hard to recover from. We further compute the average SSIM and PSNR
of the Waterloo dataset as seen in Table 6.3. Similar to the observation made in Table 6.2, the last
94

Input image

LIME

Hist. Equal.

Itr 1

Itr 2

Itr 3

Itr 4

Itr 5

Itr 6

HDR

Figure 6.8: Iterative CNN comparison with SOTA methods on the Waterloo MEF dataset. The
first column is the input LDRDark , the following columns are: LIME, histogram equalization,
proposed iteration 1 − 6, and finally, the HDR groundtruth image in the last column. All input
images have different negative EV values.
two iteration gives our method the lead in both PSNR and SSIM measures.
In all MEF experiments, we observe the SSIM and PSNR for iterations greater than 6. However, we found that the performance degrades for higher iterations, even though most of the image
indeed has better illumination quality compared to the 6th iteration. This is due to the artifacts
that begin propagating in the iterative method. In some cases, some artifacts appear at an earlier
iteration stage, such as the last two rows in Fig. 6.7. Note how some colors in the scene begin
to develop and spread in the last two rows, such as the red and blue color artifact. If we apply
95

Method
Target image
Histogram Equalization
LIME [39]
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6

Average SSIM
0.42 ± 0.20
0.42 ± 0.17
0.58 ± 0.26
0.44 ± 0.20
0.47 ± 0.19
0.52 ± 0.19
0.56 ± 0.19
0.65 ± 0.08
0.70 ± 0.17

Average PSNR
15.49 ± 2.66
13.07 ± 1.75
14.83 ± 3.13
15.74 ± 2.69
15.85 ± 3.14
15.94 ± 4.42
16.23 ± 5.18
16.80 ± 4.31
17.16 ± 2.86

Table 6.3: Average SSIM and PSNR using the Waterloo MEF dataset compared with SOTA
method.
more iterations for these two cases, the global image will have better lighting, but the artifacts will
become larger in size, reducing the SSIM and PSNR rates.

6.5.4

Pedestrian detection at night analysis

Most pedestrian detection datasets in the literature were collected during daytime, where underexposed challenges do not exist. For this reason, we collect several videos of pedestrians while
driving, by attaching a camera on the windshield of a vehicle. Two sets of videos are collected:
(1) the first set was collected starting at the sunset time of the day, were natural light still exist in
the scene. We also focus on the case of pedestrians crossing the street in this set. (2) The second
set was collected late at night, where the only light source came from either the vehicle headlights,
streetlight posts, building lights, or even the moon. These videos had a total length of 15 minutes
long. The number of pedestrians in all frames rages from 1 ∼ 6, where the average number of
pedestrians per frame is 2.
For evaluating the detection performance, we label 50 frames from each video set, i.e., a total of 100 labeled frames. We utilize the state-of-the-art pedestrian detection algorithm [11], for
evaluating the detection performance given the original night sequence compared with the gener96

Figure 6.9: Iterative CNN evaluation on the pedestrian detection application. The yellow dashed
bounding-boxes are groundtruth labels, and the red bounding-boxes are the estimated detected
person given the input frames. The first column is the original frame, the second is the histogram
equalization frame, the third and fourth columns are the 2nd and 4th iteration output of the
proposed method.
ated enhanced night sequences. For completeness, we also apply the detector on the original sequence after preprocessing it with histogram equalization technique. Several results can be found
in Fig. 6.9.
The precision-recall and the average miss rate sampled against a false positive per image (FPPI)
are used for measuring performance as seen in Fig. 6.10. The precision-recall shows the tradeoff
between precision and recall for different thresholds. A high area under the curve represents both
high recall and high precision of the pedestrian detector, where high precision relates to a low false
97

Figure 6.10: Precision-recall curve (left) and Miss-rate curve (right) of the pedestrian detection
algorithm when used on the original video sequences, histogram equalization of the videos, and
the 2nd and 4th iteration output of the proposed method.
positive rate, and high recall relates to a low false negative rate. A minimum IoU threshold of 0.5 is
required for a detected box to match with a groundtruth box. Using the original video sequence, the
pedestrian detection has an average miss rate of 31.5%, and an average precision of 77.2%. When
using the 2nd iteration output of the proposed method, we achieve an impressive 16.5% average
miss rate, with an average precision of 87.7%. This is a large performance gain considering using
the same detection algorithm, which is another validation that the proposed method indeed reduces
illumination challenges of an image. Note that the optimal detection performance was found for
the 2nd iteration of our proposed method, where all following iterations had lower detection rates.

6.5.5

Trailer coupler detection at night analysis

Our trailer coupler system from Chapter 3 was trained using videos collected during daytime. We
evaluate this system with the nighttime challenge, and found that the system fails at detecting the
coupler, especially at closer ranges. Even though the test samples are obtained at nighttime, red
lights from the breaks and the white reverse light, illuminates the scene while backing up towards
98

Original video 1

Hist. equalization Proposed method

Original video 2 Hist. equalization Proposed method

Figure 6.11: Iterative CNN evaluation on the trailer coupler detection system at nighttime for two
videos. The red plus signs are the estimated 2D coupler center estimation. We obtain three frames
from each video, illustrating the results at far, middle, and close range. For each video, the left
column are the original frames, the middle column is the histogram equalization of the frames,
and the right column are the output of the proposed method.
the coupler. Note that, for most of the video, we tend to use the breaks the majority of the time to
avoid backing up quickly.
We apply the iterative CNN model to recover dark regions and make it brighter while keeping
the light area the same. We experimentally found that four iterations was sufficient to make the
video bright enough for the coupler system to operate successfully. Note that the trailer system
utilizes gray scale images. Therefore, we would remove the fish-eye effect in RGB first, then
apply the iterative CNN method. After lighting up the scene using the iterative method, we convert
to gray scale, and finally apply the trailer system to the video sequences.

99

For every test video, we compare the results of the trailer coupler code, on three different versions of the video, the original dark sequence, the modified histogram equalization sequence, and
the proposed relighting sequence as seen in Fig. 6.11. Some of the original frames might seem too
dark, this is because the driver was not using the breaks at that time. As a general observation, we
found that detecting the trailer coupler at far distances works very well. On the other hand, the
trailer detection begins to fail at around 3 ∼ 4 meters CVD distance. In this range, our proposed
TCNN2 appears to lose track of the coupler. One clear advantage of our proposed system can be
visualized in the synthesized relighted frames, where the illumination is very consistent regardless whether the driver was using the breaks or not. This consistency helps stabilize the system
detection and tracking performance.

6.6

Conclusion

In this chapter, we present a method to normalize light in images suffering from underexposed
illumination conditions. Motivated by the mutli-exposure datasets, we propose a novel iterative
CNN model for recovering from dark images and produce good lighting illumination, such that the
CNN’s goal is to match HDR lighting quality. All previous MEF work had to use several images
at test time to recover the HDR image, whereas we only attempt to do this using a single image
LDR image. One of the key contributions in this system, is forming the solution of the underexposed lighting problem with an iterative solution, which was proven to help many other real-world
detection applications such as pedestrian detection, and trailer coupler detection at nighttime. Our
quantitative and qualitative results demonstrate our system’s ability to improve detection capability
in underexposed light challenges.

100

Chapter 7
Conclusions and Future Work

7.1

Conclusion

In this thesis, we presented several real-world object detection applications which were designed to
overcome many different detection challenges. The common object detection challenges, such as
pose, scale, illumination, partial occlusion, and appearance variations, have been well studied in the
literature. Object detection when applied to the real-world, additional challenges associated to the
problem arise, such as efficiency due to limited computational resources, unstable background and
foreground, obscurity of objects, and extreme illumination challenges, which are more problematic
compared to the typical visual object detection challenges. When working towards developing a
real application, any proposed system should be able to handle these challenges, and not create a
computational bottleneck, otherwise it will be conveniently ignored in practice. For example, if
autonomous driving vehicles cannot operate at night, then no one will buy the car. Or if a system
keeps falsely detecting incorrect objects, users will walk away from this system.
Looking back at all of the proposed methods in this thesis, there were three approaches in
overcoming any real-world detection challenges: (a) solving the challenge prior to the detection
problem. This was illustrated in Chapter 5 and 6, where the light in the input images were normalize to guarantee enhanced detection performance. (b) Modifying the detector itself, such that it can
handle the challenges. This was accomplished in Chapters 3 and 4 when designing the multiplexerCNN method for detecting the trailer coupler detection. More specifically, designing the DCNN
101

network to detect couplers using confidence measures, tracking through detection in an efficient
manner, and inferring 3D localization of objects using 2D coordinates. (c) Refinement of detection
results to eliminate any mistakes made by the system from the real-world challenges. This was
accomplished in Chapter 2 when we proposed a post-refinement component to remove all false
detections caused by the detector. There are no preferred method among the three approaches in
overcoming real-world challenges, instead, the type of challenge itself governs how to approach
the problem.

7.2

Future work

Our proposed works, opens up many doors and possibilities for future research, and the develop of
new applications.
• Regarding the trailer coupler detection work in Chapter 3 and 4, we have presented a solid
system motivated by real-world needs that provides an example of how successful vision
systems are built. Moreover, from a technical perspective, other applications may benefit
from our components. E.g., the novel confidence loss function transfers to particle tracking.
Other autonomous driving problems such as vehicle/curb/parking-lot detection may benefit
from the multiplexer-CNN. Estimating the distance of pedestrian detection can be learned
using our proposed method in inferring 3D localization from 2D detections. Moreover, our
system was designed to work during daytime, and with the iterative CNN method, we can
also operate at nighttime as well. Fusing both works such that the application still operates in
real-time is important. The iterative CNN method is somewhat slow when processing frames
on a CPU, and therefore, finding a solution for this problem is needed.
• Regarding the overexposed illumination challenges in Chapter 5, where we have proposed to
102

synthesize bright sunlight images using the DelightCNN method. This can be further used
in several other systems in autonomous driving such as when driving in the same direction
of the sunset or sunrise, or in the case when driving in or out of a tunnel. In both scenarios,
extreme illumination conditions are observed. Moreover, the idea of making DelightCNN
operate on images alone is very challenging, since we are no longer capable of temporally
pairing bad lighting frames with good or semi-good lighting frames. However, we still
think this is possible if the illumination doesn’t completely washes-out the appearance of
the object. Another solution would be to adopt a generative adversarial networks (GAN) in
completing and synthesizing the effected overexposed regions with sunlight patterns, such
as using the cycle-GAN framework [94].
• Regarding the underexposed illumination challenges in Chapter 6, where we have proposed
to synthesize dark images using the iterative CNN method. We experimentally found that,
whenever we apply the iterative CNN method on different applications or datasets, the optimal number of iterations is also different. This number is based on the initial quality and
resolution of the input image. The higher quality and resolution usually can withstand for
longer iterations, providing a more HDR-like illumination light conditions. Therefore, the
optimal number needs to be experimentally found for any new application. On the other
hand, this process can be automatically controlled via learning an image quality assessment
CNN, which will produce a score after each iteration, indicative whether we can stop or
proceed to another iteration.
• Also regarding the underexposed illumination challenges in Chapter 6, we have already
demonstrated the effectiveness of the system in improving night detection performance for
pedestrian and trailer coupler systems. We plan on collecting more testing samples such that

103

the evaluation is done on a larger scale. Another interesting idea that is worth exploring
more, is that our iterative CNN model was trained on the MEF dataset such that it attempts
to make dark images brighter, will learning the system in the opposite direction help the
overexposed sunlight challenges? Moreover, can a single system actually learn both objectives? Meaning that, can a single CNN learn how to delight an image regardless of being too
dark or too bright, such that it will always produce a good lighting image? Having opposite
objective in learning the CNN will most likely confuse the CNN parameters being learned.
Therefore, having a carefully designed architecture will be needed.

104

BIBLIOGRAPHY

105

BIBLIOGRAPHY

[1] FAO global aquaculture production volume and value statistics database updated to 2012.
Technical report, FAO Fisheries and Aquaculture Department, 2014.
[2] P. F. Alcantarilla, L. M. Bergasa, P. Jiménez, M. Sotelo, I. Parra, D. Fernandez, and S. Mayoral. Night time vehicle detection for driving assistance lightbeam controller. In Proc. Intelligent Vehicles Symposium (IV), pages 291–296. IEEE, 2008.
[3] Y. Atoum, M. J. Afridi, X. Liu, J. M. McGrath, and L. E. Hanson. On developing and
enhancing plant-level disease rating systems in real fields. Pattern Recognition, 53:287–299,
2016.
[4] Y. Atoum, J. Roth, M. Bliss, W. Zhang, and X. Liu. Monocular video-based trailer coupler detection using multiplexer convolutional neural network. In Proc. IEEE International
Conference on Computer Vision (ICCV), Venice, Italy, October 2017.
[5] Y. Atoum, S. Srivastava, and X. Liu. Automatic feeding control for dense aquaculture fish
tanks. IEEE Signal Processing Letters, 22(8):1089–1093, 2015.
[6] J. Baek, S. Hong, J. Kim, and E. Kim. Efficient pedestrian detection at nighttime using a
thermal camera. Sensors, 17(8):1850, 2017.
[7] A. Baghel and V. Jain. Shadow removal using ycbcr and k-means clustering. International
Journal of Computer Applications, 134(7):21–26, 2016.
[8] V. N. Boddeti, T. Kanade, and B. Kumar. Correlation filters for object alignment. In Proc.
IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 2291–2298. IEEE,
2013.
[9] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms—improving object detection
with one line of code. In Proc. IEEE International Conference on Computer Vision (ICCV),
pages 5562–5570. IEEE, 2017.
[10] D. S. Bolme, J. R. Beveridge, B. Draper, Y. M. Lui, et al. Visual object tracking using adaptive
correlation filters. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR),
pages 2544–2550. IEEE, 2010.
[11] G. Brazil, X. Yin, and X. Liu. Illuminating pedestrians via simultaneous detection and segmentation. In Proc. International Conference on Computer Vision (ICCV), Venice, Italy,
October 2017.

106

[12] C. J. Bridger and R. K. Booth. The effects of biotelemetry transmitter presence and attachment procedures on fish physiology and behavior. Reviews in Fisheries Science, 11(1):13–34,
2003.
[13] B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao. Dehazenet: An end-to-end system for single image
haze removal. IEEE Transactions on Image Processing (IP), 25(11):5187–5198, 2016.
[14] C. Chang, W. Fang, R.-C. Jao, C. Shyu, and I. Liao. Development of an intelligent feeding
controller for indoor intensive culturing of eel. Aquacultural Engineering, 32(2):343–353,
2005.
[15] S.-H. Chen and R.-S. Chen. Vision-based distance estimation for multiple vehicles using
single optical camera. In Proc. Int. Conf. Innovations in Bio-inspired Computing and Applications (IBICA), pages 9–12. IEEE, 2011.
[16] S. G. Conti, P. Roux, C. Fauvel, B. D. Maurer, and D. A. Demer. Acoustical monitoring of
fish density, behavior, and growth rate in a tank. Aquaculture Engineering, 251(2):314–323,
2006.
[17] C. Costa, A. Loy, S. Cataudella, D. Davis, and M. Scardi. Extracting fish size using dual
underwater cameras. Aquacultural Engineering, 35(3):218–227, 2006.
[18] R. Cucchiara and M. Piccardi. Vehicle detection under day and night illumination. In
IIA/SOCO, 1999.
[19] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc. IEEE
Conf. Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886–893. IEEE,
2005.
[20] K. Dale, M. K. Johnson, K. Sunkavalli, W. Matusik, and H. Pfister. Image restoration using online photo collections. In Proc. IEEE International Conference on Computer Vision
(ICCV), pages 2217–2224. IEEE, 2009.
[21] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. Beyond correlation filters: learning continuous convolution operators for visual tracking. In Proc. European Conference on
Computer Vision (ECCV), pages 472–488. Springer, 2016.
[22] C. de Saxe and D. Cebon. A visual template-matching method for articulation angle measurement. In Proc. Int. Conf. Intelligent Transportation Systems (ITS), pages 626–631. IEEE,
2015.
[23] J. Diebel and S. Thrun. An application of markov random fields to range sensing. In Proc.
Advances in Neural Information Processing Systems (NIPS), pages 291–298, 2006.
[24] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark. In Proc.

107

IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 304–311. IEEE, 2009.
[25] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI),
38(2):295–307, 2016.
[26] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt,
D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In
Proc. IEEE International Conference on Computer Vision (ICCV), pages 2758–2766, 2015.
[27] S. Duarte, L. Reig, and J. Oca. Measurement of sole activity by digital image analysis.
Aquacultural Engineering, 41(1):22–27, 2009.
[28] G. Eilertsen, J. Kronander, G. Denes, R. Mantiuk, and J. Unger. Hdr image reconstruction
from a single exposure using deep cnns. ACM Transactions on Graphics (TOG), 36(6), 2017.
[29] Y. Endo, Y. Kanamori, and J. Mitani. Deep reverse tone mapping. ACM Transactions on
Graphics (TOG), 36(6), 2017.
[30] R. Fattal. Single image dehazing. ACM transactions on graphics (TOG), 27(3):72, 2008.
[31] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Cascade object detection with
deformable part models. In Proc. IEEE Conf. Computer Vision and Pattern Recognition
(CVPR), pages 2241–2248. IEEE, 2010.
[32] C. for Economic Vitality. n.d. Revenue of the U.S. RV park industry from 2009 to 2015.
Statista, 2016.
[33] H. K. Galoogahi, T. Sim, and S. Lucey. Multi-channel correlation filters. In Proc. IEEE Conf.
International Conference on Computer Vision (ICCV), pages 3072–3079. IEEE, 2013.
[34] R. Girshick. Fast R-CNN. In Proc. IEEE International Conference on Computer Vision
(ICCV), pages 1440–1448, 2015.
[35] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object
detection and semantic segmentation. In Proc. IEEE Conf. Computer Vision and Pattern
Recognition (CVPR), pages 580–587, 2014.
[36] A. A. Goshtasby. Fusion of multi-exposure images. Image and Vision Computing, 23(6):611–
618, 2005.
[37] S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang. Learning dynamic guidance for
depth image enhancement. Analysis, 10(y2):2, 2017.
[38] R. Guo, Q. Dai, and D. Hoiem. Paired regions for shadow detection and removal. IEEE

108

transactions on pattern analysis and machine intelligence, 35(12):2956–2967, 2013.
[39] X. Guo, Y. Li, and H. Ling. Lime: Low-light image enhancement via illumination map
estimation. IEEE Transactions on Image Processing (IP), 26(2):982–993, 2017.
[40] P. Hanhart and T. Ebrahimi. Evaluation of jpeg xt for high dynamic range cameras. Signal
Processing: Image Communication, 50:9–20, 2017.
[41] K. He, J. Sun, and X. Tang. Guided image filtering. IEEE transactions on pattern analysis
and machine intelligence, 35(6):1397–1409, 2013.
[42] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Exploiting the circulant structure
of tracking-by-detection with kernels. In Proc. European Conference on Computer Vision
(ECCV), pages 702–715. Springer, 2012.
[43] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized
correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI),
37(3):583–596, 2015.
[44] P. Hu and D. Ramanan. Finding tiny faces. arXiv preprint arXiv:1612.04402, 2016.
[45] T.-W. Hui, C. C. Loy, and X. Tang. Depth map super-resolution by deep multi-scale guidance.
In Proc. European Conference on Computer Vision (ECCV), pages 353–369. Springer, 2016.
[46] X. Jin, Y. Li, N. Liu, X. Li, Q. Zhou, Y. Tian, and S. Ge. Scene relighting using a single
reference image through material constrained layer decomposition. In Artificial Intelligence
and Robotics (AIR, pages 37–44. Springer, 2018.
[47] X. Jin, Y. Tian, N. Liu, C. Ye, J. Chi, X. Li, and G. Zhao. Object image relighting through
patch match warping and color transfer. In Proc. International Conference on Virtual Reality
and Visualization (ICVRV), pages 235–241. IEEE, 2016.
[48] S. H. Khan, M. Bennamoun, F. Sohel, and R. Togneri. Automatic shadow detection and removal from a single image. IEEE transactions on pattern analysis and machine intelligence,
38(3):431–446, 2016.
[49] J. H. Kim, H. G. Hong, and K. R. Park. Convolutional neural network-based human detection
in nighttime images using visible light camera sensors. Sensors, 17(5):1065, 2017.
[50] A. Koesdwiady, S. M. Bedawi, C. Ou, and F. Karray. End-to-end deep learning for driver
distraction recognition. In Proc. International Conference Image Analysis and Recognition
(ICIAR, pages 11–18. Springer, 2017.
[51] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele. Joint bilateral upsampling. In ACM
Transactions on Graphics (TOG), volume 26, page 96. ACM, 2007.

109

[52] B. V. Kumar, A. Mahalanobis, and R. D. Juday. Correlation pattern recognition. Cambridge
University Press, 2005.
[53] J.-F. Lalonde, A. A. Efros, and S. G. Narasimhan. Estimating natural illumination from a
single outdoor image. In Proc. IEEE International Conference on Computer Vision (ICCV),
pages 183–190. IEEE, 2009.
[54] J.-V. Lee, J.-L. Loo, Y.-D. Chuah, P.-Y. Tang, Y.-C. Tan, and W.-J. Goh. The use of vision in a
sustainable aquaculture feeding system. Research Journal of Applied Sciences, Engineering
and Technology, 6(19):3658–3669, 2013.
[55] B. Li and Z. Shao. Precise trajectory optimization for articulated wheeled vehicles in cluttered
environments. Advances in Engineering Software, 92:40–47, 2016.
[56] Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deep joint image filtering. In Proc. European
Conference on Computer Vision (ECCV), pages 154–169. Springer, 2016.
[57] S. Liu and M. N. Do. Inverse rendering and relighting from multiple color plus depth images.
IEEE Transactions on Image Processing (IP), 26(10):4951–4961, 2017.
[58] X. Liu, Y. Tong, F. W. Wheeler, and P. H. Tu. Facial contour labeling via congealing. In Proc.
European Conference on Computer Vision (ECCV), pages 354–368. Springer, 2010.
[59] D. G. Lowe. Object recognition from local scale-invariant features. In Proc. IEEE International Conference on Computer Vision (ICCV), volume 2, pages 1150–1157. IEEE, 1999.
[60] K. Ma, H. Li, H. Yong, Z. Wang, D. Meng, and L. Zhang. Robust multi-exposure image
fusion: A structural patch decomposition approach. IEEE Transactions on Image Processing
(IP), 26(5):2519–2532, 2017.
[61] K. Ma, K. Zeng, and Z. Wang. Perceptual quality assessment for multi-exposure image
fusion. IEEE Transactions on Image Processing (IP), 24(11):3345–3356, 2015.
[62] H. Maeng, S. Liao, D. Kang, S.-W. Lee, and A. K. Jain. Nighttime face recognition at
long distance: Cross-distance and cross-spectral matching. In Proc. Asian Conference on
Computer Vision (ACCV), pages 708–721. Springer, 2012.
[63] I. Matthews and S. Baker. Active appearance models revisited. International Journal of
Computer Vision (IJCV), 60(2):135–164, 2004.
[64] J. Morales, A. Mandow, J. L. Martinez, J. L. Martínez, and A. J. García-Cerezo. Driver assistance system for backward maneuvers in passive multi-trailer vehicles. In Proc. International
Conference Intelligent Robots and Systems (ICIRS), pages 4853–4858. IEEE, 2012.
[65] R. n.d. Number of wholesale shipments of RV in the united states from 2000-2016. Statista,

110

2016.
[66] J. P. Oakley and H. Bu. Correction of simple contrast loss in color images. IEEE Transactions
on Image Processing (IP), 16(2):511–522, 2007.
[67] J. S. Park and N. I. Cho. Generation of high dynamic range illumination from a single image
for the enhancement of undesirably illuminated images. arXiv preprint arXiv:1708.00636,
2017.
[68] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In Proc. British Machine
Vision Conference (BMVC), volume 1, page 6, 2015.
[69] K. R. Prabhakar, V. S. Srikar, and R. V. Babu. Deepfuse: A deep unsupervised approach for
exposure fusion with extreme exposure image pairs. In Proc. IEEE International Conference
on Computer Vision (ICCV), pages 4724–4732. IEEE, 2017.
[70] L. Qu, J. Tian, S. He, and Y. Tang. Deshadownet: A multi-context embedding deep network
for shadow removal. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR),
2017.
[71] M. A. H. Radhi, B. Sabah, and A. M. O. Al-Hsniue. Enhancement of the captured image under different lighting conditions using histogram equalization method. International Journal
of Latest Research in Science and Technology, 3(3):25–28, 2014.
[72] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection
with region proposal networks. In Proc. Advances in Neural Information Processing Systems
(NIPS), pages 91–99, 2015.
[73] I. Sato, Y. Sato, and K. Ikeuchi. Illumination from shadows. IEEE Transactions on Pattern
Analysis and Machine Intelligence (PAMI), 25(3):290–300, 2003.
[74] M. Savvides, B. V. Kumar, and P. Khosla. Face verification using correlation filters. Proc.
IEEE Automatic Identification Advanced Technologies (AIAT), pages 56–61, 2002.
[75] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint
arXiv:1312.6229, 2013.
[76] X. Shen, C. Zhou, L. Xu, and J. Jia. Mutual-structure for joint filtering. In Proc. IEEE
International Conference on Computer Vision (ICCV), pages 3406–3414, 2015.
[77] Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data-driven hallucination of different times
of day from a single outdoor photo. ACM Transactions on Graphics (TOG), 32(6):200, 2013.
[78] S. Song and M. Chandraker. Joint SFM and detection cues for monocular 3d localization in

111

road scenes. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages
3734–3742, 2015.
[79] L. H. Stien, S. Bratland, I. Austevoll, F. Oppedal, and T. S. Kristiansen. A video analysis procedure for assessing vertical fish distribution in aquaculture tanks. Aquacultural Engineering,
37(2):115–124, 2007.
[80] D. Tian, J. W. Mauchly, and J. T. Friel. Real-time automatic scene relighting in video conference sessions, Oct. 7 2014. US Patent 8,854,412.
[81] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Proc. IEEE
International Conference on Computer Vision (ICCV), pages 839–846. IEEE, 1998.
[82] O. Tsimhoni, J. Bärgman, and M. J. Flannagan. Pedestrian detection with near and far infrared
night vision enhancement. Leukos, 4(2):113–128, 2007.
[83] A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for matlab. In Proc.
ACM International Conference on Multimedia (ICM), pages 689–692. ACM, 2015.
[84] L. Wang, K. Huang, Y. Huang, and T. Tan. Object detection and tracking for night surveillance based on salient contrast analysis. In Proc. IEEE International Conference on Image
Processing (ICIP), pages 1113–1116. IEEE, 2009.
[85] Y. Wang and D. Samaras. Estimation of multiple directional light sources for synthesis of
augmented reality images. Graphical Models, 65(4):185–205, 2003.
[86] Y. Wang and I. H. Witten. Induction of model trees for predicting continuous classes. In
Proc. European Conference Machine Learning (ECML), 1996.
[87] J. Xu, Y. Liu, S. Cui, and X. Miao. Behavioral responses of tilapia (oreochromis niloticus) to
acute fluctuations in dissolved oxygen levels as monitored by computer vision. Aquacultural
Engineering, 35(3):207–217, 2006.
[88] L.-Q. Xu, J.-L. Landabaso, and B. Lei. Segmentation and tracking of multiple moving objects
for intelligent video analysis. BT technology Journal, 22(3):140–150, 2004.
[89] S. Yan, Y. Teng, J. S. Smith, and B. Zhang. Driver behavior recognition based on deep convolutional neural networks. In Proc. IEEE International Conference on Natural Computation,
Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), pages 636–641. IEEE, 2016.
[90] Q. Yang, R. Yang, J. Davis, and D. Nistér. Spatial-depth super resolution for range images. In
Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 1–8. IEEE, 2007.
[91] J. J. Yebes, P. F. Alcantarilla, and L. M. Bergasa. Occupant monitoring system for traffic
control based on visual categorization. In Proc. IEEE Intelligent Vehicles Symposium (IV),

112

pages 212–217. IEEE, 2011.
[92] S. Zhang, R. Benenson, M. Omran, J. Hosang, and B. Schiele. Towards reaching human
performance in pedestrian detection. IEEE Transactions on Pattern Analysis and Machine
Intelligence, PP(99):1–1, 2017.
[93] C. Zhou, H. Zhang, X. Shen, and J. Jia. Unsupervised learning of stereo matching. In Proc.
IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 1567–1575, 2017.
[94] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using
cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
[95] B. Zion, V. Alchanatis, V. Ostrovsky, A. Barki, and I. Karplus. Real-time underwater sorting
of edible fish species. Computers and Electronics in Agriculture, 56(1):34–45, 2007.
[96] P. Zolliker, Z. Baranczuk, D. Kupper, I. Sprow, and T. Stamm. Creating hdr video content
for visual quality assessment using stop-motion. In Proc. IEEE European Signal Processing
Conference (EUSIPCO), pages 1–5. IEEE, 2013.

113