DETECTING OBJECTS UNDER CHALLENGING ILLUMINATION CONDITIONS By Yousef Atoum A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical Engineering — Doctor of Philosophy 2018 ABSTRACT DETECTING OBJECTS UNDER CHALLENGING ILLUMINATION CONDITIONS By Yousef Atoum Object detection is considered one of the most critical components of any Computer Vision (CV) system. For many real world CV systems such as object tracking, recognition, and alignment, a good localization of the targeted object is necessary as an initialization step. In this thesis, we explore object detection when used in real life scenarios, and study the effect of challenging illumination scenarios impacting the general performance. More specifically, we study two main challenges: underexposed dark images captured during the night time, and overexposed bright images with sunlight projected on the object. We initially study two detection applications used in real-life scenarios. The first application used a Correlation Filter (CF) detector for detecting small objects. CF’s were robust in handling object detection with minimum appearance variations, but suffers in handling illumination challenges causing many false alarms. To overcome this issue, we propose to use a post-refinement method to eliminate any false detections caused by specular light. The second detection application used Convolutional Neural Networks (CNN). We learned a total of five detection CNN’s, constructing the inputs of a Multiplexer-based method which controls the flow of the CV system, and is driven to select the appropriate CNN based on physical estimated measurements. The five CNN’s include a global frame object detector, three local object detectors used for tracking, and finally an object contour detector used to enhance the 2D detection and infer 3D localization of the target. With sufficient training data, all five CNN’s were proven to generalize well for a wide range of illumination variations introduced from weather changes, along with many other visual challenges. However, the challenging underexposed light images collected during night time led to system failure. In this thesis, we propose two CNN models to handle both illumination challenges: (1) a temporal delighting Guided-CNN scheme for recovering overexposed video frames caused by sunlight, which is based on the analysis of directional light. (2) An iterative CNN-based technique to synthesize good lighting images suffering from underexposed low intensity lighting. We demonstrate that our approach allows the recovery of plausible illumination conditions and enables improved object detection capability. An extensive evaluation on several CV systems was carried out, including pedestrian detection, and trailer coupler detection. Copyright by YOUSEF ATOUM 2018 This dissertation is dedicated to my beautiful wife, Baraa, and my children, Adam and Ryan. I wouldn’t have made it this far without your continued support and encouragement. v ACKNOWLEDGMENTS I would like to thank my advisor, Dr. Xiaoming Liu, for the patient guidance, encouragement and advice he has provided throughout my time as his student. I joined Dr. Liu’s lab at a time where I had almost lost interest in pursuing my graduate degree due to a variety of reasons beyond my control. Dr. Liu helped me in a variety of aspects along the way such as becoming a wellrounded researcher, refining my writing and presentation skills, attention to detail, professional advice, and maturing my problem solving skills. I have been privileged to have Dr. Liu as my advisor, and I will always be proud of it. I would also like to thank all of the CV lab members, for the many years we have spent together as lab-mates and as friends. They were always there for me whenever I needed help or advise, whether it was research related or any other personal matters. Amin Jourabloo, Xi Yin, Luan Tran, Garrick Brazil, Yaojie Liu, Tony Zhang, Joel Stehouwer, Adam Terwilliger, Bangjie Yin, Morteza Safdarnejad, Joseph Roth, and Jamal Afridi, thank you guys for all of the great times we had. I also thank my parents Adnan and Jaine, who provided me with the steppingstones to become the person I am today. Since I was a kid, they have guided me in gaining the skills and requirements in order to excel in my future journey and encouraged me towards attaining my future goals. And last but not least, I want to thank my wife Baraa, who has stood by me since day one of my graduate studies, who has tolerated my absences working late at school, my fits of being exhausted and impatience. She gave me endless support and help, and supported the family during much of my graduate studies. And more importantly, she sacrificed her career for me to finish my studies. Along with her, I want to acknowledge my two sons, Adam and Ryan. They have never known their dad as anything but a student always working on the laptop. They have been a great source of motivation, love and relief for whenever I felt down. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Chapter 1 Introduction on object detection challenges and contributions 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 6 8 Chapter 2 Detecting small objects using correlation filters 2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . 2.2 Proposed method on feed detection . . . . . . . . . . . . 2.2.1 Correlation Filter for Feed Detection . . . . . . . 2.2.2 SVM Classifier for Refinement of False Alarms . 2.2.3 Locating the Optimum Local Region . . . . . . . 2.2.4 Automatic Control of the Feeding Process . . . . 2.3 Experimental results . . . . . . . . . . . . . . . . . . . . 2.3.1 Feed Detection . . . . . . . . . . . . . . . . . . 2.3.2 Local Region Estimation . . . . . . . . . . . . . 2.3.3 Computational Efficiency . . . . . . . . . . . . . 2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 11 12 14 15 16 17 17 18 20 20 Chapter 3 Detecting 2D trailer couplers using convolutional neural networks 3.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Convolutional neural networks on object detection . . . . . . . . . . . . . . . 3.3 Proposed method on 2D coupler detection . . . . . . . . . . . . . . . . . . . 3.3.1 Preprocessing for Geometric Interpolation . . . . . . . . . . . . . . . 3.3.2 Coupler Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Coupler Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Trailer Coupler Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Multiplexer-CNN Setup . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1.1 Network Implementation Details . . . . . . . . . . . . . . 3.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2.1 Coupler detection . . . . . . . . . . . . . . . . . . . . . . 3.5.2.2 Coupler tracking . . . . . . . . . . . . . . . . . . . . . . . 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 24 25 25 27 29 31 32 32 32 33 33 34 36 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 4 3D trailer coupler localization using convolutional neural networks . . . 37 4.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Height Estimation and 3D Localization . . . . . . . . . . . . . . . . . . . . . . . . 38 vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 39 40 40 41 41 42 43 44 45 46 47 49 51 Chapter 5 Detecting objects in overexposed bright illumination conditions 5.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Introduction to the overexposed light challenge . . . . . . . . . . . . . . 5.3 Prior work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Light direction estimation . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Enhancing lighting in images . . . . . . . . . . . . . . . . . . . . 5.3.2.1 Shadow removal . . . . . . . . . . . . . . . . . . . . . 5.3.2.2 Image dehazing . . . . . . . . . . . . . . . . . . . . . . 5.3.2.3 Image relighting . . . . . . . . . . . . . . . . . . . . . 5.3.3 Joint filtering methods . . . . . . . . . . . . . . . . . . . . . . . 5.4 Dataset collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Light direction dataset . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 BOSCH vehicle backseat dataset . . . . . . . . . . . . . . . . . . 5.5 Estimating light direction using a CNN model . . . . . . . . . . . . . . . 5.6 Classifying the light quality using the light direction CNN . . . . . . . . . 5.7 Learning the DelightCNN model . . . . . . . . . . . . . . . . . . . . . . 5.8 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8.1 DelightCNN setup . . . . . . . . . . . . . . . . . . . . . . . . . 5.8.2 Results on the light direction dataset . . . . . . . . . . . . . . . . 5.8.3 Results on the BOSCH dataset . . . . . . . . . . . . . . . . . . . 5.8.4 Limitation and failure cases . . . . . . . . . . . . . . . . . . . . . 5.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 52 54 57 58 58 59 59 60 60 61 62 63 64 66 68 70 70 71 73 77 78 Chapter 6 Detecting objects in underexposed low-light illumination conditions 6.1 Introduction to underexposed light challenge . . . . . . . . . . . . . . . . . . 6.2 Prior work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Nighttime object detection . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Dark image enhancement and synthesizing . . . . . . . . . . . . . . 6.3 Learning the iterative CNN model . . . . . . . . . . . . . . . . . . . . . . . 6.4 Underexposed light datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 80 83 83 84 85 88 4.3 4.4 4.5 4.2.1 Coupler contour network . . . . . . . . . . 4.2.2 Contour estimation . . . . . . . . . . . . . 4.2.3 Height estimation . . . . . . . . . . . . . . 4.2.4 3D coupler localization . . . . . . . . . . . Experimental results . . . . . . . . . . . . . . . . . 4.3.1 Parameter setting . . . . . . . . . . . . . . 4.3.2 Height estimation . . . . . . . . . . . . . . 4.3.3 Overall system test . . . . . . . . . . . . . 4.3.4 System efficiency . . . . . . . . . . . . . . 4.3.5 Qualitative results . . . . . . . . . . . . . . A generalized approach for 3D coupler localization 4.4.1 Height estimation algorithm: A new method 4.4.2 Experimental results . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 6.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Iterative CNN setup . . . . . . . . . . . . . . . . . . . 6.5.2 Results on the MEF dataset using the iterative method . 6.5.3 Comparison with state of the art . . . . . . . . . . . . 6.5.4 Pedestrian detection at night analysis . . . . . . . . . . 6.5.5 Trailer coupler detection at night analysis . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 89 90 93 96 98 100 Chapter 7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 101 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 ix LIST OF TABLES Table 1.1: A list of real-world problems and challenges covered in this thesis. . . . . Table 2.1: Guideline for controlling the feeding process. . . . . . . . . . . . . . . . 16 Table 3.1: CNN architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Table 4.1: System efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Table 6.1: Average SSIM gain of the iterative CNN model using the MEF dataset. . . 91 Table 6.2: Average SSIM and PSNR using the EMPA HDR dataset compared with SOTA methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Table 6.3: Average SSIM and PSNR using the Waterloo MEF dataset compared with SOTA method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 x 3 LIST OF FIGURES Figure 2.1: Given the video input, our system performs real-time monitoring and feeding decision for a highly dense fish tank. . . . . . . . . . . . . . . . . 10 Figure 2.2: The architecture of our feeding control system. . . . . . . . . . . . . . . 11 Figure 2.3: Feed detection procedures with each column being one local region of 150 × 150 pixels: (a) the original image with green circles indicating labeled ground-truth feed, (b) the CF output (green and red), where the red squares are false alarms, and (c) the results of the SVM classifier in a binary image where the white regions are the final detected feed. Note the reduced false alarms from (b) to (c). . . . . . . . . . . . . . . . . . . 17 Figure 2.4: Comparison of feed detection. . . . . . . . . . . . . . . . . . . . . . . . 18 Figure 2.5: Local region optimization. . . . . . . . . . . . . . . . . . . . . . . . . . 19 Figure 2.6: Comparison of normalized feed. . . . . . . . . . . . . . . . . . . . . . . 19 Figure 3.1: Automatic trailer hitching by detecting and tracking the coupler using a Multiplexer-CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Figure 3.2: An automated computer vision system for coupler detection, tracking, and 3D localization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Figure 3.3: 3D distance estimation of the coupler. . . . . . . . . . . . . . . . . . . . 26 Figure 3.4: Iterative coupler detection using DCNN. The top row is the input image with the results of (∆u, s). The green color means s = 1, and the red is s = 0. The bottom row shows the results of the weighted sum of Gaussians. Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Figure 3.5: Statistics of the trailer coupler database. . . . . . . . . . . . . . . . . . . 31 Figure 3.6: Detection errors vs with distance in meters. . . . . . . . . . . . . . . . . 33 Figure 3.7: Confidence scores vs Euclidean estimation errors. . . . . . . . . . . . . . 34 Figure 3.8: Tracking comparison with errors in meters. . . . . . . . . . . . . . . . . 35 Figure 3.9: Tracking comparison with errors in pixels. . . . . . . . . . . . . . . . . . 35 xi Figure 3.10: Tracking comparison with precision plot. . . . . . . . . . . . . . . . . . . 35 Figure 3.11: Number of TCNN networks used for tracking. . . . . . . . . . . . . . . . 36 Figure 4.1: Geometric feature of a contour include: distances along straight lines, and slopes of red lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Figure 4.2: Estimation of xt and yt , on the Dh elevated by estimated height zt . Red dot is the estimated coupler, green has zero offset from the origin point. . 39 Figure 4.3: 3D coupler localization. (a) Illustrates an example of the 3D localization process. (b) The exact algorithm followed in 3D coupler localization. . . 41 Figure 4.4: Contour estimation comparison. . . . . . . . . . . . . . . . . . . . . . . 42 Figure 4.5: Overall system accuracy on the height label set. . . . . . . . . . . . . . . 44 Figure 4.6: Qualitative results of five videos. The first column is the DCNN results, the three middle columns are TCNN results, last column is the CCNN result. Red + is the estimated result, yellow ◦ is groundtruth, green × is the estimated contour. Above each frame is (CVD, meter error, pixel error), the last column also has the estimated height in meters. The red rectangles indicate the failure cases. . . . . . . . . . . . . . . . . . . . . 45 Figure 4.7: Boat trailer couplers. Bottom row are examples of cases where the counter estimation will fail, and hence the height and 3D localization will fail as well. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Figure 4.8: SIFT matching to find the amount of shift on the ground plane. . . . . . . 47 Figure 4.9: Finding the best distance of the two key frames for applying SIFT and estimating zt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Figure 4.10: 3D analysis on the 72 height dataset using the new method. . . . . . . . . 50 Figure 4.11: Comparing the height estimation of the CCNN approach versus the new method using SIFT matching points. . . . . . . . . . . . . . . . . . . . . 50 Figure 4.12: Estimating zt using the generalized approach. . . . . . . . . . . . . . . . 51 Figure 5.1: Challenging illumination examples. All examples on the left column represent light overexposure in several applications. The right column represents examples of underexposed images in dark light conditions. . . . . 53 xii Figure 5.2: We present an approach for synthesizing good lighting frames (bottom row), from videos taken in extreme light conditions (top row). . . . . . . 55 Figure 5.3: Examples of the light direction dataset. . . . . . . . . . . . . . . . . . . . 63 Figure 5.4: Examples of the BOSCH database. The left column represent one of the good lighting frames in a video sequence. The right column represents an overexposed sunlight example from the same sequence. . . . . . . . . 64 Figure 5.5: Dividing the sphere into smaller regions, such that each region represents a specific direction of a light source. . . . . . . . . . . . . . . . . . . . . 66 Figure 5.6: After applying dimensionality reduction using PCA on the 18-dimensional data for all frames extracted from a video, we obtain the distribution of points as seen in the plot. For better visualization, we perform K-mean clustering on the data points with K set to 10 clusters. By labeling the lighting quality of the frames, i.e.,0-good 1-bad, we learn an SVM classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Figure 5.7: Proposed method overview and pipeline. . . . . . . . . . . . . . . . . . . 68 Figure 5.8: Evaluation of light direction CNN. . . . . . . . . . . . . . . . . . . . . . 72 Figure 5.9: ROC curve of light direction bin classification. . . . . . . . . . . . . . . . 73 Figure 5.10: Evaluation of DelightCNN on the BOSCH backseat dataset. All frames belong to the first test video in the BOSCH dataset. The first column is the target frame obtained at time t. The second column is the temporally paired frame located at time t − j. The last column represents the delighted output of the proposed method. . . . . . . . . . . . . . . . . . 74 Figure 5.11: CF responses of the ID card detector. The images on the left correspond to the original input frames with challenging illumination conditions. The images on the right are the proposed delighted output given the frame on the left side. Beside each frame is an illustration of the correlation output. 76 Figure 5.12: CF detection performance compression when applied to the original video, and the delighted video. . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Figure 5.13: Failure cases of DelightCNN. Top row are the target frames of a video. Bottom row are the DelightCNN output. The red rectangles represent the ghosting effect when sudden object motion occurs in the scene. . . . . . . 78 xiii Figure 6.1: Proposed idea for delighting dark images which was motivated by the multi-exposure sequences along with the HDR image. Our goal is to synthesize better lighting images similar to the HDR quality (Green star), starting from an underexposed low dynamic range image (red circle) in an iterative scheme (orange and yellow circles). . . . . . . . . . . . . . . 82 Figure 6.2: Iterative CNN architecture for image delighting. . . . . . . . . . . . . . . 86 Figure 6.3: Iterative CNN example. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Figure 6.4: Examples of the EMPA HDR database [96]. The top row are good lighting HDR images generated by MEF of the LDR images using [60]. The second and third rows are examples of underexposed images with an EV of −4 and −2 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 88 Figure 6.5: Iterative CNN evaluation on the MEF dataset. The first column is the input LDRDark , the following six columns are the output of six iterations using the proposed method. The final column is the HDR groundtruth image. All input images are of the same scene, taken at EV = −6, −4, −2, 0 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Figure 6.6: Iterative CNN evaluation on the MEF dataset. The first column is the input LDRDark , the following columns are the output of six iterations using the proposed method. The final column is the HDR groundtruth image. Each row is an example of several scenes from the MEF dataset, such that all have a very low EV. . . . . . . . . . . . . . . . . . . . . . . 92 Figure 6.7: Iterative CNN comparison with SOTA methods on the EMPA HDR dataset. The first column is the input LDRDark , the following columns are: histogram equalization, LIME, HDRCNN, proposed iteration 4, proposed iteration 6, and finally, the HDR groundtruth image. All input images have different negative EV values. . . . . . . . . . . . . . . . . . . . . . 94 Figure 6.8: Iterative CNN comparison with SOTA methods on the Waterloo MEF dataset. The first column is the input LDRDark , the following columns are: LIME, histogram equalization, proposed iteration 1 − 6, and finally, the HDR groundtruth image in the last column. All input images have different negative EV values. . . . . . . . . . . . . . . . . . . . . . . . . 95 Figure 6.9: Iterative CNN evaluation on the pedestrian detection application. The yellow dashed bounding-boxes are groundtruth labels, and the red boundingboxes are the estimated detected person given the input frames. The first column is the original frame, the second is the histogram equalization frame, the third and fourth columns are the 2nd and 4th iteration output of the proposed method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 xiv Figure 6.10: Precision-recall curve (left) and Miss-rate curve (right) of the pedestrian detection algorithm when used on the original video sequences, histogram equalization of the videos, and the 2nd and 4th iteration output of the proposed method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Figure 6.11: Iterative CNN evaluation on the trailer coupler detection system at nighttime for two videos. The red plus signs are the estimated 2D coupler center estimation. We obtain three frames from each video, illustrating the results at far, middle, and close range. For each video, the left column are the original frames, the middle column is the histogram equalization of the frames, and the right column are the output of the proposed method. 99 xv Chapter 1 Introduction on object detection challenges and contributions 1.1 Introduction With the advancement of autonomous vehicles, face recognition, surveillance, biometric systems, and many other computer vision (CV) applications, accurate and efficient object detection systems are rising in demand. A wide range of algorithms have been proposed to detect objects in still images or videos, but a solution that clearly outperforms the human vision system is still missing [92]. The core challenge is that each object in the world can have an infinite number of different 2D images, as the objects position, scale, pose, lighting, and background vary relative to the camera. Yet as humans, our brains can remedy this challenge effortlessly. Whether the object is very far and small in size, collected at nighttime or in very bright lighting conditions, highly occluded with other objects, or even when captured at any possible pose, our visual system can simply spot the target. With the advancement in CV algorithms, researches are taking small steps towards the human detection performance. In this thesis, we will highlight several object detection challenges causing the slow growth of this field, and propose solutions to overcome these problems. Great object detection success has been achieved in controlled environments. One of the first object detectors which gained a lot of attraction was the use of templates in localizing the position of an object. Correlation filters (CF) are 2D templates and are considered one of the most powerful 1 and efficient detectors, especially when the targeted object has limited appearance changes. CF’s are applied to every location in an image using 2D convolution. Since the correlation is processed in the Fourier domain as element-wise multiplication, these filters have attracted many researchers attention to visual detection and tracking due to its remarkable computational efficiency, i.e., 600 FPS on a CPU in [10]. One of the other interesting properties of CF’s, is the graceful degradation property. This is remarkably useful when objects undergo partial occlusion or illumination changes, where the CF detectors will have a positive but degraded response still indicative of the target existence. CF’s have been widely used in many object detection applications, such as detecting military vehicles and planes [52], pedestrians and vehicles [33], vehicle parts [8], and even tracking by detection of objects [42]. In Chapter 2, we will utilize CF’s in detecting several small objects simultaneously in real-time given the challenging illumination conditions of the problem as seen in Table ??. Convolutional Neural Networks (CNN’s) was also used extensively for object detection over the past few years, witnessing remarkable progress. Many challenges are associated with CNN methods as follows: • One of the biggest limitations when attempting to design CNN’s for handling specific challenges, is the availability of training data when trying to learn the detector. Usually, data augmentation, fine tuning or training shallow architectures are some practices used when having limited samples. Overfitting to the training data is one of the main concerns in such cases. Throughout this thesis, we have proposed exactly eight CNN models for different purposes, each of which had very limited, if any, publicly available data. We will demonstrate how we collect data and utilize limited data resources to pursue the high generalization ability for CNN learning. 2 Ch problem Challenges Example Light reflection 2 Feed detection Multiple feed 3 4 Small size objects, pose variation, specular illumination. Real-time detection needed. Coupler 2D detection Scale, pose, illumination, and appearance variations. Real-time detection needed. Coupler 3D detection Scale, pose, illumination, and coupler style variations. Real-time detection needed. 56 cm 35 cm 60 cm 5 Object appearance visibility is low. Pixels all have high intensity values. Real-time light normalization needed. Light normalization (Overexposed challenge) Pedestrian 6 Light normalization (Underexposed challenge) Trailer Object appearance visibility is low. Pixels all have low intensity values. Real-time light normalization needed. Table 1.1: A list of real-world problems and challenges covered in this thesis. 3 • Another challenge in this framework, is the design complexity. Recently, the object detection task has been divided into several smaller components [11, 44, 72], such as a region proposal network (RPN), a region CNN classifier, and a proposal suppression component such as non-maximum suppression (NMS). Hence, any new component proposed for the object detection pipeline should not create a computational bottleneck, otherwise it will be conveniently ignored in practical implementations [9]. Moreover, most desired application nowadays have restricted memory and computational power using CPU’s only, as well as the desire to run in real-time. In Chapter 3 and 4, we will propose a real-time light-weight multiplexer-CNN method which is composed of five CNN components, for the purpose of solving the trailer coupler detection problem as seen in Table ??. • Most CNN object detectors are often formulated as a regression problem to localize object bounding boxes. This helps in generating detection confidence scores, which are needed in real-world scenarios due to the many challenges associate with it. The scores are usually computed after two main components, the RPN and the CNN classifier [72]. The RPN generates a set of rectangular object proposals, each with an objectness score measuring membership to an object class. The CNN classifier also outputs a classification confidence. Both of these scores are thresholded to produce a binary detection decision. Given a specific test sample, selecting a fixed threshold on the generated scores can have a huge impact on the performance. For example, the pedestrian detection in [11] achieves state-of-the-art performance on the Caltech benchmark [24]. When evaluating the same system on nighttime sequences, i.e., in Chapter 6, the system performance degraded significantly because of the high missing rate. It was observed that nearly a 15% performance gain was achieved through readjusting the classification score threshold, i.e., lowering the threshold from 0.44 to 0.1. 4 In Chapter 3, we will introduce several CNN detectors, one of which was trained using our novel confidence loss function along with a CNN-based object detector, which estimates object coordinates associated with confidence scores. These scores reflect the existence of the target object within the spatial enclosure of the training patches, as well as how accurate the 2D regression detection. • Very few works in the literature utilize object detection in understanding the surrounding real-world, even though it was intended for such problems. For example, estimating the distance of the detected object in meters, is the path clear between the object and camera, is the detected object size normal, and so on. In Chapter 4, we propose to transfer 2D object detections into 3D using a single monocular camera, which helped in inferring real-world distances of objects. After completing all of the works mentioned above, i.e., Chapter 2, 3 and 4, several observations have been made based on the failure cases of the CV systems. In any realistic real-world outdoor application, there are a number of additional challenges associated in object detection such as efficiency, unstable background and foreground, obscurity of objects, and extreme illumination challenges [88], which are more problematic compared to the typical visual object detection challenges, i.e., pose, scale, partial occlusion and illumination variations. Such challenges are not well considered in most object detection literature. In Chapter 5 and 6, we will specifically focus on the illumination challenges introduced in object detection systems when operated in outdoor scenarios. More specifically, Chapter 5 will introduce the overexposed bright sunlight problem when projected on objects, with most pixel values being white i.e., washed-out appearance. In Chapter 6, we will tackle the opposite problem with underexposed low-light illumination when captured in the absence of light sources. One common solution in the literature, is to utilize infrared (IR) sensors 5 to replace the RGB cameras. On the down side, there are limitations in terms of the illumination angle and distance when such sensors are used. Moreover, the illuminator power must be adjusted depending on whether the object is close or far away. The cost of such sensors is another concern to the user, as well as complicating the system by adding additional sensors. Another solution would be to fine-tune the system with training data collected with such illumination challenges. However, the detection cues for objects in such cases are fundamentally different than those when objects are in good lighting conditions, as well as the lack of data in such scenarios, makes this solution not feasible. 1.2 Contributions In this section, we list all contributions made towards completing this thesis: • Utilized CF’s in detecting an immense number of small objects simultaneously, such that the detection is done in real-time and takes place in an environment with challenging overexposed specular reflections. CF’s alone struggle when false objects have similar features and properties to the target, causing severe false alarms. We propose a post-refinement component attached to the CF’s output, to enhance the general detection performance. • Detected trailer couplers in real-world scenarios with several challenges including scale, pose, illumination, partial occlusion, and a wide range of appearance variations. This is made possible through our novel design of a distance-driven Multiplexer-CNN that achieves both generalization as well as real-time efficiency. Even though our system is composed of five CNN components, we managed to run the system on a standard computer using only a single CPU. Scale variation is among the most challenging cases in this problem, since the object appearance changes according to the distance between the object and camera. 6 Therefore, we proposed three CNN detectors to handle this challenge, such that the output performance maintains a high accuracy regardless of the object distance. • Developed a novel loss function along with a CNN-based object detector, which estimates object coordinates associated with confidence scores. These scores reflect the existence of the target object within the spatial enclosure of a local region, as well as how accurate the 2D regression detection. • Developed two methods to transfer 2D object detections into 3D using a single monocular camera. One of the methods is based on estimating the geometric shape of the target to infer a 3D localization of the object. The other method which has better generalization capability, is based on detecting and tracking points on both the foreground object as well as some points from the background to infer the 3D coordinate of the object. • Developed a novel DelightCNN model to handle overexposed bright illumination challenges, aiming at enhancing object detection performance. The main objective of this model, is to synthesize video frames suffering from bright sunlight challenges, and produce normalized light frames with ambient-like lighting conditions. • Proposed a novel iterative CNN model for light normalization in underexposed dark images. The objective of this model is to take any test image, improve the underexposed dark regions, while keeping the good lighting regions untouched. The main goal of this model is also to improve object detection in such scenarios, including nighttime detection with RGB cameras. 7 1.3 Organization This thesis is outlined as follows: In Chapter 2, we describe how CF’s were used in detecting multiple small objects in real time. We will also explain how we handled the specular reflections, which was causing a high false alarm rate, through proposing a post-refinement process on the CF’s outputs. In Chapter 3, we present a multiplexer-based-CNN method which controls the flow of the CV system of a real-world problem based on estimated physical measurements. Five different detection CNN’s were proposed with each having a specific task in the system. In Chapter 4, we introduce the fifth detection CNN used in the multiplexer-based-CNN method from Chapter 3. The goal of this CNN is to perform 3D detection on the object which yields valuable information for the system in understanding the object localization in the 3D space. Finally, in Chapter 5 and 6, we propose two methods for normalizing light in outdoor images undergoing illumination challenges. The first method in Chapter 5 helps at recovering overexposed light patterns in images, through our proposed joint filter CNN method named DelightCNN. The second method in Chapter 6 helps at recovering underexposed dark images, through our proposed iterative CNN method. We finally evaluate our proposed methods on real world problems such as detecting pedestrians and trailer couplers at night time, as well as delighting the backseat of a vehicle while driving. 8 Chapter 2 Detecting small objects using correlation filters 2.1 Problem statement Based on the statistics from Fisheries and Aquaculture Department [1], aquaculture is growing at a very high rate internationally, and its contribution to the world’s total fish production reached 42.2% in 2012, up from 25.7% in 2000. The fish feeding process is one of the most important aspects in managing aquaculture tanks, where the cost of fish feeding is around 40% of the total production costs [14]. Monitoring several aquaculture tanks with highly populated fish is a challenging task. Many researchers adopt a telemetry based approach to study fish behavior [12, 16]. In addition, some scientists prefer a computer vision (CV)-based approach for fish monitoring [17, 27, 79, 87, 95]. Unfortunately, all these studies are conducted at a small scale, i.e., a small number of fish in small tanks. Compared to fish behavior, excess feed detection is rarely addressed except [54], where feeding control is achieved by estimating fish appetite. However, the tank in [54] is also small and fish are easily segmented. By collaborating with an active aquaculture fish farm, we have developed a CV-based automated feeding control system. A video camera is placed above the water surface of a highly dense fish tank with ∼10, 000 fish, as shown in Fig. 2.1. The camera captures only part of the water 9 Figure 2.1: Given the video input, our system performs real-time monitoring and feeding decision for a highly dense fish tank. surface due to the large tank size. Videos are directly transferred to a host computer that performs immediate analysis on the state of fish behavior. Moreover, the system is also programmed to take immediate actions in stopping the feeding process when needed. In this work we present an efficient CV system to continuously monitor fish eating activity, detect excess feed, and automatically control the feeding process. We will skip monitoring fish eating activity, since it is out of scope of the dissertation, where interested readers can refer to [5] for a detailed explanation. To detect the amount of feed floating on the water surface, we propose a novel two-stage approach. First, a supervised learned correlation filter is applied to the test frame in order to detect every individual feed. Second, a Support Vector Machine (SVM) classifier is deployed as a refinement step of the correlation filter output, which attempts to suppress falsely detected feed while preserving true feed. Furthermore, we propose to detect feed in an optimum local region only, rather than the entire frame whose accuracy and efficiency are both less than ideal. Using the particle filter technique, the local region is estimated by maximizing the correlation between the number of locally detected feed and that of true feed in the entire frames. Finally, based on continuous measurements from fish activity and feed detection, various actions take place to control the feeding process. This work makes the following contributions: 1) a fully automated aquaculture monitoring 10 Figure 2.2: The architecture of our feeding control system. system that controls feeding for a highly dense fish tank, 2) an accurate measure of the fish activity, and a continuous detection of excess feed from an optimum local region utilizing CF’s, and 3) the video dataset and the labels that are publicly available for future research. 2.2 Proposed method on feed detection Monitoring the behavior of fish eating activity along with making sure the fish are provided the correct amount of feed is the main goal of this paper. The process is simply illustrated in Figure 2.2. The blue branch of the figure can be ignored since it is not related to object detection. For more info about the activity classification, please refer to [5]. Accurately estimating the amount of excess feed floating on the water is a critical component for any intelligent aquaculture system. However, detecting individual feed is very challenging due to the tiny feed size, partially submerged into the water, and the overexposed specular light reflection. Further, feed detection should be conducted in real time for immediate feeding control. The efficiency challenge is attributed by the contrast between the large frame size (1080 × 1960 11 pixels) and tiny feed size (∼30 pixels), i.e., a huge number of local candidates to be classified as feed vs. non-feed. These challenges motivate us to develop a carefully designed feed detector with three components: 1) correlation filter is used to detect all possible feed, 2) a classifier built on handcrafted features suppresses non-feed from the first component, and 3) a local region is searched to maximize the computational efficiency and accuracy. We now discuss each component in detail. 2.2.1 Correlation Filter for Feed Detection Correlation filters have been widely used in many applications such as object detection, recognition, tracking and alignment [8, 10, 33, 42, 52]. Since the correlation is processed in the Fourier domain as element-wise multiplication, these filters have attracted many researchers attention to visual detection and tracking due to its remarkable computational efficiency. The main challenges of this work are related to: (a) the small size of feed where they are partially submerged into the water. (b) Light specular reflections off of the water surface, which confuses the system with false alarms. (c) Real-time needs in controlling the feeding process, and avoid dispensing excess feed. To address these challenges, we like to efficiently and accurately rule out the majority of non-feed candidates while preserving most true feed. Correlation filter (CF) is chosen for this purpose due to its proven success in object detection [33], [10]. In the context of object detection, correlation filters are 2D templates that are applied to every location in an image using 2D convolution. The goal of the templates is to provide a strong response only in regions of the image that correspond to the template. The most important part lies in making the right template. Most contemporary filters are designed to minimize the difference between the response of the filter on the training set with a desired output. In a sense, the peak of the desired output marks a positive training sample, and the rest of the training image is a large 12 set of negative samples. This is different from discriminative classifiers, which require explicitly defined positive and negative training samples. Over the past few decades, many variations in the filter design have been proposed, but for the sake of thesis length, we will only explain the method used in solving the feed detection only. Specifically, we adopt the unconstrained scalar feature approach [8], which is learned by minimizing the average Mean Square Error between the cross correlation output and the desired correlation output for all training images. This is basically accomplished by controlling the shape of the correlation output between the entire image and the filter. Thus, the correlation filter design problem is formed in the following optimization problem, 1 N min ∑ ||xi ⊕ h − gi ||22 + λ ||h||22 , h N i=1 (2.1) where h, xi , gi , λ , and ⊕ are the CF, visual features of ith image, desired output, regularization weight, and convolution, respectively. Here the bold face character denotes that the matrix is represented in a column vector. By converting into the frequency domain, Eqn. 2.1 has the following closed-form solution, 1 N † ĥ = λ I + ∑ X̂i X̂i N i=1 [ −1 ] [ 1 N † ∑ X̂i ĝi , N i=1 ] (2.2) where ˆ is the FFT operation, X̂ is the diagonal matrix with x̂ on its diagonal, † is conjugate transpose, and I is the identity matrix. A set of N L × L local patches with true feed in the center are used as the training images. For efficiency, the raw intensity is used as x. Given a test image xt , the convolution output xt ⊕ h containing peaks larger than a threshold τ are detected as the candidate feed. We choose τ where the maximum true detection and minimal false alarm are achieved. 13 2.2.2 SVM Classifier for Refinement of False Alarms The SVM classifier is focused on distinguishing between feed and non-feed. Therefore, the same group of training images used to build the correlation filter are used again as positive training images. For negative samples, false alarms patches are obtained. Both positive and negative samples are utilized to extract features represented by {vi }i∈[1,N] , where i is the training sample index and N is the total number of positive and negative samples. Feed is easily distinguishable by identifying its color. Thus, a group of features are extracted from the training images based on color properties from the RGB color space, and then perform Kmeans clustering on its Cartesian representation. The resultant dc = 20 code words, [s1 , s2 , ..., sdc ] representing the clusters that hold a wide range of colors for both the feed and false alarms. For any given correlation output C, we extract patches Pi centered at any high peak representing either feed or false alarm, where i is the peak index. For every pixel in Pi , we locate the nearest color code that represents it. Thus, for every Pi , we generate a d-dim Bag-of-Words (BoW) histogram fc = ∑(u,v)∈Pi δ (d = argmind k Pi (u, v) − sd k2 ), where δ is the indicator function. This histogram is normalized by fc − min( fc ) . Another group of features are extracted from the Histogram max( fc ) − min( fc ) of Oriented Gradients (HOG) descriptor, obtaining orientation information of the object located in the image. We followed [19] approach in obtaining the HOG descriptor using 9 orientations. Due to normalization in every window, a 36 feature vector fh can be obtained by summing the results of all similar orientations. In some cases, the HOG features alone will not be able to classify feed from false alarms, because false alarms might appear to have the shape of a feed. On the other hand, the BoW features based on RGB colors alone will also encounter difficulties in some cases, where the feed might appear with color features similar to the negative samples due to lighting conditions. By 14 combining both features together f = [ fc , fh ], the result feature vector can overcome all possible cases. Thus, a group of 56 features are extracted from the positive and negative patches, which are used for learning the SVM classifier. 2.2.3 Locating the Optimum Local Region The key concept of searching for an optimum representative local region relays on the following two reasons: (1) Detecting feed in a sub-region will make the system more robust compared to searching in the entire frame from a computational efficiency perspective. (2) The feed tend to gather in some portions in the image more than others. Hence finding the local region where they are well represented throughout all frames in a video is needed. In our work, we adopt an approach to address this problem by using the concept of particle filters. A group of K windows are represented by their location and size with their weights as {cki , wki }k∈[1,K] , where cki denotes the location and size of window k for frame number i and wki is the weight for that window. All K windows are initially distributed uniformly throughout the frame image for the process of finding the best location and size of the window. A group on n labeled frames are used in order to determine the optimum location. The weight in every window is computed by using Pearson’s correlation coefficient, ∑ni=1 (Tik − µT )(Gi − µG ) , wki = q p n n k 2 2 ∑i=1 (Ti − µT ) ∑i=1 (Gi − µG ) (2.3) where Tik is the true detection rate within the local region, µT is the mean true detection of every labeled images in n, Gi is the ground truth number of feed in the entire global frame and µG is the mean ground truth of every labeled image. At every iteration the Pearson’s correlation coefficients are computed to assign every window with a weight. The window with less weight will be less 15 # of feed High High Low Low Fish Active Yes No Yes No Feeding Machine On On Off On Action Off Off On Off Table 2.1: Guideline for controlling the feeding process. likely selected in the following iteration, while the higher weights will have higher chances of revisiting this region in next iteration. After N number of iterations, we would expect to observe that all K windows had diverged to a specific region where the weights are highly correlated. The updating process via particle filter is done by resampling based on weights achieved in the previous iteration. This will result with a new set of windows having the same size from the previous iteration that are located at regions with higher potential of determining larger Pearson’s correlation coefficient. 2.2.4 Automatic Control of the Feeding Process The main purpose of classifying the fish behavior at every frame, as well as detecting the amount of excess feed located on the surface of the tank, is to be able to automatically make changes to the feeding process without the need of any human interference. The goal is to maintain a stable and accurate amount of feed provided to the fish tank. The system is designed to automatically control when to switch on the feeding machine and when to turn it off. Based on the results obtained from both the 2 class classification of fish activity and the feed detection systems at every frame, a continuous decision needs to be made whether to stop or continue the feeding process. We conclude to have the following rules listed in Table 2.1 that represents some critical instances in which requires to stop or continue feeding. If any of the cases listed in the table take place, an immediate action has to be made. 16 (a) (b) (c) Figure 2.3: Feed detection procedures with each column being one local region of 150 × 150 pixels: (a) the original image with green circles indicating labeled ground-truth feed, (b) the CF output (green and red), where the red squares are false alarms, and (c) the results of the SVM classifier in a binary image where the white regions are the final detected feed. Note the reduced false alarms from (b) to (c). 2.3 Experimental results Our dataset consists of 21 videos of a top-view aquaculture fish tank. These videos were captured at 10 FPS, with 1080 × 1960 pixels and an average length of 5, 684 frames. The first 20 videos are captured under normal circumstances. The last video exhibits a huge amount of excess feed, since the feeding machine is intentionally switched on for a longer period of time. To evaluate the feed detection, we manually label feed in n = 12 frames randomly taken from the video at different stages of the feeding process. We conduct the labeling twice and only the feed labeled in both trials are claimed as true feed. The number of true feed ranges from 22 to 856 per frame, with the total of 4, 485 feed. 2.3.1 Feed Detection We set the parameters as N = 2, 000, L = 25, τl = 229, τ = 0.53, and gi is a 2D Gaussian centered at the targets locations with a variance of 2 and peak amplitude of 1. The default parameters in LibSVM are used for SVM learning. 17 Normalized False Alarm 1 0.8 0.6 Local CF Local CF + SVM Global CF Global CF + SVM 0.4 0.2 0 0 0.2 0.4 0.6 True Detection Rate 0.8 1 Figure 2.4: Comparison of feed detection. Figure 2.4 compares the results of the CF in the local region alone vs. having a SVM refinement classifier following the CF. The Normalized False Alarm (NFA) is the number of falsely detected feed divided by the number of true feed. Remarkably, the refinement classifier reduces the amount of false alarm by nearly 50%, while maintaining the similar true detection rate. For example, one good point on ROC has the detection rate of 90.8% at a NFA of 0.3. Further, the results of operating on the entire frame is much worse than on the local region. Finally, we also employ the SVM classifier for feed detection without first applying the CF. It can detect 85.3% of feed, but the NFA is considerably high at 4.5, not to mention the much lower efficiency. The superior over this baseline demonstrates the excellent accuracy and efficiency of our two-step approach. An illustration of feed detection procedure is shown in Fig. 2.3. Columns 1-2 are successful at detecting all feed with no false alarms. Columns 3-4 have missing detection, but no false alarms. Columns 5-8 illustrate variations of false alarms. 2.3.2 Local Region Estimation The number of particles for localizing the optimum local region is 100. Since the results of the particle filter depend on the initialization, we repeat this experiment three times with different 18 0.98 Max(wk ) 0.96 0.94 0.92 0.9 300x300 400x400 200x200 0.88 0.86 0 2 4 6 8 Iteration Number 10 12 Figure 2.5: Local region optimization. initial sizes of local regions. The maximum wk for three runs are shown in Fig. 2.5. Note that the final iterations of all runs achieve a similar weight of 0.97, due to the huge overlap in the final optimum local region. The optimum local region is found to be centered at (224 ± 3, 256 ± 4), with a size of 258 × 258. The fact that all three runs converge to the same local region gives a strong indication of achieving the global optimization solution for this optimization. 2.5 Normalized Feed 2 1.5 1 Local − initialization Local − final iteration Ground truth Global 0.5 0 −0.5 −1 0 2 4 6 8 Labeled Frame Index 10 12 Figure 2.6: Comparison of normalized feed. To illustrate the effectiveness of the particle filter, we plot four signals: ground-truth feed in the entire frame Gi , Ti when feed detection is applied to the frame, Tik with the maximum wk at the initialization and at the final iteration. To compensate different data ranges, we plot the normalized feed as Tik −µ(T k ) Tik in Fig. 2.6. Compared to the initialization and the global feed detection, the feed estimation at the optimum local region has the highest correlation with the ground-truth feed. 19 Therefore, the feeding control based on the local region is almost the same as based on the true feed of the entire frame. 2.3.3 Computational Efficiency The computational efficiency is an important metric for any computer vision system. We evaluate the efficiency using a Matlab implementation on a conventional Windows 8 desktop computer with an Intel i5 CPU at 3.0 GHz with 8 GB RAM. The efficiency of feed detection depends on several factors, such as the size of the local region, the number of candidate feed for the SVM classifier. The total time for the CF step in the optimum local region is 0.006 seconds. The refinement classifier requires 0.004 sec. to extract features and classify a single candidate feed resulted from the CF. The average total time to detect feed in the local region is 0.085 seconds. In summary, our entire system operates at 5+ FPS, which also includes fish activity classification. With the future C++ implementation, we believe that our system can operate in real time on a conventional PC. 2.4 Conclusion A fully automatic system is developed to understand fish eating behavior in a highly dense aquaculture tank. The ability to classify whether the fish are actively consuming feed along with the continuous detection of excess feed provides valuable information for feeding control in the tank. Correlation filters are an excellent candidate for detecting small feed-like particles, since they were able to detect objects with very limited appearance ( 30 pixels in area) at a very high accuracy. However, many false alarms can appear due to specular reflections on the water surface, which can have similar features to feed particles. 20 Chapter 3 Detecting 2D trailer couplers using convolutional neural networks 3.1 Problem statement Trailers range in size from small utility and boat trailers to large box trailers or recreation vehicle (RV) trailers. RV trailers alone had a revenue of five billion dollars in 2015 [32] and are estimated to exceed 381 thousand units sold in the United States in 2016 [65]. To hitch with a vehicle, trailers have a coupler at the front that is placed over a ball connected to the vehicle at the rear. Small, lightweight trailers may be manually moved into place. However, for heavy trailers, the vehicle must be driven backwards to connect to the stationary trailer. Traditionally, a second person, called a spotter, stands outside the vehicle to instruct the driver. Even with rear-view cameras on modern vehicles, the task is difficult and tedious due to the small size of the coupler on the screen. An automated system can take the place of the spotter and allow a single person to connect to a trailer with ease. Engineers have developed numerous advanced vehicle systems providing security and comfort for drivers, e.g., emergency breaking, blind spot assist, lane recognition, and active park assist. Current systems for trailers only provide assistance after they are hitched to a vehicle, such as maneuvering in narrow spaces [55], preventing large articulation angles [64], and avoiding oscillation at high speeds [22]. To the best of our knowledge, no other automated system exists for 21 Our system Vehicle control system DCNN TCNN1 TCNN2 5×1 Multiplexer Detect/Track coupler Detect trailer coupler Estimated 3D location TCNN3 CCNN Select Distance estimation Figure 3.1: Automatic trailer hitching by detecting and tracking the coupler using a Multiplexer-CNN. backing-up towards a trailer. This paper presents an automated computer vision system using a single rear-view camera to hitch the vehicle to a trailer, as in Fig. 4.9. There are three main challenges or sources of visual variation: position, trailer, and environment. Position refers to the pose and scale of the trailer. The trailer varies in type, shape, color, and size of the trailer and coupler. The environment affects the background of the scene through asphalt, dirt, grass, or snow and the lighting from sun, clouds, or nighttime. Furthermore, there are strong performance requirements needed to successfully hitch the trailer. When hitching, the coupler 2D estimation should have an error less than the radius of the ball, e.g., 2.2 cm for a standard ball, and the efficiency should exceed 10 FPS. Assuming the driver parks the vehicle with the trailer in the field of view of the camera, this work presents a Multiplexer-CNN based system to automatically detect and track the trailer’s coupler and continuously provide 3D coordinates to control a vehicle, while relying solely on a monocular rear-view fish-eye camera. Specifically, the goal of our system is to estimate both the 2D location in pixels and the 3D location in meters. An automatic control system may consume the 3D estimate to mechanically control the vehicle’s movement. However, the mechanical system is outside the scope of this paper. For convenience, we have dedicated Chapter 3 for 2D detection 22 of the coupler only, and in Chapter 4 we will discuss how we localize the trailers coupler in 3D. The Multiplexer-CNN system selects from five CNN architectures to perform detection/tracking operations. The current estimate of the coupler position drives the CNN selection. Each CNN is invoked independently based on the estimated distance between the vehicle and the coupler. When no estimate is known, the DCNN (Detection network) detects the coupler by estimating potential locations along with confidence scores. We develop a novel loss function to enable DCNN to learn accurate confidence measures, along with regression estimates of the coupler position. With a confident estimate of trailer position, the multiplexer selects among three networks, TCNN1 , TCNN2 , and TCNN3 (Tracking via detection networks) to perform 2D tracking by detection until the coupler is centered over the ball. During this time, the 3D location is inferred using a calibrated distance map from a fixed coupler height. As the coupler approaches the ball, it is crucial to estimate the height to avoid collision. The fifth network, CCNN (Contour network), estimates the coupler’s contour, which regresses the height, and adjusts the distance map to provide accurate 3D estimation. All details regarding CCNN and height estimation can be found in Chapter 4. Data is key to learning an accurate Multiplexer-CNN. We introduce the first large-scale dataset for trailers, with 899 videos containing ∼712, 000 frames. We demonstrate the accuracy and efficiency of the system using a regular PC. Our system fulfills the performance requirements for successful hitching by achieving an estimation error of 1.4 cm when the ball reaches the coupler, while running at 18.9 FPS. Qualitatively, we show the ability of our system to detect and track unseen trailers in the field, generalizing to handle a large variety of challenges. In summary, our main contributions are: 1) Develop a novel loss function along with a CNNbased object detector, which estimates coupler coordinates associated with confidence scores. 2) Design a distance-driven Multiplexer-CNN that achieves both generalization across larger variations and real-time efficiency. 3) Develop a method to estimate the 3D coupler coordinate with a 23 monocular camera. 4) Present a large dataset for trailers coupler detection and tracking. 3.2 Convolutional neural networks on object detection Deformable part model (DPM) [31] has been the state-of-the-art object detector for many years, before the viral emergence of CNN’s. Recent approaches using CNNs to regress the location of object bounding boxes demonstrate remarkable accuracy and efficiency. Object detection has witnessed remarkable progress with CNN’s, where often a regression problem is formulated to localize object bounding boxes. Sermanet et al. [75] propose a regression network for detection where classification confidence helps aggregate proposed bounding boxes. However, it exhaustively searches the image and is not suitable for real-time applications. Girshick et al. [35] propose R-CNN, using object proposals generated by selective search. This model has been extended to Fast R-CNN [34] and Faster R-CNN [72]. In Faster R-CNN, the region proposal network (RPN) generates a set of rectangular object proposals, each with an objectness score measuring membership to an object class, after which the proposal is assigned a class specific confidence score. Note that both the objectness score of RPN and classification confidence are trained separately from localization and are thresholded to produce a binary detection decision. In contrast, we incorporate the confidence score into the learning process and use it as a scalar without thresholding. Our confidence score reflects two major cues: (1) the existence of the target object in a region, and (2) how accurate is the 2D target detection. 24 Output Coupler detection (ut , vt) Video Extract N1 patches DCNN Update G u0 v0 Update 2D estimation Itr # (xt , yt , zt) meters Coupler tracking Extract N2 patches Select CNN TCNNi ut vt Compute 3D estimation Distance map 3D coupler localization Extract N3 patches CCNN Height & c estimation ct zt Update distance map Figure 3.2: An automated computer vision system for coupler detection, tracking, and 3D localization. 3.3 Proposed method on 2D coupler detection Our Multiplexer-CNN system has five CNN inputs: DCNN, TCNN1 , TCNN2 , TCNN3 , CCNN. As shown in Fig. 3.2, our system consists of three stages: (1) 2D coupler detection, (2) 2D coupler tracking, and (3) 3D coupler localization for vehicle automation. Stage 1 initializes the 2D coordinate of the coupler for Stage 2. Stage 2 and 3 collaborate in estimating both 2D and 3D coupler positions. Stage 3 along with CCNN will be explained in Chapter 4. 3.3.1 Preprocessing for Geometric Interpolation Our rear-view camera has a fish-eye lens with a wide field-of-view (FOV). Using a checkerboard, standard camera calibration is performed to estimate camera intrinsics, extrinsics, and lens distortion parameters, based on which we unwarp the input frame to correct the fish-eye effect. In our Multiplexer system, the coupler-to-vehicle distance (CVD) is crucial in serving as the selector to choose one CNN among five. Thus, it is important to estimate CVD accurately and efficiently. Instead of employing SfM for distance estimation [78], we propose to rely on a distance map covering the entire frame. This distance map can be obtained as follows. Given an unwarped frame with a checkerboard placed on a flat ground, we estimate the camera rotation and translation 25 Coupler distance map Dh Ground distance map D0 Camera Coupler Distance to ground h0 hc Dh D0 Coupler coordinate (x, y, z) VEHICLE TRAILER Origin point (0,0,0) Figure 3.3: 3D distance estimation of the coupler. matrices, which can convert a single pixel (u, v) to a 3D world coordinate and measure the distance between the camera and target pixel (green point in Fig. 3.3). This step is repeated for all pixels in the unwarped frame. Given the known (fixed) camera height h0 , solving a simple triangle problem can convert the camera-to-pixel distance to the origin-to-pixel distance, where the origin of the 3D world coordinate is the projection of the camera onto the ground plane ((0, 0, 0) in Fig. 3.3). The origin-to-pixel distances of all pixels constitute the distance map on the ground plane D0 . Finally, given the coupler is at a fixed but unknown height hc , we elevate D0 to obtain the coupler distance map Dh by simply applying Dh = (1 − hh0c )D0 . In our system, we assume hc = 50 cm when the CVD is over 1 meter, and otherwise estimate using the method in Chapter 4. For different vehicles with different hc , the camera should be recalibrated to generate a new ground distance map D0 . We can do this for all possible heights offline, and assign one to a vehicle during manufacture. Our distance map method has a few advantages. First, we only use a single camera without additional sensors/cameras to retrieve 3D depth information. Second, Dh can be updated efficiently to accommodate changes in the coupler height. Finally, CVD is obtained by a simple look-up of Dh (u, v). However, our flat ground assumption might affect CVD estimation if the ground is not 26 flat. Fortunately, this would impact most when the trailer is far away, and have much less influence while approaching it, since close regions become locally flat. Note that the critical distance is the last CVD meter when the coupler is near the vehicle. 3.3.2 Coupler Detection When the system starts, we assume the trailer is positioned between three and seven meters from the rear of the vehicle. This is a reasonable assumption for our application. If the trailer starts too close, the vehicle may not have enough room to align with the hitch. If too far away, the coupler cannot be detected. We train a dedicated network, DCNN, to detect the coupler in the frame along with providing a confidence score of the estimation. DCNN network The architecture of DCNN is in Tab. ??. Each convolution layer is followed by an ReLU and maxpool layer. The key to DCNN is the novel confidence loss that allows the detector to balance the regression accuracy with the detection confidence accuracy. These scores reflect the existence of the target object within the spatial enclosure of the training patches, as well as how accurate the 2D regression detection. The confidence loss is, L = ∑ s̄||∆ū − ∆u||2 + ∑ λ1||s − 2s̄(1 − sigm(λ2 ||∆ū − ∆u||2 ))||2 , (3.1) where ∆u and s are the estimated 2D offset and confidence score, and ∆ū and s̄ are the groundtruth. The loss function has two parts. The first part represents the Euclidean loss, which is the regression error when having a positive coupler patch enabled by s̄. For negative patches where s̄ = 0, loss is ignored so learning focuses only on the confidence score. The second part is the confidence score loss. For negative patches, it penalizes nonzero scores since the system should have no 27 DCNN input(200 × 200) conv (7 × 7 × 20) maxpool (2) conv (7 × 7 × 30) maxpool (2) conv (5 × 5 × 40) maxpool (2) FC-100 FC-3 Confidence loss — — — — — TCNN1,2,3 input(224 × 224) conv (3 × 3 × 20) conv (3 × 3 × 20) maxpool (3) conv (3 × 3 × 40) conv (3 × 3 × 40) maxpool (3) conv (3 × 3 × 60) maxpool (2) conv (3 × 3 × 80) maxpool (2) conv (3 × 3 × 100) FC-100 FC-2 Euclidean loss CCNN input(200 × 200) conv (7 × 7 × 20) maxpool (3) conv (5 × 5 × 40) maxpool (3) FC-100 FC-76 Euclidean loss — — — — — — — Table 3.1: CNN architectures. confidence. For positive patches, the confidence should be negatively correlated to the regression error, i.e., more accurate predictions of ∆u should have higher confidence. We pass the regression error through a sigmoid function and subtract from 1, restricting confidence between 0-1. We obtain training image patches to learn DCNN. The positive patches are obtained from the videos when the trailer is in the range of d1 to d2 , and are assigned confidence scores s̄ = 1, along with the coupler offset ∆ū. Random perturbation is applied to the patches such that the center of the coupler is less than |∆ū| < w 4 away from the patch center, where w is the patch size. The negative patches are randomly selected in the surrounding area of the coupler such that the coupler center is far away from the patch center |∆ū| > w4 , and are assigned scores of zero. Coupler detection algorithm An illustration of the coupler detection algorithm is in Fig. 3.2. Given the initial trailer location within [d1 , d2 ] meters, we place a grid G containing N1 evenly distributed points covering the region as shown in the first column of Fig. 3.4. These points represent the center locations of N1 test patches, which are processed by DCNN. These patches have different sizes based on their estimated CVD dt , such that the patches in the top row of G have the 28 smallest sizes, whereas the bottom row have the largest sizes. This is motivated by the desire to feed CNN training data with smaller scale variations. Otherwise if we would select a fixed patch size, e.g., 200 × 200, the patches contain a lot of background info when the trailer is far away, and nearly no background when near by. We follow a simple formula to decide the patch size via CVD, w= −45 Dh (ut−1 , vt−1 ) + 300. 2 (3.2) Here the maximum patch size is 300 × 300 when the trailer is at zero meters away, and 75 × 75 when 10 meters away. Once the scale is determined, the patches are resized to 200 × 200. Since many patches overlap with the target coupler, it is likely that a number of patches will have high confidence scores. We update each grid point with its estimated location and repeat the process iteratively on the same initial frame, until the points cluster around potential couplers. In this particle filter like approach, we adopt a weighted sum of Gaussians approach for a final detection estimation. Every grid point is replaced with a 2D Gaussian (100 × 100 kernel size with σ = 30) weighted by the confidence score. The final detection is the maximum of the summation of weighted Gaussians from all grid points. The estimated confidence scores of correct point’s increases with more iterations performed. In general, we observe that four iterations are sufficient for satisfactory accuracy as seen in Fig. 3.4. 3.3.3 Coupler Tracking In coupler tracking, the coupler appearance changes throughout the backing-up process, which can be described in three stages: (a) Initially, the coupler is far and often hard to discern due to the small size. (b) As the trailer gets closer, the coupler appears increasingly larger. (c) Within the last meter of backing-up, the coupler appearance changes dramatically due to the increasing downward viewing angle of the camera. Therefore, we propose to use three networks, TCNN1 , TCNN2 and 29 Initialization Iteration 1 Iteration 2 Iteration 4 Figure 3.4: Iterative coupler detection using DCNN. The top row is the input image with the results of (∆u, s). The green color means s = 1, and the red is s = 0. The bottom row shows the results of the weighted sum of Gaussians. Best viewed in color. TCNN3 , to perform tracking by detection, where each operates in a predefined range, defined by d1 and d2 . Tracking CNN networks The network architecture of TCNN is in Tab. ??. The first 10 layers are similar to the VGG network [68], with minor changes in the number of filters and maxpool layers. The full network is optimized using training images of trailer couplers. Unlike most object tracking works estimating a bounding box, we are only interested in finding the center of the coupler. Therefore, we define the Euclidean loss for the 2D coupler center position. A large number of training images are used to learn all three TCNN networks. Similar to DCNN, we apply random perturbation to the training patches such that the coupler center is less than w 4 away from the patch center. We also follow Eqn. 3.2 to crop the local region with the CVD-dependent size to a training patch. Coupler tracking algorithm We adopt a tracking by detection method where TCNN serves as the detector. We use the tracking result of the previous frame to initialize the current frame. For stable tracking, we apply N2 randomly perturbed patches surrounding the initialization. The tracking result is obtained by averaging the estimations of all N2 patches. Two thresholds, τ1 and τ2 , define three ranges where each of the three TCNNs operate. 30 Figure 3.5: Statistics of the trailer coupler database. 3.4 Trailer Coupler Database With a rear-view camera of a vehicle, the videos capture the process of a vehicle backing-up towards a trailer from a variable distance to the point where the hitch ball is aligned with the trailer coupler. The database contains 899 videos consisting of ∼712K frames, with an average length of 19.3 seconds. The videos are collected at 40 FPS with a resolution of 1, 920 × 1, 200 using M-JPEG compression. A Point Gray camera (Model: BFLY-PGE-23S6C-C) with a fish-eye lens is used with a wide FOV of 190◦ . The database contains three types of challenges as explained in Sec. 3.1. Fig. 3.5 shows some statistics of the database. Nearly 3 4 of the database is obtained from RV trailers, as they are the most popular and available. Even if the trailer type introduces differences in shape and size, they all have similarities in the coupler itself. Typically, people backup to the trailer in a straight line with a center pose, however, we collect videos with various poses to make the system robust to any pose situation. The database is collected in a period of one year, which covered different weather conditions and ground types. Some examples can be seen in Fig. 4.6. There are three types of labels in the dataset. 1) Coupler label set: the 2D coordinate of the coupler center is labeled at every 10th frame of all 899 videos. 2) Contour label set: the 2D coordinates of 37 coupler contour points are labeled for a set of 810 frames, collected from 162 videos, when the CVD is 1.0, 0.8, 0.6, 0.4, and 0.2 meters away. 3) Height label set: the coupler 31 height is physically measured at the site when 72 videos are captured. These 72 videos are a subset from the 162 videos of the contour label set, which means we also have contour labels along with the height labels. 3.5 Experimental results In this section, we will discuss the experimental setup, report the quantitative results of all three main stages separately, and then jointly as a whole system. Finally, we will show qualitative results of the entire system. 3.5.1 Multiplexer-CNN Setup 3.5.1.1 Network Implementation Details We train and test DCNN and TCNN using the first 827 videos using only the coupler label set. The videos are divided into 413 training videos, and 414 testing videos. Given most trailers have 2∼3 videos captured at different posses, we make sure that each unique trailer does not exist in both training and testing sets. To train and test CCNN, we use all videos which contained contour labeled frames. The videos are divided into 90 training videos with 450 labeled coupler contour frames, and 72 testing videos with 360 labeled coupler contour frames. The networks are trained with a learning rate of 0.001 and a mini-batch size of 100, 20, and 20 for DCNN, TCNN and CCNN, respectively. Our system uses the following parameters: τ1 = τ2 = 3, τ3 = 1, N1 = 100, N2 = N3 = 10, λ1 = λ2 = 1 . 32 300 2D error (pixel) DCNN w/o score DCNN 200 100 0 3 4 5 CVD (m) 6 7 Figure 3.6: Detection errors vs with distance in meters. 3.5.2 Results 3.5.2.1 Coupler detection For coupler detection, the baseline is similar to DCNN with the exact same network structure except for the loss function, which is a normal Euclidean loss, i.e., it only estimates ∆u without the confidence score s. Hence, the coupler detection algorithm remains the same, except replacing the sum of weighted Gaussians with a sum of Gaussians. Fig. 3.6 reports the results of both methods at various initializations of CVD in the range of 3∼7 meters. This experiment shows the advantage of learning confidence score over the typical regression detectors. A clear margin of 20∼45 in pixel errors is due to incorporating confidence score learning into DCNN. To visualize the effectiveness of our confidence scores in DCNN, we demonstrate the correlation between the estimated scores and the 2D offset estimation errors. Given six randomly selected testing videos, we collect a total of 600 pairs of (u, s) after running DCNN at seven CVD meters for four iterations. As seen in Fig. 3.7, the scores have an inverse correlation with the Euclidean estimation errors, i.e., the detections with lower errors will have higher scores. The correlation coefficient is found to be −0.65 capturing this strong correlation between these two. Note that 33 Confidence score 1 0.8 0.6 0.4 0.2 0 0 200 400 600 Euclidean distance error Figure 3.7: Confidence scores vs Euclidean estimation errors. some large offset estimation errors might still have a high score. This false alarm is caused by the patch drifting away from the coupler, where it detects some background object thinking it’s the target coupler. An example is found in iteration 1 of Fig. 3.4, where some points illustrate this noisy behavior. 3.5.2.2 Coupler tracking We compare our TCNN with two baseline object tracking methods, KCF [43] and C-COT [21]. Both methods have excelled on many tracking benchmarks, e.g., C-COT wins the 2016 VOT challenge. To compare with the baseline, we provide three initializations to both baselines, while one initialization to our method. For the baseline tracking initialization, we define a 100 × 100 bounding box centered around the coupler, at three CVD locations d0 , d1 = 7 and d2 = 3 meters. Here d0 is the CVD of the first video frame, with an average of 10 meters in our database. We report resultant x − y plane error in meters at specific CVD in Fig. 3.8, 2D pixel errors at specific CVD in Fig. 3.9, and precision plots in Fig. 3.10. We observe that both KCF and C-COT perform well after initialization. However, both baseline methods suffer from drifting problems due to the extreme scale variations, which induces substantial appearance variation caused by changes in perspective. 34 x-y plane error (m) TCNN KCF C-COT 3 2 1 0 0 2 4 6 8 10 CVD (m) Figure 3.8: Tracking comparison with errors in meters. 250 TCNN KCF C-COT 2D error (pixel) 200 150 100 50 0 0 2 4 6 8 10 CVD (m) Figure 3.9: Tracking comparison with errors in pixels. 1 Precision 0.8 0.6 0.4 TCNN KCF 0.2 C-COT 0 0 20 40 60 Threshold in pixels 80 100 Figure 3.10: Tracking comparison with precision plot. To the best of our knowledge, this challenge is rarely addressed in any of the tracking benchmark studies. Using three TCNN networks was specifically justified due to large appearance variation in the 35 2 x-y plane error (m) 1.5 1 0.5 0 1 TCNN (Err: 19.9 cm) 2 TCNNs (Err: 4.2 cm) 3 TCNNs (Err: 2.4 cm) 4 TCNNs (Err: 2.3 cm) -0.5 -1 0 1 2 3 4 5 6 7 CVD (m) Figure 3.11: Number of TCNN networks used for tracking. trailer coupler as the CVD converges to zero meters. We have experimentally studied the effect of using different numbers of TCNN networks, as seen in Fig. 3.11, with the final error at zero CVD meters in parenthesis. With increasing number of TCNN networks, a higher tracking accuracy can be achieved yet system complexity increases. Thus, we choose three TCNN networks to balance the accuracy and complexity. 3.6 Conclusion In this chapter, we present a computer vision system which is capable of detecting and tracking the coupler at various distances. One of the key contributions in this system, is the ability to detect the trailer using DCNN through a new loss function producing confidence measures associated with regression estimates. The TCNN networks have demonstrated superior detection accuracy as well as efficiency, especially in the close CVD range. The next chapter will be a continuation of this chapter, where we will present a complete automated computer vision system for backing-up a vehicle towards a trailer, through providing accurate 3D coordinates of the coupler. 36 Chapter 4 3D trailer coupler localization using convolutional neural networks 4.1 Problem statement The problem statement is the same as in Chapter 3. In this chapter, we propose to complete the Multiplexer-CNN based system for detecting and tracking the trailer’s coupler and continuously provide 3D coordinates to control a vehicle, while relying solely on a monocular rear-view fish-eye camera as seen in Fig.4.9. Specifically, the goal of this chapter is to estimate the 3D location of coupler in meters. Our motivation for finding the 3D location of coupler is three-fold. (1) An automatic control system will consume the 3D estimate to mechanically control the vehicle’s movement. (2) To avoid collisions between the vehicle and trailer. (3) By dedicating a separate method for height estimation via contour estimation, we can further improve the detection rates from Chapter 3. This is because having knowledge of the general contour for target coupler, makes it easy to estimate the center of the coupler, i.e., which is the center of the contour. In the following sections, we will first explain the algorithm for estimating the coupler height and 3D localization based on coupler contour estimations. After that, we will explore some limitations introduced by the coupler shape, which will directly cause the CCNN network to fail. We will further explore a generalized second approach in estimating the 3D coupler localization regardless 37 of the coupler shape. Finally, we will report results based on both approaches separately. 4.2 Height Estimation and 3D Localization The motivation of estimating the coupler height is two-fold. 1) To control the vehicle mechanically, where the vehicle control system requires a precise 3D location of the coupler, (x, y, z) which is the range, offset and height in meters. This demands the distance map at the true height of the coupler. Hence, we need to estimate the coupler height rather than using the assumed height. 2) The vehicle has a hitch ball set to a fixed height. We need to ensure that the coupler is high enough to avoid colliding with the hitch ball. It is challenging to estimate the coupler height from a monocular camera. For fixed size objects like license plates, the depth can be estimated from the plate pixel size [15]. However, couplers vary significantly in their shape. To address this problem, we discover that the geometric shape of the coupler contour is indicative of the height, e.g., at a fixed CVD, increasing the height of a coupler will spread out the contour points. Therefore, we propose to estimate the coupler contour using CCNN. This allows us to extract contour geometric features and feed them to regressors to estimate coupler heights, as detailed in Alg. 1. 4.2.1 Coupler contour network The network architecture of CCNN is in Tab. ??. Given the small number of contour labels, we learn a shallow network of 2 convolution layers and 2 FC layers, with an ReLU and maxpool layer after each convolution layer. CCNN defines the Euclidean loss on the coupler contour represented by a 76-dim c, i.e., 38 points, where 37 points are on the contour, and the last one (c(75), c(76)) is the coupler center. 38 Meters Figure 4.1: Geometric feature of a contour include: distances along straight lines, and slopes of red lines. yt dt θ xt Origin point Figure 4.2: Estimation of xt and yt , on the Dh elevated by estimated height zt . Red dot is the estimated coupler, green has zero offset from the origin point. 4.2.2 Contour estimation Similar to tracking, to improve estimation stability, our contour is estimated using candidate contours of N3 patches extracted with random perturbations, each obtained by CCNN. Due to the high-dimension output, there are normally a few outliers among the candidate contours. Therefore, we propose to use a shape model [63] to remove the outliers, i.e., contours with unusual coupler shapes. Based on labeled contour training images c̄, we compute a mean shape c̄0 and five basis shapes P, such that any candidate contour c can be represented as a coefficient b = PT (c − c̄0 ). If any candidate contour’s b does not meet the normal coefficient distribution learned from training data, the contour will be ignored when making a final mean estimation. 39 4.2.3 Height estimation Given a stable contour estimation, we extract geometric features to capture the 2D shape of the coupler as shown in Fig. 4.1. Specifically, we uniformly sample five points along the contour, including the two end points. We compute the Euclidean distance between any two points, resulting in 10 features, i.e., red and green lines. We further compute the slopes of the red lines, resulting in four features. Thus, a 14-dim feature vector is extracted from the contour estimation of CCNN. Given five sets of training images, each having couplers at a specific CVD, we utilize their feature vectors to learn five height estimators {Ri }5i=1 via the bagging M5P regressor [3, 86]. Our analysis shows bagging M5P to be superior to other well-known regression paradigms. 4.2.4 3D coupler localization A detailed algorithm for 3D coupler localization is in Fig. 4.3. Given the current 2D coupler location (ut , vt ) estimated via TCNN, we find the CVD dt utilizing the distance map Dh at the assumed height of 50 cm. However, when dt is less than τ3 = 1 meter, CCNN is activated to estimate the contour, followed by the height estimation for each video frame. Then a refined CVD dt can be retrieved for two updates: First, the distance map Dh is elevated to the estimated coupler height hc . Second, the 2D coupler location (ut , vt ) is refined by averaging the TCNN result with the coupler center estimation from CCNN. Independent of whether dt is less than τ3 , we need to convert the CVD dt to the 3D coupler localization (xt , yt , zt ), for the purpose of vehicle control. To find xt and yt , we solve a simple triangle problem on the distance map at the coupler height, where xt and yt are the two sides forming the 90◦ angle as seen in Fig. 4.2, and the third side is the CVD dt . As for zt , it is either the assumed height of 50 cm if dt > τ3 or otherwise the estimated height hc . 40 Results of TCNNi Extract N patches CCNN Apply ASM elimination constraints Mean contour result 14-d feature vector Height zt (a) 3D coupler localization illustration (b) 3D coupler localization algorithm Figure 4.3: 3D coupler localization. (a) Illustrates an example of the 3D localization process. (b) The exact algorithm followed in 3D coupler localization. 4.3 Experimental results In this section, we will evaluate the contour estimation method first, and then we will analyze the height estimation using CCNN. We will further evaluate the overall system accuracy and efficiency. 4.3.1 Parameter setting For ValidShapeTest(b) in Alg. 1, it returns true if all elements of b are within three standard deviations of coefficient distributions. The five height estimators {Ri }5i=1 are trained from contours at distance ranges of (0.9, 1.1], (0.7, 0.9], (0.5, 0.7], (0.3, 0.5], (0.1, 0.3] meters, respectively. 41 1 Precesion 0.8 0.6 0.4 CCNN CCNN P 0.2 ASM 0 0 10 20 30 40 Threshold in Euclidean distance error 50 Figure 4.4: Contour estimation comparison. System Time(s) DCNN 0.088 TCNN 0.023 CCNN 0.010 Alg 1 0.471 Alg 2 0.053 Alg 3 0.035 Table 4.1: System efficiency. 4.3.2 Height estimation Contour estimation is crucial to our height estimation. We first compare our CCNN-based contour estimation with two baselines, ASM contour fitting [63], and CNN-based polynomial coefficient fitting inspired by [58], which used curve functions to describe the facial contour. For the classic ASM method, we learn a 2D ASM model to iteratively fit a shape to the coupler contour. For the second baseline, we learn a CNN named CCNNP similar to CCNN with the same structure, except that instead of producing a regression output for 38 contour points, CCNNP estimates an 8-dim vector (αi , βi ) for i = 1 · · · 4. Here αi and βi are the third degree polynomial coefficients of the xcoordinate and y-coordinate of the contour points, respectively. We report the Euclidean distance error by measuring the shortest path of the estimated point to the groundtruth contour in Fig. 4.4. Observing a fixed threshold at 20 pixels, our CCNN has a precision rate of 92% compared to 75% and 63% for CCNNP and ASM. Given the coupler contour and its geometric features, we can estimate the height. We analyze 42 the performance of the five height estimators through a 5-fold cross validation on the 72 videos of the height labeled set. The absolute mean errors are 0.85, 0.75, 0.60, 0.54, and 0.45 cm for {Ri }5i=1 . We observe higher performance for both contour and height estimation the closer we get to the coupler. Given the estimated height, we adjust the distance map, based on which we retrieve the CVD using the coupler center estimated by CCNN, i.e., (c(75), c(76)). Then we perform the 3D coupler localization using the same triangulization in Fig. 4.2. The blue curve in Fig. 4.5 shows the height estimation accuracy using the height label set of 72 videos. 4.3.3 Overall system test We report the results of the entire system, using components from all three stages, in Fig. 4.5. We use the height label set, because the labeled height (i.e., its elevated distance map) and the labeled coupler center can provide the ground truth 3D coupler locations in each video frame. Note that the 2D coupler center is estimated by fusing TCNN3 and CCNN, as in Line 17 of Alg. 1. The minimum error can be found in the offset estimation, followed by the height and range estimations. The small error of offset is due to the fact that most backups are along the frontal angle, and therefore, the error on the x − y plane is almost the same as the range error. The final estimation error on the x − y plane is merely 1.4 cm. While not directly comparable due to different datasets, it is substantially smaller than TCNN3 error of 2.4 cm in Fig. 3.8. This x − y plane error is the most important accuracy metric for our system. At 1.4 cm, the vehicle control system would drive the vehicle to park and stop at a point where the hitch ball is only 1.4 cm away from the coupler center. The fact that most couplers have a radius of 2.2 cm means that users can easily lower the coupler and hook up with the hitch. 43 0.06 0.05 Error (m) 0.04 Range error " x Offset error " y Height error " z 3D error x-y plane error 0.03 0.02 0.01 0 0.3 0.4 0.5 0.6 0.7 CVD (m) 0.8 0.9 Figure 4.5: Overall system accuracy on the height label set. 4.3.4 System efficiency Our system is implemented in MATLAB using MatConvNet [83] on an Intel Core i7 − 4770 CPU with 3.40GHz and a single NVIDIA TITAN X GPU. The system can run in real-time obtaining frames from the rear-view fish-eye camera, or offline using the trailer coupler database for analysis. Table. ?? provides detailed efficiency analysis per CNN network and per system stage, where Alg 1, Alg 2, and Alg 3 are the detection, tracking and localization of the coupler when tested with N1 (for 1 iteration), N2 and N3 patches. Note that Alg 1 requires nearly 1.88 seconds to perform 4 iterations during initialization of the system. While tracking the coupler, we can achieve 18.9 FPS. However, when the CVD passes τ3 the system drops in speed to 11.4 FPS due to running both TCNN and CCNN. 44 (6.21, 0.82, 90) (6.60, 0.23, 11) (1.29, 0.12, 24) (0.29, 0.02 , 16) (0.30, 0.01,10, 0.513) (6.84, 0.57, 21) (6.98, 0.17, 2) (0.78, 0.04, 8) (0.24, 0.01, 6) (0.24, 0.00, 3, 0.487) (4.79, 0.45, 28) (4.43, 0.67, 18) (1.74, 0.09, 7) (0.27, 0.02, 26) (0.28, 0.01, 18, 0.539) (6.65, 0.93, 32) (7.07, 0.18, 11) (1.69, 0.22,17) (0.24, 0.02, 35) (0.25, 0.02, 21, 0.504) (5.40, 0.46, 8) (5.36, 0.43, 7) (2.30, 0.38, 25) (0.43, 0.12, 67) (0.44, 0.05, 39, 0.498) Figure 4.6: Qualitative results of five videos. The first column is the DCNN results, the three middle columns are TCNN results, last column is the CCNN result. Red + is the estimated result, yellow ◦ is groundtruth, green × is the estimated contour. Above each frame is (CVD, meter error, pixel error), the last column also has the estimated height in meters. The red rectangles indicate the failure cases. 4.3.5 Qualitative results Figure 4.6 shows qualitative results of detecting and tracking the coupler for full video sequences. We illustrate five different video examples, representing typical challenging cases in the database. We show a few failure cases of the system obtained from different stages of the videos. Note that the fourth and fifth columns use the same frame to illustrate the results of TCNN and CCNN, respectively. One key observation in our system, is that our Multiplexer-CNN approach has the ability to overcome failure cases at various stages of the system. E.g., if our TCNN fails within the last few frames, such as the last row of Fig. 4.6, our CCNN has a good chance of correcting the problem. The same observation is made when Multiplexer-CNN switches between any two possible networks. 45 Figure 4.7: Boat trailer couplers. Bottom row are examples of cases where the counter estimation will fail, and hence the height and 3D localization will fail as well. 4.4 A generalized approach for 3D coupler localization Using CCNN in estimating the contour points was proven to work well based on Fig. 4.5. However, all of the coupler types used in this evaluation, generally have the same round shape. This type of coupler is commonly found in more than 95% of trailers such as the ones in the first row shown in Fig 4.7, while the remaining 5% will have other shapes such as the bottom row. These coupler are usually found on light weight trailers, such as boat and utility trailers. A simple solution is to retrain CCNN with more coupler training images such as the ones in Fig. 4.7. However, since these types of coupler are rare to find, then it would be very hard to collect more data. Retraining our CCNN with limited amount of data, will lead to over-fitting. Therefore, in order to come up with a generalized solution which is capable of functioning across all type of trailers, we propose to estimate the height without the need for contour estimation via CCNN. More specifically, we propose a new method to detect and track points on both the coupler as well on some points on the ground. By detecting and tracking points on the ground, we can estimate 46 Figure 4.8: SIFT matching to find the amount of shift on the ground plane. the amount of shift on the ground in meters using the distance map. We propose to use the scale invariant feature transform (SIFT) matching [59] to find matching points on the ground as seen in Fig. 4.8. Note that in this figure, we also show some matching points on the coupler in the red rectangle, which will be ignored when computing the ground shift. Technically, the amount of shift on the ground should be equivalent to the amount of shift of the coupler. 4.4.1 Height estimation algorithm: A new method During tracking with TCNN3 in the last meter range, we expect to continuously estimate and update the coupler height using the new method. Our new method will follow these steps: • Obtain frame It and It−1 when the CVD is less than 1 meter. However, we experimentally discovered that selecting adjacent frames produces very poor results. This is due to the small changes which makes it hard to find a clear distance measurement in both the ground and coupler. To find the best distances for the two key frames, we test several scenarios with different real measured distances between the frames as seen in Fig. 4.9. We observe that having a small gap or very large gap will always give bad results. On the other hand, having a fixed gap in the range of 20 ∼ 30 cm, has best performance. Note that this is made possible by keeping track of the real distance estimated for the coupler at every frame for the last 47 meter. This is needed to find the best second frame candidate to perform SIFT matching and ultimately estimate the height zt . Figure 4.9: Finding the best distance of the two key frames for applying SIFT and estimating zt . • Detect and track points on the ground. To do this first we extract a fixed patch of size 300 × 150, located in a fixed region of the lower left of the frame, i.e., region to the left of the ball of the truck. The local region and parameters of SIFT were manually selected based on trial and error during a live demo session in an RV trailer site. • Detect SIFT points and apply a SIFT matcher across the two extracted local regions from It and It−1 . • Find the average shift (∆û, ∆v̂), and estimate this shift in terms of meters dg using the ground distance map D0 . • Find the average shift (∆u, ∆v) in the coupler 2D estimation, and find dc accordingly. • Our motivation in this step, is to try to find how much we need to elevate D0 such that 48 dc − dg = 0. We were able to find a closed form solution as follows:   D0 (ût−1 , v̂t−1 ) − D0 (ût , vˆt ) zt = h0 1 − D0 (ut−1 , vt−1 ) − D0 (ut , vt ) (4.1) where D0 is the ground distance map, h0 is the camera height from the ground, (û, v̂) is the 2D detection of a ground point via SIFT, (u, v) is the 2D detection of coupler from TCNN3 . • To ensure a smooth estimation of height, we take the average of a sliding window approach across 25 frames. 4.4.2 Experimental results For all of the following reported results on height estimation, we are still utilizing the entire system, using components from all three stages as indicated in Chapter 3. Note that for the errors in x-y plane, ∆x and ∆y, they should not change from the previous method in Fig. 4.5, however, since the previous algorithm updates the x-y plane coordinate based on the height, some minor changes will happen. Evaluating the system on the same 72 height labeled videos, we obtain the results in Fig. 4.10. Even though the results are a bit worse than the old method, i.e., contour estimation with CCNN and then regress the height, however, we argue that this performance is much appreciated when we can now claim that this height estimator will work across any type of coupler, whereas the old method only worked on round couplers. To have a side-by-side comparison of both old vs. new method, we decide to highlight the generalization capability of the new method by creating a new validation set containing a group of 40 videos. This new set contains various types of couplers including different shapes. The result of this experiment can be seen in Fig. 4.11. In this figure, we also show the results of height estimation when tracking is replaced with the groundtruth center coupler label of every frame. In 49 Figure 4.10: 3D analysis on the 72 height dataset using the new method. Figure 4.11: Comparing the height estimation of the CCNN approach versus the new method using SIFT matching points. this experiment, the new method has the lead since the testing set is more challenging for the old method. We also evaluate the height estimation on site in the real-time demo, where we only report the result of ∆z in Fig. 4.12. A total of 30 different trailers with couplers ranging as low as 38 cm upto 55 cm. As seen in Fig. 4.12, we evaluate the height estimation throughout the last CVD meter. Based on our observation, height estimation have large errors during the beginning of the process (mainly in the range of 0.5 ∼ 1.0 meter). After that, height estimation becomes more reliable 50 Figure 4.12: Estimating zt using the generalized approach. (especially in the last 0.25 meters). The final mean absolute error reported at zero distance to the coupler was found to be 2 ± 1.2 cm. 4.5 Conclusion We present an automated computer vision system for backing-up a vehicle towards a trailer, using a distance-driven Multiplexer-CNN. While relying solely on a monocular rear-view fish-eye camera, we are able to provide accurate 3D coordinates of the coupler, which are needed for vehicle control. One of the key contributions in this system, is the ability to detect the trailer using DCNN through a new loss function producing confidence measures associated with regression estimates. Our quantitative and qualitative results on the collected large-scale trailer database, demonstrate our system’s ability to be integrated in any vehicle with a rear-view camera. This work also represents how successful vision systems are built to meet real world needs. From the technical perspective, other applications such as autonomous driving problems, may benefit from our components, e.g., Confidence loss function, distance map, Multiplexer-CNN. 51 Chapter 5 Detecting objects in overexposed bright illumination conditions 5.1 Problem statement Because intelligent CV systems have recently undergone rapid growth over the past couple of decades, research on accurately detecting objects in outdoor scenes is growing in importance. The existing research using RGB cameras has mainly focused on methods of detecting objects in controlled environments or during normal ambient light conditions. Even though we, i.e. as humans, undergo challenging illumination conditions on a daily basis in our everyday lives, this aspect has been largely neglected when designing CV systems. The illumination challenges can be categorized into two classes: overexposed bright images with light being projected on the target object, and underexposed dark images captured during the absence of light sources. For the first class of challenges, many examples include detecting any object with overexposed illumination properties, such as the feed particles on the water surface of an aquaculture fish tank as mentioned in Chapter 2. For the underexposed dark class, many examples include driving at night with very limited light in the scene, or surveillance systems attempting to detect faces or pedestrians during nighttime. Additional examples of both challenging illumination classes can be seen in Fig. 5.1. Note that both illumination challenges have different impact on the images in terms of appearance. For the case of overexposed light in images, the light source intensity makes the target 52 Feed particle Coupler Vehicle Trailer Pedestrian Vehicle Pedestrian ID Card Figure 5.1: Challenging illumination examples. All examples on the left column represent light overexposure in several applications. The right column represents examples of underexposed images in dark light conditions. objects very bright, and in some cases, important bright parts of an image are washed-out or effectively all white, making the object indistinguishable. Examples of this case can be seen in the left column of Fig. 5.1. On the other hand, the appearance of objects in underexposed dark images are generally hard to discern compared to good lighting images. In this case, the intensity of pixels and edges of the object are degraded, making it difficult to separate from the background of the image. Given that the underexposed and overexposed illumination challenges have completely opposite effects on the images, it is nearly impossible to design a single CV system which can handle 53 both objectives at the same time [29]. Therefore, we propose two separate methods for each case. The remainder of this chapter will only consider the overexposed challenges and is organized as follows: an introduction to the overexposed bright light problem, along with several datasets used in this chapter including the details and the motivation of collecting them. After that we will introduce our proposed method including estimating light direction and using joint filter CNN’s for normalizing the light in overexposed images. Finally, we will present quantitative and qualitative results in a real-world problem. In the following chapter, we will propose a method to handle the underexpose low-light illumination challenge separately. 5.2 Introduction to the overexposed light challenge Over the past decade, engineers have developed numerous advanced vehicle systems for autonomous driving, providing security and safety for drivers while driving. These systems operate on the outside world such as detecting vehicles on the streets and pedestrians on sidewalks. Recently researchers have been developing CV system for drivers, operating inside of the vehicle, such as recognizing the behavior of the driver [89], monitoring gaze estimation and hand tracking [50], and developing passenger monitoring systems [91]. Having CV systems for inside vehicle perception, is just as important for outside perception, in providing comfort, safety and security to the drivers. In this chapter, we are interested in detecting objects in the backseat of a vehicle while driving. More specifically, having frames captured at a top-view angle of the backseat in a vehicle, we would like to make object detection feasible regardless of the lighting conditions as seen in Fig. 5.2. This problem is accompanied with many challenges, where the illumination challenges is among the most challenging cases. First, when sunlight is projected on the backseat, the intensities of the objects becomes washed-out, prohibiting the ability to localize and even recognize objects. 54 Input frame Input frame DelightCNN estimation DelightCNN estimation Figure 5.2: We present an approach for synthesizing good lighting frames (bottom row), from videos taken in extreme light conditions (top row). Second, since the vehicle is in motion, the illumination challenges varies accordingly, which might also cause motion to the objects in the vehicle. This also makes object tracking and detection very difficult. Finally, since the projected bright sunlight on the backseat has high pixel intensity values, cameras tend to balance the global frame intensities, resulting in very dark regions where no sunlight exists. In this section, we propose to continuously delight the frames and produce visually appealing frames, and potentially enhance the object detection performance as seen in the second row of Fig. 5.2. The motivation of this work is to mainly make object detection possible given the challenging illumination scenario. From an application stand point, several systems can benefit from having a delighted frame in the backseat of a vehicle. For example, notifying the driver of what objects he/she might have forgot or even lost. Another example can be related to driver safety for trans- 55 portation services, such as Uber or taxi drivers. In such cases, detecting knives or guns would alarm the driver or even the police. Another application could be used to monitor children in the backseat, i.e. infants. Given the success of deep networks at related tasks such as image super resolution, depth upsampling, and noise reduction, we propose to use CNN’s in synthesizing appealing images by normalizing the light in the scene. Joint filter CNN methods [56] will be utilized, where our hope is that an appropriately designed guided CNN can learn image delighting. However, training such a CNN requires a very large dataset of outdoor sunlight images as seen in the top row of Fig. 5.2, with their corresponding good lighting conditions to serve as label. Unfortunately, such a dataset currently does not exist, because capturing the light image pairs, i.e., target and label light images, requires significant time and effort, acquiring it is not feasible. Our insight is to exploit videos captured of the scenario in Fig 5.2 through driving the vehicle for long periods of time, capturing various light conditions. Our hypothesis is that giving a bad lighting frame in a video, a good lighting or semi-good lighting pair can be found temporally, such that it has better illumination properties. Note that when driving, the light condition changes dramatically from one frame to another. To fully understand the illumination properties of a scene, such as the existence of light and its direction, we design a CNN-based method which can identify the light situation for any given frame. More specifically, this CNN can identify the direction of all light sources within an image, and more importantly, help in detecting good ambient light images to be paired with neighboring bad light images. Therefore, every single frame of a video, will have a paired image found temporally such that the light direction of both pairs are quite different. After overcoming the data collection challenge, we propose DelightCNN which uses the paired frames for supervising a CNN to filter good lighting images from bad ones. We adopt the joint-filter based CNN [56] method in learning the network architecture that consists of three sub-networks, 56 as shown in Fig. 5.7. The first two sub-networks DelightCNNT and DelightCNNG act as feature extractors to determine informative features from both target and guidance images. These feature responses are then concatenated as inputs for the network DelightCNNF to selectively transfer common structures and reconstruct the filtered good lighting output. Here, DelightCNNT will take the target bad lighting image paired with a second image temporally, selected based on the light direction estimation CNN which will have complementary lighting conditions. On the other hand, DelightCNNG will take an image to help guide the network in recovering the bad light regions from the target image. Since the camera is static, we propose to use a background image with no illuminations defects of the top-view scene with no objects residing in it. This will provide guidance to the CNN in learning what the original background looks like with ambient lighting. Therefore, each training sample is now composed of four images, i.e., two paired target images, one guide, and the groundtruth label image. As a result, the learned network allows recovery of plausible ambient light frames. We will also demonstrate the impact using a CF detector and compare the results of before and after delighting the scene. In the following sections, we will first briefly introduce some of the prior work related to our study. After that we will introduce the datasets collected. Then we will explain how we pair bad lighting frames with good (or semi good) lighting frames using the light direction CNN. Finally, we will introduce the proposed DelightCNN model. 5.3 Prior work We review papers based on the two main components in our paper. First, we will review some methods in light direction estimation from images. Secondly, we will review papers which attempt to enhance and filter images suffering from bad light conditions and any other relevant scenarios. 57 Since we use a guided filter CNN approach, we will also review joint filtering methods separately. 5.3.1 Light direction estimation Light direction estimation from a single image is a topic well studied over the past decades. Many of the previous works handle synthetic objects only such as in [85], where they present a method for the detection and estimation of multiple directional illuminates, using a single image of any object with known geometry and Lambertian reflectance. However, we need to be able to estimate light direction without prior knowledge of geometry and reflectance properties. Many other works use real life images in estimating the light direction such as [53], where they propose a method for estimating the likely illumination conditions of the scene. In particular, they compute the probability distribution over the sun position and visibility. The method relies on a combination of weak cues that can be extracted from different portions of the image: the sky, the vertical surfaces, and the ground. In our work, we learn a simple CNN-based method for estimating light direction of a single image. The CNN is forced to learn the direction of light based on analyzing the shadows and sunlight projections in any image. A similar approach in [73] introduce a method for recovering an illumination distribution of a scene from image brightness inside shadows cast by an object of known shape in the scene. The method combines illumination analysis with an estimation of the reflectance properties of a shadow surface. In our method, the features indicative of light direction is learned in the CNN model, regardless of the object shape and reflectance properties. 5.3.2 Enhancing lighting in images The following list of works are examples of methods which aim at improving the image lighting by either removing unwanted intensity regions from an image, or by improving the image as a whole. 58 5.3.2.1 Shadow removal Shadow removal has been a widely studied topic in the literature [7, 38, 48], which generally includes two steps: shadow localization and shadow removal. However, in our problem the shadow region might represent most of the input image whereas the shadows in the shadow removal literature only corresponds to what have been generated by smaller objects, which makes it very hard to localize. In [70], they propose to remove the shadow detection step via learning a CNN to estimate the shadow matte directly from the input image. This method would be more suitable to consider in the delighting process, but the neighboring regions to shadows are of extreme importance in removing and generating a shadow matte. The overexposed sunlight washes-out this part making it hard to recover the shadow regions. 5.3.2.2 Image dehazing In this field, researchers attempt to estimate the optical transmission in hazy scenes given a single input image. The goal is eliminate scattered light to increase scene visibility and recover hazefree scene contrasts. Existing methods in image dehazing use various constraints and priors to get plausible dehazing solutions [30]. In [66], assuming they have the scene depth, atmospheric effects are removed from terrain images taken by a forward-looking airborne camera. More recently, CNN-based methods such as [13] have proposed a system named DehazeNet which estimates the medium transmission of an image. This system takes a hazy image as input, and outputs its medium transmission map that is subsequently used to recover a haze-free image via atmospheric scattering model. 59 5.3.2.3 Image relighting Image relighting is to change the illumination of an image to a target illumination effect without knowing the original scene geometry, object information and illumination condition. The authors in [46] perform outdoor scene relighting method, which needs only a single reference image and is based on material constrained layer decomposition. In [57] they take advantage of multiple color plus depth images in modeling lighting and reflected light field in terms of spherical harmonic coefficients, to recover both illumination and albedo to produce realistic relighting. Such works, along with many other works in this field [47, 80], focuses at changing the illumination at a pixel level such that the resultant images appear brighter. All relighting methods assumes typical illumination scenarios. Applying such methods on overexposed scenes cannot recover from the visual artifacts which requires synthesizing to overcome. 5.3.3 Joint filtering methods One of the initial works in joint image filters are the bilateral filters [51, 81], which compute the filtered output as a weighted average of neighboring pixels in the target image. The filter weights depend on the local structure of the guidance image neglecting the target. On the other hand, our DelightCNN considers the contents of both images through extracting feature maps. Another class of joint image filtering is based on global optimization, where the objective function consists of a data fidelity term and regularization term. The first term ensures that the filtered output is similar to the target image. Most approaches in this field differ in the regularization term, which is responsible to maintain the structure between the guide and output images. Some works defined the regularization term according to texture derivatives [23], or mutual structures shared by the target and guidance image [76], and many others. However, these methods rely on hand-designed 60 objective functions that may not be the optimal choice to generalize across all real life images. In contrast, our method learns how to selectively transfer details directly from real life datasets. In the past few years, researchers have been motivated by the concept of joint image filtering with CNNs, where the network architecture will contain two main branches, one serves as a guide branch, while the other is for the target image as seen in our proposed DelightCNN in Fig. 3.2. A similar architecture in [26] performs optical flow estimation, where two frames were used for both of the target and guide branches. Similarly in [93], they take stereo images as input, and generates a disparity map. The first branch is used for computing the cost-volume and the other is for jointly filtering the volume. In [56], they used this approach for several low-level vision tasks, such as image noise reduction, depth map upsampling, and super resolution. Their network architecture has some resemblance to that in [26], with the only difference is that the merging layer in [26] uses a correlation operator while [56] model merges the inputs through stacking the feature responses. Note that in our DelightCNN network, we also use stacking. Moreover in [56], they propose to select the guide image from a different modality, for example they use an RGB image as guide but the target uses a depth image. However, in most real-world application, a single modality would be preferred, since not all systems has access to several sensors. In contrast, we leverage the knowledge of the light direction CNN model in selecting the guide and target images, for the purpose of insuring that the guide branch will indeed guide the target branch in delighting. 5.4 Dataset collection Our proposed method in finding a solution for the overexposed lighting challenges, utilizes two different datasets, the light direction dataset, and the BOSCH vehicle backseat dataset. In the following two sections, we will explain the details of each dataset respectively. 61 5.4.1 Light direction dataset The motivation of collecting this dataset, was to have sufficient training data of various lighting scenarios for objects placed on a table surface. The goal was to have the camera placed at a top view angle to match the case of the BOSCH dataset which will be explained in the following section. Some sample examples of this dataset can be found in Fig. 5.3. We collect data from 11 different objects, which could be found in the backseat of any vehicle. The objects were selected such that they all have different properties in terms of reflectance, color, size, and shape. We use four light sources with different Lumens values. Here Lumens is a unit of luminous flux in the International System of Units that is equal to the amount of light given out through a solid angle by a source of one candela intensity radiating equally in all directions. In other words, Lumens represents the brightness of the light source. We use one very bright light source with 720 Lumens, two 300 Lumens light sources, and ambient lighting provided from the normal indoor office environment lighting. For every data collection sample, we adjust four different variables: (a) the number of light sources used ranging from 1 ∼ 3. (b) The polar angle, i.e., denoted by θ , which ranges from 0 at the North Pole, to π/2 at the Equator. (c) The longitude, i.e., denoted by φ which is also known by azimuth, ranges between 0 ≤ φ < 2π. (d) The pose of the object. Note that θ cannot exceed π/2, since we assume the light source is always above the horizon. The physical distance of the light sources and the center of the object in the image is always fixed, i.e., which was equal to the radius of the circle shown in Fig. 5.5, i.e., 53 cm. Therefore, the location of the light source can be represented on the top half of a spherical surface shape. We have collected a total of 264 images at a resolution of 1000 × 1000 as seen in Fig. 5.3, using the Logitech c920 webcam. Every object has a total of 24 images with different lighting setups. Note that the image in the bottom far right corner represents the ambient lighting case. 62 Figure 5.3: Examples of the light direction dataset. 5.4.2 BOSCH vehicle backseat dataset This dataset was collected by BOSCH for the purpose of detecting objects undergoing overexposed sunlight challenges. The key motivation is to provide the user a list of objects which might be forgotten by the driver in the backseat. The total number of 12 videos were collected, such that three of them had no objects on the backseat, while the remaining nine videos had several objects. The total number of frames in this dataset is equal to 20747 frames, with an average of 1729 frames per video. The number of objects in every video ranges from 6 ∼ 8 objects, including an ID card, wallet, gloves, car keys, bottle, an empty can, a phone charger, a tool, a dollar bill, and sunglasses. The videos were recorded while the car is moving, and the driver would intentionally make several left and right turns to cause changes in light direction. Moreover, the videos were recorded in a city area, which contained buildings, bridges and trees. Therefore, even when the driver is driving in a straight line, many variations in the overexposed sunlight happens. Several examples of this dataset can be found in Fig. 5.4. Every row in this figure was obtained from a specific video. The first row is a video with no objects in the back seat, while the remaining other videos in this figure 63 Figure 5.4: Examples of the BOSCH database. The left column represent one of the good lighting frames in a video sequence. The right column represents an overexposed sunlight example from the same sequence. have objects. The left images are examples of good lighting frames, whereas the right column are frames selected temporally after the good frame which have bad lighting. When having a very bright region within an image, cameras tend to automatically balance the distribution of intensities. This will ultimately make the shadow regions much darker as seen in the first two rows of Fig. 5.4. 5.5 Estimating light direction using a CNN model For estimating the light direction using a CNN model, we had to collect a dataset representative of light direction as seen in Fig. 5.3. We only use 10 objects for training the CNN, where the 64 11th object was left for validation. We have a total of 240 images to train the CNN model. For CNN training purposes, we only extract small local patches collected from the object through random perturbation of the center of the image. We extract a total of 7965 training patches of size 128 × 128 × 3 pixels, through data augmentation by flipping and scaling the original images. The light direction dataset has labeled light source direction in terms of (θ , φ ) and light source brightness. We utilize these labels in creating an 18 dimensional feature vector, representative of the lighting setup, through dividing the surface of the 3D sphere surrounding the object into 18 regions as seen in Fig. 5.5. Note that when a light source is placed on the surface of the sphere, we do not represent this light as a single point, instead, it will be represented as a circular region. The radius of this circle is defined based on the real light source glare shield surrounding the light bulb, i.e., 10 cm radius. If only one light source exist in the image, and this light source only resides inside the boundaries of a bin on the sphere, then the light directional feature vector representing the image will contain 17 zeros and only one corresponding bin will have a value of one representing the origin of the light. Note that all training patches extracted from this image will have the same feature vector, which is considered as the training label. If the light source location happens to overlap with multiple bin regions, the feature vector representing the image will have distributed the value 1 among the bins, such that the summation of this vector remains equal to 1. Similarly with multiple light sources, if n light sources were used, the summation of this vector will be n. The network architecture of the light direction CNN is similar to the one used in the trailer coupler tracker [4], i.e., TCNN network, as seen in Tab. ??, where the only difference resides in the final fully connected layer with an 18 dimensional output. Note that the first 10 layers are similar to the VGG network [68], with minor changes in the number of filters and maxpool layers. The full network is optimized using the augmented training patches of the light direction dataset. 65 Figure 5.5: Dividing the sphere into smaller regions, such that each region represents a specific direction of a light source. We define the Euclidean loss for learning the 18-dimensional light direction feature vector. 5.6 Classifying the light quality using the light direction CNN Knowing the light direction information specifically, is not of great importance in the task of image delighting. However, having knowledge of existing overexposed light patterns in the scene, can be used in identifying bad lighting frames. Therefore, we leverage the results of the light direction CNN in identifying which frames is considered good lighting, to help in solving the main objective of image delighting. Our goal is to learn a classification system, which can produce a continuous score representing the lighting quality in a frame. We use one of the BOSCH sequences to conduct the following experiment and learn the classifier. Given every frame, we extract 80 patches selected randomly from the global frame, and compute the 18-dimensional feature vector representing the light direction in every patch. The final global light direction of the frame is computed by taking the average feature vectors from all local patches. We repeat the same process for all frames in the video sequence. After that, we reduce the dimensionality of all feature vectors to 2-dimensions using PCA for visualization purposes. With the lower dimensionality, we perform Kmean clustering to divide the data samples 66 K-mean clustering on the 2-dimensional light directional data Learning an SVM classifier Figure 5.6: After applying dimensionality reduction using PCA on the 18-dimensional data for all frames extracted from a video, we obtain the distribution of points as seen in the plot. For better visualization, we perform K-mean clustering on the data points with K set to 10 clusters. By labeling the lighting quality of the frames, i.e.,0-good 1-bad, we learn an SVM classifier. into 10 clusters as seen in Fig. 5.6. Note that K = 10 was determined experimentally. Through visualizing the frames residing in each cluster, we observe high light pattern correlation among them. Meaning that, the light direction of all frames in the same cluster are very similar. Moreover, we were able to identify three clusters which had no sunlight projected in the frame. We consider these three clusters as good lighting quality with only ambient light illuminating the scene. The remaining seven clusters had varying overexposed light patterns. Therefore, learning a simple SVM classifier can be achieved, where good lighting quality images assigned a label of 1 and bad quality assigned a label of 0. Moreover, by taking the average 18-dimensional good lighting quality feature vectors, we can simply apply a cosine similarity metric on any probe frame in identifying how close this frame is from being good quality. In the following section, we will explain how the cosine similarity is used in details. 67 Video of empty back seat Good lighting Guide image (DelightCNNG) Light Direction Estimation CNN Model Left: Target image Right: Paired image (DelightCNNT) Filter DelightCNNF 9x9 1x1 5x5 Delighted image Guide DelightCNNG Target video Target DelightCNNT 96 48 1 1 9x9 1x1 9x9 1x1 2 2 5x5 5x5 96 48 1 96 48 1 Figure 5.7: Proposed method overview and pipeline. 5.7 Learning the DelightCNN model Since we can identify the light quality given any probe frame, we direct our focus on learning a CV system for image delighting. Recently, researches have been using joint filter CNNs, i.e. also known as guided-CNN’s, where the goal is to filter unwanted regions, artifacts, and defects from a target image, and produce outputs with desired properties. The joint filter CNN was used in several similar tasks such as image super resolution [56, 90], depth upsampling [37, 45], and noise reduction [41, 56]. our hope is that an appropriately designed guided CNN can learn image delighting. In the remainder of this section, we will explain how we designed the DelightCNN network including the sub-networks DelightCNNT and DelightCNNG . The main concept behind a guided-CNN is, given a target image, how can a guide image provide additional information to enhance or remove undesirable light patterns? In previous literature, the guide image can be obtained from a different modality compared to the target image. Such as in the depth image enhancement work [56], where the guide image was obtained from the origi- 68 nal RGB image of the same scene for the purpose of improving the depth map. However, in our delighting work, no cross-modality information is available. Therefore, we propose to use RGB images in both the target and guide branches to help synthesize good lighting images. Since the overexposed target images will contain bright light patterns, and the backseat might be occupied with passengers or objects, we propose to utilize an empty backseat with good lighting as an input guide image to the DelightCNNG branch as seen in Fig. 5.7. This will help the DelightCNN model understand what a good lighting image looks like, and will also help in recovering any distorted parts of the backseat caused by overexposed light patterns. Given a frame xt with overexposed light patterns at time t, we use the light direction CNN model to extract the light direction feature vector ft . After that, we would like to search for a neighboring frame to form an image pair, such that the paired images have different light direction properties. To do this, we compute the light direction feature vector for the previous n frames. Given any of the n vectors ft− j , such that j ≤ n, we wish to find the minimum cosine similarity c, ct− j = ft · ft− j ||ft ||2 ||ft− j ||2 (5.1) This similarity measure output will naturally range from 0 − 1, since the feature vectors are not negative. After obtaining c for all n frames, we pair frame xt with the frame producing the lowest similarity score, i.e., index of min(c). Our general assumption is that n will be very small such that the two pair of images have minimal object motion. This pair will be used as input to the target branch DelightCNNT of the DelightCNN model as seen in Fig. 5.7. All three sub-models DelightCNNT , DelightCNNG and DelightCNNF have similar CNN architectures to the proposed super-resolution CNN work in [25]. Our CNN architecture represents a fully convolutional network (FCN) model, which means we have the ability to normalize the light 69 for any arbitrary size input images, as long as the target and guide have similar dimensionality. Each submodel contains three convolutional layers where each is followed by a ReLU layer. The sizes of filters for the three layers are 9 × 9, 1 × 1, and 5 × 5 respectively. The 1 × 1 filters in the middle of the network provides a nonlinear mapping of the responses from the previous layer. The number of filters and the general layout of all submodels is illustrated in Fig. 5.7. Based on several experimental observations, we found that operating on the YCbCr color space is more robust compared to the RGB space. Moreover, since the YCbCr color space only has one luminance channel Y representing the pixel intensities, and two chrominance channels Cb and Cr representing color, we only propose to use the Y channel as input for both branches in learning the model. Both output responses of DelightCNNT and DelightCNNG are concatenated to form the input of DelightCNNF . The output of DelightCNN is also a single Y channel, where we take the average Cb and Cr channels of both target pair images to reform the colored YCbCr image. The use of Y channel alone in learning a CNN model has been adopted in several other literature [69]. 5.8 Experimental results In this section, we will discuss the model and experimental setup, show quantitative and qualitative results of the entire system. 5.8.1 DelightCNN setup Due to the large number of frames in the videos, we only utilize the first two videos with objects in the BOSCH dataset for training the DelightCNN, along with one video without objects to be used in extracting an empty backseat for guiding the training process. For obtaining the groundtruth image for supervising the learning process, we also utilize the light direction CNN as follows. 70 Since the target branch obtains two images paired based on the lighting in the previous n frames during test time, we modify this part to select triples with maximum cosine distance among the three triples during training. One of the triplets will serve as the groundtruth image while the remaining two images will be input training sample pairs. Therefore, we expect the groundtruth image to have the best image quality among the triplets, which can be guaranteed by applying the SVM classifier learned with the light direction data. A total of 50, 000 data triplets were generated from the two video sequences. Note that every training sample data had a dimensionality of 128 × 128 × 4, where two channels were obtained from the target image pair, one from the guide image, and one from the groundtruth. The data patches were extracted from the same location of the original four full frames. The network is trained with a learning rate of 0.001, a mini-batch size of 16, and trained for 80 epochs. During testing, n = 30 since the camera was operating at 30 fps, which means we search temporally for an image pair within the last second. However, during training, n = 90 to have higher chances of finding a much better light quality frame to serve as groundtruth. 5.8.2 Results on the light direction dataset We have evaluated the light direction CNN model using the 11th object, i.e., a coffee mug, which was excluded from the training set. For this object, the light settings are completely different from the light settings in the training data. Given this evaluation setup, we should observe whether generalization exists in the trained CNN model for random objects and light positions. A total of 24 evaluation images were used, where two of them are illustrated in Fig. 5.8. For every test image, we extract 80 patches selected from the center of the image with a random offset in both x and y directions, along with a random scale similar to the training setup. Please refer to Fig. 5.5 to use the bin number on the sphere in order to understand the correspondence of light direction with bins 71 1 Estimated score Test patches 20 40 60 80 2 4 6 8 10 12 14 Feature vector bins 16 18 0 1 Estimated score Test patches 20 40 60 80 2 4 6 8 10 12 14 Feature vector bins 16 18 0 Figure 5.8: Evaluation of light direction CNN. in Fig. 5.8. For the first image, the light source is located such that bins 5, 6, and 12 sum up to one and all other bins are zero. As seen from the results of all 80 patches, the results appear to be very consistent with the groundtruth label, along with very small noise in neighboring bins. Similar analysis can be made on the second image, with the groundtruth being bins 2 and 10 which sum up to one and the rest are zero. We collect all results obtained from the test images, i.e., 24 images × 80 patches in total, to compute the ROC curve in Fig. 5.9. At a false positive bin estimation rate of 0.1, our light direction CNN can correctly classify the true bin direction at a rate of 89.0%. When observing all of the failure cases in the results, the errors can be related to two reasons: (a) poor patch extraction, with high offsets from the center. In a few of the images, the object is also not perfectly centered in the image, causing some extracted patches to have very little overlap with the object. (b) Our label for each patch is an 18-dimensional vector, with one’s assigned to the bins where the light source is originated. In some cases, the CNN estimates neighboring bins causing 72 ROC for light direction bin classification 1 True positive rate 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 False positive rate Figure 5.9: ROC curve of light direction bin classification. false negatives and positives in the performance. We have also evaluated this system on the real-world outdoor BOSCH sequences as well, when we performed kmeans clustering on the frames. The results are found in Fig. 5.6. Each cluster had over 100 frames each on average, where all frames seemed to have high illumination similarities, which is a direct indication that the CNN model is indeed operating as required. 5.8.3 Results on the BOSCH dataset We have evaluated the system performance using the BOSCH backseat sequence as seen in Fig. 5.10. Based on the proposed delightCNN method, we were able to synthesize good lighting images given overexposed sunlight patterns. Note that in this experiment, the guide image was an empty backseat such as the one presented in Fig. 5.7. This image remains the same for all of sequences collected in this vehicle. If the vehicle, or backseat have been changed, the guide image needs to be updated accordingly, via collecting an additional video with no objects in the scene. From the first two columns in Fig. 5.10, we can visualize how all the target images contain overexposed sunlight patterns, whereas the output frames are free from such challenge. The guided-CNN approach 73 Target frame at time t Paired frame at time t-j DelightCNN output Figure 5.10: Evaluation of DelightCNN on the BOSCH backseat dataset. All frames belong to the first test video in the BOSCH dataset. The first column is the target frame obtained at time t. The second column is the temporally paired frame located at time t − j. The last column represents the delighted output of the proposed method. learns how to filter the overexposed areas in the frame to restore an image with normal light conditions. By looking closely to the results, a light residual with soft edges from the light patterns still exist. However, this residual is much easier to handle compared to the illumination challenge in the target images. The first three rows in Fig. 5.10 represent good results with minimum artifacts. The fourth row on the other hand, both paired target images suffer from extreme overexposure, which resulted with artifacts in the ID card having some gray patches matching the back seat. This is expected since the ID card is hardly visible in both target images. We evaluate the results using an ID card object detector. Two of the sequences in the BOSCH 74 dataset has the ID card on the backseat. We designed a CF based approach trained on a total of 30 good lighting frames obtained from one of the sequences, while the other sequence was kept for evaluation. We utilized the same CF method used for feed detection in Chapter 2. From the 30 good lighting frames, we define 30 positive and 30 negative training samples. For the positive samples, we manually crop the ID card such that it has a size of 100 × 100, while the negative patches were selected randomly from the frame with the same size as positive samples, such that the ID card doesn’t exist in them. The goal of this CF is to provide a strong response only in regions of the frame that correspond to the ID card. In Fig. 5.11, we show four examples of the CF responses when testing with frames undergoing challenging illumination conditions along with delighted images of the same frame. The examples are sorted based on the card illumination challenges, where the first row the card has no light patterns, and in the last example it has extreme overexposed sunlight. Based on the correlation output of the used CF detector, a pattern in the correlation peak can be visualized such that the more light exposure to the ID card, degradation in the peak sharpness and amplitude is observed. On the other hand, if DelightCNN is used to normalize the light challenge, the CF detector is capable of detecting the ID card. We estimate the peak-to-side-lobe ratio (PSR) of each test frame as a final score of the detection, and we place a fixed threshold of 10 on PSR value to determine the final result, which is commonly used in other CF works [74]. The average precision rate on the original test set reached a low 42.6%, while on the delighted test set it reached a remarkable 91.1% average precision rate as seen in Fig. 5.12. It was observed that the CF detection miss rate at a false positive per image (FPPI) of 0.1 was 57.6% on the original video. This average miss rate was reduced to 33.5% on the delighted sequences. Given lower PSR threshold values, the performance on the DelightCNN video achieves nearly perfect detection performance, which means the correlation peaks exist in 75 Figure 5.11: CF responses of the ID card detector. The images on the left correspond to the original input frames with challenging illumination conditions. The images on the right are the proposed delighted output given the frame on the left side. Beside each frame is an illustration of the correlation output. 76 Figure 5.12: CF detection performance compression when applied to the original video, and the delighted video. all test cases. However, with higher thresholds, the detection performance drops fast which means the peak amplitude is not very high. This is expected since the output frames has a small blurriness affect introduces while synthesizing the image in DelightCNN. When evaluated on the original scene, more than 90% of the tested frames have very sharp peaks with a high PSR rate, which all correspond to good lighting images. The reason why DelightCNN does not synthesize good quality with high resolution images when tested with this good images, is due to the pairing algorithm used which attempts at finding a frame with opposite lighting condition. This will lead to finding bad lighting quality images to pair with the good lighting input image. 5.8.4 Limitation and failure cases One of the biggest limitations in this work, resides in the temporal pairing of target images, especially in the existence of sudden object motion. Note that the BOSCH sequences has very limited object motion, where the objects stay in the same location for long periods of time. However, a few motion cases show up when the driver makes sharp turns causing the object to move on the seat 77 Target at frame n Target at frame n + 10 Target at frame n + 20 Figure 5.13: Failure cases of DelightCNN. Top row are the target frames of a video. Bottom row are the DelightCNN output. The red rectangles represent the ghosting effect when sudden object motion occurs in the scene. or even fall off the seat. The problem is that both paired images can have different object layouts, which causes ghosting effects in the output image as seen in Fig. 5.13. The DelightCNNT submodel is trained such that both paired images are complementary of each other, and can essentially provide enough information of the objects in the scene for synthesizing a normalized light image of the given object. With motion problem, it will appear to the system as if two objects exist in the scene instead of one. 5.9 Conclusion In this chapter, we present a method to normalize light in videos suffering from overexposed illumination conditions. We were able to synthesize good lighting conditions, even when the target objects undergoes bright illuminations. One of the key contributions in this system, is the utilization of estimated light direction in designing the DelightCNN system. Even though we trained the light direction CNN in a controlled environment, it seems to generalize well in the BOSCH videos 78 which contained direct and indirect illumination distributed in a complex way. Our quantitative and qualitative results demonstrate our system’s ability to improve detection capability in overexposed light challenges. 79 Chapter 6 Detecting objects in underexposed low-light illumination conditions 6.1 Introduction to underexposed light challenge The camera lens takes the beams of light reflected off of an object and redirect them so they come together to form a real image of the object. Absence of the light source will ultimately digitize a black image. On the other hand, if we move the light source far away, or if its intensity is reduced, this will reflect accordingly on the object in the generated image, where the object will appear darker. This is because the object was less exposed to light. In this section, we will focus on the case when objects are presented in dark images, or local parts of the image are dark, where objects have degraded appearance features. Our goal is to synthesize good lighting images from such images using a monocular RGB camera, for the goal of improved object detection performance. The main aim of any night vision detection systems is to recognize the existence of objects, possibly to avoid collision, whether this was a vehicle-vehicle collision, vehicle-pedestrian collision, or vehicle-trailer collision. This is a very important issue for any real-world outdoor detection application. Having night vision capability is of high importance for such applications, to guarantee security and safety of the users. Another important use of night vision capability is for face recognition in surveillance scenarios [62]. Recently, Apple has introduced a form of biometric authentication named "Face ID" in their smartphones, which is capable of recognizing the person 80 regardless of the light condition, relaying on a combination of depth and infrared sensors. Most previous work in object detection at night utilizes sensors such as infrared, thermal and depth cameras [6, 49, 62, 82], to overcome the underexposed illumination challenges. In this chapter, we propose to learn a CNN-based method in normalizing the light conditions for any given object undergoing low-light illumination challenges. The main limitation is the lack of training data for such problems with corresponding labeled good lighting groundtruth data. Over the past couple of decades, many researchers in the image processing field have been working on multiple exposure fusion (MEF) of low dynamic range (LDR) images to produce a single high dynamic range (HDR) image. All MEF algorithms attempt at producing an HDR image by computing the weights for each image either locally or pixel wise [69]. The fused HDR image would then be the weighted sum of the images in the input sequence. It is important to note that the input LDR stack of images are collected by changing the shutter speed, usually by a factor of 2, while the aperture, focal length, ISO sensitivity, and other parameters remain the same [36]. Therefore, we propose to use all LDR images from MEF datasets which has a negative exposure value (EV) including the normal exposure, i.e., EV = 0, which we will refer to as LDRDark . These LDRDark images will serve as training data, where the groundtruth HDR image is obtained via utilizing the state-of-the-art MEF algorithm [60]. This groundturth HDR image will be used in supervising the CNN learning. Our proposed method is motivated by the small illumination changes found between every neighboring underexposed LDRDark images as seen in Fig. 6.1. One light normalization approach is to use every LDRDark image as a training sample, where the improved groundtruth image being the HDR image. However, we experimentally found that learning such a problem is complicated due to the large differences in illumination between the data and groundtruth images. Another approach is to take every two neighboring LDRDark as a training pair. E.g., A training sample 81 LDR images HDR image Good Path of Multi-expo Path of Multi-expo Image quality 1 step closer to HDR 2 steps closer to HDR Bad Dark Bright Exposure value (EV) Figure 6.1: Proposed idea for delighting dark images which was motivated by the multi-exposure sequences along with the HDR image. Our goal is to synthesize better lighting images similar to the HDR quality (Green star), starting from an underexposed low dynamic range image (red circle) in an iterative scheme (orange and yellow circles). has an EV = −4 and the groundtruth image has an EV = −2 and so on. Unfortunately, given an LDRDark image as a test sample, will only attempt to estimate another LDR image with a higher EV. This can be illustrated in Fig. 6.1, as learning to predict images along the multi-exposure path presented by the solid black line. Instead, we need a solution which can attempt to relight LDRDark images such that the illumination differences compared with the HDR image is reduced. In other words, given the analogy and illustration in Fig. 6.1, we would like to predict images along the dotted lines connecting every LDRDark with the HDR image. We propose to train a simple yet effective CNN model which predicts images with HDR-like illumination conditions, 82 via an iterative technique. Note that our optimal goal, is not to produce HDR images it self, instead we want to learn the illumination properties of such images for the purpose of image light normalization. Extensive experiments have been carried out to evaluate the delighted images through object detection on several applications, such as pedestrian and trailer coupler detection. 6.2 Prior work In this section, we will review two types of works. First, prior work which attempt to perform object detection on night images without improving the scene illuminations state. Second, prior work which attempts to normalize the dark light challenges of the scene, regardless if object detection was part of the work. 6.2.1 Nighttime object detection As a common solution, researchers employ additional sensors such as near-infrared (NIR) [62] and far-infrared (FIR) [82] cameras or thermal cameras [6]. Unfortunately, there are major limitations for infrared cameras in terms of the illumination angle and distance when such sensors are used [49]. Moreover, the illuminator power must be adjusted adaptively depending on whether the object is close or far away, which makes it hard to generalize well for real-world problems. For the case of thermal cameras, two major concerns arise, the cost is very high and it only works for objects emitting heat. This limits the possible uses for CV applications. Even though thermal cameras have been used in some applications, such as pedestrian detection [6], using it to avoid collisions is not a good idea, because objects with similar heat energy of the background cannot be avoided. For the above reasons, the optimal solution is to use standard RGB cameras to synthesize good lighting images from dark images using image processing (IP) and CV techniques. 83 Recent research has been conducted on nighttime detection using RGB cameras, but this has focused on objects at short distance [18], or the use of video-based methods to capture multiple images and process them [84]. In some other cases, researchers attempt to use side information in detecting objects at night, such as detecting the light beams from incoming cars as an indication of a vehicle [2]. In contrast to these methods, we propose to improve the illumination conditions in the input images prior to applying the object detector. 6.2.2 Dark image enhancement and synthesizing As mentioned earlier, several researchers have been working on HDR image reconstruction and tone mapping [60]. However, our work is not related to this line of research since our goal is not to produce high quality images using higher bit depth than 8 bits color channel. We only utilize the datasets from these works to accomplish our goal. Many works have been made to perform image light normalization of dark images in the literature. In [71], propose a simple histogram equalization algorithm. However, this method generates several artifacts especially with darker night time images. The authors in [20], present an image restoration method that leverages a large database of images gathered from the web. Given a test image, they first search for a similar image from the large database, and then apply a local color transfer to perform image restoration. However, this problem will not work well when the test image is very dark with limited object visibility. In [67], presents an algorithm that generates virtual image exposures via altering the illumination and reflectance components of the image, which is then reconstructed to generate an HDR image. Even though they use MEF to solve the problem, they do start with a similar testing image as ours. However, they can only handle images with the dark region only covering local regions in the image, where nighttime images will struggle. In [77], where they proposed a model for synthesizing a plausible image at a different time of day from an 84 input image. They utilize time-lapse videos of various outdoor scenes of buildings and cities. The biggest limitation of this method is that it transfers color information from the matched frame to the target frame, which will introduce artifacts, and in some cases objects such as pedestrians or vehicles can be lost during the transformation. More recently, the authors in [39] propose a state-of-the-art low-light image enhancement (LIME) method. They construct an illumination map to transform all pixels in the image such that the output image is brighter. The individual illumination of each pixel is first estimated by finding the maximum value in the RGB channels. After that, they refine the initial map by imposing a structure prior on it, to result with the final illumination map. Another recent method using deep learning approach was proposed in [28]. The authors attempt at predicting information that have been lost in saturated dark images in order to enable HDR reconstruction from a single exposure, using the HDRCNN autoencoder network. Due to the impressive results of both methods, i.e., LIME and HDRCNN, in underexposed dark challenges, they will be used as baseline in our evaluation using the MEF dataset. 6.3 Learning the iterative CNN model Given any LDRDark image along with the HDR image of the same scene, we first attempt at constructing the path between these pair of images through weighted averaging. The number of images generated n is based on the structural similarity index measure (SSIM). SSIM is a method for predicting the perceived quality between two images. This measure is based on three comparison measurements between the two samples: luminance, contrast and structure. The SSIM can range between −1 ∼ 1, but in the case of SSIM of LDRDark and HDR, the SSIM measure is always above zero since the structures and texture are not as degraded compared to the luminance. If the 85 Output bright image Input dark image 5x5 3x3 96 1x1 3x3 48 96 3x3 3x3 48 3 3 3 Skip connection + Figure 6.2: Iterative CNN architecture for image delighting. SSIM between LDRDark and HDR is very high, meaning that LDRDark is already very close in appearance to HDR, than not many images will be interpolated along the path between both images, and vice versa. Therefore, the total number of images on the path is defined using the following formula: n = bm · [1 − SSIM(LDRDark , HDR)]c, where m is defined as the max number of images to be interpolated on the path between LDRDark and HDR. The interpolation is done such that the i n−i interpolated image Ii = LDRDark · ( ) + HDR · ( ), where i ∈ [1, n]. Note that every training n n sample Ii will use Ii+1 as groundtruth when training the iterative CNN model. The proposed fully convolutional network (FCN) model used to learn the iterative light normalization technique, is illustrated in Fig. 6.2. An FCN is proposed for the ability to normalize the light for any arbitrary size input image. It is composed of six convolutional layers, each followed by a ReLU layer. All layers use small 3 filters, except for the first conv layer using filter sizes of 5 × 5 and the third conv layer with sizes of 1 × 1 providing a nonlinear mapping. The number of filters for all layers are 3, 96, 96, 48, 48, 3, respectively. A skip connection is introduced in the network, adding the input image to the responses of the fifth conv layer. This is done to inforce residual learning, such that the output of the network will be the amount of change made to the 86 Itr 1 Input Itr 2 Itr 3 Figure 6.3: Iterative CNN example. input image. We have also noticed that having this skip connection improves the quality of the output image, generating images with sharper edges compared to the case without having the skip connection. Due to the nature of the training data, where the paired training sample and groundtruth images only have a small underexposed illumination enhancement, the iterative technique can be used. An example obtained from the BOSCH dataset, where the backseat of the vehicle was taken under a very low exposure value is shown in Fig. 6.3. Note how the intensities of the image changes across the first 3 iteration. Ideally, the number of iterations used can be infinity, where after so many iterations the output images would have saturated, reaching the optimum illumination of an HDR image. However, this is not the case in real world scenarios. If an error occurs in one iteration, it will propagate throughout the other iterations causing artifacts. Therefore, the optimum number of iterations was found to be different based on the input image. 87 Cafe Flowers Knossos6 HDR Images LDR Images Figure 6.4: Examples of the EMPA HDR database [96]. The top row are good lighting HDR images generated by MEF of the LDR images using [60]. The second and third rows are examples of underexposed images with an EV of −4 and −2 respectively. 6.4 Underexposed light datasets We used the images from the EMPA HDR database [96], which provides a total of 33 scenes containing the multi-exposure image stacks, acquired using different exposure times. This dataset was collected using two different cameras, however, we only use the images which was captured by the Canon EOS-1D Mark IV, which had a total of 19 scenes. The camera has a resolution of about 16MP (4896 × 3264 pixels). Fig. 6.4 depicts a few examples of the scenes used in our proposed method to recover from underexposed light challenges in images. Each selected scene was captured with 7 exposure values: one picture with "normal" exposure settings, three underexposed pictures, and three overexposed pictures. Between the different pictures, the shutter speed was changed by a factor 2, while the aperture, focal length, ISO sensitivity, and other parameters remain the same. In our work, we are only interested in underexposed images, and therefore, we only utilize the three underexposed image per scene, along with the normal exposure setting, i.e., four images per 88 scene representing LDRDark . Each image of the stack of multiple exposure images was converted from uncompressed RAW format to raster graphics using dcraw3 (version 9.23) as performed in [40]. Each image was stored to 8-bit depth, i.e., 24 bits per pixel, which is the most common representation used to store LDR images, and it is suitable for display on LDR monitors. For evaluation purposes only, we also utilize a second MEF dataset [61] collected at University of Waterloo. We will refer to this dataset as Waterloo MEF dataset, to distinguish it from the above MEF dataset. In this set, a total of 17 scenes were collected, with a mixture of indoor and outdoor images. Each scene had different number of LDR images, where we remove all images above normal exposure keeping only LDRDark . In total, 60 LDRDark images will be used for testing. 6.5 Experimental results In this section, we will discuss the model and experimental setup. After that, we will analyze the results on the MEF dataset. Finally, we will apply this model on real-world problems including the trailer coupler and pedestrian detection. 6.5.1 Iterative CNN setup We used all 19 scenes from the EMPA HDR dataset for training the FCN model. Even though the total number of images are small, the resolution of them are large, i.e., 4896 × 3264 pixels. Moreover, each image has 4 different exposure settings, where we also interpolate several intermediate images along the path of of LDRDark and the HDR image. For interpolation, we set m = 10, such that the maximum number of interpolated images can be 10 in the case of very low SSIM. For training the model, we further extract 100 smaller patches of size 128 × 128 × 3 selected randomly from the image. The total number of training patches used was 190, 000, such that each patch had 89 Input image Itr 1 Itr 2 Itr 3 Itr 4 Itr 5 Itr 6 HDR Figure 6.5: Iterative CNN evaluation on the MEF dataset. The first column is the input LDRDark , the following six columns are the output of six iterations using the proposed method. The final column is the HDR groundtruth image. All input images are of the same scene, taken at EV = −6, −4, −2, 0 respectively. a corresponding groundtruth image of same size extracted from the same location of the original image. Similar to the DelightCNN work, we found that operating on the YCbCr color space is more robust compared to the RGB space. Instead of just learning the Y channel, we use all YCbCr channels to be learning by the FCN network. The network is trained with a learning rate of 0.001, a mini-batch size of 16, and trained for 150 epochs. 6.5.2 Results on the MEF dataset using the iterative method We will initially evaluate the system on the EMPA HDR dataset to observe how the system performs. Note that this set was used for training, by extracting 100 small patches selected randomly from the very large frame. However, now we evaluate the system using the global LDRDark image and compare the resultant image generated at different iterations with the HDR image via computing the SSIM. Note that after the first iteration, the input of the CNN in the second iteration is a synthesized image from the CNN, and therefore, this becomes disjoint from the training set. In other words, given the analogy in Fig. 6.1, using our iterative approach of one single image will 90 Iteration number Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Average SSIM gain 0.10 ± 0.22 0.22 ± 0.19 0.32 ± 0.16 0.40 ± 0.12 0.44 ± 0.08 0.45 ± 0.06 Table 6.1: Average SSIM gain of the iterative CNN model using the MEF dataset. generate a new dotted path between the LDRDark and some point near the HDR image. Note that this dotted path will be different from the path used in the training phase. In Fig. 6.5, we illustrate the system behavior given 4 images of the same scene, taken under variable exposure values, i.e., EV = −6, −4, −2, 0 respectively. Note how when we have a very underexposed dark image as input to the system, i.e., first row of Fig. 6.5, each iteration gradually improves the illumination quality of the image. However, when the input image already has a good starting point, i.e., last row of Fig. 6.5, the proposed network has minimal effect on improving the input image, where each iteration output is similar to the input of that iteration. This property is very important to avoid causing unwanted artifacts in the image, or continuing image brightening. In Fig. 6.6, we visualize the system behavior given 9 images of various scenes, taken under low exposure values. One key observation, is that high intensity pixels in the input image is not effected throughout the iterative CNN model. For example, the sun in rows 2, 3 and 5 are illuminating bright light in the input images. This light remains the same throughout all six iteration, while the dark underexposed region is enhanced. For the last four rows in Fig. 6.6, some artifacts begin to show up in the resultant image of iteration 5 and 6. As mentioned earlier, once an error occurs in one of the iterations, it will propagate to all other iterations. Given that the images were taken with the same setting and camera, it appears that some specific colors are causing the issue, mainly a specific red and blue color, since the artifacts are appearing in such colors. 91 Input image Itr 1 Itr 2 Itr 3 Itr 4 Itr 5 Itr 6 HDR Figure 6.6: Iterative CNN evaluation on the MEF dataset. The first column is the input LDRDark , the following columns are the output of six iterations using the proposed method. The final column is the HDR groundtruth image. Each row is an example of several scenes from the MEF dataset, such that all have a very low EV. The SSIM for each image is continuously increasing as the number of iteration increases, which is an indication that the system is truly approaching the illumination condition of an HDR image. On average, the SSIM gain over the first six iterations can be seen in Table 6.1. It is observed that the SSIM gain is slow in the first Note that all prior work in the MEF field [69], utilize multiple images with different exposure values to result with an HDR image, where we were able to utilize a single image and get very close to the HDR image. 92 Method Target image Histogram Equalization LIME [39] HDRCNN [28] Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Average SSIM 0.36 ± 0.23 0.47 ± 0.19 0.75 ± 0.16 0.1 ± 0.1 0.46 ± 0.22 0.58 ± 0.19 0.68 ± 0.16 0.76 ± 0.12 0.80 ± 0.08 0.81 ± 0.06 Average PSNR 11.12 ± 3.10 13.61 ± 2.56 16.67 ± 3.57 8.59 ± 2.49 12.10 ± 3.18 13.34 ± 3.27 14.66 ± 3.33 16.02 ± 3.29 17.32 ± 3.04 17.84 ± 2.67 Table 6.2: Average SSIM and PSNR using the EMPA HDR dataset compared with SOTA methods. 6.5.3 Comparison with state of the art We compare our proposed iterative CNN model with two baseline light enhancing methods, HDRCNN [28] and LIME [39]. Both methods have excelled on enhancing and normalizing dark light input images. For a complete evaluation, we also compare our method with simply using histogram equalization. The qualitative results when testing on the EMPA HDR dataset are shown in Fig. 6.7. Unexpectedly, the HDRCNN technique seemed to synthesize the least natural, even less than the LDR input images. This can be explained by the fact that HDRCNN mainly emphasized to predict information that have been lost in saturated image areas, in order to enable HDR reconstruction from a single exposure. However, improving the global image regardless of its saturation seems to not work well. On the other hand, the LIME method has comparable illumination quality to our proposed method. We compare our proposed method with the baseline methods by computing the SSIM and PSNR values of the EMPA dataset as seen in Table 6.2. The highest average SSIM was produced using the iterative CNN with 6, and even 5 and 4, iterations followed by the LIME method in [39]. For the PSNR evaluation, the highest ratio was found for the 6th and 5th iteration of our proposed method, followed by LIME. 93 Input image Hist. Equal. LIME HDRCNN It r 4 Itr 6 HDR Figure 6.7: Iterative CNN comparison with SOTA methods on the EMPA HDR dataset. The first column is the input LDRDark , the following columns are: histogram equalization, LIME, HDRCNN, proposed iteration 4, proposed iteration 6, and finally, the HDR groundtruth image. All input images have different negative EV values. We also compare our results on the Waterloo MEF dataset as seen in Fig. 6.8. Note that we excluded the HDRCNN method in this experiment, due to the poor performance observed from the previous result. One of the main challenges in this dataset, is the small resolution of the input images, which makes it hard to recover the illumination details of smaller objects compared to the previous dataset. Also, in this dataset, they tend to use much lower EV when capturing the dataset, making some images very hard to recover from. We further compute the average SSIM and PSNR of the Waterloo dataset as seen in Table 6.3. Similar to the observation made in Table 6.2, the last 94 Input image LIME Hist. Equal. Itr 1 Itr 2 Itr 3 Itr 4 Itr 5 Itr 6 HDR Figure 6.8: Iterative CNN comparison with SOTA methods on the Waterloo MEF dataset. The first column is the input LDRDark , the following columns are: LIME, histogram equalization, proposed iteration 1 − 6, and finally, the HDR groundtruth image in the last column. All input images have different negative EV values. two iteration gives our method the lead in both PSNR and SSIM measures. In all MEF experiments, we observe the SSIM and PSNR for iterations greater than 6. However, we found that the performance degrades for higher iterations, even though most of the image indeed has better illumination quality compared to the 6th iteration. This is due to the artifacts that begin propagating in the iterative method. In some cases, some artifacts appear at an earlier iteration stage, such as the last two rows in Fig. 6.7. Note how some colors in the scene begin to develop and spread in the last two rows, such as the red and blue color artifact. If we apply 95 Method Target image Histogram Equalization LIME [39] Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Average SSIM 0.42 ± 0.20 0.42 ± 0.17 0.58 ± 0.26 0.44 ± 0.20 0.47 ± 0.19 0.52 ± 0.19 0.56 ± 0.19 0.65 ± 0.08 0.70 ± 0.17 Average PSNR 15.49 ± 2.66 13.07 ± 1.75 14.83 ± 3.13 15.74 ± 2.69 15.85 ± 3.14 15.94 ± 4.42 16.23 ± 5.18 16.80 ± 4.31 17.16 ± 2.86 Table 6.3: Average SSIM and PSNR using the Waterloo MEF dataset compared with SOTA method. more iterations for these two cases, the global image will have better lighting, but the artifacts will become larger in size, reducing the SSIM and PSNR rates. 6.5.4 Pedestrian detection at night analysis Most pedestrian detection datasets in the literature were collected during daytime, where underexposed challenges do not exist. For this reason, we collect several videos of pedestrians while driving, by attaching a camera on the windshield of a vehicle. Two sets of videos are collected: (1) the first set was collected starting at the sunset time of the day, were natural light still exist in the scene. We also focus on the case of pedestrians crossing the street in this set. (2) The second set was collected late at night, where the only light source came from either the vehicle headlights, streetlight posts, building lights, or even the moon. These videos had a total length of 15 minutes long. The number of pedestrians in all frames rages from 1 ∼ 6, where the average number of pedestrians per frame is 2. For evaluating the detection performance, we label 50 frames from each video set, i.e., a total of 100 labeled frames. We utilize the state-of-the-art pedestrian detection algorithm [11], for evaluating the detection performance given the original night sequence compared with the gener96 Figure 6.9: Iterative CNN evaluation on the pedestrian detection application. The yellow dashed bounding-boxes are groundtruth labels, and the red bounding-boxes are the estimated detected person given the input frames. The first column is the original frame, the second is the histogram equalization frame, the third and fourth columns are the 2nd and 4th iteration output of the proposed method. ated enhanced night sequences. For completeness, we also apply the detector on the original sequence after preprocessing it with histogram equalization technique. Several results can be found in Fig. 6.9. The precision-recall and the average miss rate sampled against a false positive per image (FPPI) are used for measuring performance as seen in Fig. 6.10. The precision-recall shows the tradeoff between precision and recall for different thresholds. A high area under the curve represents both high recall and high precision of the pedestrian detector, where high precision relates to a low false 97 Figure 6.10: Precision-recall curve (left) and Miss-rate curve (right) of the pedestrian detection algorithm when used on the original video sequences, histogram equalization of the videos, and the 2nd and 4th iteration output of the proposed method. positive rate, and high recall relates to a low false negative rate. A minimum IoU threshold of 0.5 is required for a detected box to match with a groundtruth box. Using the original video sequence, the pedestrian detection has an average miss rate of 31.5%, and an average precision of 77.2%. When using the 2nd iteration output of the proposed method, we achieve an impressive 16.5% average miss rate, with an average precision of 87.7%. This is a large performance gain considering using the same detection algorithm, which is another validation that the proposed method indeed reduces illumination challenges of an image. Note that the optimal detection performance was found for the 2nd iteration of our proposed method, where all following iterations had lower detection rates. 6.5.5 Trailer coupler detection at night analysis Our trailer coupler system from Chapter 3 was trained using videos collected during daytime. We evaluate this system with the nighttime challenge, and found that the system fails at detecting the coupler, especially at closer ranges. Even though the test samples are obtained at nighttime, red lights from the breaks and the white reverse light, illuminates the scene while backing up towards 98 Original video 1 Hist. equalization Proposed method Original video 2 Hist. equalization Proposed method Figure 6.11: Iterative CNN evaluation on the trailer coupler detection system at nighttime for two videos. The red plus signs are the estimated 2D coupler center estimation. We obtain three frames from each video, illustrating the results at far, middle, and close range. For each video, the left column are the original frames, the middle column is the histogram equalization of the frames, and the right column are the output of the proposed method. the coupler. Note that, for most of the video, we tend to use the breaks the majority of the time to avoid backing up quickly. We apply the iterative CNN model to recover dark regions and make it brighter while keeping the light area the same. We experimentally found that four iterations was sufficient to make the video bright enough for the coupler system to operate successfully. Note that the trailer system utilizes gray scale images. Therefore, we would remove the fish-eye effect in RGB first, then apply the iterative CNN method. After lighting up the scene using the iterative method, we convert to gray scale, and finally apply the trailer system to the video sequences. 99 For every test video, we compare the results of the trailer coupler code, on three different versions of the video, the original dark sequence, the modified histogram equalization sequence, and the proposed relighting sequence as seen in Fig. 6.11. Some of the original frames might seem too dark, this is because the driver was not using the breaks at that time. As a general observation, we found that detecting the trailer coupler at far distances works very well. On the other hand, the trailer detection begins to fail at around 3 ∼ 4 meters CVD distance. In this range, our proposed TCNN2 appears to lose track of the coupler. One clear advantage of our proposed system can be visualized in the synthesized relighted frames, where the illumination is very consistent regardless whether the driver was using the breaks or not. This consistency helps stabilize the system detection and tracking performance. 6.6 Conclusion In this chapter, we present a method to normalize light in images suffering from underexposed illumination conditions. Motivated by the mutli-exposure datasets, we propose a novel iterative CNN model for recovering from dark images and produce good lighting illumination, such that the CNN’s goal is to match HDR lighting quality. All previous MEF work had to use several images at test time to recover the HDR image, whereas we only attempt to do this using a single image LDR image. One of the key contributions in this system, is forming the solution of the underexposed lighting problem with an iterative solution, which was proven to help many other real-world detection applications such as pedestrian detection, and trailer coupler detection at nighttime. Our quantitative and qualitative results demonstrate our system’s ability to improve detection capability in underexposed light challenges. 100 Chapter 7 Conclusions and Future Work 7.1 Conclusion In this thesis, we presented several real-world object detection applications which were designed to overcome many different detection challenges. The common object detection challenges, such as pose, scale, illumination, partial occlusion, and appearance variations, have been well studied in the literature. Object detection when applied to the real-world, additional challenges associated to the problem arise, such as efficiency due to limited computational resources, unstable background and foreground, obscurity of objects, and extreme illumination challenges, which are more problematic compared to the typical visual object detection challenges. When working towards developing a real application, any proposed system should be able to handle these challenges, and not create a computational bottleneck, otherwise it will be conveniently ignored in practice. For example, if autonomous driving vehicles cannot operate at night, then no one will buy the car. Or if a system keeps falsely detecting incorrect objects, users will walk away from this system. Looking back at all of the proposed methods in this thesis, there were three approaches in overcoming any real-world detection challenges: (a) solving the challenge prior to the detection problem. This was illustrated in Chapter 5 and 6, where the light in the input images were normalize to guarantee enhanced detection performance. (b) Modifying the detector itself, such that it can handle the challenges. This was accomplished in Chapters 3 and 4 when designing the multiplexerCNN method for detecting the trailer coupler detection. More specifically, designing the DCNN 101 network to detect couplers using confidence measures, tracking through detection in an efficient manner, and inferring 3D localization of objects using 2D coordinates. (c) Refinement of detection results to eliminate any mistakes made by the system from the real-world challenges. This was accomplished in Chapter 2 when we proposed a post-refinement component to remove all false detections caused by the detector. There are no preferred method among the three approaches in overcoming real-world challenges, instead, the type of challenge itself governs how to approach the problem. 7.2 Future work Our proposed works, opens up many doors and possibilities for future research, and the develop of new applications. • Regarding the trailer coupler detection work in Chapter 3 and 4, we have presented a solid system motivated by real-world needs that provides an example of how successful vision systems are built. Moreover, from a technical perspective, other applications may benefit from our components. E.g., the novel confidence loss function transfers to particle tracking. Other autonomous driving problems such as vehicle/curb/parking-lot detection may benefit from the multiplexer-CNN. Estimating the distance of pedestrian detection can be learned using our proposed method in inferring 3D localization from 2D detections. Moreover, our system was designed to work during daytime, and with the iterative CNN method, we can also operate at nighttime as well. Fusing both works such that the application still operates in real-time is important. The iterative CNN method is somewhat slow when processing frames on a CPU, and therefore, finding a solution for this problem is needed. • Regarding the overexposed illumination challenges in Chapter 5, where we have proposed to 102 synthesize bright sunlight images using the DelightCNN method. This can be further used in several other systems in autonomous driving such as when driving in the same direction of the sunset or sunrise, or in the case when driving in or out of a tunnel. In both scenarios, extreme illumination conditions are observed. Moreover, the idea of making DelightCNN operate on images alone is very challenging, since we are no longer capable of temporally pairing bad lighting frames with good or semi-good lighting frames. However, we still think this is possible if the illumination doesn’t completely washes-out the appearance of the object. Another solution would be to adopt a generative adversarial networks (GAN) in completing and synthesizing the effected overexposed regions with sunlight patterns, such as using the cycle-GAN framework [94]. • Regarding the underexposed illumination challenges in Chapter 6, where we have proposed to synthesize dark images using the iterative CNN method. We experimentally found that, whenever we apply the iterative CNN method on different applications or datasets, the optimal number of iterations is also different. This number is based on the initial quality and resolution of the input image. The higher quality and resolution usually can withstand for longer iterations, providing a more HDR-like illumination light conditions. Therefore, the optimal number needs to be experimentally found for any new application. On the other hand, this process can be automatically controlled via learning an image quality assessment CNN, which will produce a score after each iteration, indicative whether we can stop or proceed to another iteration. • Also regarding the underexposed illumination challenges in Chapter 6, we have already demonstrated the effectiveness of the system in improving night detection performance for pedestrian and trailer coupler systems. We plan on collecting more testing samples such that 103 the evaluation is done on a larger scale. Another interesting idea that is worth exploring more, is that our iterative CNN model was trained on the MEF dataset such that it attempts to make dark images brighter, will learning the system in the opposite direction help the overexposed sunlight challenges? Moreover, can a single system actually learn both objectives? Meaning that, can a single CNN learn how to delight an image regardless of being too dark or too bright, such that it will always produce a good lighting image? Having opposite objective in learning the CNN will most likely confuse the CNN parameters being learned. Therefore, having a carefully designed architecture will be needed. 104 BIBLIOGRAPHY 105 BIBLIOGRAPHY [1] FAO global aquaculture production volume and value statistics database updated to 2012. Technical report, FAO Fisheries and Aquaculture Department, 2014. [2] P. F. Alcantarilla, L. M. Bergasa, P. Jiménez, M. Sotelo, I. Parra, D. Fernandez, and S. Mayoral. Night time vehicle detection for driving assistance lightbeam controller. In Proc. Intelligent Vehicles Symposium (IV), pages 291–296. IEEE, 2008. [3] Y. Atoum, M. J. Afridi, X. Liu, J. M. McGrath, and L. E. Hanson. On developing and enhancing plant-level disease rating systems in real fields. Pattern Recognition, 53:287–299, 2016. [4] Y. Atoum, J. Roth, M. Bliss, W. Zhang, and X. Liu. Monocular video-based trailer coupler detection using multiplexer convolutional neural network. In Proc. IEEE International Conference on Computer Vision (ICCV), Venice, Italy, October 2017. [5] Y. Atoum, S. Srivastava, and X. Liu. Automatic feeding control for dense aquaculture fish tanks. IEEE Signal Processing Letters, 22(8):1089–1093, 2015. [6] J. Baek, S. Hong, J. Kim, and E. Kim. Efficient pedestrian detection at nighttime using a thermal camera. Sensors, 17(8):1850, 2017. [7] A. Baghel and V. Jain. Shadow removal using ycbcr and k-means clustering. International Journal of Computer Applications, 134(7):21–26, 2016. [8] V. N. Boddeti, T. Kanade, and B. Kumar. Correlation filters for object alignment. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 2291–2298. IEEE, 2013. [9] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms—improving object detection with one line of code. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 5562–5570. IEEE, 2017. [10] D. S. Bolme, J. R. Beveridge, B. Draper, Y. M. Lui, et al. Visual object tracking using adaptive correlation filters. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 2544–2550. IEEE, 2010. [11] G. Brazil, X. Yin, and X. Liu. Illuminating pedestrians via simultaneous detection and segmentation. In Proc. International Conference on Computer Vision (ICCV), Venice, Italy, October 2017. 106 [12] C. J. Bridger and R. K. Booth. The effects of biotelemetry transmitter presence and attachment procedures on fish physiology and behavior. Reviews in Fisheries Science, 11(1):13–34, 2003. [13] B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao. Dehazenet: An end-to-end system for single image haze removal. IEEE Transactions on Image Processing (IP), 25(11):5187–5198, 2016. [14] C. Chang, W. Fang, R.-C. Jao, C. Shyu, and I. Liao. Development of an intelligent feeding controller for indoor intensive culturing of eel. Aquacultural Engineering, 32(2):343–353, 2005. [15] S.-H. Chen and R.-S. Chen. Vision-based distance estimation for multiple vehicles using single optical camera. In Proc. Int. Conf. Innovations in Bio-inspired Computing and Applications (IBICA), pages 9–12. IEEE, 2011. [16] S. G. Conti, P. Roux, C. Fauvel, B. D. Maurer, and D. A. Demer. Acoustical monitoring of fish density, behavior, and growth rate in a tank. Aquaculture Engineering, 251(2):314–323, 2006. [17] C. Costa, A. Loy, S. Cataudella, D. Davis, and M. Scardi. Extracting fish size using dual underwater cameras. Aquacultural Engineering, 35(3):218–227, 2006. [18] R. Cucchiara and M. Piccardi. Vehicle detection under day and night illumination. In IIA/SOCO, 1999. [19] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886–893. IEEE, 2005. [20] K. Dale, M. K. Johnson, K. Sunkavalli, W. Matusik, and H. Pfister. Image restoration using online photo collections. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 2217–2224. IEEE, 2009. [21] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. Beyond correlation filters: learning continuous convolution operators for visual tracking. In Proc. European Conference on Computer Vision (ECCV), pages 472–488. Springer, 2016. [22] C. de Saxe and D. Cebon. A visual template-matching method for articulation angle measurement. In Proc. Int. Conf. Intelligent Transportation Systems (ITS), pages 626–631. IEEE, 2015. [23] J. Diebel and S. Thrun. An application of markov random fields to range sensing. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 291–298, 2006. [24] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark. In Proc. 107 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 304–311. IEEE, 2009. [25] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 38(2):295–307, 2016. [26] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 2758–2766, 2015. [27] S. Duarte, L. Reig, and J. Oca. Measurement of sole activity by digital image analysis. Aquacultural Engineering, 41(1):22–27, 2009. [28] G. Eilertsen, J. Kronander, G. Denes, R. Mantiuk, and J. Unger. Hdr image reconstruction from a single exposure using deep cnns. ACM Transactions on Graphics (TOG), 36(6), 2017. [29] Y. Endo, Y. Kanamori, and J. Mitani. Deep reverse tone mapping. ACM Transactions on Graphics (TOG), 36(6), 2017. [30] R. Fattal. Single image dehazing. ACM transactions on graphics (TOG), 27(3):72, 2008. [31] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Cascade object detection with deformable part models. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 2241–2248. IEEE, 2010. [32] C. for Economic Vitality. n.d. Revenue of the U.S. RV park industry from 2009 to 2015. Statista, 2016. [33] H. K. Galoogahi, T. Sim, and S. Lucey. Multi-channel correlation filters. In Proc. IEEE Conf. International Conference on Computer Vision (ICCV), pages 3072–3079. IEEE, 2013. [34] R. Girshick. Fast R-CNN. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 1440–1448, 2015. [35] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 580–587, 2014. [36] A. A. Goshtasby. Fusion of multi-exposure images. Image and Vision Computing, 23(6):611– 618, 2005. [37] S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang. Learning dynamic guidance for depth image enhancement. Analysis, 10(y2):2, 2017. [38] R. Guo, Q. Dai, and D. Hoiem. Paired regions for shadow detection and removal. IEEE 108 transactions on pattern analysis and machine intelligence, 35(12):2956–2967, 2013. [39] X. Guo, Y. Li, and H. Ling. Lime: Low-light image enhancement via illumination map estimation. IEEE Transactions on Image Processing (IP), 26(2):982–993, 2017. [40] P. Hanhart and T. Ebrahimi. Evaluation of jpeg xt for high dynamic range cameras. Signal Processing: Image Communication, 50:9–20, 2017. [41] K. He, J. Sun, and X. Tang. Guided image filtering. IEEE transactions on pattern analysis and machine intelligence, 35(6):1397–1409, 2013. [42] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Exploiting the circulant structure of tracking-by-detection with kernels. In Proc. European Conference on Computer Vision (ECCV), pages 702–715. Springer, 2012. [43] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 37(3):583–596, 2015. [44] P. Hu and D. Ramanan. Finding tiny faces. arXiv preprint arXiv:1612.04402, 2016. [45] T.-W. Hui, C. C. Loy, and X. Tang. Depth map super-resolution by deep multi-scale guidance. In Proc. European Conference on Computer Vision (ECCV), pages 353–369. Springer, 2016. [46] X. Jin, Y. Li, N. Liu, X. Li, Q. Zhou, Y. Tian, and S. Ge. Scene relighting using a single reference image through material constrained layer decomposition. In Artificial Intelligence and Robotics (AIR, pages 37–44. Springer, 2018. [47] X. Jin, Y. Tian, N. Liu, C. Ye, J. Chi, X. Li, and G. Zhao. Object image relighting through patch match warping and color transfer. In Proc. International Conference on Virtual Reality and Visualization (ICVRV), pages 235–241. IEEE, 2016. [48] S. H. Khan, M. Bennamoun, F. Sohel, and R. Togneri. Automatic shadow detection and removal from a single image. IEEE transactions on pattern analysis and machine intelligence, 38(3):431–446, 2016. [49] J. H. Kim, H. G. Hong, and K. R. Park. Convolutional neural network-based human detection in nighttime images using visible light camera sensors. Sensors, 17(5):1065, 2017. [50] A. Koesdwiady, S. M. Bedawi, C. Ou, and F. Karray. End-to-end deep learning for driver distraction recognition. In Proc. International Conference Image Analysis and Recognition (ICIAR, pages 11–18. Springer, 2017. [51] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele. Joint bilateral upsampling. In ACM Transactions on Graphics (TOG), volume 26, page 96. ACM, 2007. 109 [52] B. V. Kumar, A. Mahalanobis, and R. D. Juday. Correlation pattern recognition. Cambridge University Press, 2005. [53] J.-F. Lalonde, A. A. Efros, and S. G. Narasimhan. Estimating natural illumination from a single outdoor image. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 183–190. IEEE, 2009. [54] J.-V. Lee, J.-L. Loo, Y.-D. Chuah, P.-Y. Tang, Y.-C. Tan, and W.-J. Goh. The use of vision in a sustainable aquaculture feeding system. Research Journal of Applied Sciences, Engineering and Technology, 6(19):3658–3669, 2013. [55] B. Li and Z. Shao. Precise trajectory optimization for articulated wheeled vehicles in cluttered environments. Advances in Engineering Software, 92:40–47, 2016. [56] Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deep joint image filtering. In Proc. European Conference on Computer Vision (ECCV), pages 154–169. Springer, 2016. [57] S. Liu and M. N. Do. Inverse rendering and relighting from multiple color plus depth images. IEEE Transactions on Image Processing (IP), 26(10):4951–4961, 2017. [58] X. Liu, Y. Tong, F. W. Wheeler, and P. H. Tu. Facial contour labeling via congealing. In Proc. European Conference on Computer Vision (ECCV), pages 354–368. Springer, 2010. [59] D. G. Lowe. Object recognition from local scale-invariant features. In Proc. IEEE International Conference on Computer Vision (ICCV), volume 2, pages 1150–1157. IEEE, 1999. [60] K. Ma, H. Li, H. Yong, Z. Wang, D. Meng, and L. Zhang. Robust multi-exposure image fusion: A structural patch decomposition approach. IEEE Transactions on Image Processing (IP), 26(5):2519–2532, 2017. [61] K. Ma, K. Zeng, and Z. Wang. Perceptual quality assessment for multi-exposure image fusion. IEEE Transactions on Image Processing (IP), 24(11):3345–3356, 2015. [62] H. Maeng, S. Liao, D. Kang, S.-W. Lee, and A. K. Jain. Nighttime face recognition at long distance: Cross-distance and cross-spectral matching. In Proc. Asian Conference on Computer Vision (ACCV), pages 708–721. Springer, 2012. [63] I. Matthews and S. Baker. Active appearance models revisited. International Journal of Computer Vision (IJCV), 60(2):135–164, 2004. [64] J. Morales, A. Mandow, J. L. Martinez, J. L. Martínez, and A. J. García-Cerezo. Driver assistance system for backward maneuvers in passive multi-trailer vehicles. In Proc. International Conference Intelligent Robots and Systems (ICIRS), pages 4853–4858. IEEE, 2012. [65] R. n.d. Number of wholesale shipments of RV in the united states from 2000-2016. Statista, 110 2016. [66] J. P. Oakley and H. Bu. Correction of simple contrast loss in color images. IEEE Transactions on Image Processing (IP), 16(2):511–522, 2007. [67] J. S. Park and N. I. Cho. Generation of high dynamic range illumination from a single image for the enhancement of undesirably illuminated images. arXiv preprint arXiv:1708.00636, 2017. [68] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In Proc. British Machine Vision Conference (BMVC), volume 1, page 6, 2015. [69] K. R. Prabhakar, V. S. Srikar, and R. V. Babu. Deepfuse: A deep unsupervised approach for exposure fusion with extreme exposure image pairs. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 4724–4732. IEEE, 2017. [70] L. Qu, J. Tian, S. He, and Y. Tang. Deshadownet: A multi-context embedding deep network for shadow removal. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2017. [71] M. A. H. Radhi, B. Sabah, and A. M. O. Al-Hsniue. Enhancement of the captured image under different lighting conditions using histogram equalization method. International Journal of Latest Research in Science and Technology, 3(3):25–28, 2014. [72] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 91–99, 2015. [73] I. Sato, Y. Sato, and K. Ikeuchi. Illumination from shadows. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 25(3):290–300, 2003. [74] M. Savvides, B. V. Kumar, and P. Khosla. Face verification using correlation filters. Proc. IEEE Automatic Identification Advanced Technologies (AIAT), pages 56–61, 2002. [75] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013. [76] X. Shen, C. Zhou, L. Xu, and J. Jia. Mutual-structure for joint filtering. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 3406–3414, 2015. [77] Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data-driven hallucination of different times of day from a single outdoor photo. ACM Transactions on Graphics (TOG), 32(6):200, 2013. [78] S. Song and M. Chandraker. Joint SFM and detection cues for monocular 3d localization in 111 road scenes. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 3734–3742, 2015. [79] L. H. Stien, S. Bratland, I. Austevoll, F. Oppedal, and T. S. Kristiansen. A video analysis procedure for assessing vertical fish distribution in aquaculture tanks. Aquacultural Engineering, 37(2):115–124, 2007. [80] D. Tian, J. W. Mauchly, and J. T. Friel. Real-time automatic scene relighting in video conference sessions, Oct. 7 2014. US Patent 8,854,412. [81] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 839–846. IEEE, 1998. [82] O. Tsimhoni, J. Bärgman, and M. J. Flannagan. Pedestrian detection with near and far infrared night vision enhancement. Leukos, 4(2):113–128, 2007. [83] A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for matlab. In Proc. ACM International Conference on Multimedia (ICM), pages 689–692. ACM, 2015. [84] L. Wang, K. Huang, Y. Huang, and T. Tan. Object detection and tracking for night surveillance based on salient contrast analysis. In Proc. IEEE International Conference on Image Processing (ICIP), pages 1113–1116. IEEE, 2009. [85] Y. Wang and D. Samaras. Estimation of multiple directional light sources for synthesis of augmented reality images. Graphical Models, 65(4):185–205, 2003. [86] Y. Wang and I. H. Witten. Induction of model trees for predicting continuous classes. In Proc. European Conference Machine Learning (ECML), 1996. [87] J. Xu, Y. Liu, S. Cui, and X. Miao. Behavioral responses of tilapia (oreochromis niloticus) to acute fluctuations in dissolved oxygen levels as monitored by computer vision. Aquacultural Engineering, 35(3):207–217, 2006. [88] L.-Q. Xu, J.-L. Landabaso, and B. Lei. Segmentation and tracking of multiple moving objects for intelligent video analysis. BT technology Journal, 22(3):140–150, 2004. [89] S. Yan, Y. Teng, J. S. Smith, and B. Zhang. Driver behavior recognition based on deep convolutional neural networks. In Proc. IEEE International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), pages 636–641. IEEE, 2016. [90] Q. Yang, R. Yang, J. Davis, and D. Nistér. Spatial-depth super resolution for range images. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 1–8. IEEE, 2007. [91] J. J. Yebes, P. F. Alcantarilla, and L. M. Bergasa. Occupant monitoring system for traffic control based on visual categorization. In Proc. IEEE Intelligent Vehicles Symposium (IV), 112 pages 212–217. IEEE, 2011. [92] S. Zhang, R. Benenson, M. Omran, J. Hosang, and B. Schiele. Towards reaching human performance in pedestrian detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1–1, 2017. [93] C. Zhou, H. Zhang, X. Shen, and J. Jia. Unsupervised learning of stereo matching. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 1567–1575, 2017. [94] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017. [95] B. Zion, V. Alchanatis, V. Ostrovsky, A. Barki, and I. Karplus. Real-time underwater sorting of edible fish species. Computers and Electronics in Agriculture, 56(1):34–45, 2007. [96] P. Zolliker, Z. Baranczuk, D. Kupper, I. Sprow, and T. Stamm. Creating hdr video content for visual quality assessment using stop-motion. In Proc. IEEE European Signal Processing Conference (EUSIPCO), pages 1–5. IEEE, 2013. 113