HLWIHHHI €25 llHHl‘HWIINN!I'NIWIWWWI ”THS LIBRARY Michigan State University This is to certify that the thesis entitled DROWSY DRIVER DETECTION USING FACE TRACKING ALGORITHMS presented by VENKATA RAVI KIRAN DAYANA has been accepted towards fulfillment of the requirements for the Master of degree in Electrical and Computer Science Engineering WW Major Professor’s Signature MW «21,01077 V Date MSU is an affirmative-action, equal-opportunity employer -.-v—.-._ PLACE IN RETURN Box to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 6/07 p:lCIRC/DateDue.indd-p.1 DROWSY DRIVER DETECTION USING FACE TRACKING ALGORITHMS By Venkata Ravi Kiran Dayana A THESIS Submitted to Michigan State University In partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Electrical and Computer Engineering 2007 ABSTRACT DROWSY DRIVER DETECTION USING FACE TRACKING ALGORITHMS By Venkata Ravi Kiran Dayana Drowsy Driving is a major, though elusive, cause of vehicle crashes. The National Highway Traffic Safety Administration (NHTSA) conservatively estimates that 100,000 police-reported crashes are a direct result of driver fatigue each year in the United States. A detection system that can predict drowsiness and alert the driver by monitoring eye closure (PERCLOS) could reduce the number of fatigue-related crashes. Video sequence containing the driver’s face can be analyzed to evaluate whether the eyes are open or closed. Ambient illumination may vary during the video sequence resulting in changes in perceived Skin color. Therefore, a drowsy driver detection system demands robust, yet efficient algorithms to monitor a driver in real-time. In the present thesis, we evaluate different skin detection models to initialize the face of the driver followed by face tracking algorithms. Template-based tracking and Level-set based face tracking algorithms are implemented and optimized for real time operation. Results of the eye closure (PERCLOS) estimate are provided on a video database with multiple subjects. Table of Contents Chapter 1 Introduction ..................................................................................................... 1 1.1 Accidents related to Drowsiness ................................................................................ 2 1.2 Drowsiness and Driver Vigilance .............................................................................. 3 1.3 Drowsiness Features .................................................................................................. 4 1.3.1 Physical Features ................................................................................................ 5 1.4 PERCLOS based detection ........................................................................................ 7 1.5 Tracking Face and Eyes ............................................................................................. 8 1.6 Outline of Thesis ....................................................................................................... 9 Chapter 2 Face and Eye Detection ................................................................................. 10 2.1 Color Spaces ............................................................................................................ 10 2.1.1 RGB color space ............................................................................................... 10 2.1.2 HSI color space ................................................................................................. 12 2.1.3 YCer color Space ............................................................................................ 14 2.2 Skin Modeling ......................................................................................................... 15 2.2.1 Explicit skin model ........................................................................................... 15 2.2.2 Non—Parametric models .................................................................................... 19 2.2.3 Parametric models ............................................................................................ 24 2.2.4 Non-adaptive Vs Adaptive Models .................................................................. 25 2.3 Illumination Effects ................................................................................................. 26 2.3.1 Skin models under varying illumination .......................................................... 27 2.3.2 Compensation Methods .................................................................................... 29 2.4 Adaptive Skin Segmentation ................................................................................... 31 2.4.1 Motion Detection Filter .................................................................................... 32 2.4.2 Face Detection .................................................................................................. 35 2.4.3 Eye Detection ................................................................................................... 37 Chapter 3 Face and Eye Tracking .................................................................................. 40 3.1 Template Image based Tracking .............................................................................. 41 3.1.1 Template Features ............................................................................................. 44 3.1.2 Template-Matching using Cross-Correlation ................................................... 47 3.1.3 Results .............................................................................................................. 53 3.2 Level Set based Tracking ........................................................................................ 58 3.2.] Level Set Representation of Curves ................................................................. 59 3.2.2 Level Set Curve Evolution ............................................................................... 65 3.2.3 Level Set method implementation .................................................................... 67 3.2.3 Results .............................................................................................................. 69 Chapter 4 PERCLOS feature computation ..................................................................... 73 4.1 Pre-Processing ......................................................................................................... 74 iii 4.2 Eye Closure Classification ....................................................................................... 76 4.3 Results ..................................................................................................................... 80 Chapter 5 Conclusion and Future Work ......................................................................... 84 5.1 Conclusion ............................................................................................................... 84 5.2 Future Work ............................................................................................................. 85 Appendix A — Connected components algorithm .............................................................. 87 Appendix B —- Fast Level Set Method ............................................................................... 91 Bibilography ...................................................................................................................... 99 iv List of Figures Figure 1.1 Effects of drowsiness on reaction time [6] ......................................................... 4 Figure 1.2 PERCLOS feature using infrared camera [3 8] (a) Infrared image captured using a 850 nm illumination source (b) Infrared image captured using a 950 nm illumination source. (c) Difference image with bright eye pupils. .............................. 6 Figure 1.3 Schematic diagram of PERCLOS based detection module. .............................. 8 Figure 1.4 Images with two different positions of the face ................................................. 9 Figure 1.5 Images captured under different ambient illumination [9]. ............................... 9 Figure 2.] RGB color space representation [42] ............................................................... 11 Figure 2.2 Chromaticity diagram with normalized r on the x-axis and normalized g on the y-axis. The outer curved boundary is the spectral (or monochromatic) locus with wavelengths Shown in nanometers [3 5] ..................................................................... 11 Figure 2.3 H81 color space representation. Colors on the surface of the solid are fully saturated, i.e. pure colors, and the grey scale spectrum is on the axis of the solid. [42] ............................................................................................................................. 13 Figure 2.4 Three images from the VidTIMIT dataset on the left and their corresponding skin segmentation results on the right. ...................................................................... 17 Figure 2.5 Improper skin segmentation results on three images from VidTIMIT dataset. ................................................................................................................................... 18 Figure 2.6 Skin probability estimation using 2D histogram. Right image shows the probability of skin at each pixel. Note the darker regions of eyes and face with less probability .................................................................................................................. 20 Figure 2.7 2D histogram of skin with 32 x 32 bins on Hue-Saturation chrominance Space. Darker regions indicate higher frequency counts. .......................................... 22 Figure 2.8 Skin segmentation results on three images from VidTIMIT dataset using the non-parametric probability distribution model .......................................................... 23 Figure 2.9 Incorrect skin segmentation results on two images from VidTIMIT dataset using the non-parametric probability distribution model .......................................... 23 Figure 2.10 Images captured from the same subject under different illumination temperatures. Lefi image has a correlated color temperature (CCT) of 2600 K and the right image has a CCT of 6200 K [9] .................................................................. 26 Figure 2.11 Skin segmentation results of explicit model on an image captured at nomal illumination of 3800 K. [15] ...................................................................................... 27 Figure 2.12 Skin segmentation results of explicit model on an image captured at an illumination temperature of 6200 K. [15] .................................................................. 28 Figure 2.13 Skin segmentation results of non-parametric model on an image captured at an illumination temperature of 3800 K [15] .............................................................. 28 Figure 2.14 Skin segmentation results of non-parametric model on an image captured at an illumination temperature of 6200 K [15] .............................................................. 28 Figure 2.15 Pixels of uniform surface lie on a parallelogram in RGB color space [39] ...30 Figure 2.16 Schematic diagram of adaptive Skin segmentation ........................................ 31 Figure 2.17 Motion Filtering and skin region extraction. (a) and (b) Two frames in the video with the eyes open and closed. (0) Output of the motion filter. ((1) Extracted skin regions in white boxes. (e) Extracted skin regions for skin modeling. .............. 34 Figure 2.18 Adaptive skin segmentation results of non-parametric model on an image captured at an illumination temperature of 6200 K. .................................................. 34 Figure 2.19 Skin segmentation and face detection outputs. (a) and (e) Two frames in a video sequence with the eyes open and closed. (b) and (i) Segmented output using the skin model. (c) and (g) Output afier applying morphological operation and connected components algorithm. (d) and (h) Outline of the detected face .............. 36 Figure 2.20 Schematic Diagram of Eye Detection ............................................................ 38 Figure 2.21 Results of eye detection. (a) Image from a video sequence (b) Motion filter output (c) Face detection output (d) Eye detection output ........................................ 38 Figure 3.1 Tracking an eye region in a video sequence. ................................................... 40 Figure 3.2 Block diagram of a real-time tracking system ................................................. 41 Figure 3.3 Template Image based tracking. (a) Eye template (b) Eye region matched in a video frame ................................................................................................................ 42 Figure 3.4 Block diagram of template-matching algorithm .............................................. 43 vi Figure 3.5 Examples of template eye images enclosed in the white rectangles ................ 44 Figure 3.6 Eye template based on the pixel intensity ........................................................ 45 Figure 3.7 Eye template created using the Skin probability feature .................................. 46 Figure 3.8 Eye template created using the combined feature ............................................ 47 Figure 3.9 Cross-correlation of the dotted template on a larger image. The arrows indicate shifting of the template window ................................................................................ 48 Figure 3.10 Cross-correlation based face tracking. (a) Template Eye Image (b) Face Image (0) Correlation Matrix (d) Surface plot of correlation matrix. (e) Surface plot of left half of the correlation matrix .......................................................................... 51 Figure 3.11 Cross-correlation based face tracking. (a) Template Eye Image (b) Face Image with eyes closed (c) Correlation Matrix (d) Surface plot of correlation matrix. (e) Surface plot of left half of the correlation matrix ................................................ 53 Figure 3.12 Face tracking results with pixel intensity feature. White rectangle shows the tracking of left eye. .................................................................................................... 55 Figure 3.13 Face tracking results with pixel intensity feature on quarter resolution frames. White rectangle shows the tracking of left eye .......................................................... 55 Figure 3.14 Face tracking results with skin probability feature on quarter resolution frames. White rectangle shows the tracking of left eye. ............................................ 56 Figure 3.15 Face tracking results with combined skin feature on quarter resolution frames. White rectangle shows the tracking of left eye. ............................................ 57 Figure 3.16 Face tracking results with pixel intensity feature on quarter resolution frames. White rectangle shows the tracking of left eye. Tracking fails at the last frame ....... 58 Figure 3.17 Differences between template and level set tracking. (a) Template tracking of the left eye region. (b) Level set tracking of the left eye region ................................ 58 Figure 3.18 Level set representation of the curve x2 + y2 =1. Exterior and interior regions are represented by 0+ and Q“ respectively. 60 represents the interface (15(2) = 0. .................................................................................................................... 60 Figure 3.19 (a) Implicit function ¢(§c) of a unit circle. Circle at ¢(3E) = O is highlighted in white. (b) Surface plot of ¢(5c') in the domain [-2, -2] to [2, 2]. ............................... 62 vii Figure 3.20 (a) Discretized implicit firnction ¢(J'r') of a unit circle. Circle at (15(55) = 0 is highlighted in white. (b) Surface plot of discretized ¢(5c’) in the domain {-2, -2] to [2, 2]. Only the region around unit circle is discretized. .......................................... 64 Figure 3.21 Evolution of curve C under speed F IV .......................................................... 65 Figure 3.22 Topology changes in level set methods. (a) Curve with one interior region. (b) Interior region just before splitting (c) Curve with two interior regions. (d) — (g) Level set representation of the curve with an implicit surface and changing plane of interest. [43] ............................................................................................................... 67 Figure 3.23 Level set surface representation using boundary lists. ................................... 69 Figure 3.24 Speed field F in level set tracking of eyes. (a) Pixel intensity (b) Speed field F - Thresholded pixel intensity .................................................................................. 70 Figure 3.25 Level set tracking results of the left eye pupil. .............................................. 70 Figure 3.26 Level set tracking results of the left eye pupil (close-up) .............................. 71 Figure 4.1 Block diagram of PERCLOS estimation .......................................................... 73 Figure 4.2 Preprocessing output of template tracking. (a) Template output shown in white rectangle (b) Eye pupil area highlighted in white ...................................................... 74 Figure 4.3 Pre-processing stages for generating the eye pupil area from eye template ....75 Figure 4.4 Pre-processing at various stages (a) Eye template (b) Skin removal (c) Intensity threshold ((1) Eye pupil area (e) Eye pupil are overlapped on the eye template ..................................................................................................................... 76 Figure 4.5 Eye pupil area of open and closed eyes. (a) Eye pupil area of an open eye (b) Eye pupil area of a closed eye. .................................................................................. 77 Figure 4.6 Plot of Height Vs. Width of the eye pupil region. ........................................... 78 Figure 4.7 Snapshots of a video sequence with an eye class tagI for each frame - 1. Only misclassification is on the frame located at 2nd row and 4t column. ........................ 80 Figure 4.8 Snapshots of a video sequence with an eye class tag for each frame — 2. All the frames were correctly classified ................................................................................ 81 Figure 4.9 Snapshots of a video sequence with an eye class tag for each frame — 3. Only misclassification is on the frame located at 1St row and 1St column .......................... 81 viii Figure 4.10 Snapshots of a video sequence with an eye class tag for each frame — 4. All the frames were correctly classified .......................................................................... 82 Figure A.1 A binary image with five connected components of the values 1. (a) Binary image matrix (b) Connected components labeling (c) Binary image ((1) Labeled image [41] ..................................................................................... 88 Figure 3.1 Implicit representation of the curve C1 and the two lists Lin and L0“, in the neighborhood of C 1 [46] ..................................................................... 92 Figure B.2 Illustration of the motion of the curve C; by switching pixels between Lin and Low. [46] ......................................................................................... 93 ix Chapter 1 Introduction Drowsy driving is a class of problems related to operating an automotive vehicle while experiencing sleepiness or fatigue. Most drivers might have experienced drowsy driving at some time or other. According to the National Sleep Foundation's 2005 poll, 60% of adult drivers — about 168 million people — say they have driven a vehicle while feeling drowsy in the past year. Drowsiness slows reaction time, impairs judgment and increases the driver’s risk of crashing. The National Highway Traffic Safety Administration (NHTSA) conservatively estimates that 100,000 police-reported crashes are a direct result of driver fatigue each year in the United States. This results in an estimated 1,550 deaths and 71,000 injuries. Often, the cause of a crash is reported as driver inattentiveness, which may be attributed to drowsiness or fatigue. Therefore, it is believed that drowsy driving is hugely underreported in the crash reports [4]. The above factors motivated vehicle manufacturers to focus on development of vehicle-based drowsy driver detection systems. The basic idea behind a vehicle-based drowsy driver detection system is to monitor the driver unobtrusively for drowsiness. The detection system may sense both driver related features (physiological data) and vehicle related features (driving performance data), compute relevant measures to predict the onset of drowsiness. After the detection of drowsiness, the detection system can either alert the driver or take appropriate preventive action to avoid a crash. The system must have a very high detection rate and generate very few false alarms. The following sections would discuss in detail the need for on- board detection systems, comparison of relevant features and the practical issues in implementation. 1.1 Accidents related to Drowsiness Driver fatigue or drowsiness is estimated to cause roughly 3.2 % of traffic accidents [20] by National Highway Traffic Safety Administration (NHTSA). Other researchers report that fatigue is responsible for 10 % of all the traffic accidents [1] and 25 % of all single- vehicle accidents [2]. A single-vehicle accident involves only one moving vehicle including collisions with animals, fixed objects and swerving off the road. In the United States, the cost of fatigue related crashes cost $12.5 billion per year in property loss and also claims 1,550 lives [20]. Drowsiness plays a much larger role in truck crashes. According to NHTSA 2003 report “Large Truck Crash Facts” [21], drowsy or fatigued drivers were responsible for 7.8 % of the single-vehicle accidents. Statistics related to the overall drowsy driving, reveal a widespread problem. About 60 % of the adults in USA say that they have driven a vehicle while drowsy. Overall, 37 % of the driving population says they have nodded off or fallen asleep while driving at some time in their life [5]. Approximately, one third of the drivers’ report that they last experienced this problem within past year alone. To summarize, drowsy driving is a common problem and is responsible for a significant number of vehicle crashes each year. 1.2 Drowsiness and Driver Vigilance Studies in the previous section Show anecdotal evidence for drowsiness causing a significant number of vehicle crashes. A thorough understanding of drowsiness can help us evaluate its effects on a driver’s performance. Drowsiness, also referred to as sleepiness, is defined as the need to fall asleep. This process is a result of both circadian rhythm (24-hour cycle in the physiological processes of animals) and the need to sleep. Sleep can be irresistible and neurologically based sleepiness contributes to human error in a variety of settings and driving is no exception. Drowsiness leads to crashes because it impairs elements of human performance that are critical to safe driving. Relevant impairments identified in the laboratory and in—vehicle studies include: o Slower reaction time: Sleepiness reduces optimum reaction times as seen in Figure 1.1. Moderately sleepy people can have an increase in their reaction time that will hinder stopping in time to avoid a collision. Even small decrements in reaction time can have a profound effect on crash risk, particularly at high speeds. 0 Reduced vigilance: Performance on attention-based tasks declines with sleepiness, including increased periods of non-responsiveness or delayed responding. 0 Increased time for information processing: Processing and integrating information takes longer, the accuracy of short-term memory decreases, and performance declines. Normal Faster Inverse of Reaction Time I ‘ ‘ Sleep Deprived E ‘\ Slower Time on Task Figure 1.1 Effects of drowsiness on reaction time [6] 1.3 Drowsiness Features Psycho-physiological and performance changes precede the onset of drowsiness. These changes form the basis for drowsiness features. Real-time monitoring of these features enable us to detect or predict the onset of drowsiness associated with the loss of driver alertness. These features can be broadly categorized as — 0 Physical changes of drivers, eg., yawning, closed eyes. 0 Physiological changes of drivers as measured by Electroencephalogram (EEG), Electrooculographic (EOG), Heart Rate Variability (HRV) 0 Driving performance changes. For example lateral acceleration of a vehicle, steering variability. 0 Subsidiary information, eg., time of the day, as drivers are more prone to drowsiness related crashes during the night [7]. All the above features contain discriminatory information to detect the drowsiness of a driver. Physiological changes have been frequently used to detect drowsiness. Brain waves measured with electroencephalograms (EEG) are especially good indicators of sleepiness and thus potential lapses in attention [8]. However, monitoring physiological changes requires the drivers to wear special sensors every time they set out to drive. Physical changes of drivers such as head orientation can be closely monitored using an on-board camera. Hence, most detection systems use physical changes as the primary feature along with other subsidiary features. 1.3.1 Physical Features Physical changes are the observable features a driver exhibits before or during the onset of drowsiness. These features capture changes in the eyes and face of a person due to drowsiness. Multiple eye-based features [9] like blink fi'equency, eye gaze direction and PERCLOS (PERcentage eyelid CLOSure) have discriminatory information in detecting drowsiness. Overall, PERCLOS feature was identified as the most reliable feature in measuring the driver alertness by the US. Federal Highway Administration [10]. PERCLOS feature measures the percentage of time in a minute when the eyelids are at least 80 % closed. Various other authors also refer PERCLOS as a standard measure for drowsiness detection [22]. PERCLOS features can be computed in real-time using image-processing techniques on the video data acquired from an on-board camera. The most common technique to compute PERCLOS feature is developed by researchers using an infrared camera [36, 37, 38]. Human eyes reflect infrared radiation centered on a wavelength of 850 nm [36, 38]. Two consecutive images are acquired simultaneously at 850 nm and 950 nm wavelengths and their difference highlights the bright pupil region. Figure 1.2 shows the acquired infrared images and their image difference [38]. A thresholding operation on the difference image is sufiicient to decide whether the eyes are open or closed. The main drawback of this method is the need for two infrared illumination sources at 850 nm and 950 nm wavelengths. During the day, the light from sun contains infrared radiation at various wavelengths. This radiation prevents the acquisition of infrared images at precise wavelengths of 850nm and 950nm. Overall, infrared camera based PERCLOS feature performs well only during the night. (b) (C) (a) Figure 1.2 PERCLOS feature using infrared camera [38] (a) Infrared image captured using a 850 nm illumination source (b) Infrared image captured using a 950 nm illumination source. (c) Difference image with bright eye pupils. An ordinary digital camera can capture color images during the day. This camera can be combined with the infrared camera to compute PERCLOS feature during both day and night. But, the image processing techniques to compute the PERCLOS feature on normal color images are not straightforward. Video data must be processed to detect the face and eyes of the driver in various conditions. The image processing algorithms must be efficient enough to track the eyes in real-time. Researchers have attempted to compute the PERCLOS feature using ordinary camera [40], without addressing the variations in illumination and skin color. The following sections describe an overall PERCLOS detection system using the color images and the problem of continuous tracking of face and eyes through multiple variations. 1.4 PERCLOS based detection The schematic diagram of a drowsiness detection system is shown in Figure 1.3. An on- board camera captures the driver’s face at a high resolution to identify whether his / her eyes are closed. The images are processed to compensate for different variables such as illumination conditions, skin-color and pixel noise. The face and eyes of the driver are continuously tracked. The tracked eye regions are classified at each frame into “open” and “closed” classes. Using training data generated using known controlled conditions, reliable PERCLOS features can be computed. PERCLOS features can be coupled with other subsidiary features to detect drowsiness in the driver. The system may then take remedial action or alert the driver of impending danger. I III.-.”— On-board Track Face PERCLOS Drowsiness Camera & Eyes feature Classifier Subsidiary Information Figure 1.3 Schematic diagram of PERCLOS based detection module. 1.5 Tracking Face and Eyes Alert Driver Tracking an object in a dynamic environment is a challenging problem [11]. In the PERCLOS based drowsy driver detection system, numerous variables can pose difficulties to the tracking algorithms. The driver’s face can assume multiple positions as shown in Figure 1.5 (Images are presented in color). The color of the ambient illumination can cause color variations in the captured images as shown in Figure 1.4 (Images are presented in color). The above variations must be addressed by the face and eye tracking algorithms. The objective of this thesis can be described as — “Track the face and eyes of a person using eflicient image-processing algorithms and compute reliable PERCLOS measure " Though the current objective of face tracking is to compute PERCLOS features, there are other applications for tracking such as face recognition and gesture recognition. [12]. Figure 1.4 Images with two different positions of the face Figure 1.5 Images captured under different ambient illumination [9]. 1.6 Outline of Thesis The remainder of the thesis discusses various tracking algorithms and the features used in detecting the face regions. Chapter 2 discusses skin color detection methods for segmenting the face region. Chapter 3 presents a literature review of a variety of face tracking algorithms that were implemented and evaluated to monitor face and eyes of driver. Chapter 4 describes a pattern recognition classifier that differentiates between an open and closed eye. Results of combining all the above modules are documented in the Chapter 5. Conclusions and Futures work are presented in the Chapter 6. Chapter 2 Face and Eye Detection Skin color is an important feature for face detection and tracking. Color is robust to geometric variations and allows fast pixel-based processing. Also, human skin has a characteristic color and can be easily recognized by humans. Skin color based methods can be broadly classified as pixel-based and region-based methods. Pixel-based methods categorize each pixel as skin or non-skin individually. In contrast, region-based methods analyze a group of pixels and find the Spatial relation between them. Both the methods require computation of skin color features based on a color Space (RGB, HSI etc.). There is no clear consensus among researchers [13] about the superiority of one color space over another. The subsequent face detection can be accomplished using parametric techniques such as modeling Skin color with a Gaussian distribution or non- parametric techniques such as color histogram or Bayes classifier. 2.1 Color Spaces Color spaces are mathematical representation of colors by 3-tuples of numbers. The most common color space is tri-chromatic RGB (Red-Green-Blue). The other frequently used color spaces are the subtractive CMY (Cyan-Magenta-Yellow) and the H31 (Hue- Saturation-Intensity) color spaces. 2.1.1 RGB color space RGB color space representation expresses any color as a combination of the primary color components (Red, Blue, and Green). 10 Blue = (0,0,1) Cyan = (01 1) Magenta = (1.0.1) White = (1.1.1) Black = (0,0,0) Green = (0,1,0) Red = (1,0,0) Yellow = (1,1,0) Figure 2.1 RGB color space representation [42] Figure 2.2 Chromaticity diagram with normalized r on the x-axis and normalized g on the y-axis. The outer curved boundary is the spectral (or monochromatic) locus with wavelengths shown in nanometers [35] It is one of the most widely used color spaces. for storing and processing digital image data. Cathode Ray Tubes (CRT) and LCD (Liquid Crystal Display) screens utilize this 11 representation for displaying images. The advantages of this representation are its additive properties and simplicity. However, luminance (grey-intensity) and chrominance information are not independent and there is high correlation between the three channels. Figure 2.1 shows the geometry of the RGB color model for specifying colors using Cartesian coordinate system Face is a highly curved surface. Therefore, the observed intensity values (R, G, and B) exhibit strong variations. These variations may be minimized by normalizing each component with the overall intensity. This gives an intensity-normalized color vector with two components (r,g) as shown in equation 2.1 R G ___ ; ___—___ 2.1 r R+G+B g R+G+B ( ) The two dimensional plot of r and g components creates a Chromaticity diagram depicting the colors corresponding to each point. Figure 2.2 shows the location of different colors on the rg Chromaticity diagram [3 5] 2.1.2 HSI color space HSI color space expresses any color in terms of easily understandable properties — Hue, Saturation and Intensity. Hue defines the dominant color (such as Red, Yellow) of an area. Saturation is the amount colorfulness of an area in proportion to its brightness. The intensity value is related to the overall luminance of an area. Due to the separate encoding of Chromaticity components -- Hue and Saturation from the intensity I, the H81 color 12 space is intuitive to the user. The explicit discrimination between luminance and chrominance components makes HSI color space popular for color analysis. The HSI model in Figure 2.3 shows the HSI solid on the left. The HSI triangle representing a constant intensity with different colors is formed by taking a horizontal slice through the HSI solid. Hue is measured as an angle starting from the red comer. Saturation is given by the distance from the central axis. White Blue Green Intensity Black Figure 2.3 HSI color space representation. Colors on the surface of the solid are fully saturated, i.e. pure colors, and the grey scale spectrum is on the axis of the solid. [42] Conversion between HSI and RGB can be performed using the below equations 2.2- 2.4 %((R-G)+(R-B)) H = arccos r (2.2) Jet-G)2 +(R -B>(G—B» s =1—3 mi“(R’G’B) (2.3) R+G+B I=%(R+G+B) (2.4) 2.1.3 YCer color space YCer is an encoded RGB representation for color television. Unlike RGB color model, there is clear distinction between luminance and chrominance components. The luminance (Y) component contains all the information required for black and white television, and captures our perception of the relative brightness of colors. Humans perceive green as much lighter than red, and red, lighter than blue. These relative intensities are encoded by their respective weights of 0.587, 0.299 and 0.114 in the RGB conversion equation 2.5. The two Chromaticity components indicate the red (Cr) and blue (Cb) components. Y =O.299R+ 0.587G +0.144B Cr: R—Y (2.5) Cb=B—Y 14 2.2 Skin Modeling Skin modeling is the development of a mathematical representation for the quantitative description and detection of skin regions in an image. The goal of Skin detection is to build a decision rule that will discriminate between skin and non-skin pixels. This goal is usually accomplished by introducing a distance metric, which measures the distance between the pixel color and Skin tone. Distance metric and the choice of color Space are defined by the skin color model. A pixel-based skin classifier can be built in one of the three ways — - Explicitly defined Skin model 0 Non-parametric Skin distribution model 0 Parametric skin distribution model 2.2.1 Explicit skin model This model segments a color space in to skin and non-skin clusters. The segmented Skin cluster is defined explicitly using a number of straightforward rules. Essentially, an explicit Skin model is a rule-base classifier with fixed thresholds. For example an explicit RGB Skin model [23] can be characterized by the below rules. A pixel (R,G,B) is classified as skin if all rules are satisfied: 0 R>95andG>40andB>20 15 o max{R,G,B} —— min{R,G,B} > 15 and o lR-G|>15ANDR>GANDR>B The simplicity of the detection rules leads to the construction of a fast classifier. In order to evaluate the explicit model, we need a large dataset containing faces. VidTIMIT dataset [24] comprises videos of 43 volunteers (19 female and 24 male), reciting small sentences. This dataset was recorded in 3 sessions, with an average delay of 7 days between Session 1 and Session 2, and 6 days between Session 2 and Session 3. The delay between sessions allows for changes in hair style and clothing. All the volunteers recited ten sentences with an average duration of 4.25 seconds per sentence. In addition, each person performed a head rotation sequence. The sequence consists of a person moving his / her head to the left, right, up and down. All the videos are recorded at a resolution of 384 x 512 pixels (rows x columns) and 25 frames per second. For evaluating the explicit RGB skin model, images are extracted from the videos in VidTIMIT dataset. The performance of explicit RGB skin model can be seen in the results shown in Figure 2.4 and Figure 2.5. The skin regions from most of the images are properly segmented as shown in Figure 2.4. But, in a few images where the hair or clothing is Similar to skin colors (eg. brown, yellow and orange) the segmentation results was erroneous as seen in Figure 2.5. 16 Figure 2.4 Three images from the VidTIMIT dataset on the left and their corresponding skin segmentation results on the right. Figure 2.5 Improper skin segmentation results on three images from VidTIMIT dataset. The quality of skin segmentation using the explicit skin model is evaluated manually. A decision is made as to whether the segmented skin regions contain more than 80% of the skin pixels and less than 20 % of the non-skin pixels. Using this decision metric, the explicit RGB skin model segmented skin from 63 % of the faces in VidTIMlT dataset. Unlike other models, there is no training involved and the decision rules are created empirically. The explicit skin model captures a rough estimate of the skin cluster in a particular color space. It is a useful model when skin segmentation has to be done without any training examples or prior information. 2.2.2 Non-Parametric models The main idea underlying non-parametric skin modeling is to estimate the Skin color distribution from the training data. The color space is partitioned into discrete regions and assigned a Skin probability value for each region. The most common way to estimate the discrete probability distribution is by generating the histograms of skin and non-skin regions [25, 26]. Color Space is quantized into a number of bins, each corresponding to a particular range of color components. For example, if the choice of color space is HSI, then a two dimensional histogram of Hue and Saturation is generated. Each bin of the 2D histogram stores the number of times that particular color component occurred as Skin in the training images. After accumulating the histogram bins from all the training images, the histogram counts are normalized converting histogram values in to discrete probability distribution. Therefore, the probability estimate of a color belonging to skin can be generated using equation 2.6 skin[c] P(c | skin) = (2.6) where skin[c] gives the value of the histogram bin, corresponding to the color vector c and Norm is the normalization coefficient. These normalized bins represent the likelihood of a particular color c corresponding to skin. The estimate of probability density is continuous and has more information than the discrete results of explicit model. Figure 2.6 shows a face image and its corresponding skin probability estimate. 19 Figure 2.6 Skin probability estimation using 2D histogram. Right image shows the probability of skin at each pixel. Note the darker regions of eyes and face with less probability. The two dimensional histogram of skin and non-skin pixel colors, makes it possible to apply the Bayes rule. Bayes rule to compute the probability of observing skin color pixel c is — P(c | skin)P(skin) P(skin Ic) = _ _ P(c [skin )P(skm) + P(c | non — skm)P(n0n — skin) (2.7) Both the conditional probabilities P (cl skin) and P (cl non-skin) can be estimated directly using the skin and non—skin histograms. The prior probabilities can be estimated from the overall skin and non—skin pixels expected in the image. Bayes rule is then applied to the test data provided the scene illumination remains the same as in the training images. After the estimation of probability distribution, the skin segmentation is straightforward using the maximum aposteriori probability (MAP) rule. pixelclass = argmax (P(xilc» (2.8) xi e{skin,n0n—s kin} 20 A pixel in a test image is classified as skin, if the probability P (skin | c) crosses a threshold 8 (generally 0.5). The advantages of this model over an explicit model come from the quantization of the color space with histogram bins. Quantization increases the generalization ability of the model. In addition, the absence of luminance information in the model makes it robust to illumination changes. The important parameters in this model are the histogram bin Size and the threshold 9. The value of threshold 0 is set at 0.5 using the MAP rule, and the histogram bin Size is determined using a 10-fold cross validation method. [44] Color component c can be represented in any of the chrominance pairs such as normalized r and g, Hue and Saturation, red and blue Chromaticity. The effectiveness of a skin detection algorithm depends on the chrominance space in which the Skin color is modeled. Recent studies have compared the performance of different color spaces for human skin detection [13]. These studies support the Hue-Saturation chrominance space as exhibiting the smallest overlap between and skin and non-skin distributions. The skin probability distribution generated using 12 images from VidTIMIT data set is shown in Figure 2.7 . The skin regions are manually selected from these training images. 21 20- _ 25- i 30~ - I l I l D 5 10 15 20 25 30 Saturation Figure 2.7 2D histogram of skin with 32 x 32 bins on Hue-Saturation chrominance space. Darker regions indicate higher frequency counts. Using the above histogram, the test images from VidTIMIT data are segmented. The performance of the non-parametric probability distribution method is Shown in Figure 2.8 and Figure 2.9 . Overall, 81 % of the images in VidTIMIT are correctly segmented. 22 Figure 2.8 Skin segmentation results on three images from VidTIMIT dataset using the non-parametric probability distribution model Figure 2.9 Incorrect skin segmentation results on two images from VidTIMIT dataset using the non-parametric probability distribution model 23 The Skin regions from most of the images are accurately segmented as shown in Figure 2.8 . But, few images showed inaccurate segmentation of hair and face regions as seen in Figure 2.9 . False segmentation of hair is due the size of histogram bin. Small histogram bin size requires large training set to estimate the skin probability. Whereas, a large histogram bin size falsely segments neighboring colors as skin. Missing skin segmentation of the face is mostly due to gray tones. Hue estimate in HSI model is inaccurate for gray tones leading to an overlap of skin and non-skin regions. In general, some false calls must be expected, due to the non-seperability of skin from non-skin regions in color space. 2.2.3 Parametric models The probability distribution of skin color can be estimated in a more compact form using parameters like mean and variance of the skin color. An elliptical Gaussian probability distribution of the skin color vector can be defined as in equation 2.9. l _ 1 7(c-#S)T >35 ‘(c-y.) p(c I skin) 2 .e ”[25 I1/2 (2.9) where c is a color vector and #s and 25 are the mean vector and co-variance matrix respectively of the distribution. These model parameters can be estimated by equation 2.10 24 n l #s :—ch " i=1 (2.10) l ’1 ZS 2:121“). ’l‘sXCj—llslT J: where n is the number of skin color training pixels. The probability measure gives an estimate of how close a given color is to the training skin samples or the mean skin color vector ,us. The skin segmentation decision can be made if the probability of a color vector exceeds a preset threshold. Parametric estimation of skin probability can generalize and interpolate the training data better than what is possible in non-parametric estimation. Also, the skin model is more compact with just ,uS and ZS describing the entire model compared to the 32x32 matrix of non-parametric model. But, the choice of color space (RGB, HSI) plays a crucial role in Gaussian estimation. The elliptic Gaussian assumption may not hold in different color spaces. If the actual Skin color probability distribution has multiple peaks, then a mixture of Gaussians must be used [27]. In general, for skin color estimation either an explicit RGB skin model or non-parametric estimation models provide adequate results. 2.2.4 Non-adaptive Vs Adaptive Models Skin models can be either non-adaptive or adaptive. A non-adaptive model uses skin samples from training data to generate a static set of rules. If the test skin tone is representative of the training data, then a static model exhibits good performance. But, in a dynamic environment like a moving car, the scene illumination affects the skin 25 Chromaticity [19]. The above drawback can be corrected using an adaptive skin model. In an adaptive skin model, skin tone from the current illumination is used to train the model. Any skin region from the current image can be used for training the skin model. For example in a video sequence containing a face, the skin model may be initialized using the skin tone of the regions around the eyes [18]. 2.3 Illumination Effects The skin color captured by a camera depends on the skin pigmentation, brightness and the color of illumination. The same skin area can appear as two different colors under different illuminations. This is a problem commonly referred as color constancy. Human color perception system ensures the color of objects remain relatively constant under varying illumination conditions. But, the computer vision algorithms capture two distinct hues in different illumination conditions as seen in Figure 2.10. The illumination effects on the skin color have to be compensated for, in order to achieve reliable skin color segmentation. Figure 2.10 Images captured from the same subject under different illumination temperatures. Left image has a correlated color temperature (CCT) of 2600 K and the right image has a CCT of 6200 K [9] 26 2.3.1 Skin models under varying illumination The skin models designed in Section 2.2 fail to segment the skin regions under varying illumination temperatures. The explicit model segments all the skin regions under normal illumination conditions in Figure 2.11 (Images are presented in color). But, the skin regions are not segmented when the illumination is changed to a temperature of 6200 K as seen in Figure 2.12 (Images are presented in color). Figure 2.11 Skin segmentation results of explicit model on an image captured at nomal illumination of 3800 K. [15] 27 Figure 2.12 Skin segmentation results of explicit model on an image captured at an illumination temperature of 6200 K. [15] Figure 2.13 Skin segmentation results of non-parametric model on an image captured at an illumination temperature of 3800 K [15] Figure 2.14 Skin segmentation results of non-parametric model on an image captured at an illumination temperature of 6200 K [15] The segmentation results with the non-parametric distribution model are also similar. All the skin regions are segmented along with some background under normal illumination as seen in Figure 2.13 (Images are presented in color). But, the skin regions are not segmented correctly when the illumination conditions are changed as seen in Figure 2.14 (Images are presented in color). 28 2.3.2 Compensation Methods Illumination effects can be partially compensated by any of the following adaptive methods—— 0 Estimate the illumination vector from the image and remove its effect on the skin Chromaticity - Train the skin model adaptively using only the skin regions of current illumination. Estimating illuminant color to segment skin is a promising method. Recently, many researchers have reported some success using this method. A dichromatic reflection model is used to understand the reflected light due to skin pigment and the illuminant color. The Chromaticity captured by a color camera is due the superposition of body reflections and surface reflections. The body reflections Show Chromaticity due to skin pigmentation and the surface reflections Show the illumination Chromaticity. Using the dichromatic model, illuminant color can be estimated from the human skin color as described by Storring et a1. [16]. A dichromatic reflection model describes the reflected light from dielectric objects like skin as an additive mixture of light Ls reflected from the material’s surface and light Lb reflected from the material’s body. The observed color CL at a pixel can be written as a linear combination of interface color C ,- and body reflection color C b of the material. CL =miCi+mbe (2.11) 29 On a uniformly colored surface like skin, the colors C,- and C,- of interface and body do not change. But, the linear combination of both the colors results in a cluster in the color space as shown in Figure 2.15 [39]. /\G pixel colors B Ci . . ' Ch > R Figure 2.15 Pixels of uniform surface lie on a parallelogram in RGB color space [39] Using the pixels belonging to face region, both C ,- and C b can be estimated from the color cluster. The illuminant color is simply interface reflection color Ci. Though this method gives an accurate illuminant color, it is computationally intensive. Another commonly used method is to adaptively train the skin model with the current illumination information. The mean vector of the Gaussian skin model can be modified adaptively to track the face in varying illuminations [17]. The same technique can also be applied to train the non-parametric skin model. 30 2.4 Adaptive Skin Segmentation Adaptive skin segmentation combines various modules that are discussed in the above sections. The schematic diagram of an adaptive skin segmentation algorithm is shown in Figure 2.16. The previous few frames (3-4 seconds) are passed through a motion detection filter. The output of this filter masks out regions that are slowly varying. Eye blink is commonly detected by the motion detection filter. Assuming that the background does not change and the face stays predominantly in the center of the frame, potential skin regions are located. These regions are used to update the probability estimate of non- parametric model. The updated skin model segments the Skin regions from the latest video frame. The largest connected region is determined using a connected component algorithm. The output is the boundary of the largest connected region outlining the face of the person. Previous Frames Face Motion Detect Train Skin . Filter Eyes or Model Video F ace Skin _ Connected Segmentation Components Current Frame I Face Outline Figure 2.16 Schematic diagram of adaptive skin segmentation 31 2.4.1 Motion Detection Filter There are different methods for identifying regions of motion and segment moving objects in image sequences: image differencing, adaptive background and Optical flow. The simplest approach is image differencing. In this method, corresponding pixel locations in two images are compared and locations where significant changes occur are marked as regions of motion. The algorithm to detect changes between two images is as described below: Input I,[r,c] and I,_A[r,c]: two monochrome input images taken A seconds apart. Input 0 is an intensity threshold. 10u,[r,c] is a binary output image; B is a set of bounding boxes. 1. For all pixels [r,c] in the input images, set 10,,,[r,c] = 1, if(|1,[r,c] — 1,_A [r,c]| > 9 set I out [raC] = 0 OthCI'WISC. 2. Perform a closing of the I out using a small disk to fuse neighboring regions. 3. Perform connected components extraction on [out 4. Remove small regions assuming they are noise. 5. Compute the bounding boxes of all remaining regions of changed pixels. 6. Return [out [r,c] and the bounding boxes B of regions of changed pixels. 32 The above described motion filter is applied to the data from a video clip with the subject closing his eyelids. This video is used with permission from the “Face Video Database of Max Plank Institute for Biological Cybernetics”, Tuebingen, Germany. Two frames are shown in Figures 2.17(a) and (b) with the eyes open and closed. Image differencing is suited for detecting blinking; because humans have mean-blink duration of 51.9 ms [28]. The rapid change in eyelids creates a motion in successive frames. The output of the motion filter is shown in Figure 2.17(c). The output here contains only the eyelid regions, as these are the only significant changes between these two frames. Based on the location of the eye regions, four skin regions around the eyes indicated in white boxes are extracted for skin modeling as seen in Figure 2.17(d). Figure 2.15 (e) shows the extracted skin regions. The adaptive skin model is trained on these extracted Skin samples and used for detecting the face region on the future video frames. 33 (c) (6) Figure 2.17 Motion Filtering and skin region extraction. (a) and (b) Two frames in the video with the eyes open and closed. (c) Output of the motion filter. ((1) Extracted skin regions in white boxes. (e) Extracted skin regions for skin modeling. Figure 2.18 Adaptive skin segmentation results of non-parametric model on an image captured at an illumination temperature of 6200 K. 34 Adaptive skin segmentation in varying lighting conditions using a non-parametric histogram model is as Shown in Figure 2.18 (Images are presented in color). 2.4.2 Face Detection Face detection can be described as a problem of trying to outline the face region from a given image. In the present scenario, the video captures predominantly the face of the driver. Therefore, it is sufficient to identify the largest connected skin region and label it as the face. Morphological operation “closing” is applied on the segmented binary image. This operation fuses neighboring skin regions. After the fusing operation, small regions are removed by applying connected components algorithm elaborated in the Appendix A. Face detection results are shown in Figures 2.19 (a-h). 35 Figure 2.19 Skin segmentation and face detection outputs. (a) and (e) Two frames in a video sequence with the eyes open and closed. (b) and (f) Segmented output using the skin model. (c) and (g) Output after applying morphological operation and connected components algorithm. (d) and (h) Outline of the detected face 36 Figures 2.19 (d) and (h) are the final outputs of the face detection. They contain the boundary of the face region. The next step is to detect the eyes inside the boundary of the face. 2.4.3 Eye Detection The problem of detecting the eyes from a video sequence can be described as eye detection. Eye detection is necessary, as computing the PERCLOS features calls for continuous tracking of the eyes of the driver. Eyes may be detected using a combination of the motion detection filter output and the face detection output. Motion detection filter provides a very good segmentation of the eye regions as seen in Figure 2.17 (c), when the subject blinks. As the average human blink rate is approximately 12 blinks per minute [14], the motion detection filter can initialize the eye regions fairly often. The output of the face detection can be used to limit motion processing to the face region. The schematic diagram of the eye detection method is shown in Figure 2.20. 37 Face Detection -————1 Eye Detection Video I Sequence Motion Filtering Figure 2.20 Schematic Diagram of Eye Detection Figure 2.21 Results of eye detection. (a) Image from a video sequence (b) Motion filter output (c) Face detection output (d) Eye detection output 38 Figure 2.21 shows the results of eye detection algorithm. Output of motion detection filter is shown in Figure 2.21 (b). Face detection and eye detection outputs are shown in Figure 2.21 (c) and ((1) respectively. Excessive motion of the face due to turning or bending may cause the algorithm to fail. Motion detection filter might detect multiple regions on the face as potential eye regions. This problem can be partially solved using information such as the location and size of the eyes. After successful detection of face and eyes, the next step is to track them both continuously. Tracking the face and eyes of a person is elaborated next in Chapter 3. 39 Chapter 3 Face and Eye Tracking “Tracking an object” is the ability to follow a specific region or object in a video. Figure 3.1 shows an example of tracking an eye. Face tracking algorithms track the movements of the face or eyes in a video sequence. These algorithms can serve as a pre-processing stage for other facial image analysis tasks such as PERCLOS estimation, face recognition, gesture recognition and gaze tracking. Ideally, a face detection algorithm can be repeatedly applied on each frame of a video to track the face. But, a face detection algorithm is computationally intensive and cannot keep track of the face in real-time. On the contrary, face tracking algorithms use temporal correlation to track a face in real- time. Figure 3.1 Tracking an eye region in a video sequence. Face tracking algorithms must track a face through multiple sources of variations such as noise, scene illumination, occlusion effects etc. If the algorithm loses track of the face, 40 there must be an initialization process that allows the algorithm to restart the tracking. Therefore, block diagram of a continuous tracking system contains both the detection and tracking modules as shown in Figure 3.2. Face Detected . T > Detectron < J Tracking Fails Figure 3.2 Block diagram of a real-time tracking system Tracking can be performed using either templates or motion trajectories. Commonly used methods for tracking an object are — 0 Template Image based tracking 0 Level-Set based tracking 3.1 Template Image based Tracking A template image which is representative of the object of interest is first generated by the detection algorithm. This template is matched with future video frames to estimate the location of the object. For example, template image of an eye region can be matched with the subsequent video frames as shown in Figure 3.3. 41 (a) Figure 3.3 Template Image based tracking. (a) Eye template (b) Eye region matched in a video frame Template matching method is based on the following assumptions 0 the object in the template is either stationary or moves slowly compared to the video acquisition rate 0 the objects in the scene are rigid and cannot change their three-dimensional orientation 0 the object in the template cannot repeat in the scene. Repeated objects will results in multiple template matches Depending on the problem, additional assumptions can be introduced. All the above assumptions are satisfied to some extent by the face image sequences. The movement of face or eyes is common, but it is a slow process compared to the video frame rate of 30 frames per second. Face is bisymmetric in nature, causing the eye region to repeat. To overcome this problem, search area must be restricted to one half of the face region. The block diagram for template-matching algorithm is shown in Figure 3.4. The detection stage makes use of skin detection model to isolate either the eyes or face as discussed in 42 Chapter 2. A template of the required region (eyes, face) is then extracted from the detection results as shown in Figure 3.5. Template-matching algorithm finds the location in the image data, which best matches with the template image. If the matching location is successfully found, then template-matching algorithm is repeated for successive images. Otherwise, the tracking algorithm restarts with a new template image. Update Location Detection Extract Template process Template matching Image algorithm Detection Tracking Figure 3.4 Block diagram of template-matching algorithm 43 Figure 3.5 Examples of template eye images enclosed in the white rectangles In general, any template—matching algorithm can be characterized by the following stages 0 Choosing a template region in the detection process 0 Template Features for generating a template image - Similarity or Distance measure for comparing a template with the image in template-matching algorithm. The first stage of choosing a template region was discussed in detail in the previous chapter. The second and third stages are discussed in the subsequent sections of this chapter. 3.1.1 Template Features A template image can be composed of any feature that is characteristic of the object being tracked. The choice of template feature is critical in a tracking algorithm. The feature must be insensitive to noise and other variations like illumination. The simplest feature is the intensity of a pixel. Figure 3.6 shows an eye template based on the intensity feature. Other local features such as edge elements or Chromaticity components may be used depending on the application. Figure 3.6 Eye template based on the pixel intensity The template features commonly used for tracking eye are — o Intensity feature 0 Skin probability feature 0 Combination of Intensity and Skin probability features 1. The intensity feature creates a sub-image from the original image. Template-matching with a sub—image is the well known Image Registration problem. Ifthe template area has distinct edges as in the eye region, the intensity feature performs well. The intensity feature however is sensitive to variations due to pixel noise and illuminations effects. 2. An estimate of the skin probability can be used as an alternate template feature as expressed in equation 3.1. Two-dimensional histogram of color components provides a distinct skin probability profile around the eyes. These skin color variations in the eye 45 area are unaffected by small intensity changes and pixel noise. Figure 3.7 shows a template generated using the skin probability feature. The extracted template shows the skin probability in the eye area. T(x,y) = P(skin |x,y) (3.1) where T (x,y) is the template feature at location (x, y) and P(skinl x, y) is the probability of skin given the (x, y) location. Figure 3.7 Eye template created using the skin probability feature 3. A new feature can be derived by combining both the intensity and skin probability features. The skin probability estimate could be multiplied with the pixel intensity as depicted in the equation 3.2. The resulting feature puts emphasis on the intensity variations without the disadvantages of illumination effects. An eye template created using the new feature is shown in Figure 3.8. T(x,y) = I(x,y).P(skin | x,y) (3.2) 46 where T (x,y) is the template feature at location (x,y); I(x,y,) is the intensity of a pixel; P(skinl x, y) is the probability of skin given a (x, y) location; Figure 3.8 Eye template created using the combined feature 3.1.2 Template-Matching using Cross-Correlation Cross-correlation is a standard method of estimating the degree to which two signals or images are correlated. In order to identify a 2D pattern in an image, the template is correlated across the image. The locations with high correlation are the regions where the template matches with the image. T emplate-matching algorithm — Let T(x,y) represent a two-dimensional template that must be found in a larger image I(x, y). 47 Position the template (or mask) sequentially throughout the image and compute linear cross-correlation at each position as Shown in Figure 3.9. The correlation coefficient at a position (m,,n) is given by equation 3.3 Z(I(x,y)-1(x.y>)(T(x—m.y——n)—T) , = W _ 3.3 pm") (ZOOM!)--1(x.y))2.Z(T(x-m.y-n)-T)2)1/2 ( ) Ly x,y The numerator estimates the correlation between the template and the image at a specified position (m,n). The correlation is normalized with the image energy in the denominator. The ratio is the cross-correlation coefficient ,0 at a location (m,n), where —1 < p < +1 . Equation 3.3 creates a correlation coefficient matrix ,0 over the entire image I. Figure 3.9 Cross-correlation of the dotted template on a larger image. The arrows indicate shifting of the template window Find the position of maximum correlation coefficient in the p matrix. This coefficient must satisfy a set of conditions in order to be a template match. 48 o The peak vale of the correlation coefficient matrix should reside on an a- neighborhood of the center of the matrix, where e is a 5 x 3 window. 0 The maximum correlation coefficient must be greater than a threshold 0, where 0<6<1 0 There cannot be any other local maxima in the correlation matrix within p*0' of the absolute maximum coefficient, where a is the standard deviation of the correlation coefficient matrix and p is a fraction ranging from 0 to 1. This condition ensures that the global maximum is dominant compared to the local maxima. The above conditions ensure that the template location Shifts slowly and there is only one dominant maximum coefficient in the correlation matrix p. If the above conditions are satisfied, then template-matching is a success. Tracking window will be updated to the new location of the template match. An example of cross-correlation based template matching is Shown in Figure 3.10. Figure 3.10 (a) is an open eye template T that must be tracked in the face image I. Figure 3.10 (b) is the face image representing the tracking search area. Correlation matrix ,0 is plotted as an intensity image in Figure 3.10 (c). The two areas with bright intensities are the eye regions. Figure 3.10 ((1) shows a surface plot of the correlation matrix. Notice that the maximum correlation coefficient is located at the center of left eye. In a face tracking algorithm, the search area is restricted to only one half of the face. Surface plot corresponding to the left half of correlation matrix is shown in Figure 3.10 (e). The left correlation matrix satisfies all the conditions for a template match. 49 Figure 3.10 50 Figure 3.10(continued) Figure 3.10 Cross-correlation based face tracking. (a) Template Eye Image (b) Face Image (c) Correlation Matrix (d) Surface plot of correlation matrix. (e) Surface plot of left half of the correlation matrix Another example of template matching of an eye when the eye is closed is shown in Figure 3.11. Figure 3.11 (a) is an open eye template T that must be tracked in the face image I with the eyes closed. A maximum correlation coefficient is centered at the left eye in spite of the eyes in the face image being closed. ((1) Figure 3.11 52 Figure 3.11(c0ntinued) . . . I I I I I I I ' . . . I I i ,’ I ___-___. . / . .............. ., ............ 1| ........ ; ............ ;----./ ,1 l i l I I I I I 1 . I z I I I ’ I l I I ’ .--L- 2 ' ' ' ' ” : /. ' ... ............ .: ............ a. ...... f ............ a.-.-.l 4" l 08 : : I I 1 I ‘ I _ I I I ' ‘ I I I a I ‘ I ' z I I I I /"'."-1 ............ q .............. ; ____________ t ’ .- v I I I I I I’: l I I I I I ‘ I I I l I ’ ------------ I f. T 1 ___-a. -------- .A ............ a ' 'i I I I I I ’l 0.4 . . . ' .' . ' I z : : ' ' ' ------------------------------ . / r : ............ , '1' 02 I ‘ : I A I . . . . : x : I 3 I A--.:—--.: ----------- J. ------------------- L ’4. I I ' ' 0 . I ' " ' . . . .' ' ' ‘ l I . /.---,--.. ............... , ............ ;_- I, : I I I -0.2 . i I ' I I I I / -------- J---— 1 ............ A .......... 4 l I -04 . I . ' ' ' . : ' ' I ...... .b—u—t-s--_.‘$L——-—-..__--.l.~-o--§-_.-— ' I, ., ............ ,. ............ ,o ............ ,1 ........... ,. ..... .. l I 06 I ............ 4-- ‘ 3' ’ _ ‘ i ’ ..................... x ............. I. .......... v .r 1’ I I - ’ I a I 4" I I i i (c) Figure 3.11 Cross-correlation based face tracking. (a) Template Eye Image (b) Face Image with eyes closed (c) Correlation Matrix (d) Surface plot of correlation matrix. (e) Surface plot of left half of the correlation matrix The above results show that tracking an eye is possible in a real situation with eye blinks and face rotation. The results of eye-tracking on a 15 second video sequence are presented in the next section. 3.1.3 Results Template based tracking was applied on various videos of VidTIMIT database. Video sequences with faces rotating sideways and upwards are chosen to test the robustness of the tracking algorithm. Figure 3.12 shows eye tracking results on a video sequence. Pixel intensity is used as the template feature and the cross-correlation parameters 0, p are 0.7 53 and 0.5 respectively. After the initialization of eye template, tracking did not fail for the entire video sequence of 15 seconds. Tracking between each frame (384 x 512) takes an average of 0.0757 seconds on a standard computer with 2GB RAM and 1.6GHz Pentium processor. With this speed, a maximum of only 13 frames can be tracked per second. This tracking speed must be improved to facilitate other computations such as eye closure classification and PERCLOS estimation. Tracking speed can be improved in one of two ways — 0 Skip alternate frames for tracking analysis. This method improves the algorithm efficiency roughly by a factor of two. This will lead to more tracking misses due to abrupt changes, resulting in extra initializations of detection stage. 0 Reduce the resolution of the video data for tracking analysis. This method can improve the algorithm efficiency depending on the reduced resolution. It was found that tracking is largely unaffected by reducing the resolution by a factor 2 or even 4. Tracking the quarter resolution frames (96 x 128) takes an average of 0.0192 seconds, an improvement factor of 4 over the full resolution frames. A maximum of 52 frames per second can be tracking with the improved speed. Figure 3.13 shows the eye tracking results on frames with quarter resolution. The tracking results remained unaltered with the quarter resolution frames. 54 Figure 3.12 Face tracking results with pixel intensity feature. White rectangle shows the tracking of left eye. Figure 3.13 Face tracking results with pixel intensity feature on quarter resolution frames. White rectangle shows the tracking of left eye. 55 Tracking results are fairly similar with the other two features discussed earlier such as skin probability and the combined feature. Figure 3.14 and Figure 3.15 show the tracking results with the skin probability feature and combined skin feature respectively on quarter resolution frames. The drawback with both these features is the algorithmic efficiency. Even with quarter resolution video data, both these methods take 0.1102 seconds to track each frame. The maximum number of frames that can be tracked per second reduces to 9. Therefore, the pixel intensity feature is preferred over the other features. # Figure 3.14 Face tracking results with skin probability feature on quarter resolution frames. White rectangle shows the tracking of left eye. 56 Figure 3.15 Face tracking results with combined skin feature on quarter resolution frames. White rectangle shows the tracking of left eye. Tracking can fail due to various reasons such as occlusion and sudden changes in the object of interest. In eye tracking, if the face rotates sideways and the eye is completely occluded, then tracking fails. Cross-correlation method cannot find a maximum correlation coefficient and the eye template has to be reinitialized. 57 Figure 3.16 Face tracking results with pixel intensity feature on quarter resolution frames. White rectangle shows the tracking of left eye. Tracking fails at the last frame. Figure 3.16 shows an example of tracking failure due to occlusion of the eye. In the last frame, eye is partially occluded by the nose resulting in a tracking failure. 3.2 Level Set based Tracking Level Set method is a numerical technique for tracking curves. Unlike template tracking, level set method tracks the exact region of interest using curves. Figure 3.17 Differences between template and level set tracking. (a) Template tracking of the left eye region. (b) Level set tracking of the left eye region. 58 Figure 3.17 shows the difference between template tracking and level set tracking. In template tracking, shape of the template remains unchanged, while the location of template moves between each flame. But, in level set tracking the region is tracked continuously with small changes to the curve in each flame. The advantage of level set method is that the motion of curve can be controlled with numerical computations on a fixed Cartesian grid without parameterizing the curve [31]. In addition, level set method permits changing topology as encountered in merging of two objects without parameterization [3 1]. 3.2.1 Level Set Representation of Curves Curves in two-dimensional space can be defined as an interface that separates 932 domain into two subdomains with non-zero areas. The above definition is valid only for closed curves, with clearly defined interior and exterior regions. For example, consider (15(Jr')=x2 + y2 —1 , where the interface (curve) defined by ¢(3€)=0 is the unit circle 69 = {SE [If] =1}. The interior region is the unit disk (2- ={5c' ||5c'|<1} , and the exterior region is 0+ ={5E [Ii] >1}. These regions are shown in Figure 3.18. 59 9+ ¢>0 outside A ‘_ a Figure 3.18 Level set representation of the curve x2 + y2 = l . Exterior and interior regions are represented by (2+ and (2— respectively. 60 represents the interface (15(2) = 0. The explicit representation of this curve is simply the unit circle defined by 60 = {SE ”56] =1} . On the contrary, level set offers an implicit representation of the curve C, using a surface function (15(3) as depicted in equation 3.4. c = {x e 912 ;¢(5e) = 0} (3.4 (a)) ¢(;:) = x2 + y2 —1 (3.4 (b)) 60 Explicit representation of curves In two spatial dimensions, the explicit curve definition needs to specify all the points on the curve C. While it easy to do so with a unit circle, it can be more difficult for general arbitrary curves. In general, one needs to parameterize the curve with a two-dimensional vector function 56(5) , where the parameter s belongs to [50, 3]]. The condition that the curve be closed implies that 55(50) =5c'(Sf). The parametric representation of the unit circle in Figure 3.18 is 2(3) = [cos(s), sin(s)] , where 5 goes flom 0 to Zn. A convenient way of approximating an explicit representation is to discretize the parameter s into a finite set of points So <--- < SH < s,- < s m <--- < sf, where the subintervals [5,, s,-+ 1] are not necessarily of equal size. For each point 5,- in the parameter space, we then store the corresponding two-dimensional location of the curve denoted by 5c°(s,-). As the number of points in the discretized parameter space is increased, so is the resolution of the two- dimensional curve. Implicit representation of curves Implicit representation of the curve C can be stored with a discretization of the entire 932 domain, which is impractical. Instead, discretizing a bounded subdomain 13sz is sufficient. Within this domain, we choose a finite set of points (x, y.) for i = 1, ..., N to discretely approximate the implicit function ¢ as shown in Figure 3.19. 61 highlighted in =0 Figure 3.19 (a) Implicit function ¢(5E) of a unit circle. Circle at ¢(i) white. (b) Surface plot of ¢(5€) in the domain [-2, -2] to [2, 2]. 62 (b) Figure 3.20 63 Figure 3.20 (continued) I T I I l fir I I l l l I l I I I I I I l I I I l I I I I I I I I I I I I I I I I I I 15 _______ I _______ I _______ L ______ J _______ I _______ L. ______ I _______ I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 1 ------- I ------- I- ------ f' ------- —I ------- I ------- v- ------ r ------ -I I I I _, I ..l I I I I -'I" I I“. I I I I ,/ I I I - I I I I I I I \ I I I I x' I I I a I I 05 _______ I _______ '-_.{-_-_'_ _______ ' _______ ' _-__:‘.._‘______-'_ ....... - I I I I I I I I I " I I I I I I I; I I I .I I I I_.- I I I ".I I I I: I I I "I I I I I I I I I 0 L ______ I _______ I _______ L ______ .J _______ I _______ I— ______ ; ______ _I I I I I I .l I I I I I I 'I I I I; I I I 5| I I I , I I I ;‘I I I I ‘ I I I -' I I I I I I I I I -0.5 “““““ I ----- I -------- r —————— '1 ‘‘‘‘‘‘‘ I ------- r' ------ T """" I I ‘ I I I I I I I \ I I I ’ I I I I \\ I I I I I I I \I I II. I I I I I“ I I I I -1r- —————— I _______ '_-_____'___‘"‘_‘-~_~......I,.-I_»;‘_'__I _______ '_______L _______ I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I -1 5 _______________ I _______ I. ______ —l _______ I _______ I. ______ L _______ I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I l I I l_ l l l -1.5 -1 -0.5 0 0.5 1 1.5 2 (C) Figure 3.20 (a) Discretized implicit function ¢(5c') of a unit circle. Circle at ¢(3€) = O is highlighted in white. (b) Surface plot of discretized 45(3) in the domain [-2, -2] to [2, 2]. Only the region around unit circle is discretized. (c) Unit circle This is a drawback of the implicit surface representation. Instead of resolving a one- dimensional interval [30, Sf], one needs to resolve a two-dimensional region D. This can be avoided in part, by placing all the points SE very close to the interface, leaving the rest of D unresolved. Since only the curve, ¢(?c)=0 is important, only points SE near this curve are needed to accurately represent the curve. The rest of the domain D is unimportant. Clustering points near the curve is a common approach to discretize implicit representation. Once we have chosen the set of points that make up our discretization, we 64 store the values of the implicit function fly?) at each of these points. For example a unit circle in implicit representation can be discretized as shown in the Figure 3.20. Figure 3.20 (a) shows the small cluster around the unit circle that is discretized. A surface plot of ¢(3c') with the implicit surface representation is shown in Figure 3.20 (b). Figure 3.20 (c) shows the unit circle curve ¢(SE) = 0. 3.2.2 Level Set Curve Evolution Tracking requires not just the ability to represent a curve C, but also to modify it continuously in each frame. The changes in curve C in each frame can be modeled using the velocity in the normal direction. The curve evolution differential equation of an initial implicit surface ¢ and a speed field F is given by 9’2 —+ F|V¢|_0 (3.5) where F is the speed of the curve in the normal direction N and pointing outward. Figure 3.21 shows the normal direction on a unit circle. Fiv. V C\<'/i\/< " J—‘f 1&2”); \1 \1)7 7‘“ _ --a Figure 3.21 Evolution of curve C under speed F N 65 The speed field F can be either positive or negative. When F > 0 the curve moves in the normal direction and when F < 0 the curve moves opposite the normal direction. Speed F is specific to a problem and controls the evolution of curve C. It is generally composed of two parts: an external speed derived from image data and an internal speed that depends on the geometric properties of curve C. Speed F can be expressed as a combination of external and internal speeds as in equation 3.6. F=a—bK (3.6) where a is the external speed and bK is the internal regularization term. Using equations 3.5 and 3.6, we can convert any image tracking problem into a level set curve evolution problem. Equation 3.5 can be solved by numerical techniques after recognizing that it is a Hamilton-Jacobi equation [31]. The advantages of level set evolution come from the implicit function representation. Instead of the parameterized curve, discretizing the implicit surface ¢(5c°) helps in handling sharp corners and topological changes. For example, a curve representing one interior region splitting into two interior regions is shown in Figure 3.22. From left to right Figure 3.22 (a) — (c), the curve representing gray interior region splits into two. It is very hard to describe this topology change using parameterization. Whereas level set technique uses a well defined implicit surface in Figure 3.22 (d) —- (g), and the topology changes naturally. 66 Figure 3.22 Topology changes in level set methods. (a) Curve with one interior region. (b) Interior region just before splitting (c) Curve with two interior regions. (d) — (g) Level set representation of the curve with an implicit surface and changing plane of interest. [43] 3.2.3 Level Set method implementation The implementation of level set method evolution requires discretization of an implicit surface ¢on a Cartesian grid of size (Ax, Ay, Az). ¢t+ F¢X F¢_y_ F¢z HIVI’I ’FIWI )V ¢~ — (3-7) 67 The derivatives in the equation 3.5 are approximated using finite difference techniques as shown in equation 3.8 [31]. % z ¢i+1 — ¢i—1 6x 2Ax 52¢ z ¢I+1 - 53¢: + ¢I—1 6x2 sz (3.8) Level set evolution starts with a well-defined implicit surface of any standard shape. Signed distance functions are a preferred choice due to the existence of first and second derivatives at all points on the grid. At each iteration, ¢ is updated using the equation 3.5. The level set evolution stops when ¢, reaches a small value indicating that It is close to the solution. While level set method gives us a good solution, the number of iterations results in a significant computational burden [32]. Evolution process can be accelerated by updating equation 3.5, only around a local neighborhood of the curve [33]. The narrow-band level set methods reduce the computation time significantly. Furthermore, in image tracking problem, the goal is to extract the object boundary after each frame. Here, the evolution process of the level set function is not of interest. By taking advantage of this, a fast level set method without solving Partial-differential equations is proposed by Shi et a1, 2005 [34]. The aim of the fast level set algorithm is to evolve the just the curve C, instead of the surface (15 until it converges to the optimal solution. Instead of defining the surface ¢, the curve is represented by two lists of boundary points L,-,, and Lou, as shown in Figure 3.23. These list points are updated continuously until the optimal solution is reached. 68 out II 1" Illlllll E ¢<0 111111“!!! Ill 11] lllllllllI I¢I>IQIllllllll Figure 3.23 Level set surface representation using boundary lists. For faster computation integer values are chosen for surface It. 3, if xis outside C, andx IE Low; 1, ifx e L out; x = 3.9 fl ) —3, ifxis outside c, andx a Lin; ( ) —1, ifx 6 Lin; The above method is two orders of magnitude faster than ordinary level set evolution for image tracking problems. This accelerated evolution enables us to use level set tracking on a video with 25 frames per second. 3.2.3 Results Level set tracking requires initialization for the first curve and the speed field F for the subsequent frames. Initialization must provide a starting curve with speed F < 0.The speed field F can be a binary mask representing any feature of interest. For example eyes can be tracked by using the darkness of the pupil region. Speed field F for the eyes can be created by thresholding the pixel intensity as shown in Figure 3.24. 69 wl t Figure 3.24 Speed field F in level set tracking of eyes. (a) Pixel intensity (b) Speed field F - Thresholded pixel intensity. Figure 3.25 Level set tracking results of the left eye pupil. 7O Figure 3.26 Level set tracking results of the left eye pupil (close-up) The results of level set tracking with the initial curve set as the left eye are shown in Figure 3.25. The left eye pupil being tracked is shaded with white color. Figure 3.26 shows the close-up images of the left eye. It was seen that the level set tracking algorithm never lost track of the left eye for the entire duration of the video sequence. Tracking between each flame (384 x 512) takes an average of 0.0719 seconds on a standard PC with 2GB RAM and 1.6 GHz Pentium processor. With this computational speed, a maximum of only 14 flames can be tracked per second. Reducing the resolution of the video images does not affect tracking in most cases. Level set tracking of half resolution images (192 x 256) take an average of 0.0221 seconds. The maximum number of flames that can be tracked per second improves to 45. To summarize this chapter, both the template based methods and level set methods track the eyes of different people successfully. In processing data flom a 90 second video stream (i.e. 2250 flames) the tracking algorithm had to re-initialize the face detection on only two instances. The tracking algorithms primarily failed when the face of the subject 71 is rotated sideways. Tracking one eye is sufficient for the computation of PERCLOS feature, as both the eyes are closed simultaneously. A critical measure of the usefulness of a tracking algorithm is its tracking speed per flame. Table 3.1 shows the average tracking speeds of different methods. Table 3.] Speed comparision of various tracking methods Tracking speed Maximum # of Method Feature Resolution per frame (in frames tracked seconds) per second Template Pixel Intensity 384 x 512 0.0757 13 matching Template Pixel Intensity 96 x 128 0.0192 52 matching Template Skin 96 x 128 0.1102 9 matching Probability Level set Pixel Intensity 384 x 512 0.0719 14 tracking Level set Pixel Intensity 192 x 256 0.0221 50 tracking 72 Chapter 4 PERCLOS feature computation PERCLOS feature is a reliable measure of drowsiness [10] [22]. It measures the percentage of time in a minute when the eyes are at least 80 % closed. An eye closure classifier combined with tracking results flom the Chapter 3 can provide a reliable PERCLOS feature. The block diagram of PERCLOS feature estimation is shown in Figure 4.1. , Bye Bye Add to Video I ' Tracking Closure previous _T EEEELOS Data Classifier results ea c Figure 4.1 Block diagram of PERCLOS estimation Eye tracking module tracks the eye region continuously on the incoming video data as discussed in Chapter 3. The eye closure classifier takes the output of tracking module and classifies the eye region as open or closed. The classifier results from frames obtained in the previous one minute are accumulated to estimate the PERCLOS feature. The template based and level-set based tracking methods give different tracking outputs of the eye region. The level set tracking output follows the eye pupil region exactly. An 73 eye closure classifier can make use of this output to differentiate between the open and closed eyes. The tracking output flom the template method can be pre-processed to generate an output similar to that of level-set tracking. 4.1 Pre-Processing The main objective of the pre-processing stage is to identify the eye pupil area flom the template tracking output as shown in the Figure 4.2. Eye pupil area has dark intensity and non-skin color usually black, brown or blue. Therefore the block diagram of the pre- processing stage contains a skin removal step and an intensity threshold step as depicted in the schematic in Figure 4.3. (a) (b) Figure 4.2 Preprocessing output of template tracking. (a) Template output shown in white rectangle (b) Eye pupil area highlighted in white 74 Eye Skin Intensity Largest Eye Template ‘ Removal —’ threshold —’ region —* pupil Area Figure 4.3 Pre-processing stages for generating the eye pupil area from eye template For the skin removal step, any of the skin detection models flom Chapter 2 can be used. Skin pixels are detected using explicit RGB skin model and are removed flom the eye template as shown in Figure 4.4 (b). The output flom the skin removal stage is thresholded for dark intensities. The eye pupil area constitutes around 10% of the total template window area. Therefore a threshold of 10 percentile intensity is chosen as the intensity threshold. Figure 4.4 (c) shows the output after intensity thresholding. The largest 4-neighborhood region is selected as the eye pupil area. Figure 4.4 (d) and (6) show the segmented eye pupil area by itself and when overlayed on the eye template respectively. 75 (C) (d) (6) Figure 4.4 Pre-processing at various stages (a) Eye template (b) Skin removal (c) Intensity threshold ((1) Eye pupil area (e) Eye pupil are overlapped on the eye template Eye pupil area which is the final output of the pre-processing stage is used as the primary feature in eye closure classification. 4.2 Eye Closure Classification Eye closure classifier must detect whether the eyes are open or closed. Eye pupil area is a reliable feature containing information about eye closure. If the eye pupil area is short and elongated in the horizontal direction, then the eyes are probably closed. Figure 4.5 shows the eye pupil area for open and closed eyes. 76 (b) Figure 4.5 Eye pupil area of open and closed eyes. (a) Eye pupil area of an open eye (b) Eye pupil area of a closed eye. A training data set containing about 200 eye images is created manually flom the VidTIMIT video dataset [24] to test the eye closure classifier. Eyes flom multiple people are chosen at different stages of eye closure. Overall, 30 closed and 130 open eyes are extracted. There are also about 40 eye images in which the eyes are partially closed and a part of the eye pupil is visible. All these eye regions are pre-processed to extract the eye pupil region for template based tracking. The level set tracking already provides the eye pupil region. The height and width of the eye pupil region are computed and plotted in Figure 4.6. 77 30 D EyesOPEN A EyesCLOSED : 0 Eyes partially closed 25 --------------------------- E -------------------------- 1 --------- H—--l-_f---El----EJ ! I El E1 El E1 5 E D E] D E] l : El Cl C] [J 5 1 E] ['J I: I: g 20 . -------------------------- : -------------------------- EI---I3---£I---EI---I:-I--~ 8’ 5 El 5 E] I: D I: E : : 3 5 I] 5 CI 23115 --------------------------- E --------------------------- E ------------------------- — a) E 1:] El C] U '5 : I: E] El [:1 “5 5 I: D I: [II 5 3g : E] D C] : g 10 --------------------------- 5----El----El -------------- 5 ------------------------- — : U D : O O (1) El El 2 o o 4) El 5 A O O C) [1 E] I 5 ---------- A ewe-4.53 El -------------------- 5 ------------------------- — A A o o : : A A O O ' l A A o 3 E A A I 2 UL: l l U 5 10 15 Height ofthe eye pupil region Figure 4.6 Plot of Height Vs. Width of the eye pupil region. It is seen in Figure 4.6 that the height of eye pupil clearly discriminates well between open and closed eye classes. A threshold has to be chosen based on the training eye image data. For a video resolution of the 192 x 256 (length x width), a threshold of 3 pixels is chosen. An eye pupil area can be classified as open if its height is greater than 3 pixels. 78 PERCLOS feature computation is straightforward after the eye closure classification. As PERCLOS is the percentage of time in 60 seconds the eyes are 80% closed, the eye closure classification results on flames of the last 60 seconds can be stored and used for computing the PERCLOS feature as described below 60*25 PERCLOS = Z EyeClosure{frame(n)} (4,1) n=1 A PERCLOS based drowsy driver detection system uses the following rule to detect and alert drivers. ifPERCLOS > a Drowsy Driver — Alert action else Attentive Driver — No action Reset PERCLOS and Continue monitoring Threshold a typically ranges from 0.7 to 1 and can be determined more accurately using a large database containing ground truth of drowsiness and PERCLOS feature [45]. 79 4.3 Results The eye closure classifier is applied to the tracking output of multiple video sequences chosen flom VidTIMIT dataset. The entire sequence of face detection, eye tracking and eye closure classification are applied on each frame automatically. Using a threshold value of 3 pixels for the height of the pupil region, the algorithm was implemented on a test dataset of 100 eye images. The results of eye closure classification on video sequences with different subjects are shown in Figure 4.7 - 4.10. OPEN OPEN CLOSED OPEN CLOSED Figure 4.7 Snapshots of a video sequence with an eye class tag for each frame - 1. Only misclassification is on the frame located at 2"d row and 4th column. 80 Figure 4.8 Snapshots of a video sequence with an eye class tag for each frame — 2. All the frames were correctly classified OPEN CLOSED OPEN OPEN OPEN OPEN OPEN OPEN Figure 4.9 Snapshots of a video sequence with an eye class tag for each frame — 3. Only misclassification is on the frame located at 1‘t row and 1St column 81 OPEN CLOSED OPEN CLOSED 5 OPEN OPEN OPEN OPEN OPEN OPEN Figure 4.10 Snapshots of a video sequence with an eye class tag for each frame — 4. All the frames were correctly classified The confusion matrix of the eye closure classifier on 64 images which are manually verified is as shown in the table 4.1 Table 4.1 Confusion matrix of the eye closure classifier on 64 test images Eyes Closed (Classifier) Eyes Open (Classifier) Eyes Closed (Ground Truth) 11 2 Eyes Open (Ground Truth) 0 51 82 On the test data, all the 11 closed eyes are classified correctly by the eye closure classifier while 51 of the 53 open eyes are correctly classified. This translates into a correct classification performance of 97%. The performance of tracking algorithms was also evaluated in Chapter 3 and it was seen that the algorithm rarely lost track of the eyes. In processing data from a 90 second video stream (i.e. 2250 flames) the tracking algorithm had to re-initialize the face detection on two instances. Furthermore the algorithm is also computationally efficient and provides real-time tracking. The tracking algorithms primarily failed only when the face of the subject is rotated sideways. In this case, the eye pupil region is not visible. In a practical setup, the momentary loss of tracking does not affect PERCLOS estimation. Because, when the face is rotated sideways, the subject is physically active and attentive to surroundings. In other words, we can classify the eye as open, when the face is rotated sideways without affecting PERCLOS feature. 83 Chapter 5 Conclusion and Future Work 5.1 Conclusion About one third of the US. drivers had fallen asleep while driving during the last year alone [8]. Drowsy driving results in 1550 fatalities and 100,000 vehicle crashes annually [3]. These crashes could be preventable with a drowsy driver detection system that predicts drowsiness and alerts the driver. Most drowsy driver detection systems rely on PERCLOS feature that measures the amount of time the driver’s eyes are closed in the last 60 seconds. The proposed system features an on-board color camera monitoring the driver’s face in real-time. Image processing algorithms track the eyes of a driver and determine whether his / her eyes are open or closed. Aggregating the eye closure information over 60 seconds can provide us a reliable PERCLOS estimate, and hence a drowsy driver detection system. This thesis proposes a PERCLOS based detection system, which involves three major components —— 1. Skin Detection Modeling 2. Face / Eye Tracking 3. Eye Closure Classifier Three skin detection models — explicit RGB model, non-parametric histogram model and parametric Gaussian models were implemented. An adaptive training methodology was 84 proposed for non-parametric and parametric models to compensate illumination variations. It was found that the histogram based non-parametric skin model gives optimum performance in terms of face detection and is robust to illumination variations. Template-based and level-set based algorithms based on intensity and skin-probability features were implemented. A new feature which is a combination of both the intensity and skin probability feature was proposed. Both the tracking algorithms provided comparable results on reduced resolution data. Real-time efficiency on a standard computer is achieved on quarter resolution (96 pixels x 128 pixels) image data. A new eye closure classifier based on eye pupil shape was tested for determination of eye- closure and was seen to perform consistently on multiple video data. A combination of face and eye tracking algorithms and eye closure classifier is shown to give a reliable PERCLOS estimate on a large face video database (VidTIMIT [24]). The above algorithms provide a prototype for drowsy driver detection system using PERCLOS with an on-board video camera. 5.2 Future Work The detection system must be tested on a more exhaustive data set. Extensive testing must be done by systematically varying illumination, pixel noise, driver ethnicity etc. The real—world performance of the detection system could be validated on drivers operating driving simulators in a drowsy state. In the future it would be of great research value to create a benchmark video database for drowsiness testing. This video database could be used to compare the performance of various drowsy detection systems developed by different researchers with respect to both time and accuracy. Performance accuracy 85 should be quantified in terms of probability of detection drowsiness as well as probability of false alarm. The tracking results could be further improved by the fusion of tracking output from template and level set methods. Though data fusion is computationally intensive, a hardware implementation of the system using a digital signal processor (DSP) board could achieve real-time performance. A complete prototype system using a camera and DSP module should be assembled and tested. 86 Appendix A — Connected components algorithm Connected components algorithm [41] finds and labels the objects in a binary image. Suppose that B is a binary image and that B[r,c] = B[r’,c’] = v where either v = O or v =1. The pixel [r,c] is connected to the pixel [r’,c’] with respect to the value v if there is a sequence of pixels [r, c] = [ro,co], [r1,c1], ..., [rn,cn] in which B[ri,ci] = v, i = O, ..., n and [ri,ci] neighbors [n.1, ci_1] for each i = 1, ..., n. The sequence of pixels [r0,c0], ..., [rmcn] forms a connected path from [r,c] to [r’,c’]. A connected component of value v is a set of pixels C, each having value v, and such that every pair of pixels in the set are connected with respect to v. Figure A.1 (a) shows a binary image with five such connected components of ls; these components are actually connected with respect to either the 8-neighborhood or the 4—neighborhood definition. I I II I I I j I) I I 1 II: I 1 I II 2 I I 0 1 II I [II I I I II I II I II 2 I I I II II I II I I I l I II 0 ll 2 II II 0 II II II II :1 II II ' II II II II II T 1 I I 1 II I II 1 1 3 3 3 I II 4 ll 2 II II 0 I II , I II I II II II I I II 4 II 2 I I II I II II II I 5 5 II I 7 II -, II II 2 I I II I II[I I I 5 II 15052 2 2 (a) (b) Figure A.1 87 Figure A.1 (continued) (C) ((0 Figure A.l A binary image with five connected components of the values 1. (a) Binary image matrix (b) Connected components labeling (c) Binary image ((1) Labeled image [41] Definition. A connected components labeling of a binary image B, is labeled image LB which the value of each pixel is the label of its connected component. A label is a symbol that uniquely names an entity. While character labels are possible, the positive integers are more convenient and are most ofien used to label the connected components. Figure A.l (b) shows the connected components labeling of the binary image of Figure A.l (a). Recursive Labeling Algorithm: suppose that B is a binary image with MaxRow + 1 rows and MaxCol + 1 columns. We wish to find the connected components of l-pixels and produce a labeled output image LB in which every pixel is assigned the label of its connected component. The strategy is to first negate the binary image, so that all the 1- pixels become -ls. This is needed to distinguish unprocessed pixels (-1) from those of component label 1. We will accomplish this with a function called negate that inputs the binary image B and outputs the negated image LB, which will become the labeled image. 88 Then the process of finding the connected components becomes of finding a pixel whose value is -l in LB, assigning it a new label, and calling procedure search to find its neighbors that have value -1 and recursively repeat the process for these neighbor. The utility function neighbors (L, P) is given a pixel position defined by L and P. It returns the set of pixel positions of all of its neighbors, using either the 4-neighborhood or 8- neighborhood definition. Only neighbors that represent legal positions on the binary image are returned. Recursive algorithm pseudo code: B is the original binary image. LB will be the labeled connected component image. procedure recursive_connected_components(B, LB); { LB := negate(B); label := 0; find_components(LB, label); print(LB); } procedure find_components(LB, label); 89 { for L := 0 to MaxRow for P := 0 to MaxCol if LB[L,P] == -1 then { label := label + 1; search(LB, label, L, P); } procedure search(LB, label, L, P); { LB[L,P] := label; Nset := neighbors(L, P); for each [L’, P’] in Nset { if LB[L’, P’] = -1 then search(LB, label, L’, P’); 90 Appendix B — Fast Level Set Method Fast level set method [46] [34] uses a region-based tracking model based on the region competition idea, but this implementation is also applicable for edge-based models. We assume each scene of the video sequence is composed of a background region 0 and M object regions Q],Qz,..., QM. The boundaries of these M object regions are denoted as C1, C2, , CM. We model each region with a feature distribution leQme = 0,1, , M), where y is the feature vector defined at each pixel. For example, the features can be the color, the output of a filter bank designed to model textures, or other visual cues. Assuming that the feature distribution at each pixel is independent, the tracking result of each frame is the minimum of the following region competition energy: M M E=—Z jlogp(z(£)|§%)da+12]ds (13.1) m=09m m=1Cm where E, is the data _fidelity term that represents the likelihood of the current scene, E, is for smoothness regularization and is proportional to the length of all curves, and 7. is the non-negative regularization parameter. By computing the first variation of this energy, the curve evolution equation for the minimization of this energy is: dim = (Fd + FSWCm (m =1,2,...,M) (13.2) 91 where We," is the normal of Cm pointing outward, and F d and F s are the speed resulting flom Ed and Es, respectively. The speed Fd represents the competition between two regions and it is Fd =log[p(g(g) |Q,,,)/p(g(g)lQbu,)], where flout denotes the region outside Cm at g e Cm. The speed F, makes the curve smooth and it is F s = Kk, where K is the curvature. ¢>0 Figure B.l Implicit representation of the curve C I and the two lists L5,, and Low in the neighborhood of C1 [46] 92 pixel ¢>0 Figure B.2 Illustration of the motion of the curve C1 by switching pixels between Lin and Lour- [46] Since the current is on real-time level set implementation for video tracking, we use a simple tracking strategy as follows. For each flame, we use the tracking results flom the last flame as the initial curves, and then evolve each curve according to (B2) to locate the object boundaries in the current flame. Once it stops, we move on to track the next flame of the video sequence. Fast Level Set Implementation In this section, we present a fast level set implementation of the curve evolution process in (B2) when the scene is composed of only the background (20 and a single object region Q]. Extensions are then made to track multiple objects with different feature distributions in the next section. 93 To represent the background 00 and the object region 0,, we use a level set function ¢ which is negative in (21 and positive in 00. Based on this representation, we define two lists of neighboring pixels Lin and Law of Cl as shown in Fig 8.1. Formally they are defined as: Lou! = {a l¢(J_C) > 0 and 3 z 6 N4 (A) such that ¢(z) < 0}, Lin = {£l¢(£) < 0 and 3 X 6 N4 (5) such that My) > 0} (B3) where N4()_c) is a 4-connected discrete neighborhood of a pixel J_c with J_c itself removed. In conventional implementations of the level set method, an evolution PDE is solved either globally on the whole domain or locally in a narrow band to evolve the curve according to (8.2). Our fast level set implementation is based on the key observation that the implicitly represented curve CI can be evolved at pixel resolution by simply switching the neighboring pixels between the two lists Lin and Law. For example, as we show in Figure B.2, the curve C1 moves outward at pixel A and shrinks and splits at pixel B compared with the curve shown in Fig.1. This motion can be realized by simply switching pixel A from Lou, to Lin, and pixel B flom L5,, to L0,“. By doing this for all the pixels in Lin and L0,“, the curve can be moved inward or outward for one pixel everywhere in one scan. Since the curve is still represented implicitly, topological changes can be handled automatically. With this idea, we can achieve level-set based curve evolution at pixel resolution and this is usually enough for many imaging applications. Next we present the details of the fast algorithm. 94 Basic Algorithm The data structure used in our implementation is as follows:_ 0 An array for the level set function o, 0 An array for the evolution speed F; 0 Two bi-directionally linked lists of neighboring pixels: Lin and Low. Besides the inside and outside neighboring pixels contained in Lin and Lout, we call those pixels inside C I but not in Lin as interior pixels and those pixels outside C I but not in Law as exterior pixels. For faster computation, we define as follows: [3, if x is outside C, and x 95 Lout; 1, if x e L ; ¢(X)=< °“‘ (34) —3, if x is outside C, and x as Lin; [—1, ifx 6 Lin; To switch pixels between Lin and L0,“, we define two basic procedures on our data structure. The procedure switch_in() for a pixel x 6 Lou, moves the curve outward one pixel at x by switching it from L0“, to Lin and adding all its neighboring exterior pixels to Lout. Formally this procedure is defined as follows: 95 switch_in(x): 0 Step 1: Delete x from Low and add it to Lin. Set (15(5) = -l. 0 Step 2: \7’ 2 e N4(x_) satisfying (25(2) = 3, add jg to Low , and set ¢(jg) = 1 Similarly, the switch_out() procedure that moves the curve inward one pixel at )_ce Lin is defined as follows: switch_out(x): 0 Step 1: Delete )_c from Lin and add it to Low. Set ¢(x) = 1. 0 Step 2: V 1 e N4(x) satisfying gig = -3, addv to Lin , and set fly) = -1 To track the object boundary, we compute the speed at all pixels in Lin and Low, and store their sign in an array F. We scan through the list Low, and apply a switch_in() procedure at a pixel if F = +1. After this scan, some of the pixels in Lin become interior pixels and they are deleted. We then scan through the list Lin and apply a switch_outO procedure for a pixel with F = -1. Similarly, exterior pixels in Lout are deleted after this scan. At the end of this iteration, a stopping condition is checked. If it is satisfied, we stop the evolution; otherwise, we continue this iterative process. In out implementation, the following stopping condition is used: Stopping Condition. The curve evolution algorithm stops if either of the following condition is satisfied: 96 (a) The speed at each neighboring pixel satisfies: 17(5) S 0 V5 6 Lou! (B 5) F(x)20 VAELIII ' (b) A pre-specified maximum number of iterations is reached. The condition in equation (B5) is very intuitive in the sense of region competition. When the curve is on the object background, all the pixels in Law are in background and all the pixels in Lin are in the object region. When equation (B5) is satisfied, they disagree with each other on which direction to move the curve and convergence is reached. When the data is noisy or there is clutter, regularization is necessary and (B5) may not always be satisfied in the final curve. Thus part (b) of the condition is also necessary to stop the evolution. The above algorithm can be applied to arbitrary speed fields and speeds up the evolution process in (B2) dramatically compared with previous narrow band techniques based on solving the level set PDE. For the curve evolution equation in (B2), we can achieve a further speedup by introducing a novel scheme that separates the evolution driven by the data dependent speed F d and the smoothing speed F 5 into two different cycles. In spirit, this idea is similar to the work in [10] which proposed a fast method to implement the Chan-Vese model [5] over the whole domain, but the two-cycle algorithm we present next is still based on updating the two linked lists Lin and Low to evolve the implicitly 97 represented curve. Further details of using internal variations by 2D filtering and tracking multiple objects can be found in papers [32] and [46]. 98 Bibilography 10. 11. 12. . Storie, V.J., Involvement of good vehicles in public service vehicles in motorway accidents. 1113, UK Department of transport, Transport and road research laboratory, 1984. O’Hanlon, J .F., That is the extent of driving fatigue problem? EUR6065EN; Commission of the European communities, 1978. Rau, P.S., NH T SA ’5 drowsy driver research program fact sheet. Washington, DC: National Highway Traffic Safety Administration (1996). . Wang, J .S., Knipling, RR, and Blincco, L.J., Motor vehicle crach involvements: a multi-dimensional problem size assessment. Sixth annual meeting of the American Intelligent Traffic System, Houston, TX April 14—18, 1996, Washington DC: National Highway. National Survey of Distracted and Drowsy Driving Attitudes and Behaviors: 2002 Kribbs, N.B., Dinges, D.F.,Vigilance decrement and sleepiness. Sleep onset mechanisms. Washington, DC: American ..., 1994 Hartley, L., Fatigue and Driving: Driver Impairment, Driver Fatigue and Driving Simulation. The role of fatigue research in setting driving hours regulations. ISBN: 0748402624. Publisher: Taylor & Francis, Inc. Pub. Date: January 1995, pp. 41—47 Dinges, D.F., An Overview of Sleepiness and Accidents. Journal of Sleep Research 1995; 4(2):4-14 Stern, J ., Eye Activity Measures of Fatigue and Napping as a Counter-measure. USDOT Technical Report. FHWA-MC-99-028 Jan. 1999; 26 Federal Highway Administration, Office of Motor carriers, PERCLOS, A Valid Psychophysiological Measure of A lertness as Assessed by Psychometer Vigilance. Wunsch, P., Hirzinger, G., Real-time visual tracking of 3D objects with dynamic handling of occlusion, Robotics and Automation, 1997. Proceedings. Liang, R.H., Ouhyoung, M., A real-time continuous gesture recognition system for sign language, Face and Gesture Recognition, 1998 - doi.ieeecs.org, 99 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. Vezhnevets, V., Sazonov, V., Andreeva, A., A Survey on Pixel-based Skin Color Detection Techniques, Faculty of computational mathematics and Cybernetics, Moscow State University, Moscow, Russia. King, D.C., Michels, K.M., Muscular tension and the human blink rate, J Exp Psychol (1957) 53,113-116 Stoerring, M., Computer Vision and Human Skin Colour Ph.D. Dissertation, Computer Vision & Media Technology Laboratory, Aalborg University, Denmark Stoerring, M., Andersen, H.J., Granum, 13., Estimation of the Illuminant Colour from Human Skin Colour, Computer Vision and Media Technology Laboratory, Aalborg University, Niels Jernes Vej 14, DK-9220 Aalborg, Denmark Yang, J ., Lu, W., and Waibel, A., Skin-color modeling and adaptation. In Chin, R. and Pong, T.C., editors, 3rd Asian Conf on Computer Vision, volume 1352 of LNCS, pages 687{694, Hong Kong, China, Jan. 1998. Schwerdt, K., and Crowley. J .L., Robust face tracking using color, In International Conference on Automatic Face & Gesture Recognition [1], pages 90.95. Stoerring, M., Computer Vision and Human Skin Colour Ph.D. Dissertation, Computer Vision & Media Technology Laboratory, Aalborg University, Denmark US. Department of transportation, T rafiic Safety Facts — 1999, National Highway Traffic Safety Administration. http://www.nhtsa.dot.gov/people/ncsa/809-100.pdf http://ai.volpe.dot.gov/CarrierResearchResults/HTML/2003Crashfacts/2003Large TruckCrashFacts.htm Wierwille, W.W., Ellsworth, L.A., Wreggit, S.S., Fairbanks, R..l., Kim C.L., Research on Vehicle-Based Driver Status/Performance Monitoring: Development, Validation and Refinement of A lgorithms for Detection of Driver Drowsiness, National Highway Traffic Safety Administration Final Report: DOT HS 808 247, 1994. (October 1998. Publication No. FHWA-MCRT-98-006) Peer, P., Kovac, J., Solina, F., Human skin colour clustering for face detection. In submitted to EUROCON 2003 — International Conference on Computer as a Tool. http://users.rsise.anu.edu.au/~conrad/vidtimit/zips/vidtimit documentationpdf Sigal, L., Sclaroff, S., Athitsos, V., Estimation and prediction of evolving color distributions for skin segmentation under varying illumination. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (2000), vol. 2, 152—159. 100 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. Soriano, M., Huovinen, S., Martinkauppi, B., Laaksonen, M., Skin detection in video under changing illumination conditions. In Proc. 15th International Conference on Pattern Recognition (2000), vol. 1, 839—842. Yang, M., Ahuja, N. Gaussian mixture model for human skin color and its application in image and video databases. In Proc. of the SPIE: Conf. on Storage and Retrieval for Image and Video Databases (SPIE 99), vol. 3656, 458—466. Hakkanen, H., Summala, H., Partinen, M., Tiihonen, M., Silvo, J. Blink duration as an indicator of driver sleepiness in professional bus drivers. Sleep (1999), 22(6), 798—802. Gomez, G., Morales, 13., Automatic feature construction and a simple rule induction algorithm for skin detection. In Proc. of the ICML Workshop on Machine Learning in Computer Vision (2002), 31—38. Stern, H., Efros, 3., Adaptive color space switching for face tracking in multi— colored lighting environments. In Proc. of the International Conference on Automatic Face and Gesture Recognition (2002), 249—255. Osher, S., Fedkiw, R., Level Set Methods and Dynamic Implicit Surfaces, Springer publications. Peng, D., Merriman, B., Osher, S., Zhao, H., Kang, M., A pde-based fast local level set method, Journal of Computational Physics, vol. 155, pp. 410—438, 1999. Sethian, J ., A fast marching level set method for monotonically advancing fronts, Proc. Nat. Acad. Sci, vol. 93, no. 4, pp. 1591—1595, 1996. Shi, Y., Karl, W., A fast level set method without solving PDEs, Proc. ICASSP'OS CIE 1931 color space Chromaticity diagram - www.cn.wikipedia.org/wiki/CIE_1931_color_space Grace, R., Byme, V.E., Bierman, D.M., Legrand, J .M., Gricourt, D., Davis, B.K., Staszewski, J .J ., Camahan, B., A drowsy driver detection system for heavy vehicles, Digital Avionics Systems Conference, 1998. Proceedings. 17th DASC. The AIAA/IEEE/SAE. Mallis, M.M., Evaluation of In-flight Alertness Management T echnolog», NASA report — 2002 Grace, R., Drowsy Driver Monitor and Warning System, Robotics Institute, Carnegie Mellon University. 101 39. 40. 41. 42. 43. 44. 45. 46. Shafer, 8., Using color to separate reflection components, Technical report for DARPA — 1984, Computer Science Department, University of Rochester, NY. Q J i, G Bebis, Visual cues extraction for monitoring driver vigilance — Procs. Honda Symposium, 1999. Stockman, G., Shapiro, L., Computer Vision, Prentice-Hall Inc, 2001. http://homepages.inf.ed.ac.uk/rbf/CVonlinc/LOCAL_COPIES/OWENS/LECT14/ http://en.wikipedia.org/wiki/Level_set_method Duda, R.O., Hart, P.E., Pattern Classification (2nd ed.) by and David G. Stork Wiley Interscience. Knipling, R., PERCLOS: A Valid Psychophysiological Measure of Alertness As Assessed by Psychomotor Vigilance, Federal Highway Administration, Office of Motor Carriers. Shi, Y., Karl, W., Real-time tracking using level sets, CVPR'OS, Jun, 2005 102 IIllfll‘llllljijliflfiflllfljl‘jiI