IRIS RECOGNITION: ENHANCING SECURITY AND IMPROVING PERFORMANCE By Renu Sharma A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2022 ABSTRACT IRIS RECOGNITION: ENHANCING SECURITY AND IMPROVING PERFORMANCE By Renu Sharma Biometric systems recognize individuals based on their physical or behavioral traits, viz., face, iris, and voice. Iris (the colored annular region around the pupil) is one of the most popular biometric traits due to its uniqueness, accuracy, and stability. However, its widespread usage raises security concerns against various adversarial attacks. Another challenge is to match iris images with other compatible biometric modalities (i.e., face) to increase the scope of human identification. Therefore, the focus of this thesis is two-fold: firstly, enhance the security of the iris recognition system by detecting adversarial attacks, and secondly, accentuate its performance in iris-face matching. To enhance the security of the iris biometric system, we work over two types of adversarial attacks - presentation and morph attacks. A presentation attack (PA) occurs when an adversary presents a fake or altered biometric sample (plastic eye, cosmetic contact lens, etc.) to a biometric system to obfuscate their own identity or impersonate another identity. We propose three deep learning-based iris PA detection frameworks corresponding to three different imaging modalities, namely NIR spectrum, visible spectrum, and Optical Coherence Tomography (OCT) imaging inputting a NIR image, visible-spectrum video, and cross-sectional OCT image, respectively. The techniques perform effectively to detect known iris PAs as well as generalize well across unseen attacks, unseen sensors, and multiple datasets. We also presented the explainability and interpretability of the results from the techniques. Our other focuses are robustness analysis and continuous update (retraining) of the trained iris PA detection models. Another burgeoning security threat to biometric systems is morph attacks. A morph attack entails the generation of an image (morphed image) that embodies multiple different identities. Typically, a biometric image is associated with a single identity. In this work, we first demonstrate the vulnerability of iris recognition techniques to morph attacks and then develop techniques to detect the morphed iris images. The second focus of the thesis is to improve the performance of a cross-modal system where iris images are matched against face images. Cross-modality matching involves various challenges, such as cross-spectral, cross-resolution, cross-pose, and cross-temporal. To address these challenges, we extract common features present in both images using a multi-channel convolutional network and also generate synthetic data to augment insufficient training data using a dual-variational autoencoder framework. The two focus areas of this thesis improve the acceptance and widespread usage of the iris biometric system. Dedicated to Mummy, Daddy, Maa and Bapa iv ACKNOWLEDGMENTS The journey of my Ph.D. research wouldn’t be possible without the support of a number of people, and let me take this opportunity to thank them. First, I would like to bestow my sincere thanks and gratitude to my Ph.D. advisor, Dr. Arun Ross, whose acceptance email began this journey. He motivates me to imbibe various research qualities—from critical and deep thinking to efficient communication. His guidance, feedback, support, and motivation give directions to my Ph.D. research at every stage. He also provides me an opportunity to exhibit my research outside the lab through commercial projects, summer school, conferences, journals, and workshops. Next, I would like to thank my doctorate committee—Dr. Xiaoming Liu, Dr. Vishnu Boddeti, and Dr. Selin Aviyente—for their continuous support and valuable feedback. I want to convey my special thanks to all the CSE faculty members, especially my course instructors (Dr. Eric Torng, Dr. Xiaoming Liu, Dr. Jiayu Zhou, Dr. Arun Ross, Dr. Vishnu Boddeti, Dr. Sandeep Kulkarni, and Dr. Ashoke Sinha) for enhancing my knowledge in various subjects. I am also thankful to all the CSE and MSU administrative staff for helping me out with administrative affairs, want to mention Brenda Hodge, Steve Smith, Amy King, Erin Dunlop, and Vincent Mattison. I am fortunate to have wonderful labmates who made my Ph.D. an enjoyable journey, whether there was a technical hurdle or a research plateau. Their eagerness to help and reliable persona made everything easier for me. I loved those planned outings and sudden lunches with them. Thank you all-Thomas, Sudipta, Anurag, Melissa, Steven, Denny, Yaohui, Aaron, Vahid, Achsah, Ishita, Shivangi, Darshika, Cunjian, Austin, Ryan, Parisa, Sushanta, Debasmita, Raul, Morgan, Redwan, Pegah, Katie, Sai, Madison, Protichi, Ana. On the personal front, how could I express my gratitude through words to my parents (mummy and daddy)? Their unconditional love and belief hold me at every stage of my life. I am also grateful to my parents-in-law (maa and bapa), whose proud eyes push my limits further. The inspiration and love from my sister and brother keep this journey going. I am also thankful to my sisters-in-laws, v brothers-in-laws, and cheery kids for sprinkling various colors in my life. I am thankful to the second family I had here in Michigan as my friends—Apoorva, Sneha, Gauri, Aditya, Affan, Shalin, and Tanvi. I cherish the numerous unforgettable memories we made together. Thanks for providing me with a helping hand whenever required. And lastly, I would like to thank my better half, Sushanta, for always being with me. His strong belief, support, and love made this thesis possible. Thanks to my kiddo, Siddhant, for having such an infectious smile :). I am blessed to be starting my Ph.D. journey in the beautiful green and white landscape of the MSU campus. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Anatomy of Iris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Automated Iris Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Security of Iris Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.1 Iris Presentation Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.2 Morph Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.5 Cross-modal Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.6 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.7 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 CHAPTER 2 IRIS PRESENTATION ATTACK DETECTION USING A SINGLE NIR IMAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 D-NetPAD: Description and Rationale . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.1 Combined Dataset: Description and Results . . . . . . . . . . . . . . . . . 27 2.4.2 LivDet-2017 Dataset: Description and Results . . . . . . . . . . . . . . . . 29 2.4.3 LivDet-2020 Dataset: Description and Results . . . . . . . . . . . . . . . . 34 2.4.4 GCT5 and GCT6 Datasets: Description and Results . . . . . . . . . . . . . 35 2.4.4.1 Failure Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5 Explainability analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.1 Visualization Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.2 Spatial Frequency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.6 Deployment of D-NetPAD on Desktop and Mobile . . . . . . . . . . . . . . . . . 46 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 CHAPTER 3 IRIS PRESENTATION ATTACK DETECTION USING VISIBLE SPEC- TRUM VIDEO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.1 MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.2 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 vii 3.3.3 LRCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.4 C3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.5 3D ResNeXt-101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.6 Two-stream CNN Network . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.6.1 Spatial ConvNet . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.3.6.2 Temporal ConvNet . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.1 IPV Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4.2 SiW Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.4.3 SiW-M Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4.4 OULU-NPU dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5.1 Iris Modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5.1.1 Intra-session . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5.1.2 Cross-session . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5.1.3 Cross-attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.5.1.4 Baseline Experiments . . . . . . . . . . . . . . . . . . . . . . . 66 3.5.2 Face Modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.5.2.1 Results on SiW dataset . . . . . . . . . . . . . . . . . . . . . . . 68 3.5.2.2 Results on SiW-M dataset . . . . . . . . . . . . . . . . . . . . . 69 3.5.2.3 Results on OULU-NPU dataset . . . . . . . . . . . . . . . . . . 70 3.5.3 Cross-modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.6 Analysis Using Heatmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.7 Conclusion and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 CHAPTER 4 IRIS PRESENTATION ATTACK DETECTION USING A OCT IMAGE . . 75 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3 Background of Iris Imaging Modalities . . . . . . . . . . . . . . . . . . . . . . . . 77 4.4 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.6 Experimental Setup and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.6.1 Intra-attack Setup and Results . . . . . . . . . . . . . . . . . . . . . . . . 84 4.6.2 Cross-attack Setup and Results . . . . . . . . . . . . . . . . . . . . . . . . 85 4.7 CNN Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 CHAPTER 5 ROBUSTNESS OF DEEP NEURAL NETWORKS . . . . . . . . . . . . . 92 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3 Parameter Perturbations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4 Application Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.5 Datasets and Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.6 Robustness Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.6.1 Gaussian Noise Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 viii 5.6.2 Weight Zeroing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.6.3 Weight Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.6.4 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.7 Performance Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.7.1 Single Perturbed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.7.2 Ensemble of models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.7.3 Performance validation on other dataset . . . . . . . . . . . . . . . . . . . 110 5.8 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 CHAPTER 6 RETRAINING OF DEEP NEURAL NETWORKS . . . . . . . . . . . . . . 113 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.4 Experimental Setup and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.4.1 LivDet-Iris-2017 Setup and Results . . . . . . . . . . . . . . . . . . . . . 122 6.4.2 LivDet-Iris-2020 Setup and Results . . . . . . . . . . . . . . . . . . . . . 125 6.4.3 Split MNIST Setup and Results . . . . . . . . . . . . . . . . . . . . . . . . 129 6.4.4 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.5 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 CHAPTER 7 IRIS MORPHING ATTACK: CREATION AND DETECTION . . . . . . . 135 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.3 Algorithmic Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.5 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.5.1 Baseline Recognition Performance . . . . . . . . . . . . . . . . . . . . . . 139 7.5.2 Morph Attack Setup and Results . . . . . . . . . . . . . . . . . . . . . . . 140 7.5.3 Analysis of Textural Similarity . . . . . . . . . . . . . . . . . . . . . . . . 141 7.5.4 Morph Attack Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 CHAPTER 8 MATCHING IRIS IMAGES WITH FACE IMAGES . . . . . . . . . . . . . 146 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.2 Proposed Approaches and Rationale . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.2.1 Feature-level Approach: Multi-channel CNN (MT-CNN) . . . . . . . . . . 151 8.2.2 Image-level Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 8.2.2.1 Pix2Pix GAN with Identification Loss (Pix2Pix GAN ID) . . . . 152 8.2.3 Training-level: Dual Variational Generation . . . . . . . . . . . . . . . . . 155 8.3 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.3.1 BioCop-2008 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.3.2 BioCop-2009 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.3.3 PolyU Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.3.4 WVU Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.4 Experimental Setup and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 ix 8.4.1 BioCop-2008 and BioCop-2009 Dataset . . . . . . . . . . . . . . . . . . . 163 8.4.2 PolyU Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 8.4.3 WVU Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 8.5 Impact of Eye Color on Cross-model Matching . . . . . . . . . . . . . . . . . . . 174 8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 CHAPTER 9 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 9.1 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 x LIST OF TABLES Table 2.1: Description of different components of the Combined Dataset. Details of the train and test set of the Combined and NDCLD 2015 datasets are also provided in terms of the number of bonafide and PA images. Here, MSU stands for Michigan State University, CU stands for Clarkson University, and JHU-APL stands for Johns Hopkins University-Applied Physics Laboratory. . . . . . . . . 28 Table 2.2: The results of D-NetPAD in term of TDR (%) at 0.2% FDR on the Combined dataset. The method is compared with four other algorithms. . . . . . . . . . . . 29 Table 2.3: Description of the train and test sets of all four subsets of the LivDet-2017 dataset along with the number of bonafide and PA images present in the datasets. The information about the sensors is also provided. Each subset represents different testing scenarios. The Clarkson and Notre Dame test sets correspond to the cross-PA scenario, whereas the Warsaw data corresponds to the cross-sensor scenario. The IIITD-WVU represents a cross-dataset sce- nario. Here, “K. Test" means a known test set of the dataset, and “U. Test" means an unknown test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Table 2.4: D-NetPAD performance reported in terms of APCER and BPCER on all subsets of the LivDet-2017 dataset. The method is compared with three state-of-the-art algorithms in [304], which are the winners of the LivDet-2017 competition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Table 2.5: D-NetPAD performance reported in terms of the TDR (%) @ 0.2% FDR on different subsets of the LivDet-2017 dataset. Three models of D-NetPAD are generated by varying their training data. . . . . . . . . . . . . . . . . . . . . . . 33 Table 2.6: Description of the test set of the LivDet-iris-2020 dataset. It includes the number of images in each category and the sensor used to capture them. . . . . . 35 Table 2.7: D-NetPAD performance reported in terms of APCER and BPCER on the LivDet-2020 dataset. The results also include APCER on the individual type of PAs. The method is compared with the winners of the LivDet-2020 competition. Here, PE is Printed Eyes; CL is Cosmetic Contact Lens; ED is Electronic Display; F/P is Fake/Prosthetic/Printed Eyes with Add-ons; and CI is Cadaver Iris. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Table 2.8: D-NetPAD performance in terms of TDR at 0.2% FDR on the GCT5 and GCT6 datasets. Table also provides information about training and testing data along with base architecture used in both models. . . . . . . . . . . . . . . 36 xi Table 2.9: Results (TDR and a relative decrease in TDR) for VGG19, ResNet101, and D-NetPAD models, when high frequencies are manipulated or Gaussian noise is applied to the input test images. . . . . . . . . . . . . . . . . . . . . . . . . . 47 Table 2.10: Description of two architectures used to detect iris PAs at the mobile platform along with their training data and computational efficiency. . . . . . . . . . . . . 48 Table 3.1: Description of video-based passive iris PA detection techniques. . . . . . . . . . 52 Table 3.2: Description of the dataset collected for multi-frame analysis on scene videos captured from a regular webcam. . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Table 3.3: Training and testing setup for intra-session (Exp. 01-05) and cross-session (Exp. 06) experiments on the IPV dataset. . . . . . . . . . . . . . . . . . . . . . 66 Table 3.4: Training and testing setup for cross-attack (Exp. 07-11) and baseline (Exp. 12) experiments on the IPV dataset. . . . . . . . . . . . . . . . . . . . . . . . . 68 Table 3.5: ACER (%) of proposed methods across all experiments (Exp. 01-12) on the IPV dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Table 3.6: ACER (%) for all methods on the SiW [165] dataset. The ACER values outperforms the baseline [165] are shown in bold. . . . . . . . . . . . . . . . . . 69 Table 3.7: ACER (%) for all methods on the SiW-M [166] dataset. . . . . . . . . . . . . . . 70 Table 3.8: ACER (%) for all methods on the OULU-NPU [33] dataset. . . . . . . . . . . . 71 Table 4.1: Number of bonafide and PA samples corresponding to each imaging modality. . . 81 Table 4.2: APCER (%) and BPCER (%) of all algorithms on LivDet-Iris 2017 Dataset [304]. Results are presented by averaging APCER and BPCER of all test sets in the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Table 4.3: Data distribution among train, validation and test sets for all experiments (intra-attack and cross-attack scenarios). Here, CC is Cosmetic Contacts. . . . . 85 Table 4.4: TDR (%) at 0.2% FDR and ACER of all experiments (intra-attack and cross- attack) when using VGG19, ResNet50 and DenseNet121 architectures. . . . . . . 86 Table 5.1: Summary of training and test datasets along with the number of bonafide and PA images present in the datasets. The information about the sensors used to capture images is also provided. Here, “K. Test” means a known test set of the dataset, and “U. Test” means an unknown test set (see text for explanation). . . . 96 xii Table 5.2: The number of parameters (weights and bias) present in all convolutional layers of the VGG19, ResNet101, and D-NetPAD architectures. . . . . . . . . . . . . . 98 Table 5.3: The performance of VGG19, ResNet101, and D-NetPAD models in terms of True Detection Rate (%, higher the better) at 0.2% False Detection Rate on the LivDet-Iris-2017 and LivDet-Iris-2020 datasets. The performance is shown on original model (no parameter perturbations), perturbed model and an ensemble of model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Table 6.1: Different methodologies of retraining along with the information about the knowledge needs to transfer to the next task and the special requirements for the training of the current task. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Table 6.2: Description of the old and new training/test sets in the LivDet-Iris-2017 setup along with the number of bonafide and fake iris images present in the datasets. The information about the sensors used to capture images is also provided. Each test set represents different testing scenarios. The Clarkson and Notre Dame test sets correspond to the cross-PA scenario, whereas the Warsaw data corresponds to the cross-sensor scenario. The IIITD-WVU represents a cross- dataset scenario. Here, “K. Test” means a known test set of the dataset, and “U. Test” means an unknown test set. . . . . . . . . . . . . . . . . . . . . . . . 123 Table 6.3: The performance of all retraining methods in terms of True Detection Rate (%, higher the better) at 0.2% False Detection Rate on old (𝑇 𝑆 𝑜𝑙𝑑 ) and new (𝑇 𝑆 𝑛𝑒𝑤 ) test sets of the LivDet-Iris-2017 setup. . . . . . . . . . . . . . . . . . . 125 Table 6.4: Description of the old and new train/test sets in the LivDet-Iris-2020 setup along with the number of bonafide and fake iris images present in the sets. The information about the sensors used to capture images is also provided. . . . . . . 128 Table 6.5: The performance of all retraining methods in terms of True Detection Rate (%, higher the better) at 0.2% False Detection Rate on the LivDet-Iris-2020 test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Table 6.6: The average accuracy (%, higher the better) of the proposed retraining ap- proach with different state-of-the-art continual learning approaches on the Split MNIST dataset. Methods with ‘+’ superscript are reported from [118], ‘o’ from [136], ‘*’ from [22] and ‘-’ from [153]. All methods utilize the same experimental setup and expert models but differs in hyperparameters (batch size, learning rate, and the number of epochs). We use the same hyperparam- eters as used in [118]. Each value is an average of ten runs. . . . . . . . . . . . . 131 xiii Table 7.1: Performance of three iris recognition techniques in terms of TMR (%) at 0.01%, 0.1%, and 1% FMRs, on the IITD and WVU datasets. The USITv3.0 is an open-source iris recognition toolkit, VeriEye is a commercial iris recognition SDK, and CNN-Pairwise is a deep learning-based technique. . . . . . . . . . . . 140 Table 7.2: Vulnerability assessment of three iris recognition techniques to iris morph attacks in terms of MMPMR (%) at different thresholds corresponding to 0.01%, 0.1%, and 1% FMRs on the IITD and WVU datasets. . . . . . . . . . . . 141 Table 8.1: Description of genuine and impostor pairs used in experiments from the BioCop-2008 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Table 8.2: Description of genuine and impostor pairs used in experiments from the BioCop-2009 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Table 8.3: Performance of different methods on the BioCop-2008 dataset. MT-CNN with ocular input outperforms on this dataset. . . . . . . . . . . . . . . . . . . . . . . 165 Table 8.4: Performance of different methods on the BioCop-2009 dataset. MT-CNN with iris input outperforms on Aoptix Insight and CrossMatch sensor images, whereas MT-CNN with ocular input outperforms on LG ICAM 4000 sensor images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Table 8.5: Number of genuine and impostor pairs excluded from the test set due to the segmentation errors by the VeriEye technique on both the datasets. The numbers shown in the parenthesis are the total number of genuine and impostor pairs used in the test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Table 8.6: TMR and EER of ocular and iris recognition methods on the entire test set of BioCop-2008 dataset when a small set (5,000 impostor pairs) is used for the training. Including additional training samples generated from the DVG-based method does not improve the performance. . . . . . . . . . . . . . . . . . . . . 169 Table 8.7: Data distribution among train and test sets from the PolyU dataset. . . . . . . . . 169 Table 8.8: TMR (%) at 0.1% FMR and EER of all ocular and iris recognition methods on the entire test set of the PolyU dataset. MT-CNN outperforms in both ocular and iris recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Table 8.9: Data distribution among train and test sets for all three settings from the WVU dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Table 8.10: TMRs and EER of ocular and iris recognition techniques on the entire test set of the WVU dataset. All techniques fail on this dataset. . . . . . . . . . . . . . . 172 xiv Table 8.11: Iris color distribution of genuine scores obtained from Multi-channel CNN in three settings: face-face, iris-iris and face-iris matching. The region used for the matching is ocular region. . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 xv LIST OF FIGURES Figure 1.1: Frontal view of the iris: (a) ocular image, (b) focused view of the iris pattern, and (c) frontal anatomy of the iris [40]. . . . . . . . . . . . . . . . . . . . . . . 2 Figure 1.2: Iris images captured using different imaging techniques: (a) captured in the visible spectrum illumination, and (b) captured in the near-infrared spectrum illumination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Figure 1.3: Cross-sectional view of the iris: (a) image captured from optical coherence tomography imaging and (b) transverse anatomy of the iris [295]. . . . . . . . . 4 Figure 1.4: Various steps of the automated iris recognition process. It consists of the acquisition of an iris image from an individual, segmentation of iris region, iris region normalization, extraction of features from the iris image, and then the matching of the iris template against the enrolled templates. . . . . . . . . . 7 Figure 1.5: Various applications of iris recognition system: (a) UAE border control sys- tem, (b) India’s national ID project, (c) Hashemite Kingdom of Jordan’s iris-enabled ATM, (d) Biometric e-passport, and (e) Mobile access control. . . . 9 Figure 1.6: Samples of a good quality iris image and few low-quality iris images. . . . . . . 9 Figure 1.7: Generic biometric system and the various points of attacks launched on the system [224]. The attacks shown in orange boxes (presentation and morph attacks) are our focused research area. . . . . . . . . . . . . . . . . . . . . . . 13 Figure 1.8: Various iris presentation attacks instruments. . . . . . . . . . . . . . . . . . . . 14 Figure 1.9: Pictorial diagram of iris morph attack. A single morphed iris image can authenticate two or more individuals, which violates the fundamental unique- ness characteristics of the biometric system. . . . . . . . . . . . . . . . . . . . 15 Figure 1.10: Examples of cross-modal matching: (a) iris modality matches with face, (b) deducing phenotypic traits from genomic data, and (c) mapping face image with the voice signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Figure 1.11: Different categories of techniques applied to detect iris presentation attacks: (a) technique utilizing a single NIR iris image captured from conventional iris recognition sensor and (b) technique utilizing a video captured from webcam (c) technique utilizing a single iris OCT image. All these techniques generate a Presentation Attack (PA) score between 0 and 1, where ‘0’ corresponds to bonafide input sample and ‘1’ corresponds to PA input. . . . . . . . . . . . . . 20 xvi Figure 2.1: Flowchart of the D-NetPAD algorithm. Iris region (red box) is detected and cropped from the ocular image and input to the D-NetPAD architecture. The base architecture used in D-NetPAD is DenseNet121 [122]. It produces a single PA score within a range of 0-1, which determines whether an input image is a bonafide (value towards 0) or a PA (value towards 1). . . . . . . . . . 27 Figure 2.2: Sample images of bonafide and different types of PAs (print, artificial eye, cosmetic contact, kindle replay, and transparent dome on print) taken from the Combined dataset. The last cosmetic contact image is taken from the NDCLD-2015 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Figure 2.3: Misclassified images by the D-NetPAD algorithm on the JHU-APL03 test set. The first row shows bonafide images that are misclassified as PA. The second row shows PA images that are misclassified as bonafide. The PA score is displayed at the bottom of each image. The threshold for classification is 0.40, where a PA score below the threshold is considered to be a bonafide. . . . 30 Figure 2.4: Sample images of bonafide and different types of PAs (print, cosmetic contact) taken from each subset of the LivDet-2017 dataset. . . . . . . . . . . . . . . . 31 Figure 2.5: Histograms of the three trained models of D-NetPAD on the IIITD-WVU test set. For accurate classification, there should be minimal overlap between the two (red and green) distributions. This plot indicates the efficacy of the fine-tuned D-NetPAD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Figure 2.6: Sample images of bonafide and PAs (print, kindle display, artificial eye, cosmetic contact, and cadaver eyes) from the LivDet-2020 dataset. . . . . . . . 35 Figure 2.7: Failure cases on the GCT5 dataset. The first image is a bonafide misclassified bonafide image, and the other images are misclassified PA images. Three types of cosmetic contacts get misclassified: m6-009-0007-A77-1, m6-009- 0011-F40-1, and m6-009-0005-B44-1. The threshold is 0.38. . . . . . . . . . . 37 Figure 2.8: Histogram of VeriEye match scores corresponding to correctly classified and misclassified PA images when match with their bonafide images on the GCT5 data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Figure 2.9: Misclassified PA images (bottom row) along with their bonafide images (top row) and their matching score using VeriEye commercial iris matcher. . . . . . 39 Figure 2.10: Histogram of VeriEye match scores corresponding to different cosmetic con- tact PA types on the GCT5 data. . . . . . . . . . . . . . . . . . . . . . . . . . . 40 xvii Figure 2.11: Histogram of VeriEye match scores corresponding to correctly classified and misclassified PA images when match with their bonafide images on the GCT3 data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Figure 2.12: Histogram of VeriEye match scores corresponding to different cosmetic con- tact PA types on the GCT3 data. . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Figure 2.13: The architecture of D-NetPAD consists of four Dense blocks. We capture the features at the end of each Dense block, which are then visualized using t-sne plots (shown below each Dense block). The two-dimensional features of bonafide, artificial eyes, and cosmetic contacts overlap in the initial layers, but get separated in the last layer. The two blue clusters in each category correspond to the left and right eyes. . . . . . . . . . . . . . . . . . . . . . . . 43 Figure 2.14: Grad-CAM [245] heatmaps corresponding to bonafide (first row), artificial eye (second row), and cosmetic contact (last row). The last column represents the average heatmaps of each category. The heatmaps represent focused regions of the image by the D-NetPAD algorithm. Red-colored regions represent highly focused regions by the D-NetPAD, whereas blue regions represent low priority ones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure 2.15: Frequency analysis of an input iris (bonafide or PA) image. In the first row, the left-most image is the original image, the center image is a low-pass filtered image with a cutoff frequency of 20 (higher frequencies are suppressed), and the right-most is a high-pass filtered image with a cutoff frequency of 5 (lower frequencies are suppressed). The second row represents their corresponding fourier transforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Figure 2.16: Different manipulations applied over the original input image (first image): low-pass filtered images with 20, 30, and 50 cutoff frequencies, additive salt and pepper noise, and additive Gaussian noise. Only test images are subject to these manipulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Figure 2.17: The plot of TDR (%) @ 0.2% FDR against low-pass filter cutoff frequen- cies. Note the cutoff frequency beyond which the performance of D-NetPAD becomes stable (30 in this case). This cutoff frequency indicates that the D-NetPAD has not learned frequencies beyond this cutoff frequency. The performance steadiness of D-NetPAD is better than VGG19 and ResNet101. . . 47 Figure 2.18: Graphical User Interface (GUI) for three iris PA detectors developed by MSU which includes TL-PAD [46], Fusion Method [114] and D-NetPAD [249]. Patch-wise heatmap and filter-maps shown at the bottom of GUI are corresponds to the Fusion Method. . . . . . . . . . . . . . . . . . . . . . . . . 49 xviii Figure 2.19: Screenshots of Iris PA Detector app on Google Pixel 2. The first image shows the screen on the opening of the app. The second image shows the results after capturing iris images from IriShield USB BK2021U sensor. . . . . . . . . 50 Figure 3.1: Scene video (VIS) and iris image (NIR) of bonafide and PA biometric samples captured by a simple webcam and an iris sensor simultaneously. . . . . . . . . . 54 Figure 3.2: Different ways of presenting the same attack instrument (paper print) con- stitute different scenes. These scenes provide different cues for detecting PAs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Figure 3.3: The end-to-end architecture of the proposed framework. . . . . . . . . . . . . . 55 Figure 3.4: Inputs given to the Two-stream CNN network. The top row shows spatial frames, the middle row represents optical flow frames in the X-direction and the bottom row shows optical flow frames in the Y-direction. (a) corresponds to bonafide video frames, and (b) corresponds to PA video frames. . . . . . . . 60 Figure 3.5: Columns show intra-variations among different PAs using a single frame. Paper print PA variations: uses one or both eyes for presenting iris PA. Artificial eye PAs variations: use different materials, e.g., glass, plastic, prosthetic, or rubber eye. Kindle PAs variations: use different sizes and locations of an iris image on the Kindle display. Funny glasses PAs variations: uses plastic or paper print to mount over the funny glasses. Mannequin PAs: use two different materials and print/plastic to mount over them. . . . . . . . . . 61 Figure 3.6: Comparison of ACERs of (a) Intra-session experiments (Exp.01-05), (b) Cross-session experiments (Exp.06), (c) Cross-attack experiments (Exp.07- 11), and (d) Baseline experiment (Exp.12) on the IPV dataset. . . . . . . . . . . 67 Figure 3.7: Sample video frames from various face PAD datasets: the first block shows frames from the SiW-M [166] dataset, the second block represents examples from the SiW [165] dataset and the third block shows samples from the OULU-NPU [33] dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 xix Figure 3.8: Frames of bonafide (first row), artificial eye (second row), and paper print (third row) videos overlaid with their corresponding Grad-CAM heatmaps. The columns correspond to the different frames of a video. Heatmap repre- sents the focused region of a frame by the trained model (Spatial ConvNet). Red gradient regions in the heatmaps represent high focused regions consid- ered by the trained model, whereas the blue-colored regions represent low focused regions. On the bonafide frames, the focus is mainly over the center of a face. On artificial eye frames, the focus is on the artificial eye mounted over the glasses. In the case of paper print video, the focus is on the print of the eyes. Different regions of focus in different categories help in differentiating bonafide videos from spoof one. . . . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 4.1: Components of the eye and iris sensed using OCT, NIR and VIS imag- ing. The anatomical image (https://www.vecteezy.com/vector-art/431288- parts-of-human-eye-with-name) is also shown. The red line in the VIS image shows the traverse scanning direction of the OCT scanner. . . . . . . . . . . . . 77 Figure 4.2: Typical optical setup of an OCT scanner. Low-coherence light is incident over the beam splitter, which splits the light into sample and reference arms. Back-reflected light from sample and reference arms are then collected by the photodetector. Cross-sectional OCT image (B-scan) is formed by combining a number of A-scans along the transverse direction. . . . . . . . . . . . . . . . 79 Figure 4.3: Comparative analysis of OCT, NIR and VIS imaging in detecting iris PAs. Three architectures, viz., VGG19, ResNet50, DenseNet121, are used for distinguishing between bonafides and PAs by emitting a PA score. A higher PA score indicates the input is a “PA" and a lower score indicates the input is a “bonafide" image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Figure 4.4: Age distribution of subjects in the dataset. . . . . . . . . . . . . . . . . . . . . 82 Figure 4.5: Samples of bonafide, artificial eyes and cosmetic contact lens images captured using (a) OCT, (b) NIR and (c) VIS imaging modalities. . . . . . . . . . . . . . 82 Figure 4.6: ROC curves of (a) Intra-EXP 1, (b) Cross-EXP 1 and (c) Cross-EXP 2 experiments using VGG19 architecture. The first ROC plot (a) also shows the confidence interval of 95%. NIR imaging is more efficient in discriminating bonafide and PA samples on this network. . . . . . . . . . . . . . . . . . . . . 87 Figure 4.7: ROC curves of (a) Intra-EXP 1, (b) Cross-EXP 1 and (c) Cross-EXP 2 experiments using ResNet50 architecture. OCT imaging results in better performance in distinguishing bonafide and PA images in the intra-attack scenario (a), whereas NIR imaging performs the best in the cross-attack scenario (b and c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 xx Figure 4.8: ROC curves of (a) Intra-EXP 1, (b) Cross-EXP 1 and (c) Cross-EXP 2 experiments using DenseNet121 architecture. OCT imaging results in better performance in distinguishing bonafide and PA images in the intra-attack scenario (a), whereas NIR imaging performs the best in the cross-attack scenario (b and c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Figure 4.9: (a) OCT, (b) NIR and (c) VIS images and their corresponding fixation regions for bonafide, artificial eyes and cosmetic contact lens samples. Red in the heatmaps represents high priority (high CNN activations) regions considered by the CNN architecture. Blue represents low priority regions. Red boxes mark the high priority regions. Different regions of focus help the CNN architecture to differentiate between bonafide and PA iris images. . . . . . . . . 89 Figure 4.10: t-SNE plots of Intra-EXP 1, Cross-EXP 1 and Cross-EXP 2 test data pertaining to OCT, NIR and VIS imaging. 2048 dimensions of features from the average pooling layer (penultimate layer) of ResNet50 network are reduced to two dimensions for visualization. Features of bonafide and PAs from OCT images are well separated in Intra-EXP 1 and Cross-EXP 2 experiments. NIR images show good separation in all three experiments. Features from VIS images are overlapping between the bonafide and PA categories (especially in the Cross- EXP 2 experiment). More the separation of features, better the classification. . . 90 Figure 5.1: Gaussian noise manipulation: (a) Performance (TDR at 0.2% FDR) of VGG19, ResNet101, and D-NetPAD when weights and bias parameters of the entire network are perturbed. (b) Performance of D-NetPAD when the individual layer’s parameters (weights and bias) are perturbed. Here, Conv1 means the first convolution layer of the D-NetPAD, Dense1_LastConv means the last convolution layer of the first dense block, and so on. . . . . . . . . . . . 99 Figure 5.2: Weight zeroing manipulation: (a) Performance (TDR at 0.2% FDR) of VGG19, ResNet101, and D-NetPAD when parameters of the entire network are perturbed. (b) Performance of D-NetPAD when the individual layer’s parameters are perturbed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Figure 5.3: Weight distribution of different layers of the trained D-NetPAD architecture. Mean (𝜇) and standard deviation (𝜎) are provided below each distribution. . . . 101 Figure 5.4: Variant of the weight zeroing manipulation (low-magnitude weights are set to zero): (a) Performance (TDR at 0.2% FDR) of VGG19, ResNet101, and D-NetPAD when parameters of the entire network are perturbed. (b) Perfor- mance of D-NetPAD when individual layer’s parameters are perturbed. . . . . . 102 xxi Figure 5.5: Variant of the weight zeroing manipulation (high-magnitude weights are set to zero): (a) Performance (TDR at 0.2% FDR) of VGG19, ResNet101, and D-NetPAD when parameters of the entire network are perturbed. (b) Perfor- mance of D-NetPAD when individual layer’s parameters are perturbed. . . . . . 103 Figure 5.6: Variant of the weight zeroing manipulation (randomly selected weights are set to zero and non-zero weights are scaled by factor 5): (a) Performance of D-NetPAD when individual layer’s parameters are perturbed. (b) Closer look at the performance of D-NetPAD when convolution layers of DenseBlock1 and DenseBlock2 are perturbed. . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Figure 5.7: Variant of the weight zeroing manipulation (randomly selected filters are set to zero): (a) Performance (TDR at 0.2% FDR) of VGG19, ResNet101, and D-NetPAD when filters of the entire network are perturbed. (b) Performance of D-NetPAD when individual layer’s parameters are perturbed. . . . . . . . . . 104 Figure 5.8: Weight scaling manipulation: (a) Performance (TDR at 0.2% FDR) of VGG19, ResNet101, and D-NetPAD when parameters of the entire network are perturbed simultaneously. (b) Performance of D-NetPAD when the indi- vidual layer’s parameters are perturbed. . . . . . . . . . . . . . . . . . . . . . . 105 Figure 5.9: The performance distributions when Gaussian perturbation is applied over the entire architecture at the specified scale on (a) D-NetPAD, (b) ResNet101, and (c) VGG19 architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Figure 5.10: The performance distributions when weights are set to zero over the entire architecture of (a) ResNet101 and (b) VGG19 for the specified proportion. The red vertical line represents the original performance of the architectures when weights are unperturbed. . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Figure 5.11: Ensemble process of perturbed models to improve the performance of DNN model without undergoing further training. . . . . . . . . . . . . . . . . . . . . 108 Figure 5.12: Performance distributions when three Gaussian noise manipulated D-NetPAD models are ensembled. The Gaussian distribution scaling parameter used in all three models is 0.1. The red vertical line corresponds to the original performance (without weight perturbations). (a) Performance distribution when the entire network is manipulated. In this case, 29 times TDR is higher than the original performance. (b) Performance distribution when only the last convolution layer of DenseBlock4 is manipulated. In this case, 79 times TDR is higher than the original performance. . . . . . . . . . . . . . . . . . . . 108 xxii Figure 5.13: Performance distributions when three Gaussian manipulated D-NetPAD mod- els are ensembled. The Gaussian distribution scaling parameters for the three models are 0.1, 0.2, and 0.3, respectively. The red vertical line corresponds to the original performance (without weight perturbations). (a) Performance distribution when the entire network is manipulated. In this case, four times TDR is higher than the original performance. (b) Performance distribution when only the last convolution layer of DenseBlock4 is manipulated. In this case, 69 times TDR is higher than the original performance. . . . . . . . . . . . 109 Figure 5.14: Performance distributions when three parameter-manipulated D-NetPAD mod- els are fused undergoing three different types of manipulations. The manip- ulations in the three models are additive Gaussian Noise (scale factor is 0.1), weight zeroing (proportion is 0.01), and weight scaling (scale factor is 1.1), respectively. The red vertical line corresponds to the original performance (without weight perturbations). (a) Performance distribution when the entire network is manipulated. In this case, zero-times TDR is higher than the origi- nal performance. (b) Performance distribution when only the last convolution layer of DenseBlock4 is manipulated. In this case, 100 times TDR is higher than the original performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Figure 6.1: The overall idea of the dynamic weight-based fusion strategy for retraining. We train two models (expert and in-domain models) on incoming training data, and a final decision is made based on the weighted sum of their prediction scores. The expert model provides the prediction score, and the in-domain model assigns weight to the prediction score. . . . . . . . . . . . . . . . . . . . 118 Figure 6.2: Illustration of a local outlier concept used in the mean-shifted intra-class loss. Blue-colored data points belong to one training set, C is the center of the training set, and red-colored data point P is a probe sample. There are two classes (Class 1 and Class 2) in the blue-colored train set. If we consider the global outlier concept, the red-colored probe sample would be inlier. However, if the local outlier concept is considered, the probe sample is an outlier to the Class 1 as well as to the blue-colored training set. The figure is better viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Figure 6.3: Histogram of weights dynamically allocated for all test samples (old and new) corresponds to (a) Clarkson,(b) Warsaw, (c) Notre-Dame, and (d) IIIT-WVU subsets of LivDet-Iris-2017 setup. In the case of Warsaw and Notre-Dame, ‘Known’ test splits are used for illustration. Weight values toward ‘0’ of the x-axis symbolize higher priority given to the Old Expert Model, whereas weight values towards ‘1’ of the x-axis denote higher priority given to the New Expert Model. New test data of the IIIT-WVU subset estimate weights around 0.5 as the distribution of the IIIT-WVU subset test set is independent of the training distribution of both expert models. The figure is better viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 xxiii Figure 6.4: The experimental setup of the Split MNIST dataset for the retraining scenario. The main task is to classify old and even digit images. The task is divided into five sub-tasks, where the first task is to classify ‘0’ and ‘1’ digits, the second task is to classify ‘2’ and ‘3’, and so on. The class labels remain the same for all sub-tasks: 0 for odd digit images and 1 for even digit images. . . . . 129 Figure 6.5: 3-D t-sne plot showing pre-trained ViT embeddings correspond to five sub- tasks of the Split MNIST dataset. The training samples of different classes are overlapping in the feature space. The figure is better viewed in color. . . . . 132 Figure 6.6: 3-D t-sne plot showing fine-tuned ViT embeddings correspond to five sub- tasks of the Split MNIST dataset. There is a formation of clusters of training samples belonging to the same class in the feature space. The figure is better viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Figure 7.1: (a) Three categories of techniques applied to detect iris presentation attacks. (b) Illustration of the iris morphing at the image-level. It consists of registra- tion of landmark points on both the images, alignment of images, and then blending into a single image. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Figure 7.2: Samples of morphed images generated from the IITD and WVU datasets. . . . . 142 Figure 7.3: Top: Match score distribution of genuine (green), imposter (red), and morph attacks (blue) on the IITD and WVU datasets using the USITv3.0 iris recog- nition technique. Bottom: Scatter plots of match scores, where morphed images match with their component identities. The dotted line represents the threshold at 0.01% FMR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Figure 7.4: Distributions of similarity scores between the component images correspond- ing to successful (green) and unsuccessful (red) morphs using the RMSE (higher the value, lower the similarity) and SSIM (higher the value, higher the similarity) measures on the IITD and WVU datasets. . . . . . . . . . . . . . 144 Figure 8.1: The objective is to match a visible spectrum face image with the NIR spectrum iris image, or vice versa. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Figure 8.2: Histograms of similarity scores obtained from ocular images under (a) intra- modal VIS, (b) intra-modal NIR, and (c) cross-modal scenario. Similarity scores are estimated using the Structural Similarity (SSIM) index on ocular images of the BioCop-2008 dataset. The statistics of the histograms are given below each figure. There are two observations: first, the similarity between genuine pairs (Genuine Mean) reduces in the cross-modal scenario as compared to the intra-modal scenario; second, the overlapping area between two distributions increases dramatically for the cross-modal. For accurate matching, the overlapping area should be as minimum as possible. . . . . . . . 150 xxiv Figure 8.3: The architecture of Multi-channel CNN (MT-CNN). The base architecture used in the MT-CNN is DenseNet201 [122]. It estimates a similarity score between the images of the two domains. . . . . . . . . . . . . . . . . . . . . . 152 Figure 8.4: The overall testing scenario of Pix2Pix GAN ID and MT-CNN for cross- modal matching. The Pix2Pix GAN ID’s generator synthesizes a NIR image from the VIS image. The MT-CNN then generates a similarity score from a pair of synthesized NIR and real NIR images. . . . . . . . . . . . . . . . . . . 153 Figure 8.5: Training architecture of the DVG-based model. The figure is adapted from [88]. It consists of two encoders that correspond to NIR and VIS input images and a decoder. The encoder transforms input image space into latent space. The decoder utilizes the latent space of NIR and VIS images and reconstructs them back into the image space. . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Figure 8.6: Training procedure of the DVG-based method. . . . . . . . . . . . . . . . . . . 156 Figure 8.7: The testing procedure of the DVG-based method. Noise is an input to the Decoder 𝐷 𝐼 which generates a synthesized genuine pair. . . . . . . . . . . . . . 158 Figure 8.8: (a) A sample face image from the BioCop-2008 dataset. The face image is in the VIS spectrum. (b) Cropped left and right VIS ocular images from the face image. (c) Cropped left and right iris images from the left and right ocular images, respectively. The size of ocular and iris VIS images are 301 × 201 and 81 × 81, respectively. (d) Left and right NIR ocular images from the BioCop-2008 dataset. (e) Cropped left and right iris images from the left and right NIR ocular images, respectively. The size of ocular and iris NIR images are 640 × 480 and 180 × 190, respectively. . . . . . . . . . . . . . . . . 160 Figure 8.9: Samples of VIS and NIR ocular images from the PolyU dataset. The first and second row represents the corresponding VIS and NIR ocular images of four different subjects, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Figure 8.10: (a) A sample face image from the WVU dataset. The face image is in the VIS spectrum. (b) Cropped left and right VIS ocular images from the face image. (c) Cropped left and right iris images from the left and right ocular images, respectively. The size of ocular and iris VIS images are 51 × 61 and 24 × 24, respectively. (d) Left and right NIR ocular images from the WVU dataset. (e) Cropped left and right iris images from the left and right NIR ocular images, respectively. The size of ocular and iris NIR images are 640 × 480 and 300 × 300, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Figure 8.11: ROC curves of different methods and histogram (MT-CNN) in the Iris-Face matching scenario on the BioCop-2008 dataset. MT-CNN with ocular input outperforms on this dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 xxv Figure 8.12: ROC curves of different methods in the Iris-Face matching scenario on the BioCop-2009 dataset corresponding to (a) Aoptix Insight, (b) CrossMatch I SCAN 2, and (c) LG ICAM 4000 iris sensors. MT-CNN with iris input outperforms on Aoptix Insight and CrossMatch sensor images, whereas MT- CNN with ocular input outperforms on LG ICAM 4000 sensor images. (d) Histogram corresponds to the MT-CNN method with ocular input on LG ICAM 4000 sensor images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Figure 8.13: Failure cases of the MT-CNN in genuine and impostor pairs from the BioCop- 2008 dataset. The last row represents the GradCam maps [8] which show regions focused by the network to make the decision. . . . . . . . . . . . . . . 168 Figure 8.14: Failure cases of the MT-CNN in genuine and impostor pairs from the BioCop- 2009 dataset. The last row represents the GradCam maps [8] which show regions focused by the network to make the decision. . . . . . . . . . . . . . . 168 Figure 8.15: StarGANv2 generated ocular images from VIS domain to NIR domain. . . . . . 170 Figure 8.16: StarGANv2 generated iris region images from VIS domain to NIR domain. . . . 171 Figure 8.17: ROC curves of iris and ocular recognition techniques and histogram (MT- CNN) on the entire set of the WVU dataset. All techniques fail on this dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Figure 8.18: Failure cases of the MT-CNN in genuine and impostor pairs. The last row represents the GradCam maps [245] which show regions focused by the network to make the decision. The degraded and very low resolution of ocular images in the VIS spectrum causes the poor performance of cross- modal matching on the WVU dataset. . . . . . . . . . . . . . . . . . . . . . . . 173 Figure 8.19: t-SNE [280] plot of genuine and impostor pairs features obtained from the MT-CNN network. There is a large overlap between the features of the two distributions. The overlapping criteria could be used to identify on which dataset cross-modal matching would be feasible. . . . . . . . . . . . . . . . . . 173 Figure 8.20: (a) Histogram of genuine scores when ocular region from two face images (VIS) are matched. The threshold is 0.71 at 0.2% FMR. (b) Histogram of genuine scores when ocular region from two iris images (NIR) are matched. The threshold is 0.61 at 0.2% FMR. . . . . . . . . . . . . . . . . . . . . . . . . 175 Figure 8.21: Histogram of genuine scores when ocular region from the face (VIS) and iris images (NIR) are matched. The threshold is 0.79 at 0.2% FMR. All techniques fail on this dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 xxvi CHAPTER 1 INTRODUCTION 1.1 Biometrics Human authentication is very much required in our day-to-day activities, for instance, log in to a laptop, computer, or mobile; access control to a building; attendance system; ATM transactions, credit card payment; or border crossing through airports. Traditional ways of human authentication - knowledge-based (personal identification number (PIN) or secret pattern) and token-based (card)- are not able to keep pace with the increasing demands of authentication. The traditional ways require remembering a large number of passwords or carrying a lot many cards or keys, which is inconvenient for users and also limits the security. Biometrics, which refers to the measurement and calculation of body characteristics, meet the high demands of authentication requirement as users need not remember the password or carry the token. It recognizes humans based on their physical (face, fingerprint, iris), behavioral (signature, gait), and psychophysiological (ECG, EEG) traits [126]. These traits should be unique, universal, permanent, and measurable. Iris, the annular region around the pupil, is one of the most popular biometric traits due to its high accuracy, fast matching, and great stability. 1.2 Anatomy of Iris Iris is an anterior part of the uveal tract covered by the cornea from the front and supported by the lens from the back [40]. It separates the anterior and posterior chambers of the eye. At its base, it is connected to the eye’s ciliary body and another end leads to the pupil. The frontal view of the iris is a colored annular region surrounding the pupil of an eye and surrounded by the sclera of an eye (Figure 1.1 (a)). The diameter of the iris is approximately 12 mm and the circumference is 38mm. The frontal iris pattern ((Figure 1.1 (b)) consists of two zones - pupillary and ciliary zone - separated by collarette. The pupillary zone (Figure 1.1 (c)) lies in the proximity 1 of the pupil and contains sphincter muscles, connecting crests, and pigment or pupillary ruff. The sphincter muscles encircle the pupil and controlled by parasympathetic nerve endings. It constricts the pupil in the presence of high illumination (miosis). The ciliary zone (Figure 1.1 (c)) consists of crypts, contraction furrows, dilator muscles. The dilator muscle fibers run radially and control by sympathetic nerve ending. It dilates the pupil in low illumination (mydriasis). Figure 1.2 shows the frontal view of the iris captured by different imaging techniques. Figure 1.1: Frontal view of the iris: (a) ocular image, (b) focused view of the iris pattern, and (c) frontal anatomy of the iris [40]. The cross-sectional view of captured using Optical Coherence Tomography (OCT) imaging is shown in Figure 1.3 (a). The arc-like structure is the cornea, whereas the cloud-like structure corresponds to the iris tissue structure. Figure 1.3 (b) shows iris cross-sectional anatomical view. It consists of four layers from anterior to posterior: (a) anterior border layer, (b) stroma layer, 2 Figure 1.2: Iris images captured using different imaging techniques: (a) captured in the visible spectrum illumination, and (b) captured in the near-infrared spectrum illumination. (c) anterior epithelium layer, and (d) posterior pigmented epithelium layer. The anterior border layer contains interconnected connective tissues (fibroblasts) and beneath them, pigment cells (melanocytes) derived from the anterior stroma. The border layer is absent in the crypts and contraction furrows. It is thickest in the pupillary zone and at the periphery of the ciliary zone. The stroma layer consists of a loose collagenous network of sphincter pupillae muscles, blood vessels, nerves, fibroblasts, melanocytes, clump cells, and mast cells. Melanocytes present in the anterior border and stroma layer mainly contribute to the color of the iris. Dark-colored iris (brown, black) are profuse in melanocytes, whereas light-colored iris (blue, green, or gray) are sparse in melanocytes. The anterior epithelium layer is about 12.4 𝜇m in thickness and mainly consists of dilator pupillae muscle. The last posterior pigment epithelium is a layer of cells derived from the internal layer of the optic cup. It mainly consists of pigmented cells. 1.3 Automated Iris Recognition System The complex structure of the iris results in randomness in the iris pattern, which in turn supports the uniqueness at the planetary scale of the global human population [68]. Iris patterns of left and right eyes of an individual and iris patterns of monozygotic twins are also not similar. Due to its uniqueness and universality, it is considered as a suitable and reliable biometric trait employed for authentication. Biometric authentication occurs in two stages — enrollment and recognition [126]. During 3 Figure 1.3: Cross-sectional view of the iris: (a) image captured from optical coherence tomography imaging and (b) transverse anatomy of the iris [295]. the enrollment stage, an individual presents their biometric sample (iris) to the acquisition unit. The sample is then transformed into a biometric template which gets stored in the enrollment database (also known as reference or gallery set). Each enrolled template is associated with a unique identifier. During the recognition stage, an individual again presents their biometric sample for authentication, which gets transformed into a biometric template. The query template is then matched against enrolled templates either in identification or verification mode. In identification, the query template is matched against all enrolled templates (one-to-many matching) and returns an identifier associated with it. In verification, the query template is input along with a claimed identity. Hence, the query template is matched against the claimed identity enrolled template (one-to-one matching) and returns the boolean value (verified or not verified). The brief of iris recognition pipeline (Figure 1.4) is provided in the following steps: • Acquisition: The acquisition module acquired an iris image from an individual using a specialized iris sensor. Most commercial iris sensors capture iris images in near-infrared (NIR) illumination range (700–900 nm), though most smartphones capture in visible spec- trum (VIS) range (400-700 nm). Different spectral bands can potentially be used to capture different components of the iris. NIR illumination predominantly captures the stromal fea- tures (fibrovascular layer) of the iris, whereas VIS captures information about the pigment melanin. 4 • Iris Segmentation: The iris segmentation module locates the iris region from the acquired image. It involves the detection of pupillary (the inner boundary between the pupil and the iris) and limbic (the outer boundary between the iris and the sclera) boundaries. However, researchers also include detection of occlusions to the iris region, such as specular reflections, upper and lower eyelashes, upper and lower eyelids. Iris segmentation techniques in the literature can be categorized as edge-based, region growing-based, active contour-based, and learning-based (machine learning and deep learning). The edge-based approaches first detect edge points in an iris image and then fit circular or elliptical models to detect pupillary and limbic boundaries. Most popular edge-based methods used for segmenting iris region are integrodifferential operators [69], Hough transforms [295]. Tan et al. [264] proposed a combination method of region clustering, semantic refinements, and integrodifferential operators. These techniques assume the circular inner and outer boundaries for the iris region. Other researchers fitted elliptical models [80, 158, 183, 196, 236, 319]. The region growing approaches detect various regions in the entire image and perform semantic refinements to obtain the iris region. Various works that fall under this category are [8, 87, 131, 307, 316]. Later, the boundaries shape assumption is relaxed in active contour-based approaches by fitting irregular boundaries. Various research works based on active contour to detect iris boundaries are [24, 31, 67, 127, 134, 147, 187, 247, 247, 282]. The last learning-based approaches of segmentation aim to classify image pixels into iris and non-iris categories using machine learning techniques, such as SVM classifier [231, 263], AdaBoost [157], triplet Markov fields (TMF) [27], fast-structured random forest [103] and graph-cuts [209]. With the success of deep learning techniques in other computer vision tasks, it has also been utilized in iris segmentation [19, 104, 112, 160, 238, 275]. Researchers also focus on segmenting the iris region in visible spectrum images [20, 207]. NICE I [4] presented a competition of iris segmentation techniques for the visible spectrum images. A detailed description of iris segmentation techniques are included in [35,130,218]. • Iris Normalization: After the iris segmentation, the circular iris region is mapped to a 5 fixed dimension region by the iris normalization module. Typically, the rubber sheet model proposed by Daugman [62] is used for iris normalization. It maps the segmented circular iris region defined in cartesian coordinates (x, y) to rectangular polar coordinates (r, 𝜃). It helps in minimizing variations in the area of the iris region due to the dilation and contraction of the pupil. • Feature Extraction: The feature extraction module is responsible for extracting salient or discriminative features from the normalized or unnormalized iris images and represents the images in a compressed form (template), commonly known as IrisCode. Daugman [69] utilized phase information from normalized iris using quadrature 2D-Gabor wavelets to cre- ate a feature template of 2,048 phase bits. Iris feature extraction techniques defined in the literature can be categorized as texture filtering approaches, texture analysis approaches, patch-based approaches, sparse coding, and deep learning representation techniques. Tex- ture filtering approaches extract iris texture using standard texture filters, such as Gabor filters [12, 62, 69, 173], wavelet transform [30, 32, 184, 283], ordinal features, [257], dis- crete cosine transform (DCT) [179], local intensity variation [151, 169], phase correla- tion approach [177, 320], Zernike moments [261], and binarized statistical image features (BSIF) [59, 226]. Texture analysis approaches explore underlying iris texture representation using different methods, such as gray-level co-occurrence matrices (GLCM) [77], local- global graph methodology [310], probabilistic graphical models [139], SURF features [175], dynamic programming [210], SIFT features [15, 200, 258], geometric key-based iris en- coding [262], bayesian estimation [268, 291], and shape-based features [48]. However, the performance of these methods is not showing much improvement over the traditional texture filtering approaches. Another category of feature extraction techniques is patch- based. The patch-based approaches tessellate an iris image into smaller patches, extract features from patches, and combine them. The research works that fall under this category are [25,162,179,205]. The patch-based encoding helps in handling occlusion and non-linear deformations. The sparse coding-based techniques are useful in handling iris images cap- 6 tured in non-ideal conditions (low resolution, blur and defocus) [148,149,202,317]. Recently, deep learning-based techniques are applied to extract discriminative features from the iris images [38, 91, 92, 164, 189, 208]. A detailed description of iris feature extraction techniques is included in [35, 37, 188]. • Comparator: The comparator module performs matching of two iris templates (query template and enrolled template) to establish an identity of an individual and outputs a match score. The match score could be a similarity measure or dissimilarity measure. However, in iris biometrics, hamming distance [62] is generally used to calculate the match score. It is an XOR operation between the two IrisCodes after masking the non-iris bits (eyelids or eyelashes). Hamming distance of two different irides should be equal to 0.5, whereas the hamming distance of two IrisCodes from the same iris should be 0. Figure 1.4: Various steps of the automated iris recognition process. It consists of the acquisition of an iris image from an individual, segmentation of iris region, iris region normalization, extraction of features from the iris image, and then the matching of the iris template against the enrolled templates. 1.3.1 Applications Iris recognition systems have been deployed in a wide range of applications around the world. Few prominent applications of iris recognition are listed below: 7 • The United Arab Emirates implemented an iris biometric system for border control.1 Entry through any means land, air, or seaports is now through iris biometrics. There are 1.2 million records in the database and 14 billion matches per day. • The Amsterdam Schiphol Airport, Netherlands, also implemented iris recognition to expe- dited, passport-free border security checks since 2001.2 • The Hashemite Kingdom of Jordan deployed an iris-enabled automated teller machine (ATM) at Cairo Amman Bank in 2009.3 In June 2012, UNHCR registered Syrian refugees in Jordan on Cairo Amman Bank ATMs.4 The system is used to provide financial assistance to refugees. • India’s Aadhaar project is the world’s largest biometrics-based identification system with more than 1.28 billion enrollments (April 2021) [6]. The project is to assign a 12-digit unique identification (UID) number to all the residents of India. The UID number is associated with an individual’s demographic and biometric information such as a photograph, ten fingerprints, and two iris scans. Various services are also linked with the UID number: electronic-Know Your Client (e-KYC) service, government subsidies distribution, telecom services, income tax services, and financial services. • Biometric e-passport has been endorsed by 120 countries (since June 2017) [1]. It supports facial, fingerprint, and iris recognition. • Various mobile platforms also implemented iris recognition for locking and unlocking the mobile devices5: Samsung Galaxy S8 and S9 series, Microsoft 950 XL, Fujitsu NX F-04G, Vivo X5Pro, ZTE Grand S3, Alcatel Idol 3, and UMI Iron. 1 https://www.cl.cam.ac.uk/ jgd1000/UAEdeployment.pdf 2 https://www.schiphol.nl/en/privium/how-the-iris-scan-works/ 3 https://www.cab.jo/services/IRIS%20Recognition 4 https://www.unhcr.org/innovation/using-biometrics-bring-assistance-refugees-jordan/ 5 https://webcusp.com/list-of-all-eye-scanner-iris-retina-recognition-smartphones/ 8 Figure 1.5: Various applications of iris recognition system: (a) UAE border control system, (b) India’s national ID project, (c) Hashemite Kingdom of Jordan’s iris-enabled ATM, (d) Biometric e-passport, and (e) Mobile access control. 1.3.2 Challenges Despite the wide deployment of iris recognition systems, there are various open challenges in iris recognition systems as listed below: 1. Low-quality Images: Iris recognition performs considerably well when iris images are captured in a controlled environment. However, when the iris images are captured in non- ideal conditions, it negatively impacts the performance of recognition. Non-ideal scenarios result in degradation of image quality in various ways, such as occlusion by eyelids, occlusion by eyelashes, specular reflections, motion blur, off-angle gaze, low resolution, illumination variations, eyeglasses, or contact lenses. Figure 1.6 shows a good quality iris image along with few low-quality iris images. Various works showed the impact of low-quality images on the performance and also proposed mitigation measures [21, 36, 128, 159, 209, 235, 237, 243, 271]. The NIST IREX II-IQCE report [260] evaluated various iris image quality assessment algorithms. Figure 1.6: Samples of a good quality iris image and few low-quality iris images. 2. Pupil Dilation: Generally, pupil diameter lies in the range of 1.5 mm to 8 mm and can be dilated to over 9 mm. Visual features also get affected on dilation, pupillary ruff become 9 thinner or even disappear, crypts become oblique, vessels become more tortuous, and the contraction and peripheral furrows deepen [40]. Pupil dilation occurs due to light intensity change, alcohol consumption, drug usage, age, disease, eye drops, a person’s emotional state, and perceptual events. Several studies [18, 36, 100, 115, 270] show the negative impact of pupil dilation on iris recognition. 3. Iris Aging: The resting pupil size decreases with age due to fibrotic changes in the sphincter and atrophy of the dilator muscles [40]. However, iris recognition is considered relatively stable despite these changes [99, 177, 179, 268]. On the contrary, in [36, 84, 269], the authors observed the decrease in iris recognition performance. Later, the works by [29,111,239,273, 294] analyzed various other covariates along with time-lapse responsible for the degradation of iris recognition performance, such as sensor aging (commonly denoted as pixel defects), segmentation errors, quality measures (blur, illumination variation, noise, occlusion), and geometrical factors (pupil dilation). The maximum time span used in the experiments to validate the influence of aging on iris recognition is nine years. Therefore, there is a need to perform experiments on a larger dataset with a longer time span to validate the effect of aging on automated iris recognition. 4. Iris Diseases: Ophthalmic disorders not only degrade the iris recognition performance but also increase the failure to enroll rate. These disorders may include cataract, glaucoma, iridis, rubeosis iridis, acute anterior uveitis, aniridia, ciliary body leiomyoma, lisch nodules, iris melanoma, heterochromia synechiae, hyphema, iris cysts, iris prolapse, corneal pathologies, and iridodialysis. American Academy of Ophthalmology6 reported 24.4 million cataract patients, 7.7 million diabetic retinopathy patients, 2.7 million glaucoma patients, and more than 2.1 million age-related macular patients. There occur more than 7.63 million cataract surgeries (aged 50+ years) in India [101] in the year 2020. Various research works focus on the impact of these disorders on iris recognition in [28, 142, 190, 274]. 6 https://www.aao.org/newsroom/eye-health-statistics 10 5. Sensor Interoperability: The challenge of sensor interoperability arises with the use of different iris sensors where users enrolled using one model of iris sensor and probe images are acquired using another new iris sensor. The situation of cross-sensor matching also arises when the iris image of one application (e.g., national ID) matches with iris images of another application (e.g., law enforcement) where capturing sensors are different. Bowyer et al. [36] showed higher false non-matches when images of different sensors are matched compared with the images of the same sensor. The cross-sensor matching accounts for the differences in the relative position of illumination sources and human eyes, camera characteristics, and spectrums (visible or near-infrared). Various research works focus on cross-sensor matching in [91, 186, 201]. 6. Security: In [90], researchers reconstructed iris images from iris templates and used them to attack commercial iris recognition systems, with a success rate of around 80%. Samsung’s new Galaxy S8 smartphone has also been defeated by German hackers using dummy eyes. In another case, eye drops were used to cause excessive mydriasis and bypass the iris recognition- based border control in UAE. Biometrics considers an integral part of the human body, so the compromise in the individual biometric template compromises his/her identity. Therefore, it is essential to maintain the integrity of the biometric system and protect it from various types of attacks. Section 1.4 provides details of different attacks employed on biometric systems. 7. Cross-modal Matching: Cross-modal matching associates data of one biometric modality to another modality. It is required when the legacy database or enrolled template of query identity is not available. It is also beneficial when we need to map genomic data to phenotypic traits [163]. However, the matching involves various challenges due to the difference in modalities, sensors, spectrums, and resolutions. Matching of iris images with face images is discussed in Section 1.5.he challenge of sensor interoperability arises with the use of different iris sensors where users enrolled using one model of iris sensor and probe images are acquired using another new iris sensor. 11 The focus of this thesis is on the last two (security and cross-modal matching) challenges of the iris recognition system. 1.4 Security of Iris Recognition System In [3, 224], authors identified nine points of attack in the generic biometric system (Figure 1.7): (1) at the acquisition unit (presentation of a plastic eye to the iris sensor); (2) at the communication channel between the acquisition channel and feature extraction unit (modification of captured biometric sample); (3) at the feature extraction unit (trojan horse attack on feature extractor); (4) at the communication channel between feature extraction unit and comparison unit (attack on TCP/IP protocol and alteration of feature template); (5) at the data storage unit (tampering of enrolled templates); (6) at the communication channel between the data storage unit and comparison unit (attack on TCP/IP protocol and alteration of an enrolled template); (7) at the comparison unit (modification of matching algorithm or match score); (8) at the communication channel between comparison unit and decision unit (attack on TCP/IP protocol and modification of match score); (9) at the decision unit (change or flip of the decision). Cryptography or encryption techniques could prevent attacks at the communication channels (4, 6, 8 attack points). The attacks on the feature extraction (3 attack point), data storage (5 attack point), and comparison (7 attack point) units can be mitigated by keeping these units at a secure location. The attacks before feature extraction (1, 2 attack points) can be resolved by the implementation of additional modules for the detection of fake presentation or modification of a biometric sample. The focus of the thesis is on the attacks employed at attack points 1 (sensor-level) and 2 (image-level) shown in orange boxes of Figure 1.7. These attacks are easier to deploy as they do not require any internal knowledge of the system. The two significant attacks fall in these categories are presentation attacks and morph attacks. 1.4.1 Iris Presentation Attacks According to ISO/IEC 30107-1:2016 [3], a Presentation Attack (PA) is a “presentation to the biometric data capture subsystem with the goal of interfering with the operation of the biometric 12 Figure 1.7: Generic biometric system and the various points of attacks launched on the system [224]. The attacks shown in orange boxes (presentation and morph attacks) are our focused research area. system". The biometric characteristics or materials used to launch a presentation attack are termed Presentation Attack Instruments (PAIs). Examples of iris PAIs are printed iris images [57, 64, 113, 211], artificial eyes (plastic, glass, or doll eyes) [113, 152], cosmetic contacts [123, 212, 300], video display of an eye image [58, 214], cadaver eyes [58, 172], robotic eye models [143], holographic eye images [193], mannequin eye, and eye presentation under coercion. Figure 1.8 shows few samples of iris PAIs. In [3], iris PAIs are categorized as artificial, human-based, and other natural (animal or plant-based). The artificial PAIs are further categorized as complete (print or plastic eye) and partial (cosmetic contacts). The human-based PAIs are categorized as lifeless (cadaver eyes), altered (iris surgeries), non-conformant (off-gaze iris or occlusion by eyelids), coerced (unconscious or under duress), and conformant (zero effort impostor attempt). The objective is to detect the aforementioned presentation attacks along with future unknown attacks. Typically, presentation attack detection (PAD) techniques follow are similar procedures as used by the biometric recognition system. It consists of capturing raw data from the sensor (additional or same recognition sensor), feature extraction from the raw data, and classification of features into detection or not detection classes based on pre-specified decision criteria. The PAD process can be performed simultaneously (additional sensor) or sequential (biometric recognition sensor). 13 In the literature, existing iris PAD techniques are categorized as hardware-based or software- based. The software-based techniques utilize the iris image captured from the conventional iris sensors, whereas the hardware-based techniques employ additional hardware to detect the liveness characteristics (eye blinking, pupil dilation or contraction, etc.). Czajka and Bowyer [58] proposed another categorization based on the type of input (image or video) and type of response (active or passive). The categorization includes four categories: static iris passively imaged (static features from still iris image), static iris actively imaged (features from multi-spectral iris images), dynamic iris passively imaged (features from pupil hippus), and dynamic iris actively imaged (dynamic features from stimulated pupil reflex). Various competitions and assessments of these iris PAD techniques can be found in [58, 61, 304–306]. Figure 1.8: Various iris presentation attacks instruments. 1.4.2 Morph Attacks Morph attack is another focus of our thesis. It is generally employed at the image-level on the raw biometric image captured from the acquisition sensor and before the image transfer to the feature extraction module. Though, it can also be performed at feature-level [85, 225], but requires knowledge of the feature extraction module. The morph attack at the image-level can be directed by presenting the modified biometric image to the sensor or by digitally uploading it to the biometric system. The morph attack entails the generation of an image (morphed image) that embodies multiple different identities. Typically, a biometric image is associated with a single identity; however, the morphed image can successfully match with multiple identities (Figure 1.9). It violates the fundamental uniqueness property of the biometric system. Morph attack is 14 mainly studied in the context of face recognition, where a single passport with a morphed face image allowed two individuals to pass through the border control security. It has not been widely investigated in iris recognition. The morphed biometric images are generated using morphing techniques. In general, morphing [26] involves the creation of seamless transformation from one image into another. It creates intermediary morphed images by combining the two images in different proportions. It is a well- known field of research for the entertainment, education, or medical industry. However, in the context of biometric systems, it is recently being used to generate morphed biometric image that has the potential to attack the biometric system [86]. Figure 1.9: Pictorial diagram of iris morph attack. A single morphed iris image can authenticate two or more individuals, which violates the fundamental uniqueness characteristics of the biometric system. 1.5 Cross-modal Biometrics Typically, a biometric system matches samples of the same modality for human recognition. However, cross-modal biometrics refers to the matching of different modalities to establish the identity of an individual. For instance, matching of iris images with face images [129], deducing phenotypic traits from genomic data [163], or mapping face image with voice signal [168, 185]. Figure 1.10 shows some of the examples of cross-model matching. The motivation of performing cross-modal matching comes from the various scenarios: (a) identify an individual when the legacy 15 database or corresponding enrolled template is not available, (b) improve recognition confidence even if the legacy dataset is available, (c) match noisy probe image when it cannot be matched with its legacy database, i.e., masked face image, and (d) connect databases of different biometric modalities to identify an individual globally. Our focus is on the matching iris images with face images which constitutes the following challenges: 1. Cross-Modality: This involves matching iris modality images to face modality images. Though the ocular region is common in both the modalities, but the focus while acquisition is different in different modalities. In iris image acquisition, the focus is on the iris pattern, whereas in face image acquisition, the focus is on the entire face. 2. Cross-Sensor: Different sensors are used to capture the face and iris images. Sensors add various noises to the images, for instance, fixed pattern noise, pixel response non-uniformity (PRNU), random noise, etc. 3. Cross-Spectrum: Generally, face images are acquired using sensors typically operating in the visible spectrum (VIS) in contrast to iris images acquired using sensors operating in the near-infrared (NIR) spectrum. When considering the iris region only, NIR illumination (700-900nm) captures the stromal features (fibrovascular layer) of the iris, whereas VIS illumination (400-700nm) captures melanin pigment and a meshwork of ligament features. 4. Cross-Resolution: Iris or ocular regions of the face images are generally of very low resolution as compared to iris images. 1.6 Thesis Contributions The main contributions of this proposal are as follows: 1. We propose an effective and robust software-based iris presentation attack (PA) detector called D-NetPAD using a single near-infrared (NIR) iris image (Figure 1.11 (a)). It is 16 Figure 1.10: Examples of cross-modal matching: (a) iris modality matches with face, (b) deducing phenotypic traits from genomic data, and (c) mapping face image with the voice signal. based on the densely connected convolutional neural network architecture. It demonstrates generalizability across PA artifacts, sensors, and datasets. We conduct experiments on a proprietary dataset and two publicly available datasets (LivDet-Iris 2017 and LivDet-Iris 2020) that substantiate the effectiveness of the proposed method for iris PA detection. The proposed method results in a true detection rate of 98.58% at a false detection rate of 0.2% on the proprietary dataset and outperforms state-of-the-art methods on the LivDet-Iris 2017 and LivDet-Iris 2020 datasets. We also explore the explainability and interpretability of our method using t-SNE plots and Grad-CAM which help in visualizing intermediate feature distributions and fixation heatmaps, respectively. Further, we conduct a frequency analysis to explain the nature of features being extracted by the network. 2. We design a hybrid (a combination of hardware and software-based) iris presentation attack detector utilizing short videos (approx. 4 secs) captured from a webcam (Figure 1.11 (b)). The videos are in the visible spectrum focusing on the user interaction with the iris sensor. To extract discriminative features from the scene, we develop various spatial-temporal feature extraction techniques. Evaluation is performed on a proprietary dataset (IPV dataset) of 121 subjects. We also extend it for detecting PAs in face modality. For the face modality, experiments are performed on three publicly available datasets (SiW, SiW-M, and OULU- NPU). The proposed approach generalizes well across different environments (e.g., changes in illumination or background), presentation attacks, and modalities. 17 3. We propose a hardware-based iris PA detector utilizing Optical Coherence Tomography (OCT) imaging (Figure 1.11 (c)). The OCT imaging provides a cross-sectional view (internal structure) of an eye, whereas traditional imaging provides 2D iris textural information. Its viability is assessed by comparing its performance with respect to traditional iris imaging modalities, viz., near-infrared (NIR), and visible spectrum. PA detection is performed using three state-of-the-art deep architectures (VGG19, ResNet50, and DenseNet121) to differentiate between bonafide and PA samples for each of the three imaging modalities. Experiments are performed on a proprietary dataset of 2,169 bonafide, 177 Van Dyke eyes, and 360 cosmetic contact images acquired using all three imaging modalities under intra- attack (known PAs) and cross-attack (unknown PAs) scenarios. Promising results demonstrate OCT as a viable solution for iris presentation attack detection. 4. We assess the robustness of the iris PA detector against architectural parameter perturbations. The robustness analysis involves three state-of-the-art architectures (VGG, ResNet, and D- NetPAD), three types of parameter perturbations (Gaussian noise, weight zeroing, and weight scaling), and two settings (entire network and layer-wise). We conduct evaluations on the LivDet-Iris-2017 and LivDet-Iris-2020 datasets. Based on the robustness analysis, we propose improved models simply by perturbing parameters of the network without further training. Then we combine these perturbed models to improve performance over the original model. The ensemble models show a 47.59% average improvement on LivDet-Iris-2017 dataset and 5.44% on the LivDet-Iris-2020 dataset. 5. We propose a retrain methodology to maintain the performance of iris PAD in non-stationary environment. The methodology involves building a new expert model using new oncoming training data, and makes a final decision for a probe sample by a weighted sum of old and new iris PAD models scores. We assign the weights dynamically at the run-time for each probe sample using in-domain models (separate from iris PAD models). The in-domain model provides information about the membership of a probe sample to the training data. To 18 evaluate the proposed method, we experiment with three setups. The first two setups are in the application of detecting presentation attacks in iris biometric modality. The third setup compares the proposed method against state-of-the-art continual learning methods on the split MNIST dataset. 6. We investigate the other adversary attack called morph attacks in the context of iris biometrics. We perform iris morphing at the image level and generate morphed iris images from two available datasets (IIITD and WVU-Multimodal). We then demonstrate the vulnerability of three different iris recognition methods (VeriEye, USITv3.0, and CNN-Pairwise) to morph attacks with a success rate of over 90% at a false match rate of 0.01%. Finally, we provide preliminary results on the detection of morphed iris images. 7. We propose learning-based methods to perform cross-modal matching (iris modality images match against face modality images). Cross-modal recognition mainly encounters two chal- lenges: (i) a large domain gap due to different sensors, spectra, and resolutions, and (ii) an imbalance in the training data. We propose three deep learning approaches to address these two challenges. The first approach is at the feature-level, where we jointly extract discriminative features from two modality images to reduce their domain gap and termed it Multi-channel CNN. The second approach is at the image-level, where one domain image is transformed into another utilizing various GAN architectures. The third approach is at the training-level that resolves the imbalanced training data by generating samples of under- represented class using the Dual Variational Generation (DVG) framework. We conduct experiments on BioCop-2008, BioCop-2009, WVU multi-modal, and cross-spectrum PolyU datasets to substantiate the effectiveness of the proposed approaches. 1.7 Thesis Organization The remaining document is organized as follows: 19 Figure 1.11: Different categories of techniques applied to detect iris presentation attacks: (a) technique utilizing a single NIR iris image captured from conventional iris recognition sensor and (b) technique utilizing a video captured from webcam (c) technique utilizing a single iris OCT image. All these techniques generate a Presentation Attack (PA) score between 0 and 1, where ‘0’ corresponds to bonafide input sample and ‘1’ corresponds to PA input. • Chapter 2 describes the software-based iris PA detector (D-NetPAD) which utilizes a single NIR iris image. This chapter covers the first contribution. • Chapter 3 entails the hybrid iris PA detector which utilizes a scene video captured from a webcam. This chapter covers the second contribution. • Chapter 4 describes the hardware-based iris PA detector which utilizes OCT imaging. This chapter covers the third contribution. • Chapter 5 assess the robustness of the iris PA detector along with other deep neural networks. This chapter covers the fourth contribution. • Chapter 6 explores the retraining strategies for iris PA detectors to keep the performance over time. This chapter covers the fifth contribution. • Chapter 7 introduces the iris morph attacks, their potential to attack iris biometric systems, and their detection technique. This chapter covers the sixth contribution. 20 • Chapter 8 provides insight into cross-modality matching of iris images with face images. This chapter covers the seventh contribution. • Chapter 9 concludes the thesis and provides some directions for future work. 21 CHAPTER 2 IRIS PRESENTATION ATTACK DETECTION USING A SINGLE NIR IMAGE Parts of this chapter appeared in the following publications: R. Sharma and A. Ross, “D-NetPAD: An Explainable and Interpretable Iris Presentation At- tack Detector,” International Joint Conference on Biometrics (IJCB), September 2020. P. Das et al., "Iris Liveness Detection Competition (LivDet-Iris) - The 2020 Edition," Interna- tional Joint Conference on Biometrics (IJCB), September 2020. 2.1 Introduction In this chapter, we present a iris presentation attack detection (PAD) method which utilizes a near-infrared (NIR) iris image captured from the conventional iris sensor. Therefore, the method does not impose additional overhead when integrated with the existing iris recognition system. The method is based on a densely connected convolutional neural network (DenseNet). The DenseNet architecture has a unique property that each layer is connected to every other layer in a feed- forward fashion and the features across different layers correspond to different resolutions. The aggregation of these multi-resolution features efficiently characterizes the iris pattern as the iris pattern is arguably stochastic in nature and the intricate features of the iris stroma are manifested in multiple resolutions [63]. The source code and trained model of the proposed method are available at https://github.com/iPRoBe-lab/D-NetPAD. The main contributions of the work are as follows: 1. We propose an effective and robust iris PA detector named as D-NetPAD that is based on the DenseNet architecture [122]. We also demonstrate that the proposed detector exhibits generalizability across different PAs, sensors, and datasets. 22 2. We evaluate the performance of D-NetPAD on a proprietary dataset and two publicly available dataset (LivDet-2017 and LivDet-2020). 3. We perform visualizations using t-SNE plots [280] and Grad-CAM [245] to explain the performance of the proposed method. The t-SNE plots provide visualization of features obtained from the intermediate layers of the model. The Grad-CAM produces heatmaps emphasizing the salient regions in an iris image that are used by the network to detect iris PAs. 4. We also conduct a frequency analysis to understand the frequencies learned by the model and, based on that, interpret its performance. The rest of the chapter is organized as follows: Section 2.2 provides the brief description of the related work. Section 2.3 discusses the architecture of the proposed method. Section 2.4 describes the experimental setup and results on all the datasets. Section 2.5 provides a detailed analysis of the results obtained from the D-NetPAD. Section 2.7 concludes the chapter. 2.2 Related Work Existing techniques in the literature used to counter iris PAs can be categorized as being either hardware-based or software-based. Hardware-based techniques typically require physical devices in addition to the conventional iris sensor to aid in PA detection. The additional hardware assists in capturing intrinsic properties of the eye (e.g., corneal reflection, red-eye effect, etc.), involuntary behavioral characteristics (pupil hippus, eye blinking, etc.), or challenge-response behavior which could be voluntary (eye-tracking) or involuntary (pupil dynamics with external light). Daugman [66] suggested the use of spectrographic properties of the eye (tissue, blood) and four Purkinje images generated by the reflection from the outer and inner surface of the cornea and lens. Further, Lee et al. [152] analyzed the changes in the reflectance ratio between the iris and sclera under multi-spectral illumination. Czajka et al. [56] utilized IrisCUBE camera to capture pupil dynamics, whereas Kanematsu et al. [135] used CCD camera with two white LEDs to initiate 23 and record pupillary reflex. Hughes et al. [123] created 3D structural modeling of an eye using stereo imaging. Raghavendra and Busch [219] explored the inherent characteristics of Light Field Camera (LFC) in the VIS spectrum for iris detection. Komogortsev and Karpov [144] used the EyeLink II eye tracker to capture Oculomotor Plant Characteristics as a cue for detecting PA. Sharma and Ross [250] introduced the use of Optical Coherence Tomography (OCT) imaging for iris PA detection. On the other hand, software-based techniques extract salient features from the digital iris image in order to classify it as a bonafide or a PA.1 Daugman [65] distinguished patterned contacts from real iris images using the amplitude spectrum of the 2-D Fourier Transform. Other researchers also analyzed frequency spectrum using 2-D Fourier spectra [108], Wavelet Transform [109] and Laplacian pyramids [217]. Various handcrafted features are used to detect the iris PA, for instance, HVC [315], LBP [119], BSIF [220], and SID [97]. However, more recently, a number of deep- learning based methods have been proposed [46, 114, 176, 194, 301, 302]. Menotti et al. [176] proposed a deep architecture for PA detection called SpoofNet. Pala and Bhanu [194] developed a deep framework built upon triplet convolutional networks. Hoffman et al. [113] focused on detecting iris PAs utilizing a patch-batch convolutional neural network (CNN) that is observed to perform well in the cross-sensor and cross-dataset scenarios. They extend their work [114] by analyzing the importance of utilizing the periocular region in detecting iris PAs. Chen and Ross [46] proposed a multi-task CNN for first detecting the iris region and then classifying it. In their other work [45], they explored IrisCodes for PA detection, so that commercially used IrisCodes could be authenticated. Yadav et al. [302] utilized a Relativistic Average Standard Generative Adversarial Network (RaSGAN) as a one-class classifier to detect unseen or unknown iris PAs. In another work, Yadav and Ross [303] proposed Cyclic Image Translation Generative Adversarial Network (CIT- GAN) for augmenting under-represented iris PAs in the training set. Chen and Ross [47] worked in the direction of explaining the model predictions by incorporating attention mechanisms in the CNN network which provides visual explanations to the predictions. The Liveness Detection-Iris 1A “bonafide" image is sometimes referred to as a “live" image in the literature. 24 Competition (LivDet-Iris) held in 2013 [305], 2015 [306], 2017 [304] and 2020 [61] provided a comprehensive comparative report of different iris PA detection techniques. Czajka and Bowyer [58] also presented a detailed assessment of various state-of-the-art iris PA detection (PAD) algorithms. While most of these methods resulted in very high PA detection rates, generalizability across PAs, sensors, and datasets is still a challenging problem [79, 212, 300]. 2.3 D-NetPAD: Description and Rationale Dense Network Presentation Attack Detection (D-NetPAD) is based on the Densely Connected Convolutional Network 121 (DenseNet121) [122] architecture. The architecture consists of 121 convolutional layers of kernel size 7 × 7, followed by a max-pooling layer and a series of Dense blocks and Transition layers. There are four Dense blocks, and three Transition layers lie between successive Dense blocks. Each Dense block consists of two convolutional layers of kernel size 1 × 1 and 3 × 3. Both convolutional layers are followed by a non-linear ReLU activation layer. The Transition layer consists of one convolutional layer of kernel size 1 × 1 and an average pooling layer. It reduces the size of feature-maps, which is kept constant within a Dense block. The last layer is a fully connected layer. The work in [301] exploits the DenseNet architecture of depth 22 with three densely connected blocks. The most notable characteristic of DenseNet is that each layer connects to every other layer in a feed-forward fashion. In other words, each layer obtains feature-maps from preceding layers and passes its feature-maps to subsequent layers. The features from preceding layers are combined by concatenation as opposed to the summation performed in the ResNet [105] architecture. The concatenation removes the constraint of having the same dimension across the feature-maps. In this way, the architecture ensures the maximum flow of information in the forward direction and also resolves the most prevalent challenge of vanishing gradient in the backward direction. Another major advantage of DenseNet121 is that it supports such densely and deeply connected network with fewer trainable parameters (7,978,856) as compared to its counterpart ResNet50 (35,610,216) [105] or VGG19 (143,667,240) [255]. This is because DenseNet uses a small set of 25 filters in each layer (e.g., 12 filters/layer) compared to the traditional convolutional networks (∼128 or 256 filters/layer). DenseNet preserves the feature-maps and reuses it in the subsequent layers instead of relearning feature-maps every time. The reusability of feature-maps helps in alleviating the over-fitting problem, especially in the case of limited training data. These architectural tweaks help in generating an efficient feature representation for the highly textured iris pattern. Feature- maps at each layer capture specific spatial and frequency information and consolidation of these feature-maps result in the extraction of multi-resolution features. These features are efficient in characterizing the stochastic nature of the iris pattern. The intricacy of a bonafide iris pattern is not present in the spoofed iris (print eye, artificial eye, or cosmetic contact), and this difference is efficiently captured by the features generated from DenseNet. The consolidation of feature-maps at the last layer also smoothens the decision boundaries, resulting in better generalization across PA artifacts, sensors, and datasets. Figure 2.1 shows the flowchart of the proposed architecture. The iris sensor acquires an ocular image which is input to the iris detection module. In our implementation, we use the VeriEye iris detector, which outputs the centers of the iris and pupil along with their radii. The iris region is cropped from the ocular image using the center and radius of the iris. The cropped iris region is then resized to 224 × 224 and input to the pre-trained DenseNet121 network. The ImageNet dataset [72] is used to pre-train the network. The network produces a single presentation attack (PA) score, which lies between 0 and 1. A score approaching ‘1’ indicates that the input sample is a PA, whereas a score approaching ‘0’ indicates that the input sample is a bonafide. We determine the threshold by fixing the False Detection Rate to 0.2% in order to get the final classification. If the PA score is less than the specified threshold, the input sample is labeled as a bonafide; otherwise, it is a PA. During training, the learning rate used is 0.005, the batch size is 20, the optimization algorithm used is the stochastic gradient descent with a momentum of 0.9, the number of epochs is 50, and the loss function is cross-entropy. 26 Figure 2.1: Flowchart of the D-NetPAD algorithm. Iris region (red box) is detected and cropped from the ocular image and input to the D-NetPAD architecture. The base architecture used in D-NetPAD is DenseNet121 [122]. It produces a single PA score within a range of 0-1, which determines whether an input image is a bonafide (value towards 0) or a PA (value towards 1). 2.4 Evaluation and Results We performed experiments on a proprietary dataset as well as two publicly available benchmark dataset (LivDet-2017, LivDet-2020) to evaluate the performance of D-NetPAD. The proprietary dataset has several subsets and is, therefore, referred to as the “Combined Dataset" in the rest of the document. The Combined and the LivDet2020 datasets correspond to the cross-PA scenario, whereas the LivDet-2017 dataset creates a test-bed for cross-PA, cross-sensor, and cross-dataset testing scenarios. In the cross-PA scenario, we use PA instruments (PAIs) that were not used during the training. In the cross-sensor scenario, we evaluate images from different sensors than those used during the training. The cross-dataset scenario incorporates testing under different PAIs, sensors, data acquisition environments (indoor/outdoor, varying illumination conditions), subject populations, and platforms (desktop and mobile). The cross-dataset scenario accounts for large variations, making it the most challenging test scenario. 2.4.1 Combined Dataset: Description and Results The Combined Dataset was collected under the IARPA Odin program (Presentation Attack De- tection) [2]. The IrisAccess iCAM7000 sensor was used to collect the data. The dataset is a combination of various component datasets collected at different locations and times using differ- ent units of the same sensor. Table 2.1 provides the description of the component datasets. There are a total of 13,851 iris images out of which 9,660 are bonafide and 4,291 are PAs. The PA samples in the dataset correspond to the following attack instruments: print, artificial eye, cosmetic contacts, kindle replay, and transparent dome on print. Figure 2.2 shows sample images from the dataset. 27 The test set JHU-APL03 (Table 2.1) comprises two types of artificial eyes and 10 different types of cosmetic contacts. It corresponds to the cross-PA scenario as it contains six additional cosmetic contacts that are not used during training. As the process of collecting cosmetic contact images is a tedious and time-consuming process, its quantity is limited in the training set. Therefore, we utilize cosmetic contact images from NDCLD-2015 [267] to overcome the shortcoming. The bonafide images in the NDCLD-2015 dataset are not used as the Combined dataset has a large number of bonafide images. The NDCLD-2015 dataset was collected using the IrisGuard AD100 and IrisAccess LG4000 sensors. Figure 2.2: Sample images of bonafide and different types of PAs (print, artificial eye, cosmetic contact, kindle replay, and transparent dome on print) taken from the Combined dataset. The last cosmetic contact image is taken from the NDCLD-2015 dataset. Table 2.1: Description of different components of the Combined Dataset. Details of the train and test set of the Combined and NDCLD 2015 datasets are also provided in terms of the number of bonafide and PA images. Here, MSU stands for Michigan State University, CU stands for Clarkson University, and JHU-APL stands for Johns Hopkins University-Applied Physics Laboratory. Train Test Dataset MSU IrisPA01 CU IrisPA01 CU IrisPA02 PB IrisPA01 PB IrisPA02 PB IrisPA03 JHU-APL01 JHU-APL02 NDCLD 2015 JHU-APL03 Bonafide 381 962 1,107 446 518 518 1,394 1,371 - 2,963 Print 991 660 415 14 - - - - - - Artificial Eye 318 34 - 21 9 12 49 111 - 175 Cosmetic Contacts - - 208 - 21 94 78 120 2,236 177 Kindle Replay 51 79 - - - - - - - - Transparent Dome - - 503 9 - - 42 - - - Acquisition Time Period Nov 2017 Nov 2017 Dec 2018 April 2018 Feb 2019 Sept 2019 May 2018 May 2019 2015 Nov 2019 We evaluate the performance of the D-NetPAD in terms of True Detection Rate (TDR) at a False Detection Rate (FDR) of 0.2%. TDR is the percentage of PA samples correctly detected, whereas FDR is a percentage of bonafide samples incorrectly classified as PA.2 The D-NetPAD is compared against two deep learning-based methods ( [46] and [114]) as these are state-of-the-art (SoTA) methods. It is also compared with VGG19 [255] and ResNet101 [105] deep architectures. Table 2 Other commonly used evaluation measures for presentation attack detection are Attack Pre- sentation Classification Error Rate (APCER) and Bonafide Presentation Classification Error Rate (BPCER). TDR is 1−APCER, and FDR is the same as BPCER. 28 2.2 presents the results of all five algorithms. The D-NetPAD results in a TDR of 98.58% and outperforms the SoTA methods [114], [46], VGG19 and ResNet101 by 25.27%, 6.44%, 2.41% and 1.75%, respectively. Low performance of [114] (a network of eight convolutional layers) substantiates the use of deeper architectures. We also experiment to emphasize the importance of iris localization in the proposed method. When we input the entire ocular image into the DenseNet121 network, it resulted in a lower TDR of 94.59% at 0.2% FDR. Next, we analyze the failure cases (misclassified cases) of our method. At 0.2% FDR, there are four bonafide samples misclassified as PAs and five PAs samples misclassified as bonafide. Figure 2.3 shows the misclassified images. In the case of misclassified bonafide images, subjects wearing transparent contact lenses and glare of the light reflected from the glasses, resulting in misclassification. In the case of misclassified PA images, the D-NetPAD fails for a particular type of cosmetic contact lens (Halloween-style Extreme contact lens), where the pattern appears only at the periphery of the cosmetic contact. Segmentation ignores the outer region of the iris containing some artifacts of these cosmetic contacts. This resulted in a smaller region of the artifact being fed into the DenseNet for PA detection, leading to a misclassification. Table 2.2: The results of D-NetPAD in term of TDR (%) at 0.2% FDR on the Combined dataset. The method is compared with four other algorithms. Algorithms [114] [46] VGG19 ResNet101 D-NetPAD TDR (%) @ 0.2% FDR 78.69 92.61 96.26 96.88 98.58 2.4.2 LivDet-2017 Dataset: Description and Results Another dataset used for evaluation is the LivDet-2017 [304] dataset. The LivDet-2017 dataset is a combination of four datasets: Clarkson, Warsaw, Notre Dame, and IIITD-WVU datasets. Figure 2.4 shows few samples of LivDet-2017 dataset. Table 2.3 describes the types of PAs present in the datasets, and the number of images in the train and test sets of all four datasets. The Clarkson dataset represents the cross-PA testing scenario. The test set consists of 5 additional cosmetic contacts and prints of visible spectrum iris images captured using an iPhone 5. The Warsaw dataset helps in 29 Figure 2.3: Misclassified images by the D-NetPAD algorithm on the JHU-APL03 test set. The first row shows bonafide images that are misclassified as PA. The second row shows PA images that are misclassified as bonafide. The PA score is displayed at the bottom of each image. The threshold for classification is 0.40, where a PA score below the threshold is considered to be a bonafide. evaluating the cross-sensor testing scenario. It consists of two test sets: a “known" sensor and an “unknown" sensor. The IrisGuard AD100 sensor is used to capture the images of the training set and the “known" component of the test set. Images of the “unknown" component of the test set are captured by a setup composed of Aritech ARX-3M3C camera, SONY EX-View CCD sensor, Fujinon DV10X7.5A-SA2 lens, and B+W 092 NIR filter. The Notre Dame dataset corresponds to the cross-PA scenario. It also contains two test sets (“known" and “unknown"). The “unknown" test set includes cosmetic contacts not used in the training set. The IIITD-WVU dataset consists of data collected by IIITD and WVU. The IIITD data is used for training, whereas the WVU data is used for testing. The dataset corresponds to the cross-dataset scenario, where the test set incorporates variations in the sensors, data acquisition environment, subject population, and PA generation procedures. The training set is captured in a controlled environment using two iris sensors: Cogent dual iris sensor (CIS 202) and VistaFA2E single iris sensor. The test set is captured using the IriShield MK2120U mobile iris sensor at two different locations: indoors (controlled illumination) and outdoors (varying environmental conditions). The cross-dataset testing scenario represents the most difficult case. For a detailed evaluation of the D-NetPAD, we created three models of the D-NetPAD network, 30 Table 2.3: Description of the train and test sets of all four subsets of the LivDet-2017 dataset along with the number of bonafide and PA images present in the datasets. The information about the sensors is also provided. Each subset represents different testing scenarios. The Clarkson and Notre Dame test sets correspond to the cross-PA scenario, whereas the Warsaw data corresponds to the cross-sensor scenario. The IIITD-WVU represents a cross-dataset scenario. Here, “K. Test" means a known test set of the dataset, and “U. Test" means an unknown test set. Clarkson Warsaw Notre Dame IIITD-WVU Dataset (Cross-PA) (Cross-Sensor) (Cross-PA) (Cross-Dataset) Train Test Train K. Test U. Test Train K. Test U. Test Train Test Bonafide 2,469 1,485 1,844 974 2,350 600 900 900 2,250 702 Print 1,346 908 2,669 2,016 2,160 - - - 3,000 2,806 Cosmetic Contacts 1,122 765 - - - 600 900 900 1,000 701 Aritech ARX-3M3C, Cogent IrisAccess IrisGuard Fujinon DV10X7.5A, IrisGuard AD100, IriShield Sensor CIS 202, EOU2200 AD100 DV10X7.5A-SA2 lens IrisAccess LG4000 MK2120U VistaFA2E B+W 092 NIR filter Figure 2.4: Sample images of bonafide and different types of PAs (print, cosmetic contact) taken from each subset of the LivDet-2017 dataset. which differ in their training process: (i) Pre-trained D-NetPAD: The model trained on the Combined dataset is directly used; (ii) Scratch D-NetPAD: The model is trained from scratch on the LivDet-2017 train sets; and (iii) Fine-tuned D-NetPAD: The model that is pre-trained on the Combined dataset is fine-tuned using the LivDet-2017 train sets. The performance measure used is the same as used in [304]: Attack Presentation Classification Error Rate (APCER) and Bonafide Presentation Classification Error Rate (BPCER). The APCER is the proportion of PA samples misclassified as bonafide, whereas the BPCER is a proportion of bonafide samples misclassified as PAs. The D-NetPAD is compared against the top three winners of the LivDet-2017 competition. Table 2.4 summarizes the results of all algorithms. While the pre-trained D-NetPAD model and the model trained from scratch perform at par with the state-of-the-art methods, the fine-tuned D-NetPAD model outperforms the other methods. We also measured the performance of D-NetPAD in terms of its TDR at 0.2% FDR on the LivDet-2017 dataset. Table 2.5 compiles the results of D-NetPAD on all four datasets of the 31 LivDet-2017 [1] dataset. A summary of the results is provided below: Clarkson Test Dataset: The pre-trained D-NetPAD fails on the test set of Clarkson. The Clarkson dataset corresponds to the cross-sensor and cross-PA scenarios. The images captured from IrisAccess EOU2200 is visually quite different from the images captured by the iCAM 7000 iris sensor, which results in poor performance (28.63%). But, the result improves (92.05% and 93.51%) when the training set (scratch or fine-tuned) includes the Clarkson train set (sensor information). Warsaw Test Dataset: The pre-trained D-NetPAD achieves competent performance on the Warsaw dataset. The sensors and types of PA used in the Warsaw dataset are different from the one used in the training, but the images captured by the test sensors are visually similar, which results in comparable TDR. Fine-tuning the pre-trained D-NetPAD using the train set of Warsaw dataset results in 100% TDR. Notre Dame Test Dataset: The dataset represents the cross-PA scenario, where the test set uses additional cosmetic contacts. The pre-trained D-NetPAD model trained on diverse cosmetic contacts generalizes well across previously unseen cosmetic contacts (93.55% and 91%). Its performance drops on the unknown test set (66.55%) when the model is trained from scratch as the diversity of cosmetic contacts is limited in the Notre Dame train set. Fine-tuning the model with the Notre Dame train set achieves 100% TDR. IIIT-WVU Test Dataset: The dataset is the most challenging dataset where the test set images are captured using the IriShield MK2120U mobile iris sensor and under different capturing envi- ronments (indoor and outdoor). The dataset also included unseen PAs, resulting in very low TDRs for all three models (42.91%, 29.30%, and 48.85%). We further analyze the results of IIIT-WVU by plotting the PA score distributions of the bonafide and PAs, and estimating the d-prime distance between them (Figure 2.5). Though the TDR is quite low in the case of fine-tuned D-NetPAD, its histogram shows a better separation (d0 = 2.64) between the score distributions of bonafide and PAs. The D-NetPAD algorithm demonstrates robustness across PAs and sensors testing scenarios 32 after the fine-tuning but fails in the case of cross-dataset which is a combination of cross-PA, cross- sensor, cross-environment, and cross-platform scenarios. Here, cross-platform implies training on images of iris sensor meant for desktop (e.g., IrisAccess iCAM7000) and testing on images of iris sensor meant for mobile devices (e.g., IriShield MK2120U). Table 2.4: D-NetPAD performance reported in terms of APCER and BPCER on all subsets of the LivDet-2017 dataset. The method is compared with three state-of-the-art algorithms in [304], which are the winners of the LivDet-2017 competition. Clarkson Warsaw IIITD-WVU Notre-Dame Averaged Algorithm APCER BPCER APCER BPCER APCER BPCER APCER BPCER APCER BPCER CASIA 9.61 5.65 3.4 8.6 23.16 16.1 11.33 7.56 11.88 9.48 Anon1 15.54 3.64 6.11 5.51 29.4 3.99 7.78 0.28 14.71 3.36 UNINA 13.39 0.81 0.05 14.77 23.18 35.75 25.44 0.33 15.52 12.92 Pre-Trained D-NetPAD 16.73 19.46 1.66 0.83 16.05 15.24 1.00 2.22 8.86 9.43 Scratch D-NetPAD 5.78 0.94 0 0.04 36.41 10.12 10.38 3.23 13.14 3.58 Fine-tuned D-NetPAD 2.99 2.97 0 0.54 1.88 8.84 0.33 0.27 1.3 3.15 Table 2.5: D-NetPAD performance reported in terms of the TDR (%) @ 0.2% FDR on different subsets of the LivDet-2017 dataset. Three models of D-NetPAD are generated by varying their training data. Clarkson Warsaw Notre-Dame IIITD-WVU Algorithm Test K. Test U. Test K. Test U. Test Test Pre-Trained D-NetPAD 28.63 92.95 98.56 93.55 91.00 42.91 Scratch D-NetPAD 92.05 100 100 100 66.55 29.30 Fine-tuned D-NetPAD 93.51 100 100 100 99.77 48.85 Figure 2.5: Histograms of the three trained models of D-NetPAD on the IIITD-WVU test set. For accurate classification, there should be minimal overlap between the two (red and green) distributions. This plot indicates the efficacy of the fine-tuned D-NetPAD. 33 2.4.3 LivDet-2020 Dataset: Description and Results The last dataset used for evaluation is the LivDet-2020 dataset [61]. The dataset is from the LivDet-Iris-2020 competition launched in May 2020 by five organizations: Clarkson University (USA), University of Notre Dame (USA), Warsaw University of Technology (Poland), IDIAP Research Institute (Switzerland), and Medical University of Warsaw (Poland). The LivDet-2020 dataset is different from previous editions ( [304–306]) in that the organizers did not announce any official training set, only a testing set is provided. The testing set employed in the competition is a combination of data from all three organizers: Clarkson University, University of Notre Dame, and Warsaw University of Technology. The dataset consists of 12,432 images (5,331 bonafide and 7,101 PA samples). Five presentation attack instruments (PAI) categories included in the dataset are printed eyes (1,049), cosmetic contact lenses (4,336), kindle/electronic displayed eyes (81), fake/prosthetic/printed eyes with add-ons (541), and cadaver eyes (1,094). The fake eyes with add-ons include five subcategories: cosmetic contacts on printed eyes, cosmetic contacts on doll eyes, clear contacts on printed eyes, eye dome on printed eyes, and doll eyes. Table 2.6 provides the number of images in each category of PAIs along with the sensor information used to capture those images. Figure 2.6 shows few samples of the LivDet-2020 dataset. For the evaluation of the D-NetPAD on the LivDet-Iris-2020 dataset, D-NetPAD is trained on the Combined Dataset along with the partial data from the Warsaw PostMortem v3 dataset [275] (1,200 cadaver iris images from the first 37 cadavers). The base architecture used is DensetNet161, which consists of 4 dense blocks and 161 convolutional layers. The evaluation measures used are APCER and BPCER. The APCER is calculated for the individual PA types as well as for overall PAs. The proposed method is compared with the top three winners of the LivDet-2020 competition. Table 2.7 summarizes the results of all algorithms. The D-NetPAD outperforms the other methods by a large margin. D-NetPAD resulted in a weighted APCER of 2.76% at a BPCER of 1.61%, whereas the winner of the competition shows 59.10% weighted APCER at 0.46% BPCER. When considering individual PAs ACPER, attacks by cadaver eyes and fake eyes are easier to detect, whereas attacks by cosmetic contacts are challenging attacks to detect. The 34 Table 2.6: Description of the test set of the LivDet-iris-2020 dataset. It includes the number of images in each category and the sensor used to capture them. Classes Presentation Attack Instruments Sample Count Sensor Bonafide - 5,331 LG 4000, AD 100, Iris ID iCAM7000 PA Printed Eyes 1,049 Iris ID iCAM7000 PA Cosmetic Contact Lens 4,336 LG 4000, AD 100, Iris ID iCAM7000 PA Kindle Display 81 Iris ID iCAM7000 PA Fake/Prosthetic/Printed Eyes with Add-ons 541 Iris ID iCAM7000 PA Cadaver Iris 1,094 IriTech IriShield Table 2.7: D-NetPAD performance reported in terms of APCER and BPCER on the LivDet-2020 dataset. The results also include APCER on the individual type of PAs. The method is compared with the winners of the LivDet-2020 competition. Here, PE is Printed Eyes; CL is Cosmetic Contact Lens; ED is Electronic Display; F/P is Fake/Prosthetic/Printed Eyes with Add-ons; and CI is Cadaver Iris. APCER Overall Performance Algorithms ACER PE CL ED F/P CI APCER𝑎𝑣𝑔. BPCER USACH/TOC 23.64 66.01 9.87 25.69 86.10 59.10 0.46 29.78 FraunhoferIGD 14.87 72.80 53.08 19.04 0 48.68 11.59 30.14 Competitor-3 72.64 43.68 83.95 73.19 89.85 57.8 40.31 49.06 D-NetPAD 2.38 3.85 1.23 0.18 0.18 2.76 1.61 2.18 high ACPER in cosmetic contact PAs is also due to the unseen types of cosmetic contact used in the testing set. Figure 2.6: Sample images of bonafide and PAs (print, kindle display, artificial eye, cosmetic contact, and cadaver eyes) from the LivDet-2020 dataset. 2.4.4 GCT5 and GCT6 Datasets: Description and Results The Government Control Test (GCT) 5 and 6 are also proprietary datasets collected under the IARPA Odin program (Presentation Attack Detection) [2]. Table 2.8 provides the performance of D-NetPAD on both datasets along with the number of bonafide and PA images present in the 35 Table 2.8: D-NetPAD performance in terms of TDR at 0.2% FDR on the GCT5 and GCT6 datasets. Table also provides information about training and testing data along with base architecture used in both models. Model DenseNet161 DenseNet201 (Base Architecture) Combined Dataset, GCT3 (2,963 Bonafide, 352 PAs), Combined Dataset, Cross-sensor (9,606 Bonafide, 922 PAs), GCT3 (2,963 Bonafide, 352 PAs), Warsaw Postmortem (2,400 PAs), Train Cross-sensor (9,606 Bonafide, 922 PAs), GCT4 (337 Bonafide, 332 PAs), Warsaw Postmortem (2,400 PAs) LivDet-2020 (5,315 Bonafide, 7,101 PAs), GCT6 Train (1,457 Bonafide, 598 PAs) Test GCT5 (1,354 Bonafide, 206 PAs) GCT6 Test (3,112 Bonafide, 392 PAs) TDR (%) 95.63 99.23 @ 0.2% FDR Threshold 0.3839 0.4671 training and testing sets. The base architecture of the GCT5 model is DenseNet161, whereas, for GCT6, it is DenseNet201. 2.4.4.1 Failure Analysis Subsequently, we analyze the failure cases that occurred on the GCT5 data. There are ten misclas- sifications, one in bonafide and nine in case of PA images. Figure 2.7 shows all the failure cases along with their Grad-CAM heatmaps. The Grad-CAM heatmaps provide information about the correctness of the iris segmentation and high priority regions network utilize to make the decision. There occur no segmentation errors in the failure cases. The misclassified bonafide image contains circular artifacts in the iris region may be due to a soft transparent contact lens or positioning of light sources. The circular artifact resembles cosmetic contact images and results in misclassification. In the case of PA misclassifications, three types of cosmetic contacts contribute to the misclassifica- tion, two are unknown cosmetic contacts (m6-009-0007-A77-1 and m6-009-0011-F40-1), and one is known cosmetic contact Extreme-SFX-Intrigue Brown (m6-009-0005-B44-1). The Extreme- SFX-Intrigue Brown cosmetic contacts are misclassified in the GCT3 results as well. The training of the GCT5 model contains only a few images of the specified contact lens. In two misclassified PA images (m6-009-0011-F40-1), artifacts due to the face mask cause misclassification. We also perform matching of entire PA images with their corresponding bonafide to understand 36 Figure 2.7: Failure cases on the GCT5 dataset. The first image is a bonafide misclassified bonafide image, and the other images are misclassified PA images. Three types of cosmetic contacts get misclassified: m6-009-0007-A77-1, m6-009-0011-F40-1, and m6-009-0005-B44-1. The threshold is 0.38. the portion of the underlying bonafide pattern present in the PA samples. We use a commercially available high-performing iris matcher called VeriEye. Its threshold is 60 for 0.001% FMR. We observe that 82.52% (170/206) of PA images are higher than the 60 revealing an underlying iris pattern helpful for matching. Figure 2.8 shows a histogram of VeriEye match scores correspond- ing to correctly classified and misclassified PA images. Match scores of misclassified PA images are high (more than 100) and lie in the right-sided tail of the histogram. It indicates that these cosmetic contacts do not obscure underlying iris patterns and result in misclassification. Figure 37 2.9 shows misclassified PA images along with their bonafide and corresponding match scores. All misclassified PA images show higher match scores. We also plot the histogram of match scores (VeriEye) corresponding to different types of cosmetic contacts (Figure 2.10). Most misclassifi- cations occurred on m6-009-0005-B44 cosmetic contact PA images (3/6) show in brown color in Figure 2.10, second major misclassifications occur on m6-009-0007-A77 cosmetic contacts (4/28) show in lime green color, and another cosmetic contact m6-009-0011-F40 (2/32) is next misclas- sified cosmetic contact show in orange color. All these cosmetic contacts have high match scores revealing underlying iris patterns. Figure 2.8: Histogram of VeriEye match scores corresponding to correctly classified and misclas- sified PA images when match with their bonafide images on the GCT5 data. We perform a similar analysis on the Combined dataset test data. Figure 2.11 shows a histogram of VeriEye match scores corresponding to correctly classified and misclassified PA images. Here, we observe that 47.16% (75/159) of PA images are higher than the 60 (0.001% FMR). There are five 38 Figure 2.9: Misclassified PA images (bottom row) along with their bonafide images (top row) and their matching score using VeriEye commercial iris matcher. misclassifications on PA images, and 4 out of 5 match scores are higher than the 60 threshold. We also plot a histogram of match scores corresponding to different types of cosmetic contacts (Figure 2.12), and misclassifications mainly occur on m60090005B44 cosmetic contact (5/23) shown in brown color in Figure 2.12. 2.5 Explainability analysis 2.5.1 Visualization Analysis We visualize the results of the D-NetPAD using t-Distributed Stochastic Neighbor Embedding (t-SNE) [280] plots and Gradient-weighted Class Activation Mapping (Grad-CAM) heatmaps. We 39 Figure 2.10: Histogram of VeriEye match scores corresponding to different cosmetic contact PA types on the GCT5 data. utilize the D-NetPAD model trained on the training set of the Combined dataset for this purpose, and use the samples in the JHU-APL03 test set to generate these visualizations. The t-SNE helps in visualizing the features extracted from the D-NetPAD. It reduces the high-dimensional features extracted from the D-NetPAD to a lower dimension (two in our case), which are then used to construct a scatter plot. The architecture of the D-NetPAD consists of four Dense blocks. We capture the high-dimensional features at the end of each Dense block for visualization (Figure 2.13). For instance, the feature set captured at the end of Dense block 4 has a size of 1 × 1024 × 7 × 7, which is flattened to 1 × 50,176. The 50,176-dimensional row vector is then reduced to a two-dimension vector. We draw three key observations from these plots: 1. The distributions of bonafide, artificial eye and cosmetic contact features overlap after the initial Dense blocks, but separate for the later Dense blocks. As the depth of the network increases, the 40 Figure 2.11: Histogram of VeriEye match scores corresponding to correctly classified and misclas- sified PA images when match with their bonafide images on the GCT3 data. features of different categories are better separated. This substantiates the high performance of D-NetPAD (Table 2.2). 2. The features of different categories are sufficiently discriminated at the end of Dense Block 4, which justifies the use of four Dense blocks in the architecture as opposed to three in [301]. 3. The plots shows two bonafide clusters which correspond to the left and right eyes. The left and right irides exhibit differences due to the orientation of upper and lower eyelids, location of specular reflection, the relative position of pupil center to iris center, and background illumination variation. The D-NetPAD captures these variations in its features. We further visualize the CNN activations using the Grad-CAM [245] heatmaps. The Grad- CAM produces a coarse localization map highlighting the salient regions in an image that were used by the network to generate its inference. These are regions that produce high activations in 41 Figure 2.12: Histogram of VeriEye match scores corresponding to different cosmetic contact PA types on the GCT3 data. the neural network. It is estimated using the gradient of a loss function, which backpropagates through the convolutional layers to the input image [245]. Figure 2.14 presents the CNN activation heatmaps on bonafide, artificial eye, and cosmetic contact images taken from the JHU-APL03 test set. The last column represents the average heatmaps of each category considering the entire test set. The red regions indicate high activation, whereas the blue regions represent low activation. The first row of Figure 2.14 shows the heatmap of bonafide sample images along with the average bonafide heatmap, where the high activation region is at the pupillary zone of the iris pattern. The second row of Figure 2.14 corresponds to the heatmap of artificial eye images, where the focus seems to be mainly on the left and right sub-regions of the iris. The last row shows the heatmaps of cosmetic contact images, where the lower sub-region of the iris pattern is focused. The average heatmaps show the distinctive regions of focus in each category, which helps in 42 Figure 2.13: The architecture of D-NetPAD consists of four Dense blocks. We capture the features at the end of each Dense block, which are then visualized using t-sne plots (shown below each Dense block). The two-dimensional features of bonafide, artificial eyes, and cosmetic contacts overlap in the initial layers, but get separated in the last layer. The two blue clusters in each category correspond to the left and right eyes. discriminating bonafide from PAs. We quantify the results of Grad-CAM by training a network with Grad-CAM heatmaps corresponding to three categories to show the region’s distinctiveness in each category. We utilize DenseNet121 architecture as a backbone architecture and perform training on randomly selected 60% of the JHU-APL03 heatmaps from the Combined dataset. We repeated the experiments five times, and the top-1 accuracy is 97.90% ± 0.001. The performance validates our claim that D-NetPAD focuses on distinctive regions for each category. 2.5.2 Spatial Frequency Analysis The iris is a highly textured pattern exhibiting numerous spatial frequencies. To understand what frequencies the D-NetPAD model has learned and how it impacts iris PAD performance, we perform a spatial frequency analysis on the D-NetPAD model. We attain the objective with the assumption that the performance of the model only gets affected by the manipulation of learned frequencies. We start by manipulating higher frequencies for two reasons. First, when we visually examine low- and high-pass filtered images (Figure 2.15), it is observed that a high-pass filter (suppression of low frequencies) considerably obscures the iris pattern. Second, deep learning- 43 Figure 2.14: Grad-CAM [245] heatmaps corresponding to bonafide (first row), artificial eye (second row), and cosmetic contact (last row). The last column represents the average heatmaps of each category. The heatmaps represent focused regions of the image by the D-NetPAD algorithm. Red- colored regions represent highly focused regions by the D-NetPAD, whereas blue regions represent low priority ones. based models learn low frequencies first (initial epochs) and then high frequencies (later epochs) in the training process [213, 299]. In other words, the number of weight parameters contributing towards expressing low frequencies is larger than the one expressing high frequencies [213]. Due to this, a small manipulation in low frequencies results in large shifts in performance. In the case of high frequencies, the more the architecture learns high frequencies, the more it tunes its parameters towards learning the intricacies of the training images, which may cause overfitting. So, learning of high frequencies determines the effectiveness of model-fitting on the training data (i.e., efficiently fit or overfit). For high frequency suppression, we use a low-pass filter with various cutoff frequencies. Cutoff frequency represents a radius from the center in the Fourier transforms (second row of Figure 2.15). A low-pass filter allows frequencies below the cutoff frequency and attenuates higher frequencies. Figure 2.17 shows the performance of the D-NetPAD model as well as the VGG19 and ResNet101 models on various low-pass filter cutoff frequencies. We use the train and test set of the Combined 44 Figure 2.15: Frequency analysis of an input iris (bonafide or PA) image. In the first row, the left-most image is the original image, the center image is a low-pass filtered image with a cutoff frequency of 20 (higher frequencies are suppressed), and the right-most is a high-pass filtered image with a cutoff frequency of 5 (lower frequencies are suppressed). The second row represents their corresponding fourier transforms. dataset for the experiments. The manipulation is only applied over the test images. There are two noteworthy observations. First, D-NetPAD shows a relatively lower drop in performance compared to VGG19 and ResNet101 models. Second, the performance of the D-NetPAD model becomes steady beyond the 30 cutoff frequency, which implies that the model has not overfitted to higher frequencies beyond 30. Beyond a cutoff frequency of 60, the performance becomes constant implying that it has not learned from frequencies beyond 60. Another way of manipulating high frequencies is their addition to the input images, which we did by contaminating input images with salt and pepper noise. We also analyze the models when Gaussian noise (noise values are Gaussian- distributed) is added to the input images. Figure 2.16 shows an example of an input image subject to high-frequency manipulation, (b) - (e), and the addition of Gaussian noise, (f). The performance is measured using a relative decrease in TDR (%) at 0.2% FDR. Table 2.9 provides the results of VGG19, ResNet101, and D-NetPAD architectures when input images are manipulated. The D-NetPAD model shows a lower decrease in TDRs compared to VGG19 and ResNet101 45 Figure 2.16: Different manipulations applied over the original input image (first image): low-pass filtered images with 20, 30, and 50 cutoff frequencies, additive salt and pepper noise, and additive Gaussian noise. Only test images are subject to these manipulations. models when higher frequencies in input images are manipulated (either by suppression or addition). The VGG19 and ResNet101 models have a large number of trainable parameters that result in the overfitting of these models to the training data. The overfitted models learn higher frequencies considerably well and, therefore, are more sensitive towards them. On the contrary, efficient learning of frequencies by the D-NetPAD makes it more robust towards manipulations to the high frequencies and also substantiates its generalizability across PAs, sensors, and datasets. Gaussian noise randomly affects both lower and higher frequencies, resulting in a higher drop in performance of all the networks, including D-NetPAD. 2.6 Deployment of D-NetPAD on Desktop and Mobile We deploy the proposed D-NetPAD model on desktop as well as mobile platforms. In the desktop version, we are capturing iris images from the iCAM7000 iris sensor. The configuration of the GPU used in the desktop is Nvidia GeForce GTX 1070 with 8GB RAM. The D-NetPAD took 0.037 sec to process a single image. Figure 2.18 shows screenshots of the desktop application. 46 Figure 2.17: The plot of TDR (%) @ 0.2% FDR against low-pass filter cutoff frequencies. Note the cutoff frequency beyond which the performance of D-NetPAD becomes stable (30 in this case). This cutoff frequency indicates that the D-NetPAD has not learned frequencies beyond this cutoff frequency. The performance steadiness of D-NetPAD is better than VGG19 and ResNet101. Table 2.9: Results (TDR and a relative decrease in TDR) for VGG19, ResNet101, and D-NetPAD models, when high frequencies are manipulated or Gaussian noise is applied to the input test images. VGG19 ResNet101 D-NetPAD Input Test Images Relative Relative Relative TDR(%)@ TDR(%)@ TDR(%)@ Decrease Decrease Decrease 0.2% FDR 0.2% FDR 0.2% FDR TDR (%) TDR (%) TDR (%) Original Images 96.26 - 96.88 - 98.58 - LowPass20 52.33 45.63 71.65 26.04 81.61 17.21 (Suppess high freq.) LowPass30 86.60 10.03 88.47 8.68 94.39 4.25 (Suppress high freq.) LowPass50 94.08 2.26 93.45 3.54 96.88 1.72 (Suppress high freq.) Salt & Pepper 74.14 22.97 68.22 29.58 80.99 17.84 (Add high freq.) Gaussian Noise 56.07 41.75 62.61 35.37 59.19 39.95 For mobile version, we capture both iris images (left and right) from the IriShield BK2121U binocular sensor. We utilize DenseNet201 and MobileNetv2 architectures for mobile deployment. 47 Table 2.10: Description of two architectures used to detect iris PAs at the mobile platform along with their training data and computational efficiency. Time Deploy Training Data Architecture Input Sensor Taken (secs) Environment Clarkson: 283 Bonafide, 183 Cosmetic Contacts, DenseNet201 IriShield BK2121U 0.52 1,131 Print Google Pixel 2 ST5: 216 Bonafide, Octa-core, ST6: 524 Bonafide, 4GB RAM MobileNetv2 IriShield BK2121U 0.05 LivDet-2017_WVU: 702 Bonafide, 2,806 Print, 701 Cosmetic Contacts The training data use to train the models are collected by Clarkson University, taken from Self-test5 and Self-test6 collection, and from the LivDet-Iris-2017 WVU subset. The training images are from the IriShield BK2121U iris sensor except for the images from the LivDet-Iris-2017 WVU subset, which is from IriShield MK2120U sensor. Table 2.10 provides the details of the training data. There are three options for the model compression for mobile devices in PyTorch library: 1. Dynamic quantization: weights are quantized ahead of time, but the activations are dynami- cally quantized during inference 2. Static quantization: weights quantized, activations quantized, calibration required post- training 3. Quantization aware training: weights quantized, activations quantized, quantization numeric modeled during training Dynamic quantization reduces the time complexity by a small margin without affecting the per- formance. The static quantization reduces the time complexity but also reduces the performance. It also requires a calibration using the training data. The third solution requires re-training of the model. Currently, we are using dynamic quantization for compression of the model to deploy at mobile platform. The mobile platform we use is Google Pixel 2 and the time taken by DenseNet201 is 0.52 sec and by MobileNetv2 is 0.05 sec for processing a single image. The loading time on the mobile platform is 0.67 secs. Figure 2.19 shows screenshots of the mobile application developed. 48 Figure 2.18: Graphical User Interface (GUI) for three iris PA detectors developed by MSU which includes TL-PAD [46], Fusion Method [114] and D-NetPAD [249]. Patch-wise heatmap and filter-maps shown at the bottom of GUI are corresponds to the Fusion Method. 2.7 Conclusion We propose an effective and robust software-based iris PA detector called D-NetPAD. The D-NetPAD exploits the architectural benefits of DenseNet121. Experiments are performed on five datasets to help assess its effectiveness. The test sets of these datasets correspond to cross-PA, cross-sensor, and cross-dataset scenarios which measure the robustness of the D-NetPAD. We further explained the performance of the D-NetPAD using t-SNE plots, Grad-CAM heatmaps and frequency analysis. 49 Figure 2.19: Screenshots of Iris PA Detector app on Google Pixel 2. The first image shows the screen on the opening of the app. The second image shows the results after capturing iris images from IriShield USB BK2021U sensor. 50 CHAPTER 3 IRIS PRESENTATION ATTACK DETECTION USING VISIBLE SPECTRUM VIDEO 3.1 Introduction In this chapter, we present another iris presentation attack detection (PAD) method which utilizes visible (VIS) spectrum scene video captured from a webcam. Here, the scene refers to the field-of-view of the webcam mounted over the iris sensor (Figure 3.3) capturing user interaction with the sensor. The scene video provides ancillary information such as human posture, actions, objects, human-object interactions, and their temporal changes. Our aim is to extract some of these cues from the multiple frames of the scene video using deep learning techniques. The capturing of the video in the VIS spectrum provides complementary information to the iris image captured in the NIR spectrum (conventionally used in iris recognition). The use of a simple webcam makes the acquisition process cheaper and convenient for users. The key contributions of the work are: 1. We propose a multi-frame analysis approach for detecting iris PAs from the scene video (VIS spectrum) which seamlessly incorporated into the existing NIR image-based iris recognition systems. 2. We develop various spatial-temporal feature extraction techniques for analyzing the scene. 3. We collect a dataset, Iris Presentation Attacks Video (IPV), consists of 672 iris bonafide and PA videos from 121 subjects and experiments are performed under three scenarios (intra-session, cross-session, and cross-attack). 4. We extend the multi-frame analysis approach for the detection of face PAs and experiments are performed on three publicly available face PA datasets: SiW [165], SiW-M [166] and OULU-NPU [33]. Cross-modality experiments are also performed. 51 Table 3.1: Description of video-based passive iris PA detection techniques. Authors Hardware and Imaging Algorithmic Details Villalbos-Castaldi and IR video from custom Pupil dynamic features (hippus) Suaste-Góme, 2014 [285] imaging apparatus VIS video from iPhone 5S Laplacian pyramids decomposition Kiran et al., 2015 [217] and Nokia Lumia 1020 followed by frequency responses and NIR images at different orientations VIS video from iPhone 5S Enhanced eulerian video magnification Kiran et al., 2015 [214] and Nokia Lumia 1020 (EVM) Thavalengal et al., VIS and NIR videos from Multi-spectral features and 2016 [265] custom mobile device multi-frame pupil localization VIS video from Various variant of deep features to Our Method webcam and NIR images capture spatial-temporal information 5. We also interpret the PA detection results using Grad-CAM [245] heatmaps. The Grad-CAM heatmap highlights the salient regions in the video that were used by the network to generate the inference. The rest of the chapter is organized as follows. Section 3.2 discusses the existing work for detecting iris and face PAs. Section 3.3 gives the details of the proposed approach. Section 3.4 describes the datasets used for the experiments. Section 3.5 provides the experimental setup and results. Section 3.6 provides a detailed analysis of the results. Finally, Section 3.7 concludes the chapter. 3.2 Related Work A brief survey on software and hardware-based iris presentation attack detection (PAD) tech- niques is provided in Section 2.2. Table 3.1 describes various video-based iris PA detection techniques which requires no stimulation (passive). These techniques are closely related to our acquisition setup. In the face modality, there are several existing methods that focus on cues from the scene or context. Kim et al. [140] detect liveness of a face by combining similarity score of background between reference and input image (region without a face and upper body) and background motion index which indicates the amount of motion in the background compared to the foreground. 52 Anjos and Marcel [17] measure the correlations between the total amount of movement in the face and background regions. Later on, Anjos et al. [16] utilize optical flow for estimating foreground/background motion correlation. Pan et al. [195] estimate the context information by comparing the difference of regions around fiducial points between a reference scene image and the input image using local binary pattern (LBP) descriptors. The context information is then combined with blinking information for liveness detection. Yan et al. [308] combine three cues namely non-rigid motion, face-background consistency, and imaging banding effect for face PAs detection. The non-rigid motion of a face (i.e., blinking) is estimated using low-rank matrix decomposition. Face-background consistency (motion of face with respect to the background) is calculated using GMM-based motion detection, and imaging banding effect (imaging quality defects) is estimated using wavelet decomposition. Komulainen et al. [146] detect face PAs by fusing temporal (using MLP classifier) and texture information (using LBP) at the score level using linear logistic regression. In another work, Komulainen et al. [145] utilize a cascade of an upper-body and spoofing medium detectors based on the histogram of oriented gradients (HOG) descriptors and linear support vector machines (SVM). Patel et al. [199] integrate deep texture features and face movement cues (eye-blinking) for liveness detection. Deep texture features are learned from both aligned facial regions and the entire frame. Apart from these context information based face PA detection algorithms, other techniques can be found in [89, 221]. Various competitions and assessment reports are [34,52,178]. A detailed description of various iris and face PAD techniques are published in [172]. The proposed approach is advantageous over the existing solutions in the following ways: (a) use of the entire frame discards the pre-processing routine (iris segmentation, or iris or face detection) and the error introduce by them; (b) generally, the devices used in hardware-based techniques are expensive or inconvenient for users, whereas the webcam used in the proposed approach is cost- effective and non-obtrusive. In contrast with software-based techniques which detect PAs after the acquisition of an image from the sensor, the proposed approach detects anomalies present in the scene simultaneously with the image acquisition; (c) the approach can easily extend to other 53 Figure 3.1: Scene video (VIS) and iris image (NIR) of bonafide and PA biometric samples captured by a simple webcam and an iris sensor simultaneously. Figure 3.2: Different ways of presenting the same attack instrument (paper print) constitute different scenes. These scenes provide different cues for detecting PAs. modalities (e.g., face or fingerprint) due to the similarities of the way the attacks present (e.g., print attack of the face and iris modalities) as the results show for face PA detection. The approach has a limitation in detecting certain types of presentation attacks, e.g., cosmetic contact lens (iris modality) and face makeup (face modality). It also fails in those acquisition scenarios where scene information is not provided, for instance, OULU-NPU [33] dataset. 54 Figure 3.3: The end-to-end architecture of the proposed framework. 3.3 Proposed Method Figure 3.3 shows the architecture of the proposed framework. Hardware setup consists of a webcam mounted over a standard iris sensor. The webcam captures the scene video, and the iris sensor captures a NIR iris image. Iris image (I) and scene video (V) then undergo different techniques to compute the individual PA scores. A PA score is a confidence score that the given input is a PA. It is in the range [0, 1], where 1 represents high confidence that the input is a PA. To generate a PA score from the iris image (𝑠 𝐼 ), we adopt the CNN architecture proposed by Hoffman et al. [113], though other PA detection techniques can also be used. For computing PA score from the scene video (𝑠𝑉 ), we utilize several deep learning techniques (details are given below). Both PA scores are normalized and then averaged to obtain a final PA score by: 𝑠 + 𝑠   𝑉 𝐼  Bonafide,   if ≤𝑇 2  𝑓 (𝑠𝑉 , 𝑠 𝐼 ) =   PA,  otherwise  where, T is the threshold. We propose six different deep learning techniques to extract spatio- temporal information contained in the multiple frames of a scene video and generate a PA (𝑠𝑉 ) score. We describe these techniques in the following subsections. 55 3.3.1 MLP Firstly, we capture only spatial information from the scene as some of the PAs are strongly associated with the objects present in the scene such as paper print, kindle display, artificial eye, etc. We resize video frames to 80 × 80 and select 30 equally spaced frames per video. Resized video frames are input into a pre-trained model of Inception-v3 (pre-trained on ImageNet Dataset [73]) to extract CNN features. CNN features are then fed into a multi-layer perceptron (MLP) consisting of two hidden layers with 512 hidden units each and one softmax layer. We also use two dropout layers following each hidden layer with a dropout value of 0.5. The training mini-batch size used is 20. Training stops when there is no further reduction in training loss. During testing, we estimate a final PA score for a video by summing softmax scores obtained from 30 equally spaced frames. The method considers video as a collection of independent frames, exploring only spatial information and ignores the temporal information. 3.3.2 LSTM To capture temporal information, we feed the same CNN features (Inception-v3) into a two-layered long short-term memory (LSTM) network instead of an MLP network. The first layer has 2048 hidden units, and the second layer is a fully-connected layer with 512 nodes. There is a dropout of 0.5 following the fully connected layer. The mini-batch size and stopping criteria are the same as the previous method. Temporal information is considered in the method, but only at the very end of the network. 3.3.3 LRCN To capture spatial and temporal information simultaneously, we use a long-term recurrent convolu- tional network (LRCN) [76], which is an end-to-end trainable architecture. The architecture has a small VGG-16 style network followed by one LSTM layer having 256 nodes and the final softmax layer. The training mini-batch size is 15, and stopping criteria is the same as the previous method. 56 The method performs training from scratch, which requires a large amount of training data. 3.3.4 C3D Another way to capture the spatial-temporal information is 3D ConvNet, where the third dimension corresponds to the temporal dynamics. To implement this, we use the architecture (C3D) proposed in [272], which consists of five 3D convolutional layers followed by 3D max-pooling layers, two fully connected layers, and a softmax layer. Unlike [272], we also use two dropout layers following each fully connected layer with a value of 0.5 to prevent over-fitting due to the small training data. The number of filters used for five convolutional layers is 64, 128, 256, 512, and 512 having the same filter size 3 × 3 × 3. Fully connected layers have 4096 hidden units. All max-pooling layers have a kernel size of 2 × 2 × 2 except the first one which has a kernel of size 1 × 2 × 2. The mini-batch size is 15, and stopping criteria is the same as the previous method. Again, due to a large number of parameters and training from scratch, it requires a large amount of training data. 3.3.5 3D ResNeXt-101 This architecture [298] also utilizes 3D convolutional layers, but it is a comparative large architec- ture pre-trained on Kinetics action recognition dataset [138]. The architecture introduces a new dimension called “cardinality” (the size of the set of transformations), in addition to the depth and width. It consists of 101 convolutional layers depthwise and 32 cardinalities. The input fed into the architecture is 16 equally spaced frames per video resized to 112 × 112. The mini-batch size is 10, the number of epochs is 100, and the learning rate is 5 × 10−4 . 3.3.6 Two-stream CNN Network Due to the lack of large training data, we apply another architecture that captures spatial and temporal information separately, but in parallel. The architecture called Two-stream CNN [254] decoupled the scene videos into spatial and temporal information by inputting them into two separate streams 57 of ConvNets. The final score obtains by averaging softmax scores outputs from the two streams. Details of these two ConvNets are as follows: 3.3.6.1 Spatial ConvNet The spatial stream performs PA detection utilizing RGB video frames. The backbone architecture is ResNet-101 pre-trained on the ImageNet dataset [73]. Simonyan and Zisserman [254] use a single RGB frame, whereas we select 𝑘 equally spaced frames from the video and resize them to size 224 × 224. We then input these frames are into 𝑘 separate ResNet-101 networks and combine their scores using an aggregation function (S) as: Õ 𝑘 𝑆= 𝑅(𝐹𝑖 ; 𝑃) (3.3.1) 𝑖=1 where, 𝑅(𝐹𝑖 ; 𝑃) is the softmax score generated by the network with parameters 𝑃 and frame 𝐹𝑖 as input. The same parameters 𝑃 are used in all 𝑘 networks. Subsequently, cross entropy loss is calculated as Õ 𝐶 Õ 𝐶 𝐿(𝑦, 𝑆) = − 𝑦𝑖 (𝑆𝑖 − log exp 𝑆 𝑗 ) (3.3.2) 𝑖=1 𝑗=1 where, 𝐶 is the total number of classes and 𝑦𝑖 is the true label of class 𝑖. The loss back-propagates through the network and updates the parameters 𝑃. Though spatial ConvNet is intended to capture the spatial information, its loss calculation also captures a long-range temporal structure. It is motivated by the concept of Temporal Segment Networks (TSNs) [288] though the TSNs use a sequence of 𝑘 snippets (set of consecutive frames), we use 𝑘 frames. The backbone architecture in [288] is the Inception network with Batch Normalization [124], whereas we employ ResNet-101 as a backbone network. Aggregation functions used in [288] are maximum, averaging, and weighted averaging, whereas we use sum as an aggregation function. We aggregate the loss estimated from all the selected frames and then update the parameters. We can update the parameters by using a single frame at a time, which also increases the data used for training but loses the temporal information. This concept is already utilized in the MLP method. We utilize separate networks for individual frames. An alternative option is to feed the frames as multiple channels into the 58 network, but it increases the number of trainable parameters. C3D and ResNeXt-101 methods used this concept. We experiment with different numbers of frames per video (𝑘) for training and empirically select 𝑘 = 3. During testing, we select 20 equally spaced frames and combine their corresponding softmax scores to get the final score. 3.3.6.2 Temporal ConvNet The temporal stream utilizes a stack of optical flows [41] for PA detection. During training, we input 20 optical flow frames from a video into a single ResNet-101 network as multiple channels, where 10 frames correspond to the X-direction, and 10 correspond to the Y-direction [254]. Figure 3.4 shows RGB (first row), X-direction optical flow (middle row), and Y-direction optical flow (last row) frames corresponding to bonafide and PA samples. The optical flow frames are also resized to 224 × 224. During testing, we randomly select 10 frames and calculate their 10 X-direction and 10 Y-direction optical flow frames to fed into the network. We average 20 softmax scores to get the final decision about the video. The architecture helped in working with small-sized training data but has a high time complexity to compute the optical flow frames and memory requirement to store them on disk. Both the streams use mini-batch of size 15, the number of epochs is 100, and the learning rate is 0.0005. We empirically select all hyperparameters. 3.4 Datasets To evaluate iris PA detection, we introduce a proprietary dataset called Iris Presentation Attack Videos (IPV). Existing iris PA video datasets focus solely on the iris region only and does not capture the scene information. This necessitates the collection of a new iris video dataset containing scene information. For face PA detection, we utilize three publicly available datasets: SiW [165], SiW-M [166], and OULU-NPU [33]. Details for all these four datasets are as follows. 59 Figure 3.4: Inputs given to the Two-stream CNN network. The top row shows spatial frames, the middle row represents optical flow frames in the X-direction and the bottom row shows optical flow frames in the Y-direction. (a) corresponds to bonafide video frames, and (b) corresponds to PA video frames. 3.4.1 IPV Dataset We collect the dataset in three sessions with different locations, operators, environments, and timing using a Logitech C920 webcam. Figure 3.3 shows the acquisition setup where a webcam mounts on an IrisID iCAM 7000 sensor. Recording of a video from the webcam starts automatically when the IrisID sensor gives instructions to the user to align the eyes with the IrisID sensor and stops on the capture of an iris image from the IrisID sensor. Videos are approximately 4-5 seconds long with a frame rate of 30 frames/sec. The first session of the dataset collected in lab1 and termed as IPV1. The second session collected in lab2 after five months of the first collection and termed as IPV2. The two labs (lab1 and lab2) are at different locations, thus having different acquisition environments. The third session data collection conducted in the lab1 again after three months and termed as IPV3. Subjects of IPV2 are disjoint from the subjects of IPV1 and IPV3. The IPV1 session data contains videos of only one subject to ensure that PA detection techniques focus on characteristics of bonafide or PA rather than identity information of a user. Different types of PAs and their corresponding number of video collections are paper print (74), artificial eye (156), kindle display (51), funny glasses (166), and mannequin attacks (28). Table 3.2 provides further description of the collected dataset. The dataset has large variations in terms of different PA materials, and different ways to present the PAs. For paper print PAs, we use two different paper types (glossy and matte), and two different types of prints (with and without pupil cut out). We also 60 Figure 3.5: Columns show intra-variations among different PAs using a single frame. Paper print PA variations: uses one or both eyes for presenting iris PA. Artificial eye PAs variations: use different materials, e.g., glass, plastic, prosthetic, or rubber eye. Kindle PAs variations: use different sizes and locations of an iris image on the Kindle display. Funny glasses PAs variations: uses plastic or paper print to mount over the funny glasses. Mannequin PAs: use two different materials and print/plastic to mount over them. use a transparent dome in some paper print PAs to mimic the shape and specular reflections from an eye. Artificial eye PAs contains four different materials (plastic, glass, prosthetic, and rubber). We also create funny glasses PAs, where artificial eye and paper printed PAs mounted over the funny glasses. Mannequin PAs contains two different materials (plastic and polystyrene), mounted with paper print and artificial eyes. More variations introduced in the dataset by alternating the use of one or another eye or both eyes to present a PA. Figure 3.5 shows few variations of the dataset. 3.4.2 SiW Dataset The Spoof-in-the-Wild (SiW) [165] dataset contains bonafide and spoof videos of 165 subjects. There are 8 bonafide and up to 20 spoof videos from each subject. The dataset is collected in four sessions with different PIE variations. The videos are captured using two high-quality cameras: Canon EOS T6 and Logitech C920 webcam. The videos are 15 seconds in length, 30 fps frame rate, and 1080p HD resolution. The dataset provides two print and four replay video attacks for 61 Table 3.2: Description of the dataset collected for multi-frame analysis on scene videos captured from a regular webcam. Session IPV1 IPV2 IPV3 No. of Bonafide Videos 17 69 111 No. of PA Videos 67 20 388 No. of Subjects 1 80 42 Paper print, Funny glasses, Type of PAs Artificial eye, Paper print, Funny glasses Collected Kindle display, Artificial eye, Mannequin Kindle display Acquisition October 2017 April 2018 August 2018 Time Period each subject. For print attack, two quality images (5184 × 3456 and 1920 × 1080) are printed using an HP Color LaserJet M652 printer. To generate replay video attacks, four spoof mediums (Samsung Galaxy S8, Apple iPhone 7, Apple iPad Pro, and PC Asus MB168B) are used. Figure 3.7 (second block) shows few samples of the dataset. 3.4.3 SiW-M Dataset The Spoof-in-the-Wild database with Multiple Attack Types (SiW-M) dataset [166] is built to benchmark the face PA detection algorithms for detecting unseen attacks (cross-attack). There are a total of 1,630 videos of 493 subjects with 13 different spoof attacks. The videos are 5-7 seconds in length, 30 fps frame rate, and 1080p HD resolution. The videos are recorded using a Logitech C920 webcam and a Canon EOS T6 in three sessions. The spoof attacks included in the dataset are 5 3D mask attacks, 3 partial attacks, 3 makeup, one replay, and one print attack. The 3D mask attacks include half mask, silicone, transparent, paper-craft, and mannequin masks. The makeup attacks constitute obfuscation, impersonation, and cosmetic makeup. The partial attacks include funny eye, paper glasses, and partial paper. Figure 3.7 (first block) shows a few samples of the dataset. 62 3.4.4 OULU-NPU dataset The OULU-NPU [33] dataset is built to assess the generalizability of face PAD techniques in mobile scenarios. The dataset consists of a total of 4,950 bonafide and PA videos of 55 subjects. The videos were recorded using the front cameras of six mobile devices (Samsung Galaxy S6 edge, HTC Desire EYE, MEIZU X5, ASUS Zenfone Selfie, Sony XPERIA C5 Ultra Dual, and OPPO N3) in three sessions with different illumination and locations. The presentation attacks included in the dataset are print and replay attacks. Print and replay attacks are created using two different printers and display devices respectively. During the capture, special attention is given to avoid the background scene difference between the bonafide and PA videos. So, the print and replay attack videos do not contain the bezels of the screens or edges of the prints. Figure 3.7 (third block) shows few video frame samples of the dataset. 3.5 Experimental Results and Analysis To analyze the effectiveness of the proposed approaches, we have performed various experiments to detect PAs in iris and face modalities. For the iris modality, we conduct experiments in intra- session, cross-session, and cross-attack scenarios on the IPV dataset. We also conduct a baseline experiment in iris modality, where the complementary nature of the scene cues evaluates with cues obtained from the iris region. For the face modality, we perform face PA detection experiments on SiW [165], SiW-M [166], and OULU-NPU [33] datasets. Finally, we perform cross-modality PA detection experiments where training is on iris PAs and testing on face PAs and vice versa. We reported results in the Average Classification Error Rate (ACER), which is an average of Attack Presentation Classification Error Rate (APCER) and Bonafide Presentation Classification Error Rate (BPCER). APCER is the proportion of PAs samples misclassified as bonafide, whereas BPCER is the proportion of bonafide samples misclassified as PAs. 63 3.5.1 Iris Modality To evaluate the proposed approaches, we perform 12 experiments on the IPV dataset in three settings: intra-session, cross-session, and cross-attack scenario. Experiments 01-05 correspond to intra-session, 06 to cross-session, and 07-11 correspond to a cross-attack scenario. One more experiment (exp. 12) performs where the PA score generated from a scene video is fused with the one proposed in [113] (which uses only the iris image) to show the complementary nature of both the cues. 3.5.1.1 Intra-session In experiments 01-05, we select training and testing data from all three sessions to analyze the intra-session scenario. So, there are 150 videos from each category for training, and the rest used for testing. Table 3.3 provides the details of the selection of videos in each session. Table 3.5 (columns 02-06) presents results of all intra-session experiments. Two-stream CNN is an average fusion of Spatial and Temporal ConvNets. Spatial ConvNet performs the best. LSTM and MLP are also producing comparable ACER. These three methods capture spatial information and work on a pre-trained model. Other methods (LRCN, C3D, and Temporal ConvNet) are not performing well. These models are trained from scratch except Temporal ConvNet, which uses a pre-trained model trained on RGB images instead of optical flow frames. Due to the small size of collected data, training of these methods are prone to over-fitting as there is a large number of trainable parameters in the networks. 3D ResNeXt-101 has the same concept as C3D, except it is a large pre-trained model. As a result, it performs better than the C3D. 3.5.1.2 Cross-session Experiment 06 aims to analyze the cross-session scenario. In the experiment, training performs on the data collected during IPV03 session, and testing performs on the data collected during IPV01 and IPV02 sessions. Table 3.3 provides further details about the experiment 06. Table 3.5 (columns 64 07) presents the results of cross-session experiment. This is a difficult test condition as one must account for the variations in data acquisition environment, subject population, and PA generation procedures. The ACER increases for every method. However, MLP, LSTM, 3D ResNeXt-101, and Spatial ConvNet methods manage to perform reasonably well. On the other hand, the lack of training data and adverse test scenario drop the ACER of C3D, LRCN, and Temporal ConvNet methods drastically. 3.5.1.3 Cross-attack To further evaluate the proposed approaches in a cross-attack scenario, we conduct five more experiments (Exps. 08-12) based on a leave-one-out strategy. The strategy allows methods to train on all types of PAs except one for testing as an unseen attack. Table 3.4 provides the training and testing setup of these experiments. It is an even more difficult testing condition than the cross-session scenario. However, Table 3.5 (columns 08-12) shows reasonably good ACER for alomst all methods except the C3D method and a few PAs. The proposed approaches reliably detect unseen PAs as the method not only captures the characteristics of PA material, but also focus on its presentation, and other contextual information. The 3D ResNeXt-101 method fails to detect unseen artificial eyes and funny glasses attacks, and the LSTM method fails to detect unseen funny glasses attacks. It could be due to the presence of bonafide videos where users wear corrective eyeglasses. The mannequin attack is a difficult unseen attack for the majority of the techniques, whereas paper print is the simplest attack. Overall, Two-stream CNN method performs the best in the cross-attack scenario. In another observation, temporal information modeled better in this scenario (compared to intra-session and cross-session), as can be seen from the results of LRCN and Temporal ConvNet. In the leave-one-out strategy, there are more samples for training compared to intra-session and cross-session experimental setup. This also supports our over-fitting hypothesis with C3D, LRCN, and Temporal ConvNet methods for poor results under intra-session and cross-session scenarios. 65 Table 3.3: Training and testing setup for intra-session (Exp. 01-05) and cross-session (Exp. 06) experiments on the IPV dataset. IPV1 IPV2 IPV3 Experiments Category Train Test Train Test Train Test Bonafide 10 7 50 19 90 21 Exp. 01-05 PA 15 17 10 10 125 219 Bonafide 0 17 0 69 111 0 Exp. 06 PA 0 32 0 20 111 0 3.5.1.4 Baseline Experiments From the results obtained from all three scenarios, one can deduce that video of a scene does contain cues for detecting PAs in iris modality and it can generalize across unseen attacks. We perform another experiment (Exp. 12) to examine the complementary nature of scene cues regarding cues from iris region. We perform the score-level fusion of cues obtained from the iris region and scene video. The one PA score is obtained from the existing iris PA detection technique [113] trained on BERC-IF dataset [154] using a single iris image and the other PA score from the proposed methods applied over the corresponding scene video. For testing, there are 89 bonafide videos along with their corresponding 178 bonafide iris images (89 × 2) and 136 PA videos along with their 177 iris images (41 × 2 + 95) where both or either iris image is PA. Table 3.5 (the last column) shows results of the fusion. The ACER calculated using only iris region for PA detection [113] is 10.9%, whereas when combined with scene cues, it reaches 0% (fusion with MLP method), thus demonstrating the complementary nature of the cues provided by the scene video. Figure 3.6 shows ACER of all methods across all experiments for better visualization. 3.5.2 Face Modality After the successful detection of iris PAs from scene video, we extend its use for detecting face PAs on three publicly available datasets SiW [165], SiW-M [166], and OULU-NPU [33]. SiW [165] and SiW-M [166] datasets do contain scene (contextual) information, whereas in OULU-NPU [33] dataset special attention is given to avoid the scene information. Due to the computational 66 (a) (b) (c) (d) Figure 3.6: Comparison of ACERs of (a) Intra-session experiments (Exp.01-05), (b) Cross-session experiments (Exp.06), (c) Cross-attack experiments (Exp.07-11), and (d) Baseline experiment (Exp.12) on the IPV dataset. 67 Table 3.4: Training and testing setup for cross-attack (Exp. 07-11) and baseline (Exp. 12) experiments on the IPV dataset. Experiments Unseen PA Category Train Test Bonafide 197 35 Exp. 07 Paper print PA 401 74 Bonafide 197 35 Exp. 08 Artificial eye PA 319 156 Bonafide 197 35 Exp. 09 Kindle display PA 424 51 Bonafide 197 35 Exp. 10 Funny glasses PA 309 166 Bonafide 197 35 Exp. 11 Mannequin PA 447 28 Bonafide 99 89 Exp. 12 N/A PA 280 136 Table 3.5: ACER (%) of proposed methods across all experiments (Exp. 01-12) on the IPV dataset. Experiments Intra-session Cross-session Cross-attack Baseline (unseen) Exp. 6 Exp. 7 Exp. 8 Exp. 9 Exp. 10 Exp. 11 Exp. 1 Exp. 2 Exp. 3 Exp. 4 Exp. 5 Exp. 12 (IPV01-02) (Print) (Artificial) (Kindle) (Funny Glasses) (Mannequin) MLP 0.61 0.61 0.61 0.20 0.20 3.66 0.0 3.01 2.85 20.00 18.57 0.0 LSTM 0.20 0.61 1.47 0.20 3.67 6.35 0.0 5.12 6.77 30.73 10.00 1.29 LRCN 15.57 10.94 29.87 13.93 14.68 40.78 4.10 3.52 8.82 0.30 20.71 11.10 C3D 16.66 23.68 5.81 25.49 6.93 37.90 14.67 46.31 10.70 13.20 24.28 10.50 3D ResNeXt-101 3.50 0.81 4.41 03.09 3.75 7.93 0.0 18.56 5.79 35.22 23.57 13.63 Spatial ConvNet 0.0 0.20 0.0 0.0 0.20 0.96 0.0 3.20 0.0 10.54 12.50 0.55 Temporal ConvNet 10.49 10.94 5.47 5.67 4.97 21.62 0.0 12.76 22.12 18.07 11.78 26.31 Two-stream CNN 3.25 6.08 1.26 3.14 1.62 2.12 0.0 4.31 0.0 10.54 6.07 6.41 Cross-Modality 5.88 3.75 4.16 9.63 10.90 11.78 6.15 19.84 3.38 26.95 24.64 - complexity incurred in estimating optical flows for such large datasets, we did not analyze the Temporal ConvNet method for detecting face PAs. 3.5.2.1 Results on SiW dataset The SiW [165] dataset provides three evaluation protocols along with a baseline method. Training and testing performed on the disjoint set of subjects in all three protocols. Training performs on 90 subjects and testing on rest (75 subjects). Protocol 1 evaluates the generalizability of algorithms under different face poses and expressions by considering only the first 60 frames for training (frontal view) and rest for testing. Protocol 2 represents the scenario of cross-medium of the same spoof type (replay attack). Training is on three replay attack media and tested on the fourth 68 Table 3.6: ACER (%) for all methods on the SiW [165] dataset. The ACER values outperforms the baseline [165] are shown in bold. Protocol Subset Subject# Attack Auxiliary [165] MLP LSTM LRCN C3D 3D ResNeXt-101 Spatial ConvNet Train 90 First 60 Frames 1 3.58 0.034 0.0835 0.569 0.725 1.412 0 Test 75 All Train 90 3 display 2 0.57 ± 0.69 0.083 ± 0.118 2.216 ± 3.620 2.031 ± 2.873 0.523 ± 0.574 0.208 ± 0.259 0±0 Test 75 1 display Train 90 print (display) 3 8.31 ± 3.81 11.38 ± 15.98 9.865 ± 13.951 1.063 ± 0.363 0.1849 ± 0.081 22.561 ± 0.241 2.0399 ± 0.246 Test 75 display (print) medium. The protocol uses a leave-one-out strategy and reports the mean and standard deviation of four experiments. Protocol 3 represents the scenario of cross-attack, where training is on print attacks and testing on replay attacks and vice versa. Table 3.6 presents results (ACER) on all three protocols. The proposed methods compared with the algorithm specified in the work [165]. For protocol 1, all scene-based methods outperform the baseline [165]. Scene information is invariant to the variations of face pose and expression when used for detecting PAs. For protocols 2 and 3 as well, cues from the entire scene are more crucial than cues from just facial region in detecting face PAs. The C3D and Spatial ConvNet methods perform the best on this dataset. The issue of limited training data gets resolved for the C3D method on this dataset. 3.5.2.2 Results on SiW-M dataset To further analyze the role of scene information in detecting the face PAs in the unseen attack scenario, we perform experiments on the SiW-M dataset [166]. The dataset specified 13 experi- mental splits for evaluating the performance on each presentation attack following the leave-one-out strategy. For each experiment split, training performed on 12 types of spoof attacks and 80% of the bonafide videos and testing on one left attack type and 20% of bonafide videos. There is no overlapping of subjects between the training and testing sets of bonafide videos. Table 3.7 presents the results (ACER) of all 13 experimental splits. The proposed scene-based methods compared with SVM-RBF + LBP [33], Auxiliary [165], and Deep Tree Learning [166] algorithms. All scene-based methods outperform methods proposed in [165] and [33] when looking at the average (last column). Except for the C3D and LRCN 69 Table 3.7: ACER (%) for all methods on the SiW-M [166] dataset. Mask Attacks Makeup Attacks Partial Attacks Methods Replay Print Average Half Silicone Trans. Paper Manne Obfusc. Imperson. Cosmetic Funny Eye Paper Glasses Partial Paper SVM-RBF + LBP [33] 20.6 18.4 31.3 21.4 45.5 11.6 13.8 59.3 23.9 16.7 35.9 39.2 11.7 26.9 ± 14.5 Auxiliary [165] 16.8 6.9 19.3 14.9 52.1 8.0 12.8 55.8 13.7 11.7 49.0 40.5 5.3 23.6 ± 18.5 Deep Tree Learning [166] 9.8 6.0 15.0 18.7 36.0 4.5 7.7 48.1 11.4 14.2 19.3 19.8 8.5 16.8 ± 11.1 MLP 6.77 4.35 8.24 16.03 11.66 0.76 1.53 26.75 2.35 10.84 2.23 5.43 1.15 7.54 ± 7.13 LSTM 5.24 5.0 13.79 20.44 12.18 0.76 1.92 26.88 4.04 14.61 3.17 9.73 1.15 9.14 ± 7.76 LRCN 7.92 17.39 12.61 31.66 34.5 4.86 18.65 45.11 5.58 23.92 16.25 31.58 1.15 19.32 ± 2.81 C3D 9.85 16.7 11.47 37.74 27.73 4.86 13.46 41.27 6.45 22.15 15.62 30.1 0.38 18.29 ± 12.22 3D ResNeXt-101 13.38 10.71 10.58 21.57 29.11 2.94 8.15 45.95 5.89 27.61 23.59 27.24 9.55 13.38 ± 1.75 Spatial ConvNet 5.61 1.61 11.59 18.51 24.58 0.00 0.00 28.89 0.81 6.76 17.45 14.32 0.00 10.01 ± 9.62 Cross-Modality 24.32 10.94 28.70 19.72 36.26 11.83 3.05 46.56 24.94 34.32 10.30 16.62 1.14 20.66 ± 12.96 methods, all other scene-based methods also outperform [166]. Considering individual unknown attacks, MLP and Spatial ConvNet methods show promising results under a cross-attack or unknown attack scenario. Results on SiW [165] and SiW-M [166] datasets show that cues from the entire image are more effective in detecting face PAs as it contains cues from the facial region as well as background region. 3.5.2.3 Results on OULU-NPU dataset The OULU-NPU dataset specified four evaluation protocols. Data is divided into three subject- disjoint subsets named training, development, and testing. Protocol 1 assesses the face PA detection algorithms under unseen illumination and location. The training uses data of Sessions 1 and 2, and testing is on data of session 3. Protocol 2 evaluates the effect of using different presentation attack instruments (PAI) in print and replay attacks. Training is on one type of print and replay attack and testing on another type of print and replay attacks. Protocol 3 analyses the effect of the input sensor variations on PA detection algorithms using the leave-one-out strategy. There are six sensors used to capture the data. The training performs on videos of five sensors and testing on the remaining one. Protocol 4 combines all three challenges and evaluates the generalizability of face PA detection methods under unseen environmental conditions, PAIs, and input sensors. Table 3.8 presents results (ACER) on all four protocols. The proposed scene-based methods are compared with SVM-RBF + LBP [33], Auxiliary [165], and De-spoofing [133]. For protocols 2 and 3, the Spatial ConvNet method performs the best. For all four protocols, 3D ResNeXt-101 and Spatial ConvNet methods are more effective than the 70 Table 3.8: ACER (%) for all methods on the OULU-NPU [33] dataset. Methods Protocol 1 Protocol 2 Protocol 3 Protocol 4 SVM-RBF + LBP [33] 13.5 14.2 12.1 ± 3.7 27.2 ± 14.3 Auxiliary [166] 1.6 2.7 2.9 ± 1.5 9.5 ± 6.0 De-spoofing [133] 1.5 4.3 3.6 ± 1.6 5.6 ± 5.7 MLP 22.29 16.25 18.68 ± 4.83 23.33 ± 8.97 LSTM 26.45 16.52 20.83 ± 7.49 23.75 ±5.72 LRCN 46.45 30.55 30.34 ± 7.01 48.75 ± 1.90 C3D 48.75 26.52 27.77 ± 7.24 49.16 ± 1.86 3D ResNeXt-101 5.83 5.13 4.72 ± 1.03 23.33 ± 8.97 Spatial ConvNet 3.54 2.5 2.84 ± 1.77 24.16 ± 16.62 baseline methods [33] even though the contextual information is deliberately kept out of the videos. Other scene-based methods perform poorly on this dataset. We anticipate these results on this dataset as the proposed approaches focus on cues from the entire frame (spatial) or along the temporal dimension. The dataset suppresses scene cues, which result in poor performance. Hence, the capture contextual information is advantageous for PA detection. 3.5.3 Cross-modality We also perform two more experiments to evaluate the usefulness of scene cues in PA detection across modalities. In the first experiment, training performs on face PAs taken from SiW-M [166] dataset, and evaluation is on all test splits of the iris IPV dataset. Table 3.5 (last row) shows all results (ACER). Though the results show the inferior performance of the cross-modality model over the intra-modality model, the cues learned from the face PAs are worthwhile in distinguishing bonafide and PA samples in iris modality. In the second cross-modality experiment, training performs on iris PAs (IPV dataset) and testing on the experimental splits of SiW-M [166] dataset. The last row of Table 3.7 presents its results. Surprisingly, it performs better than the SVM-RBF + LBP [33] and Auxiliary [165] techniques when examining the overall average (last column of table 3.7). Reasonable results of scene-based techniques under the cross-modality scenario validate the presence of common scene cues across PAs of different modalities. 71 Figure 3.7: Sample video frames from various face PAD datasets: the first block shows frames from the SiW-M [166] dataset, the second block represents examples from the SiW [165] dataset and the third block shows samples from the OULU-NPU [33] dataset. The key findings observed from all the experiments conducted in this work are as follows: 1. The scene provides useful information for detecting iris PAs under different (intra-session, cross-session, and cross-attack) scenarios (refer Iris Modality results). 2. The cues obtained from the scene video are complementary to the one obtained from the NIR iris region (refer to the last column of Table 3.5). 3. Scene analysis could be extended to other modalities. Outperforming results on the face modality (refer Face Modality results) validate the hypothesis. 4. Scene-based techniques utilize the common cues of presentation attacks across biometric modalities (refer to Cross-modality results). 5. Spatial ConvNet performs best in the majority of the experiments. 72 3.6 Analysis Using Heatmaps Figure 3.8: Frames of bonafide (first row), artificial eye (second row), and paper print (third row) videos overlaid with their corresponding Grad-CAM heatmaps. The columns correspond to the different frames of a video. Heatmap represents the focused region of a frame by the trained model (Spatial ConvNet). Red gradient regions in the heatmaps represent high focused regions considered by the trained model, whereas the blue-colored regions represent low focused regions. On the bonafide frames, the focus is mainly over the center of a face. On artificial eye frames, the focus is on the artificial eye mounted over the glasses. In the case of paper print video, the focus is on the print of the eyes. Different regions of focus in different categories help in differentiating bonafide videos from spoof one. We further visually analyze the result by generating “heatmaps" using Gradient-weighted Class Activation Mapping (Grad-CAM) [245]. Grad-CAM produces a coarse localization map highlight- ing the salient regions in an image that were used by the network to generate its inference. It is generated by estimating the gradient of the loss function and backpropagates it through the hidden layers to the input frame. We use the Spatial ConvNet model trained using experiment 11 (Iris Modality cross-attack) setup for generating heatmaps of bonafide, artificial eye, and paper print videos (most commonly used PAs). Figure 3.8 exhibits sample frames of bonafide, artificial eye, and paper print videos along with their heatmaps. The first row of Figure 3.8 shows the heatmaps of bonafide frames, where the high activation regions are at the center of a face. The other two rows of Figure 3.8 correspond to artificial and print PA frames, where high activation regions are around the artificial eye and print paper respectively. The presence of spoof artifacts in the video frames shifted the salient region towards the artifacts. Distinct region of focus aids the models to discriminate bonafide from PAs videos. 73 3.7 Conclusion and Future work We proposed an approach that utilizes multiple frames of a scene video for detecting the presentation attacks in iris and face biometric modalities. Experimental results validated the presence of significant cues in the scene video for detecting the PAs. In the case of iris modality, scene video cues are also complementary to the cues obtained from the NIR iris image. It has the generalizable capability as it produces reasonably good results with unseen attacks and modalities. We extended the approach for the face modality, but it could also be extended for other modalities such as fingerprint, where holding a fake fingerprint can be evidence for detecting the presentation attack. 74 CHAPTER 4 IRIS PRESENTATION ATTACK DETECTION USING A OCT IMAGE Parts of this chapter appeared in the following publication: R. Sharma and A. Ross, “Viability of Optical Coherence Tomography for Iris Presentation At- tack Detection,” International Conference on Pattern Recognition (ICPR), Milan, Italy, January 2021. 4.1 Introduction In this chapter, we present another iris presentation detection (PAD) method utilizing Optical Coherence Tomography (OCT) imaging. Existing PAD methods utilize NIR or VIS imaging which captures the stromal textural patterns of the iris, whereas OCT1 images capture the internal structure of the eye and the iris (Figure 4.1). The OCT imaging has been utilized for fingerprint PA detection [180]. But the unavailability of an OCT iris dataset and the high hardware costs associated with OCT has traditionally prevented its exploration for iris PA detection. However, the development of cost-effective OCT hardware [256] motivates us to consider it for iris PA detection. The main contributions of the work are as follows: 1. We propose a hardware-based iris PA detection technique based on OCT imaging technology. We also assess its viability by comparing its performance against traditional NIR and VIS imaging modalities. 2. We implement OCT-based iris PA detection using three state-of-the-art deep CNN mod- els which significantly differ in their architectures: VGG19 [255], ResNet50 [106] and DenseNet121 [121]. 1 OCT also employs NIR illumination but obtains cross-sectional views, not textural details. 75 3. We evaluate PA detection performance on a dataset of 2,169 bonafide, 177 Van Dyke eyes and 360 cosmetic contact lens images under intra-attack and cross-attack scenarios. Each input sample is captured in all three imaging modalities. 4. We also generate CNN visualizations (heatmaps [245] and t-SNE plots [280]) to further analyze the results on OCT, NIR and VIS images. Heatmaps are used to identify salient image regions that the deep architectures utilize to detect PAs. t-SNE plots aid in visualization of features extracted by the CNN architectures. The rest of the chapter is organized as follows. Section 4.2 discusses the existing work for detecting hardware-bsed iris PAs. Section 4.3 discusses background of the imaging modalities. Section 4.4 describes the proposed approach. Section 4.5 provides a description of the dataset. Section 4.6 describes the experimental setup and reports the results. Section 4.7 provides a detailed analysis of the results obtained from the proposed approach. Finally, Section 4.8 concludes the chapter. 4.2 Related Work Various presentation attack detection techniques in iris modality utilize different imaging tech- niques. Commonly used imaging technique is near-infrared (NIR) imaging [46,114,176,249,315]. Zhang et al. [315] utilized texture-based features, whereas works in [46, 114, 176, 249] used deep features to detect iris PAs. The authors in [98, 211] operated on visible spectrum (VIS) imaging utilizing LBP [98] and BSIF [211] features. Menotti et al. [176] showed results on both NIR and VIS iris images. Raghavendra and Busch [219] exploited characteristics of the Light Field Camera (LFC) for iris PA detection in the VIS spectrum. Sequeira et al. [246] suggested the use of a one-class classifier on VIS images for generalization across unseen attacks, i.e., attacks that were not used in the training phase. In [214], the authors utilized Eulerian Video Magnification (EVM) to detect PAs in VIS videos. Park and Kang [198] utilized a specialized tunable filter to capture iris images at different spectral bands ranging from 650nm to 1100nm. These multi-spectral images are then fused at the image level to detect PAs. Lee et al. [152] analyzed the reflectance properties of the 76 Figure 4.1: Components of the eye and iris sensed using OCT, NIR and VIS imaging. The anatomical image (https://www.vecteezy.com/vector-art/431288-parts-of-human-eye-with-name) is also shown. The red line in the VIS image shows the traverse scanning direction of the OCT scanner. iris and sclera in multi-spectral illumination. Chen et al. [49] captured images at the near-infrared (860nm) and blue (480nm) wavelengths, and then analyzed the conjunctival vasculature patterns and the iris textural patterns for liveness detection.2 Connell et al. [54] exploited the anatomy and geometry of the human eye using structured light to detect cosmetic contact lens. Thavalengal et al. [266] used both VIS and NIR images for iris liveness detection in smartphones. Hsieh et al. [117] utilized dual-band imaging hardware (VIS and NIR) to distinguish between the textured pattern of contact lens from real iris patterns using independent component analysis. 4.3 Background of Iris Imaging Modalities The complex texture of the iris is characterized by its components, including, pigments (chro- mophore), blood vessels, muscles, crypts, contractile furrows, freckles, collarette and pupillary 2 Early literature used the term “liveness detection" to refer to the problem of PA detection. 77 frills. Different spectral bands can potentially be used to capture different components of the iris. NIR illumination, which operates in the 700-900nm range, predominantly captures the stromal features (fibrovascular layer) of the iris, whereas VIS (400-700nm) captures information about the pigment melanin. Optical Coherence Tomography (OCT) [120] is a non-invasive, micrometer- resolution imaging modality, that can be used to capture 2-D cross-sectional or 3-D volumetric images of an eye. It is mainly used for biomedical and clinical purposes, such as ophthalmology, optometry, cardiology and dermatology. It works with a low-coherence near-infrared (800nm- 1325nm) light source. OCT imaging captures cornea (circular arc), iris tissue structure, anterior humor (the space between iris and cornea) and the ciliary muscles (next to the iris tissues) of the eye as shown in Figure 4.1. OCT images are captured by shining the light source over a beam splitter, which splits the light into two beams, one directed to the sample arm (human eye) and another to the reference arm (mirror). The time delay and intensity of the back-reflected light from both the arms are estimated to create an axial back-scattering profile called A-Scan. Combination of A-Scans along transverse axis forms a 2-D cross-sectional image called B-Scan. The imaging setup of an OCT sensor is shown in Figure 4.2. OCT imaging primarily captures the structure and morphology of the eye as opposed to texture information that is typically observed in NIR and VIS images. A majority of commercial iris recognition systems and iris PA detection algorithms utilize NIR images for the following reasons. Firstly, NIR illumination penetrates deeper into the iris and elicits the textural pattern of both light and dark irides; in contrast, majority of VIS illumination is absorbed by higher levels of melanin in dark-colored irides resulting in poorly discernible iris texture. Secondly, background illumination variations and corneal reflections do not affect NIR imaging as much as RGB imagers. However, some iris recognition and PA detection algorithms have started using VIS imaging due to inexpensive hardware and a wide range of applications (mobile devices, surveillance, etc.) [98, 286]. Due to expensive hardware, OCT imaging has not been traditionally discussed in the literature for either iris recognition or PA detection. 78 Figure 4.2: Typical optical setup of an OCT scanner. Low-coherence light is incident over the beam splitter, which splits the light into sample and reference arms. Back-reflected light from sample and reference arms are then collected by the photodetector. Cross-sectional OCT image (B-scan) is formed by combining a number of A-scans along the transverse direction. 4.4 Proposed Approach In this work, we discuss the use of OCT imaging for iris PA detection. For classification of iris OCT images as bonafide or PA, we used three state-of-the-art deep CNN architectures: VGG19 [255], ResNet50 [106] and DenseNet121 [121]. These architectures output a single PA score in the range [0, 1], with a ‘1’ indicating a PA and ‘0’ indicating a bonafide. Using the same CNN architectures, we compare the PA detection capability of OCT images against NIR and VIS images. Overview of the approach is depicted in Figure 4.3. In the subsequent sub-section, we provide implementation details of all three network architectures. To classify bonafide and PA iris images acquired from all three imaging modalities, we used three state-of-the-art deep architectures: VGG19 [255], ResNet50 [106] and DenseNet121 [121]. These three networks differ by the number of the convolutional layers, the number of trainable parameters and the connection type. VGG19 [255] has 19 convolutional layers with kernels of fixed size 3 × 3 throughout the network. It has 143,667,240 trainable parameters. ResNet50 [106] has 50 79 Figure 4.3: Comparative analysis of OCT, NIR and VIS imaging in detecting iris PAs. Three architectures, viz., VGG19, ResNet50, DenseNet121, are used for distinguishing between bonafides and PAs by emitting a PA score. A higher PA score indicates the input is a “PA" and a lower score indicates the input is a “bonafide" image. convolutional layers with residual connections (skip connections) to moderate gradient flow and allow the training of a large network. It has 35,610,216 trainable parameters. DenseNet121 [121] consists of 121 convolutional layers, where each layer is connected to every other layer resulting in a much reduced set of trainable parameters (7,978,856). Three different sized architectures are utilized in the study to eliminate the bias created due to the network architecture (under-fitting or over-fitting) in the comparison results. As the dataset used in the study is insufficient to train these deep architectures, we utilize pre-trained models on ImageNet dataset. Pre-trained models also help in faster convergence during the training process. ImageNet is a large dataset used for object classification containing 1.2 million images of 1000 classes. The images in ImageNet dataset are visible spectrum images, i.e., RGB. To preserve the usefulness of pre-trained weights for the OCT and NIR spectrum images, we normalize OCT, NIR and VIS images using the mean and the standard deviation calculated from the ImageNet dataset images. The photometrically normalized images are then re-sized to 224 × 224 and input to the aforementioned architectures. All three models are then fine-tuned using OCT, NIR and VIS iris images resulting in nine trained models. 80 Table 4.1: Number of bonafide and PA samples corresponding to each imaging modality. Imaging Modality Classes Sub-Classes OCT RGB NIR Bonafide 844 844 1371 Van Dyke Eye (Brown) 30 30 51 Artificial Eyes Van Dyke Eye (Blue) 29 29 56 Face Mask 2 2 4 Acuvue Accent Vivid 37 37 43 Cosmetic Contacts Air Optix Sterling Grey 41 41 43 Extreme FXS Halloween Blackout 42 42 34 The learning rate used in the training is 0.005, the batch size is 20, the optimization algorithm is stochastic gradient descent with momentum of 0.9, the number of epochs is 50, and the loss function is cross-entropy. During test and evaluation, each of these networks produce a single PA score which is used along with a threshold to determine if the input image is a PA or a bonafide. 4.5 Dataset The dataset is collected under the Odin program of IARPA [2] from 740 eyes (370 subjects). Figure 4.4 provides age distribution of subjects. The number of male and female subjects are 136 and 243, respectively. OCT, NIR and VIS images are collected sequentially for a subject using an RGB camera, iCAM7000 NIR sensor and THORLabs Telesto series (TEL1325LV2) OCT sensor [5], respectively. The OCT images are acquired at 1325nm wavelength having 7mm imaging depth and 12𝜇m axial imaging resolution. For a single sample, 50 cross-sectional frames are captured by the OCT sensor. However, temporal information is not significant among frames, so we use only the first frame. Iris PAs considered in this study are artificial eyes (Van Dyke eyes) and cosmetic contact lenses. For OCT and VIS, the dataset contains 844 bonafide images, 61 artificial eyes and 120 cosmetic contact lens images, whereas, for NIR, there are 1,371 bonafide images, 111 artificial eyes and 120 cosmetic contact lens images. Further sub-categorization of PA images is provided in Table 4.1. Figure 4.5 shows examples of bonafide and PA images acquired in all three spectra (OCT, NIR and VIS). 81 Figure 4.4: Age distribution of subjects in the dataset. Figure 4.5: Samples of bonafide, artificial eyes and cosmetic contact lens images captured using (a) OCT, (b) NIR and (c) VIS imaging modalities. 82 Table 4.2: APCER (%) and BPCER (%) of all algorithms on LivDet-Iris 2017 Dataset [304]. Results are presented by averaging APCER and BPCER of all test sets in the dataset. Algo. CASIA [304] Anon1 [304] UNINA [304] VGG19 ResNet50 DenseNet121 APCER 11.88 14.71 15.52 15.80 11.71 6.25 BPCER 9.48 3.36 12.92 1.20 3.24 10.39 4.6 Experimental Setup and Results Before evaluating the three imaging modalities (OCT, NIR and VIS), we assess the performance of three fine-tuned architectures (VGG19, ResNet50 and DenseNet121) on the LivDet-iris 2017 [304] dataset for iris PA detection. The dataset is an amalgamation of Clarkson, Warsaw, Notre Dame and IIITD-WVU datasets. Print and cosmetic contact lens PAs are included in the dataset. The experimental setup is kept the same as specified in the competition [304]. Evaluation measures are Attack Presentation Classification Error Rate (APCER) and Bonafide Presentation Classification Error Rate (BPCER), where APCER is the proportion of PA samples misclassified as bonafide and BPCER is the proportion of bonafide samples misclassified as PAs. All three architectures either outperform or are comparable to the state-of-the-art algorithms (CASIA, Anon1 and UNINA) on the LivDet-iris 2017 competition as shown in Table 4.2. Utilizing the three architectures, we perform comparative evaluation of OCT, NIR and VIS images in detecting iris PAs. Experiments are performed under intra- and cross-attack scenarios. Samples that were successfully captured in all three imaging modalities are selected for experiments. The dataset used for evaluation eventually has 723 bonafide samples, 59 artificial eyes and 120 cosmetic contact lens images captured in all three imaging modalities. The train, validation and test sets are eye-disjoint, i.e., they have data from different eyes and samples in the three sets are mutually exclusive. Intra-attack experiments examine which imaging modality performs best with known PAs (used during training), whereas cross-attack experiments analyze the generalizability across unknown PAs (not used in training). The evaluation measures used are True Detection Rate (TDR) at 0.2% False Detection Rate (FDR), and Average Classification Error Rate (ACER). TDR is the percentage of PA samples that were correctly detected, whereas FDR is a percentage 83 of bonafide samples that were misclassified as PA. ACER is the average of APCER and BPCER. Receiver operating characteristic (ROC) curves are also provided for a comprehensive overview. For successful detection, TDR should be comparatively higher and ACER should be comparatively lower. 4.6.1 Intra-attack Setup and Results In the intra-attack setup, three experiments are performed: Intra-EXP 1, Intra-EXP 2 and Intra-EXP 3. Intra-EXP 1 includes both the PAs (artificial eyes and cosmetic contact lens) and bonafide images in the training and test sets, whereas Intra-EXP 2 and Intra-EXP 3 include images from only one PA along with bonafide images for training and testing. Intra-EXP 2 and Intra-EXP 3 experiments are performed to test the difficulty level of differentiating a specific PA from bonafide samples. Details about the train, validation and test sets of all three experimental setups are provided in Table 4.3. In the first experiment (Intra-EXP 1), the data are split in a 70:30 ratio, where 70% of eyes is used for training and the remaining for testing (30%). Thereafter, five-fold cross-validation is employed on the training set, where 4 folds are used for training and one for validation. The validation set is used to estimate the threshold to be used on the test set for calculating ACER. The TDR at 0.2% FDR and the ACER for VGG19, ResNet50 and DenseNet121 architectures are provided in Table 4.4. ROC curves of Intra-EXP 1 for all three architectures are shown in Figures 4.6(a), 4.7(a) and 4.8(a). In the Intra-EXP 1 experiment, the best results are observed on OCT images, second-best on NIR images, and then on VIS images. All trained models (five) obtained from cross-validation show low standard deviation in the results when tested on OCT images (Figures 4.7(a) and 4.8(a)) compared to NIR and VIS images. Similar results are observed across all three network architectures (VGG19, ResNet50 and DenseNet121). This validates the robustness of PA detection when using OCT images. Considering individual PAs in Intra-EXP 2 and Intra-EXP 3 experiments, it is found that both types of PAs are perfectly classified (100% TDR) by the OCT and NIR modalities. There are a few errors when detecting cosmetic contact PAs using the VIS modality (98.63% TDR). So, 84 Table 4.3: Data distribution among train, validation and test sets for all experiments (intra-attack and cross-attack scenarios). Here, CC is Cosmetic Contacts. Train Set Validation Set Test Set Experiments Bonafide PAs Bonafide PAs Bonafide PAs Intra-EXP 1 404 100 101 25 218 54 (Both Artificial Eyes & CC) Intra-EXP 2 435 35 145 12 146 12 (Only Artificial Eyes) Intra-EXP 3 435 72 145 24 146 24 (Only CC) Cross-EXP 1 435 41 145 18 146 120 (CC are unknown) Cross-EXP 2 435 84 145 36 146 59 (Artificial eyes are unknown) in the intra-attack scenario, where attacks are known and used during training, the OCT modality perfectly separates (100% TDR at 0.2% FDR) bonafide and PA iris images by a higher margin compared to the NIR and VIS modalities. 4.6.2 Cross-attack Setup and Results To perform the cross-attack (generalization to unknown attacks) analysis, two experiments are conducted: Cross-EXP 1 and Cross-EXP 2. In the first experiment (Cross-EXP 1), training is performed on bonafide and artificial eye images, and testing is done on bonafide and cosmetic contact lens images. Bonafide images are split in a 60:20:20 ratio for the training, validation and test sets, respectively. Artificial eye images are split in a 70:30 ratio for the training and validation sets, respectively. All cosmetic contact images constitute the test set. In the second experiment (Cross-EXP 2), training is performed on bonafide and cosmetic contact lens images, and testing is done on bonafide and artificial eye images. Bonafide images are split in the same way as Cross-EXP 1. Cosmetic contact lens images are split in a 70:30 proportion for the training and validation sets, respectively. All artificial eye images are used in the test set. Further details of both the experimental setups are given in Table 4.3. The TDR at 0.2% FDR and the ACER for VGG19, ResNet50 and DenseNet121 architectures are provided in Table 4.4. ROC curves of all three architectures for the 85 Table 4.4: TDR (%) at 0.2% FDR and ACER of all experiments (intra-attack and cross-attack) when using VGG19, ResNet50 and DenseNet121 architectures. Evaluation VGG19 ResNet50 DenseNet121 Experiments Measure OCT NIR RGB OCT NIR RGB OCT NIR RGB Intra-EXP 1 ACER 0.08 ± 0.15 0.02 ± 0.01 0.09 ± 0.03 0.00 ± 0.00 0.00 ± 0.01 0.08 ± 0.00 0.02 ± 0.03 0.02 ± 0.02 0.07 ± 0.02 (Both Artificial & CC) TDR 100 ± 0.00 97.99 ± 2.66 82.58 ± 6.88 100 ± 0.00 97.33 ± 3.88 89.62 ± 3.62 100 ± 0.00 97.66 ± 3.26 86.66 ± 3.59 Intra-EXP 2 ACER 0.00 0.00 0.00 0.00 0.00 0.04 0.00 0.00 0.00 (Only Artificial Eyes) TDR 100 100 100 100 100 100 100 100 100 Intra-EXP 3 ACER 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.03 (Only CC) TDR 100 100 95.83 100 100 100 100 100 100 Cross-EXP 1 ACER 0.39 0.01 0.19 0.20 0.01 0.27 0.16 0.01 0.30 (CC are unknown) TDR 21.66 97.58 26.66 92.50 98.38 15.00 84.16 98.38 11.66 Cross-EXP 2 ACER 0.06 0.03 0.04 0.01 0.02 0.07 0.05 0.01 0.04 (Artificial eyes are unknown) TDR 86.44 98.38 93.22 94.91 96.77 81.35 94.91 96.77 91.52 two experiments are shown in Figures 4.6(b) and 4.6(c), 4.7(b) and 4.7(c), and 4.8(b) and 4.8(c), respectively. In the cross-attack scenario, the best results are observed on NIR images, followed by OCT images and then VIS images. Basically, the OCT and VIS modalities failed in detecting cosmetic contact images when training is performed using artificial eye PAs (see Figures 4.6(b), 4.7(b) and 4.8(b)). The feature sub-spaces of bonafide samples and cosmetic contact lens seem to overlap (middle column of Figure 4.10). However, when classifiers are trained on cosmetic contact images (Figure 4.6(c), 4.7(c) and 4.8(c)), they can detect artificial eye PAs as feature sub-space of artificial eyes seems to be well separated from that of bonafide samples (last column of Figure 4.10). Difficulty in detecting cosmetic contact PAs is also reflected in the Intra-EXP 2 and Intra-EXP 3 experiments. ResNet50 and DenseNet121 architectures are better suited for the cross-attack scenario than the VGG19 network, as a higher number of trainable parameters are present in VGG19 and the training data is insufficient. As the networks are pre-trained on the ImageNet dataset (containing VIS images), trainable parameters converge in the case of VIS and NIR images, but fail to converge for OCT images due to the fundamentally different image modality (Figure 5(a)). The main findings of the comparative analysis are: 1. In the intra-attack scenario, when PAs are known and used during training, OCT images provide more discriminative information for distinguishing between bonafide and PA samples. However, NIR imaging provides better generalizability across unknown iris PA attacks. 86 Figure 4.6: ROC curves of (a) Intra-EXP 1, (b) Cross-EXP 1 and (c) Cross-EXP 2 experiments using VGG19 architecture. The first ROC plot (a) also shows the confidence interval of 95%. NIR imaging is more efficient in discriminating bonafide and PA samples on this network. Figure 4.7: ROC curves of (a) Intra-EXP 1, (b) Cross-EXP 1 and (c) Cross-EXP 2 experiments using ResNet50 architecture. OCT imaging results in better performance in distinguishing bonafide and PA images in the intra-attack scenario (a), whereas NIR imaging performs the best in the cross- attack scenario (b and c). Figure 4.8: ROC curves of (a) Intra-EXP 1, (b) Cross-EXP 1 and (c) Cross-EXP 2 experiments using DenseNet121 architecture. OCT imaging results in better performance in distinguishing bonafide and PA images in the intra-attack scenario (a), whereas NIR imaging performs the best in the cross-attack scenario (b and c). 87 2. Cosmetic contact PAs are difficult to detect compared to artificial eyes, especially on VIS images. 3. ResNet50 and DenseNet121 architectures are well-suited for iris PA detection in the OCT imaging modality possibly due to the smaller number of trainable parameters compared to VGG-19. 4.7 CNN Visualization The performance of all three architectures is nearly perfect on OCT and NIR images. To further analyze the results, we generate heatmaps [245] and t-SNE plots [280]. Heatmaps provide the salient regions in OCT, NIR and VIS images where the classifier (ResNet50) focused on, in order to discriminate PAs from bonafide samples. Heatmaps are generated using Grad-CAM [245]. Grad-CAM uses a gradient of the loss function and backpropagates it through the convolutional layers to generate activations on the input image. OCT, NIR and VIS images of a bonafide, artificial eye and cosmetic contact lens are shown along with their heatmaps in Figure 4.9. In the case of OCT images (Figure 4.9(a)), the heatmap of the bonafide image highlights the iris regions, which is the most discriminative region compared to OCT PA images. The heatmap of an artificial eye image focuses over the outer structure. Cosmetic contact lens conceals the underlying iris pattern (partially or fully), which causes the focus to shift over to the corneal region corresponding to the pupil. In the case of NIR and VIS imaging (Figure 4.9(a) and 4.9(b)), heatmaps of bonafide sample focus over the iris pattern. For an artificial eye image, the heatmap is activated all over the image, whereas for a textured contact lens more emphasis is given to the circumference of the iris. Different regions of focus for different categories (bonafide and PA) aid the CNN architecture to discriminate between them. After visualizing activations on the input image, we also visualize the CNN features using a t-SNE plot [280]. The CNN features are extracted from the average pooling layer (penultimate layer, a layer before the last fully connected layer) of the ResNet50 architecture. The dimensionality of the features is 2048, which is reduced to two dimensions using t-Distributed Stochastic Neighbor 88 Figure 4.9: (a) OCT, (b) NIR and (c) VIS images and their corresponding fixation regions for bonafide, artificial eyes and cosmetic contact lens samples. Red in the heatmaps represents high priority (high CNN activations) regions considered by the CNN architecture. Blue represents low priority regions. Red boxes mark the high priority regions. Different regions of focus help the CNN architecture to differentiate between bonafide and PA iris images. Embedding (t-SNE). The t-SNE plots are shown in Figure 4.10. These t-SNE plots correspond to Intra-EXP 1 (first column), Cross-EXP 1 (second column) and Cross-EXP 2 (third column) test data. Distribution of bonafide, artificial eyes and cosmetic contact images are observed to be well separated in OCT imaging in the case of Intra-EXP 1 and Cross-EXP 2 experiments. Separation of these features is also prominent in NIR imaging under the cross-attack scenario (Cross-EXP 1 and Cross-EXP 2). Features in the case of Cross-EXP 1 experiment overlap for VIS images. These plots substantiate our observations that OCT imaging works efficiently in the intra-attack scenario and moderately in the cross-attack scenario, while NIR imaging generalizes well in the cross-attack scenario. 89 Figure 4.10: t-SNE plots of Intra-EXP 1, Cross-EXP 1 and Cross-EXP 2 test data pertaining to OCT, NIR and VIS imaging. 2048 dimensions of features from the average pooling layer (penultimate layer) of ResNet50 network are reduced to two dimensions for visualization. Features of bonafide and PAs from OCT images are well separated in Intra-EXP 1 and Cross-EXP 2 experiments. NIR images show good separation in all three experiments. Features from VIS images are overlapping between the bonafide and PA categories (especially in the Cross-EXP 2 experiment). More the separation of features, better the classification. 90 4.8 Conclusion and Future Work In this chapter, we described the use of the OCT imaging modality for iris PA detection. By comparative analysis against other imaging modalities (traditional NIR and VIS), we determined that OCT is a viable solution for iris PA detection. Extensive experiments were conducted both in the intra-attack and cross-attack scenarios using three state-of-the-art deep architectures, and results were analyzed using CNN visualizations (heatmaps and t-SNE plots). Future work will involve collecting OCT data from more subjects and other types of PAs. Hardware cost continues to be a barrier for the use of OCT in iris recognition applications. However, as sophisticated presentation attacks are launched in the future, the OCT modality is likely to be of great benefit. 91 CHAPTER 5 ROBUSTNESS OF DEEP NEURAL NETWORKS 5.1 Introduction In this chapter, we empirically analyze the robustness of iris presentation attack detection (PAD) models by manipulating their architectural parameters. Here, we consider three state-of-the- art architectures (VGG [255], ResNet [105], and DenseNet [122]) under three types of parameter perturbations (Gaussian noise, weight zeroing and weight scaling). We apply the perturbations in two settings: over all the layers of a network simultaneously and over each layer at a time. Our main contributions are as follows: 1. We perform robustness analysis of three state-of-the-architectures (VGG [255], ResNet [105] and DenseNet [122]) against parameter perturbations. 2. We apply a large number of parameter perturbations (three types of perturbations and its variant in two settings) to analyze the robustness of deep neural networks in the context of iris presentation attack detection. 3. We leverage the robustness analysis to propose better performing ensemble models. 4. We perform experiments using five datasets. Three of the datasets (IARPA, NDCLD-2015, Warsaw Postmortem v3) are used for training, whereas the others (LivDet-Iris-2017 and LivDet-Iris-2020) are used for testing. This represents a cross-dataset scenario, where training and testing are performed on different datasets. The rest of the chapter is organized as follows: Section 5.2 discusses the existing work related to the robustness analysis of DNNs, Section 5.3 provides the details of various parameter perturbations used for the robustness analysis, Section 5.4 describes the application scenario considered in this work, Section 5.5 explains the dataset and experimental setup, Section 5.6 provides the robustness 92 analysis of the three architectures against considered parameter perturbations, and Section 5.7 describes how we leverage the robustness analysis to generate an ensemble of perturbed models for improving performance. Finally, Section 5.8 summarizes the chapter and provides future directions. 5.2 Related Work Deep Neural Networks (DNNs) have revolutionized the machine learning field through their superior performance in various tasks especially in the field of computer vision [105, 122, 255], natural language processing [75], and speech technology [74]. In essence, a DNN comprises a sequence of layers containing trainable parameters (weights and bias) to learn a complex mapping between input signals and output labels. For deploying DNNs in real-world applications, it is crucial to analyze their robustness or sensitivity to hardware/sensor noise introduction [51], environment changes [276] and adversarial attacks [94]. Robustness analysis also helps in building a quantized- weights model with commensurate performance [102, 293]. In the literature, robustness analysis of DNNs has been performed by perturbing either the input signal or the architectural parameters. The work in [83, 96, 137, 181, 191, 259] analyze DNN robustness by manipulating the input signals, whereas the work in [102, 253, 276, 293, 297] perturb architectural parameters to analyze robustness. Yeung et al. [309] provide a detailed sensitivity analysis of neural networks over input and parameter perturbations. In this work, we focus on the robustness analysis of DNNs when architectural parameters are perturbed. The authors in [253,276,293,297] provide a theoretical robustness analysis based on parameter perturbations. Shu and Zhu [253] propose an influence measure motivated by information geometry to quantify the effects of various perturbations to input signals and network parameters on DNN classifiers. Xiang et al. [297] design an iterative algorithm to compute the sensitivity of a DNN layer by layer, where sensitivity is defined as “the mathematical expectation of absolute output variation due to weight perturbation with respect to all possible inputs" [297]. Tsai et al. [276] study the robustness of the pairwise class margin function against weight perturbations. Weng et al. [293] compute a certified robustness bound for weight perturbations, within which a neural 93 network will not make erroneous outputs. In addition, they also identify a useful connection between the developed certification and the problem of weight quantization. Our work is motivated from [51], where they also empirically analyze the robustness of the pre- trained AlexNet and VGG16 networks to internal architecture and weight perturbations. However, our work is vastly different. First, we extend the work by evaluating the robustness of more recent DNN architectures: VGG, ResNet, and DenseNet. Second, we perform additional weight manipulations (weight scaling and perturbations over the entire network) in the robustness analysis. Third, we leverage the findings from the robustness analysis and propose an ensemble of perturbed models for improved performance without any further training. 5.3 Parameter Perturbations We explore the stability of neural networks by perturbing their architectural parameters (weights and bias). From now on, we use the terms ‘architectural parameters’, ‘parameters’, and ‘weights’ interchangeably. To measure the stability, we consider the change in the performance of the DNN when weights are perturbed. Let 𝑛 input samples be {𝑥1 , 𝑥2 , ..., 𝑥 𝑛 } and their output as {𝑦 1 , 𝑦 2 , ..., 𝑦 𝑛 }. Here, we labeled the positive class as ‘1’ and the negative as ‘0’. The predicted output values from a DNN approximator are { 𝑓 (𝑥 1 , 𝑊𝑜𝑟𝑔 ), 𝑓 (𝑥 2 , 𝑊𝑜𝑟𝑔 ), ..., 𝑓 (𝑥 𝑛 , 𝑊𝑜𝑟𝑔 )}, where 𝑊𝑜𝑟𝑔 are the learned parameters. We measure the performance of the DNN in terms of True Detection Rate (TDR). TDR is a percentage of positive samples correctly classified as Í𝑛 𝑖 ( 𝑓 (𝑥𝑖 , 𝑊𝑜𝑟𝑔 ) > 𝑇) 𝑇 𝐷 𝑅𝑜𝑟𝑔 = Í𝑛 ∗ 100 (5.3.1) 𝑖 𝑦𝑖 where, 𝑇 is the threshold. The input sample with a predicted value above the threshold is considered a positive class. On weight perturbation, we estimate the output as { 𝑓 (𝑥 1 , 𝑊𝑚𝑜𝑑 ), 𝑓 (𝑥2 , 𝑊𝑚𝑜𝑑 ), ..., 𝑓 (𝑥 𝑛 , 𝑊𝑚𝑜𝑑 )}, where 𝑊𝑚𝑜𝑑 are the perturbed parameters. We then use these predicted values to measure the performance of DNN (𝑇 𝐷 𝑅𝑚𝑜𝑑 ). The higher the change in the performance, the lower the robustness of the neural network to the particular perturbation. We perturb the parameters in two settings: manipulating parameters of all layers simultaneously 94 and manipulating parameters one layer at a time. The first setting aims to understand the overall robustness of DNNs, whereas the second setting examines which layer has more impact on the stability of the model. We consider three types of perturbations: Gaussian noise manipulation, weight zeroing, and weight scaling. These perturbations resemble noise introduction due to (a) defects in hardware implementations of neural networks [174], and (b) adversarial attacks [94]. Details of these perturbations are as follows: 1. Gaussian Noise Manipulation: Here, we manipulate the original parameters of the layers by adding Gaussian noise sampled from a normal distribution of zero mean and scaled standard deviation. We control the scaling of the standard deviation by the scalar factor 𝛼. The modified parameters are defined as 𝑊𝑚𝑜𝑑 = 𝑊𝑜𝑟𝑔 + 𝑁 (0, 𝜶 ∗ 𝜎(𝑊𝑜𝑟𝑔 )) (5.3.2) where, 𝑊𝑜𝑟𝑔 are the original parameters, 𝑊𝑚𝑜𝑑 are the modified parameters, and 𝑁 (𝜇, 𝜎) is the normal distribution. We calculate 𝜎(𝑊𝑜𝑟𝑔 ) for a particular layer by first flattening the parameter tensor to a 1-D array and then computing the standard deviation. So, the standard deviation and the Gaussian noise distribution will differ for each layer. Consequently, the absolute perturbations applied to the different layers also vary. However, relative perturbations are the same across layers. 1. Weight Zeroing: In the second manipulation, we randomly select a certain proportion of parameters and set them to zero. The portion of parameters is determined by a scalar factor 𝛽. The modified parameters are represented as 𝑊𝑜𝑟𝑔 [𝑟𝑎𝑛𝑑𝑜𝑚( 𝜷, 𝑊𝑜𝑟𝑔 )] = 0 (5.3.3) where, 𝑟𝑎𝑛𝑑𝑜𝑚(., .) is the function that returns the index of 𝛽 proportion of randomly selected parameters from the original set of parameters. We also define another version of weight zeroing, where weights are first sorted, and then the 𝛽 proportion of low-magnitude weights is set to zero. 3. Weight Scaling: The third perturbation scales the original parameters by a scalar factor 𝛾 as 𝑊𝑚𝑜𝑑 = 𝜸 ∗ 𝑊𝑜𝑟𝑔 . (5.3.4) 95 5.4 Application Scenario We perform our robustness analysis in the context of iris presentation attack detection (PAD). A presentation attack (PA) occurs when an adversary presents a fake or altered biometric sample such as printed eyes, plastic eyes, or cosmetic contact lenses to circumvent the iris recognition system [3]. Our application is to detect these PAs launched against an iris system. We formulate the detection problem as a two-class problem based on DNNs, where the input is a near-infrared iris image and the output is a PA score that is assigned one of two labels: “bonafide” or “PA”. 5.5 Datasets and Experimental Setup Table 5.1: Summary of training and test datasets along with the number of bonafide and PA images present in the datasets. The information about the sensors used to capture images is also provided. Here, “K. Test” means a known test set of the dataset, and “U. Test” means an unknown test set (see text for explanation). Train/Test Train Test Datasets Warsaw LivDet-Iris-2017 NDCLD Dataset IARPA PostMortem Clarkson Warsaw Notre Dame IIITD-WVU LivDet-Iris-2020 -2015 Subsets v3 (Cross-PA) (Cross-sensor) (Cross-PA) (Cross-Dataset) Splits Test K. Test U. Test K. Test U. Test Test Bonafide 19,453 - - 1,485 974 2,350 900 900 702 5,331 Print 1,005 - - 908 2,016 2,160 - - 2,806 1,049 Cosmetic 1,187 2,236 - 765 - - 900 900 701 4,336 Contacts Artificial 1,804 - - - - - - - - 541 Eyes Electronic 51 - - - - - - - - 81 Display Cadaver Eyes - - 1,200 - - - - - - 1,094 IrisGuard Aritech ARX-3M3C, Iris ID iCAM7000, COTS Iris AD100, IriShield IrisAccess IrisGuard Fujinon DV10X7.5A, IrisGuard AD100, IriShield IrisGuardAD100, Sensor Sensors x31 IrisAccess MK2120U EOU2200 AD100 DV10X7.5A-SA2 lens IrisAccess LG4000 MK2120U IrisAccess LG4000, LG4000 B+W 092 NIR filter IriTech IriShield The training data we use to build our iris PAD models are IARPA [2], NDCLD-2015 [267] and Warsaw PostMortem v3 [275] datasets. The IARPA dataset is a proprietary dataset collected under the IARPA Odin program [2]. It consists of 19,453 bonafide irides and 4,047 presentation attack (PA) samples. From the NDCLD-2015 dataset, we use 2,236 cosmetic contact lens images for training. From the Warsaw PostMortem v3 dataset, 1,200 cadaver iris images from the first 37 cadavers are used for training. Testing is performed on the LivDet-Iris-2017 [304] and LivDet- Iris-2020 [61] datasets. Both of these are publicly available competition datasets for evaluating iris presentation attack detection. The LivDet-Iris-2017 dataset [304] consists of four subsets: 96 Clarkson, Warsaw, Notre Dame, and IIITD-WVU. All subsets contain train and test partitions, and we use only the test partition. Warsaw and Notre Dame subsets further contains two splits in the test partition: ‘Known’ and ‘Unknown’. The ‘Known’ split corresponds to the scenario, where PAs of the same type or images from similar sensors are present in both train and test partitions, while the ‘Unknown’ split contains different types of PAs or images from different types of sensors in the train and test partitions. Our experimental setup corresponds to a cross-dataset scenario as we use different datasets for training and testing. In the case of LivDet-Iris-2020 [61], we use the entire dataset for testing, and this scenario also corresponds to the cross-dataset. Table 5.1 describes all training and test sets along with the types of PAs and images present in them. We use three iris PA detectors for stability analysis. Two of the detectors utilize VGG19 [255] and ResNet101 [105] networks as their backbone architecture. The third detector is D-NetPAD [249], where the backbone architecture is DenseNet161 [122]. The D-NetPAD shows state-of-the-art performance on both LivDet-Iris-2017 and LivDet-Iris-2020 iris PAD competitions [61, 249]. The input given to these models is a cropped iris region resized to 224 × 224. For training, we initialize the model with the weights from the ImageNet dataset [72] and then fine-tuned the models using the training datasets described above. The learning rate was set to 0.005, the batch size was 20, the number of epochs was 50, the optimization algorithm was stochastic gradient descent with a momentum of 0.9, and the loss function used was cross-entropy. We measure the robustness of these DNNs by evaluating their performance as a function of the weight perturbations. The performance is estimated in terms of TDR (%) at 0.2% False Detection Rate (FDR). FDR is the percentage of bonafide samples incorrectly classified as PAs. In Table 5.3, the row corresponding to the ‘Original’ method reports the performance of these models on the LivDet-Iris-2017 and LivDet-Iris-2020 datasets before weights were perturbed. On the LivDet-Iris-2017 dataset, ResNet101 performs the best (average 74.55% TDR), whereas on the LivDet-Iris-2020 dataset, D-NetPAD performs the best (90.22% TDR). We also provide information about the number of weights and bias parameters present in all three models (Table 5.2). The VGG19 architecture has the highest number of parameters, followed by the ResNet101 97 Table 5.2: The number of parameters (weights and bias) present in all convolutional layers of the VGG19, ResNet101, and D-NetPAD architectures. Architecture VGG19 ResNet101 D-NetPAD Weights 139,570,240 42,451,584 26,366,448 Bias 19,202 52,674 109,970 Total 139,589,442 42,504,258 26,476,418 architecture. 5.6 Robustness Analysis 5.6.1 Gaussian Noise Addition The Gaussian noise manipulation involves the addition of Gaussian noise to the original parameters. Figure 5.1 (a) shows the performance of all the networks when we perturb parameters of all layers with the Gaussian noise. The scale factor (𝛼) used to modify the standard deviation is shown on the x-axis. From a trend standpoint, the performance of all networks decreases with an increase in the standard deviation. However, this decrease is not linear. In fact, there are some performance gains at certain scales. These scales are different for different networks. For instance, the VGG19 network shows improvement for 𝛼 = 0.3, 0.6, and 0.9, ResNet101 for 𝛼 = 0.1, 0.3, and 0.9, and D-NetPAD for 𝛼 = 0.1, 0.4 and 1.0. Surprisingly, certain scales give higher performance than the original model, such as 0.1 scale for the ResNet101 and D-NetPAD models, and 0.3 scale for the VGG19 model. It should be noted that all three networks are not robust to Gaussian noise perturbations, and we cannot conclude which network is comparatively robust under these weight perturbations. We further analyze the impact of perturbation at different layers on the performance of the models. We manipulate the parameters one layer at a time and observe the performance change. For the layer-wise analysis, we show the results of only the D-NetPAD model since the other two models also show similar performance trends. In the case of D-NetPAD, we select the first convolution layer and the last convolution layers of four dense blocks for perturbation. Figure 5.1 1 Specific sensor names withheld at sponsor’s request 98 (a) (b) Figure 5.1: Gaussian noise manipulation: (a) Performance (TDR at 0.2% FDR) of VGG19, ResNet101, and D-NetPAD when weights and bias parameters of the entire network are perturbed. (b) Performance of D-NetPAD when the individual layer’s parameters (weights and bias) are perturbed. Here, Conv1 means the first convolution layer of the D-NetPAD, Dense1_LastConv means the last convolution layer of the first dense block, and so on. (b) shows the performance of D-NetPAD when the individual layer’s parameters are perturbed. We observe that the initial layers have more influence on the performance of the D-NetPAD compared to the later layers. The model is highly robust to the perturbations in the last convolution layer of the fourth dense block, even at a scale factor of 30. Cheney et al. [51] also observe the higher impact of perturbations in the initial layers on the performance. This is because the perturbations in the initial layers impact all the subsequent layers, resulting in a substantial decrease in performance. Change in middle layers exhibit large fluctuations in performance compared to the initial and later layers. 5.6.2 Weight Zeroing The weight zeroing manipulation involves random selection of a particular fraction of weight pa- rameters and setting them to zero. Figure 5.2 (a) shows the performance of all three architectures when we manipulate the entire set of network parameters, while Figure 5.2 (b) shows the perfor- mance of D-NetPAD when we perturb individual layers. Similar conclusions can be drawn from Figure 5.2 (a) as drawn from Figure 5.1 (a) that the overall performance of all three architectures de- 99 creases with an increase in the proportion of weights set to zero. However, certain perturbations give improved performance. For example zeroing 3% of weights improves the VGG19 network performance from 76.87% TDR (original) to 92.70% TDR, in the case of ResNet101 zeroing 3% of weights improves performance from 84.11% TDR (original) to 88.88% TDR. Again, all three networks are not robust to the zeroing out of randomly selected weights from the entire network. (a) (b) Figure 5.2: Weight zeroing manipulation: (a) Performance (TDR at 0.2% FDR) of VGG19, ResNet101, and D-NetPAD when parameters of the entire network are perturbed. (b) Performance of D-NetPAD when the individual layer’s parameters are perturbed. In the layer-wise setup (Figure 5.2 (b)), the performance of D-NetPAD is stable except for the first convolution layer. This is due to the fact that the original weights of the convolution layers have zero mean and small standard deviation ranging from 0.10 (first convolution layer) to 0.01 (last convolution layer) as shown in Figure 5.3. A similar performance trend is observed in the VGG19 and ResNet101 networks as well. To further analyze the effect of weight zeroing, we assess its three variants - first is to set low- magnitude weights to zero, second sets high-magnitude weights to zero and in the third randomly selected weights to make them zero and non-zero weights are scale to factor 5. The details of these variants are as follows: 1. Since most of the original weights are already close to 0, we set low-magnitude weights to zero. Figure 5.4 (a) shows the performance of all architectures when we manipulate the 100 Conv1 Dense1_LastConv Dense2_LastConv 𝜇: 0.0, 𝜎: 0.103 𝜇: 0.0, 𝜎: 0.019 𝜇: 0.0, 𝜎: 0.018 Dense3_LastConv Dense4_LastConv 𝜇: 0.0, 𝜎: 0.016 𝜇: -0.0, 𝜎: 0.012 Figure 5.3: Weight distribution of different layers of the trained D-NetPAD architecture. Mean (𝜇) and standard deviation (𝜎) are provided below each distribution. entire network in this fashion, while Figure 5.4 (b) shows the performance of D-NetPAD on layer-wise manipulation. ResNet101 and D-NetPAD networks are robust to this manipulation as zeroing out even 33% of all weights does not affect their performance. VGG19 also shows robustness with only a 10% drop in performance, though its performance is not as stable as the ResNet101 and D-NetPAD networks. Figure 5.4 (b) shows the stability of the D-NetPAD on layer-wise perturbations. Zeroing out even 30% of the first convolution layer weights does not impact its performance. Remarkably, the manipulation in the last convolution layer of the first and second dense blocks shows a linear increase in performance. The performance of D- NetPAD increases from 90.22% TDR to 96.28% TDR upon manipulating the last convolution layer of the first dense block. This suggests that we could zero out low-magnitude weights and reduce the size of the model without affecting its performance. 101 (a) (b) Figure 5.4: Variant of the weight zeroing manipulation (low-magnitude weights are set to zero): (a) Performance (TDR at 0.2% FDR) of VGG19, ResNet101, and D-NetPAD when parameters of the entire network are perturbed. (b) Performance of D-NetPAD when individual layer’s parameters are perturbed. 2. The second variant make high-magnitude weights to zero. Figure 5.5 (a) shows the perfor- mance of all architectures when high-magnitude weights are set to zero on the entire network, while Figure 5.5 (b) shows the performance of D-NetPAD on layer-wise manipulation. There is a drastic drop in the performance of all three architectures when high-magnitude weights are set to zero. It shows that high-magnitude weights are high priority parameters. In the layer-wise analysis, manipulation in the first convolution layer and layers of DenseBlock1 show a drop in the performance, whereas manipulation in other higher layers does not impact the performance. 3. The third variant mimics the operation of Dropout layer, where randomly selected weights are set to zero and non-zero weights by scaled to the factor five. Figure 5.6 (a) shows the performance of D-NetPAD when layer-wise manipulation is applied. The last layer of the DenseBlock1 layer shows a different trend of increasing performance with an increase in the manipulation magnitude. To explore it further, we plot the performance against layers manipulation between the first convolution layer and the last convolution layer of DenseBlock2 (Figure 5.6 (b)). We found that the layers in DenseBlock1 and DenseBlock2 show a similar 102 (a) (b) Figure 5.5: Variant of the weight zeroing manipulation (high-magnitude weights are set to zero): (a) Performance (TDR at 0.2% FDR) of VGG19, ResNet101, and D-NetPAD when parameters of the entire network are perturbed. (b) Performance of D-NetPAD when individual layer’s parameters are perturbed. pattern. This is due to the scaling of non-zero weights whose impact decreases with an increase in weight proportion set to zero. The impact of scaling weights is more than the setting weights to zero. (a) (b) Figure 5.6: Variant of the weight zeroing manipulation (randomly selected weights are set to zero and non-zero weights are scaled by factor 5): (a) Performance of D-NetPAD when individual layer’s parameters are perturbed. (b) Closer look at the performance of D-NetPAD when convolution layers of DenseBlock1 and DenseBlock2 are perturbed. 4. The fourth variant randomly selected filters from the layers and set their all weights to zero. 103 Figure 5.7 (a) shows the performance of all architectures when randomly selected filters are set to zero on the entire network, while Figure 5.7 (b) shows the performance of D-NetPAD on layer-wise manipulation. Similar conclusions can be drawn from Figure 5.7 (a) as drawn from Figure 5.2 (a) that the overall performance of all three architectures decreases with an increase in the proportion of filters set to zero. However, certain perturbations give improved performance. Overall, all three networks are not robust to zero out randomly selected filters from the entire network. Again from Figure 5.7 (b) performance is robust to the manipulations in later layers compared to initial layers. (a) (b) Figure 5.7: Variant of the weight zeroing manipulation (randomly selected filters are set to zero): (a) Performance (TDR at 0.2% FDR) of VGG19, ResNet101, and D-NetPAD when filters of the entire network are perturbed. (b) Performance of D-NetPAD when individual layer’s parameters are perturbed. 5.6.3 Weight Scaling This manipulation scales the original parameters with a scalar value. Figure 5.8 (a) shows the performance of all three architectures when we manipulate the entire set of network parameters, while Figure 5.8 (b) presents the performance of D-NetPAD when we perturb specific layers. The performance at scale 1 indicates the original performance without weight perturbations. Weight perturbations across the entire network resulted in a radical drop in performance even with a 104 (a) (b) Figure 5.8: Weight scaling manipulation: (a) Performance (TDR at 0.2% FDR) of VGG19, ResNet101, and D-NetPAD when parameters of the entire network are perturbed simultaneously. (b) Performance of D-NetPAD when the individual layer’s parameters are perturbed. small scalar factor (0.8 or 1.1). In the layer-wise manipulation, the initial layers show a higher impact on the performance of D-NetPAD compared to the later layers. The manipulation in the last convolution layer does not impact the performance even at a scaling factor of 10. A similar performance trend is observed on the VGG19 and ResNet101 networks as well. 5.6.4 Findings Here are the main findings from the aforementioned analysis: 1. All three networks decrease in performance when perturbations are applied over the entire network. However, the networks show stability when low-magnitude weights are set to zero. The scaling of weights has a major negative impact on the performance of networks. 2. Layer-wise robustness analysis shows that perturbations in initial layers impacted the per- formance to a greater extent compared to the later layers. Initial layer perturbation impacts all the subsequent layers, resulting in drop in performance. Gaussian noise perturbations negatively impacted the performance when applied layer-wise. 3. Certain perturbations improve the performance of network models over the original one. 105 This observation indicates that the parameters learned by the models during training are not optimum. Hence, there is a further scope for optimizing weights. 4. Zeroing out low-magnitude weights results in better performance as well as reduces the size of the model. 5.7 Performance Improvement We observe that certain perturbations result in better performance, even higher than that of the original model. We leverage this observation and obtain better performing models using these perturbations without any additional training. In this regard, we explore two directions: the first is to find a single perturbed model which achieves good performance consistently, and the second is to create an ensemble of perturbed models to obtain high performance. 5.7.1 Single Perturbed Model Certain perturbations result in higher performance than the original one, for instance, 0.1 scale factor of Gaussian manipulation for D-NetPAD and ResNet101 architectures, 0.3 scale factor of Gaussian manipulation for VGG19 architecture, 0.01 proportion of weight zeroing for ResNet101, and 0.03 proportion of weight zeroing for VGG19 architecture. These manipulations involves random selection, so we repeat these manipulations 100 times and plot the performance distributions. Figures 5.9 and 5.10 show the performance distributions corresponding to these manipulations. Approximately 20-40 times, these manipulations result in higher performance than the original one. 5.7.2 Ensemble of models The second direction to improve the performance using weight manipulations without further training is an ensemble of models. Ensemble of models better spans the decision space and generalize well on the test data [203]. For initial analysis, we ensemble three perturbed models. 106 (a) (b) (c) Figure 5.9: The performance distributions when Gaussian perturbation is applied over the entire architecture at the specified scale on (a) D-NetPAD, (b) ResNet101, and (c) VGG19 architectures. (a) (b) Figure 5.10: The performance distributions when weights are set to zero over the entire architecture of (a) ResNet101 and (b) VGG19 for the specified proportion. The red vertical line represents the original performance of the architectures when weights are unperturbed. Figure 5.11 shows the process of assembling perturbed models. We consider three settings for ensemble - models having the same manipulation and scalar factor, same manipulation but different scalar factors, and different manipulations. In each setting, we also consider manipulations applied on the entire network and to the last convolution layer only. 1. Same parameter manipulations and scalar factor: In the first setting, we use the same manipulation and scalar factor to manipulate the parameters of component models. The manipulation is the addition of Gaussian noise sampled from Gaussian distribution having zero mean and 0.1 scaling factor. All three models have the same manipulation, but their weights differ as there involve random sampling from the Gaussian distribution. We repeat 107 Figure 5.11: Ensemble process of perturbed models to improve the performance of DNN model without undergoing further training. the ensemble models 100 times and plot the performance distribution. Figure 5.12 (a) shows the performance distribution when we manipulate the entire network, whereas Figure 5.12 (b) shows the performance distribution when only last convolution layer manipulated. There are 29 times when TDR is higher than the original performance on the entire network manipulation, whereas there are 79 times when TDR improves over the last convolution layer manipulation. So, ensemble models show a higher chance of improved performance when we manipulate only the last convolution layer. (a) (b) Figure 5.12: Performance distributions when three Gaussian noise manipulated D-NetPAD models are ensembled. The Gaussian distribution scaling parameter used in all three models is 0.1. The red vertical line corresponds to the original performance (without weight perturbations). (a) Performance distribution when the entire network is manipulated. In this case, 29 times TDR is higher than the original performance. (b) Performance distribution when only the last convolution layer of DenseBlock4 is manipulated. In this case, 79 times TDR is higher than the original performance. 108 2. Same parameter manipulations, but different scalar factor:In the second setting, we use the same manipulation but different scalar factors to manipulate the parameters of component models. The manipulation used is the addition of Gaussian noise sampled from Gaussian distribution having zero mean and 0.1, 0.2, and 0.3 scaling factors. Figure 5.13 (a) shows the performance distribution when we manipulate the entire network, whereas Figure 5.13 (b) presents the performance distribution on only last convolution layer manipulation. Four times TDR is higher than the original performance when the entire network is manipulated, whereas TDR improves 69 times when we only manipulate the last layer. We draw similar conclusion that the ensemble models show a higher chance of improved performance when we manipulate only the last convolution layer. (a) (b) Figure 5.13: Performance distributions when three Gaussian manipulated D-NetPAD models are ensembled. The Gaussian distribution scaling parameters for the three models are 0.1, 0.2, and 0.3, respectively. The red vertical line corresponds to the original performance (without weight perturbations). (a) Performance distribution when the entire network is manipulated. In this case, four times TDR is higher than the original performance. (b) Performance distribution when only the last convolution layer of DenseBlock4 is manipulated. In this case, 69 times TDR is higher than the original performance. 3. Different manipulations: In the third setting, we utilize three models undergoing different parameter manipulations. The manipulations applied to the component models are: (a) Gaussian noise with a scale factor of 0.1, (b) weight zeroing with 0.01 proportion, and (c) weight scaling with a scalar factor of 1.1. Figure 5.14 (a) shows the performance distribution 109 when we manipulate the entire network, whereas Figure 5.14 (b) presents the performance distribution we manipulate only the last convolution layer. There are zero times when TDR is higher than the original performance in the case of entire network manipulation, and there are 100 times when TDR is higher than the original in the case of last convolution layer manipulation. Again, ensemble models with only last layer manipulation show a higher chance of improved performance. (a) (b) Figure 5.14: Performance distributions when three parameter-manipulated D-NetPAD models are fused undergoing three different types of manipulations. The manipulations in the three models are additive Gaussian Noise (scale factor is 0.1), weight zeroing (proportion is 0.01), and weight scaling (scale factor is 1.1), respectively. The red vertical line corresponds to the original performance (without weight perturbations). (a) Performance distribution when the entire network is manipulated. In this case, zero-times TDR is higher than the original performance. (b) Performance distribution when only the last convolution layer of DenseBlock4 is manipulated. In this case, 100 times TDR is higher than the original performance. 5.7.3 Performance validation on other dataset Until now, we observed the performance increment on the LivDet-Iris-2020 dataset. Here, we select high-performing models (single and ensemble of models) based on their performance on the LivDet-Iris-2020 dataset and validate their performance on another dataset, i.e., the LivDet-Iris- 2017 dataset. In the case of ensemble models, we consider two high-performing perturbed models and fuse their scores using the sum rule. We explore both directions of improved performance 110 (single and ensemble of models) for each architecture and compare their performance with the original models. Details of these models are given below: 1. Original Model: The model utilizes originally trained parameters without any perturbation of the parameters. 2. Perturbed Model: In the case of VGG19, we create a perturbed model by setting 80% of the low-magnitude weights of the seventh convolution layer to zero. For the ResNet101 model, a perturbed model is formed by setting 40% of low-magnitude weights of the first convolution layer to zero, while for the D-NetPAD, 90% of the low-magnitude weights of the last convolution layer of the first dense block are set to zero. The selection of these perturbed models are based on their performance on the LivDet-Iris-2020 dataset (5.4 (b)). 3. Ensemble Models: To create an ensemble model, we select two perturbed models and fuse their PA scores by the sum rule. In the case of VGG19, we fuse the perturbed model defined above and the model created by adding Gaussian noise with 𝛼 = 0.3 (𝑁 (0, 0.3 ∗ 𝜎(𝑊𝑜𝑟𝑔 )) to the entire network. In the case of ResNet101, we again use the perturbed model defined above, and the second model is created by adding Gaussian noise with 𝛼 = 0.1 to the entire network. Similarly, for D-NetPAD, we fuse the above specified perturbed model and the model formed by adding Gaussian noise with 𝛼 = 0.1 to all layers. Table 5.3 provides the performance of these three models corresponding to the three archi- tectures (VGG19, ResNet101, and D-NetPAD). The performance of perturbed and ensemble models is better than the original model on both datasets. The observation is consistent for all three architectures. The perturbed models show an average of 30.90% improvement on the LivDet-Iris-2017 and 3.86% on the LivDet-Iris-2020 dataset, whereas the ensemble models show an average of 47.59% improvement on the LivDet-Iris-2017 and 5.44% on the LivDet-Iris-2020 dataset. One major advantage of these perturbed models is that these models are created without any further training. Another advantage is that these high-performing perturbed models have reduced model size. 111 Table 5.3: The performance of VGG19, ResNet101, and D-NetPAD models in terms of True Detection Rate (%, higher the better) at 0.2% False Detection Rate on the LivDet-Iris-2017 and LivDet-Iris-2020 datasets. The performance is shown on original model (no parameter perturba- tions), perturbed model and an ensemble of model. Datasets LivDet-Iris-2017 Subsets Clarkson Warsaw Notre Dame IIITD-WVU LivDet-Iris-2020 Splits Test K. Test U. Test K. Test U. Test Test VGG19 Model Original 51.32 86.25 10.12 100 99.00 1.44 76.87 Perturbed 51.81 73.90 7.71 100 99.00 6.67 78.55 Ensemble 67.64 88.14 21.71 100 99.11 8.49 82.93 ResNet101 Model Original 15.82 89.93 91.67 100 99.44 50.47 84.11 Perturbed 14.50 95.33 93.18 100 99.55 55.55 86.39 Ensemble 14.71 95.18 94.51 100 99.33 56.26 87.00 D-NetPAD Model Original 60.04 76.68 35.76 100 99.33 32.01 90.22 Perturbed 69.24 90.72 40.96 100 97.33 48.35 96.28 Ensemble 68.89 89.53 36.94 100 96.88 41.68 94.76 5.8 Summary and Future Work We analyze the robustness of three DNN architectures (VGG19, ResNet101, and D-NetPAD) under three types of parameter perturbations (Gaussian noise manipulation, weight zeroing, and weight scaling). We apply the perturbations in two settings: modifying the weights across all layers and modifying weights layer-by-layer. We found that DNNs are generally robust to a variant of weight zeroing, where low-magnitude weights are set to zero. From the layer-wise analysis, we observe that the DNNs are more stable to perturbations in later layers compared to the initial layers. Certain manipulations improve the performance over the original one. Based on these observations, we propose the use of an ensemble of models that consistently perform well on both LivDet-Iris-2017 and LivDet-Iris-2020 datasets. As future work, we will focus on finding the theoretical optimum direction for weight perturbations. 112 CHAPTER 6 RETRAINING OF DEEP NEURAL NETWORKS 6.1 Introduction While a great deal of research in machine learning involves achieving higher performance at various classification/regression tasks, maintenance of that performance in a non-stationary envi- ronment [182] is a less explored area. The non-stationary environment (change in data capturing device, change in the deployment location, or change in a population group) degrades the per- formance of machine learning models. The performance degradation happens due to the dataset shift [182]. The dataset shift involves a shift in the input or output distributions or a shift in the relationship between input and output. The focus of this work is a shift in the input distribution. To maintain the performance of machine learning models under dataset shift, one needs to update the models with new incoming data. One solution is to fine-tune the model with new data; however, it results in catastrophic forgetting of the previously learned information [70, 71, 161]. Another solution is to retrain the model using entire data (old and new), but in the real-world scenario, old training data is generally unavailable due to security or privacy issues. So, the research problem arises how we should update the existing model that maintains the previous performance while improving the performance on new data, given that old training data is unavailable. Mathematically, we define the problem as – let there is an expert model 𝑀 trained on old training data 𝑇 𝑅𝑜𝑙𝑑 and tested on old test data 𝑇 𝑆 𝑜𝑙𝑑 . It works satisfactorily on 𝑇 𝑆 𝑜𝑙𝑑 . Now, comes the new training data 𝑇 𝑅𝑛𝑒𝑤 and new test data 𝑇 𝑆 𝑛𝑒𝑤 . So, given the existing trained model 𝑀 and new training data 𝑇 𝑅𝑛𝑒𝑤 , how we should retain the model 𝑀 such that it maintains the performance on 𝑇 𝑆 𝑜𝑙𝑑 and improves the performance on 𝑇 𝑆 𝑛𝑒𝑤 . Here, we consider old training 𝑇 𝑅𝑜𝑙𝑑 and test data 𝑇 𝑆 𝑜𝑙𝑑 belong to one domain, and new training 𝑇 𝑅𝑛𝑒𝑤 and test data 𝑇 𝑆 𝑛𝑒𝑤 belong to another domain. We further assume following constraints to define the real-world retraining scenario: 1. Sequential learning: Data of different domains are supposed to learn in a sequence. 113 2. Unavailability of old training data 𝑇 𝑅𝑜𝑙𝑑 : Training samples from the previous domain might not be available due to privacy or security concerns. 3. Limited availability of new training data 𝑇 𝑅𝑛𝑒𝑤 : Training samples from new domain are generally small in number compared to the old training data. There might be an absence of training samples for certain classes. 4. Memory constraints: Information transfer from one domain to another is also restrained due to memory limitation. 5. Architectural capacity constraints: Finite capacity of an architecture limits its ability to learn new domain over time. 6. Knowledge constraints: Generally, a third party performs retraining of the deployed model. There could be a lack of expertise compared to the original architecture designer or developer. In this work, we propose a dynamic weight-based fusion retraining strategy, where we train a new expert model with new incoming training data and make a final decision for a probe sample by a weighted sum of the predicted scores from the old and new trained models. We assign the weights individually for each probe sample using in-domain models at the run-time. The in-domain models provide information about the membership of the probe sample to the old and new training data. Our main contributions of the work are as follows: 1. We propose a novel retraining methodology which involves dynamic weight-based fusion of expert models. We allocate dynamic weights at the run-time for each probe sample. 2. We propose an in-domain model to assign dynamic weights to the scores of the expert models. The in-domain model works on the principle of outlier detection. 3. We perform experiments on three setups: LivDet-Iris-2017, LivDet-Iris-2020, and Split MNIST. These setups illustrate two levels of dataset shift. The first shift is between 𝑇 𝑅𝑜𝑙𝑑 and 𝑇 𝑅𝑛𝑒𝑤 and the second shift is between 𝑇 𝑅𝑛𝑒𝑤 and 𝑇 𝑆 𝑛𝑒𝑤 . 114 The rest of the chapter is organized as follows: Section 6.2 discusses the related work on retrain- ing strategies, Section 6.3 explains the proposed method. Section 6.4 describes the experimental setup and results. Finally, Section 6.5 concludes the chapter. 6.2 Related Work Retraining is a process of including new samples in the old prediction pipeline. The ob- jective is to improve the performance of the deployed model on new test samples 𝑇 𝑆 𝑛𝑒𝑤 and maintain the performance of old test samples 𝑇 𝑆 𝑜𝑙𝑑 . The closest terminology to the retraining paradigm is continual learning, which involves sequential learning of the number of tasks without forgetting knowledge obtained from the previous tasks. Continual learning considers three scenar- ios [118, 278]: Task-Incremental Learning (Task-IL), Domain-Incremental Learning (Domain-IL), and Class-Incremental Learning (Class-IL). Task-IL incrementally learns several independent tasks, explicitly knowing the task identity. Domain-IL learns tasks of the same output label space but dif- fers input distribution. Here, task identity is not known. Class-IL incrementally learns new classes in each task without being given any information on the task identity. The retraining scenario is close to Domain-IL continual learning scenario.The difference is that Domain-IL assumes no shift in train and test distributions within a task, whereas the retraining scenario considers no assumption in this regard (includes both cases shift or no shift). Other synonyms of Domain-IL are lifelong learning [50], never-ending learning [155]. Other terms that could be confused with the retraining paradigm are multi-domain incremental learning [93], domain adaptation [289], and transfer learning [197,292]. Multi-domain incremental learning [93] concerns with sequentially learning a task, say image classification, on multiple visual domains with possibly different label spaces, whereas we consider the same label spaces. Domain adaptation involves learning a new task of a different domain without retaining old domain knowledge. Transfer learning also does not retain the previous task knowledge either. Two elementary retraining methodologies are: fine-tune the model using new training data (Fine-Tuned) or retrain the model using entire (old and new) training data (Full-Retrain). The 115 Fine-Tuned methodology results in catastrophic forgetting of the previous knowledge [70, 71, 161], whereas Full-Retrain is not a practical solution due to the unavailability of the old training data. Continual learning methodologies could also directly apply to the retraining paradigm, and in the literature, those are categorized as 1. Regularization-based Approaches: These approaches add regularization terms in the learn- ing process to penalize the drastic change in parameters of the mapping function. The regu- larization helps to prevent catastrophic forgetting of previously learned tasks. Regularisation- based methods have a limited capacity to learn a large number of the tasks. Some seminal works based on regularization approaches are elastic weight consolidation (EWC) [141], online EWC [244], Kronecker factored EWC (KFAC) [229], Synaptic Intelligence (SI) [313], Memory Aware Synapses (MAS) [13], Learning without Forgetting (LwF) [161], Orthogonal Weight Modification (OWM) [312], and Natural Continual Learning (NCL) [136]). 2. Dynamic Neural Networks/Parameter-Isolation Approaches: These approaches begin with a simplified architecture and, when needed, augment the network incrementally with new components to attain satisfactory performance on subsequent tasks [42,81,156,228,233]. In a practical scenario, a finite capacity of models limits their ability to learn a large number of tasks over time. 3. Replay-based/Memory/Rehearsal-based Approaches: Replay-based approaches comple- ment the existing expert models with memory to accommodate information about previously learned tasks. These approaches involve the usage of a subset of training samples from the previous tasks [14, 22, 44, 167, 204, 296] or the learning of generative or probabilistic models to simulate pseudo-samples from previously learned tasks [230, 252, 277]. However, as the number of tasks increases, it becomes difficult to maintain additional memory to store previous tasks information. We propose a fusion-based methodology that learns a separate expert model using new training data and makes a final decision by a weighted sum of old and new prediction scores. The work 116 Table 6.1: Different methodologies of retraining along with the information about the knowledge needs to transfer to the next task and the special requirements for the training of the current task. Knowledge Special requirements Approaches Methods transferred for the training of to next task new model Fine-Tuned None None Baselines Training includes Full-Retrain Old training data old training data Regularization-based [13, 136, 141, 161, 229, 244, 312, 313] None Learning constraints Approaches Dynamic Neural [42, 81, 156, 228, 233] None Increment architecture Networks Approaches Subset of Training includes subset Replay-based [14, 22, 44, 167, 204, 296] old training data of old training data Approaches Generation of synthetic [230, 252, 277] Generative model old training data Fusion-based Proposed Method In-domain Model None Approaches in [153] is close to our work, where they also learned separate expert models with incoming new training data and measured the marginal likelihood of the expert model using a density estimator, whereas we assign dynamic weights to expert models using in-domain model. We also explicitly discuss the information required to transfer to the subsequent tasks (apart from the expert model) along with special requirements during the training of the current expert model in all retraining methodologies in Table 6.1. The Fine-Tuned method does not require anything from the previous task, whereas the Full-Retrain method requires the entire old training dataset. Regularization-based approaches do not require information from the previous tasks but apply additional constraints in the learning process. Dynamic Neural Networks approaches require a change in the architecture with subsequent tasks. Replay-based methods require additional memory to transfer a generative model or subset of old training data to the subsequent tasks. The proposed fusion-based approach does not require a change in the learning process or architecture. The approach also consumes less memory compared to the relay-based approaches. However, there is an increase in the inference time compared to the other methodologies. 117 6.3 Proposed Algorithm In this work, we propose a dynamic weight-based fusion method to update existing expert model such that it maintains its performance on both 𝑇 𝑆 𝑜𝑙𝑑 and 𝑇 𝑆 𝑛𝑒𝑤 data. Figure 6.1 illustrates the proposed idea. Let’s consider that we have old training data 𝑇 𝑅𝑜𝑙𝑑 and its corresponding expert model. We also need to build an in-domain model on 𝑇 𝑅𝑜𝑙𝑑 , which provides weight information. When a new training data 𝑇 𝑅𝑛𝑒𝑤 come, we build two separate models (expert and in-domain models). During testing, we input a probe sample into all four models. Two expert models output prediction scores (𝑆1 and 𝑆2), whereas in-domain models output weights (𝑊1 and 𝑊2) assign to the prediction scores. We estimate the final prediction score as 𝑆 = 𝑊1 × 𝑆1 + 𝑊2 × 𝑆2. (6.3.1) Figure 6.1: The overall idea of the dynamic weight-based fusion strategy for retraining. We train two models (expert and in-domain models) on incoming training data, and a final decision is made based on the weighted sum of their prediction scores. The expert model provides the prediction score, and the in-domain model assigns weight to the prediction score. Our main contribution lies in the introduction of the in-domain model to estimate the dynamic weights. The in-domain model works on the principle of outlier detection, where training data of one expert model is considered as inliers and for each probe sample, we determine the degree of being an outlier with respect to the training data of the expert model. To accomplish this, an in- domain model contains two components: (i) a feature extractor that represents training distribution in feature space and (ii) a distance measure that provides the outlier score of a probe sample to the obtained training feature space. Details of these components are as follows 118 Feature Extractor (FE): The base architecture we use for feature extraction is Vision Trans- former (ViT) [78]. The great success of transformers in natural language processing [116, 281] and computer vision [78] inspired us to use it for representing training data. The input to the ViT 2 architecture is a sequence of flattened 2D image patches 𝑥 𝑝 ∈ R 𝑁×(𝑃 .𝐶) and 1D positioning vector (providing position information of image patches), where 𝑁 is the number of patches, 𝑃 is the patch size, and 𝐶 is the number of channels. We remove the MLP head used for classification from the ViT original architecture to make it a feature extractor. We use the "Base" version of the ViT, where there are 12 layers, 768-dimensional hidden latent vector, and 16 × 16 input patch. The total number of learnable parameters in the architecture is 86M. We train the ViT feature extractor using two losses: center and mean-shifted intra-class loss. The details of these losses are as follows: 1. Center Loss: The objective of the center loss is to extract features from the training samples such that feature embeddings are close to the center of the embeddings. The center of the training data embeddings is calculated as 𝑐 = E𝑥∈𝜒𝑡𝑟𝑎𝑖𝑛 [𝜙(𝑥)] (6.3.2) where, 𝑥 is the input image, 𝜙(𝑥) is the features embedding from the ViT model and 𝜒𝑡𝑟𝑎𝑖𝑛 is the train set. We update the center position in every epoch. The center loss is then calculated as ℓ𝑐𝑒𝑛𝑡𝑒𝑟 (𝑥) = k𝜙(𝑥) − 𝑐k 2 . (6.3.3) The loss reduces the intra-train set variations among training samples and forms a closer feature representation of the samples. This helps in detecting outlier samples from other training set. 2. Mean-Shifted Intra-Class Loss: The objective of this loss is to form a cluster of samples be- longing to the same class. To accomplish the objective, we first mean-shifted the embeddings of the training samples as 𝜙(𝑥) − 𝑐 𝜃 (𝑥) = (6.3.4) k𝜙(𝑥) − 𝑐k 2 119 where, 𝜙(𝑥) is the features embedding from the ViT model with 𝑥 input sample and 𝑐 is the center of the training samples in the ViT feature space. We then estimate contrastive loss over the two mean-shifted representations 𝑥 0 and 𝑥 00 belong to the same class 𝐶𝑖 as: ℓ𝑚𝑠𝑖𝑐 (𝑥 0, 𝑥 00) {𝑥0,𝑥00 }∈𝐶 = ℓ𝑐𝑜𝑛 (𝜃 (𝑥 0), 𝜃 (𝑥 00)) 𝑖 exp((𝜃 (𝑥 0).𝜃 (𝑥 00))/𝜏) = − log Í (6.3.5) 2𝑁 E[𝑥 ≠ 𝑥 0]. exp((𝜃 (𝑥 0).𝜃 (𝑥𝑖 ))/𝜏) 𝑖=1 𝑖 where, 𝜃 (.) is the mean-shifted representation, 𝑁 is the batch size and 𝜏 is the temperature hyperparameter. Both losses together form cluster of samples belonging to the same classes around the center of the training samples. The class cluster formation helps in the detection of local outliers. By local outlier, we mean those samples whose inter-train set distance is lower but are outliers to their class distribution. Let consider Figure 6.2, where blue-colored data points belong to one training set, C is the center of the training set, and red-colored data point P is a probe sample. There are two classes (Class 1 and Class 2) of different densities. So, if we consider the global outlier measure, the probe sample would be an inlier to the blue-colored train set as its distance from Class 1 is less compared to the distance between data points of Class 1 and Class 2, but according to the local outlier measure, it is an outlier to the Class 1 as distance among data points of Class 1 are significantly lower the that of the probe sample and the data points of Class 1. Subsequently, the probe sample is also considered as an outlier to the blue-colored training set. The total loss is the sum of center loss and mean-shifted intra-class loss as ℓ𝑡𝑜𝑡𝑎𝑙 (𝑥 0, 𝑥 00) = ℓ𝑐𝑒𝑛𝑡𝑒𝑟 (𝑥 0) + ℓ𝑐𝑒𝑛𝑡𝑒𝑟 (𝑥 00) + ℓ𝑚𝑠𝑖𝑐 (𝑥 0, 𝑥 00). (6.3.6) Based on these losses, we train the feature extractor and used their features to represent the training data. Distance Measure: After the representation of the training data and a probe sample, we estimate the distance of the probe sample with respect to the distribution of the training data using 120 Figure 6.2: Illustration of a local outlier concept used in the mean-shifted intra-class loss. Blue- colored data points belong to one training set, C is the center of the training set, and red-colored data point P is a probe sample. There are two classes (Class 1 and Class 2) in the blue-colored train set. If we consider the global outlier concept, the red-colored probe sample would be inlier. However, if the local outlier concept is considered, the probe sample is an outlier to the Class 1 as well as to the blue-colored training set. The figure is better viewed in color. Local Outlier Factor (LOF) [39]. The LOF is an unsupervised outlier detection algorithm that provides a score to each sample using local density deviation to its neighbors. It considers those samples as outliers whose density is substantially lower than their neighbors. If the LOF score has approximately a value of one, it suggests that the sample has a similar density as its neighbors, a value less than one indicates the sample has a higher local density than its neighbors, whereas a value greater than one presents the sample has a lower density than its neighbors. To assign weights to the individual expert models, we first inverse LOF scores and then perform SoftMax normalization to ensure weights lie in the range of [0-1] and weights sum to one. 6.4 Experimental Setup and Results To evaluate the proposed methodology, we perform experiments on three setups: LivDet-Iris- 2017, LivDet-Iris-2020, and Split MNIST. These setups consider dataset shift at two levels: 1) shift between training data of old 𝑇 𝑅𝑜𝑙𝑑 and new 𝑇 𝑅𝑛𝑒𝑤 domain, and 2) shift between training 𝑇 𝑅𝑛𝑒𝑤 and test 𝑇 𝑆 𝑛𝑒𝑤 data within a domain (new domain). The LivDet-Iris-2017 setup represents the 121 scenario where dataset shift occurs between 𝑇 𝑅𝑜𝑙𝑑 and 𝑇 𝑅𝑛𝑒𝑤 , but no shift between 𝑇 𝑅𝑛𝑒𝑤 and 𝑇 𝑆 𝑛𝑒𝑤 (except in one case, explained in Section 6.4.1). The LivDet-Iris-2020 setup illustrates the scenario where dataset shift occurs both in between 𝑇 𝑅𝑜𝑙𝑑 and 𝑇 𝑅𝑛𝑒𝑤 and between 𝑇 𝑅𝑛𝑒𝑤 and 𝑇 𝑆 𝑛𝑒𝑤 . The Split MNIST setup depicts the condition where distribution of 𝑇 𝑅𝑜𝑙𝑑 is disjoint of 𝑇 𝑅𝑛𝑒𝑤 , however no shift between 𝑇 𝑅𝑛𝑒𝑤 and 𝑇 𝑆 𝑛𝑒𝑤 . The LivDet-Iris-2017 and LivDet-Iris-2020 setups are in the application of detecting presentation attacks (PA) in iris biometric modality. We formulate the PA detection as a binary classification between bonafide and fake (print, cosmetic contacts, artificial eyes, and electronic display) iris images. The Split MNIST setup used to compare the proposed method with existing state-of-the-art continual learning methods. 6.4.1 LivDet-Iris-2017 Setup and Results In this setup, we utilize two datasets: IARPA dataset [2] and LivDet-Iris-2017 dataset [304]. The IARPA dataset is a proprietary dataset collected under the IARPA Odin program [2]. We divide the dataset into two splits which in this setup considered as 𝑇 𝑅𝑜𝑙𝑑 and 𝑇 𝑆 𝑜𝑙𝑑 data. The LivDet-Iris- 2017 dataset [304] is a publicly available dataset for iris presentation attack detection. It consists of four subsets: Clarkson, Warsaw, Notre Dame, and IIITD-WVU. All these subsets consist of their corresponding train and test sets which in this setup considered as 𝑇 𝑅𝑛𝑒𝑤 and 𝑇 𝑆 𝑛𝑒𝑤 , respectively. Details of these subsets are as follows: 1. Clarkson subset: It consists of print and cosmetic contacts PAs. The subset represents the cross-PA testing scenario, where five additional cosmetic contacts and prints of visible spectrum iris images captured using an iPhone 5 are included in the test set. 2. Warsaw subset: It consists of only print iris PA. The subset consists of two test sets: “known" and “unknown". The “unknown" test represents a cross-sensor scenario, where different sensors are used to capture images of the training and test sets. 3. Notre-Dame subset: It consists of only cosmetic contact iris PA. This subset also contains two test sets (“known" and “unknown"). The “unknown" test set represents the cross-PA 122 Table 6.2: Description of the old and new training/test sets in the LivDet-Iris-2017 setup along with the number of bonafide and fake iris images present in the datasets. The information about the sensors used to capture images is also provided. Each test set represents different testing scenarios. The Clarkson and Notre Dame test sets correspond to the cross-PA scenario, whereas the Warsaw data corresponds to the cross-sensor scenario. The IIITD-WVU represents a cross-dataset scenario. Here, “K. Test” means a known test set of the dataset, and “U. Test” means an unknown test set. Old Train and Test Domains Domains New Train and Test Domains (LivDet-Iris-2017 Dataset) (IARPA Dataset) IARPA IARPA Clarkson Warsaw Notre Dame IIITD-WVU Datasets Split I Split II (Cross-PA) (Cross-Sensor) (Cross-PA) (Cross-Dataset) Train/Test Train Test Train Test Train K. Test U. Test Train K. Test U. Test Train Test Bonafide 9,660 2,963 2,469 1,485 1,844 974 2,350 600 900 900 2,250 702 Print 2,634 - 1,346 908 2,669 2,016 2,160 - - - 3,000 2,806 Cosmetic 2,757 177 1,122 765 - - - 600 900 900 1,000 701 Contacts Artificial 554 175 - - - - - - - - - - Eyes Electronic 130 - - - - - - - - - - - Display Aritech ARX-3M3C, Iris ID iCAM7000, Cogent Iris ID IrisAccess IrisAccess IrisGuard Fujinon DV10X7.5A, IrisGuard AD100, IriShield Sensor IrisGuard AD100, CIS 202, iCAM7000 EOU2200 EOU2200 AD100 DV10X7.5A-SA2 lens IrisAccess LG4000 MK2120U IrisAccess LG4000 VistaFA2E B+W 092 NIR filter scenario, where different cosmetic contacts introduce in the test set. 4. IIITD-WVU subset: It includes the PA images from both print and cosmetic contacts. The subset is a combination of data from IIITD and WVU collections. The subset corresponds to the cross-dataset scenario where training performs on the IIITD collection and testing on the WVU collection. The training set is captured in a controlled environment using two iris sensors: Cogent dual iris sensor (CIS 202) and VistaFA2E single iris sensor. The test set is captured using the IriShield MK2120U mobile iris sensor at two different locations: indoors (controlled illumination) and outdoors (varying environmental conditions). The setup represents the scenario where there is a dataset shift between 𝑇 𝑅𝑜𝑙𝑑 and 𝑇 𝑅𝑛𝑒𝑤 , but no shift between 𝑇 𝑅𝑛𝑒𝑤 and 𝑇 𝑆 𝑛𝑒𝑤 except in the case of the IITD-WVU dataset. In the IITD- WVU dataset, a shift also exists between 𝑇 𝑅𝑛𝑒𝑤 and 𝑇 𝑆 𝑛𝑒𝑤 . Table 6.2 describes all old and new training/test sets along with types of PAs and images present in all training and test sets. For evaluation, we compare it with following models: (i) Old Expert Model: trained only on 𝑇 𝑅𝑜𝑙𝑑 , (ii) New Expert Model: trained only on 𝑇 𝑅𝑛𝑒𝑤 , (iii) Fine-Tuned: trained on 𝑇 𝑅𝑜𝑙𝑑 and then fine-tuned using 𝑇 𝑅𝑛𝑒𝑤 , (iv) Full-Retrain: trained on both 𝑇 𝑅𝑜𝑙𝑑 and 𝑇 𝑅𝑛𝑒𝑤 , (v) Fusion- Equal Weights: fusion of scores from two expert models (Old and New Expert Models) with 123 equal weights, (vi) Fusion-Pre-trained ViT Features-Dynamic Weights: fusion of scores from two expert models with dynamic weights where in-domain model uses pre-trained ViT model for the feature representation, and (vii) Fusion-Dynamic Weights (proposed method): fusion of scores from two expert models with dynamic weights where proposed feature extractor (FE) model is used to represent the training data. To train FE model, we initialize weights with pre-trained model trained on the ImageNet-21k and JFT-300M datasets, the number of epochs used is 100, the batch size is 15, 𝜏 is 0.25, and the optimizer is stochastic gradient descent with a learning rate of 1e-5. For the implementation of LOF distance measure, we use default values provided in [39], the number of neighbors is 20, and the distance metric to estimate neighbors is euclidean distance. As an expert model, we use the model proposed in [249] for iris presentation attack detection. Table 6.3 presents the performance of all the models in terms of True Detection Rate (TDR (%)) at 0.2% False Detection Rate (FDR). TDR is the percentage of fake samples correctly detected, whereas FDR is a percentage of bonafide samples incorrectly detected as fake. The performance scores are reported individually on both test splits. The objective is to obtain high performance (higher TDR) on both test splits (𝑇 𝑆 𝑜𝑙𝑑 and 𝑇 𝑆 𝑛𝑒𝑤 ). The Full-Retrain model provides a benchmark for evaluating the performance of retraining methods. The Old and New Expert models perform better in their respective test splits but fail on other test splits. The Fine-Tuned model performs better on 𝑇 𝑆 𝑛𝑒𝑤 but fails to retain knowledge about the old (poor performance on 𝑇 𝑆 𝑜𝑙𝑑 ). The Full-Retrain model performs better on both test splits, but it is not a practical approach as old training data is generally unavailable. With respect to the fusion methodologies, we consider providing equal weights as our lower performance benchmark. We perform one ablation study where a pre-trained ViT model is used for the feature extraction. The proposed method outperforms both fusion-based methodologies, which validate that the proposed FE model better represents the training data and the weights are appropriately assigned to their respective models. We also visualize weight histograms of various test splits (Figure 6.3). The histograms are generated using weight values given to the New Expert Model for test data (both old and new). So, weight values toward ‘0’ of the x-axis symbolize higher priority given to the 124 Table 6.3: The performance of all retraining methods in terms of True Detection Rate (%, higher the better) at 0.2% False Detection Rate on old (𝑇 𝑆 𝑜𝑙𝑑 ) and new (𝑇 𝑆 𝑛𝑒𝑤 ) test sets of the LivDet- Iris-2017 setup. Old New Old New Old New Old New Test Domains (𝑇 𝑆 𝑜𝑙𝑑 ) (𝑇 𝑆 𝑛𝑒𝑤 ) (𝑇 𝑆 𝑜𝑙𝑑 ) (𝑇 𝑆 𝑛𝑒𝑤 ) (𝑇 𝑆 𝑜𝑙𝑑 ) (𝑇 𝑆 𝑛𝑒𝑤 ) (𝑇 𝑆 𝑜𝑙𝑑 ) (𝑇 𝑆 𝑛𝑒𝑤 ) IARPA Clarkson IARPA Warsaw IARPA Notre-Dame IARPA IIITD-WVU Datasets Split II (Cross-PA) Split II (Cross-Sensor) Split II (Cross-PA) Test (Cross-Dataset) Test Test Test K. Test U. Test Test K. Test U. Test Test Test Old Expert Model 98.44 28.63 98.44 92.95 98.56 98.44 93.55 91.00 98.44 42.91 New Expert Model 25.54 92.05 0.31 100 100 29.90 100 66.55 0.31 29.30 Fine-Tuned 86.91 93.51 45.48 100 100 98.75 100 99.77 83.17 48.85 Full-Retrain 96.57 91.63 93.76 100 100 96.57 100 100 96.57 66.81 Fusion of Old and New Domain Expert Models Equal Weights 97.50 89.67 97.81 99.45 100 99.37 99.88 96.22 98.44 43.62 Pre-trained ViT Features- 98.13 72.80 91.27 100 99.38 99.37 100 80.44 88.16 29.27 Dynamic Weights Fine-tuned ViT Features- 98.44 92.67 98.13 100 100 99.37 100 99.55 98.13 44.94 Dynamic Weights Old Expert Model, whereas weight values towards ‘1’ of the x-axis denote higher priority given to the New Expert Model. New test data of the IIIT-WVU subset produces weights around 0.5 as the distribution of IIIT-WVU subset test samples is independent of the training distribution of both expert models. So, assigning weights around 0.5 is an appropriate step. In all other cases, 𝑇 𝑆 𝑜𝑙𝑑 receives higher weights for the old expert model and 𝑇 𝑆 𝑛𝑒𝑤 receives higher weights for the new expert model. It is noteworthy that the proposed method outperforms even the Full-Retrain method except in the case of the IIIT-WVU test split. As specified earlier, the distribution of the IIIT-WVU test split is different from both training sets, which limits the performance of dynamic weight-based fusion in this particular case. 6.4.2 LivDet-Iris-2020 Setup and Results In this setup we utilizes three datasets: IARPA dataset [2], Warsaw PostMortem v3 dataset [7] and LivDet-Iris-2020 dataset [61]. We divide the IARPA dataset into three splits and consider them as 𝑇 𝑅𝑜𝑙𝑑 and two 𝑇 𝑅𝑛𝑒𝑤 sets. Warsaw PostMortem v3 dataset is used as a 𝑇 𝑅𝑛𝑒𝑤 and the LivDet-Iris-2020 dataset as 𝑇 𝑆 𝑛𝑒𝑤 . Description of these datasets are as follows: 1. Old training set (𝑇 𝑅𝑜𝑙𝑑 ): One split of IARPA dataset collected using iCAM7000 iris sensor. 2. New training set (𝑇 𝑅𝑛𝑒𝑤 ): 125 (a) (b) (c) (d) Figure 6.3: Histogram of weights dynamically allocated for all test samples (old and new) corre- sponds to (a) Clarkson,(b) Warsaw, (c) Notre-Dame, and (d) IIIT-WVU subsets of LivDet-Iris-2017 setup. In the case of Warsaw and Notre-Dame, ‘Known’ test splits are used for illustration. Weight values toward ‘0’ of the x-axis symbolize higher priority given to the Old Expert Model, whereas weight values towards ‘1’ of the x-axis denote higher priority given to the New Expert Model. New test data of the IIIT-WVU subset estimate weights around 0.5 as the distribution of the IIIT-WVU subset test set is independent of the training distribution of both expert models. The figure is better viewed in color. 126 a) IARPA split: It is the second split of the IARPA dataset. The type of fake images is the same as present in 𝑇 𝑅𝑜𝑙𝑑 . The images are also collected from the same sensor. So, there is no additional information other than the more varieties of cosmetic contacts. The total number of images is also limited compared to 𝑇 𝑅𝑜𝑙𝑑 . b) Cross-sensor data: This is the third split of the IARPA dataset. The split represents the cross-sensor scenario, where images of this split are captured using LGIris and VistaEY2 iris sensors, whereas 𝑇 𝑅𝑜𝑙𝑑 images are collected from the iCAM7000 sensor. The type of fake images is the same as 𝑇 𝑅𝑜𝑙𝑑 . So, the data contains additional information about the sensors, and the number of images is higher than in 𝑇 𝑅𝑜𝑙𝑑 . c) Post-mortem data: It consists of images from the Warsaw PostMortem v3 dataset [7]. It represents the cross-PA scenario where a new type of fake images from cadaver eyes are present. It does not contain any bonafide images. d) Combined data: It includes the data from all the above-stated data sets. So, it represents both cross-sensor and cross-PA scenarios. The total number of images is also higher than in 𝑇 𝑅𝑜𝑙𝑑 . 3. New test set (𝑇 𝑆 𝑜𝑙𝑑 and 𝑇 𝑆 𝑛𝑒𝑤 ): The test dataset is the LivDet-Iris-2020 [61] competition data. It is independent of 𝑇 𝑅𝑜𝑙𝑑 and 𝑇 𝑅𝑛𝑒𝑤 . The setup considers the scenario where dataset shift occurs both in 𝑇 𝑅𝑜𝑙𝑑 and 𝑇 𝑅𝑛𝑒𝑤 , and 𝑇 𝑅𝑛𝑒𝑤 and 𝑇 𝑆 𝑛𝑒𝑤 except in the first case of IARPA split (no shift between 𝑇 𝑅𝑜𝑙𝑑 and 𝑇 𝑅𝑛𝑒𝑤 ). Table 6.4 provides the number of bonafide and fake images used in these sets. Implementation details for the in-domain and expert models are the same as the previous experimental setup. Evaluation models are also the same: Old Expert Model, New Expert Model, Fine-Tuned, Full-Retrain, Fusion- Equal Weights, Fusion-Pre-trained ViT Features-Dynamic Weights, and Fusion-Dynamic Weights (proposed method). Table 6.5 presents the performance of all these models in terms of TDR (%) at 0.2% FDR on the LivDet-Iris-2020 test dataset. 127 Table 6.4: Description of the old and new train/test sets in the LivDet-Iris-2020 setup along with the number of bonafide and fake iris images present in the sets. The information about the sensors used to capture images is also provided. Domains Old (𝑇 𝑅𝑜𝑙𝑑 ) New (𝑇 𝑅𝑛𝑒𝑤 ) Old and New (𝑇 𝑆) Datasets IARPA Split I IARPA Split II Cross-sensor Post-Mortem Combined LivDet-Iris 2020 Train/Test Train Train Train Train Train Test Bonafide 9,660 2,963 9,606 - 12,569 5,331 Print 2,634 - - - - 1,049 Cosmetic Contacts 2,757 177 539 - 716 4,336 Artificial Eyes 554 175 383 - 558 541 Electronic Display 130 - - - - 81 Cadaver Eyes - - - 2,400 2,400 1,094 Iris ID iCAM7000, Iris ID iCAM7000, Iris ID iCAM7000, Iris ID LGIris, IriShield IrisGuard AD100, Sensor IrisGuard AD100, LGIris, VistaEY2, iCAM7000 VistaEY2 M2120U IrisAccess LG4000, IrisAccess LG4000 IriShield M2120U IriTech IriShield Table 6.5: The performance of all retraining methods in terms of True Detection Rate (%, higher the better) at 0.2% False Detection Rate on the LivDet-Iris-2020 test set. Train Dataset IARPA Split I IARPA Split II Cross-Sensor Post-Mortem Combined Test Dataset LivDet-Iris 2020 Old and New Expert Models 61.86 58.25 75.55 0.94 85.56 Fine-Tuned - 63.18 66.53 0 83.00 Full-Retrain - 77.96 76.96 67.76 94.05 Fusion of Old and New Domain Expert Models Equal Weights - 72.42 79.04 58.73 87.05 Pre-trained ViT Features- - 69.91 79.03 58.73 89.38 Dynamic Weights Fine-tuned ViT Features- - 69.27 81.36 61.99 93.62 Dynamic Weights The Old and New Expert models are not performing well on the test set as the distribution of the test set is different from 𝑇 𝑅𝑜𝑙𝑑 and 𝑇 𝑅𝑛𝑒𝑤 training sets. Similar is the case with the Fine- Tuned model. The Full-Retrain model outperforms the other models, but again it is not a practical approach due to the unavailability of the old training data. The proposed method outperforms both fusion-based methods (fusion with equal weights and Pre-trained ViT feature-based dynamic weight fusion). However, its performance is lower than the Full-Retrain method as the distribution of the test set does not match any of the training sets, and the proposed methodology depends on the training sets used by the expert models. 128 6.4.3 Split MNIST Setup and Results We further perform experiments on the MNIST dataset for comparing the proposed retraining methodology with existing state-of-the-art continual learning strategies. The original dataset consists of 28 × 28 pixel grey-scale images of ten digits. We use standard train and test split, with 60,000 training images (∼6,000 per digit) and 10,000 test images (∼1,000 per digit). The main task is to classify even digit images from odd digit images. The main task is subdivided into five binary sub-tasks, where the first task is to classify ‘0’ and ‘1’ digits, the second task is to classify ‘2’ and ‘3’, and so on. The splitting of the dataset according to the five sub-tasks is referred as Split MNIST in the literature [118, 278]. The tasks are learned sequentially. The class labels are the same for all tasks, giving labels 0 to odd digit images and 1 to even digit images. Here, distributions of training data corresponding to different tasks are disjoint, but there is no shift between training and testing distribution within a task. Figure 6.4 depicts the experimental setup for the Split MNIST dataset. Figure 6.4: The experimental setup of the Split MNIST dataset for the retraining scenario. The main task is to classify old and even digit images. The task is divided into five sub-tasks, where the first task is to classify ‘0’ and ‘1’ digits, the second task is to classify ‘2’ and ‘3’, and so on. The class labels remain the same for all sub-tasks: 0 for odd digit images and 1 for even digit images. For the expert model, we use multi-layer perceptron (MLP) architecture and learning parameters as used in the [118] for a fair comparison. The MLP architecture consists of two fully connected layers with 400 nodes each, followed by a softmax output layer. ReLU non-linearity is used in both fully connected layers. The loss function used is cross-entropy, the number of epochs is four and 129 batch size is 128, and the optimizer is stochastic gradient descent with a learning rate of 0.01. For each task, we build separate expert and in-domain models, which results in a total of ten models (five expert and five in-domain models). Implementation details of the in-domain model are the same as the previous experimental setup. For comparative evaluation, we use the baseline models as provided in [118] which include Fine- tuned and Full-Retrain models. We also compare the proposed method with other continual learning methods: regularization-based (EWC [141], online EWC [244], SI [313], KFAC [229], MAS [13], LwF [161], OWM [312], NCL [136]) and replay-based (BiC [296], ER [44], GDumb [204], RM [22], DGR [252], GEM [167], RtF [277]). In the fusion-based approaches, we consider Fusion- Equal Weights, Fusion-Manual Weights, Fusion-Pre-trained ViT Features-Dynamic Weights, and Fusion-Dynamic Weights (proposed method). In the Fusion-Manual Weights, we manually assign one to the correct expert model and zero to other models. Fusion-Equal Weights is our lower per- formance limit, whereas Fusion-Manual Weights is the upper limit. As training data of all sub-tasks are disjoint, the manual weight assignment is a reasonable choice for the upper limit. All methods utilize the same experimental setup and expert models but few methods from regularization-based and replay-based approaches differ in hyperparameters (batch size, learning rate, and the number of epochs). Table 6.6 provides the results of all the methods in terms of the accuracy (%). The proposed method outperforms Fine-Tuned baseline method and all regularization-based methods. Its performance is lower compared to three replay-based methods (DGR [252] and RtF [277] and GEM [167]). DGR [252] and RtF [277] are generative-based methods that involve separate training of a generative model along with an expert model. The generative model is then used to generate previous task samples and augment the training of the subsequent tasks. The process increases the training time of the subsequent tasks and makes the training of the expert model dependent on the generative model. However, the proposed method does not involve any generation of the samples, and the expert model is independent of additional models. GEM [167] method requires additional memory to store a subset of the previous task samples, which is a concern in terms of memory as well as privacy. The fusion-based methodology shows the maximum limit 130 Table 6.6: The average accuracy (%, higher the better) of the proposed retraining approach with different state-of-the-art continual learning approaches on the Split MNIST dataset. Methods with ‘+’ superscript are reported from [118], ‘o’ from [136], ‘*’ from [22] and ‘-’ from [153]. All methods utilize the same experimental setup and expert models but differs in hyperparameters (batch size, learning rate, and the number of epochs). We use the same hyperparameters as used in [118]. Each value is an average of ten runs. Approaches Method Accuracy (%) Fine-Tuned+ 63.20 ± 0.35 Baselines Full-Retrain+ 98.59 ± 0.15 EWC [141]+ 58.85 ± 2.59 Online EWC [244]+ 57.33 ± 1.44 SI [313]+ 64.76 ± 3.09 Regularization-based KFAC [229] 67.86 ± 1.33 Approaches MAS [13]+ 68.57 ± 6.85 LwF [161]+ 71.02 ± 1.26 OWM [312]𝑜 87.46 ± 0.74 NCL [136]𝑜 91.48 ± 0.64 BiC [296]∗ 77.75 ± 1.27 ER [44]− 85.69 GDumb [204]∗ 88.51 ± 0.52 Replay-based RM [22]∗ 92.65 ± 0.33 Approaches DGR [252]+ 95.74 ± 0.23 GEM [167]+ 96.16 ± 0.35 RtF [277]+ 97.31 ± 0.11 Equal Weights (Lower Limit) 84.20 ± 0.08 Manual Weights (Upper Limit) 98.66 ± 0.008 Fusion-based CN-DPM [153]− 93.23 Approaches Pre-trained ViT Features- 81.34 ± 0.005 Dynamic Weights Fine-tuned ViT Features- Dynamic Weights 94.32 ± 0.01 (Proposed Method) 131 as 98.66% (performance of Fusion-Manual Weights), which outperforms all methods even the Full-Retrain method. We also experiment with another distance measure (Mahalanobis distance), and it results in an accuracy of 97.03 ± 0.0001%, which is as par as the replay-based methods without any generation or storage of additional training data. In this setup, Mahalanobis distance performs the best as disjoint training distributions are effectively characterize by the mean and variance, whereas LOF outperforms in other setups. To understand the importance of training the ViT-based feature extractor with the proposed losses, we visualize the features extracted from the pre-trained ViT model (Figure 6.5) and our trained FE model (Figure 6.6). The visualization involves training embeddings corresponding to all five sub-tasks (shown in different colors). The embeddings are reduced to three dimensions using t-SNE [279]. The pre-trained model features show significant overlap among different task embeddings compared to our trained FE model. The performance and visualization both validate the use of loss functions involved in the training of ViT-based feature extractor. Figure 6.5: 3-D t-sne plot showing pre-trained ViT embeddings correspond to five sub-tasks of the Split MNIST dataset. The training samples of different classes are overlapping in the feature space. The figure is better viewed in color. 132 Figure 6.6: 3-D t-sne plot showing fine-tuned ViT embeddings correspond to five sub-tasks of the Split MNIST dataset. There is a formation of clusters of training samples belonging to the same class in the feature space. The figure is better viewed in color. 6.4.4 Findings The main findings from the three experimental setups are as follows: 1. When 𝑇 𝑆 𝑛𝑒𝑤 has a similar distribution as 𝑇 𝑅𝑜𝑙𝑑 or 𝑇 𝑅𝑛𝑒𝑤 data, our proposed approach outperforms other approaches, even the Full-Retrain method as shown in the LivDet-Iris- 2017 setup (Table 6.3). 2. When 𝑇 𝑆 𝑛𝑒𝑤 distribution is independent of 𝑇 𝑅𝑜𝑙𝑑 and 𝑇 𝑅𝑛𝑒𝑤 training data, the proposed approach outperforms other approaches, but not the Full-Retrain method as shown in the IIIT-WVU subset of LivDet-Iris-2017 setup (Table 6.3) and LivDet-Iris-2020 setup (Table 6.5). 3. In the case of disjoint training distribution between 𝑇 𝑅𝑛𝑒𝑤 and 𝑇 𝑅𝑜𝑙𝑑 , the proposed ap- proach outperforms baselines, regularization-based, and other fusion-based approaches. Its performance is lower than some replay-based methods. However, the performance improves by Mahalanobis distance as an outlier distance measure in this particular scenario. 133 4. The proposed in-domain model for dynamic weights allocation is appropriately assigning weights to their respective expert models, as exhibited by its higher performance compared to the Fusion-Equal Weights method in Tables 6.3, 6.5 and 6.6. Weight histograms in Figure 6.3 also validate the accurate allocation of the weights. 5. The proposed FE model better represents the training data as shown by comparing its per- formance with pre-trained ViT model in Tables 6.3, 6.5 and 6.6. Figure 6.6 also visually validate the finding. 6.5 Summary and Future Work We propose a dynamic weight-based fusion methodology to update the existing expert models such that it maintains the performance on old test data alongside improves the performance on new test data. The method asserts a new expert model on new training data and makes the final decision by the weighted sum of the prediction scores from all expert models. Evaluation of the proposed approach in three setups depicting two levels of dataset shift validates its effectiveness. In this work, by dataset shift, we mean shift in input distribution. As the method does not manipulate the existing expert models, it motivates the reuse of existing expert models without any manipulation in the training process or the architecture of the expert models. It also requires less memory as compared to the replay-based methods. However, there is an increase in the inference time as a probe image input to another model for estimating weights. Regarding scalability, there involves the addition of two models with every new incoming task, and hence the number of models linearly increases with an increase in tasks. The number of models could reduce by applying pre-condition (performance difference or data distribution difference) before building additional models. In future work, we will define the pre-condition for improving the scalability of the proposed method. 134 CHAPTER 7 IRIS MORPHING ATTACK: CREATION AND DETECTION Parts of this chapter appeared in the following publication: R. Sharma and A. Ross, “Image-Level Iris Morph Attack,” IEEE International Conference on Image Processing (ICIP), 2021. 7.1 Introduction In this chapter, we investigate the problem of morph attacks in the context of iris biometrics. We employ a landmark-based iris morphing scheme at the image level which generates morphed iris images. The potential of generated morphed iris images is then analyzed over the three iris recognition systems. The main contributions of the work are as follows: 1. We propose a landmark-based method to perform iris morphing at the image-level. 2. We evaluate vulnerability of three iris recognition techniques (USITv3.0 [227], VeriEye1, and CNN-Pairwise [206]) to morphed iris images using two publicly available datasets (IITD [150] and WVU multi-modal2). The attack success rate is over 90% at 0.01% false match rate. 3. We explore the similarity required between the component images to create a successful morphed iris image. 4. We provide preliminary results on the detection of image-level morphed iris images. The rest of the chapter is organized as: Section 7.2 discusses the various morphing techniques in the context of biometrics, Section 7.3 provides the methodology used to create image-level 1 https://www.neurotechnology.com/verieye.html 2 https://biic.wvu.edu/data-sets/multimodal-dataset 135 morphed iris images, Section 7.4 describes the datasets, Section 7.5 provides the experimental setup and results on both the datasets, and Section 7.6 concludes the chapter. 7.2 Related Work Morphing can be performed at the image-level or feature-level. Morphing at the image-level is relatively simple as it does not require knowledge of the internal working of a biometric system, whereas morphing at the feature-level requires knowledge of the feature extraction module. The image-level morphed samples can directly be presented to the sensors or digitally uploaded to the biometric system. Commonly used morphing techniques at the image-level are landmark-based [26, 86, 171, 241]. The landmark-based techniques first detect corresponding landmark points in the two images then warp the images based on the detected landmarks, and finally blend the warped images. Shechtman et. al [251] posed the morphing as an optimization problem to achieve bidirectional similarity of each morphed image with its neighboring frames within the morph sequences as well as the input images. Recently, morphing techniques based on generative adversarial networks have been proposed [9, 60, 314]. However, their attack success rate at this time is still lower than the landmark-based techniques. Morphing techniques have also been proposed at the feature-level, e.g., minutiae-based [85], iris-codes [225]. Detailed surveys on morphing techniques in the context of morph attacks can be found in [242, 284]. Further, frameworks for evaluating the vulnerability of biometric systems to morphed samples are presented in [95, 240]. In the case of the iris modality, Rathgeb and Busch [225] proposed morphing at the feature- level, where iris-codes are morphed using stability-based bit substitution. Erdongan et. al [82] proposed morphing on the normalized iris images.3 They created composite normalized iris images based on the selection of pixels from the two images considering their intensity and phase profiles. We propose a morphing scheme for generating morphed iris images using two unnormalized iris images. 3 Here, normalization refers to the unwrapping of the iris wherein it is mapped from Cartesian coordinates to Pseudo-polar coordinates resulting in a fixed-size rectangular entity 136 7.3 Algorithmic Details We generate a synthetic iris image (morphed image) from samples of two different identities such that the morphed image matches with both of its component identities. We utilize the landmark- based method to create morphed images. There are generally three steps to such an approach [242]: correspondence, warping, and blending. In the correspondence step, a set of correlated landmarks points from both images are detected. In the warping step, two images are non-linearly deformed to make them geometrically aligned with respect to the detected landmarks. Finally, the warped images are blended by linearly combining pixel values from both images at each location using a scalar value (blending factor). The scalar value controls the degree of contribution of each source image to the morphed image. 1. Correspondence: To establish the correspondence between two iris images, we first obtain iris segmentation parameters – iris center, iris radius, pupil center, and pupil radius. Using the segmentation parameters, we estimate equally spaced landmarks on both the inner and outer iris boundaries. The landmark points are 10 degrees apart with respect to the iris center resulting in 72 landmarks (36 on the inner iris boundary + 36 on the outer iris boundary). We select these 72 landmarks to minimize iris feature distortion during warping. A lower number of landmarks distorts the iris pattern during warping, and a higher number increases computational complexity. We also include four extreme corner points of an image (top left, top right, bottom left, and bottom right) in the landmarks set, creating a total of 76 landmark points. The corner points are required to align the iris regions of both images. 2. Warping: Given the landmark points, we compute the Delaunay triangulation using the convex hull method [23]. We average the corresponding triangle coordinates and compute their affine transformation matrix, 𝑇, as follows: −1 . 𝑇3×3 = 𝐴3×3 𝑋3×3 (7.3.1) Here, 𝐴 is the averaged triangle coordinates arranged column-wise, and 𝑋 is one of the correspond- ing triangles coordinates. Using the transformation matrics, triangles from both images are warped 137 Figure 7.1: (a) Three categories of techniques applied to detect iris presentation attacks. (b) Illustration of the iris morphing at the image-level. It consists of registration of landmark points on both the images, alignment of images, and then blending into a single image. to the averaged triangle coordinates. We further interpolate the missing values using bilinear interpolation. 3. Blending: Finally, we blend the pixels within the warped triangles using linear blending at each location (𝑖, 𝑗) as follows: 𝑀 (𝑖, 𝑗) = 𝛼𝑋𝑤 (𝑖, 𝑗) + (1 − 𝛼)𝑌𝑤 (𝑖, 𝑗). (7.3.2) Here, 𝑀 is the morphed triangle, 𝑋𝑤 and 𝑌𝑤 are two corresponding warped triangles, and 𝛼 is the blending factor. The blending factor is set to 0.5 to get an equal contribution of identity information from both the images. Figure 7.1 shows a pictorial representation of these steps. 7.4 Datasets To demonstrate the vulnerability of iris recognition techniques to morph attacks, we conduct experiments on the following two publicly available iris datasets: 1. IITD Iris Dataset [150]: The IITD iris dataset consists of 2,240 iris images from 224 subjects. There are ten iris images per subject (5 left and 5 right). The images are acquired using JIRIS, JPC1000, and digital CMOS sensors. The subjects in the dataset are in the age range of 138 14-55 years. There are 176 males and 48 females in the dataset. These images have a resolution of 320 × 240 pixels. 2. WVU Multi-modal Release 1 Dataset: The WVU multi-modal dataset consists of the iris, face, fingerprint, voice, palmprint, and hand-geometry modalities. We only use the iris modality, which contains 3,099 iris images from 244 subjects. There is an average of 12 images per subject. The resolution of these images is 640 × 480 pixels. 7.5 Evaluation and Results To evaluate the impact of image-level morphed images on iris recognition, we first compute the baseline performance of three iris recognition techniques (USITv3.0 [227], VeriEye, and CNN- Pairwise [206]) on the IITD and WVU multi-modal datasets. The two datasets are used to create morphed iris images. Subsequently, we assess the susceptibility of iris recognition techniques to the generated morphed iris images. Further, we also analyze the textural similarity of component images required to create a successful morphed iris image. Finally, we provide preliminary results on the detection of morphed iris images. 7.5.1 Baseline Recognition Performance We utilize three iris recognition techniques to assess their vulnerability to morphed iris images. The first is a best performing technique within the open-source iris recognition software toolkit, University of Salzburg Iris Toolkit (USIT v3.0) [227]. It extracts iris-code using quadratic spline wavelet (QSW) [151] and uses hamming distance to measure the dissimilarity between the iris- codes. The second is a commercially available off-the-shelf technique called VeriEye. The third is a deep learning-based CNN-Pairwise [206] technique. We utilizes DenseNet121 [122] as the base architecture. The network inputs two cropped iris images and formats them as multiple channels and then outputs a similarity score between 0 (impostor) and 1 (genuine). The iris images are manually segmented for the USITv3.0 and CNN-Pairwise techniques, whereas VeriEye uses its own iris segmentation module. The segmentation failures occured in VeriEye are manually corrected. 139 Table 7.1: Performance of three iris recognition techniques in terms of TMR (%) at 0.01%, 0.1%, and 1% FMRs, on the IITD and WVU datasets. The USITv3.0 is an open-source iris recognition toolkit, VeriEye is a commercial iris recognition SDK, and CNN-Pairwise is a deep learning-based technique. IITD Dataset (TMR(%)) WVU Dataset (TMR(%)) Algorithms FMR FMR FMR FMR FMR FMR 0.01% 0.1% 1% 0.01% 0.1% 1% USITv3.0 99.33 99.55 99.72 94.73 96.40 97.62 VeriEye 99.77 99.77 99.77 98.54 98.78 99.02 CNN-Pairwise 98.16 98.72 99.38 85.70 90.54 93.69 The recognition performance of these three techniques is evaluated on both datasets. As the CNN-Pairwise method is a deep learning-based, training is performed using 60% of the subjects, and the rest is used for testing (subject-disjoint strategy). Table 7.1 provides the True Match Rate (TMR) at 0.01%, 0.1%, and 1% False Match Rate (FMR). The VeriEye algorithm performs the best followed by the USITv3.0 algorithm. CNN-Pairwise shows relatively lower performance (presumably) due to insufficient training data, whereas the other two techniques do not require training. 7.5.2 Morph Attack Setup and Results We utilize both the datasets to create image-level morphed iris images. In the IITD dataset, there are 224 left eye classes and 224 right eye classes. We randomly select one image per class for generating the morphs, which should result in 49,952 (224𝐶2 +224 𝐶2 ) morphed images. However, landmarks could not be detected in some images with partial irides, so a total of 49,816 morphs were created. In the WVU dataset, there are 237 left eye classes and 233 right eye classes, which should result in a total of 54,994 (237𝐶2 +233 𝐶2 ) morphs. However, due to landmark detection problems in partial irides, not all pairs could be considered, resulting in a total of 50,573 morphs. Figure 7.2 presents few samples of morphed images generated from both datasets along with their component images. We input the morphed iris images to three iris recognition techniques (morph attack) and measure their vulnerability in terms of Mated Morph Presentation Match Rate (MMPMR) [240]. MMPMR is the ratio of successful morph attacks to total morph attacks. The 140 Table 7.2: Vulnerability assessment of three iris recognition techniques to iris morph attacks in terms of MMPMR (%) at different thresholds corresponding to 0.01%, 0.1%, and 1% FMRs on the IITD and WVU datasets. IITD (MMPMR(%)) WVU (MMPMR(%)) Algorithms FMR FMR FMR FMR FMR FMR 0.01% 0.1% 1% 0.01% 0.1% 1% USITv3.0 93.8 95.64 96.91 85.96 93.82 97.07 VeriEye 95.77 97.07 97.85 90.22 94.48 97.02 CNN-Pairwise 17.64 24.76 25.32 47.70 54.23 56.76 morph attack succeeds when the morphed image matches with all of its component subjects at a specified threshold. Table 7.2 provides the performance of morph attacks in terms of MMPMR at different thresholds corresponding to 0.01%, 0.1%, and 1% FMRs. The VeriEye and USITv3.0 techniques are more susceptible to morph attacks (> 90% MMPMR). CNN-Pairwise shows a relatively lower morph attack success rate as it has been trained on a relatively small amount of data, and a slight perturbation in the images (due to morphing) results in non-matches. To evaluate further, we plot a histogram of the genuine, impostor, and morph match scores on both the datasets (top row of Figure 7.3). The distribution of morph match scores leans towards the genuine distribution, and a majority of the morph match scores are labeled as genuine when considering the threshold at 0.01% FMR. We also visualize the morph match scores using scatter plots (bottom row of Figure 7.3) with component identities along X and Y-axes. Most of the match scores are in a quadrant that corresponds to successful matches with both component subjects (top right corner). This shows how well the morphed images match with their component identities and substantiates the vulnerability of iris recognition techniques to morph attacks. 7.5.3 Analysis of Textural Similarity Next, we analyze the similarity between the component iris images used to create morphed iris images. To calculate the similarity, we utilize the Root Mean Square Error (RMSE) and Structural Similarity Index Measure (SSIM) [290] measures. RMSE calculates the pixel-wise difference between the two images (higher the value, lower the similarity), while SSIM estimates the structural 141 Figure 7.2: Samples of morphed images generated from the IITD and WVU datasets. similarity between the two images (higher the value, higher the similarity). Figure 7.4 presents the distribution of similarity scores as calculated by RMSE and SSIM on both the datasets. Distribution in green corresponds to successful morphs, whereas distribution in red corresponds to unsuccessful morphs. Match scores and threshold are according to the USITv3.0 iris recognition technique. The mean SSIM scores corresponding to successful and unsuccessful morphs are 0.49 and 0.45 (the mean difference is 0.029), respectively, on the IITD dataset. The mean difference increases to 0.032 when using RSME scores. Though distributions of successful and unsuccessful morphs significantly overlap, we can still conclude that there is a high chance of generating a successful morph if SSIM between the component images is more than 0.31 (or less than 0.20 in the case of RSME). A similar conclusion can be made on the WVU dataset that there is a high chance of obtaining a successful morph when SSIM between the component images is more than 0.49 (or less than 0.22 in the case of RSME). 142 IITD dataset WVU dataset Figure 7.3: Top: Match score distribution of genuine (green), imposter (red), and morph attacks (blue) on the IITD and WVU datasets using the USITv3.0 iris recognition technique. Bottom: Scatter plots of match scores, where morphed images match with their component identities. The dotted line represents the threshold at 0.01% FMR. 7.5.4 Morph Attack Detection The next natural step is to address morph attacks by detecting morphed images prior to inputting them into an iris recognition system. We present preliminary results on the detection of morphed iris images. Firstly, we perform detection using a pre-trained presentation attack detector [249]4 at a pre-defined threshold (corresponds to 0.2% False Detection Rate (FDR)). It results in 9.06% True Detection Rate (TDR) on morphed images from the IITD dataset and 16.82% on morphed images from the WVU dataset. Next, we fine-tune the detector with morphed iris images (60% from each dataset used for training and the rest for testing). It attains 86.83% TDR on the IITD 4 An iris presentation attack detector trained to detect artifacts such as printed iris, cosmetic contacts, and artificial eyes. 143 IITD Dataset WVU Dataset RMSE Similarity SSIM Similarity Figure 7.4: Distributions of similarity scores between the component images corresponding to successful (green) and unsuccessful (red) morphs using the RMSE (higher the value, lower the similarity) and SSIM (higher the value, higher the similarity) measures on the IITD and WVU datasets. morphed images and 99.55% TDR on the WVU morphed images at 0.2% FDR. In the proposed morphing technique, we did not perform post-processing on the morphed images due to which some artifacts are present outside the iris region. We hypothesize that these artifacts aid in the detection of morphed iris images. 7.6 Summary We successfully generate image-level morphed iris images, which can be used to denote two identities thereby posing a security concern. The morphed images show a high morph attack success rate (> 90%) on three high-performing iris recognition methods (USITv3.0, VeriEye, and 144 CNN-Pairwise) when assessed on two datasets (IITD and WVU multi-modal). We also explore the textural similarity required between the component samples to create a successful morphed image. Finally, we present preliminary results on the detection of morphed iris images. 145 CHAPTER 8 MATCHING IRIS IMAGES WITH FACE IMAGES 8.1 Introduction Biometric recognition involves matching two biometric samples primarily from the same modal- ity, such as the face, iris, fingerprint, or voice to identify an individual. However, cross-modal recognition implies matching of two biometric samples from different modalities. These two bio- metric samples are refered as an enrolled and a probe sample, where the enrolled sample exists in a corresponding legacy database. Cross-modal recognition helps in the case of unavailability of legacy datasets or where we need to analyze the relationship between two modalities. It also helps in boosting the recognition confidence even if the legacy dataset is available. Various efforts made in this direction are [129, 163, 168, 185, 234]. In this work, we focus on matching iris images captured in the near-infrared spectrum against face images captured in the visible (Figure 8.1). It also helps in recognizing humans when face recognition is not reliable, such as the presence of occlusions on the face (face mask). Two main challenges arise from matching a face image to an iris image: (i) a large domain gap and (ii) imbalanced training data. The domain gap results from the following factors: 1. Cross-Modality: There occurs matching of iris modality images to face modality images. 2. Cross-Sensor: Different sensors are used to capture the face and iris images. Sensors add various noises to the images, for instance, fixed pattern noise, pixel response non-uniformity (PRNU), random noise, etc. 3. Cross-Spectrum: Generally, face image captures in the visible spectrum (VIS), whereas iris image captures in the near-infrared (NIR) spectrum. When considering the iris region only, NIR illumination (700-900nm) captures the stromal features (fibrovascular layer) of the 146 iris, whereas VIS illumination (400-700nm) captures melanin pigment and a meshwork of ligament features. 4. Cross-Resolution: Iris or ocular regions cropped from face images are of very low resolution as compared to iris images. For example, in the BioCop-2008 dataset, the ocular or iris regions cropped from face images are of 0.06 or 0.006 megapixels, whereas it is of 0.3 or 0.03 megapixels on iris images. Figure 8.1: The objective is to match a visible spectrum face image with the NIR spectrum iris image, or vice versa. Previous literature focuses on one or two of these factors. To the best of our knowledge, only one paper [129] dealt with all four challenges. The authors propose various handcrafted features (Local Binary Patterns (LBP), Normalized Gradient Correlation (NGC), and Joint Dictionary- based Sparse Representation (JDSR)) and the score-level fusion of those features. They achieve a 23% Equal Error Rate (EER), which shows the difficulty of the task. Generally, techniques used to reduce the domain gap categorize as feature-level or image-level. At the feature-level, the focus is on the extraction of discriminative features invariant to the factors (cross-spectrum, cross-sensor, or cross-resolution). For cross-spectral iris recognition, Abdullah et al. [10] propose three descriptors: Gabor-difference of Gaussian (G-DoG), Gabor-binarized statistical image feature (G-BSIF), and Gabor-multi-scale Weberface (G-MSW) as well as a fusion of these features at the decision-level. Oktiana et al. [192] propose phase-based features utilizing phase-only correlation 147 (POC) and band-limited phase-only correlation (BLPOC). Wang and Kumar [287] investigate a range of deep learning architectures: CNN with softmax cross-entropy loss, Siamese network, and triplet network. Regarding cross-spectral ocular recognition, Sharma et al. [248] propose combined neural network architecture, which first trains two neural networks separately on each spectrum and then jointly learns the cross-spectral variability using cross-spectral training data. Raja et al. [215] utilize Binarized Statistical Image Features (BSIF) and perform matching using Chi-Square distance. Later, Raja et al. [216] propose another method based on steerable pyramid features and a multi-class SVM classifier. At the image-level, the domain gap is reduced by transforming one domain image into another. Burge and Monaco [43] approximate a NIR iris image using features derived from the color and structure of the VIS iris image. Zuo et al. [319] generate a NIR iris image from the VIS iris image using a feed-forward neural network. Ramaiah and Kumar [222, 223] synthesize visible texture from the NIR image using Markov Random Fields (MRF) for both iris and ocular images. More recently, Hernandez-Diaz et al. [110] propose Conditional Generative Adversarial Networks (cGAN) for synthesizing a NIR ocular image from the VIS ocular image or vice versa. Another major challenge in cross-modal recognition is imbalanced train data, where the number of pairs from different individuals (impostor pairs) is very high compared to pairs from the same individual (genuine pairs). As far as we know, no work on cross-modal or spectrum iris recognition focuses on the challenge in this literature. In this work, we focus on both these challenges and propose three deep learning approaches. The first is at the feature-level, where the aim is to extract common features from both the images and the method is called Multi-channel CNN. It is a convolution neural network (CNN) that inputs face and iris images as different channels and extract common features together. The second strategy is at the training-level, where we generate synthetic training samples to increase the training data for learning as the number of genuine pairs in the cross-modal setting is insufficient for the training of deep architecture. We use Dual Variational Generation (DVG) framework [88] to synthesize the genuine pairs for training. The third strategy is at the image-level, where we transform one modality image 148 into another modality using the Generative Adversarial Network (GAN) framework. We use various GAN architectures for image-to-image translation, such as BicycleGAN [318], ESRGAN [289], Pix2Pix GAN [125], and StarGANv2 [53]. Here, we present the results of Pix2Pix GAN [125], BicycleGAN [318] and StarGANv2 [53] as these GAN frameworks perform the best. The main contributions of the work are as follows: 1. We propose deep learning approaches at three different levels (feature-level, image-level, and training-level) to address the domain gap and imbalanced training data challenges for cross-modal recognition. 2. We evaluate the performance of the proposed approaches on four cross-modal datasets: BioCop-2008, BioCop-2009, cross-spectrum PolyU, and WVU datasets. Section 8.2 explains the architectural details of the proposed approaches. Section 8.3 describes datasets used in this work. Section 8.4 describes the experimental setup and results on both the datasets. Section 8.5 reports the impact of eye color on the performance of cross-model matching. Section 8.6 summarize the chapter. 8.2 Proposed Approaches and Rationale In this section, we explain the two main challenges of cross-modal matching (domain gap and imbalanced training data) and the proposed approaches to address them. To understand the domain gap, we conduct an initial analysis using histograms of ocular images under the VIS intra-modal, NIR intra-modal, and cross-modal scenarios. The analysis is on the randomly selected subset (5,000 genuine pairs and 5,000 impostor pairs) of ocular images from the BioCop-2008 dataset. We crop ocular regions from face images to form VIS ocular images, which are of low resolution. We generate histograms using similarity scores among genuine and impostor pairs, where similarity scores are computed using the Structural Similarity (SSIM) index [290]. Figure 8.2 shows the histograms of all three scenarios. Below each histogram statistics are provided in terms of genuine distribution mean, impostor distribution mean, d-prime (distance between two 149 Figure 8.2: Histograms of similarity scores obtained from ocular images under (a) intra-modal VIS, (b) intra-modal NIR, and (c) cross-modal scenario. Similarity scores are estimated using the Structural Similarity (SSIM) index on ocular images of the BioCop-2008 dataset. The statistics of the histograms are given below each figure. There are two observations: first, the similarity between genuine pairs (Genuine Mean) reduces in the cross-modal scenario as compared to the intra-modal scenario; second, the overlapping area between two distributions increases dramatically for the cross-modal. For accurate matching, the overlapping area should be as minimum as possible. distributions) [170], and overlap area of distributions. There are two noteworthy observations: first, the similarity between genuine pairs (Genuine Mean) reduces under cross-modal scenario; second, the overlapping area increases dramatically for cross-modal (genuine and impostor distributions are almost overlapping). Due to the large domain gap, the similarity of genuine pairs overlaps the imposter pairs to a significant extent. To reduce the domain gap, we propose two approaches one at the feature-level and the other at the image-level. The feature-level solution (Multi-channel CNN) jointly learns discriminative features from a pair of cross-modal images and outputs similarity score. For image-level solution, we translate one domain image into another using various GAN architectures (Pix2Pix GAN [125], BicycleGAN [318] and StarGANv2 [53]) before matching. The second major challenge in cross-modal matching is imbalanced training data or insufficient genuine pairs training data. For example, if we consider genuine and impostor pairs of BioCop- 2008 and PolyU datasets, BioCop-2008 contains 3,534 genuine pairs and 2,287,398 impostor pairs, whereas the PolyU dataset contains 6,287 genuine pairs and 10,630,762 impostor pairs. The genuine-impostor pairs ratio for BioCop-2008 and PolyU datasets are 1:645 and 1:1690, respectively. This is the case for any biometric dataset in verification mode. A large number of datasets are available for intra-modal matching, whereas only a few datasets exist for cross-modal 150 matching. To address the imbalanced data, we synthesize a large number of genuine pairs using the DVG framework [88] to augment the training data. A description of all three approaches, feature-level, image-level, and training-level are as follows: 8.2.1 Feature-level Approach: Multi-channel CNN (MT-CNN) In the first method, we attempt to reduce the domain gap with a Multi-channel CNN (MT-CNN). Generally, deep networks input three input channels, namely the Red (R), Green (G), and Blue (B) channels. In MT-CNN (Figure 8.3), we use six input channels, where three correspond to the VIS image (R, G, and B channels) and the other three to the NIR image (one NIR channel repeated thrice). The motivation comes from Aguilera et al. [11] work, which utilizes a two-channel architecture for similarity measurement from a pair of natural images and outperforms Siamese and Pseudo-Siamese networks. We also applied two channels (one for VIS and the other for NIR image), four channels (three for VIS and one for NIR image), and the Siamese network. We report the best network (six channels) in the result section. The six-channel network does not compress the information as in the case of a two-channel CNN, where R, G, and B channels compress into one gray channel. The six-channel network also has an advantage over the four-channel network as it equally weights both VIS and NIR images. In contrast to the Siamese network, where weights are shared in the later layers, the MT-CNN jointly processes information from both images at the first layer of the network [311]. The backbone architectures used in the MT-CNN is DenseNet201 [122]. The base networks is pre-trained on the ImageNet dataset [72], and then fine-tuned on iris training data described in Section 3. The output is a similarity score between 0 and 1, where ‘0’ implies an impostor pair and ‘1’ implies a genuine pair. Training performed using Stochastic Gradient Descent with a learning rate of 0.005, weight decay of 10−6 , and a momentum of 0.9. The batch size is 20 and the number of epochs is 50. 151 Figure 8.3: The architecture of Multi-channel CNN (MT-CNN). The base architecture used in the MT-CNN is DenseNet201 [122]. It estimates a similarity score between the images of the two domains. 8.2.2 Image-level Approach This method aims to transform one modality image (cropped VIS iris image) into another modality (NIR iris image) image. For image-to-image translation, we utilize three GAN architectures: Pix2Pix GAN [125], BicycleGAN [318] and StarGANv2 [53]. We translate a low-resolution (301 x 201) visible spectrum (VIS) iris image to a high-resolution (640 x 480) near-infrared spectrum (NIR) iris image. We are performing VIS to NIR image translation instead of NIR to VIS as: (i) NIR discerns iris patterns more effectively compared to VIS image, and translation from NIR to VIS loses relevant information, and (ii) compression of three channels of VIS image into one channel of NIR image is easier than an expansion of one channel of NIR image into three channels of VIS image. After image translation, we calculate similarity score of the GAN-generated NIR image with the real NIR image using Multi-channel CNN. We utilize the same losses for BicycleGAN [318] and StarGANv2 [53] as specified in the original work, whereas for Pix2Pix GAN we include additional identification loss. Therefore, we only describe the Pix2Pix GAN below. 8.2.2.1 Pix2Pix GAN with Identification Loss (Pix2Pix GAN ID) We attempt to reduce the domain gap using a deep generative model, Pix2Pix GAN [125]. It is a conditional Generative Adversarial Network (cGAN) designed for image-to-image translation. We introduce identification loss into its objective function and term it as Pix2Pix GAN ID. Our 152 objective is to synthesize NIR image of the same identity as VIS image and match it against the real NIR image. Figure 8.4 represents the overall testing setup of cross-modal matching utilizing Pix2Pix GAN ID and MT-CNN. Figure 8.4: The overall testing scenario of Pix2Pix GAN ID and MT-CNN for cross-modal matching. The Pix2Pix GAN ID’s generator synthesizes a NIR image from the VIS image. The MT-CNN then generates a similarity score from a pair of synthesized NIR and real NIR images. There are two components of Pix2Pix GAN ID: a generator and a discriminator. The generator aims to generate a realistic image with a constraint that the image should retain iris biometric information, whereas the discriminator distinguishes between real and synthesized NIR images. The base architecture used for the generator is U-Net256 [232] with skip connections. The discriminator used is PatchGAN [164] classifier. The training of the generator and discriminator is performed using the following loss terms: 1. Adversarial Loss: It is a classical adversarial loss, where the discriminator and generator compete with each other until reaching an equilibrium. The adversarial loss is defined as follows: 𝐿 𝐺 𝐴𝑁 (𝐺, 𝐷) = 𝐸 𝑥,𝑦 [𝑙𝑜𝑔𝐷 (𝑥, 𝑦)] + 𝐸 𝑥 [𝑙𝑜𝑔(1 − 𝐷 (𝑥, 𝐺 (𝑥)))] (8.2.1) where, 𝐺 is the generator function, 𝐷 is the discriminator function, 𝑥 is the input VIS image, and 𝑦 is the target NIR image. The generator aims to minimize the objective, whereas the discriminator tries to maximize it. The generator is not directly affecting the 𝑙𝑜𝑔(𝐷 (𝑥, 𝑦)) 153 term in the function, so for the generator, minimizing the loss is equivalent to minimizing 𝐸 𝑥 [𝑙𝑜𝑔(1 − 𝐷 (𝑥, 𝐺 (𝑥))]). 2. Per-pixel Loss: Per-pixel loss computes 𝑙1 distance between two images at the pixel level and reduces the mapping space from the VIS spectrum to the NIR spectrum. The loss formulation is as follows: 𝐿 𝑝𝑒𝑟−𝑝𝑖𝑥𝑒𝑙 (𝐺) = 𝐸 𝑥,𝑦 k 𝐺 (𝑥) − 𝑦 k 1 (8.2.2) where, k ∗ k 1 is the 𝑙1 norm between synthetic (𝐺 (𝑥)) and real (𝑥) images. 3. Perceptual Loss: It is the 𝑙 1 distance between deep features extracted from synthetic and real images. The features are extracted at multiple layers of the VGG19 network and concatenated to form a single feature descriptor. The VGG19 network is pre-trained on the ImageNet dataset [73]. The formulation is as follows: 𝐿 𝑝𝑒𝑟𝑐𝑒 𝑝𝑡𝑢𝑎𝑙 (𝐺) = 𝐸 𝑥,𝑦 k 𝜙 𝑃 (𝐺 (𝑥)) − 𝜙 𝑃 (𝑦) k 1 (8.2.3) where, 𝜙 𝑃 (𝐺 (𝑥)) and 𝜙 𝑃 (𝑦) are the VGG features extracted from the synthetic NIR image and the real NIR image, respectively. Perceptual loss [132] helps the generator to minimize the high-level semantic difference between the images. It ensures the smoothness and visual similarity of the generated image with the real NIR image. 4. Identity Loss: It is a cross-entropy loss estimated using MT-CNN, where input is a pair of real and synthetic NIR images, and output is a similarity score (1 for genuine pairs and 0 for impostor pairs). Its formulation is as follows: 𝐿 𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦 (𝐺) = −(𝑡 log(𝑀 (𝐺 (𝑥), 𝑦)) + (1 − 𝑡)𝑙𝑜𝑔(1 − 𝑀 (𝐺 (𝑥), 𝑦)))) (8.2.4) where, 𝑀 (𝐺 (𝑥), 𝑦) is the similarity score output by MT-CNN when synthetic NIR image 𝐺 (𝑥) and real NIR image 𝑦 are given as input, 𝑡 is the ground-truth label specifying that input pair is genuine or an impostor. As we are using only genuine pairs for training Pix2Pix GAN ID, the identity loss reduced to 𝐿 𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦 (𝐺) = −(log(𝑀 (𝐺 (𝑥), 𝑦)). 154 The overall loss function is as follows: 𝐺 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝐺 (𝐿 𝐺 𝐴𝑁 + 𝐿 𝑝𝑒𝑟−𝑝𝑖𝑥𝑒𝑙 + 𝐿 𝑝𝑒𝑟𝑐𝑒 𝑝𝑡𝑢𝑎𝑙 + 𝐿 𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦 ) (8.2.5) 𝐷 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐷 𝐿 𝐺 𝐴𝑁 8.2.3 Training-level: Dual Variational Generation Another challenge of cross-modal matching is insufficient genuine training data, which also raises the imbalance issue between genuine and impostor pairs. To address the challenge, we utilize another deep generative framework: Dual Variational Generation (DVG) [88]. The DVG network is an unconditional variational autoencoder (VAE) that generates NIR and VIS paired images of the same identity from noise sampled from a standard normal distribution. The MT-CNN training is supplemented with the synthetic samples (i.e., the network is trained using both real VIS-NIR genuine pairs and synthetic VIS-NIR genuine pairs from the DVG). Figure 8.5 shows the overall architecture of cross-modal recognition training utilizing the DVG-based method and MT-CNN. Regarding identity constraint, the image-to-image translation focuses on identity preservation, where the identity of a synthesized image is the same as the input image. On the other hand, the DVG-based method focuses on the identity consistency of generated pairs, which is anonymous. Generally, for the image-to-image translation, only a few samples are available for learning identity preservation, whereas the DVG-based model utilizes entire training genuine pairs for identity consistency. The architecture of DVG consists of two encoders that correspond to NIR and VIS input images and a decoder. Figure 8.6 represents the training architecture of the DVG model. Encoder 𝐸 𝑁 is responsible for mapping input NIR image 𝑥 𝑁 to latent space 𝑧 𝑁 , whereas encoder 𝐸𝑉 is responsible for mapping input VIS image 𝑥𝑉 to latent space 𝑧𝑉 . The latent representation of NIR and VIS images is then concatenated and fed into the decoder, which reconstructs the NIR and VIS images. Training of DVG-based model is performed using the following loss terms: 1. KL Divergence Loss: The constraint is applied over the encoders 𝐸 𝑁 and 𝐸𝑉 which outputs 155 Figure 8.5: Training architecture of the DVG-based model. The figure is adapted from [88]. It consists of two encoders that correspond to NIR and VIS input images and a decoder. The encoder transforms input image space into latent space. The decoder utilizes the latent space of NIR and VIS images and reconstructs them back into the image space. Figure 8.6: Training procedure of the DVG-based method. posterior distributions 𝑞 𝜙 𝑁 (𝑧 𝑁 | 𝑥 𝑁 ) and 𝑞 𝜙𝑉 (𝑧𝑉 | 𝑥𝑉 ) using Kullback-Leibler divergence: 𝐿𝑜𝑠𝑠 𝐾 𝐿 = 𝐷 𝐾 𝐿 (𝑞 𝜙 𝑁 (𝑧 𝑁 | 𝑥 𝑁 ) k 𝑝(𝑧 𝑁 )) + 𝐷 𝐾 𝐿 (𝑞 𝜙𝑉 (𝑧𝑉 | 𝑥𝑉 )) k 𝑝(𝑧𝑉 )) (8.2.6) where 𝑥 ∗ is the NIR or VIS input, 𝑧∗ is the NIR or VIS latent output, 𝑝(𝑧∗ ) is the NIR or VIS prior distribution, and 𝑞 𝜙∗ (𝑧∗ | 𝑥 ∗ ) is the posterior distribution output by the NIR or VIS encoder. We assume a multivariate standard normal distribution for the prior distributions. 2. Reconstruction Loss: This constraint is applied over the decoder, which reconstructs the 156 input images 𝑥 𝑁 and 𝑥𝑉 . The formulation is as follows: 𝐿𝑜𝑠𝑠𝑟𝑒𝑐 = −𝐸 𝑞 𝜃 (𝑧 𝑁 |𝑥 𝑁 ) Ð 𝑞 𝜙 (𝑧𝑉 |𝑥𝑉 ) [log 𝑝 𝜃 (𝑥 𝑁 , 𝑥𝑉 | 𝑧 𝐼 )] (8.2.7) 𝑁 𝑉 where 𝑧 𝐼 is the concatenation of 𝑧 𝑁 and 𝑧𝑉 latent vectors, and 𝑝 𝜃 (𝑥 𝑁 , 𝑥𝑉 | 𝑧 𝐼 ) is the joint distribution. The reconstruction loss is calculated by the root mean square of input and reconstructed images. 3. Distribution Alignment Loss: We aim to project both NIR and VIS images to a common latent space. To achieve this, we minimize the Wasserstein distance [107] between the posterior distributions of NIR and VIS images. The loss formulation is as follows: 1 𝐿𝑜𝑠𝑠 𝑑𝑖𝑠𝑡 = k 𝜇 𝑁 − 𝜇𝑉 k 22 + k 𝜎𝑁 − 𝜎𝑉 k 22 (8.2.8) 2 where 𝜇 𝑁 and 𝜎𝑁 are the mean and standard deviation output by the NIR encoder 𝐸 𝑁 , and 𝜇𝑉 and 𝜎𝑉 are the mean and standard deviation output by VIS encoder 𝐸𝑉 . 4. Identity Loss: Another constraint is to preserve the biometric content within the domain and ensure the identity is consistent across the domains. We use MT-CNN for both identity preservation and consistency. a) NIR-VIS pair identity loss: A cross-entropy loss is estimated using MT-CNN to keep the same identity across the reconstructed NIR and VIS images. Its formulation is as follows: 𝐿𝑜𝑠𝑠𝑖𝑑−𝑝𝑎𝑖𝑟 = −𝑙𝑜𝑔(𝑀𝑁𝑉 (b 𝑥𝑉 )) 𝑥𝑁 , b (8.2.9) where b𝑥 𝑁 is the DVG reconstructed NIR image, b 𝑥𝑉 is the DVG reconstructed VIS image and 𝑀 (b 𝑥𝑉 ) is the MT-CNN having NIR and VIS image as input. 𝑥𝑁 , b b) NIR-NIR and VIS-VIS identity loss: It is a cross-entropy loss to preserve the identity within the domain. Reconstructed NIR or VIS images should be of the same identity as the original NIR or VIS images. Its formulation is as follows: 𝐿𝑜𝑠𝑠𝑖𝑑−𝑟𝑒𝑐 = − log(𝑀𝑁 (b 𝑥 𝑁 , 𝑥 𝑁 )) − log(𝑀𝑉 (b 𝑥𝑉 , 𝑥𝑉 )) (8.2.10) 157 where 𝑥 𝑁 is the NIR input image, 𝑥𝑉 is the VIS input image, 𝑀𝑁 (∗, ∗) is the MT-CNN for NIR images, and 𝑀𝑉 (∗, ∗) is the MT-CNN for VIS images. The original framework [88] utilizes a 𝑙2 norm over the features extracted from Light-CNN to reserve the identity. The overall objective for the DVG-based model is as follows: 𝐿𝑜𝑠𝑠𝑂𝑣𝑒𝑟𝑎𝑙𝑙 = 𝐿𝑜𝑠𝑠𝑟𝑒𝑐 + 𝛾𝐾 𝐿 𝐿𝑜𝑠𝑠 𝐾 𝐿 + 𝛾 𝑑𝑖𝑠𝑡 𝐿𝑜𝑠𝑠 𝑑𝑖𝑠𝑡 (8.2.11) +𝛾𝑖𝑑−𝑝𝑎𝑖𝑟 𝐿𝑜𝑠𝑠𝑖𝑑−𝑝𝑎𝑖𝑟 + 𝛾𝑖𝑑−𝑟𝑒𝑐 𝐿𝑜𝑠𝑠𝑖𝑑−𝑟𝑒𝑐 where 𝛾∗ are empirically set to 0.1. During the testing of the DVG-based method (Figure 8.7), we discard both the encoders and only utilize the decoder. Noise is sampled from the standard normal distribution and the same is concatenated with itself and fed into the decoder which generates NIR and visible images of the same identity. The decoder generates genuine pairs that can be included in the training process of the Multi-channel CNN. Figure 8.7: The testing procedure of the DVG-based method. Noise is an input to the Decoder 𝐷 𝐼 which generates a synthesized genuine pair. 158 8.3 Dataset Description 8.3.1 BioCop-2008 Dataset The first dataset we use for our experiments is the FBI Biometric Collection of People (BioCoP- 2008) dataset. The BioCoP-2008 is an extension of the dataset mentioned in [129]. It is a multi-modal biometric dataset consisting of the face and ocular images collected from two sessions (SET1 and SET2). Subjects are the same in both sessions. The face images are acquired in visible illumination by Olympus C8080 camera from 1,135 subjects, whereas the ocular images are acquired in NIR illumination by Oki IrisPass M iris sensor from 1,097 subjects. The images are not simultaneously captured, so images are not aligned. There are a total of 3,608 iris images and 2,270 frontal face images. The dataset also consists of face images with 45 and 90-degree pose angles, but we utilize only the frontal images. The original size of VIS face images is 3264 × 2448. We cropped left and right ocular regions from the face images, which results in the ocular images of size 301 × 201. We further cropped the left and right iris regions from the left and right ocular images, respectively. The size of the cropped iris images is 81 × 81. The original size of NIR ocular images is 640 × 480. We cropped the left and right iris regions from the left and right NIR ocular images, respectively. The size of the cropped NIR iris images is 180 × 190. Figure 8.8 shows the original VIS face image, cropped VIS ocular image, cropped VIS iris image, and their corresponding NIR images. 8.3.2 BioCop-2009 Dataset The second dataset we use for our experiments is the FBI Biometric Collection of People (BioCoP- 2009) dataset. It is dataset is also a multi-modal biometric dataset consisting of images of a face and iris modalities. Subjects are the same in both the modalities collection. The face images are collected in two sessions from 1,100 subjects using Canon EOS 5D Mark II camera in the visible spectrum. Images are provided with different angles (-90, -45, 0, 45, and 90) and scales (raw, SAP50, and SAP51). We utilize only the frontal face (angle of 0) with SAP51 scale face images, 159 Figure 8.8: (a) A sample face image from the BioCop-2008 dataset. The face image is in the VIS spectrum. (b) Cropped left and right VIS ocular images from the face image. (c) Cropped left and right iris images from the left and right ocular images, respectively. The size of ocular and iris VIS images are 301 × 201 and 81 × 81, respectively. (d) Left and right NIR ocular images from the BioCop-2008 dataset. (e) Cropped left and right iris images from the left and right NIR ocular images, respectively. The size of ocular and iris NIR images are 640 × 480 and 180 × 190, respectively. so it results in a total of 2,199 face images. The resolution of face images is 2400 x 3200. We manually crop left and right ocular images from the face images. The size of the ocular images is 402 x 301. We further crop left and right iris regions from the left and right ocular images, respectively. The size of iris images varies according to the iris region. The data collected from the iris modality is acquired from 1,098 subjects in the NIR spectrum using three sensors: Aoptix Insight, CrossMatch I SCAN 2, and LG ICAM 4000. There are five sessions for each sensor. In each session, there are 2 images (one left and one right) from Aoptix Insight and CrossMatch I SCAN 2 sensors and 4 images from LG ICAM 4000 sensor. The total number of images from the Aoptix Insight sensor is 11,000, from CrossMatch I SCAN 2 is 10,910, and from LG ICAM 4000 is 21,980. The image size of NIR ocular images is 640 x 480. We further crop left and right iris regions from left and right NIR ocular images, respectively. The size of the cropped iris images varies according to the iris region. 160 8.3.3 PolyU Dataset The third dataset we use for our experiments is the publicly available PolyU dataset [223]. It is a bi-spectral dataset used to analyze the cross-spectral iris recognition algorithms. The sensor used for the data collection is an in-house imaging setup that acquires NIR and VIS iris images simul- taneously in a single shot. The two spectral images collected have pixel-to-pixel correspondences. The dataset consists of images from two sessions. In the first session, there are images of 209 subjects. Approximate 15 images are there for each eye (left and right). In the second session, there are images of 11 subjects. The subjects are the same in both sessions. The total number of images from both sessions are 12,574. The resolution of both spectral images is 640 × 480. The first two rows of Figure 8.9 show a few samples of the PolyU dataset. Figure 8.9: Samples of VIS and NIR ocular images from the PolyU dataset. The first and second row represents the corresponding VIS and NIR ocular images of four different subjects, respectively. 8.3.4 WVU Dataset The WVU multimodal dataset [55] is a biometric dataset consists of images from face, iris, fingerprint, hand geometry, palmprint, and voice modalities. We utilize only face and iris modality images. The face images are acquired in visible illumination using Sony EVI-D30 and Sony EVI- D31 cameras from 269 subjects. The iris images are acquired in NIR illumination using the Irispass 161 iris sensor from 244 subjects. There are 234 subjects common in both modalities (face and iris). The total number of iris images is 3,099, and frontal face images is 1,746. The original size of the VIS face images captured from two sensors are 768 × 576 and 640 × 480. We crop left and right ocular regions from the face image, which results in the ocular image of size approx. 51 × 61 (varies as per the size of the ocular region). We further crop left and right iris regions from the left and right ocular images, respectively. The size of the cropped iris images is approximately 24 × 24 (varies as per the size of the iris region). The original size of NIR ocular images is 640 × 480. We crop left and right iris regions from the left and right NIR ocular images, respectively. The size of the cropped NIR iris images is approximately 300 × 300 (varies as per the size of the iris region). Figure 8.10 shows the original VIS face image, cropped VIS ocular image, cropped VIS iris image, and their corresponding NIR images. Figure 8.10: (a) A sample face image from the WVU dataset. The face image is in the VIS spectrum. (b) Cropped left and right VIS ocular images from the face image. (c) Cropped left and right iris images from the left and right ocular images, respectively. The size of ocular and iris VIS images are 51 × 61 and 24 × 24, respectively. (d) Left and right NIR ocular images from the WVU dataset. (e) Cropped left and right iris images from the left and right NIR ocular images, respectively. The size of ocular and iris NIR images are 640 × 480 and 300 × 300, respectively. 162 8.4 Experimental Setup and Results 8.4.1 BioCop-2008 and BioCop-2009 Dataset For the evaluation of cross-modal matching on BioCop-2008 and BioCop-2009 datasets, we con- sider three matching scenarios and two types of input (iris and ocular). In the first scenario (Face-Face), the iris or ocular region from the face visible image is matched against the face visible image. In the second (Iris-Iris), the iris or ocular region from the iris NIR image (original dataset image) is matched against the iris NIR image. In the last scenario (Iris-Face), the iris or ocular re- gion from the iris NIR image is matched against the face visible image. In the BioCop-2009 dataset, there are three experiments in Iris-Iris and Iris-Face scenarios corresponding to iris images captured from three different iris sensors (Aoptix, CrossMatch, and LG4000). In all experiments, we perform training on 70% of subjects and testing on the rest (30%) using the subject disjoint protocol. Table 8.1 provides the details on the genuine and impostor pairs used for training and testing in all three scenarios for the BioCop-2008 dataset. Table 8.2 provides the same details for the BioCop-2009 dataset. We utilize the entire training genuine pairs, but partial impostor pairs (50,000) to reduce time complexity. However, testing is performed on the entire genuine and impostor pairs. We repeat the random selection of impostor pairs for training five times and report the cross-validation results. We perform iris recognition using VeriEye, USITv3.0, MT-CNN techniques, and ocular recognition using MT-CNN. The VeriEye is a commercially available off-the-shelf technique that performs iris recognition. It is used as a baseline. The USITv3.0 is an open-source iris recognition software toolkit from the University of Salzburg Iris Toolkit. We utilize the best-performing technique from the toolkit. The technique extracts iris-code using quadratic spline wavelet (QSW) and uses hamming distance to measure the dissimilarity between the iris codes. The technique also performs iris recognition and is considered a baseline for our cross-modal evaluation. We manually segment the iris images for USITv3.0, whereas VeriEye utilizes its iris segmentation module. Evaluation measures used in the experiments are True Match Rate (TMR) at 0.1% False Match Rate (FMR) and Equal Error Rate (EER). Tables 8.3 and 8.4 present the performance of all methods on the 163 Table 8.1: Description of genuine and impostor pairs used in experiments from the BioCop-2008 dataset. Train Set Test Set Genuine Impostor Impostor Genuine Impostor Pairs Pairs Pairs Used Pairs Pairs Face-Face 1,588 629,642 50,000 682 115,940 Iris-Iris 1,044 389,713 50,000 425 67,906 Face-iris 2,448 1,875,168 50,000 1,030 338,870 Table 8.2: Description of genuine and impostor pairs used in experiments from the BioCop-2009 dataset. Train Set Test Set Genuine Impostor Impostor Genuine Impostor Pairs Pairs Pairs Used Pairs Pairs Face-Face 1,538 590,592 50,000 660 109,230 Iris-Iris 15,550 587,522 50,000 6,600 107,912 (Aoptix) Iris-Iris 15,320 575,322 50,000 6,600 107,912 (CrossMatch) Iris-Iris 69,610 587,522 50,000 29,700 107,912 (LG4000) Face-iris 15,420 1,178,112 50,000 6,580 215,824 (Aoptix) Face-Iris 15,300 1,170,442 50,000 6,520 213,850 (CrossMatch) Face-Iris 30,800 1,178,112 50,000 13,160 215,824 (LG4000) BioCop-2008 and BioCop-2009 datasets, respectively. Figure 8.11 shows the ROC curves of four methods in the Iris-Face matching scenario and the histogram corresponds to the MT-CNN on the BioCop-2008 dataset. Figures 8.12a, 8.12b, and 8.12c show the ROC curves of four methods in the Iris-Face matching scenario corresponding to three iris sensors images and 8.12d shows histogram corresponds to the MT-CNN method on the BioCop-2009 dataset. The MT-CNN performs the best in Face-Face and Iris-Face matching scenarios on both datasets. VeriEye technique performs the best in the case of Iris-Iris matching on the BioCop-2008 dataset and the MT-CNN on the BioCop-2009 dataset. There occur a few segmentation failures in the 164 Table 8.3: Performance of different methods on the BioCop-2008 dataset. MT-CNN with ocular input outperforms on this dataset. VeriEye USITv3.0 MT-CNN(Iris) MT-CNN(Ocular) Experiments TMR (%) @ TMR (%) @ TMR (%) @ TMR (%) @ EER (%) EER (%) EER (%) EER (%) 0.1% FMR 0.1% FMR 0.1% FMR 0.1% FMR Face-Face 47.38 21.84 45.60 20.74 95.74 0.86 98.53 0.84 Iris-Iris 98.80 0.67 95.76 1.64 96.70 1.20 82.58 4.67 Iris-Face 34.84 29.07 14.46 37.01 47.45±1.58 9.05±0.45 50.46±2.48 7.32±0.47 Figure 8.11: ROC curves of different methods and histogram (MT-CNN) in the Iris-Face matching scenario on the BioCop-2008 dataset. MT-CNN with ocular input outperforms on this dataset. case of VeriEye, which are provided in Table 8.6. Figures 12 and 13 show a few failure cases from the BioCop-2008 and BioCop-2009 datasets, respectively. There is no clear winner between iris and ocular recognition when considering the MT-CNN. The results show the efficiency of the learning-based method (MT-CNN) over the hand-crafted features-based techniques in the cross- modal matching scenario. In another experimental setup, we used a small set of 5,000 impostor pairs for the training and evaluate the DVG-based method on the BioCop-2008 dataset. Using the DVG-based method, we generate 50,000 genuine pairs and are included in the training process of the MT-CNN. The testing set remains the same as in Table 8.1. Only the cross-modal (Iris-Face) scenario is tested. Table 8.6 shows the results of MT-CNN when trained on only real genuine pairs and when trained on both real and synthetically generated genuine pairs (DVG-based method). As the DVG-based method 165 Table 8.4: Performance of different methods on the BioCop-2009 dataset. MT-CNN with iris input outperforms on Aoptix Insight and CrossMatch sensor images, whereas MT-CNN with ocular input outperforms on LG ICAM 4000 sensor images. VeriEye USITv3.0 MT-CNN(Iris) MT-CNN(Ocular) Experiments TMR (%) @ TMR (%) @ TMR (%) @ TMR (%) @ EER (%) EER (%) EER (%) EER (%) 0.1% FMR 0.1% FMR 0.1% FMR 0.1% FMR Face-Face 88.86 5.66 82.42 7.13 98.18 0.44 97.87 1.27 Iris-Iris 96.40 3.24 94.93 3.97 99.89 0.11 99.83 0.12 (Aoptix) Iris-Iris 99.96 0.03 99.89 0.10 99.98 0.02 99.56 0.31 (CrossMatch) Iris-Iris 99.85 0.14 99.60 0.34 99.88 0.11 97.04 0.44 (LG4000) Iris-Face 51.30 25.38 29.76 32.88 80.48 2.18 46.79 2.36 (Aoptix) Iris-Face 58.48 22.34 40.73 27.94 76.13 2.88 70.03 2.99 (CrossMatch) Iris-Face 28.05 21.09 39.60 29.39 89.38±0.95 2.16±0.15 91.82±3.26 1.55±0.15 (LG4000) Table 8.5: Number of genuine and impostor pairs excluded from the test set due to the segmentation errors by the VeriEye technique on both the datasets. The numbers shown in the parenthesis are the total number of genuine and impostor pairs used in the test set. BioCop-2008 BioCop-2009 Genuine Impostor Genuine Impostor Pairs Pairs Pairs Pairs Face-Face 12 (682) 1,357 (115,940) 17 (660) 3,926 (109,230) Iris-Iris 6 (425) 1,617 (67,906) 4 (6,600) 328 (107,912) (Aoptix) Iris-Iris - - 52 (6,600) 1,302 (107,912) (CrossMatch) Iris-Iris - - 100 (29,700) 655 (107,912) (LG4000) Face-iris 24 (1,030) 7,611 (338,870) 100 (6,580) 2,943 (215,824) (Aoptix) Face-Iris - - 110 (6,520) 4,214 (213,850) (CrossMatch) Face-Iris - - 40 (13,160) 692 (215,824) (LG4000) 166 Figure 8.12: ROC curves of different methods in the Iris-Face matching scenario on the BioCop- 2009 dataset corresponding to (a) Aoptix Insight, (b) CrossMatch I SCAN 2, and (c) LG ICAM 4000 iris sensors. MT-CNN with iris input outperforms on Aoptix Insight and CrossMatch sensor images, whereas MT-CNN with ocular input outperforms on LG ICAM 4000 sensor images. (d) Histogram corresponds to the MT-CNN method with ocular input on LG ICAM 4000 sensor images. generates the genuine pairs from the distribution of available real genuine pairs, therefore there is no significant improvement occurred in the performance. 8.4.2 PolyU Dataset For the evaluation of the cross-spectrum setting on the PolyU dataset, we use the subject-disjoint strategy in the experiments, where 60% of subjects were present in training and 40% in testing. It generates 4,067 genuine pairs and 8,199,862 impostor pairs for the training, whereas 2,220 genuine 167 Figure 8.13: Failure cases of the MT-CNN in genuine and impostor pairs from the BioCop-2008 dataset. The last row represents the GradCam maps [8] which show regions focused by the network to make the decision. Figure 8.14: Failure cases of the MT-CNN in genuine and impostor pairs from the BioCop-2009 dataset. The last row represents the GradCam maps [8] which show regions focused by the network to make the decision. pairs and 2,430,900 impostor pairs for the testing. From the train and test sets, we utilize all genuine pairs, but 10,000 randomly selected impostor pairs to reduce the computational time. Table 8.7 provides the number of genuine and impostor pairs used for training and testing. The evaluation measures used for the comparison are TMR (%) at 0.1% FMR and EER. The methods used for 168 Table 8.6: TMR and EER of ocular and iris recognition methods on the entire test set of BioCop- 2008 dataset when a small set (5,000 impostor pairs) is used for the training. Including additional training samples generated from the DVG-based method does not improve the performance. Ocular Iris Experiments TMR (%) @ TMR (%) @ EER (%) EER (%) 0.1% FMR 0.1% FMR Multi-channel 28.84 ± 3.27 9.69 ± 0.50 29.88 ± 4.07 10.87 ± 0.67 CNN DVG-based 29.15 ± 1.45 9.61 ± 0.41 28.63 ± 3.12 11.02 ± 0.66 Method Table 8.7: Data distribution among train and test sets from the PolyU dataset. Train Set Test Set Genuine Impostor Genuine Impostor Pairs Pairs Pairs Pairs PolyU Dataset 4,067 10,000 2,220 2,430,900 comparison on the dataset are VeriEye, MT-CNN, Pix2Pix GAN ID, StarGANv2, and BicycleGAN. Table 8.8 presents the results of all ocular and iris recognition algorithms. Figure 8.15 shows NIR ocular samples generated from StarGANv2 given VIS ocular images and Figure 8.16 shows NIR iris region images generated from StarGANv2 given VIS iris images. The MT-CNN outperforms the other methods. The StarGANv2 generated images perform better than the Pix2Pix GAN ID when StarGANv2 is trained on unpaired images, whereas Pix2Pix GAN ID is trained on paired images. However, there is still scope for improvement as the MT-CNN is still performing better than GAN-generated images. The ocular recognition is performing better than iris recognition on this dataset. 8.4.3 WVU Dataset For the evaluation of the cross-modal setting on the WVU dataset, we again follow the subject- disjoint strategy, where 60% (140) of subjects utilize in training and 40% (94) in testing. We perform experiments in three settings as before: the first setting matches VIS face images with VIS face images (Face-Face), and the second setting matches NIR iris images with NIR iris images 169 Table 8.8: TMR (%) at 0.1% FMR and EER of all ocular and iris recognition methods on the entire test set of the PolyU dataset. MT-CNN outperforms in both ocular and iris recognition. Ocular Iris Experiments TMR (%) @ TMR (%) @ EER (%) EER (%) 0.1% FMR 0.1% FMR VeriEye - - 56.77 18.19 BicycleGAN 67.19 6.55 3.30 20.56 Pix2Pix GAN ID 94.46 ± 2.82 1.33 ± 0.66 26.30 ± 9.29 14.24 ± 1.88 + MT-CNN StarGANv2 [5] + 98.58 ± 0.41 0.42 ± 0.06 28.24 ± 5.12 8.47 ± 0.44 MT-CNN MT-CNN 99.25 ± 0.29 0.35 ± 0.08 83.54 ± 0.93 2.77 ± 0.32 Figure 8.15: StarGANv2 generated ocular images from VIS domain to NIR domain. (Iris-Iris), and the third setting matches VIS face images with NIR iris images (Iris-Face). Table 8.10 provides the number of genuine and impostor pairs utilized for training and testing in all three 170 Figure 8.16: StarGANv2 generated iris region images from VIS domain to NIR domain. settings. For training, we randomly select 50,000 impostor pairs from the entire train set. We perform experiments for both ocular as well as iris recognition, where the input is ocular and iris image, respectively. The evaluation measures used are TMR at 0.1% and EER. To set the maximum recognition performance, we perform face recognition using COTS Rank One Computing (ROC), which produces 99.98% TMR at 0.1% FMR. Table 8.10 provides the ocular and iris recognition results in all three settings. Figure 8.17 shows the Receiver Operating Characteristic (ROC) curves correspond to all methods in the Iris-Face scenario and the histogram corresponds to the MT-CNN. Ocular recognition achieves the best (68.35% TMR @ 0.1% FMR) on VIS images (intra-modal scenario), and iris recognition achieves the best (92.90% TMR @ 0.1% FMR) on NIR images (intra-modal scenario). Both ocular and iris recognition drop significantly (1.15% TMR @ 0.1% FMR) under a cross-modal scenario, where VIS images (very low-resolution images) match with 171 Table 8.9: Data distribution among train and test sets for all three settings from the WVU dataset. Train Set Test Set Genuine Impostor Genuine Impostor Pairs Pairs Pairs Pairs Face-Face 10,164 50,000 9,104 920,890 Iris-Iris 5,668 50,000 4,792 870,882 Face-iris 13,016 50,000 11,296 866,651 Table 8.10: TMRs and EER of ocular and iris recognition techniques on the entire test set of the WVU dataset. All techniques fail on this dataset. VeriEye USITv3.0 MT-CNN(Iris) MT-CNN(Ocular) Experiments TMR (%) @ TMR (%) @ TMR (%) @ TMR (%) @ EER (%) EER (%) EER (%) EER (%) 0.1% FMR 0.1% FMR 0.1% FMR 0.1% FMR Face-Face - - 0.59 42.47 32.96 11.40 68.33 6.91 Iris-Iris 98.95 0.79 96.38 2.15 92.90 3.58 42.94 15.77 Iris-Face - - 0.03 48.91 0.88 36.83 1.07 34.12 NIR images. The failure of the MT-CNN method under the cross-modal scenario is due to the very low resolution of VIS ocular and iris images. Figure 8.18 shows some failure cases when the MT-CNN method is used, and Figure 8.19 shows the t-SNE [280] plot of features extracted from the MT-CNN for the genuine and impostor pairs. There is a large overlap between the features of genuine pairs and impostor pairs, which results in poor performance on the WVU dataset. Figure 8.17: ROC curves of iris and ocular recognition techniques and histogram (MT-CNN) on the entire set of the WVU dataset. All techniques fail on this dataset. 172 Figure 8.18: Failure cases of the MT-CNN in genuine and impostor pairs. The last row represents the GradCam maps [245] which show regions focused by the network to make the decision. The degraded and very low resolution of ocular images in the VIS spectrum causes the poor performance of cross-modal matching on the WVU dataset. Figure 8.19: t-SNE [280] plot of genuine and impostor pairs features obtained from the MT-CNN network. There is a large overlap between the features of the two distributions. The overlapping criteria could be used to identify on which dataset cross-modal matching would be feasible. 173 Table 8.11: Iris color distribution of genuine scores obtained from Multi-channel CNN in three settings: face-face, iris-iris and face-iris matching. The region used for the matching is ocular region. Eye Colors Blue Gray Green Hazel Brown Black Total Face-Face 170 8 80 64 296 60 678 Iris-Iris 94 7 42 48 178 39 408 Face-Iris 284 8 163 121 429 85 1090 8.5 Impact of Eye Color on Cross-model Matching We analyze the influence of eye color on the ocular matching scores. The method used is MT-CNN, and the dataset is the BioCop2008 dataset. Eye color is the result of the amount of melanin present in the iris. Dark-colored irides contain more melanin than light-colored irides. The presence of a high concentration of melanin in the iris absorbs most of the light, causing dark-colored irides. The lack of melanin causes scattering of the light, resulting in the light-colored irides. According to the melanin pigment concentration, eye colors from light to dark can be sequenced as blue, gray, green, hazel, brown, and black. These irides colors can be categorized into broader categories – light-colored irides (blue, gray, green) and dark-colored irides (hazel, brown, and black). The BioCop2008 dataset provides the eye color information for each subject. We use the MT-CNN ocular matching techniques to analyze the effect of eye colors on matching scores. We used three multi-channel CNNs models trained matching Face (VIS)-Face (VIS), Iris (NIR)- Iris (NIR), and for Face (VIS)-Iris (NIR). Genuine scores from all three models are used to analyze the eye color effect on the matching scores. Table 8.11 provides number of color distribution of genuine scores obtained in all three-matching scenario. Total number of genuine scores in each scenario is also provided. Figure 8.20a, 8.20b, and 8.21 is showing histogram of genuine scores corresponding to different eye colors under three scenarios (face-face, Iris-Iris, Iris-Face) respectively. Figure 8.20a (face-face matching) shows most of the genuine scores going below the threshold (0.71) belong to light colored irides (blue, gray and green). Though no such pattern is noticeable in the Iris-Iris and Face-Iris scenario. 174 Figure 8.20: (a) Histogram of genuine scores when ocular region from two face images (VIS) are matched. The threshold is 0.71 at 0.2% FMR. (b) Histogram of genuine scores when ocular region from two iris images (NIR) are matched. The threshold is 0.61 at 0.2% FMR. Figure 8.21: Histogram of genuine scores when ocular region from the face (VIS) and iris images (NIR) are matched. The threshold is 0.79 at 0.2% FMR. All techniques fail on this dataset. 8.6 Summary There are two main challenges when face images match against the iris images (cross-modal recognition): large domain gap and imbalanced training data. We address these challenges with 175 three deep learning approaches at feature-level, image-level, and training-level. For the first approach, we use Multi-channel CNN, for the second we use three GAN-based architectures (Bicy- cleGAN, Pix2Pix GAN ID, and StarGANv2), and for the third approach, we utilize Dual variation generation (DVG). The first two approaches (feature-level and image-level) aim to reduce the do- main gap, whereas the third approach addresses the imbalanced training data issue. Superior results on a cross-modal (BioCop-2008, BioCop-2009, and WVU) and a cross-spectrum (PolyU dataset) datasets show their effectiveness. We further analyze the impact of eye color on the performance of cross-modal performance and it is found that eye color does not impact the performance of cross-model matching. 176 CHAPTER 9 CONCLUSION 9.1 Research Contributions Iris recognition has been widely used in a number of large-scale or high-security real-world applications. In this thesis, we focus on some aspects of iris biometrics. Our primary focus is to provide countermeasures against two adversary attacks: presentation attacks and morph attacks. The second focus is on cross-modal matching of NIR iris images with RGB face images. The first adversary attack we attempt to counteract is presentation attacks (PA) which occur when an adversary presents fake or altered biometric samples to the iris sensor in order to circumvent the biometric system. We propose three iris PA detection methodologies based on the input signal available to facilitate PA detection. The first method called D-NetPAD utilizes a near-infrared iris image for iris PA detection. The method is based on a densely connected convolutional network, which effectively characterizes the bonafide iris pattern to deflect iris PAs. It generalizes well across unseen attacks, sensors, and datasets. It emerges as the best performer in the Intelligence Advanced Research Projects Activity (IARPA) Odin program, LivDet-Iris-2017, and LivDet-Iris- 2020 competitions. The second method we proposed utilizes additional hardware (webcam) to capture short videos (∼4 secs) depicting the human behavior during their interaction with the iris sensor. The last proposed method employs additional hardware, namely, Optical Coherence Tomography (OCT) imaging. The OCT imaging provides a 2D cross-sectional view (internal structure) of an eye. The iris PA detection using OCT works on the principle that low-coherence light passes through the cornea of bonafide eyes, whereas it is partially (cosmetic contact lens) or completely (plastic eye) blocked by iris PAs. The blocking of the light causes voids in the imaged iris region and aids in the detection of iris PAs. Along with these PA detection methods, we also explain their performance using t-SNE scatter plots and GradCAM heatmaps. We not only strive for the high performance of the iris PA detectors but also assess the robustness 177 of these models under input image and architectural parameters perturbations. In the case of input image perturbations, we apply various low and high frequencies manipulations, Gaussian noise, and salt & pepper noise to the input images. We observe that D-NetPAD is comparatively robust to these manipulations to the input images. In the case of architectural parameter perturbations, we apply Gaussian noise, weight zeroing, and weight scaling. We observed that the proposed iris PAD is only robust to the weight zeroing manipulations. The robustness analysis is not confined to iris PA detection methods but also can be applied to any deep neural network. The maintenance of the performance of iris PA detectors in a non-stationary environment is another required factor in the deployment of the iris PAD in real-world applications. We propose a retraining methodology, where we build a new PA detector using new oncoming training data, and make a final decision for a probe sample by a weighted sum of old and new PA detector scores. We assign the weights dynamically for each probe sample using in-domain models (separate from iris PA detectors). Each in-domain model provides information about the membership of a probe sample to the training data. The second adversary attack we focus on is morph attacks. A morph attack entails the generation of an image (morphed image) that embodies multiple different identities. The morph attack is not been widely analyzed in iris recognition. To the best of our knowledge, we are the first to report the vulnerability of the iris biometric system to morphed attacks at the image level. We develop a landmark-based iris morphing scheme and demonstrate the potential of morphed iris images to attack the systems. We also develop a deep learning-based network to detect the morphed iris images. The last contribution of the thesis is to improve the performance of human recognition when matching iris images against face images. Such matching of different modalities is called cross- modal recognition. There are two main challenges: (i) a large domain gap due to different sensors, spectra, and resolutions, and (ii) an imbalance in the training data. We attempt to resolve these challenges with three deep learning approaches. The first approach is at the feature-level using a Multi-channel CNN. It jointly extracts discriminative features from the images of both modalities 178 to reduce the domain gap. The second approach is at the image-level, where the image from one domain is transformed into another utilizing various GAN architectures. The third approach is training-based which resolves the imbalance of train data by generating samples of the genuine class using the Dual Variational Generation (DVG) framework. 9.2 Future Work We identify the following directions that require more attention in the future: 1. We proposed various effective software and hardware-based methods (Chapters 2, 3 and 4) for iris presentation attack (PA) detection. However, there is still scope for performance improvement across datasets (Table 2.4). This would entail focusing on the generalizability of the PA detection solutions by either applying domain transfer techniques or updating the existing PA detector with new training data. But this must be done in a manner so as to minimize the data needed from previously unseen domains. Work is also required to provide generalizability to morph attack detection. 2. We attempted to explain the overall results of iris PA detectors using Grad-CAM, t-sne plots, and frequency analysis. However, explainability must be imparted at the individual image level. 3. During sensitivity analysis of deep neural models against weight perturbations, we found that the weights learned using the stochastic gradient descent algorithm are not optimum. Even setting randomly selected weights to zero improves the performance of the models. The observation indicates that there requires additional strategies in finding optimum weights for the deep neural networks. In our work, we empirically selected high-performing models. In future work, we could attempt to analytically find the direction of optimum weights. Another noteworthy observation is that setting low-magnitude weights to zero improves the performance and reduces the model size. So, leveraging the sensitivity analysis, we could work in the direction of model compression or quantization. 179 4. In the retraining work, we are introducing two new models with every incoming data (or new task), which results in a linear increase in the number of models with an increase in tasks. This raises concern about the scalability of the proposed method. In future work, we could improve the scalability of the method by applying pre-conditions (performance difference or data distribution difference with already available models) before building additional models. 5. In cross-modal matching of iris images with face images, we observed the high-performance of the feature-level method, which involves extraction of common features from both modal- ities. We also utilized existing generative adversarial networks (GANs) to translate one domain image to another for matching, though the performance is not on par with the feature- level method. GAN-based techniques have shown great potential in various computer vision tasks. Therefore, we could focus on designing a GAN architecture specific to cross-modal matching with effective loss functions. 180 BIBLIOGRAPHY 181 BIBLIOGRAPHY [1] Biometric e-passport. https://en.wikipedia.org/wiki/Biometric_passport. [2] IARPA, ODNI:IARPA-BAA-16-04 (Thor). https://www.iarpa.gov/index.php/research- programs/odin/odin-baa. [3] ISO/IEC 30107-1:2016: Information technology – Biometric Presentation Attack Detection – Part 1: Framework. https://www.iso.org/standard/53227.html. [4] NICE.I-Noisy Iris Challenge Evaluation Part I. http://nice1.di.ubi.pt/index.html. [5] THORLabs Telesto series (TEL1325LV2) Spectral domain OCT scanner. https://www.thorlabs.com/catalogpages/Obsolete/2017/TEL1325LV2-BU.pdf. [6] Unique Identification Authority of India: Govt. of India. Aadhaar Dashboard. https://uidai. gov.in/aadhaar_dashboard/. [7] Warsaw University of Technology, Poland. http://zbum.ia.pw.edu.pl/EN/node/46. [8] Andrea F. Abate, Maria Frucci, Chiara Galdi, and Daniel Riccio. BIRD: Watershed based iris detection for mobile devices. Pattern Recognition Letters (PRL), 57:43–51, 2015. Mobile Iris CHallenge Evaluation part I (MICHE I). [9] R. Abdal, Y. Qin, and P. Wonka. Image2StyleGAN: How to embed images into the StyleGAN latent space? International Conference on Computer Vision (ICCV), pages 4431–4440, 2019. [10] Mohammed A. M. Abdullah, Satnam S. Dlay, Wai L. Woo, and Jonathon A. Chambers. A novel framework for cross-spectral iris matching. IPSJ Transactions on Computer Vision and Applications, 8, 2016. [11] Cristhian A. Aguilera, Francisco J. Aguilera, Angel D. Sappa, Cristhian Aguilera, and Ricardo Toledo. Learning cross-spectral similarity measures with deep convolutional neural networks. Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), page 9, 2016. [12] Fares S. Al-Qunaieer and Lahouari Ghouti. Color iris recognition using hypercomplex gabor wavelets. Symposium on Bio-inspired Learning and Intelligent Systems for Security, pages 18–19, 2009. [13] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. European Conference on Computer Vision (ECCV), pages 139–154, 2018. 182 [14] Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, and Lucas Page-Caccia. Online continual learning with maximal interfered retrieval. Advances in Neural Information Processing Systems (NeurIPS)), 32, 2019. [15] Fernando Alonso-Fernandez, Pedro Tome-Gonzalez, Virginia Ruiz-Albacete, and Javier Ortega-Garcia. Iris recognition based on sift features. IEEE International Conference on Biometrics, Identity and Security (BIdS), pages 1–8, 2009. [16] A. Anjos, M. M. Chakka, and S. Marcel. Motion-based counter-measures to photo attacks in face recognition. IET Biometrics, 3(3):147–158, 2014. [17] André Anjos and Sébastien Marcel. Counter-measures to photo attacks in face recognition: a public database and a baseline. International Joint Conference on Biometrics (IJCB), 2011. [18] S. S. Arora, M. Vatsa, R. Singh, and A. Jain. Iris recognition under alcohol influence: A preliminary study. IAPR International Conference on Biometrics (ICB), pages 336–341, 2012. [19] M. Arsalan, R. A. Naqvi, D. S. Kim, P. H. Nguyen, M. Owais, and K. R. Park. IrisDenseNet: Robust iris segmentation using densely connected fully convolutional networks in the images by visible light and near-infrared light camera sensors. Sensors, 2018. [20] Muhammad Arsalan, Hyung Gil Hong, Rizwan Ali Naqvi, Min Beom Lee, Min Cheol Kim, Dong Seop Kim, Chan Sik Kim, and Kang Ryoung Park. Deep learning-based iris segmentation for iris recognition in visible light environment. Symmetry, 9(11), 2017. [21] Sarah E. Baker, Amanda Hentz, Kevin W. Bowyer, and Patrick J. Flynn. Degradation of iris recognition performance due to non-cosmetic prescription contact lenses. Computer Vision and Image Understanding (CVIU), 114(9):1030–1044, 2010. [22] Jihwan Bang, Heesu Kim, YoungJoon Yoo, Jung-Woo Ha, and Jonghyun Choi. Rainbow memory: Continual learning with a memory of diverse samples. Conference on Computer Vision and Pattern Recognition (CVPR), pages 8218–8227, June 2021. [23] C. Bradford Barber, David P. Dobkin, and Hannu Huhdanpaa. The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software, 22(4):469–483, 1996. [24] Carlos A. C. M. Bastos, Ing Ren Tsang, and George D. C. Calvalcanti. A combined pulling amp; pushing and active contour method for pupil segmentation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 850–853, 2010. [25] A. Bastys, J. Kranauskas, and R. Masiulis. Iris recognition by local extremum points of multiscale taylor expansion. Pattern Recognition (PR), 42(9):1869–1877, 2009. [26] T. Beier and S. Neely. Feature-based image metamorphosis. SIGGRAPH Computer Graphics, 26(2):35–42, 1992. 183 [27] Dalila Benboudjema, Nadia Othman, Bernadette Dorizzi, and Wojciech Pieczynski. Chal- lenging eye segmentation using triplet markov spatial models. IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pages 1927–1931, 2013. [28] Oliver Bergamin, M. Bridget Zimmerman, and Randy Kardon. Pupil light reflex in normal and diseased eyes: diagnosis of visual dysfunction using waveform partitioning. Ophthal- mology, 110:106–14, 02 2003. [29] T. Bergmüller, L. Debiasi, A. Uhl, and Z. Sun. Impact of sensor ageing on iris recognition. IEEE International Joint Conference on Biometrics (IJCB), pages 1–8, 2014. [30] Rajesh M. Bodade and Sanjay N. Talbar. Shift invariant iris feature extraction using rotated complex wavelet and complex wavelet for iris recognition system. International Conference on Advances in Pattern Recognition (ICAPR), pages 449–452, 2009. [31] Vishnu Naresh Boddeti, B.V.K. Vijaya Kumar, and Krishnan Ramkumar. Improved iris segmentation based on local texture statistics. Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pages 2147–2151, 2011. [32] W.W. Boles and B. Boashash. A human identification technique using images of the iris and wavelet transform. IEEE Transactions on Signal Processing, 46(4):1185–1188, 1998. [33] Z. Boulkenafet, J. Komulainen, L. Li, X. Feng, and A. Hadid. OULU-NPU: a mobile face presentation attack database with real-world variations. IEEE International Conference on Automatic Face and Gesture recognition (FG), 2017. [34] Zinelabidine Boulkenafet, Jukka Komulainen, Zahid Akhtar, Azeddine Benlamoudi, Djamel Samai, Salah Eddine Bekhouche, Abdelkrim Ouafi, Fadi Dornaika, Abdelmalik taleb ahmed, L Qin, F Peng, Le-Bing Zhang, M Long, Shruti Bhilare, V Kanhangad, Artur Costa- Pazo, Esteban Vazquez-Fernandez, D Perez-Cabo, J J. Moreira-Perez, and A Hadid. A competition on generalized software-based face presentation attack detection in mobile scenarios. International Conference on Biometrics: Theory, Applications and Systems (BTAS), 2017. [35] K. W. Bowyer, K. P. Hollingsworth, and P. J. Flynn. A survey of Iris biometrics research: 2008-2010. Springer, 2013. [36] Kevin Bowyer, Sarah Baker, Amanda Hentz, Karen Hollingsworth, Tanya Peters, and Patrick Flynn. Factors that degrade the match distribution in iris biometrics. Identity in the Infor- mation Society, 2:327–343, 12 2009. [37] Kevin W. Bowyer, Karen Hollingsworth, and Patrick J. Flynn. Image understanding for iris biometrics: A survey. Computer Vision and Image Understanding (CVIU), 110(2):281–307, 2008. 184 [38] Aidan Boyd, Adam Czajka, and Kevin Bowyer. Deep learning-based feature extraction in iris recognition: Use existing models, fine-tune or train from scratch? International Conference on Biometrics Theory, Applications and Systems (BTAS), pages 1–9, 2019. [39] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. LOF: Identifying density-based local outliers. ACM SIGMOD International Conference on Management of Data, page 93–104, 2000. [40] A. Bron, R. Tripathi, and B. Tripathi. Wolff ’s anatomy of the eye and orbit. 1998. [41] Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. High accuracy optical flow estimation based on a theory for warping. European Conference on Computer Vision (ECCV), pages 25–36, 2004. [42] Adrian Bulat, Jean Kossaifi, Georgios Tzimiropoulos, and Maja Pantic. Incremental multi- domain learning with network latent tensor factorization. AAAI Conference on Artificial Intelligence, 34(07):10470–10477, 2020. [43] Mark J. Burge and Matthew K. Monaco. Multispectral iris fusion for enhancement, inter- operability, and cross wavelength matching. Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XV, 7334:494–501, 2009. [44] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet Kumar Dokania, Philip H. S. Torr, and Marc’Aurelio Ranzato. Continual learn- ing with tiny episodic memories. arXiv, abs/1902.10486, 2019. [45] C. Chen and A. Ross. Exploring the use of iriscodes for presentation attack detection. International Conference on Biometrics: Theory, Applications and Systems (BTAS), 2018. [46] C. Chen and A. Ross. A multi-task convolutional neural network for joint iris detection and presentation attack detection. IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW), 2018. [47] C. Chen and A. Ross. An explainable attention-guided iris presentation attack detector. IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW), 2021. [48] Jianxu Chen, Feng Shen, Danny Ziyi Chen, and Patrick J. Flynn. Iris recognition based on human-interpretable features. IEEE Transactions on Information Forensics and Security (TIFS), 11(7):1476–1485, 2016. [49] Rui Chen, Xirong Lin, and Tianhuai Ding. Liveness detection for iris recognition using multispectral images. Pattern Recognition Letters (PRL), 33(12):1513–1519, 2012. [50] Zhiyuan Chen, Bing Liu, Ronald Brachman, Peter Stone, and Francesca Rossi. Lifelong Machine Learning. Morgan Claypool Publishers, 2nd edition, 2018. 185 [51] Nicholas Cheney, Martin Schrimpf, and Gabriel Kreiman. On the robustness of convolutional neural networks to internal architecture and weight perturbations. arXiv, abs/1703.08245, 2017. [52] Ivana Chingovska, J. Yang, Zhen Lei, Dong Yi, Stan Z. Li, Olga Kähm, Christian Glaser, Naser Damer, Arjan Kuijper, Alexander Nouak, Jukka Komulainen, Tiago Freitas Pereira, Shubham Gupta, Shubham Khandelwal, Shubham Bansal, Ayush Rai, Tarun Krishna, Dushyant Goyal, Muhammad-Adeel Waris, Honglei Zhang, Iftikhar Ahmad, Serkan Ki- ranyaz, Moncef Gabbouj, Roberto Tronci, Maurizio Pili, Nicola Sirena, Fabio Roli, Javier Galbally, Julian Fiérrez, Allan da Silva Pinto, Hélio Pedrini, W. S. Schwartz, Anderson Rocha, André Anjos, and Sébastien Marcel. The 2nd competition on counter measures to 2D face spoofing attacks. International Conference on Biometrics (ICB), pages 1–6, 2013. [53] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. StarGAN v2: Diverse im- age synthesis for multiple domains. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [54] Jonathan Connell, Nalini Ratha, James Gentile, and Ruud Bolle. Fake iris detection using structured light. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8692–8696, 2013. [55] S. Crihalmeanu, A. Ross, S. Schuckers, and L. Hornak. A protocol for multibiometric data acquisition, storage and dissemination. Technical Report, WVU, Lane Department of Computer Science and Electrical Engineering, 2007. [56] A. Czajka. Pupil dynamics for iris liveness detection. IEEE Transactions on Information Forensics and Security (TIFS), 10(4):726–735, 2015. [57] Adam Czajka. Iris liveness detection by modeling dynamic pupil features. In Mark J. Burge and Kevin W. Bowyer, editors, Handbook of Iris Recognition, volume 1542, pages 439–467. Springer-Verlag, 2013. [58] Adam Czajka and Kevin W. Bowyer. Presentation attack detection for iris recognition: An assessment of the state-of-the-art. ACM Computing Surveys (CSUR), 51(4):86:1–86:35, 2018. [59] Adam Czajka, Daniel Moreira, Kevin Bowyer, and Patrick Flynn. Domain-specific human- inspired binarized statistical image features for iris recognition. IEEE Winter Conference on Applications of Computer Vision (WACV), pages 959–967, 2019. [60] N. Damer, A. M. Saladié, A. Braun, and A. Kuijper. MorGAN: Recognition vulnerability and attack detectability of face morphing attacks created by generative adversarial network. International Conference on Biometrics Theory, Applications and Systems (BTAS), pages 1–10, 2018. [61] P. Das, J. McGrath, A. Boyd Z. Fang, G. Jang, A. Mohammadi, S. Purnapatra, D. Yambay, S. Marcel, M. Trokielewicz, P. Maciejewicz, K. Bowyer, A. Czajka, S. Schuckers, J. Tapia, 186 S. Gonzalez, M. Fang, N. Damer, F. Boutros, A. Kuijper, R. Sharma, C. Chen, and A. Ross. Iris liveness detection competition (LivDet-Iris) – the 2020 edition. International Joint Conference on Biometrics (IJCB), 2020. [62] J. Daugman. How iris recognition works. Transactions on Circuits and Systems for Video Technology (TCSVT), 14(1), 2004. [63] J Daugman and C Downing. Epigenetic randomness, complexity and singularity of human iris patterns. Proceedings of the Royal Society B: Biological Sciences (Proc Biol Sci), 268:1737–40, 2001. [64] John Daugman. Countermeasures against subterfuge. Biometrics: Personal Identification in Networked Society, pages 103–121, 1999. [65] John Daugman. Demodulation by complex-valued wavelets for stochastic pattern recog- nition. International Journal of Wavelets, Multi-resolution and Information Processing, 1:1–17, 2003. [66] John Daugman. Recognizing persons by their iris patterns. Advances in Biometric Person Authentication, 3338:5–25, 2004. [67] John Daugman. New methods in iris recognition. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 37(5):1167–1175, 2007. [68] John Daugman. Collision avoidance on national and global scales: Understanding and using big biometric entropy. TechRxiv, Feb 2021. [69] John G. Daugman. High confidence visual recognition of persons by a test of statistical independence. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 15(11), 1993. [70] Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. Continual learning: A comparative study on how to defy forgetting in classification tasks. arXiv, abs/1909.08383(6), 2019. [71] Matthias Delange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Greg Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), pages 1–1, 2021. [72] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: a large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. [73] Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Li Fei-fei. ImageNet: a large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. 187 [74] Li Deng, Jinyu Li, Jui-Ting Huang, Kaisheng Yao, Dong Yu, Frank Seide, Michael Seltzer, Geoff Zweig, Xiaodong He, Jason Williams, et al. Recent advances in deep learning for speech research at microsoft. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8604–8608, 2013. [75] Li Deng and Yang Liu. Deep learning in natural language processing. Springer, Singapore, 2018. [76] Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), 39(4):677–691, 2017. [77] Ruggero Donida Labati and Fabio Scotti. Noisy iris segmentation with boundary regu- larization and reflections removal. Image and Vision Computing (IVC), 28(2):270–277, 2010. [78] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations (ICLR), 2021. [79] James S. Doyle and Kevin W. Bowyer. Robust detection of textured contact lenses in iris recognition using BSIF. IEEE Access, 3:1672–1683, 2015. [80] Y. Du, E. Arslanturk, Z. Zhou, and C. Belcher. Video-based noncooperative iris image segmentation. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 41(1):64–74, 2011. [81] Sayna Ebrahimi, Franziska Meier, Roberto Calandra, Trevor Darrell, and Marcus Rohrbach. Adversarial continual learning. European Conference on Computer Vision (ECCV), pages 386–402, 2020. [82] Gizem Erdogan. Contact Lenses in Iris Recognition. M.S. dissertation, graduate theses, dissertations, and problem reports, number 4965, West Virginia University, 2013. [83] Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Robustness of classifiers: From adversarial to random noise. International Conference on Neural Informa- tion Processing Systems (NeurIPS), page 1632–1640, 2016. [84] S. P. Fenker, E. Ortiz, and K. W. Bowyer. Template aging phenomenon in iris recognition. IEEE Access, 1:266–274, 2013. [85] M. Ferrara, R. Cappelli, and D. Maltoni. On the feasibility of creating double-identity fingerprints. IEEE Transactions on Information Forensics and Security (TIFS), 12(4):892– 900, 2017. 188 [86] M. Ferrara, A. Franco, and D. Maltoni. The magic passport. IEEE International Joint Conference on Biometrics (IJCB), pages 1–7, 2014. [87] Maria Frucci, Michele Nappi, Daniel Riccio, and Gabriella Sanniti di Baja. WIRE: watershed based iris recognition. Pattern Recognition (PR), 52:148–159, 2016. [88] C. Fu, X. Wu, Y. Hu, H. Huang, and R. He. Dual variational generation for low-shot heterogeneous face recognition. Neural Information Processing Systems (NIPS), 2019. [89] J. Galbally, S. Marcel, and J. Fierrez. Image quality assessment for fake biometric detection: Application to iris, fingerprint, and face recognition. IEEE Transactions on Image Processing (TIP), 23(2):710–724, 2014. [90] Javier Galbally, Arun Ross, Marta Gomez-Barrero, Julian Fierrez, and Javier Ortega-Garcia. Iris image reconstruction from binary templates: An efficient probabilistic approach based on genetic algorithms. Computer Vision and Image Understanding (CVIU), 117(10):1512– 1525, 2013. [91] A. Gangwar and A. Joshi. DeepIrisNet: Deep iris representation with applications in iris recognition and cross-sensor iris recognition. IEEE International Conference on Image Processing (ICIP), pages 2301–2305, 2016. [92] A. Gangwar, Akanksha Joshi, Padmaja Joshi, and Ramachandra Raghavendra. DeepIrisNet2: learning deep-iriscodes from scratch for segmentation-robust visible wavelength and near infrared iris recognition. arXiv, abs/1902.05390, 2019. [93] Prachi Garg, Rohit Saluja, Vineeth N Balasubramanian, Chetan Arora, Anbumani Subra- manian, and C. V. Jawahar. Multi-domain incremental learning for semantic segmentation. arXiv, abs/2110.12205, 2021. [94] Siddhant Garg, Adarsh Kumar, Vibhor Goel, and Yingyu Liang. Can adversarial weight perturbations inject neural backdoors. ACM International Conference on Information and Knowledge Management (CIKM), page 2029–2032, 2020. [95] M. Gomez-Barrero, C. Rathgeb, U. Scherhag, and C. Busch. Predicting the vulnerability of biometric systems to attacks based on morphed biometric information. IET Biometrics, 7(4):333–341, 2018. [96] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adver- sarial examples. International Conference on Learning Representations (ICLR), 2015. [97] Diego Gragnaniello, Giovanni Poggi, Carlo Sansone, and Luisa Verdoliva. An investigation of local descriptors for biometric spoofing detection. IEEE Transactions on Information Forensics and Security (TIFS), 10(4):849–863, 2015. [98] Diego Gragnaniello, Carlo Sansone, and Luisa Verdoliva. Iris liveness detection for mobile devices based on local descriptors. Pattern Recognition Letters (PRL), 57:81–87, 2015. 189 [99] P. Grother, J. R. Matey, E. Tabassi, G. W. Quinn, and M. Chumakov. IREX VI: temporal stability of iris recognition accuracy. NIST Interagency Report 7948, 2013. [100] P. Grother, G. W. Quinn, J. R. Matey, M. L. Ngan, W. J. Salamon, G. P. Fiumara, and C. I. Watson. IREX III - Performance of iris identification algorithms. NIST Interagency/Internal Report (NISTIR) 7836, 2012. [101] Murthy Gudlavalleti, Sanjeev Gupta, Neena John, and Praveen Vashist. Current status of cataract blindness and vision 2020: The right to sight initiative in india. Indian Journal of Ophthalmology (IJO), 56:489–94, 05 2008. [102] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. International Conference on Neural Information Processing Systems (NeurIPS), page 1135–1143, 2015. [103] M. Happold. Structured forest edge detectors for improved eyelid and iris segmentation. International Conference of the Biometrics Special Interest Group (CCPR), page 28–33, 2015. [104] Bilal Hassan, Ramsha Ahmed, Taimur Hassan, and Naoufel Werghi. SIP-SegNet: a deep convolutional encoder-decoder network for joint semantic segmentation and extraction of sclera, iris and pupil based on periocular region suppression. arXiv, abs/2003.00825, 2020. [105] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. [106] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. [107] R. He, X. Wu, Z. Sun, and T. Tan. Wasserstein CNN: Learning invariant features for NIR-VIS face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 41(07), 2019. [108] X. He, Y. Lu, and P. Shi. A fake iris detection method based on FFT and quality assessment. Chinese Conference on Pattern Recognition (CCPR), pages 1–4, 2008. [109] Zhaofeng He, Zhenan Sun, Tieniu Tan, and Zhuoshi Wei. Efficient iris spoof detection via boosted local binary patterns. International Conference on Biometrics (ICB), 5558:1080– 1090, 2009. [110] Kevin Hernandez-Diaz, Fernando Alonso-Fernandez, and Josef Bigun. Cross-spectral peri- ocular recognition with conditional adversarial networks. International Joint Conference on Biometrics (IJCB), 2020. 190 [111] H. Hofbauer, I. Tomeo-Reyes, and A. Uhl. Isolating iris template ageing in a semi-controlled environment. International Conference of the Biometrics Special Interest Group (BIOSIG), pages 1–5, 2016. [112] Heinz Hofbauer, Ehsaneddin Jalilian, and Andreas Uhl. Exploiting superior CNN-based iris segmentation for better recognition accuracy. Pattern Recognition Letters (PRL), 120:17–23, 2019. [113] Steven Hoffman, Renu Sharma, and Arun Ross. Convolutional neural networks for iris presentation attack detection: Toward cross-dataset and cross-sensor generalization. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1701– 17018, 2018. [114] Steven Hoffman, Renu Sharma, and Arun Ross. Iris + ocular: Generalized iris presentation attack detection using multiple convolutional neural networks. International Conference on Biometrics (ICB), 2019. [115] Karen Hollingsworth, Kevin Bowyer, and Patrick Flynn. Pupil dilation degrades iris bio- metric performance. Computer Vision and Image Understanding (CVIU), 113:150–157, 01 2009. [116] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classi- fication. Association for Computational Linguistics (ACL), 2018. [117] Sheng-Hsun Hsieh, Yunghui Li, Wei Wang, and Chung-Hao Tien. A novel anti-spoofing solution for iris recognition toward cosmetic contact lens attack using spectral ICA analysis. Sensors, 18:795–810, 2018. [118] Yen-Chang Hsu, Yen-Cheng Liu, Anita Ramasamy, and Zsolt Kira. Re-evaluating continual learning scenarios: A categorization and case for strong baselines. arXiv, abs/1810.12488, 2019. [119] Yang Hu, Konstantinos Sirlantzis, and Gareth Howells. Iris liveness detection using regional features. Pattern Recognition Letters (PRL), 82:242–250, 2016. [120] D. Huang, E. A. Swanson, C. P. Lin, J. S. Schuman, W. G. Stinson, W. Chang, M. R. Hee, T. Flotte, K. Gregory, C. A. Puliafito, and et al. Optical coherence tomography. Science, 254:1178–1181, 1991. [121] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. Densely connected convolutional networks. Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261– 2269, 2017. [122] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolu- tional networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017. 191 [123] K. Hughes and K. W. Bowyer. Detection of contact-lens-based iris biometric spoofs using stereo imaging. Hawaii International Conference on System Sciences (HICSS), 2013. [124] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning (ICML), 37:448–456, 2015. [125] P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976, 2017. [126] A. K. Jain, A. A. Ross, and K. Nandakumar. Introduction to Biometrics. Springer Publishing Company, 2011. [127] Ann A. Jarjes, Kuanquan Wang, and Ghassan J. Mohammed. Iris localization: Detect- ing accurate pupil contour and localizing limbus boundary. 2010 2nd International Asia Conference on Informatics in Control, Automation and Robotics (CAR 2010), 1:349–352, 2010. [128] Dae Sik Jeong, Jae Won Hwang, Byung Jun Kang, Kang Ryoung Park, Chee Sun Won, Dong-Kwon Park, and Jaihie Kim. A new iris segmentation method for non-ideal iris images. Image and Vision Computing (IVC), 28(2):254–260, 2010. [129] R. Jillela and A. Ross. Matching face against iris images using periocular information. IEEE International Conference on Image Processing (ICIP), pages 4997–5001, 2014. [130] Raghavender R. Jillela and A. Ross. Methods for Iris Segmentation. Springer London, 2013. [131] Liu Jin, Fu Xiao, and Wang Haopeng. Iris image segmentation based on k-means cluster. IEEE International Conference on Intelligent Computing and Intelligent Systems, 3:194–198, 2010. [132] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. European Conference on Computer Vision (ECCV), 2016. [133] Amin Jourabloo, Yaojie Liu, and Xiaoming Liu. Face de-spoofing: Anti-spoofing via noise modeling. European Conference on Computer Vision (ECCV), 2018. [134] Roy K. and Bhattacharya P. Iris recognition in nonideal situations. International Conference on Information Security (ISC), 5735, 2009. [135] Miwa Kanematsu, Hironobu Takano, and Kiyomi Nakamura. Highly reliable liveness de- tection method for iris recognition. SICE Annual Conference, pages 361–364, 2007. [136] Ta-Chu Kao, Kristopher Jensen, Gido van de Ven, Alberto Bernacchia, and Guillaume Hen- nequin. Natural continual learning: success is a journey, not (just) a destination. Advances in Neural Information Processing Systems (NeurIPS), 34:28067–28079, 2021. 192 [137] Nikolaos Karianakis, Jingming Dong, and Stefano Soatto. An empirical evaluation of current convolutional architectures’ ability to manage nuisance location and scale variability. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4442–4451, 2016. [138] Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017. [139] R. Kerekes, B. Narayanaswamy, J. Thornton, M. Savvides, and B. V. K. Vijaya Kumar. Graphical model approach to iris matching under deformation and occlusion. IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 1–6, 2007. [140] Younghwan Kim, Jang-Hee Yoo, and Kyoungho Choi. A motion and similarity-based fake detection method for biometric face recognition systems. IEEE Transactions on Consumer Electronics (TCE), 57, 2011. [141] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences (PNAS), 114(13):3521–3526, 2017. [142] Maki Kojima, Toshiki Shioiri, Toshihiro Hosoki, Hideaki Kitamura, Takehiko Bando, and Toshiyuki Someya. Pupillary light reflex in panic disorder: A trial using audiovisual stim- ulation. European Archives of Psychiatry and Clinical Neuroscience (EAPCN), 254:242–4, 09 2004. [143] O. V. Komogortsev, A. Karpov, and C. D. Holland. Attack of mechanical replicas: Liveness detection with eye movements. IEEE Transactions on Information Forensics and Security (TIFS), 10(4):716–725, 2015. [144] Oleg Komogortsev and Alex Karpov. Liveness detection via oculomotor plant characteristics: Attack of mechanical replicas. International Conference on Biometrics (ICB), pages 1–8, 2013. [145] Jukka Komulainen, Abdenour Hadid, and Matti Pietikäinen. Context based face anti- spoofing. International Conference on Biometrics: Theory, Applications and Systems (BTAS), pages 1–8, 2013. [146] Jukka Komulainen, Abdenour Hadid, Matti Pietikainen, André Anjos, and Sébastien Marcel. Complementary countermeasures for detecting scenic face spoofing attacks. International Conference on Biometrics (ICB), 2013. [147] Emine Krichen. Lef3a: Pupil segmentation using viterbi search algorithm. IAPR Interna- tional Conference on Biometrics (ICB), pages 323–329, 2012. 193 [148] A. Kumar, T.-S. Chan, and C.-W. Tan. Human identification from at-a-distance face im- ages using sparse representation of local iris features. IAPR International Conference on Biometrics (ICB), pages 303–309, 2012. [149] Ajay Kumar and Tak-Shing Chan. Iris recognition using quaternionic sparse orientation code (QSOC). IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 59–64, 2012. [150] Ajay Kumar and Arun Passi. Comparison and combination of iris matchers for reliable personal authentication. Pattern Recognition (PR), 43(3):1016 – 1026, 2010. [151] L. Ma, T. Tan, Y. Wang, and D. Zhang. Efficient iris recognition by characterizing key local variations. Transactions on Image Processing (TIP), 13(6):739–750, 2004. [152] S. J. Lee, K. R. Park, and J. Kim. Robust fake iris detection based on variation of the reflectance ratio between the iris and the sclera. Biometrics Symposium: Special Session on Research at the Biometric Consortium Conference, pages 1–6, 2006. [153] Soochan Lee, Junsoo Ha, Dongsu Zhang, and Gunhee Kim. A neural dirichlet process mixture model for task-free continual learning. International Conference on Learning Rep- resentations (ICLR), 2020. [154] Sung Lee, Kang Park, Youn Lee, Kwanghyuk Bae, and Jai Kim. Multifeature-based fake iris detection method. Optical Engineering, 46(12), 2007. [155] Timothée Lesort, Massimo Caccia, and Irina Rish. Understanding continual learning settings with data distribution drift analysis. arXiv, abs/2104.01678, 2021. [156] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Learning to generalize: Meta-learning for domain generalization. AAAI Conference on Artificial Intelligence, 2018. [157] Haiqing Li, Zhenan Sun, and Tieniu Tan. Robust iris segmentation based on learned boundary detectors. IAPR International Conference on Biometrics (ICB), pages 317–322, 2012. [158] X. Li. Modeling intra-class variation for nonideal iris recognition. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3832 LNCS:419–427, 2006. [159] Y. Li and M. Savvides. An automatic iris occlusion estimation method based on high- dimensional density estimation. IEEE Transactions on Pattern Analysis and Machine Intel- ligence (PAMI), 35(4):784–796, 2013. [160] Yung-Hui Li, Po-Jen Huang, and Yun Juan. An efficient and robust iris segmentation algorithm using deep learning. Mobile Information Systems, 2019. [161] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on Pattern Analysis and Machine Intelligence (PAMI), 40(12):2935–2947, 2017. 194 [162] Jie Lin, Jian-Ping Li, Hui Lin, and Ji Ming. Robust person identification with face and iris by modified PUM method. International Conference on Apperceiving Computing and Intelligence Analysis (ICACIA), pages 321–324, 2009. [163] Christoph Lippert, Riccardo Sabatini, M. Cyrus Maher, Eun Yong Kang, Seunghak Lee, Okan Arikan, Alena Harley, Axel Bernal, Peter Garst, Victor Lavrenko, Ken Yocum, Theodore Wong, Mingfu Zhu, Wen-Yun Yang, Chris Chang, Tim Lu, Charlie W. H. Lee, Barry Hicks, Smriti Ramakrishnan, Haibao Tang, Chao Xie, Jason Piper, Suzanne Brew- erton, Yaron Turpaz, Amalio Telenti, Rhonda K. Roby, Franz J. Och, and J. Craig Venter. Identification of individuals by trait prediction using whole-genome sequencing data. Pro- ceedings of the National Academy of Sciences (PNAS), 114(38):10166–10171, 2017. [164] Nianfeng Liu, Man Zhang, Haiqing Li, Zhenan Sun, and Tieniu Tan. DeepIris: learning pairwise filter bank for heterogeneous iris verification. Pattern Recognition Letters (PRL), 82:154–161, 2016. [165] Yaojie Liu, Amin Jourabloo, and Xiaoming Liu. Learning deep models for face anti-spoofing: Binary or auxiliary supervision. Computer Vision and Pattern Recognition (CVPR), 2018. [166] Yaojie Liu, Joel Stehouwer, Amin Jourabloo, and Xiaoming Liu. Deep tree learning for zero-shot face anti-spoofing. Computer Vision and Pattern Recognition (CVPR), 2019. [167] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017. [168] Kamachi M., Hill H.and Lander K., and Vatikiotis-Bateson E. Putting the face to the voice: matching identity across modality. Current Biology, 13(19), 2003. [169] Li Ma, Tieniu Tan, Yunhong Wang, and Dexin Zhang. Local intensity variation analysis for iris recognition. Pattern Recognition (PR), 37(6):1287–1298, 2004. [170] Neil A. Macmillan and C. Douglas Creelman. Detection theory: A user’s guide. Lawrence Erlbaum Associates. [171] A. Makrushin, T. Neubert, and J. Dittmann. Automatic generation and detection of visually faultless facial morphs. International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP), pages 39–50, 2017. [172] Sébastien Marcel, Mark S. Nixon, Julian Fiérrez, and Nicholas W. D. Evans, editors. Hand- book of Biometric Anti-Spoofing - Presentation Attack Detection, Second Edition. Advances in Computer Vision and Pattern Recognition. Springer, 2019. [173] Libor Masek. Recognition of human iris patterns for biometric identification. Technical report, 2003. 195 [174] Carver Mead and Mohammed Ismail, editors. Analog VLSI Implementation of Neural Systems. The Kluwer International Series in Engineering and Computer Science. Kluwer / Springer US, 1989. [175] Hunny Mehrotra, Banshidhar Majhi, and Phalguni Gupta. Annular iris recognition using SURF. Pattern Recognition and Machine Intelligence (PAMI), pages 464–469, 2009. [176] David Menotti, Giovani Chiachia, Allan Pinto, William Robson Schwartz, Helio Pedrini, Alexandre Xavier Falcao, and Anderson Rocha. Deep Representations for Iris, Face, and Fingerprint Spoofing Detection. IEEE Transactions on Information Forensics and Security (TIFS), 10(4):864–879, 2015. [177] K. Miyazawa, K. Ito, T. Aoki, K. Kobayashi, and H. Nakajima. An effective approach for iris recognition using phase-based image matching. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(10):1741–1756, 2008. [178] Murali Mohan Chakka, André Anjos, Sébastien Marcel, Roberto Tronci, Daniele Muntoni, Gianluca Fadda, Maurizio Pili, Nicola Sirena, Gabriele Murgia, Marco Ristori, Fabio Roli, Junjie Yan, Dong Yi, Zhen Lei, Zhiwei Zhang, Stan Li, William Schwartz, Anderson Rocha, Helio Pedrini, and Matti Pietikainen. Competition on counter measures to 2-D facial spoofing attacks. International Joint Conference on Biometrics (IJCB), pages 1 – 6, 2011. [179] D. M. Monro, S. Rakshit, and D. Zhang. DCT-based iris recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 29(4):586–595, 2007. [180] Y. Moolla, L. Darlow, A. Sharma, A. Singh, and J. Van Der Merwe. Optical coherence tomography for fingerprint presentation attack detection. Handbook of Biometric Anti- Spoofing, pages 49–70, 2019. [181] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Uni- versal adversarial perturbations. IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), 2017. [182] Jose G. Moreno-Torres, Troy Raeder, RocÃo Alaiz-RodrÃguez, Nitesh V. Chawla, and Francisco Herrera. A unifying view on dataset shift in classification. Pattern Recognition (PR), 45(1):521–530, 2012. [183] Satish Mulleti and Chandra Sekhar Seelamantula. Ellipse fitting using the finite rate of innovation sampling principle. IEEE Transactions on Image Processing (TIP), 25(3):1451– 1464, 2016. [184] Tajbakhsh N., Misaghian K., and Bandari N.M. A region-based iris feature extraction method based on 2d-wavelet transform. Biometric ID Management and Multimodal Communication (BioID), 5707, 2009. 196 [185] A. Nagrani, S. Albanie, and A. Zisserman. Seeing voices and hearing faces: Cross-modal biometric matching. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8427–8436, 2018. [186] P. R. Nalla and A. Kumar. Toward more accurate iris recognition using cross-spectral matching. IEEE Transactions on Image Processing (TIP), 26(1):208–221, 2017. [187] K. Nguyen, C. Fookes, and S. Sridharan. Fusing shrinking and expanding active contour models for robust iris segmentation. International Conference on Information Sciences, Signal Processing and their Applications (ISSPA), pages 185–188, 2010. [188] Kien Nguyen, Clinton Fookes, Raghavender Jillela, Sridha Sridharan, and Arun Ross. Long range iris recognition: A survey. Pattern Recognition (PR), 72:123–143, 2017. [189] Kien Nguyen, Clinton Fookes, Arun Ross, and Sridha Sridharan. Iris recognition with off-the-shelf CNN features: A deep learning perspective. IEEE Access, 6:18848–18855, 2018. [190] Ishan Nigam, Mayank Vatsa, and Richa Singh. Ophthalmic Disorder Menagerie and Iris Recognition, pages 359–396. Springer London, 2016. [191] Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl- Dickstein. Sensitivity and generalization in neural networks: an empirical study. Interna- tional Conference on Learning Representations (ICLR), 2018. [192] Maulisa Oktiana, Takahiko Horiuchi, Keita Hirai, Khairun Saddami, Fitri Arnia, Yuwaldi Away, and Khairul Munadi. Cross-spectral iris recognition using phase-based matching and homomorphic filtering. Heliyon, 6(2):e03407, 2020. [193] Andrzej Pacut and Adam Czajka. Aliveness detection for iris biometrics. IEEE International Carnahan Conferences Security Technology (ICCST), pages 122 – 129, 2006. [194] Federico Pala and Bir Bhanu. Iris liveness detection by relative distance comparisons. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 664–671, 2017. [195] Gang Pan, Lin Sun, Zhaohui Wu, and Yueming Wang. Monocular camera-based face liveness detection by combining eyeblink and scene context. Telecommunication Systems, 47(3):215–225, 2011. [196] Lili Pan, Mei Xie, Tao Zheng, and Jianli Ren. A robust iris localization model based on phase congruency and least trimmed squares estimation. Image Analysis and Processing (ICIAP), 2009. [197] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22(10):1345–1359, 2010. 197 [198] Jong Hyun Park and Moon-Gi Kang. Multispectral iris authentication system against coun- terfeit attack using gradient-based image fusion. Optical Engineering, 46(11):1–14, 2007. [199] Keyurkumar Patel, Hu Han, and Anil K. Jain. Cross-database face antispoofing with robust feature representation. In Zhisheng You, Jie Zhou, Yunhong Wang, Zhenan Sun, Shiguang Shan, Weishi Zheng, Jianjiang Feng, and Qijun Zhao, editors, Biometric Recognition, pages 611–619, Cham, 2016. Springer International Publishing. [200] C. Patil and S. Patilkulkarni. An approach to enhance security environment based on sift feature extraction and matching to iris recognition. Information Processing and Management, page 527–530, 2010. [201] J. K. Pillai, M. Puertas, and R. Chellappa. Cross-sensor iris recognition through kernel learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 36(1):73– 85, 2014. [202] Jaishanker K. Pillai, Vishal M. Patel, Rama Chellappa, and Nalini K. Ratha. Secure and robust iris recognition using random projections and sparse representations. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 33(9):1877–1893, 2011. [203] R. Polikar. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine, 6(3):21–45, 2006. [204] Ameya Prabhu, Philip H. S. Torr, and Puneet K. Dokania. GDumb: a simple approach that questions our progress in continual learning. European Conference on Computer Vision (ECCV), pages 524–540, 2020. [205] Hugo Proenca and Luis A. Alexandre. Toward noncooperative iris recognition: A classi- fication approach using multiple signatures. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 29(4):607–612, 2007. [206] H. Proença and J. C. Neves. A reminiscence of “mastermind”: Iris/periocular biometrics by “in-set” CNN iterative analysis. IEEE Transactions on Information Forensics and Security (TIFS), 14(7):1702–1712, 2019. [207] Hugo Proença. Iris recognition: On the segmentation of degraded images acquired in the visible wavelength. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 32(8):1502–1516, 2010. [208] Hugo Proença and João C. Neves. IRINA: iris recognition (even) in inaccurately segmented data. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6747– 6756, 2017. [209] S. J. Pundlik, D. L. Woodard, and S. T. Birchfield. Non-ideal iris segmentation using graph cuts. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR-W), pages 1–6, 2008. 198 [210] K. R. Radhika, S. V. Sheela, M. K. Venkatesha, and G. N. Sekhar. Multi-modal authentica- tion using continuous dynamic programming. Biometric ID Management and Multimodal Communication, pages 228–235, 2009. [211] R. Raghavendra and Christoph Busch. Robust Scheme for Iris Presentation Attack Detection using Multiscale Binarized Statistical Image Features. IEEE Transactions on Information Forensics and Security (TIFS), 10(4):703–715, 2015. [212] R. Raghavendra, K. B. Raja, and C. Busch. ContlensNet: robust iris contact lens detection using deep convolutional neural networks. IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1160–1167, 2017. [213] Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. International Conference on Machine Learning (ICML), 97:5301–5310, 2019. [214] K. B. Raja, R. Raghavendra, and C. Busch. Video presentation attack detection in visi- ble spectrum iris recognition using magnified phase information. IEEE Transactions on Information Forensics and Security (TIFS), 10(10):2048–2056, 2015. [215] K. B. Raja, R. Raghavendra, and C. Busch. Cross-spectrum periocular authentication for NIR and visible images using bank of statistical filters. International Conference on Imaging Systems and Techniques (IST), pages 227–231, 2016. [216] K. B. Raja, R. Raghavendra, and C. Busch. Scale-level score fusion of steered pyramid features for cross-spectral periocular verification. International Conference on Information Fusion (Fusion), pages 1–7, 2017. [217] Kiran B. Raja, Raghavendra Ramachandra, and Christoph Busch. Presentation attack detec- tion using laplacian decomposed frequency response for visible spectrum and near-infra-red iris systems. International Conference on Biometrics Theory, Applications and Systems (BTAS), pages 1–8, 2015. [218] M R Rajput and G S Sable. IRIS biometrics survey 2010–2015. IEEE International Confer- ence on Recent Trends in Electronics, Information Communication Technology (RTEICT), pages 2028–2033, 2016. [219] Raghavendra Ramachandra and Christoph Busch. Presentation attack detection on visi- ble spectrum iris recognition by exploring inherent characteristics of light field camera. International Joint Conference on Biometrics (IJCB), pages 1–8, 2014. [220] Raghavendra Ramachandra and Christoph Busch. Robust scheme for iris presentation at- tack detection using multiscale binarized statistical image features. IEEE Transactions on Information Forensics and Security (TIFS), 10:703–715, 2015. 199 [221] Raghavendra Ramachandra and Christoph Busch. Presentation attack detection methods for face recognition systems: A comprehensive survey. ACM Computing Surveys, 50(1):8:1– 8:37, 2017. [222] N. P. Ramaiah and A. Kumar. On matching cross-spectral periocular images for accurate biometrics identification. International Conference on Biometrics Theory, Applications and Systems (BTAS), pages 1–6, 2016. [223] N. Pattabhi Ramaiah and A. Kumar. Toward more accurate iris recognition using cross- spectral matching. Transactions on Image Processing (TIP), 26(1):208–221, 2017. [224] Nalini Ratha, Jonathan Connell, and Ruud Bolle. Enhancing security and privacy in biometrics-based authentication systems. IBM Systems Journal, 40:614–634, 01 2001. [225] C. Rathgeb and C. Busch. On the feasibility of creating morphed iris-codes. IEEE Interna- tional Joint Conference on Biometrics (IJCB), pages 152–157, 2017. [226] C. Rathgeb, F. Struck, and C. Busch. Efficient BSIF-based near-infrared iris recognition. International Conference on Image Processing Theory, Tools and Applications (IPTA), pages 1–6, 2016. [227] Christian Rathgeb, Andreas Uhl, Peter Wild, and Heinz Hofbauer. Design Decisions for an Iris Recognition SDK, pages 359–396. Springer London, 2016. [228] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, G. Sperl, and Christoph H. Lampert. iCaRL: Incremental classifier and representation learning. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5533–5542, 2017. [229] Hippolyt Ritter, Aleksandar Botev, and David Barber. Online structured laplace approxima- tions for overcoming catastrophic forgetting. Advances in Neural Information Processing Systems (NeurIPS), 31, 2018. [230] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Ex- perience replay for continual learning. Advances in Neural Information Processing Systems (NeurIPS), 32, 2019. [231] Tang Rongnian and Weng Shaojie. Improving iris segmentation performance via borders recognition. International Conference on Intelligent Computation Technology and Automa- tion, 2:580–583, 2011. [232] O. Ronneberger, P.Fischer, and T. Brox. U-Net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention (MICCAI), 9351:234–241, 2015. [233] Amir Rosenfeld and John K. Tsotsos. Incremental learning through deep adaptation. arXiv, abs/1705.04228, 2018. 200 [234] A. Ross, S. Banerjee, C. Chen, A. Chowdhury, V. Mirjalili, R. Sharma, T. Swearingen, and S. Yadav. Some research problems in biometrics: The future beckons. International Conference on Biometrics (ICB), 2019. [235] A. Ross and S. Shah. Segmenting non-ideal irises using geodesic active contours. Biometrics Symposium: Special Session on Research at the Biometric Consortium Conference, pages 1–6, 2006. [236] Wayne J. Ryan, Damon L. Woodard, Andrew T. Duchowski, and Stan T. Birchfield. Adapt- ing starburst for elliptical iris segmentation. IEEE Second International Conference on Biometrics: Theory, Applications and Systems (BTAS), pages 1–7, 2008. [237] H. J. Santos-Villalobos, D. R. Barstow, M. Karakaya, C. B. Boehnen, and E. Chaum. ORNL biometric eye model for iris recognition. IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS), pages 176–182, 2012. [238] Mousumi Sardar, Subhashis Banerjee, and Sushmita Mitra. Iris segmentation using interac- tive deep learning. IEEE Access, 8:219322–219330, 2020. [239] Nadezhda Sazonova, Fang Hua, Xuan Liu, Jeremiah Remus, Arun Ross, Lawrence Hornak, and Stephanie Schuckers. A study on quality-adjusted impact of time lapse on iris recognition. Proceedings of the SPIE, 8371:320 – 328, 2012. [240] U. Scherhag, A. Nautsch, C. Rathgeb, M. Gomez-Barrero, R. Veldhuis, L. Spreeuwers, M. Schils, D. Maltoni, P. Grother, S. Marcel, R. Breithaupt, R. Raghavendra, and C. Busch. Biometric systems under morphing attacks: Assessment of morphing techniques and vul- nerability reporting. International Conference of the Biometrics Special Interest Group (BIOSIG), pages 1–7, 09 2017. [241] U. Scherhag, R. Raghavendra, K. B. Raja, M. Gomez-Barrero, C. Rathgeb, and C. Busch. On the vulnerability of face recognition systems towards morphed face attacks. International Workshop on Biometrics and Forensics (IWBF), pages 1–6, 2017. [242] U. Scherhag, C. Rathgeb, J. Merkle, R. Breithaupt, and C. Busch. Face recognition systems under morphing attacks: A survey. IEEE Access, 7:23012–23026, 2019. [243] S. A. C. Schuckers, N. A. Schmid, A. Abhyankar, V. Dorairaj, C. K. Boyce, and L. A. Hornak. On techniques for angle compensation in nonideal iris recognition. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 37(5):1176–1190, 2007. [244] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. International Conference on Machine Learning (ICLR), pages 4528–4537, 2018. 201 [245] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual explanations from deep networks via gradient- based localization. The IEEE International Conference on Computer Vision (ICCV), 2017. [246] A. F. Sequeira, S. Thavalengal, J. Ferryman, P. Corcoran, and J. S. Cardoso. A realistic evalu- ation of iris presentation attack detection. International Conference on Telecommunications and Signal Processing (TSP), pages 660–664, 2016. [247] S. Shah and A. Ross. Iris segmentation using geodesic active contours. IEEE Transactions on Information Forensics and Security (TIFS), 4(4):824–836, 2009. [248] A. Sharma, S. Verma, M. Vatsa, and R. Singh. On cross spectral periocular recognition. International Conference on Image Processing (ICIP), pages 5007–5011, 2014. [249] R. Sharma and A. Ross. D-NetPAD: an explainable and interpretable iris presentation attack detector. International Joint Conference on Biometrics (IJCB), 2020. [250] Renu Sharma and Arun Ross. Viability of optical coherence tomography for iris presentation attack detection. International Conference on Pattern Recognition (ICPR), 2021. [251] E. Shechtman, A. Rav-Acha, M. Irani, and S. Seitz. Regenerative morphing. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010. [252] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017. [253] Hai Shu and Hongtu Zhu. Sensitivity analysis of deep neural networks. arXiv, abs/1901.07152, 2019. [254] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems (NIPS), pages 568–576, 2014. [255] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR), 2015. [256] G. Song, K. K. Chu, S. Kim, M. Crose, B. Cox, E. T. Jelly, N. Ulrich, and A. Wax. First clinical application of low-cost OCT. Translational vision science and technology (TVST), 8(3):61, 2019. [257] Zhenan Sun and Tieniu Tan. Ordinal measures for iris recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 31(12):2211–2226, 2009. [258] Manisha Sam Sunder and Arun Ross. Iris image retrieval based on macro-features. Interna- tional Conference on Pattern Recognition (ICPR), pages 1318–1321, 2010. 202 [259] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. International Con- ference on Learning Representations (ICLR), 2014. [260] E. Tabassi, P. Grother, and W. Salamon. IREX II - IQCE: Iris Quality Calibration and Evaluation. NIST Interagency/Internal Report (NISTIR) 7820, 2011. [261] C.-W. Tan and A. Kumar. Accurate iris recognition at a distance using stabilized iris encoding and zernike moments phase features. IEEE Transactions on Image Processing (TIP), 23(9):3962–3974, 2014. [262] C.-W. Tan and A. Kumar. Efficient and accurate at-a-distance iris recognition using geometric key-based iris encoding. IEEE Transactions on Information Forensics and Security (TIFS), 9(9):1518–1526, 2014. [263] Chun-Wei Tan and Ajay Kumar. Unified framework for automated iris segmentation using distantly acquired face images. IEEE Transactions on Image Processing (TIP), 21(9):4068– 4079, 2012. [264] Tieniu Tan, Zhaofeng He, and Zhenan Sun. Efficient and robust segmentation of noisy iris images for non-cooperative iris recognition. Image and Vision Computing (IVC), 28(2):223– 230, 2010. [265] S. Thavalengal, T. Nedelcu, P. Bigioi, and P. Corcoran. Iris liveness detection for next generation smartphones. IEEE Transactions on Consumer Electronics (TCE), 62(2):95– 102, 2016. [266] Shejin Thavalengal, Tudor Nedelcu, Petronel Bigioi, and Peter Corcoran. Iris liveness detection for next generation smartphones. Transactions on Consumer Electronics (TCE), 62:95–102, 2016. [267] The Notre Dame Contact Lense Dataset 2015. https://cvrl.nd.edu/projects/data/#the-notre- dame-contact-lense-dataset-2015ndcld15. [268] J. Thornton, M. Savvides, and B. V. K. V. Kumar. A bayesian approach to deformed pattern matching of iris images. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 29(4):596–606, 2007. [269] P. Tome-Gonzalez, F. Alonso-Fernandez, and J. Ortega-Garcia. On the effects of time variability in iris recognition. International Conference on Biometrics: Theory, Applications and Systems (BTAS), pages 1–6, 2008. [270] I. Tomeo-Reyes, A. Ross, and V. Chandran. Investigating the impact of drug induced pupil dilation on automated iris recognition. IEEE International Conference on Biometrics Theory, Applications and Systems (BTAS), pages 1–8, 2016. 203 [271] Inmaculada Tomeo-Reyes. Robust Iris Recognition using Decision Fusion and Degradation Modelling. Ph.D. dissertation, Queensland University of Technology, 2015. [272] Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learn- ing spatiotemporal features with 3D convolutional networks. International Conference on Computer Vision (ICCV), pages 4489–4497, 2015. [273] M. Trokielewicz. Linear regression analysis of template aging in iris biometrics. Interna- tional Workshop on Biometrics and Forensics (IWBF), pages 1–6, 2015. [274] Mateusz Trokielewicz, Adam Czajka, and Piotr Maciejewicz. Implications of ocular patholo- gies for iris recognition reliability. arXiv, abs/1809.00168, 2018. [275] Mateusz Trokielewicz, Adam Czajka, and Piotr Maciejewicz. Post-mortem iris recogni- tion with deep-learning-based image segmentation. Image and Vision Computing (IVC), 94:103866, 2020. [276] Yu-Lin Tsai, Chia-Yi Hsu, Chia-Mu Yu, and Pin-Yu Chen. Formalizing generalization and adversarial robustness of neural networks to weight perturbations. Advances in Neural Information Processing Systems (NeurIPS), 34:19692–19704, 2021. [277] Gido M Van de Ven and Andreas S Tolias. Generative replay with feedback connections as a general strategy for continual learning. arXiv, abs/1809.10635, 2018. [278] Gido M. van de Ven and Andreas S. Tolias. Three scenarios for continual learning. arXiv, abs/1904.07734, 2019. [279] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR), 9(11), 2008. [280] L.J.P. van der Maaten and G.E. Hinton. Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research (JMLR), page 2579–2605, 2008. [281] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017. [282] Mayank Vatsa, Richa Singh, and Afzel Noore. Improving iris recognition performance using segmentation, quality enhancement, match score fusion, and indexing. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38(4):1021–1035, 2008. [283] Vladan Velisavljevic. Low-complexity iris coding and recognition based on directionlets. IEEE Transactions on Information Forensics and Security (TIFS), 4(3):410–417, 2009. [284] S. Venkatesh, R. Ramachandra, K. Raja, and C. Busch. Face morphing attack generation and detection: A comprehensive survey. arXiv, abs/2011.02045, 2020. 204 [285] F. M. Villalbos-Castaldi and E. Suaste-Gómez. In the use of the spontaneous pupillary oscillations as a new biometric trait. International Workshop on Biometrics and Forensics (IWBF), pages 1–6, 2014. [286] Ritesh Vyas, Tirupathiraju Kanumuri, and Gyanendra Sheoran. Cross spectral iris recog- nition for surveillance based applications. Multimedia Tools and Applications (MTA), 78(5):5681–5699, 2019. [287] Kuo Wang and Ajay Kumar. Cross-spectral iris recognition using CNN and supervised discrete hashing. Pattern Recognition (PR), 86:85–98, 2019. [288] Limin Wang, Zhe Wang Yuanjun Xiong, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. European Conference on Computer Vision (ECCV), pages 20–36, 2016. [289] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. ESRGAN: enhanced super-resolution generative adversarial networks. European Conference on Computer Vision (ECCV) Workshops, pages 63–79, 2018. [290] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: From error visibility to structural similarity. Transactions on Image Processing (TIP), 13(4):600–612, 2004. [291] Zhuoshi Wei, Tieniu Tan, and Zhenan Sun. Nonlinear iris deformation correction based on gaussian model. In Seong-Whan Lee and Stan Z. Li, editors, Advances in Biometrics, pages 780–789, 2007. [292] Karl Weiss, Taghi M. Khoshgoftaar, and DingDing Wang. A survey of transfer learning. Journal of Big Data, 2016. [293] Tsui-Wei Weng, Pu Zhao, Sijia Liu, Pin-Yu Chen, Xue Lin, and Luca Daniel. Towards certifi- cated model robustness against weight perturbations. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):6356–6363, Apr. 2020. [294] Peter Wild, James Ferryman, and Andreas Uhl. Impact of (segmentation) quality on long vs. short-timespan assessments in iris recognition performance. IET Biometrics, 4:227–235(8), 2015. [295] R. P. Wildes. Iris recognition: an emerging biometric technology. Proceedings of the IEEE, 85(9):1348–1363, 1997. [296] Yue Wu, Yan-Jia Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. Conference on Computer Vision and Pattern (CVPR), 2019. [297] Lin Xiang, Xiaoqin Zeng, Yuhu Niu, and Yanjun Liu. Study of sensitivity to weight perturbation for convolution neural network. IEEE Access, 7:93898–93908, 2019. 205 [298] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 1492–1500, 2017. [299] Zhi-Qin John Xu, Yaoyu Zhang, and Yanyang Xiao. Training behavior of deep neural network in frequency domain. International Conference On Neural Information Processing (ICONIP), 11953:264–274, 2019. [300] D. Yadav, N. Kohli, J. S. Doyle, R. Singh, M. Vatsa, and K. W. Bowyer. Unraveling the effect of textured contact lenses on iris recognition. IEEE Transactions on Information Forensics and Security (TIFS), 9(5):851–862, 2014. [301] D. Yadav, N. Kohli, M. Vatsa, R. Singh, and A. Noore. Detecting textured contact lens in uncontrolled environment using DensePAD. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2336–2344, 2019. [302] Shivangi Yadav, Cunjian Chen, and Arun Ross. Relativistic discriminator: A one-class classifier for generalized iris presentation attack detection. IEEE Winter Conference on Applications of Computer Vision (WACV), 2020. [303] Shivangi Yadav and Arun Ross. CIT-GAN: Cyclic image translation generative adversarial network with application in iris presentation attack detection. IEEE Winter Conference on Applications of Computer Vision (WACV), 2021. [304] D. Yambay, B. Becker, N. Kohli, D. Yadav, A. Czajka, K. W. Bowyer, S. Schuckers, R. Singh, M. Vatsa, A. Noore, D. Gragnaniello, C. Sansone, L. Verdoliva, L. He, Y. Ru, H. Li, N. Liu, Z. Sun, and T. Tan. LivDet iris 2017 — Iris liveness detection competition 2017. IEEE International Joint Conference on Biometrics (IJCB), pages 733–741, 2017. [305] D. Yambay, J. S. Doyle, K. W. Bowyer, A. Czajka, and S. Schuckers. LivDet-iris 2013– iris liveness detection competition 2013. IEEE International Joint Conference on Biometrics (ICB), pages 1–8, 2014. [306] David Yambay, Brian Walczak, Stephanie Schuckers, and Adam Czajka. LivDet-Iris 2015 — iris liveness detection competition 2015. In IEEE International Conference on Identity, Security, and Behavior Analysis (ISBA), pages 1–6, 2017. [307] Fei Yan, Yantao Tian, Haiwei Wu, Yanhua Zhou, Liuyang Cao, and Changjiu Zhou. Iris seg- mentation using watershed and region merging. IEEE Conference on Industrial Electronics and Applications, pages 835–840. [308] Junjie Yan, Zhiwei Zhang, Zhen Lei, Dong Yi, and Stan Z. Li. Face liveness detection by exploring multiple scenic clues. International Conference on Control Automation Robotics and Vision (ICARCV), pages 188–193, 2012. [309] Daniel S. Yeung, Ian Cloete, Daming Shi, and Wing W.Y. Ng. Sensitivity Analysis for Neural Networks. Springer Publishing Company, Incorporated, 2009. 206 [310] Sowon Yoon, Kwanghyuk Bae, Kang Ryoung Park, and Jaihie Kim. Pan-tilt-zoom based iris image capturing system for unconstrained user environments at a distance. Advances in Biometrics, pages 653–662, 2007. [311] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolu- tional neural networks. Conference on Computer Vision and Pattern Recognition (CVPR), pages 4353–4361, 2015. [312] G. Zeng, Y. Chen, B. Cui, and S. Yu. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence, 1:364–372, 2019. [313] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. International Conference on Machine Learning (ICML), pages 3987–3995, 2017. [314] H. Zhang, S. Venkatesh, R. Ramachandra, K. Raja, N. Damer, and C. Busch. MIPGAN – generating robust and high quality morph attacks using identity prior driven GAN. arXiv, abs/2009.01729, 2020. [315] Hui Bin Zhang, Zhenan Sun, Tieniu Tan, and Jianyu Wang. Learning hierarchical visual codebook for iris liveness detection. International Joint Conference on Biometrics (IJCB), 2011. [316] Zijing Zhao and Ajay Kumar. An accurate iris segmentation framework under relaxed imaging constraints using total variation model. IEEE International Conference on Computer Vision (ICCV), pages 3828–3836, 2015. [317] Bo-Ren Zheng, Dai-Yan Ji, and Yung-Hui Li. Heterogeneous iris recognition using hetero- geneous eigeniris and sparse representation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3764–3768, 2014. [318] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017. [319] J. Zuo and N.A. Schmid. On a methodology for robust segmentation of nonideal iris images. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 40(3):703–718, 2010. [320] Jinyu Zuo and Natalia A. Schmid. An automatic algorithm for evaluating the precision of iris segmentation. International Conference on Biometrics: Theory, Applications and Systems (BTAS), pages 1–6, 2008. 207