TOWARDSAROBUSTUNCONSTRAINEDFACERECOGNITIONPIPELINE WITHDEEPNEURALNETWORKS By YichunShi ADISSERTATION Submittedto MichiganStateUniversity inpartialful˝llmentoftherequirements forthedegreeof ComputerScienceDoctorofPhilosophy 2021 ABSTRACT TOWARDSAROBUSTUNCONSTRAINEDFACERECOGNITIONPIPELINE WITHDEEPNEURALNETWORKS By YichunShi Facerecognitionisaclassicprobleminthe˝eldofcomputervisionandpatternrecognitiondue toitswideapplicationsinreal-worldproblemssuchasaccesscontrol,identityveri˝cation,physical security,surveillance,etc.Recentprogressindeeplearningtechniquesandtheaccesstolarge-scale facedatabaseshasleadtoasigni˝cantimprovementoffacerecognitionaccuracyunderconstrained andsemi-constrainedscenarios.Deepneuralnetworksareshowntosurpasshumanperformanceon LabeledFaceintheWild(LFW),whichconsistsofcelebrityphotoscapturedinthewild.However, inmanyapplications,e.g.surveillancevideos,wherewecannotassumethatthepresentedfaceis undercontrolledvariations,theperformanceofcurrentDNN-basedmethodsdropsigni˝cantly.The mainchallengesinsuchanunconstrainedfacerecognitionprobleminclude,butarenotlimited to:lackoflabeleddata,robustfacenormalization,discriminativerepresentationlearningandthe ambiguityoffacialfeaturescausedbyinformationloss. Inthisthesis,weproposeasetofmethodsthatattempttoaddresstheabovechallengesin unconstrainedfacerecognitionsystems.Startingfromaclassicdeepfacerecognitionpipeline,we reviewhoweachstepinthispipelinecouldfailonlow-qualityuncontrolledinputfaces,whatkindof solutionshavebeenstudiedbefore,andthenintroduceourproposedmethods.Thevariousmethods proposedinthisthesisareindependentbutcompatiblewitheachother.Experimentonseveral challengingbenchmarks,e.g.IJB-CandIJB-Sshowthattheproposedmethodsareabletoimprove therobustnessandreliabilityofdeepunconstrainedfacerecognitionsystems.Oursolutionachieves state-of-the-artperformance,i.e.95.0%TAR@FAR=0.001%onIJB-Cdatasetand61.98%Rank1 retrievalrateonthesurveillance-to-bookingprotocolofIJB-Sdataset. Copyrightby YICHUNSHI 2021 Dedicatedtomyparents iv ACKNOWLEDGMENTS Mydeepgratitude˝rstgoestomyadvisor,Dr.AnilK.Jain.Sixyearsago,Ihadthefortune tobeadmittedbyDr.JaintohisPRIPlabasaseniorbachelorstudent.Sincethen,hehasbeen givingvaluablesuggestionsandguidancetobothmyresearchandlife.Inthislab,Ihavelearned howtoconductresearchfromlookingfortopics,conductingexperimentstowritingpapers.Ihave alsogainedinspirationandskillsfromDr.Jainthatcouldbebene˝cialformylife.Iappreciate hisencouragementwhenIhadtroublesandthecorrectionswhenImademistakes.Hislife-long enthusiasmforresearchhasbeen,andwillalwaysbe,agreatinspirationformycareer. IthankallofthemembersofthePRIPlabforparticipatinginmyresearchandprovidingvaluable feedbackonmywork.Iwillrememberthejoythatwesharedandthediscussionsthatwehad. IthankallthemembersinmyPhDcommittee,namelyDr.XiaomingLiu,Dr.VishnuNaresh Boddeti,andDr.MiZhang,fortheirvaluablesuggestionstomythesiswork. Ithankmyentirefamilyfortheirloveandsupport,especiallymyparents.Withoutthem,Iwould nothavethechancetocometoMSUtobecomeaPhDstudent. IthankallmyfriendsinEastLansing,whohavebeensupportingmeandkeepingmecompany. MythanksalsogotoalltheotherexcellentscholarsandfellowstudentswhomIhadchanceto talkwithorlearnedfrominthepast˝veyears. v TABLEOFCONTENTS LISTOFTABLES ....................................... ix LISTOFFIGURES ...................................... xi KEYTOABBREVIATIONS ................................. xvi Chapter1Introduction .................................. 1 1.1ApplicationsofAutomaticFaceRecognition.....................1 1.1.1Security....................................2 1.1.2AccessControl................................2 1.1.3Identi˝cation.................................3 1.1.4Surveillance..................................3 1.2TheDevelopmentofAutomaticFaceRecognition..................3 1.2.1TraditionalSolutions.............................3 1.2.2DeepFaceRecognition............................4 1.2.3FromConstrainedtoUnconstrainedFaceRecognition............5 1.3PipelineofAutomaticFaceRecognition.......................6 1.4ChallengesinUnconstrainedFaceRecognition....................7 1.5EvaluationMetricsandDatasets...........................8 1.5.1Evaluation...................................8 1.5.2Datasets....................................10 1.6DissertationContributions..............................12 1.7ThesisStructure....................................13 Chapter2LearningLocalFaceFeatureswithVisualAttention ............ 15 2.1Introduction......................................15 2.2RelatedWork.....................................17 2.2.1Parts-basedDeepFaceRecognition.....................17 2.2.2VisualAttentionNetwork...........................18 2.3Approach.......................................19 2.3.1OverallArchitecture.............................19 2.3.2AttentionNetwork..............................20 2.3.3Sub-NetworkforModelingFacialParts...................22 2.3.4PromotingSub-networksforFeatureExploration..............22 2.4Experiments......................................23 2.4.1ImplementationDetails............................23 2.4.2EvaluationofProposedModulesonLFW..................25 2.4.3EvaluationonIJB-AandIJB-BBenchmarks.................28 2.5Conclusion......................................30 Chapter3UncertaintyEstimationforDeepFaceRecognition ............. 31 3.1Introduction......................................31 vi 3.2RelatedWork.....................................33 3.3LimitationsofDeterministicEmbeddings......................34 3.4ProbabilisticFaceEmbeddings............................36 3.4.1MatchingwithPFEs.............................37 3.4.2FusionwithPFEs...............................39 3.4.3Learning....................................40 3.5ImplementationDetails................................41 3.5.1DataPreprocessing..............................41 3.5.2BaseModels.................................42 3.5.3UncertaintyModule..............................42 3.6Experiments......................................43 3.6.1ExperimentsonDi˙erentBaseEmbeddings.................44 3.6.2ComparisonwithState-Of-The-Art......................45 3.7ResultsonDi˙erentArchitectures..........................47 3.7.1QualitativeAnalysis..............................48 3.8Risk-controlledFaceRecognition...........................49 3.9Conclusion......................................51 Chapter4UniversalFaceRepresentationLearning ................... 52 4.1RelatedWork.....................................54 4.2ProposedApproach..................................55 4.2.1Con˝dence-AwareIdenti˝cationLoss....................56 4.2.2Con˝dence-AwareSub-Embeddings.....................58 4.2.3Sub-EmbeddingsDecorrelation.......................60 4.2.4MiningforFurtherVariations.........................62 4.2.5Uncertainty-GuidedProbabilisticAggregation................62 4.3ImplementationDetails................................63 4.4Experiments......................................64 4.4.1Datasets....................................64 4.4.2AblationStudy................................66 4.4.3EvaluationonGeneralDatasets........................69 4.4.4EvaluationonMixed/LowQualityDatasets.................70 4.5Conclusion......................................71 Chapter5GeneralizingFaceRepresentationwithUnlabeledImages ......... 72 5.1Introduction......................................72 5.2RelatedWork.....................................74 5.2.1Semi-supervisedLearning..........................74 5.2.2DomainAdaptationandGeneralization...................75 5.3Methodology.....................................75 5.3.1MinimizingErrorintheLabeledDomain..................77 5.3.2MinimizingDomainGap...........................78 5.3.3MinimizingErrorintheUnlabeledDomains.................79 5.4Experiments......................................82 5.4.1ImplementationDetails............................82 vii 5.4.2Datasets....................................83 5.4.3AblationStudy................................84 5.4.4Quantityvs.Diversity............................86 5.5ChoiceoftheUnlabeledDataset...........................88 5.5.1ComparisonwithState-of-the-ArtFRMethods...............89 5.6Conclusions......................................90 Chapter6Summary .................................... 92 6.1Contributions.....................................93 6.2SuggestionsforFutureWork.............................95 APPENDIX ........................................... 96 BIBLIOGRAPHY ....................................... 98 viii LISTOFTABLES Table2.1Thearchitectureoftheattentionnetwork....................20 Table2.2Thearchitectureofthesub-networks......................21 Table 2.3 Evaluationresultsoftheproposedmodelwith/withoutcertainmoduleson standardLFWandBLUFRprotocols.means"AttentionNetwork"; means"FusionLayer";"PL"refersto"PromotionLoss".indicatesthemoduleis usedwhileindicatesthatmoduleisnotused.Accuracyistestedonthestandard LFWveri˝cationprotocol.Veri˝cationRate(VR)andDetectionandIdenti˝cation Rate(DIR)aretestedontheBLUFRprotocol.....................26 Table2.4EvaluationresultsonIJB-A1:1Comparisonand1:NSearchprotocols....28 Table 2.5 EvaluationresultsonIJB-B1:1BaselineVeri˝cationand1:NMixedMedia Identi˝cationprotocols.................................28 Table 3.1 ResultsofmodelstrainedonCASIA-WebFace.referstothe deterministicembeddings.Thebetterperformanceamongeachbasemodelare showninboldnumbers.usesmutuallikelihoodscoreformatching.IJB-A resultsareveri˝cationratesatFAR= 0 Ł 1% .......................43 Table 3.2 Resultsofourmodels(lastthreerows)trainedonMS-Celeb-1Mandstate-of- the-artmethodsonLFW,YTFandMegaFace.TheMegaFaceveri˝cationratesare computedatFAR= 0 Ł 0001% .indicatesthattheauthordidreporttheperformance onthecorrespondingprotocol.............................44 Table 3.3 Resultsofourmodels(lastthreerows)trainedonMS-Celeb-1Mandstate-of- the-artmethodsonCFP(frontal-pro˝leprotocol)andIJB-A.............45 Table 3.4 Resultsofourmodels(lastthreerows)trainedonMS-Celeb-1Mandstate-of- the-artmethodsonIJB-C................................45 Table 3.5 PerformancecomparisononthreeprotocolsofIJB-S.Theperformanceis reportedintermsofrankretrieval(closed-set)andTPIR@FPIR(open-set)instead ofthemedia-normalizedversion[ 1 ].Thenumbers 1% and 10% inthesecond rowrefertotheFPIR..................................45 Table 3.6 Resultsofdi˙erentnetworkarchitecturestrainedonCASIA-WebFace.ig- referstothedeterministicembeddings.Thebetterperformanceamongeach basemodelareshowninboldnumbers.usesmutuallikelihoodscorefor matching.IJB-Aresultsareveri˝cationratesatFAR= 0 Ł 1% ..............47 ix Table 4.1 Ablationstudyoverthewholeframework.VA:VariationAugmentation (Section4.2),CI:Con˝dence-awareIdenti˝cationloss(Section4.2.1),ME:indicates MultipleEmbeddings(Section4.2.3),DE:DecorrelatedEmbeddings(Section4.2.3), PA:ProbabilisticAggregation.(Section4.2.5).E(all)usesalltheproposedmodules. 67 Table 4.2 Ourmethodcomparedtostate-of-the-artmethodsonTypeIdatasets.The MegaFaceveri˝cationratesarecomputedatFAR= 0 Ł 0001% .indicatesthatthe authordidnotreporttheperformanceonthecorrespondingprotocol.........67 Table 4.3 Ourmodelcomparedtostate-of-the-artmethodsonIJB-A,IJB-CandIJB-S. indicatesthattheauthordidnotreporttheperformanceonthecorresponding protocol.indicates˝ne-tuningonthetargetdatasetduringevaluationonIJB-A benchmarkandindicatesthetestingperformancebyusingthereleasedmodels fromcorrespondingauthors..............................68 Table 5.1 Ablationstudyoverdi˙erenttrainingmethodsoftheembeddingnetwork.All modelshasidenti˝cationlossbydefault.Aandreferto AlignmentAugmentationNetwand respectively.......................................84 Table 5.2 Ablationstudyoverdi˙erenttrainingmethodsoftheaugmentationnetwork. ˇ ˚ ˇ / refertoeDiscriminator econstructionLossStyleDiscriminatorandoDo respectively.The˝rstrowisabaselinethatusesonlythedomainadversarialloss butnoaugmentationnetwork.isasingle-modetranslationnetwork thatdoesnotuselatentstylecode...........................86 Table5.3Performancecomparisonwithstate-of-the-artmethodsontheIJB-Cdataset..90 Table5.4Performancecomparisonwithstate-of-the-artmethodsontheIJB-Bdataset..90 Table5.5PerformanceontheIJB-Sbenchmark......................91 x LISTOFFIGURES Figure1.1Exampleapplicationsoffacerecognition....................2 Figure 1.2 Thepiplelineofautomaticfacerecognitionsystems.Here,weassumethe faceimagesarealreadydetectedandhenceomitthedetectionstep..........6 Figure 1.3 Exampleimagesofsixrepresentativedatasets.Theimagesaresampled fromMS-Celeb-1M[ 2 ],LFW[ 3 ],CFP[ 4 ],IJB-A[ 5 ],IJB-S[ 1 ]andTinyFace[ 6 ] respectively.......................................11 Figure 2.1 ExampleimagesinLFWandIJB-BafteralignmentusingMTCNN[ 7 ]. Theimageinthe˝rstrowarewellalignedandallthefacialpartsarelocatedina consistentway.Thefaceimagesinthesecondandthirdrows,althoughaligned,still appearinaquitedi˙erentwaybecauseoflargeposevariationsorocclusion.....16 Figure 2.2 Anexamplearchitectureoftheproposedend-to-endnetworkwith = 2 sub-networks.A 96 112 imageis˝rstfedintothebase-network,whichisa singleCNNforfacerecognition.Thefeaturemapofthelastconvolutionallayer ofthebase-networkisthenbothusedtolearnaglobalrepresentationwithafully connectedlayer,and transformationmatriceswithanattentionnetworkoftwo- stackedfully-connectedlayers.Theregionsofinterestaresampledintopatchesof sizeof 48 48 . smallerCNNsassub-networksfollowtolearnlocalfeatures fromtheseautomaticallylocalizedpatches.Alltheglobalandlocalfeaturesare thenconcatenatedandfusedbyanotherfullyconnectedlayer.............18 Figure 2.3 Magnitudeoftheweightsofthefusionlayeroverdi˙erentinputdimensions whenusingdi˙erent _ forthepromotionloss.Withoutpromotionloss,many dimensionshavelittleweight,resultinginsub-networks.Dropouthelpsto promotetheweights,butdiminishestheperformance................24 Figure 2.4 Examplepairsthataremisclassi˝edbybase-networkbutareclassi˝ed correctlyonLFWdataset.Pairsinthegreenboxaregenuinepairsandpairsin theredboxareimpostorpairs.WeusetheaveragethresholdofBLUFR[ 8 ]face veri˝cationforVR@FAR = 0 Ł 1% on10splits.....................25 Figure 2.5 ExamplesofthelocalizedregionsinModelA.Theattentionnetworklocalizes theeyes,noseandmouthaccuratelybylearningwithoutlandmarklabels.These accuratelylocalizedpatchesmakeitaneasiertaskforthesub-networkstolearn robustfeaturesfromcertainfacialparts........................27 xi Figure 2.6 Examplepairsthataremisclassi˝edbybase-networkbutareclassi˝ed correctlybyModelBonIJB-Bdataset.Pairsinthegreenboxaregenuinepairsand pairsintheredboxareimpostorpairs.WeusethethresholdofIJB-B1:1Baseline Veri˝cationforTAR@FAR = 0 Ł 1% ...........................29 Figure 3.1 Di˙erencebetweendeterministicfaceembeddingsandprobabilisticface embeddings(PFEs).Deterministicembeddingsrepresenteveryfaceasapointinthe latentspacewithoutregardstoitsfeatureambiguity.Probabilisticfaceembedding (PFE)givesadistributionalestimationoffeaturesinthelatentspaceinstead. Best viewedincolor. ....................................32 Figure 3.2 Illustrationof featureambiguitydilemma .Theplotsshowthecosinesimilarity onLFWdatasetwithdi˙erentdegreesofdegradation.Bluelinesshowthesimilarity betweenoriginalimagesandtheirrespectivedegradedversions.Redlinesshowthe similaritybetweenimpostorpairsofdegradedimages.Theshadingindicatesthe standarddeviation.Withlargerdegreesofdegradation,themodelbecomesmore con˝dent(veryhigh/lowscores)inawrongway....................34 Figure 3.3 ExamplegenuinepairsfromIJB-Adatasetestimatedwiththelowestsimilarity scoresandimpostorpairswiththehighestsimilarityscores(amongallpossiblepairs) bya64-layerCNNmodel.Thegenuinepairsmostlyconsistofonehigh-qualityand onelow-qualityimagewhiletheimpostorpairsarealllow-qualityimages.Note thatthesepairsarenottemplatesintheveri˝cationprotocol.............36 Figure 3.4 FusionwithPFEs.(a)Illustrationofthefusionprocessasadirectedgraphical model.(b)GiventheGaussianrepresentationsoffaces(fromthesameidentity),the fusionprocessoutputsanewGaussiandistributioninthelatentspacewithamore precisemeanandloweruncertainty..........................39 Figure 3.5 RepeatedexperimentsonfeatureambiguitydilemmawiththeproposedPFE. ThesamemodelinFigure3.2isusedasthebasemodelandisconvertedtoaPFEby traininganuncertaintymodule.Noadditionaltrainingdatanordataaugmentation isusedfortraining...................................48 Figure 3.6 ExamplegenuinepairsfromIJB-Adatasetestimatedwiththelowestmutual likelihoodscoresandimpostorpairswiththehighestscoresbythePFEversion ofthesame64-layerCNNmodelinSection3.3.IncomparisontoFigure3.3, mostimagesherearehigh-qualityoneswithclearfeatures,whichcanmisleadthe modeltobecon˝dentinawrongway.Notethatthesepairsarenottemplatesinthe veri˝cationprotocol..................................48 Figure 3.7 Distributionofestimateduncertaintyondi˙erentdatasets.Here,ncer referstotheharmonicmeanof f acrossallfeaturedimensions.Notethatthe estimateduncertaintyisproportionaltothecomplexityofthedatasets. Bestviewed incolor .........................................49 xii Figure 3.8 Visualizationresultsonahigh-quality,alow-qualityandamis-detected imagefromIJB-A.Foreachinput,5imagesarereconstructedbyapre-trained decoderusingthemeanand 4 randomlysampled z vectorsfromtheestimated distribution ? ¹ z j x º ...................................50 Figure 3.9 ExampleimagesfromLFWandIJB-Athatareestimatedwiththehighest (H)con˝dence/qualityscoresandthelowest(L)scoresbyourmethodandMTCNN facedetector......................................51 Figure 3.10 Comparisonofveri˝cationperformanceonLFWandIJB-A(nottheoriginal protocol)by˝lteringaproportionofimagesusingdi˙erentqualitycriteria.....51 Figure 4.1 Traditionalrecognitionmodelsrequiretargetdomaindatatoadaptfromthe high-qualitytrainingdatatoconductunconstrained/low-qualityfacerecognition. Modelensembleisfurtherneededforauniversalrepresentationpurposewhich signi˝cantlyincreasesmodelcomplexity.Incontrast,ourmethodworksonlyon originaltrainingdatawithoutanytargetdomaindatainformation,andcandealwith unconstrainedtestingscenarios.............................53 Figure 4.2 SamplesfromMS-Celeb-1M[ 2 ]withaugmentationalongsidedi˙erent variations........................................55 Figure 4.3 Overviewoftheproposedmethod.High-qualityinputimagesare˝rstaug- mentedaccordingtopre-de˝nedvariations,i.e.,blur,occlusionandpose.Thefeature representationisthensplitintosub-embeddingsassociatedwithsample-speci˝c con˝dences.Con˝dence-awareidenti˝cationlossandvariationdecorrelationloss aredevelopedtolearnthesub-embeddings......................56 Figure 4.4 Illustrationofcon˝dence-awareembeddinglearningonquality-variousdata. Withcon˝denceguiding,thelearnedprototypeisclosertohigh-qualitysamples whichrepresentstheidentitybetter...........................58 Figure 4.5 Thecorrelationmatricesofsub-embeddingsbysplittingthefeaturevector intodi˙erentsizes.Thecorrelationiscomputedintermsofdistancetoclasscenter.59 Figure 4.6 Thevariationdecorrelationlossdisentanglesdi˙erentsub-embeddings byassociatingthemwithdi˙erentvariations.Inthisexample,the˝rsttwo sub-embeddingsareforcedtobeinvarianttoocclusionwhilethesecondtwosub- embeddingsareforcedtobeinvarianttoblur.Bypushingstrongerinvariancefor eachvariation,thecorrelation/overlapbetweentwovariationsisreduced.......60 xiii Figure 4.7 Testingresultsonsyntheticdataofdi˙erentvariationsfromIJB-Abenchmark (TAR@FAR=0.01%).Di˙erentrowscorrespondtodi˙erentaugmentationstrategies duringtraining.Columnsaredi˙erentsynthetictestingdata. representsandoserespectively.Theperformanceofthe proposedmethodisimprovedinamonotonouswaywithmoreaugmentationsbeing added..........................................65 Figure 4.8 t-SNEvisualizationofthefeaturesina2Dspace.Colorsindicatethe identities.Originaltrainingsamplesandaugmentedtrainingsamplesareshownin circleandtriangle,respectively.............................66 Figure4.9Performancechangewithrespecttodi˙erencechoiceofK..........66 Figure 4.10 Heatmapvisualizationofsub-embeddinguncertaintyondi˙erenttypesof imagesfromIJB-Cdataset,shownontherightofeachfaceimage.16valuesare arrangedin4 4grids(nospatialmeaning).Brightercolorindicateshigheruncertainty. 69 Figure 5.1 Illustrationoftheproblemsettingsinourwork.Bluecirclesimplythe domainsthatthefaceimagesbelongto.Byutilizingdiverseunlabeledimages,we wanttoregularizethelearningofthefaceembeddingformoreunconstrainedface recognitionscenarios..................................73 Figure 5.2 Overviewofthetrainingframeworkoftheembeddingnetwork.Ineachmini- batch,arandomsubsetoflabeleddatawouldbeaugmentedbytheaugmentation networktointroduceadditionaldiversity.Thenon-augmentedlabeleddataareused totrainthefeaturediscriminator.Theadversariallossforcesthedistributionofthe unlabeledfeaturestoalignwiththelabeledone....................76 Figure 5.3 t-SNEvisualizationofthefaceembeddingsusingsynthesizedunlabeled images.UsingpartoftheMS-Celeb-1Masunlabeleddataset,wecreatethreesub domainsbyprocessingtheimageswitheitherrandomGaussiannoise,random occlusionordownsampling.(a)di˙erentsub-domainsshowdi˙erentdomainshift intheembeddingspaceofthesupervisedbaseline.(b)withtheholisticbinary domainadversarialloss,eachofthesub-domainsisalignedwiththedistributionof thelabeleddata.....................................77 Figure 5.4 Trainingframeworkoftheaugmentationnetwork ˝ .Thetwopipelinesare optimizedjointlyduringtraining............................80 Figure5.5Examplegeneratedimagesoftheaugmentationnetwork...........82 Figure 5.6 Ablationstudyoftheaugmentationnetwork.Inputimagesareshowninthe ˝rstcolumn.Thesubsequentcolumnsshowtheresultsofdi˙erentmodelstrained withoutacertainmoduleorloss.Thetexturestylecodesarerandomlysampled fromthenormaldistribution..............................85 xiv Figure 5.7 EvaluationresultsonIJB-CandIJB-Swithdi˙erentprotocolsanddi˙erent numberoflabeledtrainingdata............................87 Figure 5.8 EvaluationResultsonIJB-S,IJB-CandLFWwithdi˙erentprotocolsand di˙erentnumberandchoiceofunlabeledtrainingdata.Theredlinehererefersthe performanceofthesupervisedbaselinewhichdoesnotuseanyunlabeleddata....88 xv KEYTOABBREVIATIONS Acronyms/Abbreviation FRFaceRecognition SOTAState-of-the-art ROCReceivingOperatingCharacteristic TARTrueAcceptanceRate FARFalseAcceptanceRate CMCCumulativeMatchCharacteristic DIRDetection&Identi˝cationRate PCAPrincipalComponentAnalysis LDALinearDiscriminantAnalysis DNNDeepNeuralNetwork CNNConvolutionalNeuralNetwork PFEProbabilisticFaceEmbedding GANGenerativeAdversarialNetwork xvi Chapter1 Introduction FaceRecognitionisaclassicyetpopularongoingprobleminthe˝eldofcomputervisionand patternrecognition.ThegeneralgoalofAutomaticFaceRecognition(AFR)istoletthemachine identifyapersonfromhis/herphotos.Suchaprocesscouldinvolveasetoftypicalchallengesin computervisionproblems:occlusion,illumination,out-of-planerotation(posechange)andlow imagequality.Ontheotherhand,AFRtechnologyhasawiderangeofapplicationsinforensics, accesscontrol,mobilepayment,surveillance,etc,makingitoneofthemostactiveresearchtopicsin the˝eldofpatternrecognition.Inthischapter,we˝rstreviewtheapplicationsofAFRsystemsand theirdevelopmenthistory,andthenexplainthepipelineofmodernAFRsystemsandthechallenges theyface.Basedonthesechallenges,weintroduceourproposedmethodsandcontributions. 1.1ApplicationsofAutomaticFaceRecognition Theapplicationsoffacerecognitioncanbeclassi˝edintotwotypes:faceveri˝cationandface identi˝cation.Faceveri˝cationsystems,alsoknownas1:1comparison,needstodecidewhether twogivenfaceimages(orcollections)belongtothesamepersonwhileafaceidenti˝cation(1:N comparison)systemneedstoidentify(search)aprobeimagefromagallerysetofNimages.A moredetailedexplanationoffaceveri˝cationandfaceidenti˝cationcanbefoundinSection1.5. Here,welistafewrepresentativeapplicationsoffaceveri˝cationandidenti˝cation. 1 (a) AirportSecurity[9] (b) iPhoneXFaceID[10] (c) Identi˝cationforPayment[11] (d) Surveillance[12] Figure1.1Exampleapplicationsoffacerecognition. 1.1.1Security Inmanysecurity-sensitivescenarios,weneedtoverifyaperson'sidentityforsafetyreasons.For example,asshowninFigure1.1,manyAFRsystemshavebeendeployedatairportsworld-wideto checktheidentityofapassportholder[ 13 ].Withmoreandmorepassengerstakinginternational tripseachyear,suchasystemcouldsigni˝cantlyincreasethee˚ciencyofbordercontrolandreduce theburdenonsta˙atairports.Similarly,manyimmigrationcheckpointshavealsoadoptedface recognitionsystemstoacceleratethepassengerveri˝cationprocess. 1.1.2AccessControl Anotherapplicationoffaceveri˝cationistochecktheaccesspermissiontocertainbuildings, devicesor˝lesstoredinacomputer.Incorporatebuildings,AFRsystemsareusedtoreplace thetraditionallockstocontroltheentrygate.StartingwithiPhoneX,facerecognitionhasbeen deployedasanalternativetoPINtosecuretheunlockingprocess(SeeFigure1.1(b)).Comparedto 2 passwordand˝ngerprint,facerecognitionismoreconvenienttousesinceitonlyrequirestheuser tolookatthephone. 1.1.3Identi˝cation Besidesfaceveri˝cation,anothertypeoffacerecognitionapplicationsneedstoidentifyaperson fromalargesetofknownpeople.Forexample,ifthepolicehaveaphotoofacriminal,they coulduseittoretrievesimilarfacesfromamugshotdatabaseto˝gureoutpotentialidentities ofthecriminal.Childtra˚ckingisasevereprobleminmanydevelopingcountries.There,face identi˝cationcanalsobeusedtodetectwhetherachildisreportedtobelost[ 14 ]tosolvesuch socialproblems.Besidessecurityapplications,faceidenti˝cationcanalsobeappliedtomobile payments(SeeFigure1.1(c)). 1.1.4Surveillance Aspecialtypeofapplicationoffaceidenti˝cationisassociatedwithsurveillancecameras.These surveillancecamerasplayakeyroleinthemanagementofmega-citiesacrosstheworld.Till 2019,therewereestimatedtobe770millionsurveillancecamerasinstalledaroundtheworld[ 15 ]. However,e˙ectivelyutilizingthesesurveillancevideosisnotasimpletask,sincealargeamountof humanlaborwouldbeneededtomonitororreviewthem.Incontrast,arobustfacedetectionand identi˝cationalgorithmcouldgothroughmassivenumberofvideostolocalizepotentialcriminals andoperate24hoursadaysevendaysaweek. 1.2TheDevelopmentofAutomaticFaceRecognition 1.2.1TraditionalSolutions Sincethe˝rststudyonAFRbyTakeoKanade[ 16 ]in1970s,thetechnologyofautomaticface recognitionhasevolveddrastically.Manydi˙erentapproacheshavebeenexploredtorepresent 3 andcomputesimilaritytwofaceimagesunderconsideration.Intheearlystages,methodsthat explicitlymodelthegeometricshapeortextureoffaceswereusedtorepresentthefaceimages.For example,ActiveShapeModels[ 17 ]representafacebythecoordinatesoffaciallandmarks.However, suchmethodsarelimitedintermsofrepresentationpowerandaresensitivetothevariationsthat couldappearinfaceimages,suchaspose,illumination,andexpression(PIE).Subspace-based representations,suchasEigenFace[ 18 ]andFisherFace[ 19 ],havebeenproposedtomodelfaces imagesbyasetofbasiscomponentsinalinearsubspace.Eachimageisrepresentedbythe coe˚cientofthebases,whichcanbefurtherusedforrecognitiontasks.Later,manuallydesigned visualdescriptors,suchasScale-invariantFeatureTransform(SIFT)[ 20 ]andLocalBinaryPatterns (LBP)[ 21 ],becamepopularincomputervisiontasks.Thesefeaturesareshowntobee˙ective onfacerecognitiontasksaswell[ 21 ]andachievesevenbetterperformancewhencombinedwith data-drivenmethods[22],suchasLinearDiscriminantAnalysis(LDA)[19]andJointBayes[23]. 1.2.2DeepFaceRecognition Inrecentyears,duetotheadventoflarge-scalewebdataande˚cientparallelcomputingdevices, i.e.GPUs,ConvolutionalNeuralNetworks(CNNs),asapuredata-drivenmethod,hasreplaced traditionalmethodsandachievedimpressiveperformanceonawiderangeofcomputervisiontasks, suchasimageclassi˝cation[ 24 ],detection[ 25 ]andsegmentation[ 26 ].AseriesofCNN-based facerecognitionalgorithmshavebeenproposedsince2014,includingDeepFace[ 27 ],DeepID series[ 28 , 29 ]andFaceNet[ 30 ].Thesealgorithmsnotonlyoutperformtraditionalmethodsbya largemargin,buttheybeathumanbeingsonfaceveri˝cationtasks[ 28 , 31 ].ComparedtoLBPand LDA,CNNareabletolearnamuchmorecomplicatednon-linearmappingfunctiontoserveas thefeatureextractor,whichismorediscriminativeandlesssensitivetodi˙erentfacialvariations. AftertheearlyworkonCNN-basedfacerecognition,subsequentstudieshaveexploreddi˙erent lossfunctionstoimprovethediscriminationpowerofthefeaturerepresentation.Wenetal.[ 32 ] proposedcenterlosstoreduceintra-classvariation.Aseriesofworkshavealsoproposedtouse metriclearningforfacerecognition[ 30 , 33 ].Recente˙ortshaveattemptedtoachievediscriminative 4 embeddingswithasingleidenti˝cationlossfunctionwhereproxyorprototypevectorsareusedto representeachclassintheembeddingspace[34,35,36,37,38]. 1.2.3FromConstrainedtoUnconstrainedFaceRecognition AlongwiththeboostinAFRalgorithms,theevaluationdatasetsandprotocolsforAFRsystems havebeenupdatedfrequently.Intheearlystudies,theinputfacephotostoberecognizedare mostlycapturedunderconstrainedsettings,wherethereisalimitedamountofvariationinterms ofpose,illuminationandexpression(PIE).Forexample,theYale-Bfacedataset[ 39 ]releasedin 2000onlyconsistsofgray-scalefrontalfaceswithdi˙erentillumination.In2007,theLabeled FacesintheWild(LFW)[ 3 ]datasetwasreleased.Asthe˝rstbenchmarkcomposedoffaceimages capturedinthewild(notundercontrolledsettings),LFWbecameamajorchallengetotheAFR systemsbeforethedeeplearningera(seeFigure1.3).Whiletraditionalapproaches[ 22 ]achieved 95.17%accuracyonLFWunderthestandardveri˝cationprotocol,CNNsthatweretrainedon large-scaledatasetsquicklysaturatedthebenchmarkwithperformancehigherthan99.00%[ 31 , 30 ]. Sincethen,moreunconstrainedbenchmarkshavebeenreleasedtoevaluatetheperformanceof FRalgorithms.Forexample,NISTreleasedthreebenchmarksdevelopedunderIARPAJanus program[ 40 ],namelyIJB-A[ 5 ],IJB-B[ 41 ],andIJB-C[ 42 ],thatarecomposedofamixedsetof celebrityphotosalongwithvideoframes.Sincethefacesintheseimagesaremanuallycroppedby humansratherthano˙-the-shelffacedetectors,thefacesinthesedatasetscouldincludearbitrary PIEvariations.Further,insteadofaclosed-setrecognition,themodelstodayarerequiredtoperform recognitioninanopen-setsetting,i.e.thetest(query)subjectsmaynotbepresentinthedatabaseof knownsubjects(gallery),whichmakesthetasksevenmoredi˚cult.Inspiteofthesechallenges, deepneuralnetworksquicklysaturatedeventhesebenchmarksbylearningfromlargerandlarger trainingdatasetsandnewermodels.Recently,researchershavebeenfocusingonmorechallenging cases,suchassurveillancefacerecognition[ 1 ]andlow-resolutionfacesinthewild[ 6 ],which representamorerealisticsettinginreal-worldapplications.Moredetailsonthesedatasetscanbe foundinSection1.5. 5 Figure1.2Thepiplelineofautomaticfacerecognitionsystems.Here,weassumethefaceimages arealreadydetectedandhenceomitthedetectionstep. 1.3PipelineofAutomaticFaceRecognition AsshowninFigure1.2,thepipelineofAFRtypicallyincludesthreesteps:normalization,feature extractionandcomparison.Here,weassumethefaceshavealreadybeendetectedandwedonot discussit. Normalization Inthisstep,spatialtransformationsareconductedtoreducethefacialvariations beforesendingtheinputimagetothefeatureextractormodule.Di˙erentmethodscanbeusedto reducesuchvariations.Thesimplestsolutionistousethelocationofboundingboxorlandmarks tocropacanonicalviewoftheinputface[ 30 , 43 ].Someusemorecomplicated3Dmodelsto frontalizethefacetofurtherreducethevariation[ 27 ].Themostcommonsolutionistodetect5 landmarks(eyes,noseandmouthcorners)andapplyasimilaritytransformation[32,34]. FeatureExtraction Inthisstep,eitheramanuallydesignedoralearnedrepresentationareused toextractthediscriminativefeaturesfromthefaces.BoththeLBPdescriptorandLDAmethods mentionedinlastsectionbelongtothisstep.Today,almostallongoingresearchuseaCNNasthe featureextractor,whichmapsanRGBimagetoafeaturevectorwith˝xedlength.TheCNNis ˝rsttrainedonalarge-scaleweb-crawleddatabase,withmillionsoffaceimagescoveringhundreds ofthousandsofidentities[ 44 , 2 ],andthentheoutputvectorsofitshiddenlayersareusedasthe extractedfeaturestorepresentthefaces.Thenetworkistrainedeitherwithmetriclearningloss functions[30]orclassi˝cationtasks[27]. 6 SimilarityMetric Thechoiceofsimilaritymetricmainlydependsontherepresentation.For example,forhistogram-basedfeatures,suchasLBP, j 2 measureisusedtocomputethedistance. AssumingthefeaturesaregeneratedbyaGaussiandistribution,Chenetal.[ 23 ]proposedtouse ajointformulationofconcatenatedfeaturevectorstocomputethefacialsimilarity,whichisalso showntobee˙ectiveondeeprepresentations[ 45 ].Themostwidelyadoptedsimilaritymetricfor deepfacerepresentationsiscosinesimilarity.Thisismainlybecausethehiddenfeatureslearnedby theneuralnetworkswithaclassi˝cationlossaredistributedinaradialway[46,34]. 1.4ChallengesinUnconstrainedFaceRecognition Themajorchallengesofunconstrainedfacerecognition,comparedwithconstrainedones,liesin thelargefacialvariations,includingpose,illumination,expression,lowresolution,occlusion,etc. Althoughilluminationandexpressionwereconsideredchallengingformanuallydesignedfeatures, deeprepresentationsthataretrainedonlarge-scalewebdatasetsturnouttoberelativelyinvariantto suchvariations[ 29 ].Thus,theyarenolongerafocusinrecentstudies.Di˙erentmethodshave beenproposedtolearnpose-invariantdeepfacerepresentations.Someuseposelabelsduring trainingtolearnapose-disentangleddeepfeature[ 47 , 48 ]whileothershaveutilized3Dmodels tobuilddeepmodelsthatcouldfrontalizethefaceimagesbeforefeatureextraction[ 49 ].Onthe otherside,state-of-the-art(SOTA)facerecognitionmodels[ 38 ]thataretrainedongenericface datasetswithdeepermodelsandmargin-basedlossfunctionshavealsobeenshowntoperformwell oncross-posefaceveri˝cationtasks[ 4 ].Comparedwithothertypesofvariations,lowresolution andocclusionaremoredi˚cultbecausetheyimplythelossofinformationintheinputfaces.While deeprepresentationshaveachievedhuman-levelperformanceonconstrainedfacephotos,many evaluationbenchmarkshavebeenreleasedtoevaluatedeepfacerecognitionmodelsonsurveillance andwebvideos[1,6],wherelowerresolutionandocclusionarethemainchallenges. InacompleteAFRsystem,theindividualimpactofaforementionedfacialvariationsdepends onthechoiceofdi˙erentmodulesintheFRpipeline.Herewebrie˛yintroducetheconnection 7 betweendi˙erentfacialvariationsandthemodulesinanAFRpipeline,whichfurthermotivatesour proposedmethodsinChapters2-5: ‹ Thefacenormalizationstepismainlycorrelatedwiththecross-posefacerecognitionperfor- mance.Amorecomplicatednormalizationmethod,e.g.3Dmodeling,couldsigni˝cantly alleviatestheposevariationproblem.However,underunconstrainedsettings,givenalow qualityimage,the3Dmodelmightnotbeabletoaccuratelyreconstructthestructureofthe inputface.InChapter2,weintroduceanattention-basedlearningframeworkthatcould exploitlocalfeaturesfromfacesofdi˙erentposestohandlethisissue. ‹ InChapter3,weshowthatthecommonchoiceofsimilaritymetric,i.e.cosinesimilarity betweenembeddedvectors,couldsu˙erfromfacialvariationsthatcauseinformationloss, suchaslowresolutionandocclusion.Weproposeanuncertainty-basedrepresentationto solvethisproblem. ‹ Therepresentationlearningstepisrelatedtoallkindsofvariations.Theperformancedi˙ers dependingonwhatkindofdatasetandlossfunctionwechoosetotrainthemodelwith.Thus,a trade-o˙oftenexistswhentheperformancedegradesonacertaintypeofdatawhenwe˝tour modeltohandleothertypesofvariations.InChapter4,weproposeauniversalrepresentation learningframeworkthatisabletosimultaneouslyimprovefeaturediscriminationpower withdi˙erentvariations.InChapter5,wefurtherproposeasemi-supervisedrepresentation learningframeworkthatutilizesanauxiliaryunlabeleddatasettoaugmentthelabeledtraining datatoimprovethegeneralizabilityofthefaceembeddings. 1.5EvaluationMetricsandDatasets 1.5.1Evaluation Thetasksoffacerecognitioncanbeconcludedastwotypes:faceveri˝cationandfaceidenti˝cation orsearch. 8 In faceveri˝cation ,alsoknownas1:1comparison,thesystemisrequiredtodeterminewhether apairoffaceimagesbelongtothesamesubjectbyapplyingathresholdtothesimilarityscore. Twotypesofmetricsareusedhereforevaluatingfaceveri˝cationprotocol.The˝rstmetricisthe accuracy: 22DA02H ¹ #Œ) º = Numberofcorrectcomparisonatthreshold, ) Numberofallpossiblepairs, # Ł (1.1) Thismetricisusuallyusedforprotocolswithbalancednumberofpositiveandnegativepairs,such asinLFW[ 3 ]andCFP[ 4 ].Thethreshold ) isusuallydeterminedunderacross-validationprotocol. Thesecondmetric,whichisclosertoareal-worldveri˝cationrequirementistoevaluatetheTrue AcceptRateofinputpairsata˝xedFalseAcceptRate.Formally,thismetricisde˝nedas: )' ¹ # ? Œ) º = Numberofacceptedgenuinepairsatthreshold, ) Numberofallgenuinepairs, # ? Ł (1.2) ˙' ¹ # = Œ) º = Numberofacceptedimpostorpairsatthreshold, ) Numberofallimpostorpairs, # = Ł (1.3) ToachievealowerFAR,onewouldliketolowerthethreshold ) ,whichwouldcausealowerTAR, too.Therefore,wecanevaluatetheperformancebychoosingthethresholdbasedonadesiredFAR value. In faceidenti˝cation ,alsoknownas1:Ncomparison,thesystemisgivenagallerysetwith imagesofknownidentities.Then,givenaprobefaceimage,thesystemneedstodeterminewhich personinthegallerytheinputfacebelongsto.Inparticular,dependingonwhetherthereexists non-mateprobes(whosecorrespondingsubjectisnotinthegallery),theidenti˝cationprotocol canbefurthercategorizedintoclosed-setidenti˝cationandopen-setidenti˝cation.Intheopen-set identi˝cation,thesystemneedsto˝rstdeterminewhethertheinputfaceidentityisinthegallery beforetryingtoidentifyhim/her.Forclosed-setidenti˝cation,rankretrievalrateisusedtoevaluate theperformanceasshownbelow: '4CA84E0;'0C4 ¹ #Œ º = Numberofsuccessfullyretrievedprobeswithtop returns Numberofallprobes # (1.4) 9 andTruePositiveIdenti˝cationRate(TPIR)atFalsePositiveIdenti˝cationRateisusedtoreport theperformanceofopen-setfacerecognitionbenchmarks: )%˚' ¹ #Œ) º = Numberofretrievedmateprobeswithscoreabovethreshold ) Numberofallmateprobes # (1.5) ˙%˚' ¹ #Œ) º = Numberofnon-mateprobeswithscoreabovethreshold ) Numberofallnon-mateprobes # (1.6) SimilartoTAR@FARforveri˝cation,thethreshold ) hereischosendependingontheFPIR.For probeswithamateinthegallery,aprobe'smateisconsideredtobesuccessfullyretrievedonlyifit isreturnedastop-1result. 1.5.2Datasets Here,webrie˛yintroduceallthedatasetsthatweusefortrainingandevaluatingtheAFRsystems. Sinceallthemodelsinourworkarebasedondeepneuralnetworks,whichrequirealargenumber ofimagesto˝ttheparameters,weusetwopublicweb-crawleddatasetsfortraining: CASIA-Webface [ 50 ]containsabout0.5Mhigh-qualitycelebrityphotosof 10 Œ 575 subjectscaptured theAllthefaceimagesarecollectedfrominternetbysearchingcelebritynames. MS-Celeb-1M [ 2 ]contains8Mfacephotosofabout85Ksubjects.Theimagesarecollectedin asimilarwayasCASIA-Webface.However,theoriginalMS-Celeb-1Mcontainsalargenumber ofmislabeledimages.Therefore,acleanedversionisusuallyusedinsteadoftheoriginalone. Inthisthesis,weuseapubliclyavailablecleanlist 1 andthecleanlistfromArcFace[ 38 ]forthe experiments. WeshowsomeexampleimagesfromMS-Celeb-1MinFigure1.3(a).TheimagesinCASIA- Webfacearesimilartothem.Forevaluation,weconsider8di˙erentbenchmarks,whoseimages presentdi˙erenttypesanddegreesoffacialvariations: LFW [ 3 ]contains 13 Œ 233 near-frontalandhigh-qualityfacephotosof 5 Œ 749 subjects.The veri˝cationprotocolusedinthisthesisincludes 6 Œ 000 facepairs. 1 https://github.com/inlmouse/MS-Celeb-1M_WashList. 10 (a)CASIA-WebFace(2014) (b)MS-Celeb-1M(2016) (c)LFW(2007) (d)YTF(2011) (e)MegaFace(2016) (f)CFP(2016) (h)IJB-A(2015) (i)IJB-S(2018) (j)TinyFace(2019) Figure1.3Exampleimagesofsixrepresentativedatasets.TheimagesaresampledfromMS- Celeb-1M[2],LFW[3],CFP[4],IJB-A[5],IJB-S[1]andTinyFace[6]respectively. YTF [ 51 ]contains 3 Œ 425 videosof 1 Œ 595 subjects.Theveri˝cationprotocolusedinthisthesis includes 5 Œ 000 videopairs. MegaFace [ 52 ]contains1MfaceimagesfromFlickerasdistractors.TheFaceScrubdatasetisused astheprobesetinourexperiments,whichcontains 3 Œ 530 high-qualityfaceimagesof 80 subjects. CFP [ 4 ]contains 7 Œ 000 frontal/pro˝lefacephotosof 500 subjects.Weonlytestonthefrontal-pro˝le 11 (FP)protocol,whichincludes 7 Œ 000 pairsoffrontal-pro˝lefaces. IJB-A [ 5 ]isatemplate-basedbenchmark,containing 25 Œ 813 facesimagesof 500 subjects.Each templateincludesasetofstillphotosorvideoframes.Comparedwithpreviousbenchmarks,the facesinIJB-Ahavelargervariationsandpresentamoreunconstrainedscenario. IJB-C [ 42 ]isanextensionofIJB-Awith 140 Œ 740 facesimagesof 3 Œ 531 subjects.Theveri˝cation protocolofIJB-CincludesmoreimpostorpairssothatwecancomputeTrueAcceptRates(TAR)at lowerFalseAcceptRates(FAR). IJB-S [ 1 ]isasurveillancevideobenchmarkcontaining 350 surveillancevideosspanning 30 hours intotal, 5 Œ 656 enrollmentimages,and 202 enrollmentvideosof 202 subjects.Manyfacesinthis datasetareofextremeposeorlow-quality,makingitoneofthemostchallengingfacerecognition benchmarks. TinyFace [ 6 ]isadatasettoevaluatethefacerecognitionmodelsonlow-resolutionfaceimages. Thedatasetcontains5,139labelledfacialidentitiesgivenby 169 Œ 403 naturallow-resolutionface images.Closed-setidenti˝cationrateisusedtoevaluatethesystemsonthisbenchmark. ExampleimagesfromsomeofthesedatasetsareshowninFigure1.3. 1.6DissertationContributions Themaincontributionsofthisdissertationareasfollows: ‹ Aspatialtransformer-basedattentionmodulethatautomaticallydetectssalientfacialregions toextractlocalfeatures.Theattentionmodulecouldbetrainedwithoutlabels. ‹ Aframeworkthate˚cientlycombinesmultipleregionattentionmodulestoextractlocal featuresandincorporatesthemintoglobalfacialrepresentation.Experimentalresultson unconstrainedfacedatabasesshowthatthemethodcoulde˙ectivelyboosttheperformanceof abasefacematcherwhenmoresalientregionsarecombined. ‹ Anewtypeoffacerepresentationthattakesfeatureuncertaintyintoaccount.Givena 12 pre-traineddeterministicdeepfaceembedding,theproposedmethodcouldconvertitinto aprobabilisticfaceembedding(PFE)byrepresentingeachfaceimageasadistributionin thelatentspace.Theprobabilisticembeddingaddsadditionalinterpretabilitytodeepface representationsandcanbeusedasaqualityassessmentmethodtocontroltheenrollmentof faceimages. ‹ Aprobabilisticmethodthate˙ectivelyutilizesdatauncertaintytocombineandcompare di˙erentprobabilisticfaceembeddings. ‹ Anuniversalfeaturelearningframeworkthatlearnsasetofsub-embeddingstotackledi˙erent variationsinunconstrainedfacerecognition.Acon˝dence-controlledfaceidenti˝cationloss andvariation-baseddecouplelossareproposedtoregularizethefacialfeaturestohandle multiplevariations.Experimentsshowthattheproposedmethodcouldincrementallyenhance thefeaturerepresentationswhenmoretypesofvariationsareintroducedintothetraining data.Combiningdecoupledsub-embeddingswithPFEleadstoSOTAperformanceonseveral challengingfacerecognitionbenchmarks. ‹ Asemi-supervisedfeaturelearningframeworkthatincorporatesanauxiliaryunlabeleddataset intothetrainingofdeepfaceembeddings.Ageneratoristrainedtoautomaticallydiscoverthe latentstylesintheunlabeleddatasetsuchthatitcanbeusedtoaugmentthelabeleddataset. Then,wecanjointlyregularizetheembeddingmodelfromboththeimagespaceandthe featurespacetoimproveitsgeneralizability. 1.7ThesisStructure Ch.2ofthisthesispresentsaframeworkofenhancingglobalfacefeatureswithlocalinformation. spatialtransformersareusedasattentionmodulestoautomaticallylocalizesalientfacialregions toextractlocalfeatures,whicharethenfusedintotheholisticfeatures.Ch.3presentsanew typeoffacerepresentation,namelyProbabilisticFaceEmbeddings(PFEs).PFEsincorporatedata 13 uncertaintyintofacerepresentationsandareabletoimprovefacerecognitionperformancebytaking uncertaintyintoaccountduringtemplatefusionandtemplatecomparison.InCh.4,wepropose aframeworktolearnauniversalfacerepresentation.Di˙erenttypesofdataaugmentationare combinedtomimicasettingwhereonehasaccesstoalargetrainingdatasetofunconstrainedfaces andnewlossfunctionsareproposedtolearndecoupledfeaturesfromdi˚culttrainingsamples. Ch.5furtherstudiesthepossibilityofusinganunlabeleddatasettoaugmentalabeledtrainingset intermsofdiversity,whereweshowimprovedmodelgeneralizabilitytounconstrainedfaces.The lastchapterdiscussestheconclusionsofthisdissertationandpresentsdirectionsforfuturework. Theexperimentalresultsoftheworkinthisthesiswerepreviouslypresentedin[53,54,55,56]. 14 Chapter2 LearningLocalFaceFeatureswithVisual Attention 2.1Introduction AsshowninFigure1.2,mostAFRsystemsadoptanormalizationstepforpre-processingtoensure theinputfacesareinasimilarpositionandorientation,reducingtheintra-classvariationsand makingtherecognitiontaskeasier[ 27 , 31 , 50 , 30 , 57 ].However,asthecomplexityofunconstrained faceimagesincrease,eventhoughaligned,2Dfaceimagescanstillappearverydi˙erently,asshown inFigure2.1.Assuch,constructingglobalfacemodelsbecomesaverydi˚culttask.Becauseof thisdi˚culty,anattractiveideaistomodeldi˙erentfacialpartsindividuallyandcombinethemto generateaglobalrepresentation.Recognizingcomplexobjectsbytheirpartsisapopulartechnique inpatternrecognition.Inthewell-knownDeformablePartModel(DPM)[ 58 ],di˙erentpart˝lters arelearnedandcombinedwitharoot˝ltertodetectcomplexobjectsintheimagese˚ciently. Similarideas,suchasdecomposingfacesintodi˙erentparts,havebeenshowntoworkwellfor facedetection[ 59 , 60 , 61 ].Ahighlysuccessful,parts-basedfacerecognitionapproach,calledthe DeepIDseries[ 28 , 29 , 31 ],croppedalargenumberofdi˙erentlocalpatcheseitherat˝xedpositions oraroundlandmarksinthefaceimage,trainedasingledeepconvolutionalnetworkoneachofthese 15 Figure2.1ExampleimagesinLFWandIJB-BafteralignmentusingMTCNN[ 7 ].Theimagein the˝rstrowarewellalignedandallthefacialpartsarelocatedinaconsistentway.Thefaceimages inthesecondandthirdrows,althoughaligned,stillappearinaquitedi˙erentwaybecauseoflarge posevariationsorocclusion. regions,andfusedtherepresentationsfromallthenetworksbytrainingonavalidationdataset.The successofworkslikeDeepIDindicatethatalthoughfaceisanearlyrigidobject,buildingmodels fordi˙erentfaceregionscanalsohelpimprovetheperformanceoffacerecognitionsystems. Oneofthemostimportantproblemsinparts-basedfacerecognitionapproaches,isthelocalization ofthetargetparts.Inotherwords,althoughthefacesarealigned,partsofafaceshownina ˝xedregioncouldbequitedi˙erentfordi˙erentpeopleatdi˙erentposes,whichreducesthe discriminationabilityoftheseparts-basedmodels.Oneapproachtosolvingthisproblemistouse thedetectedlandmarkstocroprectangularpatchesaroundthoserespectivelandmarks.However, evenwiththeselandmarks,itisstilldi˚culttodecidewhatregionsweshouldcropsincesome regionsmaybeusefulforrecognition,andothersmaynot.Giventhisdi˚culty,weturntoanother techniqueto˝ndandlocalizediscriminativeregionsautomaticallythathasbecomepopularinthe visioncommunity,i.e.visualattentionmechanism[62,63,64,65]. Byusingadi˙erentiablevisualattentionnetwork,wecanbuildanend-to-endsystemwhere theglobalrecognitionnetworkandseveralparts-basednetworksaretrainedsimultaneously.In thisproposedend-to-endsystem,afullyconnectedlayerforfusingfeaturescanbetrainedtogether withtherecognitionnetworks,whichhelpsthesub-networkstoexploremorediscriminative 16 featurescomplementarytotheglobalrepresentation.Inaddition,thevisualattentionnetwork learnstolocalizedistinctlocalregionsautomaticallywithoutanylandmarksupervision.Our experimentsshowthattheproposedapproachcanfurtherimprovethestate-of-the-artnetworkson challengingbenchmarkssuchasIJB-AandIJB-B.Moreconcisely,contributionsofthischaptercan besummarizedasfollows: ‹ Wedesignedanend-to-endfacerecognitionsystemincludingglobalnetwork,parts-based networks,attentionnetworkandafusionlayerthataretrainedsimultaneously. ‹ Weshowedthatdiscriminativeregionscanbebelocalizedautomaticallywithoutusingfacial landmarksbyusingavisualattentionnetwork. ‹ Weshowedthataddingparts-basednetworkscanfurtherimprovetheperformanceofstate-of- artdeepnetworksonchallengingprotocols,includingBLUFR,IJB-AandIJB-B,withlittle complexityincrease. 2.2RelatedWork 2.2.1Parts-basedDeepFaceRecognition OurproposedapproachispredominantlyinspiredbythesuccessoftheDeepIDseries[ 28 , 29 , 31 ]. Intheir˝rstwork[ 28 ],tendi˙erentregionswerecropped,respectively,fromafaceimage(˝ve largeregionsat˝xedpositionsand˝vesmallregionsarounddetectedlandmarks).Foreachregion, RGBandgray-scalepatchesof˝vedi˙erentscalesweregeneratedandeachtrainedwithasingle convolutionalneuralnetworktooutputafeaturevectorof 160 dimensions.Thefeatureswerethen concatenatedandthedimensionalitywasreducedwithadditionaltrainingonavalidationset.In DeepID2, 400 patchesatdi˙erentpositions,scales,colorchannelsandhorizontal˛ippingwere croppedandusedfortraining 200 di˙erentnetworks.Afterfeatureselection,25patcheswereselected toextracta 4 Œ 000 -dimensionalfeaturevector,whichwas˝nallyreducedto 180 -dimensionalvector 17 Figure2.2Anexamplearchitectureoftheproposedend-to-endnetworkwith = 2 sub-networks. A 96 112 imageis˝rstfedintothebase-network,whichisasingleCNNforfacerecognition.The featuremapofthelastconvolutionallayerofthebase-networkisthenbothusedtolearnaglobal representationwithafullyconnectedlayer,and transformationmatriceswithanattentionnetwork oftwo-stackedfully-connectedlayers.Theregionsofinterestaresampledintopatchesofsizeof 48 48 . smallerCNNsassub-networksfollowtolearnlocalfeaturesfromtheseautomatically localizedpatches.Alltheglobalandlocalfeaturesarethenconcatenatedandfusedbyanotherfully connectedlayer. withPCA.Theauthorsshowedthatcombiningthesefeaturesfromdi˙erentregionssubstantially improvedthefacerecognitionperformance. 2.2.2VisualAttentionNetwork Visualattentionisamechanismtoautomaticallylocalizeobjectsofinterestinanimageorpartsofan object.Baetal.[ 62 ]usedarecurrentattentionmodeltolocatetheobjectsinordertobetterperform multi-objectclassi˝cation.Asimilarschemewasusedin[ 63 ]togeneratecaptionsforimages. Xiaoetal.[ 66 ]proposedtousevisualattentionproposalsfor˝ne-grainedobjectclassi˝cationby clusteringthechannelsofafeaturemapintodi˙erentgroupsandgeneratingpatchesbasedonthe activationofindividualgroups.In[ 64 ],arecurrentstructureofaCNNandattentionproposal networkisproposedtozoomintosmallregionsfor˝ne-grainedclassi˝cation.Theinputofthe attentionnetworkisthefeaturemapofthelastconvolutionallayerratherthanrawimagessothatthe 18 computationalcostcanbereduced.Weadoptasimilarstrategyinournetwork.Onlytwolevels ofCNNsareusedinourapproachbutmorethanonepatchisgeneratedbytheattentionnetwork. Inaddition,weuseSpatialTransformers[ 65 ],whichuseaprojectivetransformationmatrix \ to transformtheoriginalinputimage,enablingustobettersamplepatches.Bymultiplying \ and thecoordinatesofpixelsintheoutputimage,thespatialtransformercomputesthecorresponding coordinatesofeachpixelintheinputimage,andsamplesthemthroughbi-linearinterpolation.This transformerisdi˙erentiable,allowingtheattentionnetworktobelearnedend-to-endwithoutlabels. In[ 65 ],experimentsshowedthatthespatialtransformernetworkisabletoautomaticallylocalize distorteddigits,andstreetviewhousenumbers.Subsequently,theperformanceof˝ne-grained classi˝cationisimprovedbygeneratingmultipleregionproposals.Finally,Zhongetal.[ 67 ]showed thatbytraininganattentionnetworkwithspatialtransformers,anend-to-endfacerecognition networkwhichautomaticallylearnsthealignmentcanachievecomparableresultstothosewith pre-alignedimages. 2.3Approach Inthissection,weoutlineanend-to-endnetworkwhichincludesa base-network forlearninga globalrepresentationfromthewholefaceimage,several sub-network sformodelingspeci˝cfacial parts,anattentionnetworkforgeneratingregionproposalstofeedintothesub-networksandafusion layertofusetheglobalandlocalfeatures. 2.3.1OverallArchitecture AgraphicillustrationoftheoverallarchitectureisshowninFigure2.2.Theinputimagesizeis 96 112 .Theproposednetworkbeginswithabase-networkwhichcanbeanysingleconvolutional neuralnetworkforfacerecognition.Inparticular,weemploytheFace-ResNetproposedin[ 68 ] becauseofitsgoodgeneralizationabilityanditsstate-of-the-artperformance.Inordertoreducethe computationalcostoftheattentionnetwork,weadoptasimilarapproachas[ 64 ],wheretheattention 19 Table2.1Thearchitectureoftheattentionnetwork. TypeOutputSize BatchNorm+FullyConnected 128 BatchNorm+FullyConnected 8 networkisconnectedtothelasthiddenconvolutionallayerratherthantheinputimage.Theattention networkoutputs projectivetransformationmatrices \ ,eachofwhichhas8parameters.Here, is ahyperparameter.Foreachofthe transformationmatrices,aspatialtransformerisusedtosample a 48 48 patchfromtheregionofinterestviabi-linearinterpolation.Thesampledpatchisthenused byasmallersub-networktolearnlocalfeatures.Theglobalrepresentationisof 512 dimensions, whilethelengthofeachlocalfeaturevectoris 128 dimensions.Allofthemareconcatenated togetherandfusedbyafullyconnectedlayertogeneratea 512 -dimensionalrepresentation. Asoftmaxlayerisaddedtoboththeglobalrepresentationandthefusedrepresentationfor classi˝cationinthetrainingphase.Noticethatthegradientisnotpropagatedbackthroughthefusion layertotheglobalrepresentation.Thisallowsthebase-networktobetrainedindependently,andit encouragesthesub-networkstoexplorenewfeaturescomplementarytotheglobalrepresentation. Experimentalresultshowsthatsuchanapproachenablesthemodeltoconvergefasterandleads tobettergeneralizability.Thesoftmaxmainlylearnstoscatterthefeaturesofdi˙erentclasses, whichiscorrespondenttotheinter-classdissimilarity.Therefore,inordertoreducetheintra-class variation,wealsoadoptthecenterlossproposedin[ 32 ]withtherecommendedsettingof U = 0 Ł 5 and _ = 0 Ł 003 .Thecenterlossisappliedtoboththeglobalrepresentationandfusedrepresentation. 2.3.2AttentionNetwork DetailsabouttheattentionnetworkareshowninTable2.1.Becausetheinputtothisnetwork isthefeaturemapofthelastconvolutionallayerofthebase-networkthatcontainsrichsemantic information,theattentionnetworkiscomposedofonlytwofully-connectedlayers,savingalarge amountofcomputationalresources.Weaddabatchnormalizationlayer[ 69 ]alongwithaReLU 20 Table2.2Thearchitectureofthesub-networks. TypeOutputSizeFilterSize/Stride Convolution 48 48 323 3 š 1 Convolution 48 48 643 3 š 1 MaxPooling 24 24 642 2 š 2 Convolution 24 24 643 3 š 1 Convolution 24 24 1283 3 š 1 MaxPooling 12 12 1282 2 š 2 Convolution 12 12 963 3 š 1 Convolution 12 12 1923 3 š 1 MaxPooling 6 6 1922 2 š 2 Convolution 6 6 1283 3 š 1 Convolution 6 6 2563 3 š 1 FullyConnected 128 activationlayer[ 70 ]bothbeforeandafterthe˝rstfully-connectedlayertoacceleratethetrainingof attentionnetwork.Thesecondfullyconnectedlayeroutputs transformationmatrices.Thena spatialtransformermoduleisusedtosamplethecorrespondingpartialregionsaccordingtoeachof thesematrices.Finally,thereareseveralimplementationsubtletiestonote. First,becauseweareusingaprojectivetransformation,thesampledregionisnotrestrictedto bearectangularshape.Thismeansthattheoriginalimagecouldbewarped.However,Zhonget al.[ 67 ]showedthatabetterperformancecanbeachievedwithaprojectivetransformationthan asimilaritytransformationforfacealignment.Oneplausibleexplanationforthisisthatneural networksdonotperceiveimagesinthesamewayashumando.Assuch,networksareabletolearn betterfeaturesfromwarpedimages. Second,wemultiplythelearningrateoftheattentionnetworkby 0 Ł 0001 .Withoutperforming thisscaling,theoutputtransformationdeviatestoomuchbeforethenetworkisabletolearnasetof reasonableparameters. Third,theweightsofthelastfullyconnectedlayerareinitializedaszero,whileitsbiasesare initializedasthe˛attenvectoroftheinitialKtransformationmatrices.Inexperiments,weuse manualinitializationforthesematricesif issmallandrandominitializationif islarge. 21 2.3.3Sub-NetworkforModelingFacialParts Sincetheinformationinalocalregionisrelativelysmall,itwouldbeunnecessarilycomplextouse anetworkwithasmanyparametersasthebase-networktolearnrepresentationsfromthesepatches. Assuch,weuseasimplearchitectureforallthesub-networks,asshowninTable2.2.Itisvery similartothenetworkusedin[ 50 ]exceptthatitusesfewerlayers.Weaddafullyconnectedlayer attheendofthesub-networktolearnacompressedlocalfeaturevector.Finally,weaddabatch normalizationalongwithaReLUlayeraftereveryconvolutionandfullyconnectedlayer.Because thesub-networkstakeasmallerinputandhavefewerparameterscomparedwithbase-network,they onlyaddlittleextrarun-timetothewholemodel,asshownin2.4.1. 2.3.4PromotingSub-networksforFeatureExploration Althoughtheoreticallythelargerthenumberofsub-networks,themorecomplementarylocalfeatures canbelearnedtoimprovetherobustnessofthefusedrepresentation,we˝ndthattheimprovement oftheperformanceafteraddingalargenumberofsub-networksisusuallynegligent.Anexplanation forthisisfoundbythemagnitudeoftheweightsinthefusionlayerforeachdimensioninthe concatenatedfeature.Figure2.3showsthatmanylocalfeatureshaveverysmallweightsinthe fusionlayer.Thisindicatesthattherearesomesub-networkswhichcontributelittletothe˝nalfused representation.Additionally,thiscoulddiminishthelosspropagatedbacktothebase-networksand preventsthesub-networksfromlearninge˚ciently.Assuch,somesub-networksbecome duringtraining.Therefore,inspiredby[ 71 ],weaddapromotionlosstoexplicitlypromotethe weightsinthefusionlayerforthoselocalfeatures.Noticethatin[ 71 ],thepromotedparametersare thoserelatedtoacertainoutputclass,however,inourcasetheyarethoserelatedtoacertaininput dimension.Inparticular,let'sdenoteaninputfeaturevectoras x = » x 6 Œ x ; ¼ where x 6 istheglobal featurevectorand x ; isthevectorofalllocalfeaturesconcatenatedintoonecolumn.Thefused representation H isobtainedwithafullyconnectedlayer H = , x ¸ b .Correspondingto G 6 and G ; , 22 , canbeviewedastheconcatenationoftwomatrices , 6 and , ; ,where H = , 6 x 6 ¸ , ; x ; ¸ b (2.1) Thegoaloftheproposedpromotionloss ! ? istoencouragethelocalweightstobesimilartothe globalweights: ! ? = 1 ˇ ; ˇ ; Õ 8 = 0 , ; 8 2 U 2 Œ (2.2) where U = 1 ˇ 6 ˇ 6 Õ 8 = 0 , 6 8 2 (2.3) and ˇ ; , ˇ 6 refertothenumberofdimensionsinthelocalandglobalfeaturevectors,respectively. , ; 8 referstothe 8 th columnof , ; ,similarfor , 6 8 .Thepromotionlossisaddedasaregularization losswithcoe˚cient _ .Asshownin2.3,afteraddingpromotionloss,thedistributionoftheweights inthefusionlayerbecomemuchmoreuniform,thusavoidingtheproblemofsub-network andencouragingthesub-networksto˝ndmorediscriminativefeatures. 2.4Experiments 2.4.1ImplementationDetails .WeconductallofourexperimentsusingTensor˛ow 1 Ł 2 .First,weimplementtheFace-ResNet in[ 68 ].Wefollowthesamesettingsforthelearningrateandcenterloss.Alltheimagesare˝rst alignedusinglandmarksdetectedusingMTCNN[ 7 ]andtrainedontheCASIA-Webfacedataset[ 50 ]. Theresultingnetworkachievesaveri˝cationaccuracyof 98 Ł 77% onthestandardLFWprotocol. Thisresultisquitecomparabletotheperformanceoriginallyreportedin[ 68 ],however,wedonote aslightdropinperformance(from 99 Ł 00% to 98 Ł 77% ).Themostplausibleexplanationisthatwe areusingadi˙erentlibraryforimplementation.Allthefollowingexperimentsarecomparedtothis baselineresult. 23 Figure2.3Magnitudeoftheweightsofthefusionlayeroverdi˙erentinputdimensionswhen usingdi˙erent _ forthepromotionloss.Withoutpromotionloss,manydimensionshavelittle weight,resultinginsub-networks.Dropouthelpstopromotetheweights,butdiminishesthe performance. Forthesub-networks,weadopttwoschemestoinitializethetransformationmatrices \ : ‹ ModelA :asmallnetworkwith = 3 rectangularregionsinitializedintheupper,middleand bottomface,respectively. ‹ ModelB :arelativelylargernetworkwith randomlyinitializedrectangularregions,whose widthsandheightsarebetween 30% and 60% oftheoriginalimage. ThereasonthatwemanuallyinitializeModelAisthatwhen israthersmall,therandomly initializedregionsarenotguaranteedtobedistributedwell.Forexample,theymayhavealarge amountofoverlapandonlycoverasmallpartoftheentirefaceimage.Thiswouldresultinleaving behindcrucialinformationusefulforrecognition.Therefore,wemanuallychoosethreerectangular regionsthatcoverdi˙erentpartsofthefaceforModelA. Wefollowthesametrainingsettingsas[ 68 ]withabatchsizeof256and 28 Œ 000 trainingsteps. Thepromotionlossweightissetto _ = 10 5 basedontheresultsofagridsearch.WeusetwoNvidia GeforeGTX1080TiGPUstotrainModelAandfourforModelB.Asfortimecomplexity,thereis onlyaslightincreaseinrun-time:forbase-network,ModelAandModelB,ittakes 0 Ł 003 s, 0 Ł 003 s and 0 Ł 004 sperimagetoextractfeatureswithoneGPU,respectively. Inordertoevaluatetheproposedmethodandimplementation,we˝rststudythee˙ectivenessof theproposedmodulesusingLFWdatasetwithbothstandardandBLUFRprotocol[ 8 ].Thenwe 24 Figure2.4Examplepairsthataremisclassi˝edbybase-networkbutareclassi˝edcorrectlyon LFWdataset.Pairsinthegreenboxaregenuinepairsandpairsintheredboxareimpostorpairs. WeusetheaveragethresholdofBLUFR[8]faceveri˝cationforVR@FAR = 0 Ł 1% on10splits. evaluatetheproposedmodelonmorechallengingIJB-A[ 5 ]andIJB-B[ 41 ]benchmarks.Because thepurposeofthischapteristopresentasystemtoimproveanyfacerecognitionnetworkinstead ofachievingthebestresultonthesespeci˝cprotocols,andsincemostresultsonthebenchmarks arebasedondi˙erentarchitecturesandtrainingdatasets,webelieveitisnotfairtocomparethe absoluteperformances.Thus,weonlycomparetherelativeperformanceoftheproposedsystem withtheoriginalbase-network. 2.4.2EvaluationofProposedModulesonLFW Intheproposednetwork,weuseanattentionnetworktolocalize discriminativeregionsrather thancroppinga˝xedpatch,trainafusionlayertocompresstheconcatenatedfeatureandadd promotionlossencouragingthesub-networkstoexploremorediscriminativefeatures.Herewe evaluatethee˙ectivenessofthesemodulesbycomparingtheresultswithandwithoutthesemodules ontwoprotocolsonLFWdataset:standardandBLUFR[8].Thestandardveri˝cationprotocolof 25 Table2.3Evaluationresultsoftheproposedmodelwith/withoutcertainmodulesonstandard LFWandBLUFRprotocols.means"AttentionNetwork";means"FusionLayer";"PL" refersto"PromotionLoss".indicatesthemoduleisusedwhileindicatesthatmoduleisnot used.AccuracyistestedonthestandardLFWveri˝cationprotocol.Veri˝cationRate(VR)and DetectionandIdenti˝cationRate(DIR)aretestedontheBLUFRprotocol. TypeANFLPLAccuracyVRDIRRank-1 @FAR = 0 Ł 1% @FAR = 1% Base-net 98 Ł 77%94 Ł 96%72 Ł 96% ModelBNYY 98 Ł 67%95 Ł 54%74 Ł 33% ModelBYNY 98 Ł 78%95 Ł 63%76 Ł 37% ModelBYYN 98 Ł 75%95 Ł 83%75 Ł 75% ModelAYYY 98 Ł 85%95 Ł 90%77 Ł 51% ModelBYYY 98 Ł 98 % 96 Ł 44 % 77 Ł 96 % theoriginalLFWdatasetcontainsonly 6 Œ 000 pairsoffacesinall,whichisinsu˚cienttoevaluate deeplearningmethods,evidencedbythefactthatresultsarealmostsaturatedonthisprotocol. Becauseofthis,Liaoetal.[ 8 ]madeuseofthewholeLFWdatasettobuildtheBLUFRprotocol. Inthisprotocol,a10-foldcross-validationtestisde˝nedforboth faceveri˝cation and open-set faceidenti˝cation .For faceveri˝cation ,averi˝cationrate(VR)isreportedforeachsplitwith strictfalsealarmrate(FAR = 0 Ł 1% )bycomparingaround 156 Œ 915 genuinepairsand 46 Œ 960 Œ 863 imposterpairs 1 ,whichisclosertoreal-worldscenariothantheaccuracymetricinthestandard LFWprotocol.For open-setidenti˝cation ,anidenti˝cationrate(DIR)atRank-1corresponding toFAR = 1% iscomputed.We˝rsttesttheperformanceofModelBwithoutcertainmodulesto ensuretheire˙ectiveness.ThenwetraintheproposedModelAandModelBwithallmodulesand comparethemwithbase-network. InTable2.3, Base-net indicatesthebaselinesingleCNNnetwork,whichisusedasthebase- networkinourmodel. AttentionNet indicateswhetheranattentionnetworkisusedtoautomatically localizetheregionsforsub-networksorcropthe˝xedregionsthatarerandomlyinitialized. Fusion Layer indicateswhethertoaddafullyconnectedfusionlayerordirectlyusetheconcatenatedlayer astherepresentation. PromotionLoss meanswhetherweaddpromotionlossasregularizationto thefusionlayer.Theaccuracyistestedonthestandardprotocol,while Veri˝cationRate (VR)and 1 thenumbersareaveragedovertensplits. 26 Figure2.5ExamplesofthelocalizedregionsinModelA.Theattentionnetworklocalizestheeyes, noseandmouthaccuratelybylearningwithoutlandmarklabels.Theseaccuratelylocalizedpatches makeitaneasiertaskforthesub-networkstolearnrobustfeaturesfromcertainfacialparts. DetectandIdenti˝cationRate (DIR)aretestedonBLUFRprotocol.Althoughalltheresultsare similaronstandardLFWprotocol,distinctdi˙erencescanbeobservedinBLUFRresults.Thisis becausestandardprotocolonlycontains 6 Œ 000 pairswhichisnotadequatetopreciselyre˛ectthe performanceofahighlysophisticatedmodel.BasedontheresultsonBLUFR,wecanseethatModel Bconsistentlyoutperformsbase-networkevenwithoutcertainmodules.Andalsoeverymoduleis makingacontributionandisessentialtoguaranteethe˝nalperformanceofthewholemodel.After usingallmodules,theproposedModelAandModelBsurpassesthebaselineby4%intermsof DIR@FAR = 1% atrank-1.Thisdemonstratesthattheproposedideaofanauto-alignedparts-based modeldoesimprovetheperformanceofasingleneuralnetwork.Andwithmoresub-networks added,ModelB(12sub-networks)consistentlyoutperformsModelA(3sub-networks). Tofurtherevaluatetheattentionnetworks,wevisualizethelocalizedpatchesinModelA.Some examplesareshowninFigure2.5.Noticethedi˙erentdistributionoffacialparts,evenafter alignment,duetothechallengingposeoftheinputimage.Theattentionnetworkcanstillaccurately ˝ndthetargetfacialparts.Inthelocalizedpatchesineachcolumn,allthefacialpartsaredistributed inasimilarway.Theseaccuratelylocalizedpatchesmakeitaneasiertaskforthesub-networksto 27 Table2.4EvaluationresultsonIJB-A1:1Comparisonand1:NSearchprotocols. TAR@FAR(Veri˝cation)CMC(Closed-setIdenti˝cation)FNIR(Open-setIdenti˝cation) Type 0 Ł 0010 Ł 01 Rank-1Rank-5 0 Ł 010 Ł 1 Base-net 0 Ł 542 0 Ł 09170 Ł 7883 0 Ł 09170 Ł 882 0 Ł 01900 Ł 954 0 Ł 00790 Ł 426 0 Ł 01700 Ł 355 0 Ł 0140 ModelA 0 Ł 583 0 Ł 08320 Ł 8075 0 Ł 02640 Ł 889 0 Ł 00680 Ł 957 0 Ł 00680 Ł 418 0 Ł 0147 0 Ł 353 0 Ł 0137 ModelB 0 Ł 602 0 Ł 06920 Ł 8231 0 Ł 02190 Ł 898 0 Ł 00920 Ł 960 0 Ł 00610 Ł 411 0 Ł 0164 0 Ł 353 0 Ł 0142 Table2.5EvaluationresultsonIJB-B1:1BaselineVeri˝cationand1:NMixedMediaIdenti˝cation protocols. TAR@FAR(Veri˝cation)CMC(Closed-setIdenti˝cation)FNIR(Open-setIdenti˝cation) Type 0 Ł 0010 Ł 01 Rank-1Rank-5 0 Ł 010 Ł 1 Base-net 0 Ł 6310 Ł 8510 Ł 7490 Ł 8610 Ł 1490 Ł 032 ModelA 0 Ł 6520 Ł 8610 Ł 768 0 Ł 875 0 Ł 139 0 Ł 031 ModelB 0 Ł 6590 Ł 8650 Ł 769 0 Ł 874 0 Ł 135 0 Ł 032 learnrobustfeaturesfromcertainfacialparts.Theattentionnetworkalsoallowsadjustingwhich parttolocalizesothatthesub-networkscan˝ndmorediscriminativefeatures.Noticethatthe attentionnetworkistrainedwithoutthelandmarklabelsandassuch,thecomputationisalmostfree. 2.4.3EvaluationonIJB-AandIJB-BBenchmarks TheIARPAJanusBenchmarks,includingIJB-AandIJB-B,werereleasedtopushforwardthe frontiersofunconstrainedfacerecognitionsystems.InIJB-A,amanuallylabeleddatasetcontaining imagesbothfromphotosandvideoframesisusedtobuildaprotocolfor faceidenti˝cation (1:N Search)and faceveri˝cation (1:1Comparison).IncomparisontoLFW,the 5 Œ 712 imagesand 2 Œ 085 videosintheIJB-Abenchmarkhaveawidergeographicvariation,largerposevariationandimages oflowresolutionorheavyocclusion,makingitamuchharderbenchmarkthanbothstandardLFW andBLUFRbenchmarks.Again,a10-foldcross-validationtestisdesignedforbothidenti˝cation andveri˝cationinIJB-A.TrueAcceptRate(TAR)atFalseAcceptRate(FAR)isusedtoevaluate veri˝cationperformance.Forclosed-setidenti˝cation,CumulativeMatchCharacteristic(CMC) measuresthefractionofgenuinegallerytemplatesthatareretrievedwithinacertainrank.And FalseNegativeIdenti˝cationRate(FNIR)atFalsePositiveIdenti˝cationRate(FPIR)isreportedto 28 Figure2.6Examplepairsthataremisclassi˝edbybase-networkbutareclassi˝edcorrectlyby ModelBonIJB-Bdataset.Pairsinthegreenboxaregenuinepairsandpairsintheredboxare impostorpairs.WeusethethresholdofIJB-B1:1BaselineVeri˝cationforTAR@FAR = 0 Ł 1% . evaluatetheperformanceintermsofopen-setidenti˝cation. IJB-BisanextensionofIJB-Abenchmark.Itconsistsof 21 Œ 798 stillimagesand 55 Œ 026 frames from 7 Œ 011 videosfrom 1 Œ 845 subjects.Thereisnocross-validationinIJB-B.Inparticular,weuse the1:1BaselineVeri˝cationprotocoland1:NMixedMediaIdenti˝cationprotocolforIJB-B. FromtheresultsinTable2.4andTable2.5,wecanseethattheproposedmodelsdoimprove theperformanceofthebase-netonboththeIJB-AandIJB-Bbenchmarks.Thisshowsthe e˙ectivenessoftheproposedideawhichfusesfeaturesfromlocalregionstogetherwithaglobal featurerepresentation,althoughthebase-networkisalreadyquitesophisticated.Second,ModelB outperformsModelAinmostprotocols,whichindicatesthatmorelocalregionsandsub-networks couldhelpachieveevenlargerperformancegains. 29 2.5Conclusion Inthischapter,wehaveproposedaschemeforincorporatingparts-basedmodelsintostate-of-the-art CNNsforfacerecognition.Asetofsub-networksareaddedtolearnfeaturesfromcertainfacialparts. Anspatialtransformer-basedattentionnetworklearnstoautomaticallylocalizethediscriminative regions.Wehavefurtheraddedafusionlayertocombinetheglobalandlocalfeatures,which,with theproposedpromotionloss,encouragesthesub-networksto˝ndmorediscriminativefeatures.The proposedapproachcanbeappliedtoanysingleCNNtobuildanend-to-endsystem.Experiments onthemostnovelandchallengingbenchmarksshowthattheproposedstrategycanhelpimprove theperformanceofasingleCNNwithoutsigni˝cantincreaseinrun-time.Evidencesuggeststhat wecanfurtherimprovetheperformancewithevenmoresub-networks. 30 Chapter3 UncertaintyEstimationforDeepFace Recognition 3.1Introduction Whenhumansareaskedtodescribeafaceimage,theynotonlygivethedescriptionofthefacial attributes,butalsothecon˝denceassociatedwiththem.Forexample,iftheeyesareblurredin theimage,apersonwillkeeptheeyesizeasanuncertaininformationandfocusonotherfeatures. Furthermore,iftheimageiscompletelycorruptedandnoattributescanbediscerned,thesubject mayrespondthathe/hercannotidentifythisface.Thiskindofuncertainty(orcon˝dence)estimation iscommonandimportantinhumandecisionmaking. Ontheotherhand,therepresentationsandsimilaritymetricsusedinstate-of-the-artface recognitionsystemsaregenerallycon˝dence-agnostic.Thesemethodsdependonanembedding model(e.g.DeepNeuralNetworks)togiveadeterministicpointrepresentationforeachfaceimage inthelatentfeaturespace[ 30 , 32 , 34 , 36 , 38 ].Apointinthelatentspacerepresentsthemodel's estimationofthefacialfeaturesinthegivenimage.Iftheerrorintheestimationissomehow bounded,thedistancebetweentwopointscane˙ectivelymeasurethesemanticsimilaritybetween thecorrespondingfaceimages.Butgivenalow-qualityinput,wheretheexpectedfacialfeaturesare ambiguousorabsentintheimage,alargeshiftintheembeddedpointsisinevitable,leadingtofalse 31 (a) deterministicembedding (b)probabilisticembedding Figure3.1Di˙erencebetweendeterministicfaceembeddingsandprobabilisticfaceembeddings (PFEs).Deterministicembeddingsrepresenteveryfaceasapointinthelatentspacewithoutregards toitsfeatureambiguity.Probabilisticfaceembedding(PFE)givesadistributionalestimationof featuresinthelatentspaceinstead. Bestviewedincolor. recognition(Figure3.1a). Toaddresstheaboveproblems,wepropose ProbabilisticFaceEmbeddings(PFEs) ,which giveadistributionalestimationinsteadofapointestimationinthelatentspaceforeachinputface image(Figure3.1b).Themeanofthedistributioncanbeinterpretedasthemostlikelylatent featurevalueswhilethespanofthedistributionrepresentstheuncertaintyoftheseestimations.PFE canaddresstheunconstrainedfacerecognitionprobleminatwo-foldway:(1)Duringmatching (facecomparison),PFEpenalizesuncertainfeatures(dimensions)andpaysmoreattentiontomore con˝dentfeatures.(2)Forlowqualityinputs,thecon˝denceestimatedbyPFEcanbeusedtoreject theinputoractivelyaskforhumanassistancetoavoidfalserecognition.Besides,anaturalsolution canbederivedtoaggregatethePFErepresentationsofasetoffaceimagesintoanewdistribution withloweruncertaintytoincreasetherecognitionperformance.TheimplementationofPFEis open-sourced 1 .Thecontributionsofthechaptercanbesummarizedasbelow: 1. Anuncertainty-awareprobabilisticfaceembedding(PFE)whichrepresentsfaceimagesas distributionsinsteadofpoints. 2. Aprobabilisticframeworkthatcanbenaturallyderivedforfacematchingandfeaturefusion usingPFE. 1 https://github.com/seasonSH/Probabilistic-Face-Embeddings 32 3. AsimplemethodthatconvertsexistingdeterministicembeddingsintoPFEswithoutadditional trainingdata. 4. ComprehensiveexperimentsshowingthattheproposedPFEcanimprovefacerecognition performanceofdeterministicembeddingsandcane˙ectively˝lteroutlow-qualityinputsto enhancetherobustnessoffacerecognitionsystems. 3.2RelatedWork UncertaintyLearninginDNNs Toimprovetherobustnessandinterpretabilityofdiscriminant DeepNeuralNetworks(DNNs),deepuncertaintylearningisgettingmoreattention[ 72 , 73 , 74 ]. Therearetwomaintypesofuncertainty: modeluncertainty and datauncertainty .Modeluncertainty referstotheuncertaintyofmodelparametersgiventhetrainingdataandcanbereducedbycollecting additionaltrainingdata[ 75 , 76 , 72 , 73 ].Datauncertaintyaccountsfortheuncertaintyinoutput whoseprimarysourceistheinherentnoiseininputdataandhencecannotbeeliminatedwithmore trainingdata[ 74 ].Theuncertaintystudiedinourworkcanbecategorizedasdatauncertainty. Althoughtechniqueshavebeendevelopedforestimatingdatauncertaintyindi˙erenttasks,including classi˝cationandregression[ 74 ],theyarenotsuitableforourtasksinceourtargetspaceisnot well-de˝nedbygivenlabels 2 .VariationalAutoencoders[77]canalsoberegardedasamethodfor estimatingdatauncertainty,butitmainlyservesagenerationpurpose.Speci˝ctofacerecognition, somestudies[ 78 , 79 , 80 ]haveleveragedthemodeluncertaintyforanalysisandlearningofface representations,buttoourknowledge,oursisthe˝rstworkthatutilizesdatauncertainty 3 for recognitiontasks. ProbabilisticFaceRepresentation Modelingfacesasprobabilisticdistributionsisnotanew idea.Inthe˝eldoffacetemplate/videomatching,thereexistsabundantliteratureonmodelingthe facesasprobabilisticdistributions[ 82 , 83 ],subspace[ 84 ]ormanifolds[ 83 , 85 ]inthefeaturespace. However,theinputforsuchmethodsisasetoffaceimagesratherthanasinglefaceimage,and 2 Althoughwearegiventheidentitylabels,theycannotdirectlyserveastargetvectorsinthelatentfeaturespace. 3 Someintheliteraturehavealsousedtheterminologyuncertainty"foradi˙erentpurpose[81]. 33 (a)GaussianBlur (b)Occlusion (c)RandomGaussianNoise Figure3.2Illustrationof featureambiguitydilemma .Theplotsshowthecosinesimilarityon LFWdatasetwithdi˙erentdegreesofdegradation.Bluelinesshowthesimilaritybetweenoriginal imagesandtheirrespectivedegradedversions.Redlinesshowthesimilaritybetweenimpostor pairsofdegradedimages.Theshadingindicatesthestandarddeviation.Withlargerdegreesof degradation,themodelbecomesmorecon˝dent(veryhigh/lowscores)inawrongway. theyuseabetween-distributionsimilarityordistancemeasure,e.g.KL-divergence,forcomparison, whichdoesnotpenalizetheuncertainty.Meanwhile,somestudies[ 86 , 87 ]haveattemptedtobuild afuzzymodelofagivenfaceusingthefeaturesoffaceparts.Incomparison,theproposedPFE representseachsinglefaceimageasadistributioninthelatentspaceencodedbyDNNsandweuse anuncertainty-awareloglikelihoodscoretocomparethedistributions. Quality-awarePooling Incontrasttothemethodsabove,recentworkonfacetemplate/video matchingaimstoleveragethesaliencyofdeepCNNembeddingsbyaggregatingthedeepfeatures ofallfacesintoasinglecompactvector[ 88 , 89 , 90 , 91 ].Inthesemethods,aseparatemodulelearns topredictthequalityofeachfaceintheimageset,whichisthennormalizedforaweightedpooling offeaturevectors.Weshowthatasolutioncanbenaturallyderivedunderourframework,whichnot onlygivesaprobabilisticexplanationforquality-awarepoolingmethods,butalsoleadstoamore generalsolutionwhereanimagesetcanalsobemodeledasaPFErepresentation. 3.3LimitationsofDeterministicEmbeddings Inthissection,weexplaintheproblemsofdeterministicfaceembeddingsfromboththeoretical andempiricalviews.Let X denotetheimagespaceand Z denotethelatentfeaturespaceof ˇ 34 dimensions.Anideallatentspace Z shouldonlyencode identity-salient featuresandbe disentangled fromidentity-irrelevantfeatures.Assuch,eachidentityshouldhaveauniqueintrinsiccode z 2Z thatbestrepresentsthispersonandeachfaceimage x 2X isanobservationsampledfrom ? ¹ x j z º . Theprocessoftrainingfaceembeddingscanbeviewedasajointprocessofsearchingforsucha latentspace Z andlearningtheinversemapping ? ¹ z j x º .Fordeterministicembeddings,theinverse mappingisaDiracdeltafunction ? ¹ z j x º = X ¹ z 5 ¹ x ºº ,where 5 istheembeddingfunction.Clearly, foranyspace Z ,giventhepossibilityofnoisesin x ,itisunrealistictorecovertheexact z andthe embeddedpointofalow-qualityinputwouldinevitablyshiftawayfromitsintrinsic z (nomatter howmuchtrainingdatawehave). Thequestioniswhetherthisshiftcouldbeboundedsuchthatwestillhavesmallerintra-class distancescomparedtointer-classdistances.However,thisisunrealisticforfullyunconstrainedface recognitionandweconductanexperimenttoillustratethis.Letusstartwithasimpleexample: givenapairofidenticalimages,adeterministicembeddingwillalwaysmapthemtothesamepoint andthereforethedistancebetweenthemwillalwaysbe 0 ,eveniftheseimagesdonotcontainaface. Thisimpliesthatpairofimagesbeingsimilaroreventhesamedoesnotnecessarilymeanthe probabilityoftheirbelongingtothesamepersonishigh Todemonstratethis,weconductanexperimentbymanuallydegradingthehigh-qualityimages andvisualizingtheirsimilarityscores.Werandomlyselectahigh-qualityimageofeachsubject fromtheLFWdataset[ 3 ]andmanuallyinsertGaussianblur,occlusion,andrandomGaussiannoise tothefaces.Inparticular,welinearlyincreasethesizeofGaussiankernel,occlusionratioand thestandarddeviationofthenoisetocontrolthedegradationdegree.Ateachdegradationlevel, weextractthefeaturevectorswitha64-layerCNN 4 ,whichiscomparabletostate-of-the-artface recognitionsystems.Thefeaturesarenormalizedtoahyper-sphericalembeddingspace.Then, twotypesofcosinesimilaritiesarereported:(1)similaritybetweenpairsoforiginalimageandits respectivedegradedimage,and(2)similaritybetweendegradedimagesofdi˙erentidentities.As showninFigure3.2,forallthethreetypesofdegradation,thegenuinesimilarityscoresdecreaseto 4 trainedonMs-Celeb-1M[2]withAM-Softmax[35] 35 (a)Low-similarityGenuinePairs(b)High-similarityImpostorPairs Figure3.3ExamplegenuinepairsfromIJB-Adatasetestimatedwiththelowestsimilarityscores andimpostorpairswiththehighestsimilarityscores(amongallpossiblepairs)bya64-layerCNN model.Thegenuinepairsmostlyconsistofonehigh-qualityandonelow-qualityimagewhilethe impostorpairsarealllow-qualityimages.Notethatthesepairsarenottemplatesintheveri˝cation protocol. 0 whiletheimpostorsimilarityscoresconvergeto 1 Ł 0 !Theseindicatetwotypesoferrorsthatcan beexpectedinafullyunconstrainedscenarioevenwhenthemodelisverycon˝dent(veryhigh/low similarityscores): (1) falseacceptofimpostorlow-qualitypairsand (2) falserejectofgenuinecross-qualitypairs. Tocon˝rmthis,wetestthemodelontheIJB-Adatasetby˝ndingimpostor/genuineimagepairswith thehighest/lowestscores,respectively.Thesituationisexactlyaswehypothesized(SeeFigure3.3). Wecallthis FeatureAmbiguityDilemma whichisobservedwhenthedeterministicembeddingsare forcedtoestimatethefeaturesofambiguousfaces.Theexperimentalsoimpliesthatthereexista darkspace wheretheambiguousinputsaremappedtoandthedistancemetricisdistorted. 3.4ProbabilisticFaceEmbeddings Toaddresstheaforementionedproblemcausedbydatauncertainty,weproposetoencodethe uncertaintyintothefacerepresentationandtakeitintoaccountduringmatching.Speci˝cally, insteadofbuildingamodelthatgivesapointestimationinthelatentspace,weestimateadistribution ? ¹ z j x º inthelatentspacetorepresentthepotentialappearanceofaperson'sface 5 .Inparticular,we 5 followingthenotationsinSection3.3. 36 useamultivariateGaussiandistribution: ? ¹ z j x 8 º = N¹ z ; - 8 Œ 2 2 8 I º (3.1) where - 8 and 2 8 arebotha ˇ -dimensionalvectorpredictedbythenetworkfromthe 8 th input image x 8 .Hereweonlyconsideradiagonalcovariancematrixtoreducethecomplexityoftheface representation.Thisrepresentationshouldhavethefollowingproperties: 1. Thecenter - shouldencodethemostlikelyfacialfeaturesoftheinputimage. 2. Theuncertainty 2 shouldencodethemodel'scon˝dencealongeachfeaturedimension. Inaddition,wewishtouseasinglenetworktopredictthedistribution.Consideringthatnew approachesfortrainingfaceembeddingsarestillbeingdeveloped,weaimtodevelopamethodthat couldconvertexistingdeterministicfaceembeddingnetworkstoPFEsinaneasymanner.Inthe followings,we˝rstshowhowtocompareandfusethePFErepresentationstodemonstratetheir strengthandthenproposeourmethodforlearningPFEs. 3.4.1MatchingwithPFEs GiventhePFErepresentationsofapairofimages ¹ x 8 Œ x 9 º ,wecandirectlymeasurethe ofthembelongingtothesameperson(sharingthesamelatentcode): ? ¹ z 8 = z 9 º ,where z 8 ˘ ? ¹ z j x 8 º and z 9 ˘ ? ¹ z j x 9 º .Speci˝cally, ? ¹ z 8 = z 9 º = ¹ ? ¹ z 8 j x 8 º ? ¹ z 9 j x 9 º X ¹ z 8 z 9 º 3 z 8 3 z 9 Ł (3.2) Inpractice,wewouldliketousetheloglikelihoodinstead,whosesolutionisgivenby: B ¹ x 8 Œ x 9 º = log ? ¹ z 8 = z 9 º = 1 2 ˇ Õ ; = 1 ¹ ¹ ` ¹ ; º 8 ` ¹ ; º 9 º 2 f 2 ¹ ; º 8 ¸ f 2 ¹ ; º 9 ¸ log ¹ f 2 ¹ ; º 8 ¸ f 2 ¹ ; º 9 ºº 2>=BCŒ (3.3) 37 where 2>=BC = ˇ 2 log2 c , ` ¹ ; º 8 referstothe ; th dimensionof - 8 andsimilarlyfor f ¹ ; º 8 . Notethatthissymmetricmeasurecanbeviewedastheexpectationoflikelihoodofoneinput's latentcodeconditionedontheother,thatis B ¹ x 8 Œ x 9 º = log ¹ ? ¹ z j x 8 º ? ¹ z j x 9 º 3 z = log E z ˘ ? ¹ z j x 8 º » ? ¹ z j x 9 º¼ = log E z ˘ ? ¹ z j x 9 º » ? ¹ z j x 8 º¼ Ł (3.4) Assuch,wecallit mutuallikelihoodscore(MLS) .Di˙erentfromKL-divergence,thisscoreis unboundedandcannotbeseenasadistancemetric.ItcanbeshownthatthesquaredEuclidean distanceisequivalenttoaspecialcaseofMLSwhenalltheuncertaintiesareassumedtobethe same: Property1 If f ¹ ; º 8 isa˝xednumberforalldata x 8 anddimensions ; ,MLSisequivalenttoascaled andshiftednegativesquaredEuclideandistance. Further,whentheuncertaintiesareallowedtobedi˙erent,wenotethatMLShassomeinteresting propertiesthatmakeitdi˙erentfromadistancemetric: 1. Attention mechanism:the˝rstterminthebracketinEquation(3.3)canbeseenasaweighted distancewhichassignslargerweightstolessuncertaindimensions. 2. Penalty mechanism:thesecondterminthebracketinEquation(3.3)canbeseenasapenalty termwhichpenalizesdimensionsthathavehighuncertainties. 3. Ifeitherinput x 8 or x 9 haslargeuncertainties,MLSwillbelow(becauseofpenalty)irrespective ofthedistancebetweentheirmean. 4. Onlyifbothinputshavesmalluncertaintiesandtheirmeansareclosetoeachother,MLS couldbeveryhigh. ThelasttwopropertiesimplythatPFEcouldsolvethefeatureambiguitydilemmaifthenetwork cane˙ectivelyestimate 2 8 . 38 (a) (b) Figure3.4FusionwithPFEs.(a)Illustrationofthefusionprocessasadirectedgraphicalmodel. (b)GiventheGaussianrepresentationsoffaces(fromthesameidentity),thefusionprocessoutputs anewGaussiandistributioninthelatentspacewithamoreprecisemeanandloweruncertainty. 3.4.2FusionwithPFEs Inmanycaseswehaveatemplate(set)offaceimages,forwhichweneedtobuildacompact representationformatching.WithPFEs,aconjugateformulacanbederivedforrepresentationfusion (Figure3.4).Let f x 1 Œ x 2 ŒŁŁŁŒ x = g beaseriesofobservations(faceimages)fromthesameidentity and ? ¹ z j x 1 Œ x 2 ŒŁŁŁŒ x = º betheposteriordistributionafterthe = th observation.Then,assumingall theobservationsareconditionallyindependent(giventhelatentcode z ).Itcanbeshownthat: ? ¹ z j x 1 Œ x 2 ŒŁŁŁŒ x = ¸ 1 º = U ? ¹ z j x = ¸ 1 º ? ¹ z º ? ¹ z j x 1 Œ x 2 ŒŁŁŁŒ x = º Œ (3.5) where U isanormalizationfactor.Tosimplifythenotations,letusonlyconsideraone- dimensionalcasebelow;thesolutioncanbeeasilyextendedtothemultivariatecase. If ? ¹ z º isassumedtobeanoninformativeprior,i.e. ? ¹ z º isaGaussiandistributionwhose varianceapproaches 1 ,theposteriordistributioninEquation(3.5)isanewGaussiandistribution withloweruncertainty. Further,givenasetoffaceimages f x 1 Œ x 2 ŒŁŁŁŒ x = g ,theparametersofthefusedrepresentation 39 canbedirectlygivenby: ^ ` = = = Õ 8 = 1 ^ f 2 = f 2 8 ` 8 Œ (3.6) 1 ^ f 2 = = = Õ 8 = 1 1 f 2 8 Ł (3.7) Inpractice,becausetheconditionalindependenceassumptionisusuallynottrue,e.g.video framesincludealargeamountofredundancy,Equation(3.7)willbebiasedbythenumberofimages intheset.Therefore,wetakedimension-wiseminimumtoobtainthenewuncertainty. RelationshiptoQuality-awarePooling Ifweconsideracasewhereallthedimensionssharethe sameuncertainty f 8 for 8 th inputandletthequalityvalue @ 8 = 1 f 2 8 betheoutputofthenetwork.Then Equation(3.6)canbewrittenas ^ - = = Í = 8 = 1 @ 8 - 8 Í = 9 @ 9 Ł (3.8) Ifwedonotusetheuncertaintyafterfusion,thealgorithmwillbethesameasrecentquality-aware aggregationmethodsforset-to-setfacerecognition[88,89,90]. 3.4.3Learning Notethatanydeterministicembedding 5 ,ifproperlyoptimized,canindeedsatisfythepropertiesof PFEs:(1)theembeddingspaceisadisentangledidentity-salientlatentspaceand(2) 5 ¹ x º represents themostlikelyfeaturesofthegiveninputinthelatentspace.Assuch,inthisworkweconsidera stage-wisetrainingstrategy:givenapre-trainedembeddingmodel 5 ,we˝xitsparameters,take - ¹ x º = 5 ¹ x º ,andoptimizeanadditionaluncertaintymoduletoestimate 2 ¹ x º .Whentheuncertainty moduleistrainedonthesamedatasetoftheembeddingmodel,thisstage-wisetrainingstrategy allowsustohaveamorefaircomparisonbetweenPFEandtheoriginalembedding 5 ¹ x º thanan end-to-endlearningstrategy. Theuncertaintymoduleisanetworkwithtwofully-connectedlayerswhichsharesthesame 40 inputasofthebottlenecklayer 6 .Theoptimizationcriteriaistomaximizethemutuallikelihood scoreofallgenuinepairs ¹ x 8 Œ x 9 º .Formally,thelossfunctiontominimizeis L = 1 jPj Õ ¹ 8Œ9 º2P B ¹ x 8 Œ x 9 º (3.9) where P isthesetofallgenuinepairsand B isde˝nedinEquation(3.3).Inpractice,theloss functionisoptimizedwithineachmini-batch.Intuitively,thislossfunctioncanbeunderstoodasan alternativetomaximizing ? ¹ z j x º :ifthelatentdistributionsofallpossiblegenuinepairshavealarge overlap,thelatenttarget z shouldhavealargelikelihood ? ¹ z j x º foranycorresponding x .Noticethat because - ¹ x º is˝xed,theoptimizationwouldn'tleadtothecollapseofallthe - ¹ x º toasinglepoint. 3.5ImplementationDetails AllthemodelsinthechapterareimplementedusingTensor˛owr1.9.TwoandFourGeForceGTX 1080TiGPUsareusedfortrainingbasemodelsonCASIA-Webface[ 50 ]andMS-Celeb-1M[ 2 ], respectively.TheuncertaintymodulesaretrainedusingoneGPU. 3.5.1DataPreprocessing Allthefaceimagesare˝rstpassedthroughMTCNNfacedetector[ 7 ]todetect5faciallandmarks (twoeyes,noseandtwomouthcorners).Then,similaritytransformationisusedtonormalizethe faceimagesbasedonthe˝velandmarks.Aftertransformation,theimagesareresizedto 112 96 . Beforepassingintonetworks,eachpixelintheRGBimageisnormalizedbysubtracting 127 Ł 5 and dividingby 128 . 6 Bottlenecklayerreferstothelayerwhichoutputstheoriginalfaceembedding. 41 3.5.2BaseModels ThebasemodelsforCASIA-Webface[ 50 ]aretrainedfor 28 Œ 000 stepsusingaSGDoptimizer withamomentumof 0 Ł 9 .Thelearningratestartsat 0 Ł 1 ,andisdecreasedto 0 Ł 01 and 0 Ł 001 after 16 Œ 000 and 24 Œ 000 steps,respectively.ForthebasemodeltrainedonMs-Celeb-1M[ 2 ],wetrain thenetworkfor 140 Œ 000 stepsusingthesameoptimizersettings.Thelearningratestartsat 0 Ł 1 ,and isdecreasedto 0 Ł 01 and 0 Ł 001 after 80 Œ 000 and 120 Œ 000 steps,respectively.Thebatchsize,feature dimensionandweightdecayaresetto 256 , 512 and 0 Ł 0005 ,respectively,forbothcases. 3.5.3UncertaintyModule Architecture Theuncertaintymoduleforallmodelsaretwo-layerperceptronswiththesame architecture: FC-BN-ReLU-FC-BN-exp ,where FC referstofullyconnectedlayers, BN refersto batchnormalizationlayers[ 69 ]and exp functionensurestheoutputs f 2 areallpositivevalues[ 74 ]. The˝rst FC sharesthesameinputwiththebottlenecklayer,i.e.theoutputfeaturemapofthe lastconvolutionallayer.Theoutputofboth FC layersare ˇ -dimensionalvectors,where ˇ isthe dimensionalityofthelatentspace.Inaddition,weconstrainthelast BN layertosharethesame W and V acrossalldimensions,whichwefoundtohelpstabilizingthetraining. Training ForthemodelstrainedonCASIA-WebFace[ 50 ],wetraintheuncertaintymodulefor 3 Œ 000 stepsusingaSGDoptimizerwithamomentumof 0 Ł 9 .Thelearningratestartsat 0 Ł 001 ,and isdecreasedto 0 Ł 0001 after 2 Œ 000 steps.ForthemodeltrainedonMS-Celeb-1M[ 2 ],wetrainthe uncertaintymodulefor 12 Œ 000 steps.Thelearningratestartsat 0 Ł 001 ,andisdecreasedto 0 Ł 0001 after 8 Œ 000 steps.Thebatchsizeforbothcasesare 256 .Foreachmini-batch,werandomlyselect 4 imagesfrom 64 di˙erentsubjectstocomposethepositivepairs( 384 pairsinall).Theweightdecay issetto 0 Ł 0005 inallcases.ASubsetofthetrainingdatawasseparatedasthevalidationsetfor choosingthesehyper-parametersduringdevelopmentphase. 42 Table3.1ResultsofmodelstrainedonCASIA-WebFace.referstothedeterministic embeddings.Thebetterperformanceamongeachbasemodelareshowninboldnumbers. usesmutuallikelihoodscoreformatching.IJB-Aresultsareveri˝cationratesatFAR= 0 Ł 1% . BaseModelRepresentationLFWYTFCFP-FPIJB-A Original 98 Ł 9394 Ł 7493 Ł 8478 Ł 16 Softmax+ CenterLoss[32] PFE 99 Ł 2795 Ł 4294 Ł 5180 Ł 83 Original 97 Ł 65 93 Ł 36 89 Ł 76 60 Ł 82 Triplet[30] PFE 98 Ł 45 93 Ł 96 90 Ł 04 61 Ł 00 Original 99 Ł 1594 Ł 8092 Ł 4178 Ł 54 A-Softmax[34] PFE 99 Ł 3294 Ł 9493 Ł 3782 Ł 58 Original 99 Ł 28 95 Ł 64 94 Ł 77 84 Ł 69 AM-Softmax[35] PFE 99 Ł 55 95 Ł 92 95 Ł 06 87 Ł 58 InferenceSpeed Featureextraction(passingthroughthewholenetwork)usingoneGPUtakes 1 Ł 5 msperimage.Notethatgiventhesmallsizeoftheuncertaintymodule,ithaslittleimpactonthe featureextractiontime.Matchingimagesusingcosinesimilarityandmutuallikelihoodscoretakes 4 nsand 15 ns,respectively.Bothareneglectableincomparisonwithfeatureextractiontime. 3.6Experiments Inthissection,we˝rsttesttheproposedPFEmethodonstandardfacerecognitionprotocolsto comparewithdeterministicembeddings.Thenweconductqualitativeanalysistogainmoreinsight intohowPFEbehaves. Tocomprehensivelyevaluatethee˚cacyofPFEs,weconducttheexperimentson 7 benchmarks, includingthewellknown LFW [ 3 ], YTF [ 51 ], MegaFace [ 52 ]andfourothermoreunconstrained benchmarks. WeusetheCASIA-WebFace[ 50 ]andacleanedversion 7 ofMS-Celeb-1M[ 2 ]astrainingdata, fromwhichweremovethesubjectsthatarealsoincludedinthetestdatasets 8 . 7 https://github.com/inlmouse/MS-Celeb-1M_WashList. 8 84 and 4 Œ 182 subjectswereremovedfromCASIA-WebFaceandMS-Celeb-1M,respectively. 43 Table3.2Resultsofourmodels(lastthreerows)trainedonMS-Celeb-1Mandstate-of-the- artmethodsonLFW,YTFandMegaFace.TheMegaFaceveri˝cationratesarecomputedat FAR= 0 Ł 0001% .indicatesthattheauthordidreporttheperformanceonthecorresponding protocol. MethodTrainingDataLFWYTF MF1MF1 Rank1Veri. DeepFace+[27]4M 97 Ł 3591 Ł 4 -- FaceNet[30]200M 99 Ł 6395 Ł 1 -- DeepID2+[31]300K 99 Ł 4793 Ł 2 -- CenterFace[32]0.7M 99 Ł 2894 Ł 965 Ł 2376 Ł 52 SphereFace[34]0.5M 99 Ł 4295 Ł 075 Ł 7789 Ł 14 ArcFace[38]5.8M 99 Ł 8398 Ł 0281 Ł 0396 Ł 98 CosFace[36]5M 99 Ł 7397 Ł 677 Ł 1189 Ł 88 L2-Face[37]3.7M 99 Ł 7896 Ł 08 -- Baseline4.4M 99 Ł 7097 Ł 1879 Ł 4392 Ł 93 PFE fuse 4.4M- 97 Ł 32 -- PFE fuse+match 4.4M 99 Ł 8297 Ł 3678 Ł 9592 Ł 51 3.6.1ExperimentsonDi˙erentBaseEmbeddings Sinceourmethodworksbyconvertingexistingdeterministicembeddings,wewanttoevaluate howitworkswithdi˙erentbaseembeddings,i.e.facerepresentationstrainedwithdi˙erentloss functions.Inparticular,weimplementthefollowingstate-of-the-artlossfunctions:Softmax+Center Loss[ 32 ],TripletLoss[ 30 ],A-Softmax[ 34 ]andAM-Softmax[ 35 ] 9 .Tobealignedwithprevious work[ 34 , 36 ],wetraina64-layerresidualnetwork[ 34 ]witheachoftheselossfunctionsonthe CASIA-WebFacedatasetasbasemodels.Allthefeaturesare 2 -normalizedtoahyper-spherical embeddingspace.ThenwetraintheuncertaintymoduleforeachbasemodelontheCASIA-WebFace againfor 3 Œ 000 steps.Weevaluatetheperformanceonfourbenchmarks:LFW[ 3 ],YTF[ 51 ], CFP-FP[ 4 ]andIJB-A[ 5 ],whichpresentdi˙erentchallengesinfacerecognition.Theresultsare showninTable3.1.ThePFEimprovesovertheoriginalrepresentationinallcases,indicatingthe proposedmethodisrobustwithdi˙erentembeddingsandtestingscenarios. 44 Table3.3Resultsofourmodels(lastthreerows)trainedonMS-Celeb-1Mandstate-of-the-art methodsonCFP(frontal-pro˝leprotocol)andIJB-A. MethodTrainingData IJB-A(TAR@FAR) CFP-FP 0 Ł 1%1 Ł 0% Yinet.al.[92]0.5M 73 Ł 9 4 Ł 277 Ł 5 2 Ł 5 94 Ł 39 NAN[88]3M 88 Ł 1 1 Ł 194 Ł 1 0 Ł 8 - QAN[89]5M 89 Ł 31 3 Ł 9294 Ł 20 1 Ł 53 - Caoet.al.[44]3.3M 90 Ł 4 1 Ł 495 Ł 8 0 Ł 6 - Multicolumn[90]3.3M 92 Ł 0 1 Ł 396 Ł 2 0 Ł 5 - L2-Face[37]3.7M 94 Ł 3 0 Ł 597 Ł 00 0 Ł 4 - Baseline4.4M 93 Ł 30 1 Ł 2996 Ł 15 0 Ł 7192 Ł 78 PFE fuse 4.4M 94 Ł 59 0 Ł 7295 Ł 92 0 Ł 73 - PFE fuse+match 4.4M 95 Ł 25 0 Ł 8997 Ł 50 0 Ł 43 93 Ł 34 Table3.4Resultsofourmodels(lastthreerows)trainedonMS-Celeb-1Mandstate-of-the-art methodsonIJB-C. MethodTrainingData IJB-C(TAR@FAR) 0 Ł 001%0 Ł 01%0 Ł 1%1% Yinet.al.[93]0.5M-- 69 Ł 383 Ł 8 Caoet.al.[44]3.3M 74 Ł 784 Ł 091 Ł 096 Ł 0 Multicolumn[90]3.3M 77 Ł 186 Ł 292 Ł 796 Ł 8 DCN[94]3.3M- 88 Ł 594 Ł 7 98 Ł 3 Baseline4.4M 70 Ł 1085 Ł 3793 Ł 6196 Ł 91 PFE fuse 4.4M 83 Ł 1492 Ł 3895 Ł 4797 Ł 36 PFE fuse+match 4.4M 89 Ł 6493 Ł 2595 Ł 49 97 Ł 17 Table3.5PerformancecomparisononthreeprotocolsofIJB-S.Theperformanceisreportedin termsofrankretrieval(closed-set)andTPIR@FPIR(open-set)insteadofthemedia-normalized version[1].Thenumbers 1% and 10% inthesecondrowrefertotheFPIR. MethodTrainingData Surveillance-to-SingleSurveillance-to-BookingSurveillance-to-Surveillance Rank-1Rank-5Rank-101%10%Rank-1Rank-5Rank-101%10%Rank-1Rank-5Rank-101%10% C-FAN[91]5.0M 50 Ł 8261 Ł 1664 Ł 9516 Ł 4424 Ł 1953 Ł 0462 Ł 6766 Ł 3527 Ł 4029 Ł 70 10 Ł 05 17 Ł 5521 Ł 060 Ł 110 Ł 68 Baseline4.4M 50 Ł 0059 Ł 0762 Ł 707 Ł 2219 Ł 0547 Ł 5456 Ł 1461 Ł 0814 Ł 7522 Ł 999 Ł 4017 Ł 5223 Ł 040 Ł 060 Ł 71 PFE fuse 4.4M 53 Ł 4461 Ł 4065 Ł 05 10 Ł 5322 Ł 87 55 Ł 4563 Ł 1766 Ł 38 16 Ł 7026 Ł 208 Ł 1814 Ł 5219 Ł 310 Ł 090 Ł 63 PFE fuse+match 4.4M 50 Ł 1658 Ł 3362 Ł 28 31 Ł 8835 Ł 33 53 Ł 6061 Ł 7564 Ł 97 35 Ł 9939 Ł 82 9 Ł 20 20 Ł 8227 Ł 340 Ł 842 Ł 83 3.6.2ComparisonwithState-Of-The-Art Tocomparewithstate-of-the-artfacerecognitionmethods,weuseadi˙erentbasemodel,which isa 64 -layernetworktrainedwithAM-SoftmaxontheMS-Celeb-1Mdataset.Thenwe˝xthe parametersandtraintheuncertaintymoduleonthesamedatasetfor 12 Œ 000 steps.Inthefollowing experiments,wecompare3methods: ‹ Baseline onlyusestheoriginalfeaturesofthe64-layerdeterministicembeddingalongwithcosine 9 WealsotriedimplementingArcFace[38]butitdoesnotconvergewellinourcase.Sowedidnotuseit. 45 similarityformatching.Averagepoolingisusedincaseoftemplate/videobenchmarks. ‹ PFE fuse usestheuncertaintyestimation 2 inPFEandEquation(3.6)toaggregatethefeaturesof templatesbutusescosinesimilarityformatching.Iftheuncertaintymodulecouldestimatethe featureuncertaintye˙ectively,fusionwith 2 shouldbeabletooutperformaveragepoolingby assigninglargerweightstocon˝dentfeatures. ‹ PFE fuse+match uses 2 bothforfusionandmatching(withmutuallikelihoodscores).Tem- plates/videosarefusedbasedonEquation(3.6)andEquation(3.7). InTable3.2weshowtheresultsonthreerelativelyeasierbenchmarks:LFW,YTFandMegaFace. AlthoughtheaccuracyonLFWandYTFarenearlysaturated,theproposedPFEstillimprovesthe performanceoftheoriginalrepresentation.NotethatMegaFaceisabiaseddataset:becauseallthe probesarehigh-qualityimagesfromFaceScrub,thepositivepairsinMegaFacearebothhigh-quality imageswhilethenegativepairsonlycontainatmostonelow-qualityimage 10 .Therefore,neither ofthetwotypesoferrorcausedbythefeatureambiguitydilemma(Section3.3)willshowupin MegaFaceanditnaturallyfavorsdeterministicembeddings.However,thePFEstillmaintainsthe performanceinthiscase.Wealsonotethatsuchabias,namelythetargetgalleryimagesbeingof higherqualitythantherestofgallery,wouldnotexistinrealworldapplications. InTable3.3andTable3.4weshowtheresultsonthreemorechallengingdatasets:CFP,IJB-A andIJB-C.Theimagesinthesedatasetspresentlargervariationsinpose,occlustion,etc,andfacial featurescouldbemoreambiguous.Assuch,wecanseethatPFEachievesamoresigni˝cant improvementonthesethreebenchmarks.InparticularonIJB-CatFAR = 0 Ł 001% ,PFEreduces theerrorrateby 64% .Inaddition,simplyfusingtheoriginalfeatureswiththelearneduncertainty (PFE fuse )alsohelpstheperformance. InTable3.5wereporttheresultsonthreeprotocolsofthelatestbenchmark,IJB-S.Again, PFEisabletoimprovetheperformanceinmostcases.Noticethatthegallerytemplatesinthe veillance-to-sandvallincludehigh-qualityfrontalmugshots, whichpresentlittlefeatureambiguity.Therefore,weonlyseeaslightperformancegapinthesetwo 10 ThenegativepairsofMegaFaceintheveri˝cationprotocolonlyincludethosebetweenprobesanddistractors. 46 Table3.6Resultsofdi˙erentnetworkarchitecturestrainedonCASIA-WebFace.refers tothedeterministicembeddings.Thebetterperformanceamongeachbasemodelareshowninbold numbers.usesmutuallikelihoodscoreformatching.IJB-Aresultsareveri˝cationratesat FAR= 0 Ł 1% . (a)CASIA-Net BaseModelRepresentationLFWYTFCFP-FPIJB-A Original 97 Ł 7092 Ł 5691 Ł 1363 Ł 93 Softmax+ CenterLoss[32] PFE 97 Ł 8993 Ł 1091 Ł 3664 Ł 33 Original 96 Ł 98 90 Ł 72 85 Ł 69 54 Ł 47 Triplet[30] PFE 97 Ł 10 91 Ł 22 85 Ł 10 51 Ł 35 Original 97 Ł 12 92 Ł 38 89 Ł 3154 Ł 48 A-Softmax[34] PFE 97 Ł 92 91 Ł 78 89 Ł 9658 Ł 09 Original 98 Ł 32 93 Ł 50 90 Ł 24 71 Ł 28 AM-Softmax[35] PFE 98 Ł 63 94 Ł 00 90 Ł 50 75 Ł 92 (b)Light-CNN BaseModelRepresentationLFWYTFCFP-FPIJB-A Original 97 Ł 7792 Ł 3490 Ł 9660 Ł 42 Softmax+ CenterLoss[32] PFE 98 Ł 2893 Ł 2492 Ł 2962 Ł 41 Original 97 Ł 48 92 Ł 46 90 Ł 01 52 Ł 34 Triplet[30] PFE 98 Ł 15 93 Ł 62 90 Ł 54 56 Ł 81 Original 98 Ł 0792 Ł 7289 Ł 3463 Ł 21 A-Softmax[34] PFE 98 Ł 4793 Ł 4490 Ł 5471 Ł 96 Original 98 Ł 68 93 Ł 78 90 Ł 59 76 Ł 50 AM-Softmax[35] PFE 98 Ł 95 94 Ł 34 91 Ł 26 80 Ł 00 protocols.Butinthemostchallengingveillance-to-surveillanceprotocol,largerimprovement canbeachievedbyusinguncertaintyformatching.Besides,PFE fuse+match improvestheperformance signi˝cantlyonalltheopen-setprotocols,whichindicatesthatMLShasmoreimpactontheabsolute pairwisescorethantherelativeranking. 3.7ResultsonDi˙erentArchitectures Here,weevaluatetheproposedmethodontwodi˙erentnetworkarchitecturesforfacerecognition: CASIA-Net[ 50 ]and29-layerLight-CNN[ 95 ].Noticethatbothnetworksrequiredi˙erentimage shapesfromourpreprocessedones.Thuswepadourimageswithzerovaluesandresizethem intothetargetsize.Sincethemainpurposeoftheexperimentistoevaluatethee˚cacyofthe uncertaintymoduleratherthancomparingwiththeoriginalresultsofthesenetworks,thedi˙erence inthepreprocessingshouldnota˙ectafaircomparison.Besides,theoriginalCASIA-Netdoesnot convergeforA-SoftmaxandAM-Softmax,soweaddanbottlenecklayertooutputtheembedding representationaftertheaveragepoolinglayer.Thenweconducttheexperimentsbycomparing probabilisticembeddingswithbasedeterministicembeddings,similartoSection3.6.1.Theresults areshowninTable3.6aandTable3.6b.Withouttuningthearchitectureoftheuncertaintymodule northehyper-parameters,PFEstillimprovetheperformanceinmostcases. 47 (a)GaussianBlur (b)Occlusion (c)RandomNoise Figure3.5RepeatedexperimentsonfeatureambiguitydilemmawiththeproposedPFE.Thesame modelinFigure3.2isusedasthebasemodelandisconvertedtoaPFEbytraininganuncertainty module.Noadditionaltrainingdatanordataaugmentationisusedfortraining. (a)Low-scoreGenuinePairs(b)High-scoreImpostorPairs Figure3.6ExamplegenuinepairsfromIJB-Adatasetestimatedwiththelowestmutuallikelihood scoresandimpostorpairswiththehighestscoresbythePFEversionofthesame64-layerCNN modelinSection3.3.IncomparisontoFigure3.3,mostimagesherearehigh-qualityoneswith clearfeatures,whichcanmisleadthemodeltobecon˝dentinawrongway.Notethatthesepairs arenottemplatesintheveri˝cationprotocol. 3.7.1QualitativeAnalysis WhyandwhendoesPFEimproveperformance? We˝rstrepeatthesameexperimentsin Section3.3usingthePFErepresentationandMLS.Thesamenetworkisusedasthebasemodel here.AsonecanseeinFigure3.5,althoughthescoresoflow-qualityimpostorpairsarestill increasing,theyconvergetoapointthatislowerthanthemajorityofgenuinescores.Similarly,the scoresofcross-qualitygenuinepairsconvergetoapointthatishigherthanthemajorityofimpostor scores.ThismeansthetwotypesoferrorsdiscussedinSection3.3couldbesolvedbyPFE.Thisis furthercon˝rmedbytheIJB-AresultsinFigure3.6.Figure3.7showsthedistributionofestimated uncertaintyonLFW,IJB-AandIJB-S.Asonecansee,theariance"ofuncertaintyincreasesinthe followingorder:LFW IJB-A IJB-S.ComparingwiththeperformanceinSection3.6.2,wecan 48 Figure3.7Distributionofestimateduncertaintyondi˙erentdatasets.Here,ncerrefers totheharmonicmeanof f acrossallfeaturedimensions.Notethattheestimateduncertaintyis proportionaltothecomplexityofthedatasets. Bestviewedincolor . seethatPFEtendstoachievelargerperformanceimprovementondatasetswithmorediverseimage quality. WhatdoesDNNseeandnotsee? Toanswerthisquestion,wetrainadecodernetworkonthe originalembedding,thenapplyittoPFEbysampling z fromtheestimateddistribution ? ¹ z j x º of given x .Forahigh-qualityimage(Figure3.8Row1),thereconstructedimagestendtobevery consistentwithoutmuchvariation,implyingthemodelisverycertainaboutthefacialfeaturesinthis images.Incontrast,foralower-qualityinput(Figure3.8Row2),largervariationcanbeobserved fromthereconstructedimages.Inparticular,attributesthatcanbeclearlydiscernedfromtheimage (e.g.thickeye-brow)arestillconsistentwhileattributescannot(e.g.eyeshape)bediscerned havelargervariation.Asforamis-detectedimage(Figure3.8Row3),signi˝cantvariationcanbe observedinthereconstructedimages:themodeldoesnotseeanysalientfeatureinthegivenimage. 3.8Risk-controlledFaceRecognition Inmanyscenarios,wemayexpectahigherperformancethanoursystemisabletoachieveorwe maywanttomakesurethesystem'sperformancecanbecontrolledwhenfacingcomplexapplication scenarios.Therefore,wewouldexpectthemodeltorejectinputimagesifitisnotcon˝dent.A 49 x 2 meansample1sample2sample3sample4 Figure3.8Visualizationresultsonahigh-quality,alow-qualityandamis-detectedimagefrom IJB-A.Foreachinput,5imagesarereconstructedbyapre-traineddecoderusingthemeanand 4 randomlysampled z vectorsfromtheestimateddistribution ? ¹ z j x º . commonsolutionforthisisto˝ltertheimageswithaqualityassessmenttool.WeshowthatPFE providesanaturalsolutionforthistask.WetakealltheimagesfromLFWandIJB-Adatasetsfor image-levelfaceveri˝cation(Wedonotfollowtheoriginalprotocolshere).Thesystemisallowed tooutaproportionofallimagestomaintainabetterperformance.Wethenreportthe TAR@FAR = 0 Ł 001% againsttheilterOutRateWeconsidertwocriteriafor˝ltering:(1)the detectionscoreofMTCNN[ 7 ]and(2)acon˝dencevaluepredictedbyouruncertaintymodule.Here thecon˝dencefor 8 th sampleisde˝nedastheinverseofharmonicmeanof 2 8 acrossalldimensions. Forfairness,bothmethodsusetheoriginaldeterministicembeddingrepresentationsandcosine similarityformatching.Toavoidsaturatedresults,weusethemodeltrainedonCASIA-WebFace withAM-Softmax.TheresultsareshowninFigure3.10.Asonecansee,thepredictedcon˝dence valueisabetterindicatorofthepotentialrecognitionaccuracyoftheinputimage.Thisisan expectedresultsincePFEistrainedundersupervisionfortheparticularmodelwhileanexternal qualityestimatorisunawareofthekindoffeaturesusedformatchingbythemodel.Exampleimages withhigh/lowcon˝dence/qualityscoresareshowninFigure3.9. 50 LFWIJB-A H L (a)Ours LFWIJB-A H L (b)MTCNN Figure3.9ExampleimagesfromLFWandIJB-Athatareestimatedwiththehighest(H) con˝dence/qualityscoresandthelowest(L)scoresbyourmethodandMTCNNfacedetector. (a)LFW (b)IJB-A Figure3.10Comparisonofveri˝cationperformanceonLFWandIJB-A(nottheoriginalprotocol) by˝lteringaproportionofimagesusingdi˙erentqualitycriteria. 3.9Conclusion Wehaveproposedprobabilisticfaceembeddings(PFEs),whichrepresentfaceimagesasdistributions inthelatentspace.ProbabilisticsolutionswerederivedtocompareandaggregatethePFEofface images.Unlikedeterministicembeddings,PFEsdonotsu˙erfromthefeatureambiguitydilemma forunconstrainedfacerecognition.Quantitativeandqualitativeanalysisondi˙erentsettingsshowed thatPFEscane˙ectivelyimprovethefacerecognitionperformancebyconvertingdeterministic embeddingstoPFEs.WehavealsoshownthattheuncertaintyinPFEsisagoodindicatorforthe iminativeualityoffaceimages.InthefutureworkwewillexplorehowtolearnPFEsinan end-to-endmannerandhowtoaddressthedatadependencywithinfacetemplates. 51 Chapter4 UniversalFaceRepresentationLearning Inthischapter,wewilltalkaboutthechallengesfacedbythefeatureextractionmoduleinAFR systemsandapotentialsolutiontosolveit.AlmostallmodernAFRsystemsusedeepconvolutional neuralnetworksasthefeatureextractionmodule.Asablackboxfunction,suchdeepneuralnetworks aretrainedtomapinputimagestoafeaturespacewithsmallintra-identitydistanceandlarge inter-identitydistance,whichhasbeenachievedbypriorworksthroughlossdesignanddatasetswith richwithin-classvariations[ 30 , 32 , 34 , 36 , 38 ].However,evenverylargepublicdatasetsmanifest strongbiases,suchasethnicity[ 96 , 97 ]orheadposes[ 98 , 99 , 100 ].Thislackofvariationleadsto signi˝cantperformancedropsonchallengingtestdatasets,forexample,accuracyreportedbyprior state-of-the-art[54]onIJB-SorTinyFace[1,6]areabout 30% lowerthanIJB-A[5]orLFW[3]. Recentworksseektoclosethedomaingapcausedbysuchdatabiasthroughdomainadaptation, i.e.,identifyingspeci˝cfactorsofvariationandaugmentingthetrainingdatasets[ 99 ],orfurther leveragingunlabeleddataalongsuchnameablefactors[ 96 ].Whilenameablevariationsarehard toidentifyexhaustively,priorworkshavesoughttoalignthefeaturespacebetweensourceand targetdomains[ 101 , 97 ].Alternatively,individualmodelsmightbetrainedonvariousdatasets andensemblestoobtaingoodperformanceoneach[ 102 ].Alltheseapproacheseitheronly handlespeci˝cvariations,orrequireaccesstotestdatadistributions,oraccrueadditionalrun-time complexitytohandlewidervariations.Incontrast,weproposelearningasingledeep 52 Figure4.1Traditionalrecognitionmodelsrequiretargetdomaindatatoadaptfromthehigh-quality trainingdatatoconductunconstrained/low-qualityfacerecognition.Modelensembleisfurther neededforauniversalrepresentationpurposewhichsigni˝cantlyincreasesmodelcomplexity.In contrast,ourmethodworksonlyonoriginaltrainingdatawithoutanytargetdomaindatainformation, andcandealwithunconstrainedtestingscenarios. featurerepresentationthathandlesthevariationsinfacerecognitionwithoutrequiringaccesstotest datadistributionandretainsrun-timee˚ciency,whileachievingstrongperformanceacrossdiverse situationsespeciallyonlow-qualityimages(seeFigure4.1). ThischapterintroducesseveralnovelcontributionsinSection4.2tolearnsuchauniversal representation.First,wenotethatinputswithnon-frontalposes,lowresolutionsandheavyocclusions arekeynameablefactorsthatpresentchallengesforapplications,forwhichtraining datamaybesyntheticallyaugmented.Butdirectlyaddinghardaugmentedexamplesintotraining leadstoamoredi˚cultoptimizationproblem.Wemitigatethisbyproposinganidenti˝cationloss thataccountsforper-samplecon˝dencetolearnaprobabilisticfeatureembedding.Second,weseek tomaximizerepresentationpoweroftheembeddingbydecomposingitintosub-embeddings,each ofwhichhasanindependentcon˝dencevalueduringtraining.Third,allthesub-embeddingsare encouragedtobefurtherdecorrelatedthroughtwocomplementaryregularizationoverdi˙erent partitionsofthesub-embeddings,i.e.,classi˝cationlossonvariationsandadversariallosson 53 di˙erentpartitions.Fourth,weachievefurtherdecorrelationbyminingforadditionalvariations forwhichsyntheticaugmentationisnon-trivial.Finally,weaccountforthevaryingdiscrimination powerofsub-embeddingsforvariousfactorsthroughaprobabilisticaggregationthataccountsfor theiruncertainties. InSection4.4,weextensivelyevaluatetheproposedmethodsonpublicdatasets.Comparedto ourbaselinemodel,theproposedmethodmaintainsthehighaccuracyongeneralfacerecognition benchmarks,suchasLFWandYTF,whilesigni˝cantlyboostingtheperformanceonchallenging datasetssuchasIJB-C,IJB-S,wherenewstate-of-the-artperformanceisachieved.Detailedablation studiesshowtheimpactofeachoftheabovecontributionsinachievingthesestrongperformance. Insummary,themaincontributionsofthischapterare: ‹ Amethodforlearningauniversalfacerepresentationbyassociatingfeatureswithdi˙erent variations,leadingtoimprovedgeneralizationondiversetestingdatasets. ‹ Acon˝dence-awareidenti˝cationlossthatutilizessamplecon˝denceduringtrainingtoleverage hardsamples. ‹ Afeaturedecorrelationregularizationthatappliesbothaclassi˝cationlossonvariationsand anadversariallossondi˙erentpartitionsofthefeaturesub-embeddings,leadingtoimproved performance. ‹ Atrainingstrategytoe˙ectivelycombinesynthesizeddatatotrainafacerepresentationapplicable toimagesoutsidetheoriginaltrainingdistribution. ‹ State-of-the-artresultsonseveralchallengingbenchmarks,suchasIJB-A,IJB-C,TinyFaceand IJB-S. 4.1RelatedWork Universalrepresentationreferstoasinglemodelthatcanbeappliedtovariousvisualdomains (usuallydi˙erenttasks),e.g.object,character,roadsigns,whilemaintainingtheperformanceof usingasetofdomain-speci˝cmodels[ 103 , 104 , 105 , 106 , 97 ].Thefeatureslearnedbysucha 54 (a)Blur (b)Occlusion (c)Pose (d)RandomlyCombined Figure4.2SamplesfromMS-Celeb-1M[2]withaugmentationalongsidedi˙erentvariations. singlemodelarebelievedtobemoreuniversalthandomain-speci˝cmodels.Di˙erentfromdomain generalization[ 107 , 108 , 109 , 110 , 111 ],whichtargetsadaptabilityonunseendomainsbylearning fromvariousseendomains,universalrepresentationlearningdoesnotinvolvere-trainingonunseen domains.Severalmethodsfocusonincreasingtheparametere˚ciencybyreducingthedomain-shift withtechniquessuchasconditionedBatchNorm[ 103 ]andresidualadapters[ 104 , 105 ].Basedon SEmodules[ 112 ],[ 106 ]proposeadomain-attentivemoduleforintermediate(hidden)featuresofa universalobjectdetectionnetwork.Ourworkisdi˙erentfromthosemethodsintwoways:(1)it isamethodforsimilaritymetriclearningratherthandetectionorclassi˝cationtasksand(2)itis model-agnostic.Thefeatureslearnedbyourmodelcanthenbedirectlyappliedtodi˙erentdomains bycomputingthepairwisesimilaritybetweensamplesofunseenclasses. 4.2ProposedApproach Inthissection,we˝rstintroducethreeaugmentablevariations,namelyblur,occlusionandhead pose,toaugmentthetrainingdata.VisualexamplesofaugmenteddataareshowninFigure4.2and thedetailscanbefoundinSection4.3.TheninSection4.2.1,weintroduceacon˝dence-aware identi˝cationlosstolearnfromhardexamples,whichisfurtherextendedinSection4.2.2bysplitting thefeaturevectorsintosub-embeddingswithindependentcon˝dence.InSection4.2.3,weapply 55 Figure4.3Overviewoftheproposedmethod.High-qualityinputimagesare˝rstaugmented accordingtopre-de˝nedvariations,i.e.,blur,occlusionandpose.Thefeaturerepresentationis thensplitintosub-embeddingsassociatedwithsample-speci˝ccon˝dences.Con˝dence-aware identi˝cationlossandvariationdecorrelationlossaredevelopedtolearnthesub-embeddings. theintroducedaugmentablevariationstofurtherdecorrelatethefeatureembeddings.Amethodfor discoveringfurthernon-augmentablevariationsisproposedtoachievebetterdecorrelation.Finally, anuncertainty-guidedpairwisemetricisproposedforinference. 4.2.1Con˝dence-AwareIdenti˝cationLoss Weinvestigatetheposteriorprobabilityofbeingclassi˝edtoidentity 9 2f 1 Œ 2 ŒŁŁŁŒ# g ,giventhe inputsample x 8 .Denotethefeatureembeddingofsample 8 as f 8 andthe 9 th identityprototypevector as w 9 ,whichistheidentitytemplatefeature.Aprobabilisticembeddingnetwork \ representseach sample x 8 asaGaussiandistribution N¹ f 8 Œf 2 8 I º inthefeaturespace.Thelikelihoodof x 8 beinga sampleofclass 9 isgivenby: ? ¹ x 8 j H = 9 º/ ? \ ¹ w 9 j x 8 º = 1 ¹ 2 cf 2 8 º ˇ 2 exp ¹ f 8 w 9 2 2 f 2 8 º Œ (4.1) 56 where ˇ isfeaturedimension.Furtherassumingthepriorofassigningasampletoanyidentityas equal,theposteriorof x 8 belongingtothe 9 th classisderivedas: ? ¹ H = 9 j x 8 º = ? ¹ x 8 j H = 9 º ? ¹ H = 9 º Í # 2 = 1 ? ¹ x 8 j H = 2 º ? ¹ H = 2 º = exp ¹ k f 8 w 9 k 2 2 f 2 8 º Í # 2 = 1 exp ¹ k f 8 w 2 k 2 2 f 2 8 º Œ (4.2) Forsimplicity,letusde˝neacon˝dencevalue B 8 = 1 f 2 8 .Constrainingboth f 8 and w 9 onthe 2 -normalizedunitsphere,wehave k f 8 w 9 k 2 2 f 2 8 = B 8 ¹ 1 w ) 9 f 8 º and ? ¹ H = 9 j x 8 º = exp ¹ B 8 w ) 9 f 8 º Í # 2 = 1 exp ¹ B 8 w ) 2 f 8 º Ł (4.3) Thee˙ectofcon˝dence-awareposteriorinEquation4.3isillustratedinFigure4.4.Whentrainingis conductedamongsamplesofvariousqualities,ifweassumethesamecon˝denceacrossallsamples, thelearnedprototypewillbeinthecenterofallsamples.Thisisnotideal,aslow-qualitysamples conveymoreambiguousidentityinformation.Incontrast,ifwesetupsample-speci˝ccon˝dence B 8 ,wherehigh-qualitysamplesshowhighercon˝dence,itwillpushtheprototype w 9 tobemore similartohigh-qualitysamplesinordertomaximizetheposterior.Meanwhile,duringupdateofthe embedding f 8 ,itprovidesastrongerpushforlow-quality f 8 tobeclosertotheprototype. Addinglossmargin[ 36 ]overtheexponentiallogithasbeenshowntobee˙ectiveinnarrowing thewithin-identitydistribution.Wealsoincorporateitintoourloss: L 0 83C = log exp ¹ B 8 w ) H 8 f 8 < º exp ¹ B 8 w ) H 8 f 8 < º¸ Í 9 < H 8 exp ¹ B 8 w ) 9 f 8 º Œ (4.4) where H 8 istheground-truthlabelof x 8 .Ourcon˝dence-awareidenti˝cationloss(C-Softmax) isdi˙erentfromcosineloss[ 36 ]asfollows:(1)eachimagehasanindependentanddynamic B 8 ratherthanaconstantsharedscalarand(2)themarginparameter < isnotmultipliedby B 8 .The 57 (a)w/ocon˝dence (b)w/con˝dence Figure4.4Illustrationofcon˝dence-awareembeddinglearningonquality-variousdata.With con˝denceguiding,thelearnedprototypeisclosertohigh-qualitysampleswhichrepresentsthe identitybetter. independenceof B 8 allowsittogatethegradientsignalsof w 9 and f 8 duringnetworktrainingin asample-speci˝cway,asthecon˝dence(degreeofvariation)oftrainingsamplescanhavelarge di˙erences.Thoughsamplesarespeci˝c,weaimtolearnahomogeneousfeaturespacesuchthatthe metricacrossdi˙erentidentitiesisconsistent.Thus,allowing B 8 tocompensateforthecon˝dence di˙erenceofthesamples,weexpect < tobeconsistentlysharedacrossalltheidentities. 4.2.2Con˝dence-AwareSub-Embeddings Thoughtheembedding f 8 learnedthroughasample-speci˝cgating B 8 candealwithsample-level variations,wearguethatthecorrelationamongtheentriesof f 8 itselfisstillhigh.Tomaximize therepresentationpowerandachieveacompactfeaturesize,decorrelatingtheentriesofthe embeddingisnecessary.Thisencouragesustofurtherbreaktheentireembedding f 8 intopartitioned sub-embeddings,eachofwhichisfurtherassignedascalarcon˝dencevalue. IllustratedinFigure4.3,wepartitiontheentirefeatureembedding f 8 into equal-length sub-embeddingsasinEquation4.5.Accordingly,theprototypevector w 9 andthecon˝dencescalar 58 (a)sub-embeddingofsize8 (b)sub-embeddingofsize32 Figure4.5Thecorrelationmatricesofsub-embeddingsbysplittingthefeaturevectorintodi˙erent sizes.Thecorrelationiscomputedintermsofdistancetoclasscenter. B 8 arealsopartitionedintothesamesize groups. w 9 = » w ¹ 1 º ) 9 Œ w ¹ 2 º ) 9 ŒŁŁŁŒ w ¹ º ) 9 ¼ Œ f 8 = » f ¹ 1 º ) 8 Œ f ¹ 2 º ) 8 ŒŁŁŁŒ f ¹ º ) 8 ¼ Œ s 8 = » B ¹ 1 º 8 ŒB ¹ 2 º 8 ŒŁŁŁŒB ¹ º 8 ¼ Œ (4.5) Eachgroupofsub-embeddings f ¹ : º 8 is 2 normalizedontounitsphereseparately.The˝nal identi˝cationlossthusis: L 83C = log exp ¹ a 8ŒH 8 < º exp ¹ a 8ŒH 8 < º¸ Í 9 < H 8 exp ¹ a 8Œ9 º Œ (4.6) a 8Œ9 = 1 Õ : = 1 B ¹ : º 8 w ¹ : º ) 9 f ¹ : º 8 Ł (4.7) Acommonissueforneuralnetworksisthattheytendtobever-con˝dentonpredictions[ 113 ]. Weaddanadditional ; 2 regularizationtoconstrainthecon˝dencefromgrowingarbitrarilylarge: L A46 = 1 Õ : = 1 B ¹ : º 2 8 Ł (4.8) 59 (a)variation-correlatedfeatures (b)variation-decorrelatedfeatures Figure4.6Thevariationdecorrelationlossdisentanglesdi˙erentsub-embeddingsbyassociating themwithdi˙erentvariations.Inthisexample,the˝rsttwosub-embeddingsareforcedtobe invarianttoocclusionwhilethesecondtwosub-embeddingsareforcedtobeinvarianttoblur.By pushingstrongerinvarianceforeachvariation,thecorrelation/overlapbetweentwovariationsis reduced. 4.2.3Sub-EmbeddingsDecorrelation Settingupmultiplesub-embeddingsalonedoesnotguaranteethefeaturesindi˙erentgroupsare learningcomplementaryinformation.EmpiricallyshowninFigure4.5,we˝ndthesub-embeddings arestillhighlycorrelated,i.e.,dividing f 8 intoequal 16 groups,theaveragecorrelationamongall thesub-embeddingsis 0 Ł 57 .Ifwepenalizethesub-embeddingswithdi˙erentregularization,the correlationamongthemcanbereduced.Byassociatingdi˙erentsub-embeddingswithdi˙erent variations,weconductvariationclassi˝cationlossonasubsetofallthesub-embeddingswhile conductingvariationadversariallossintermsofothervariationtypes.Givenmultiplevariations, suchtworegularizationtermsareforcedondi˙erentsubsets,leadingtobettersub-embedding decorrelation. Foreachaugmentablevariation C 2f 1 Œ 2 ŒŁŁŁŒ" g ,wegenerateabinarymask + C ,whichselects arandom 2 subsetofallsub-embeddingswhilesettingtheotherhalftobezeros.Themasksare generatedatthebeginningofthetrainingandwillremain˝xedduringtraining.Weguarantee thatfordi˙erentvariations,themasksaredi˙erent.Weexpect + C ¹ f 8 º tore˛ect C th variationwhile invarianttotheothers.Accordingly,webuildamulti-labelbinarydiscriminator ˘ bylearningto 60 predictallvariationsfromeachmaskedsubset: min ˘ L ˘ = " Õ C = 1 log ? ˘ ¹ u 8 = ^ u 8 j + C ¹ f 8 ºº = " Õ C = 1 " Õ C 0 = 1 log ? ˘ ¹ D ¹ C 0 º 8 = ^ D ¹ C 0 º 8 j + C ¹ f 8 ºº (4.9) where u 8 = » D ¹ 1 º 8 ŒD ¹ 2 º 8 ŒŁŁŁŒD ¹ " º 8 ¼ arethebinarylabels(0/1)oftheknownvariationsand ^ u 8 isthe ground-truthlabel.Forexample,if C = 1 correspondstoresolution, ^ D ¹ 1 º 8 wouldbe 1 and 0 for high/low-resolutionimages,respectively.NotethatEquation4.9isonlyusedfortrainingthe discriminator ˘ .Thecorrespondingclassi˝cationandadversariallossoftheembeddingnetworkis thengivenby: L 2;B = " Õ C = 1 log ? ˘ ¹ D ¹ C º = ^ D ¹ C º 8 j + C ¹ f 8 ºº (4.10) L 03E = " Õ C = 1 Õ C 0 < C ¹ 1 2 log ? ˘ ¹ D ¹ C 0 º = 0 j + C ¹ f 8 ºº¸ 1 2 log ? ˘ ¹ D ¹ C 0 º = 1 j + C ¹ f 8 ººº (4.11) Theclassi˝cationloss L 2;B toencourage + C tobevariation-speci˝cwhile L 03E isanadversarialloss toencourageinvariancetotheothervariations.Aslongasnotwomasksarethesame,itguarantees thattheselectedsubsets + C isfunctionallydi˙erentfromother + C 0 .Wethusachievedecorrelation between + C and + C 0 .Theoveralllossfunctionforeachsampleis: min \ L = L 83C ¸ _ A46 L A46 ¸ _ 2;B L 2;B ¸ _ 03E L 03E Ł (4.12) Duringtheoptimization,Equation(4.12)isaveragedacrossthesamplesinthemini-batch. 61 4.2.4MiningforFurtherVariations Thelimitednumber(threeinourmethod)ofaugmentablevariationsleadstolimitede˙ectof decorrelationasthenumberof + C aretoosmall.Tofurtherenhancethedecorrelation,aswell tointroducemorevariationsforbettergeneralizationability,weaimtoexploremorevariations withsemanticmeaning.Noticethatnotallthevariationsareeasytoconductdataaugmentation, e.g.smilingornotishardtoaugment.Forsuchvariations,weattempttomineoutthevariation labelsfromtheoriginaltrainingdata.Inparticular,weleverageano˙-the-shelfattributedataset CelebA[114]totrainaattributeclassi˝cationmodel \ withidentityadversarialloss: min \ L \ = log ? ¹ ; j x º 1 # # Õ 2 log ? ¹ H = 2 j x º min ˇ L ˇ = log ? ¹ H = H x j x º Œ (4.13) where ; istheattributelabeland H istheidentitylabel. x istheinputfaceimageand # isthe numberofidentitiesintheCelebAdataset.The˝rsttermpenalizesthefeaturetoclassifyfacial attributesandthesecondtermpenalizesthefeaturetobeinvarianttoidentities. Theattributeclassi˝eristhenappliedtotherecognitiontrainingsettogenerate ) newsoft variationlabels,e.g.smilingornot,youngorold.Theseadditionalvariationbinarylabelsare mergedwiththeoriginalaugmentablevariationlabelsas: u 8 = » D ¹ 1 º 8 ŒŁŁŁŒD ¹ " º 8 ŒD ¹ " ¸ 1 º 8 ŒŁŁŁŒD ¹ " ¸ ) º 8 ¼ andarethenincorporatedintothedecorrelationlearningframeworkinSection4.2.3. 4.2.5Uncertainty-GuidedProbabilisticAggregation Consideringthemetricforinference,simplytakingtheaverageofthelearnedsub-embeddingsis sub-optimal.Thisisbecausedi˙erentsub-embeddingsshowdi˙erentdiscriminativepowerfor di˙erentvariations.Theirimportanceshouldvaryaccordingtothegivenimagepairs.Inspired by[ 54 ],weconsiderapplyingtheuncertaintyassociatedwitheachembeddingforapairwise 62 similarityscore: B2>A4 ¹ x 8 Œ x 9 º = 1 2 Õ : = 1 f ¹ : º 8 f ¹ : º 9 2 f ¹ : º 2 8 ¸ f ¹ : º 2 9 ˇ 2 Õ : = 1 log ¹ f ¹ : º 2 8 ¸ f ¹ : º 2 9 º (4.14) ThoughwithEquation4.8forregularization,weempirically˝ndthatthecon˝dencelearnedwiththe identi˝cationlossstilltendtobeovercon˝dentandhencecannotbedirectlyusedforEquation4.14, sowe˝ne-tunetheoriginalcon˝dencebranchtopredict f while˝xingtheotherparts.Wereferthe readersto[54]forthetrainingdetailsof˝ne-tuning. 4.3ImplementationDetails TrainingDetailsandBaseline AllthemodelsareimplementedwithPytorchv1.1.Weusethe cleanlistfromArcFace[ 38 ]forMS-Celeb-1M[ 2 ]astrainingdata.Aftercleaningtheoverlapped subjectswiththetestingsets,wehave4.8Mimagesof76.5Kclasses.Weusethemethodin[ 115 ]for facealignmentandcropallimagesintoasizeof 110 110 .Randomandcentercroppingareapplied duringtrainingandtesting,respectively,totransformtheimagesinto 100 100 .Thebackbone ofourembeddingnetwork \ isamodi˝ed100-layerResNetin[ 38 ].Thenetworkissplitinto twodi˙erentbranchesafterthelastconvolutionlayer,eachofwhichincludesonefullyconnected layer.The˝rstbranchoutputsa 512 -Dvector,whichisfurthersplitinto 16 sub-embeddings.The otherbranchoutputsa 16 -Dvector,whicharecon˝dencevaluesforthesub-embeddings.The exp functionisusedtoguaranteeallthecon˝dencevalues B ¹ : º 8 arepositive.Themodel \ that weusedforminingadditionalvariationsisafourlayerCNN.Thefourlayershave 64 , 128 , 256 and 512 kernels,respectively,allofwhichare 3 3 .Theembeddingsizeis 512 forallmodels, andthefeaturesaresplitinto 16 groupsformulti-embeddingmethods.Themodel ˘ isalinear classi˝er.ThebaselinemodelsintheexperimentsaretrainedwithCosFacelossfunction[ 36 , 35 ], 63 whichachievesstate-of-the-artperformanceongeneralfacerecognitiontasks.Themodelswithout domainaugmentationaretrainedfor 18 epochsandmodelswithdomainaugmentationaretrained for 27 epochstoensureconvergence.Weempiricallyset _ A46 , _ 2;B and _ 03E as0.001,2.0and2.0, respectively.Themargin < isempiricallysetto 30 .Fornon-augmentablevariations,wechoose ) = 3 attributes,namelysmiling,youngandgender. VariationAugmentation Forthelow-resolution,weuseGaussianblurwithakernelsizebetween 3 and 11 .Fortheocclusion,wesplittheimagesinto 7 7 blocksandrandomlyreplacesome blockswithblackmasks.(3)Forposeaugmentation,weusePRNet[ 116 ]to˝tthe3Dmodelof near-frontalfacesinthedatasetandrotatethemintoayawdegreebetween 40 and 60 .Allthe augmentationsarerandomlycombinedwithaprobabilityof 30% foreach. 4.4Experiments Inthissection,we˝rstlyintroducedi˙erenttypesofdatasetsre˛ectingdi˙erentlevelsofvariation. Di˙erentlevelsofvariationindicatedi˙erentimagequalityandthusleadtodi˙erentperformance. Thenweconductdetailedablationstudyovertheproposedcon˝dence-awarelossandallthe proposedmodules.Further,weshowevaluationonthosedi˙erenttypesoftestingdatasetsand comparetostate-of-the-artmethods. 4.4.1Datasets Weevaluateourmodelsoneightfacerecognitionbenchmarks,coveringdi˙erentreal-worldtesting scenarios.Thedatasetsareroughlycategorizedintothreetypesbasedonthelevelofvariations: TypeI:LimitedVariation LFW[ 3 ],CFP[ 4 ],YTF[ 51 ]andMegaFace[ 117 ]arefourwidely appliedbenchmarksforgeneralfacerecognition.Webelievethevariationsinthosedatasetsare limited,asonlyoneorfewofthevariationsbeingpresented.Inparticular,YTFarevideosamples withrelativelylowerresolution;CFP[ 4 ]arefaceimageswithlargeposevariationbutofhigh 64 (a)Baseline (b)Proposed Figure4.7Testingresultsonsyntheticdataofdi˙erentvariationsfromIJB-Abenchmark (TAR@FAR=0.01%).Di˙erentrowscorrespondtodi˙erentaugmentationstrategiesduringtraining. Columnsaredi˙erentsynthetictestingdata.representsand oserespectively.Theperformanceoftheproposedmethodisimprovedinamonotonousway withmoreaugmentationsbeingadded. resolution;MegaFaceincludes1milliondistractorscrawledfrominternetwhileitslabeledimages areallhigh-qualityfrontalfacesfromFaceScrubdataset[ 118 ].ForbothLFWandYTF,weusethe unrestrictedveri˝cationprotocol.ForCFP,wefocusonthefrontal-pro˝le(FP)protocol.Weteston bothveri˝cationandidenti˝cationprotocolsofMegaFace. TypeII:MixedQuality IJB-A[ 5 ]andIJB-C[ 42 ]includebothhighqualitycelebrityphotos takenfromthewildandlowqualityvideoframeswithlargevariationsofillumination,occlusion, headpose,etc.Wetestonbothveri˝cationandidenti˝cationprotocolsofthetwobenchmarks. TypeIII:LowQuality WetestonTinyFace[ 6 ]andIJB-S[ 1 ],twoextremelychallenging benchmarksthataremainlycomposedoflow-qualityfaceimages.Inparticular,TinyFaceonly consistsoflow-resolutionfaceimagescapturedinthewild,whichalsoincludesothervariationssuch asocclusionandpose.IJB-Sisavideofacerecognitiondataset,whereallimagesarevideoframes capturedbysurveillancecamerasexceptafewhigh-qualityregistrationphotosforeachperson. 65 (a)Baseline (b)Proposed Figure4.8t-SNEvisualizationofthefeaturesina2Dspace.Colorsindicatetheidentities.Original trainingsamplesandaugmentedtrainingsamplesareshownincircleandtriangle,respectively. Figure4.9Performancechangewithrespecttodi˙erencechoiceofK. 4.4.2AblationStudy E˙ectofCon˝dence-awareLearning Wetrainasetofmodelsbygraduallyaddingthenameablevariations.Themodelis an18-layerResNettrainedonarandomlyselectedsubsetofMS-Celeb-1M(0.6Mimages).The modelistrainedwiththecon˝dence-awareidenti˝cationlossand = 16 embedding groups.Asacontrolledexperiment,weapplythesametypeofaugmentationonIJB-Adatasetto synthesizetestingdataofthecorrespondingvariations.InFigure4.7,modelshows decreasingperformancewhengraduallyaddingnewvariationsasinthegridgoingdownfrom 66 Table4.1Ablationstudyoverthewholeframework.VA:VariationAugmentation(Section4.2), CI:Con˝dence-awareIdenti˝cationloss(Section4.2.1),ME:indicatesMultipleEmbeddings (Section4.2.3),DE:DecorrelatedEmbeddings(Section4.2.3),PA:ProbabilisticAggregation. (Section4.2.5).E(all)usesalltheproposedmodules. Model Method LFW CFP-FP IJB-A(TAR@FAR) TinyFace IJB-S VA CI ME DE PA Accuracy Accuracy FAR=0.001% FAR=0.01% Rank1 Rank5 Rank1 Rank5 Baseline 99.75 98.16 82.20 93.05 46.75 51.79 37.14 46.75 A X 99.70 98.35 82.42 93.86 55.26 59.04 51.27 58.94 B X X 99.78 98.30 94.70 96.02 57.11 63.09 59.87 66.90 C X X X 99.77 98.50 94.75 96.27 57.30 63.73 59.66 66.30 X X X X 99.78 98.66 96.10 97.29 55.04 60.97 59.71 66.32 D X X X 99.65 97.77 80.06 92.14 34.76 39.86 29.87 40.69 X X X X 99.68 98.00 94.37 96.42 35.05 40.13 50.00 56.27 E(all) X X X X 99.75 98.30 95.00 96.27 61.32 66.34 60.74 66.59 X X X X X 99.78 98.64 96.00 97.33 63.89 68.67 61.98 67.12 Table4.2Ourmethodcomparedtostate-of-the-artmethodsonTypeIdatasets.TheMegaFace veri˝cationratesarecomputedatFAR= 0 Ł 0001% .indicatesthattheauthordidnotreportthe performanceonthecorrespondingprotocol. Method LFW YTF CFP-FP MF1 Rank1 Veri. FaceNet[30] 99.63 95.1 - - - CenterFace[32] 99.28 94.9 - 65.23 76.52 SphereFace[34] 99.42 95.0 - 75.77 89.14 ArcFace[38] 99.83 98.02 98.37 81.03 96.98 CosFace[36] 99.73 97.6 - 77.11 89.88 Ours(Baseline) 99.75 97.16 98.16 80.03 95.54 Ours(Baseline+VA) 99.70 97.10 98.36 78.10 94.31 Ours(all) 99.75 97.68 98.30 79.10 94.92 Ours(all)+PA 99.78 97.92 98.64 78.60 95.04 toprowtobottomrow.Incomparison,theproposedmethodshowsimprovingperformancewhen addingnewvariationsfromtoptobottom,whichhighlightsthee˙ectofourcon˝dence-aware representationlearninganditfurtherallowstoaddmorevariationsintotheframeworktraining. Wealsovisualizethefeatureswitht-SNEprojectedonto2Dembeddingspace.Figure4.8 showsthatformodel,withdi˙erentvariationaugmentations,thefeaturesactuallyare mixedandthusareerroneousforrecognition.Whileformodel,di˙erentvariation augmentationgeneratedsamplesarestillclusteredtogethertoitsoriginalsamples,whichindicates thatidentityiswellpreserved.Underthesamesettingsasabove,wealsoshowthee˙ectofusing di˙erentnumberofgroupsinFigure4.9.Atthebeginning,splittingtheembeddingspaceintomore 67 Table4.3Ourmodelcomparedtostate-of-the-artmethodsonIJB-A,IJB-CandIJB-S.indicates thattheauthordidnotreporttheperformanceonthecorrespondingprotocol.indicates˝ne- tuningonthetargetdatasetduringevaluationonIJB-Abenchmarkandindicatesthetesting performancebyusingthereleasedmodelsfromcorrespondingauthors. Method IJB-A(Vrf) IJB-A(Idt) IJB-C(Vrf) IJB-C(Idt) IJB-S(S2B) FAR=0.001% FAR=0.01% Rank1 FAR=0.001% FAR=0.01% Rank1 Rank1 Rank5 FPIR=1% NAN[88] - 88.1 1.1 95.8 0.5 - - - - - - L2-Face[37] 90.9 0.7 94.3 0.5 97.3 0.5 - - - - - - DA-GAN[119] 94.6 0.1 97.3 0.5 99.0 0.2 - - - - - - [44] - 92.1 1.4 98.2 0.4 76.8 86.2 91.4 - - - Multicolumn[90] - 92.0 1.3 - 77.1 86.2 - - - - ArcFace[38] 93.7 1.0 94.2 0.8 97.0 0.6 93.5 95.8 95.87 57.36 64.95 41.23 Ours(Baseline) 82.6 8.3 93.3 3.0 95.5 0.7 43.9 86.7 89.85 37.14 46.75 24.75 Ours(Baseline+VA) 82.4 8.1 93.9 3.5 95.8 0.6 47.6 90.6 90.16 51.27 58.94 31.19 Ours(all) 95.0 0.9 96.3 0.6 97.5 0.4 91.6 93.7 94.39 60.74 66.59 37.11 Ours(all)+PA 96.0 0.8 97.3 0.4 97.5 0.3 95.0 96.6 96.00 61.98 67.12 42.73 groupsincreasesperformanceforbothTARs.Whenthesizeofeachsub-embeddingbecomestoo small,theperformancestartstodropbecauseofthelimitedcapacityforeachsub-embedding. AblationonAllModules Weinvestigateeachmodule'se˙ectbylookingintotheablativemodelsinTable4.1.Starting fromthebaseline,modelAistrainedwithvariationaugmentation.BasedonmodelA,weadd con˝dence-awareidenti˝cationlosstoobtainmodelB.ModelCisfurthertrainedbysettingup multiplesub-embeddings.InmodelE,wefurtheraddedthedecorrelationloss.Wealsocompare withaModelDwithallthemodulesexceptvariationaugmentation.ModelC,DandE,whichhave multipleembeddings,aretestedw/andw/oprobabilisticaggregation(PA).Themethodsaretested ontwotypeIdatasets(LFWandCFP-FP),onetype-IIdataset(IJB-A)andonetype-IIIdataset (TinyFace). ShowninTable4.1,comparedtobaseline,addingvariationaugmentationimprovesperformance onCFP-FP,TinyFace,andIJBA.Thesedatasetspresentexactlythevariationsintroducedbydata augmentation,i.e.,posevariationandlowresolution.However,theperformanceonLFW˛uctuates frombaselineasLFWismostlygoodqualityimageswithfewvariations.Incomparison,modelB andCareabletoreducethenegativeimpactofhardexamplesintroducedbydataaugmentationand leadstoconsistentperformanceboostacrossallbenchmarks.Meanwhile,weobservethatsplitting intomultiplesub-embeddingsalonedoesnotimprove(compareBtoC˝rstrow)signi˝cantly,which 68 High-quality Blur Occlusion Large-pose Figure4.10Heatmapvisualizationofsub-embeddinguncertaintyondi˙erenttypesofimages fromIJB-Cdataset,shownontherightofeachfaceimage.16valuesarearrangedin4 4grids(no spatialmeaning).Brightercolorindicateshigheruncertainty. canbeexplainedbythestronglycorrelatedcon˝denceamongthesub-embeddings(seeFigure4.5). Nevertheless,withthedecorrelationlossandprobabilisticaggregation,di˙erentsub-embeddings areabletolearnandcombinecomplementaryfeaturestofurtherboosttheperformance,i.e.,the performanceinthesecondrowofModelEisconsistentlybetterthanits˝rstrow. 4.4.3EvaluationonGeneralDatasets Wecompareourmethodwithstate-of-the-artmethodsongeneralfacerecognitiondatasets,i.e., thoseTypeIdatasetswithlimitedvariationandhighquality.Sincethetestingimagesaremostly withgoodquality,thereislimitedadvantageofourmethodwhichisdesignedtodealwithlarger variations.Eventhough,showninTable4.2,ourmethodstillstandsontopbeingbetterthanmost ofthemethodswhileslightlyworsethanArcFace.Noticethatourbaselinemodelalreadyachieves goodperformanceacrossallthetestingsets.Itactuallyveri˝esthatthetypeItestingsetsdonot showsigni˝cantdomaingapfromthetrainingset,whereevenwithoutvariationaugmentationor embeddingdecorrelation,thestraighttrainingcanleadtogoodperformance. 69 4.4.4EvaluationonMixed/LowQualityDatasets Whenevaluatingonmorechallengingdatasets,thosestate-of-the-artgeneralmethodsencounter performancedropasthechallengingdatasetspresentlargevariationsandthuslargedomaingap fromthegoodqualitytrainingdatasets.Table4.3showstheperformanceonthreechallenging benchmarks:IJB-A,IJB-CandIJB-S.Theproposedmodelachievesconsistentlybetterresults thanthestate-of-the-arts.Inparticular,simplyaddingvariationaugmentation(Baseline+ VactuallyleadstoaworseperformanceonIJB-AandIJB-C.Whenvariationaugmentationis combinedwithourproposedmodulessigni˝cantperformanceboostisachieved.Further addingPAwithweachieveevenbetterperformanceacrossalldatasetsandprotocols.Notice thatIJB-Aisacross-validationprotocol.Manyworks˝ne-tuneontrainingsplitsbeforeevaluation (shownwithEventhough,ourmethodwithout˝ne-tuningstilloutperformsthestate-of-the-art methodswithsigni˝cantmarginonIJB-Averi˝cationprotocol,whichsuggeststhatourmethod indeedlearnstherepresentationtowardsdealingwithunseenvariations. Table4.3lastcolumnshowstheevaluationonIJB-S,whichissofarthemostchallenging benchmarktargetingrealsurveillancescenariowithseverepoorqualityimages.Weshowthe Surveillance-to-Booking(S2B)protocolofIJB-S.AsIJB-Sisrecentlyreleased,therearefew studiesthathaveevaluatedonthisdataset.Tocomprehensivelyevaluateourmodel,weusethe publiclyreleasedmodelsfromArcFace[ 38 ]forcomparison.Ourmethodachievesconsistently betterperformanceacrossRank-1andRank-5identi˝cationprotocol.ForTinyFace,asinTable4.1, weachieve 63 Ł 89% , 68 Ł 67% rank-1andrank-5accuracy,where[ 6 ]reports 44 Ł 80% , 60 Ł 40% ,and ArcFaceachieves 47 Ł 39% , 52 Ł 28% .CombiningTable4.2,ourmethodachievestoplevelaccuracy ongeneralrecognitiondatasetsandsigni˝cantlybetteraccuracyonchallengingdatasets,which demonstratestheadvantageindealingwithextremeorunseenvariations. UncertaintyVisualization Figure4.10showsuncertaintyscoresforthe 16 sub-embeddings reshapedinto 4 4 grids.High-qualityandlow-qualitysub-embeddingsareshownindarkandlight colorsrespectively.Theuncertaintymappresentsdi˙erentpatternsfordi˙erentvariations. 70 4.5Conclusion Inthiswork,weproposeauniversalfacerepresentationlearningframework,URFace,torecognize facesunderallkindsofvariations.We˝rstlyintroducethreenameablevariationsintoMS-Celeb-1M trainingsetviadataaugmentation.Traditionalmethodsencounterconvergenceproblemwhen directlyfeedingtheaugmentedhardexamplesintotraining.Weproposeacon˝dence-aware representationlearningbypartitioningtheembeddingintomultiplesub-embeddingsandrelaxing thecon˝dencetobesampleandsub-embeddingspeci˝c.Further,theclassi˝cationandadversarial lossesonvariationsareproposedtodecorrelatethesub-embeddings.Byformulatingtheinference withanuncertaintymodel,thesub-embeddingsareaggregatedproperly.Experimentalresultsshow thatourmethodachievestopperformanceongeneralbenchmarkssuchasLFWandMegaFace,and signi˝cantlybetteraccuracyonchallengingbenchmarkssuchasIJB-A,IJB-CandIJB-S. 71 Chapter5 GeneralizingFaceRepresentationwith UnlabeledImages 5.1Introduction Machinelearningalgorithmstypicallyassumesthattrainingandtestingdatacomefromthesame underlyingdistribution.However,inpractice,wewouldoftenencountertestingdomainsthatare di˙erentfromthepopulationwherethetrainingdataisdrawn.Sinceitisnon-trivialtocollectdata forallpossibletestingdomains,learningrepresentationsthataregeneralizabletoheterogeneous testingdataisdesired[ 108 , 120 , 121 , 122 , 123 ].Particularlyforfacerecognition,thisproblemis re˛ectedbythedomaingapbetweenthesemi-constrainedtrainingdatasetsandunconstrainedtesting datasets.Nearlyallofthestate-of-the-artdeepfacenetworksaretrainedonlarge-scaleweb-crawled faceimages,mostofwhicharehigh-qualitycelebrityphotos[ 50 , 2 ].Butinpractice,wewishto deploythetrainedFRsystemsformanyotherscenarios,e.g.unconstrainedphotos[ 5 , 41 , 42 ]and surveillance[ 1 ].Thelargedegreeoffacevariationinthetestingscenarios,comparedtothetraining set,couldresultinsigni˝cantperformancedropofthetrainedfacemodels[42,1]. Thesimplestsolutiontosuchadomaingapproblemistocollectalargenumberofunconstrained labeledfaceimagesfromdi˙erentsources.However,duetoprivacyissueandhuman-labelingcost,it 72 Figure5.1Illustrationoftheproblemsettingsinourwork.Bluecirclesimplythedomainsthatthe faceimagesbelongto.Byutilizingdiverseunlabeledimages,wewanttoregularizethelearningof thefaceembeddingformoreunconstrainedfacerecognitionscenarios. isextremelyhardtocollectsuchadatabase.Otherpopularsolutionstothisproblemincludetransfer learninganddomainadaptation,whichrequiredomain-speci˝cdatatotrainamodelforeachofthe targetdomains[ 124 , 125 , 126 , 127 , 128 , 129 ].However,inunconstrainedfacerecognition,aface representationthatisrobusttoalldi˙erentkindsofvariationsisneeded,sothesedomain-speci˝c solutionsarenotappropriate. Instead,itwouldbeusefulifwecouldutilizethecommonlyavailable, unlabeleddatatoachieveadomain-agnosticfacerepresentationthatgeneralizestounconstrained testingscenarios (SeeFig.5.1).Toachievethisgoal,wewouldliketoaskthefollowingquestions inthischapter: ‹ Isitpossibletoimprovemodelgeneralizabilitytounconstrainedfacesbyintroducingmore diversityfromauxiliaryunlabeleddata? ‹ Whatkindofandhowmuchunlabeleddatadoweneed? ‹ Howmuchperformanceboostcouldweachievewiththeunlabeleddata? Inthischapter,weproposesuchansemi-supervisedframeworkforlearningrobustface representations.Theunlabeledimagesarecollectedfromapublicfacedetectiondataset,i.e. WiderFace[ 130 ],whichcontainsmorediversetypes(sub-domains)offaceimagescomparedto 73 typicallabeledfacedatasetsusedfortraining. Toutilizetheunlabeleddata,theproposedmethodjointlyregularizestheembeddingmodel fromfeaturespaceandimagespace.Weshowthatadversarialregularizationcanhelptoreduce domaingapscausedbyfacialvariations,evenintheabsenceofsub-domainlabels.Ontheother hand,animageaugmentationmoduleistrainedtodiscoverthehiddensub-domainstylesinthe unlabeleddataandapplythemtothelabeledtrainingsamples,thusincreasingthediscrimination powerondi˚cultfaceexamples.Toourknowledge,thisisthe˝rststudytouseaheterogeneous unlabeleddatasettoboostthemodelperformanceforgeneralunconstrainedfacerecognition.The contributionsofthischapteraresummarizedasbelow: ‹ Asemi-supervisedlearningframeworkforgeneralizingfacerepresentationswithauxiliary unlabeleddata. ‹ Anmulti-modeimagetranslationmoduleisproposedtoperformdata-drivenaugmentation andincreasethediversityofthelabeledtrainingsamples. ‹ Empiricalresultsshowthattheregularizationofunlabeleddatahelpstoimprovetherecognition performanceonchallengingtestingdatasets,e.g.IJB-B,IJB-C,andIJB-S. 5.2RelatedWork 5.2.1Semi-supervisedLearning Classicsemi-supervisedlearninginvolvesasmallnumberoflabeledimagesandalargenumberof unlabeledimages[ 131 , 132 , 133 , 134 , 135 , 136 , 137 , 138 ].Thegoalistoimprovetherecognition performancewhenwedon'thavesu˚cientdatathatarelabeled.State-of-the-artsemi-supervised learningmethodscanmainlybeclassi˝edintofourcategories.(1)Pseudo-labelingmethodsgenerate labelsforunlabeleddatawiththetrainedmodelandthenusethemfortraining[ 131 ].Inspiteofits simplicity,ithasbeenshowntobee˙ectiveprimarilyforclassi˝cationtaskswherelabeleddata andunlabeleddatasharethesamelabelspace.(2)Temporalensemblemodelsmaintaindi˙erent 74 versionsofmodelparameterstoserveasteachermodelsforthecurrentmodel[ 133 , 134 ].(3) Consistency-regularizationmethodsapplycertaintypesofaugmentationtotheunlabeleddata whilemakingsuretheoutputpredictionremainsconsistentafteraugmentation[ 132 , 137 , 138 ].(4) Self-supervisedlearning,originallyproposedforunsupervisedlearning,hasrecentlybeenshownto bee˙ectiveforsemi-supervisedlearningaswell[ 136 ].Comparedwithclassicsemi-supervised learningaddressedintheliterature,ourproblemisdi˙erentintwosenseofheterogeneity:di˙erent domainsanddi˙erentidentitiesbetweenthelabeledandunlabeleddata.Thesedi˙erencesmake manyclassicsemi-supervisedlearningmethodsunsuitableforourtask. 5.2.2DomainAdaptationandGeneralization Indomainadaptation,theuserhasadatasetforasourcedomainandanotherfora˝xedtarget domain[ 124 , 125 , 126 , 128 , 129 ].Ifthetargetdomainisunlabeled,thisleadstoan unsupervised domainadaption setting[ 125 , 128 , 129 ].Thegoalistoimprovetheperformanceonthetarget domainsothatitcouldmatchtheperformanceonthesourcedomain.Thisisachievedbyreducing thedomaingapbetweenthetwodatasetsinfeaturespace.Theproblemaboutdomainadaption isthatoneneedstoacquireanewdatasetandtrainanewmodelwheneverthereisanewtarget domain.In domaingeneralization ,theuserisgivenasetoflabeleddatasetsfromdi˙erent domains.Themodelisjointlytrainedonthesedatasetssothatitcouldbettergeneralizetounseen domains[ 108 , 120 , 121 , 122 , 123 ].Ourproblemsharesthesamegoalwithdomaingeneralization methods: wewanttoincreasethemodelgeneralizabilityratherthanimprovingperformanceona speci˝ctargetdomain .However,unlikedomaingeneralization,wedonothaveidentitylabelsforall thedata,whichmakesourtaskevenmoredi˚cult. 5.3Methodology Generally,infacerepresentationlearning,wearegivenalargelabeleddataset X = f¹ G 1 ŒH 1 º Œ ¹ G 2 ŒH 2 º ŒŁŁŁŒ ¹ G = ŒH = ºg ,where G 8 and H 8 arethefaceimagesandidentitylabels,re- 75 Figure5.2Overviewofthetrainingframeworkoftheembeddingnetwork.Ineachmini-batch, arandomsubsetoflabeleddatawouldbeaugmentedbytheaugmentationnetworktointroduce additionaldiversity.Thenon-augmentedlabeleddataareusedtotrainthefeaturediscriminator. Theadversariallossforcesthedistributionoftheunlabeledfeaturestoalignwiththelabeledone. spectively.Thegoalistolearnanembeddingmodel 5 suchthat 5 ¹ G º wouldbediscriminative enoughtodistinguishbetweendi˙erentidentities.However,since 5 isonlytrainedonthedomain de˝nedby X ,whichisusuallysemi-constrainedcelebrityphoto,itmightnotgeneralizetoun- constrainedsettings.Inourframework,weassumetheavailabilityofanotherunlabeleddataset U = U 1 [U 2 ŁŁŁ U : = f D 1 ŒD 2 ŒŁŁŁŒD = g ,collectedfromdi˙erentsources(sub-domains).However, thesesub-domainlabelsmaynotbeavailableinrealapplications,thuswedonotassumetheaccess tothembutinsteadseeksolutionsthatcouldautomaticallyleveragethesehiddensub-domains. Then,wewishtosimultaneouslyminimizethreetypesoferrors: ‹ Errorduetodiscriminationpowerwithinthelabeleddomain X . ‹ Errorduetofeaturedomaingapbetweenthelabeleddomain X andthehiddensub-domains U 8 . ‹ Errorduetodiscriminationpowerwithintheunlabeleddomain U . AnoverviewoftheframeworkisshowninFig.5.2. 76 (a)w/oDomainAdversarialLoss (b)w/DomainAdversarialLoss Figure5.3t-SNEvisualizationofthefaceembeddingsusingsynthesizedunlabeledimages.Using partoftheMS-Celeb-1Masunlabeleddataset,wecreatethreesubdomainsbyprocessingtheimages witheitherrandomGaussiannoise,randomocclusionordownsampling.(a)di˙erentsub-domains showdi˙erentdomainshiftintheembeddingspaceofthesupervisedbaseline.(b)withtheholistic binarydomainadversarialloss,eachofthesub-domainsisalignedwiththedistributionofthe labeleddata. 5.3.1MinimizingErrorintheLabeledDomain Thedeeprepresentationofafaceimageisusuallyapointinahyper-sphericalembeddingspace, where k 5 ¹ G 8 º k 2 = 1 .State-of-the-artsupervisedfacerecognitionmethodsalltryto˝ndanobjective functiontomaximizetheinter-classmarginsuchthattherepresentationcouldstillbediscriminative whentestedonunseenidentities.Inthiswork,wechoosetouseCosFacelossfunction[ 36 ][ 35 ]for trainingthelabeledimages: L 83C = E G 8 ŒH 8 ˘X » log 4 B ¹ , ) H 8 5 8 < º 4 B ¹ , ) H 8 5 8 < º ¸ Í 9 < H 8 4 B, ) H 9 5 8 ¼ Ł (5.1) Here B isthehyper-parametercontrollingtemperature, < isamarginhyper-parameterand , 9 isthe proxyvectorofthe 9 C identityintheembeddingspace,whichisalso 2 normalized.Wechooseto useCosFacelossfunctionbecauseofitsstabilityandhigh-performance.Itcouldpotentiallybe replacedbyanyothersupervisedidenti˝cationlossfunction. 77 5.3.2MinimizingDomainGap Theunlabeleddataset U isassumedtobeadiversedatasetcollectedfromdi˙erentsources,i.e. coveringdi˙erentsub-domains(types)offaceimages.Ifwehavetheaccesstosuchsub-domain labels,anaturalsolutiontoadomain-agnosticmodelwouldbealigningeachofthesub-domains withthefeaturedistributionofthelabeledimages.However,thesub-domainlabelsmightnotbe availableinmanycases.Inourexperiment,we˝ndthereisnonecessityforpairwisedomain alignment.Instead,abinarydomainalignmentlossissu˚cienttoalignthesub-domains.Formally, givenafeaturediscriminatornetwork ˇ ,wecouldreducethedomaingapviaanadversarialloss: L ˇ = E G ˘X » log ˇ ¹ H = 0 j 5 ¹ G º¼ E D ˘U » log ˇ ¹ H = 1 j 5 ¹ D º¼ Œ (5.2) L 03E = E G ˘X » log ˇ ¹ H = 1 j 5 ¹ G º¼ E D ˘U » log ˇ ¹ H = 0 j 5 ¹ D º¼ Ł (5.3) Thediscriminator ˇ isamulti-layerbinaryclassi˝eroptimizedby L ˇ .Ittriestolearnanon-linear classi˝cationboundarybetweenthetwodatasetswhiletheembeddingnetworkneedstofoolthe discriminatorbyreducingthedivergencebetweenthedistributionsof 5 ¹ G º and 5 ¹ D º .Toseethe e˙ectofdomainalignmentloss,weconductacontrolledexperimentswithatoydataset.Wesplit theMS-Celeb-1M[ 2 ]datasetintolabeledimagesandunlabeledimages(noidentityoverlap).The unlabeledimagesarethenprocessedwithoneofthethreedegradations:randomGaussiannoise, randomocclusionanddownsampling.Thus,wecreatethreesub-domainsintheunlabeleddataset. Thecorrespondingdomainshiftcanbeobservedinthet-SNEplotinFig.5.3(a),wherethemodel istrainedonlyonthelabeledsplit.Then,weincorporatetheaugmentedunlabeledimagesinto trainingwiththebinarydomainadversarialloss.InFig.5.3(b),weobservethatwiththebinary domainalignmentloss,thedistributionofeachofsub-domainsisalignedwiththeoriginaldomain, indicatingreduceddomaingaps. 78 5.3.3MinimizingErrorintheUnlabeledDomains ThedomainalignmentlossinSection5.3.2helpstoeliminatetheerrorcausedbydomaingaps betweenunconstrainedfaces.Thus,theremainingtaskistoimprovethediscriminationpowerof thefacerepresentationamongtheunlabeledfaces.Manysemi-supervisedclassi˝cationmethods addressthisproblembyusingpseudo-labelingofunlabeleddata[ 131 , 137 , 138 ],butthisisnot applicabletoourproblemsinceourunlabeleddatasetdoesnotsharethesamelabelspacewiththe labeledone.Furthermore,becauseofdatacollectionprotocols,thereisverylittlechancethatone identitywouldhavemultipleunlabeledimages.Thus,clustering-basedmethodsarealsoinfeasible forourtask.Here,weconsidertoaddressthisissuewithamulti-modeaugmentationmethod. Priorstudieshaveshownthatanimagetranslationnetwork,suchasCycleGAN[ 139 ],canbe e˙ectivelyusedasadataaugmentationmodulefordomainadaptation[ 140 ].Themainideaofthe augmentationnetworkistolearnthedi˙erencebetweentwodomainsintheimagespaceandthen augmentthesamplesfromsourcedomaindatatocreatetrainingdatawithpseudo-labelsinthetarget domain.Sinceourgoalistogeneralizethedeepfacerepresentationtounconstrainedfaces,which involvesalargevariety,deterministicmethodsuchasCycleGANwouldbeunsuitable.Therefore, weproposetouseamulti-modeimagetranslationnetworkthatcoulddiscoverthehiddendomainsin theunlabeleddataandthenaugmentthelabeledtrainingdatawithdi˙erentstyles.Inparticular,we needafunction ˝ whichmapslabeledsamples G intotheimagespacede˝nedbytheunlabeledfaces, i.e. ? ¹ G º! ? ¹ D º .Then,trainingtheembedding 5 on ˝ ¹ G º couldmakeitmorediscriminativein theimagespacede˝nedby * .Therearetworequirementofthefunction ˝ :(1)itshouldnotchange theidentityoftheinputimageand(2)itshouldbeabletocapturedi˙erentstylesthatarepresent intheunlabeledimages.Inspiredbyrecentprogressinimagetranslationframeworks[ 139 , 141 ], weproposetotrain ˝ asastyle-transfernetworkthatlearnsthevisualstylesduringtransferinan unsupervisedmanner.Thenetwork ˝ canthenbeusedasadata-drivenaugmentationmodulethat generatesdiversesamplesgivenaninputfromthelabeleddataset.Duringthetraining,werandomly replaceasubsetofthelabeledimagestobeaugmentedandputthemintoouridenti˝cationlearning framework.Thedetailsoftrainingtheaugmentationnetwork ˝ isgiveninSection5.3.3. 79 Figure5.4Trainingframeworkoftheaugmentationnetwork ˝ .Thetwopipelinesareoptimized jointlyduringtraining. Theoveralllossfunctionfortheembeddingnetworkisgivenby: L = _ 83C L 83C ¸ _ 03E L 03E (5.4) where ! 83C alsoincludestheaugmentedlabeledsamples. Multi-modeAugmentationNetwork Theaugmentationnetwork ˝ isafullyconvolutionalnetworkthatmapsanimagetoanother.To preservethegeometricstructure,ourarchitecturedoesnotinvolveanydownsamplingorupsampling. Inordertogeneratestylessimilartotheunlabeledimages,animagediscriminator ˇ ˚ istrainedto distinguishbetweenthetexturestylesofunlabeledimagesandgeneratedimages: L ˇ ˚ = E G ˘X » log ˇ ˚ ¹ H = 0 j ˝ ¹ GŒI ºº¼ E D ˘U » log ˇ ˚ ¹ H = 1 j D º¼ Œ (5.5) L ˝ 03E = E G ˘X » log ˇ ˚ ¹ H = 1 j ˝ ¹ GŒI ºº¼ Ł (5.6) 80 Here I ˘N¹ 0 Œ I º isarandomstylevectortocontrolthestylesoftheoutputimage,whichisinjected intothegenerationprocessviaAdaptiveInstanceNormalization(AdaIN)[ 142 ].Althoughthe adversariallearningcouldmakesuretheoutputareintheunlabeledspace,butitcannotensure that(1)thecontentoftheinputismaintainedintheoutputimageand(2)therandomstyle I is beingusedtogeneratediversevisualstyles,correspondingtodi˙erntsub-domainsintheunlabeled images.Weproposetoutilizeanadditionalreconstructionpipelinetosimultaneouslysatisfythese tworequirements.First,weintroduceanadditionalstyleencoder ˆ I tocapturethecorresponding styleintheinputimage,asin[ 141 ].Areconstructionlossisthenenforcedtokeeptheconsistency oftheimagecontent: L ˝ A42 = E G ˘X » k G ˝ ¹ GŒˆ I ¹ G ºº k 2 ¼ (5.7) ¸ E D ˘U » k D ˝ ¹ DŒˆ I ¹ D ºº k 2 ¼ Œ (5.8) Then,duringthereconstruction,weaddanotherlatentstylediscriminator ˇ I toguaranteethe distributionof ˆ I ¹ D º alignwithpriordistribution N¹ 0 Œ I º : L ˇ I = E D ˘U » log ˇ I ¹ H = 0 j ˆ I ¹ D ºº¼ E I ˘N¹ 0 Œ I º » log ˇ I ¹ H = 1 j I º¼ Œ (5.9) L I 03E = E D ˘U » log ˇ I ¹ H = 1 j ˆ I ¹ D ºº¼ Œ (5.10) Theoveralllossfunctionofthegeneratorisgivenby: L ˝ = _ ˝ 03E L ˝ 03E ¸ _ ˝ A42 L ˝ A42 ¸ _ I 03E L I 03E (5.11) Aoverviewofthetrainingframeworkof ˝ isgiveninFig.5.4andexamplegeneratedimagesare showninFig.5.5. 81 Figure5.5Examplegeneratedimagesoftheaugmentationnetwork. 5.4Experiments 5.4.1ImplementationDetails TrainingDetailsoftheRecognitionModel AllthemodelsareimplementedwithPytorchv1.1. WeusetheRetinaFace[ 143 ]forfacedetectionandalignment.Allimagesaretransformedintoa sizeof 112 112 .Amodi˝ed50-layerResNetin[ 38 ]isusedasourarchitecture.Theembedding sizeis 512 forallmodels.Bydefault,allthemodelsaretrainedwith 150 Œ 000 stepswithabatch sizeof256.Forsemi-supervisedmodels,weuse 64 unlabeledimagesand 192 labeledimagesin eachmini-batch.Formodelswhichusestheaugmentationmodule, 20% ofthelabeledimagesare augmentedbythegeneratornetwork.Thescaleparameter B andmarginparameter < aresetto 30 and 0 Ł 5 ,respectively.Weempiricallyset _ 83C , _ 03E as1.0and0.01.Formodelsthatutilizesthe consistencyregularization, _ ˘' issetto 0 Ł 2 .Randomimagetranslation,˛ipping,occlusionand downsamplingareusedasdataperturbationforthosemodels. TrainingDetailsoftheGeneratorModel Thegeneratoristrainedfor 160 Œ 000 stepswithabatch sizeof 8 images( 4 fromeachdataset).Adamoptimizerisusedwith V 1 = 0 Ł 5 and V 2 = 0 Ł 99 .The learningratestartswith 1 4 4 anddropsto 1 4 5 after 80 Œ 000 steps. _ ˝ 03E , _ ˝ A42 and _ I 03E aresetto as1.0,10.0and1.0,respectively.ThearchitectureofthegeneratorisbasedonMUNIT[ 141 ].Let c5s1-k bea 5 5 convolutionallayerwith : ˝ltersandstride 1 . dk-IN denotesa 3 3 convolutional layerwith : ˝ltersanddilation 2 ,whereINmeansInstanceNormalization[ 144 ].Similarly,AdaIN meansAdaptiveInstanceNormalization[ 142 ]andLNdenotesLayerNormalization[ 145 ]. fc8 82 denotesafullyconnectedlayerwith 8 ˝lters. avgpool denotesaglobalaveragepoolinglayer.No normalizationisusedinthestyleencoder.WeuseLeakyReLUwithslope0.2inthediscriminator andReLUactivationeverywhereelse.Thearchitecturesofdi˙erentmodulesareasfollows: ‹ StyleEncoder: c5s1-32,c3s2-64,c3s2-128,avgpool,fc8 ‹ Generator: c5s1-32-IN,d32-IN,d32-AdaIN,d32-LN, d32-LN,c5s1-3 ‹ Discriminator: c5s1-32,c3s2-64,c3s2-128 Thelengthofthelatentstylecodeissetto 8 .Astyledecoder(multi-layerperceptron)hastwo hiddenfullyconnectedlayersof 128 ˝lterswithoutnormalization,whichtransformsthelatentstyle codetotheparametersoftheAdaINlayer. 5.4.2Datasets Weuse MS-Celeb-1M [ 2 ]asourlabeledtrainingdataset.Asforunlabeledimages,wechoose WiderFace [ 130 ]asourtrainingdata.WiderFaceisdatasetcollectedbyretrievingimagesfrom searchengineswithdi˙erenteventkeywords.Asafacedetectiondataset,itincludesamuchwider domainofphotosandthefaces.Manyfacesinthisdatasetstillcannotbedetectedbystate-of-the-art detectionmethods[ 143 ].WeonlykeepthedetectablefacesintheWiderFacetrainingsetasour trainingdata.Ourgoalistoclosethegapbetweenfacedetectionandrecognitionengineandimprove therecognitionperformanceonageneralsettingswithanydetectablefaces.Attheend,wewere abletodetectabout70KfacesfromWiderFace,lessthan2%ofourlabeledtrainingdata. Toevaluatethefacerepresentationmodels,wetestonthreebenchmarks,namelyIJB-B,IJB-C andIJB-S.Althoughourgoalistoimproverecognitionperformanceondomainsthataredi˙erent fromthetrainingset,wewouldnotliketolosethediscriminationpowerintheoriginaldomain 83 Table5.1Ablationstudyoverdi˙erenttrainingmethodsoftheembeddingnetwork.Allmodels hasidenti˝cationlossbydefault.AandrefertoAlignment AugmentationNetwandrespectively. Method IJB-C(Vrf) IJB-C(Idt) IJB-S(V2S) LFW 1e-7 1e-6 1e-5 Rank1 Rank5 Rank1 Rank5 Accuracy Baseline 62.90 82.94 90.73 94.90 96.77 53.23 62.91 99.80 +DA 72.74 85.33 90.52 94.99 96.75 56.35 66.77 99.82 +DA+AN(SM) 74.80 87.58 91.94 95.51 97.09 56.98 65.66 99.80 +DA+AN(MM) 77.39 87.92 91.86 95.61 97.13 57.33 65.37 99.75 (high-qualityphotos)either.Therefore,duringablationwealsoevaluateourmodelsonthestandard LFW [ 3 ]protocol,whichisacelebrityphotodataset,similartothelabeledtrainingdata(MS-Celeb- 1M).NotethattheaccuracyontheLFWprotocolishighlysaturated,sothemaingoalisjustto checkwhetherthereisasigni˝cantperformancedropontheconstrainedfaceswhileincreasingthe generalizabilitytounconstrainedones. 5.4.3AblationStudy Inthissection,weconductanablationstudytoquantitativelyevaluatethee˙ectofdi˙erentmodules proposedinthischapter.Inparticular,wehavetwomodulestostudy:DomainAlignment(DA)and AugmentationNetwork(AN).TheperformanceisshowninTable5.1.Aswealreadyshowedin Fig.5.3,domainadversariallossisabletoforcesmallerdomaingapsbetweenthesub-domains inWiderFaceandthecelebrityfaces,eventhoughwedonothaveaccesstothosedomainlabels. Consequently,weobservetheperformanceimprovementonmostoftheprotocolsonIJB-Cand IJB-S.Introducingtheaugmentationnetwork(AN)furtherhelpsimprovingtheperformance onunconstrainedbenchmarks,whereamulti-mode(MM)augmentationnetworkoutperformsa single-model(SM)augmentationnetwork. Wealsoablateoverthetrainingmodulesoftheaugmentationnetwork.Inparticular,weconsider toremovethefollowingmodulesfordi˙erentvariants:Latent-stylecodeformulti-modegeneration (MM),ImageDiscriminator( ˇ ˚ ),ReconstructionLoss(Rec),StyleDiscriminator( ˇ I )andthe architecturewithoutdownsampling(ND).Thequalitativeresultsofdi˙erentmodelsareshownin 84 InputModel(a)Model(b)Model(c)Model(d)Model(e)Model(f) Figure5.6Ablationstudyoftheaugmentationnetwork.Inputimagesareshowninthe˝rstcolumn. Thesubsequentcolumnsshowtheresultsofdi˙erentmodelstrainedwithoutacertainmoduleor loss.Thetexturestylecodesarerandomlysampledfromthenormaldistribution. Fig.5.6.Withoutthelatentstylecode(Modela),theaugmentationnetworkcanonlyoutputone deterministicimageforeachinput,whichmainlyappliesblurringtotheinputimage.Withoutthe imageadversarialloss(Modelb),themodelcannotcapturetherealisticvariationsintheunlabeled datasetandthestylecodecanonlychangethecolorchannelinthiscase.WithouttheReconstruction Loss(Modelc),themodelistrainedonlywithadversariallossbutwithouttheregularization ofcontentpreservation.Andtherefore,weseeclearartifactsontheoutputimages.However, addingreconstructionlossalonehardlyhelps,sincethelatentcodeusedinthereconstructionofthe unlabeledimagescouldbeverydi˙erentfromthepriordistribution ? ¹ I º thatweuseforgeneration. Therefore,similarartifactscanbeobservedifwedonotaddlatentcodeadversarialloss(Model d).Asforthearchitecture,ifwechoosetouseanencoder-decoderstylenetworkasintheoriginal MUNIT[ 141 ],withdownsamplingandupsampling(Modele),weobservethattheoutputimages arealwaysblurredduetothelossofspatialinformation.Incontrast,withourarchitecture(Model f),thenetworkiscapableofaugmentingimageswithdiversecolor,blurringandilluminationstyles butwithoutclearartifacts. Furthermore,weincorporatethesedi˙erentvariantsofaugmentationnetworksintotrainingand 85 Table5.2Ablationstudyoverdi˙erenttrainingmethodsoftheaugmentationnetwork. ˇ ˚ ˇ / refertoeDiscriminatoreconstructionLoss StyleDiscriminatorandoDorespectively.The˝rstrowisabaselinethat usesonlythedomainadversariallossbutnoaugmentationnetwork.isasingle-mode translationnetworkthatdoesnotuselatentstylecode. Model Modules IJB-C(Vrf) IJB-C(Idt) IJB-S(V2S) LFW MM ˇ ˚ Rec ˇ / ND 1e-7 1e-6 1e-5 Rank1 Rank5 Rank1 Rank5 Accuracy 72.74 85.33 90.52 94.99 96.75 56.35 66.77 99.82 (a) X X 74.80 87.58 91.94 95.51 97.09 56.98 65.66 99.80 (b) X X X X 75.32 88.00 91.71 95.42 97.04 57.54 66.72 99.75 (c) X X X 74.51 87.49 91.97 95.61 97.18 57.17 66.24 99.78 (d) X X X X 75.07 88.11 92.19 95.66 97.12 56.85 64.87 99.78 (e) X X X X 73.99 86.52 91.33 95.33 97.04 58.47 66.00 99.73 (f) X X X X X 77.39 87.92 91.86 95.61 97.13 57.33 65.37 99.75 showtheresultsinTable5.6.Thebaselinemodelhereisamodelthatonlyusesdomainalignment losswithoutaugmentationnetwork.Infact,comparedwiththisbaseline,usingalldi˙erentvariants oftheaugmentationnetworkachievesperformanceimprovementinspiteoftheartifactsinthe generatedimages.Butamorestableimprovementisobservedfortheproposedaugmentation networkacrossdi˙erentevaluationprotocols.Wealsoshowmoreexamplesofaugmentedimages inFigure5.6. 5.4.4Quantityvs.Diversity AlthoughwehaveshowninSec.5.4.3thatutilizingunlabeleddataleadstobetterperformanceon challengingtestingbenchmarks,generallyitshallbeexpectedthatsimplyincreasingthenumber oflabeledtrainingdatacanalsohaveasimilare˙ect.Therefore,inthissection,weconducta moredetailedstudytoanswersuchaquestion: whichismoreimportantforfeaturegeneralizability: quantityordiversityofthetrainingdata? Inparticular,wetrainseveralsupervisedmodelsby adjustingthenumberoflabeledtrainingdata.Foreachsuchmodel,wealsotrainacorresponding modelwithadditionalunlabeleddata.TheevaluationresultsareshowninFigure5.7. OntheIJB-Sdataset,whichissigni˝cantlydi˙erentfromthelabeledtrainingdata,weseethat themodelstrainedwithunlabeleddataconsistentlyoutperformsthesupervisedbaselineswithalarge margin.Inparticular,theproposedmethodachievesbetterperformancethanthesupervisedbaseline 86 Figure5.7EvaluationresultsonIJB-CandIJB-Swithdi˙erentprotocolsanddi˙erentnumberof labeledtrainingdata. evenwhenthereisonlyone-fourthoftheoveralllabeledtrainingdata(1Mvs4M),indicatingthe valueofdatadiversityduringtraining.Notethatthereisasigni˝cantperformanceboostwhen increasingthenumberoflabeledsamplesfrom0.5Mto1M.However,afterthat,thebene˝tof acquiringmorelabeleddataplateausandinfactitismorehelpfultointroduce70Kunlabeleddata than3Madditionallabeleddata. OntheIJB-Cdataset,forbothveri˝cationandidenti˝cationprotocols,weobserveasimilartrend astheIJB-Sdataset.Inparticular,largerimprovementisachievedatlowerFARs.Thisisbecausethe veri˝cationthresholdatlowerFARsisa˙ectedbythelowqualitytestdata(di˚cultimpostorpairs), whichismoresimilartoourunlabeleddata.Anotherinterestingobservationisthattheimprovement marginincreaseswhenthereismorelabeleddata.Notethatingeneralsemi-supervisedlearning,we wouldexpectlessimprovementbyusingunlabeleddatawhenthereismorelabeleddata.Butitis theoppositeinourcasebecausetheunlabeleddatahasdi˙erentcharacteristicsthanthelabeleddata. Sowhentheperformanceofsupervisedmodelsaturateswithsu˚cientlabeleddata,transferringthe knowledgefromdiverseunlabeleddatabecomesmorehelpful. ForbothIJB-SandIJB-C(TAR@FAR=1e-7),weobservethatafteracertainpoint,adding 87 Figure5.8EvaluationResultsonIJB-S,IJB-CandLFWwithdi˙erentprotocolsanddi˙erent numberandchoiceofunlabeledtrainingdata.Theredlineherereferstheperformanceofthe supervisedbaselinewhichdoesnotuseanyunlabeleddata. morelabeleddatadoesnotboostperformanceanymoreandtheperformancestartsto˛uctuate. Thishappensbecausethenewlabeleddatadoesnotnecessarilyhelpwiththosehardcases.Based ontheseresults,weconcludethat whenthenumberoflabeledtrainingdataissmall,itismore importanttoincreasethequantityofthelabeleddataset.Oncethereissu˚cientlabeledtraining data,thegeneralizablityoftherepresentationtendstosaturatewhilethediversityofthetraining databecomesmoreimportant . 5.5ChoiceoftheUnlabeledDataset InSection5.4.4,wediscussedontheimpactofthequantity/diversityoftrainingdataonfeature generalizability,whereweconductedtheexperimentsbyadjustingthenumberoflabeledfaces. Here,weextendthediscussionbyshowingmoreexperimentsonthechoiceofunlabeleddataset. InadditiontotheWiderFacedataset,weconsidertoutilizetwootherdatasets:MegaFace[ 117 ] andCASIA-WebFace[ 50 ].ForMegaFace,weonlyusethedistractorimagesintheiridenti˝cation 88 protocol,whicharecrawledfromalbumphotosonFlickerandpresentalargerdegreeofvariation comparedwiththefacesinMS-Celeb-1M.CASIA-WebFace,similartoMS-Celeb-1M,ismainly composedofcelebrityphotos,andthereforeitshouldnotintroducemuchadditionaldiversity.Note thatCASIA-WebFaceisalabeleddatasetbutweignoreitslabelsforthisexperiment.Thediversity (facialvariation)ofthethreedatasetscanrankedas: WiderFace ¡ MegaFace ¡ CASIA-WebFace . ForbothMegaFaceandCASIA-Webface,wechoosearandomsubsettomatchthenumberofthe WiderFace.Furthermore,toseetheimpactofthequantityofunlabeleddataset,wealsotrainthe modelswithdi˙erentnumbersofunlabeleddata.Then,weevaluateallthemodelsonIJB-S,IJB-C andLFW.ThereasontoevaluateonLFWhereistoseetheimpactofdi˙erentunlabeleddatasets ontheperformanceintheoriginaldomain.TheresultsareshowninFigure5.8.Notethatduetothe largenumberofexperiments,wedonotuseaugmentationnetworkhere.Butempiricallywefound thetrendswouldbesimilar. FromFigure5.8,itcanbeseenthatingeneral,themorediversetheunlabeleddatasetis,themore performanceboostitleadsto.Inparticular,usingCASIA-WebFaceastheunlabeleddatasethardly improvesperformanceonanyprotocol.ThisisexpectedbecauseCASIA-WebFaceisverysimilarto MS-Celeb-1Mandhenceitcannotintroduceadditionaldiversitytoregularizethetrainingofface representations.UsingMegaFacedistractorsastheunlabeleddatasetimprovestheperformanceon bothIJB-CandIJB-S,bothofwhichhavemorevariationsthantheMS-Celeb-1M.UsingWiderFace astheunlabeleddatasetfurtherimprovestheperformanceontheIJB-Sdataset.Notethatallthe modelsinthisexperimentmaintainthehighperformanceontheLFWdataset.Inotherwords, using amorediverseunlabeleddatasetwouldnotdeterioratetheperformanceontheoriginaldomainand safelyimprovestheperformanceonthechallengingnewdomains .Anadditionalresultthatwecan observeisthatthesizeoftheunlabeleddatasetdoesnothaveacleare˙ectcomparedtoitsdiversity. 5.5.1ComparisonwithState-of-the-ArtFRMethods InTable5.3weshowmorecompleteresultsonIJB-Cdatasetandcompareourmethodwithother state-of-the-artmethods.Ingenerally,weobservethatwithfewerlabeledtrainingsamplesand 89 Table5.3Performancecomparisonwithstate-of-the-artmethodsontheIJB-Cdataset. Method Data Model Veri˝cation Identi˝cation 1e-7 1e-6 1e-5 1e-4 Rank1 Rank5 Caoetal.[44] 13.3M SE-ResNet-50 - - 76.8 86.2 91.4 95.1 PFE[54] 4.4M ResNet-64 - - 89.64 93.25 95.49 97.17 ArcFace[38] 5.8M ResNet-50 67.40 80.52 88.36 92.52 93.26 95.33 Ranjanetal.[146] 5.6M ResNet-101 67.4 76.4 86.2 91.9 94.6 97.5 AFRN[147] 3.1M ResNet-101 - - 88.3 93.0 95.7 97.6 Baseline 3.9M ResNet-50 62.90 82.94 90.73 94.57 94.90 96.77 Proposed 4.0M ResNet-50 77.39 87.92 91.86 94.66 95.61 97.13 Table5.4Performancecomparisonwithstate-of-the-artmethodsontheIJB-Bdataset. Method Data Model Veri˝cation Identi˝cation 1e-6 1e-5 1e-4 1e-3 Rank1 Rank5 Caoetal.[44] 13.3M SE-ResNet-50 - 70.5 83.1 90.8 90.2 94.6 Comparator[94] 3.3M ResNet-50 - - 84.9 93.7 - - ArcFace[38] 5.8M ResNet-50 40.77 84.28 91.66 94.81 92.95 95.60 Ranjanetal.[146] 5.6M ResNet-101 48.4 80.4 89.8 94.4 93.3 96.6 AFRN[147] 3.1M ResNet-101 - 77.1 88.5 94.9 97.3 97.6 Baseline 3.9M ResNet-50 40.12 84.38 92.79 95.90 93.85 96.55 Proposed 4.0M ResNet-50 43.38 88.19 92.78 95.86 94.62 96.72 numberofparameters,weareabletoachievestate-of-the-artperformanceonmostoftheprotocols. ParticularlyatlowFARs,theproposedmethodoutperformsthebaselinemethodswithagood margin.ThisisbecauseatalowFAR,theveri˝cationthresholdismainlydeterminedbylowquality impostorpairs,whichareinstancesofthedi˚cultfacesamplesthatwearetargetingwithadditional unlabeleddata.SimilartrendisobservedforIJB-Bdataset(Table5.4).Notethatbecauseoffewer numberoffacepairs,weareonlyabletotestathigherFARsforIJB-Bdataset. InTable5.5weshowtheresultsontwodi˙erentprotocolsofIJB-S.BoththeSurveillance-to-Still (V2S)andSurveillance-to-Booking(V2B)protocolsusesurveillancevideosasprobesandmugshots asgallery.Therefore,IJB-Sresultsrepresentacrossdomaincomparisonproblem.Overall,the proposedsystemachievenewstate-of-the-artperformanceonbothprotocols. 5.6Conclusions Inthischapter,wehaveproposedasemi-supervisedframeworkoflearningrobustfacerepresentation thatcouldgeneralizetounconstrainedfacesbeyondthelabeledtrainingdata.Withoutcollecting 90 Table5.5PerformanceontheIJB-Sbenchmark. Method Surveillance-to-Still Surveillance-to-Booking Rank1 Rank5 Rank10 1% 10% Rank1 Rank5 Rank10 1% 10% MARN[148] 58.14 64.11 - 21.47 - 59.26 65.93 - 32.07 - PFE[54] 50.16 58.33 62.28 31.88 35.33 53.60 61.75 62.97 35.99 39.82 ArcFace[38] 50.39 60.42 64.74 32.39 42.99 52.25 61.19 65.63 34.87 43.50 Baseline 53.23 62.91 67.83 31.88 43.32 54.26 64.18 69.26 32.39 44.32 Proposed 59.29 66.91 69.63 39.92 50.49 60.58 67.70 70.63 40.80 50.31 domainspeci˝cdata,weutilizedarelativelysmallunlabeleddatasetcontainingdiversestyles offaceimages.Inordertofullyutilizetheunlabeleddataset,twomethodsareproposed.First, weshowedthatthedomainadversariallearning,whichiscommoninadaptationmethods,can beappliedinoursettingtoreducedomaingapsbetweenlabeledfacesandhiddensub-domains. Second,weproposeanaugmentationnetworkthatcancapturedi˙erentvisualstylesintheunlabeled datasetandapplythemtothelabeledimagesduringtraining,makingthefacerepresentationmore discriminativeforunconstrainedfaces.Ourexperimentalresultsshowthatasthenumberoflabeled imagesincreases,theperformanceofthesupervisedbaselinetendstosaturateonthechallenging testingscenarios.Instead,introducingmorediversetrainingdatabecomesmoreimportantand helpful.Inafewchallengingprotocols,weshowedthattheproposedmethodcanoutperformthe supervisedbaselinewithlessthanhalfofthelabeleddata.BytrainingonthelabeledMS-Celeb-1M datasetandunlabeledWiderFacedataset,our˝nalmodelachievesstate-of-the-artperformanceon challengingbenchmarks. 91 Chapter6 Summary Inthisthesis,we˝rstreviewthehistoryoffacerecognitionproblemanditssolutions.The recognitionpipelineincludesthreesteps:normalization,featurelearningandsimilaritymetric. Weshowthatexistingmethodsineachstepinthispipelinefacecertainchallengeswhenapplied toreal-worldfacerecognitionscenarios.Thus,fourmethodsareproposedtoimprovethesesteps. First,tohandlethelargeposevariation,anattentionmoduleisproposedtoautomaticallylocalize salientfacialareas.Incontrasttoconventionalmethodstonormalizefacesbytransformation,the proposedmethoddoesnotexplicitlytransformtheinputimage.Instead,itautomaticallydiscovers salientfacialareasandincorporatestheirinformationintotheglobalfacerepresentation.Second, weproposeanewtypeoffacerepresentation,namelyprobabilisticfaceembeddings(PFEs).We showthatbyconvertingdeterministicfaceembeddingsintoPFEs,wenotonlyachieveabetter interpretabilityandsafetycontrol,butalsoboosttherecognitionperformancebyincorporatingdata uncertaintyintothesimilaritymetric.Third,forthefeatureextraction,wefoundthataconventional deeplearningframeworkwouldsu˙erfromdatabiasifwesimplyintroducemorevariationto augmentthetrainingdata.Thus,weproposeauniversallearningframeworkthatdecouplesthe featureembeddingsduringtrainingtoreducethenegativeimpactofdi˙erentaugmentationoneach other.Duringtesting,thesedecoupledfeaturesarecombinedundertheuncertaintyframework tohandledi˙erenttypesofvariations.However,suchalearningframeworkisstilllimitedby 92 manuallydesignedfacialvariations,whichcouldbedi˙erentfromdatadistributionofunconstrained facesinrealworldapplications.Finally,weproposeasemi-supervisedlearningframework,which utilizesanauxiliaryunlabeleddatasettoregularizetheembeddingmodelduringtraining.Weusea generativemodeltoautomaticallydiscoverthelatentstyleswithintheunlabeleddatasetandtransfer themtoaugmentthelabeledimages.Thenwecombinetheregularizationinboththefeatureand imagespacestobuildamoregeneralizablefaceembeddingtoboostunconstrainedfacerecognition performance. 6.1Contributions Themaincontributionsofthisthesisareasfollows: 1. Aspatialtransformer-basedattentionmodulethatautomaticallydetectssalientfacialregions toextractlocalfeatures.Theattentionmodulecouldserveasanalternativetocomplicated normalizationtechniquestoreducethevariationsinfaceimages.Further,itcouldhelpto discoverdiscriminativelocalfeatures. 2. Aframeworkthatcombinesmultipleregionattentionmodulestoextractlocalfeaturesand incorporatesthemintoglobalfacialrepresentation.Experimentalresultsonunconstrained facedatabasesshowthatthemethodcoulde˙ectivelyboosttheperformance.Andthe performancefurtherincreaseswhenadditionalregionattentionmodulesareincorporatedinto theframework. 3. Anewtypeoffacerepresentationthattakesfeatureuncertaintyintoaccount.Weshowthat deterministicembeddings,whichareusedinalmostallongoingstudiesonfacerecognition, su˙erfromafeatureambiguitydilemma,whichcannotbesolvedbyincreasingthemodel sizeoraugmentingthetrainingdata.Instead,weproposetoconvertpre-traineddeepface representationsintoPFEsbyrepresentingeachfaceimageasadistributioninthelatent space.Theprobabilisticembeddinghasabetterinterpretabilityandcanbeusedasaquality 93 assessmentmethodtocontroltheenrollmentoffaceimages. 4. Aprobabilisticframeworkthatcoulde˙ectivelyutilizedatauncertaintytocombineand comparedi˙erentPFEstoimprovethefacerecognitionperformance.Evaluationresultson unconstrainedfacerecognitionbenchmarksshowthatthemethodconsistentlyimprovesthe recognitionperformancecomparedtoconventionaldeterministicembeddings. 5. Anuniversalfeaturelearningframeworkthatlearnsasetofdecoupledfacerepresentations. Acon˝dence-controlledfaceidenti˝cationlossandavariation-baseddecouplinglossare proposedinthefeaturelearningprocesstoe˙ectivelyhandledi˙erenttypesofvariations inthetrainingdata.Experimentsshowthatconventionalapproachescouldsu˙erfromnew variationsaddedintotrainingdatawhiletheproposedmethodincrementallyenhancesthe featurerepresentationswhenadditionaltypesofvariationsareintroduced. 6. BycombiningtheuniversalfacerepresentationframeworkandthePFEs,theproposed methodachievesstate-of-the-artperformanceonseveralchallengingrecognitionbenchmarks, includingIJB-C,TinyFaceandIJB-S. 7. Asemi-supervisedlearningframeworkforgeneralizingfacerepresentationswithunlabeled data,wherearepresentationlearningmethodofjointregularizationfrombothimageand featuredomains. 8. Amulti-modeimagetranslationmoduleisproposedtoperformdata-drivenaugmentationto increasethediversityofthelabeledtrainingsamples. 9. Empiricalresultsshowthattheregularizationofunlabeleddatahelpstoimprovetherecognition performanceonunconstrainedtestingdatasets. 94 6.2SuggestionsforFutureWork Someoftheongoingandpossiblefuturedirectionswithinthescopeofrobustunconstraineddeep facerecognitionareasfollows: ‹ Uncertainty-awareRepresentationLearning InChapter3,weproposeanuncertainty- awarefacerepresentation,i.e.PFE,toboostfacerecognitionperformancebyincorporating uncertaintyinformationintothefacecomparisonprocess.However,theissueofdata uncertaintyalsoexistsinthelearningprocessaswell.Thus,anotherdirectionthatisworth exploringistostudywhethermodelingdatauncertaintycouldacceleratethetrainingofface embeddings. ‹ DomainGeneralization InChapter4andChapter5,weusemanuallydesignedtransfor- mationsandanunlabeleddatasettogeneralizesupervisedmodels,respectively.Another optionistocombineseveralheterogeneouslabeleddatasetsfromdi˙erentsourcestotraina moregeneralizablemodel.SuchaframeworkisknownasDomainGeneralization.Although currentlywedonothaveaccesstolarge-scalefacedatasetswithcleardomaingaps,webelieve itwouldbeaninterestingresearchdirectiontoexploreifonecouldcollectsuchdatasets. ‹ Self-supervisedLearningonUnlabeledData InChapter5,weshowedthatdiversityis moreimportantthantheamountofunlabeleddata.Assuch,webelievethereisstillspace intermsofmethodologythatcouldfurtherutilizealargersetofunlabeleddatatoboostthe performance.Apossibledirectionistoapplyself-supervisedlearningtounlabeleddata, whichhasrecentlybeenshownsuccessfulonimageclassi˝cationtasks. 95 APPENDIX 96 PUBLICATIONS [1] D.Deb,S.Wiper,S.Gong,Y.Shi,C.Tymoszek,A.Fletcher,andA.K.Jain,acerecognition: Primatesinthewild,in 2018IEEE9thInternationalConferenceonBiometricsTheory, ApplicationsandSystems(BTAS) ,2019. [2] Y.Shi,C.Otto,andA.K.Jain,aceclustering:representationandpairwiseconstraints, IEEE TransactionsonInformationForensicsandSecurity ,2018. [3] Y.ShiandA.Jain,vingfacerecognitionbyexploringlocalfeatureswithvisualattention, in ICB ,2018. [4] Y.ShiandA.K.Jain,ace:Matchingiddocumentphotostosel˝es,in 2018IEEE9th InternationalConferenceonBiometricsTheory,ApplicationsandSystems(BTAS) ,2019. [5] Y.ShiandA.K.Jain,ace+:Iddocumenttosel˝ematching, IEEETransactionson Biometrics,Behavior,andIdentityScience ,2019. [6] Y.Shi,D.Deb,andA.K.Jain,arpgan:Automaticcaricaturegeneration,in CVPR ,2019. [7] S.Gong,Y.Shi,N.D.Kalka,andA.K.Jain,ideofacerecognition:Component-wisefeature aggregationnetwork(c-fan),in ICB ,2019. [8] Y.ShiandA.K.Jain,ticfaceembeddings,in ICCV ,2019. [9] S.Gong,Y.Shi,andA.Jain,wqualityvideofacerecognition:Multi-modeaggregation recurrentnetwork(marn),in CVPRWorkshops ,2019. [10] Y.Shi,X.Yu,K.Sohn,M.Chandraker,andA.K.Jain,Towardsuniversalrepresentation learningfordeepfacerecognition,in CVPR ,2020. [11] Y.ShiandA.K.Jain,tingunconstrainedfacerecognitionwithauxiliaryunlabeleddata, in CVPRWorkshops ,2021. [12] Y.Shi,D.Aggarwal,andA.K.Jain,2dstyleganfor3d-awarefacegeneration,in CVPR ,2021. 97 BIBLIOGRAPHY 98 BIBLIOGRAPHY [1] N.D.Kalka,B.Maze,J.A.Duncan,K.J.O'Connor,S.Elliott,K.Hebert,J.Bryan,andA.K. Jain,:IARPAJanusSurveillanceVideoBenchmark,in BTAS ,2018. [2] Y.Guo,L.Zhang,Y.Hu,X.He,andJ.Gao,Adatasetandbenchmarkfor largescalefacerecognition,in ECCV ,2016. [3] G.B.Huang,M.Ramesh,T.Berg,andE.Learned-Miller,facesinthewild:A databaseforstudyingfacerecognitioninunconstrainedenvironments,Tech.Rep.07-49, UniversityofMassachusetts,Amherst,October2007. [4] S.Sengupta,J.-C.Chen,C.Castillo,V.M.Patel,R.Chellappa,andD.W.Jacobs,to pro˝lefaceveri˝cationinthewild,in WACV ,2016. [5] B.F.Klare,B.Klein,E.Taborsky,A.Blanton,J.Cheney,K.Allen,P.Grother,A.Mah,and A.K.Jain,thefrontiersofunconstrainedfacedetectionandrecognition:IARPA JanusBenchmarkA,in CVPR ,2015. [6] Z.Cheng,X.Zhu,andS.Gong,w-resolutionfacerecognition,in ACCV ,2018. [7] K.Zhang,Z.Zhang,Z.Li,andY.Qiao,facedetectionandalignmentusingmultitask cascadedconvolutionalnetworks, IEEESignalProcessingLetters ,2016. [8] S.Liao,Z.Lei,D.Yi,andS.Z.Li,Abenchmarkstudyoflarge-scaleunconstrainedface recognition,in IJCB ,2014. [9] trialsfacialrecognitionandadvancedimagingtechnologyatsecurity. https: //www.futuretravelexperience.com/2018/02/lax-trialling-facial- recognition-and-advanced-imaging-technology-at-security/ ,2018. [10] 'sfaceidisn'tperfect,butyoucanmakeitbetter. https://www.cnet.com/how- to/iphones-face-id-problems-tricks-tips/ ,2018. [11] 'salipayaddssought-afterbeauty˝lterstoface-scanpayments. https: //anith.com/chinas-alipay-adds-sought-after-beauty-filters- to-face-scan-payments-techcrunch/ ,2019. [12] runaustralia'sfacialrecognitionsurveillancesystemontheatosan. https://www.zdnet.com/article/please-run-australias-facial- recognition-surveillance-system-on-the-ato-san/ ,2019. [13] wfacialrecognitionistakingoverairports. https://www.cnn.com/travel/ article/airports-facial-recognition/index.html ,2019. 99 [14] wfacialrecognitionis˝ghtingchildsextra˚cking. https://www.wired.com/ story/how-facial-recognition-fighting-child-sex-trafficking , 2019. [15] billionsurveillancecameraswillbewatchingaroundtheworldin2021,anewstudy says. https://www.cnbc.com/2019/12/06/one-billion-surveillance- cameras-will-be-watching-globally-in-2021.html ,2019. [16] T.Kanade, PictureProcessingbyComputerComplexandRecognitionofHumanFaces .PhD thesis,KyotoUniversity,1973. [17] T.F.Cootes,C.J.Taylor,D.H.Cooper,andJ.Graham,Activeshapemodels-theirtraining andapplication, ComputerVisionandImageUnderstanding ,vol.61,1995. [18] M.A.TurkandA.P.Pentland,acerecognitionusingeigenfaces,in CVPR ,1991. [19] P.N.Belhumeur,J.P.Hespanha,andD.J.Kriegman,enfacesvs.˝sherfaces:Recognition usingclassspeci˝clinearprojection, IEEETrans.onPAMI ,vol.19,no.7,1997. [20] D.G.Lowe,recognitionfromlocalscale-invariantfeatures,in ICCV ,1999. [21] T.Ahonen,A.Hadid,andM.Pietikainen,acedescriptionwithlocalbinarypatterns: Applicationtofacerecognition, IEEETrans.onPAMI ,vol.28,2006. [22] D.Chen,X.Cao,F.Wen,andJ.Sun,ssingofdimensionality:High-dimensionalfeature anditse˚cientcompressionforfaceveri˝cation,in CVPR ,2013. [23] D.Chen,X.Cao,L.Wang,F.Wen,andJ.Sun,yesianfacerevisited:Ajointformulation, in ECCV ,2012. [24] A.Krizhevsky,I.Sutskever,andG.E.Hinton,enetclassi˝cationwithdeepconvolutional neuralnetworks,in NeurIPS ,2012. [25] S.Ren,K.He,R.Girshick,andJ.Sun,asterr-cnn:Towardsreal-timeobjectdetectionwith regionproposalnetworks,in NeurIPS ,2015. [26] V.Badrinarayanan,A.Kendall,andR.Cipolla,Adeepconvolutionalencoder-decoder architectureforimagesegmentation, IEEETrans.onPAMI ,2017. [27] Y.Taigman,M.Yang,M.Ranzato,andL.Wolf,ace:Closingthegaptohuman-level performanceinfaceveri˝cation,in CVPR ,2014. [28] Y.Sun,X.Wang,andX.Tang,learningfacerepresentationfrompredicting10,000 classes,in CVPR ,2014. [29] Y.Sun,Y.Chen,X.Wang,andX.Tang,learningfacerepresentationbyjoint identi˝cation-veri˝cation,in NIPS ,2014. [30] F.Schro˙,D.Kalenichenko,andJ.Philbin,acenet:Auni˝edembeddingforfacerecognition andclustering,in CVPR ,2015. 100 [31] Y.Sun,X.Wang,andX.Tang,ylearnedfacerepresentationsaresparse,selective,and robust,in CVPR ,2015. [32] Y.Wen,K.Zhang,Z.Li,andY.Qiao,Adiscriminativefeaturelearningapproachfordeep facerecognition,in ECCV ,2016. [33] K.Sohn,veddeepmetriclearningwithmulti-classn-pairlossobjective,in NIPS , 2016. [34] W.Liu,Y.Wen,Z.Yu,M.Li,B.Raj,andL.Song,ace:Deephypersphereembedding forfacerecognition,in CVPR ,2017. [35] F.Wang,W.Liu,H.Liu,andJ.Cheng,Additivemarginsoftmaxforfaceveri˝cation, arXiv:1801.05599 ,2018. [36] H.Wang,Y.Wang,Z.Zhou,X.Ji,Z.Li,D.Gong,J.Zhou,andW.Liu,ace:Large margincosinelossfordeepfacerecognition,in CVPR ,2018. [37] R.Ranjan,C.D.Castillo,andR.Chellappa,trainedsoftmaxlossfordiscriminative faceveri˝cation, arXiv:1703.09507 ,2017. [38] J.Deng,J.Guo,andS.Zafeiriou,Arcface:Additiveangularmarginlossfordeepface recognition, CVPR ,2019. [39] A.S.Georghiades,P.N.Belhumeur,andD.J.Kriegman,fewtomany:Generative modelsforrecognitionundervariableposeandillumination,in FG ,2000. [40] pajanusprogram. https://www.iarpa.gov/index.php/research- programs/janus . [41] C.Whitelam,E.Taborsky,A.Blanton,B.Maze,J.Adams,T.Miller,N.Kalka,A.K.Jain,J.A. Duncan,K.Allen, etal. ,pajanusbenchmark-bfacedataset,in CVPRWorkshops ,2017. [42] B.Maze,J.Adams,J.A.Duncan,N.Kalka,T.Miller,C.Otto,A.K.Jain,W.T.Niggel, J.Anderson,J.Cheney, etal. ,pajanusbenchmark-c:Facedatasetandprotocol,in ICB , 2018. [43] D.Wang,C.Otto,andA.K.Jain,acesearchatscale, IEEETrans.onPAMI ,vol.39,2016. [44] Q.Cao,L.Shen,W.Xie,O.M.Parkhi,andA.Zisserman,ggface2:Adatasetforrecognising facesacrossposeandage,in FG ,2018. [45] J.-C.Chen,V.M.Patel,andR.Chellappa,nconstrainedfaceveri˝cationusingdeepcnn features,in WACV ,2016. [46] F.Wang,X.Xiang,J.Cheng,andA.L.Yuille,ormface: ; 2 hypersphereembeddingforface veri˝cation, ACMMM ,2017. [47] L.Tran,X.Yin,andX.Liu,representationlearningganforpose-invariantface recognition,in CVPR ,2017. 101 [48] X.YinandX.Liu,convolutionalneuralnetworkforpose-invariantfacerecognition, IEEETrans.onImageProcessing ,2017. [49] J.Zhao,L.Xiong,Y.Cheng,Y.Cheng,J.Li,L.Zhou,Y.Xu,J.Karlekar,S.Pranata,S.Shen, etal. ,deeppose-invariantfacerecognition.,in IJCAI ,p.11,2018. [50] D.Yi,Z.Lei,S.Liao,andS.Z.Li,ningfacerepresentationfromscratch, arXiv:1411.7923 ,2014. [51] L.Wolf,T.Hassner,andI.Maoz,acerecognitioninunconstrainedvideoswithmatched backgroundsimilarity,in CVPR ,2011. [52] I.Kemelmacher-Shlizerman,S.M.Seitz,D.Miller,andE.Brossard,Themegafacebenchmark: 1millionfacesforrecognitionatscale,in CVPR ,2016. [53] Y.ShiandA.Jain,vingfacerecognitionbyexploringlocalfeatureswithvisual attention,in ICB ,2018. [54] Y.ShiandA.K.Jain,ticfaceembeddings,in ICCV ,2019. [55] Y.Shi,X.Yu,K.Sohn,M.Chandraker,andA.K.Jain,Towardsuniversalrepresentation learningfordeepfacerecognition,in CVPR ,2020. [56] Y.ShiandA.K.Jain,tingunconstrainedfacerecognitionwithauxiliaryunlabeleddata, in CVPRWorkshops ,2021. [57] O.M.Parkhi,A.Vedaldi,A.Zisserman, etal. ,facerecognition.,in BMVC ,2015. [58] P.F.Felzenszwalb,R.B.Girshick,D.McAllester,andD.Ramanan,detectionwith discriminativelytrainedpart-basedmodels, IEEETrans.onPAMI ,vol.32,no.9,2010. [59] M.Mathias,R.Benenson,M.Pedersoli,andL.VanGool,acedetectionwithoutbellsand whistles,in ECCV ,2014. [60] J.Yan,X.Zhang,Z.Lei,andS.Z.Li,acedetectionbystructuralmodels, ImageandVision Computing ,vol.32,no.10,2014. [61] S.Yang,P.Luo,C.-C.Loy,andX.Tang,facialpartsresponsestofacedetection:A deeplearningapproach,in ICCV ,2015. [62] J.Ba,V.Mnih,andK.Kavukcuoglu,objectrecognitionwithvisualattention, arXiv:1412.7755 ,2014. [63] K.Xu,J.Ba,R.Kiros,K.Cho,A.Courville,R.Salakhudinov,R.Zemel,andY.Bengio, w,attendandtell:Neuralimagecaptiongenerationwithvisualattention,in ICML ,2015. [64] J.Fu,H.Zheng,andT.Mei,closertoseebetter:Recurrentattentionconvolutional neuralnetworkfor˝ne-grainedimagerecognition,in CVPR ,2017. [65] M.Jaderberg,K.Simonyan,A.Zisserman, etal. ,transformernetworks,in NIPS , 2015. 102 [66] T.Xiao,Y.Xu,K.Yang,J.Zhang,Y.Peng,andZ.Zhang,Theapplicationoftwo-level attentionmodelsindeepconvolutionalneuralnetworkfor˝ne-grainedimageclassi˝cation, in CVPR ,2015. [67] Y.Zhong,J.Chen,andB.Huang,Towardend-to-endfacerecognitionthroughalignment learning, IEEESignalProcessingLetters ,vol.24,no.8,2017. [68] A.Hasnat,J.Bohné,J.Milgram,S.Gentric,andL.Chen,e:Makingface recognitionsimpleyetwithpowerfulgeneralizationskills,in ICCV ,2017. [69] S.Io˙eandC.Szegedy,hnormalization:Acceleratingdeepnetworktrainingbyreducing internalcovariateshift,in ICML ,2015. [70] V.NairandG.E.Hinton,ecti˝edlinearunitsimproverestrictedboltzmannmachines,in ICML ,2010. [71] Y.GuoandL.Zhang,facerecognitionbypromotingunderrepresentedclasses, arXiv:1707.05574 ,2017. [72] A.Kendall,V.Badrinarayanan,andR.Cipolla,yesiansegnet:Modeluncertaintyindeep convolutionalencoder-decoderarchitecturesforsceneunderstanding,in BMVC ,2015. [73] Y.GalandZ.Ghahramani,asabayesianapproximation:Representingmodel uncertaintyindeeplearning,in ICML ,2016. [74] A.KendallandY.Gal,uncertaintiesdoweneedinbayesiandeeplearningforcomputer vision?,in NIPS ,2017. [75] D.J.MacKay,Apracticalbayesianframeworkforbackpropagationnetworks, Neural Computation ,1992. [76] R.M.Neal, Bayesianlearningforneuralnetworks .PhDthesis,UniversityofToronto,1995. [77] D.P.KingmaandM.Welling,Auto-encodingvariationalbayes,in ICLR ,2013. [78] S.Gong,V.N.Boddeti,andA.K.Jain,thecapacityoffacerepresentation, arXiv:1709.10433 ,2017. [79] S.Khan,M.Hayat,W.Zamir,J.Shen,andL.Shao,trikingtherightbalancewithuncertainty, arXiv:1901.07590 ,2019. [80] U.Zafar,M.Ghafoor,T.Zia,G.Ahmed,A.Latif,K.R.Malik,andA.M.Sharif,ace recognitionwithbayesianconvolutionalnetworksforrobustsurveillancesystems, EURASIP JournalonImageandVideoProcessing ,2019. [81] Y.Xu,X.Fang,X.Li,J.Yang,J.You,H.Liu,andS.Teng,uncertaintyinface recognition, IEEETrans.onCybernetics ,2014. [82] G.Shakhnarovich,J.W.Fisher,andT.Darrell,acerecognitionfromlong-termobservations, in ECCV ,2002. 103 [83] O.Arandjelovic,G.Shakhnarovich,J.Fisher,R.Cipolla,andT.Darrell,acerecognition withimagesetsusingmanifolddensitydivergence,in CVPR ,2005. [84] H.CevikalpandB.Triggs,acerecognitionbasedonimagesets,in CVPR ,2010. [85] Z.Huang,R.Wang,S.Shan,X.Li,andX.Chen,metriclearningonsymmetric positivede˝nitemanifoldwithapplicationtoimagesetclassi˝cation,in ICML ,2015. [86] H.Li,G.Hua,Z.Lin,J.Brandt,andJ.Yang,ticelasticmatchingforposevariant faceveri˝cation,in CVPR ,2013. [87] P.Hiremath,A.Danti,andC.Prabhakar,uncertaintyinrepresentationoffacial featuresforfacerecognition,in Facerecognition ,2007. [88] J.Yang,P.Ren,D.Zhang,D.Chen,F.Wen,H.Li,andG.Hua,euralaggregationnetwork forvideofacerecognition.,in CVPR ,2017. [89] Y.Liu,J.Yan,andW.Ouyang,awarenetworkforsettosetrecognition,in CVPR , 2017. [90] W.XieandA.Zisserman,networksforfacerecognition,in ECCV ,2018. [91] S.Gong,Y.Shi,andA.K.Jain,ideofacerecognition:Component-wisefeatureaggregation network(c-fan),in ICB ,2019. [92] X.YinandX.Liu,convolutionalneuralnetworkforpose-invariantfacerecognition, IEEETrans.onImageProcessing ,2018. [93] B.Yin,L.Tran,H.Li,X.Shen,andX.Liu,Towardsinterpretablefacerecognition, arXiv:1805.00611 ,2018. [94] W.Xie,L.Shen,andA.Zisserman,networks,in ECCV ,2018. [95] X.Wu,R.He,Z.Sun,andT.Tan,Alightcnnfordeepfacerepresentationwithnoisylabels, IEEETrans.onInformationForensicsandSecurity ,2015. [96] K.Sohn,S.Liu,G.Zhong,X.Yu,M.-H.Yang,andM.Chandraker,nsuperviseddomain adaptationfordistancemetriclearning,in CVPR ,2019. [97] K.Sohn,W.Shang,X.Yu,andM.Chandraker,nsuperviseddomainadaptationfordistance metriclearning,in ICLR ,2019. [98] I.Masi,A.T.Tran,T.Hassner,J.T.Leksut,andG.Medioni,wereallyneedtocollect millionsoffacesfore˙ectivefacerecognition?,in ECCV ,2016. [99] X.Peng,X.Yu,K.Sohn,D.Metaxas,andM.Chandraker,econstruction-baseddisentangle- mentforpose-invariantfacerecognition,in ICCV ,2017. [100] X.Yin,X.Yu,K.Sohn,X.Liu,andM.Chandraker,eaturetransferlearningforface recognitionwithunder-representeddata,in CVPR ,2019. 104 [101] S.Sankaranarayanan,A.Alavi,C.D.Castillo,andR.Chellappa,Tripletprobabilistic embeddingforfaceveri˝cationandclustering,in BTAS ,2016. [102] I.Masi,S.Rawls,G.Medioni,andP.Natarajan,ose-awarefacerecognitioninthewild,in CVPR ,2016. [103] H.BilenandA.Vedaldi,niversalrepresentations:Themissinglinkbetweenfaces,text, planktons,andcatbreeds, arXiv:1701.07275 ,2017. [104] S.-A.Rebu˚,H.Bilen,andA.Vedaldi,ningmultiplevisualdomainswithresidual adapters,in NIPS ,2017. [105] S.-A.Rebu˚,H.Bilen,andA.Vedaldi,parametrizationofmulti-domaindeep neuralnetworks,in CVPR ,2018. [106] X.Wang,Z.Cai,D.Gao,andN.Vasconcelos,Towardsuniversalobjectdetectionbydomain attention,in CVPR ,2019. [107] A.Khosla,T.Zhou,T.Malisiewicz,A.A.Efros,andA.Torralba,ndoingthedamageof datasetbias,in ECCV ,2012. [108] K.Muandet,D.Balduzzi,andB.Schölkopf,generalizationviainvariantfeature representation,in ICML ,2013. [109] D.Li,Y.Yang,Y.-Z.Song,andT.M.Hospedales,,broaderandartierdomain generalization,in ICCV ,pp.2017. [110] D.Li,Y.Yang,Y.-Z.Song,andT.M.Hospedales,ningtogeneralize:Meta-learning fordomaingeneralization,in AAAI ,2018. [111] Y.Tamaazousti,H.LeBorgne,C.Hudelot,M.E.A.Seddik,andM.Tamaazousti,ning moreuniversalrepresentationsfortransfer-learning, IEEETrans.onPAMI ,2019. [112] J.Hu,L.Shen,andG.Sun,ueeze-and-excitationnetworks,in CVPR ,2018. [113] C.Guo,G.Pleiss,Y.Sun,andK.Q.Weinberger,calibrationofmodernneuralnetworks, in ICML ,2017. [114] Z.Liu,P.Luo,X.Wang,andX.Tang,learningfaceattributesinthewild,in ICCV , 2015. [115] X.Yu,F.Zhou,andM.Chandraker,deformationnetworkforobjectlandmark localization,in ECCV ,2016. [116] Y.Feng,F.Wu,X.Shao,Y.Wang,andX.Zhou,3dfacereconstructionanddense alignmentwithpositionmapregressionnetwork,in ECCV ,2018. [117] I.Kemelmacher-Shlizerman,S.M.Seitz,D.Miller,andE.Brossard,Themegaface benchmark:1millionfacesforrecognitionatscale,in CVPR ,2016. 105 [118] H.-W.NgandS.Winkler,Adata-drivenapproachtocleaninglargefacedatasets,in CIP , 2014. [119] J.Zhao,L.Xiong,J.Li,J.Xing,S.Yan,andJ.Feng,dual-agentgansfor unconstrainedfacerecognition, IEEETrans.onPAMI ,2018. [120] M.Ghifary,W.BastiaanKleijn,M.Zhang,andD.Balduzzi,generalizationfor objectrecognitionwithmulti-taskautoencoders,in ICCV ,2015. [121] S.Motiian,M.Piccirilli,D.A.Adjeroh,andG.Doretto,ni˝eddeepsuperviseddomain adaptationandgeneralization,in ICCV ,2017. [122] H.Li,S.JialinPan,S.Wang,andA.C.Kot,generalizationwithadversarialfeature learning,in CVPR ,2018. [123] F.M.Carlucci,A.D'Innocente,S.Bucci,B.Caputo,andT.Tommasi,generalization bysolvingjigsawpuzzles,in CVPR ,2019. [124] S.J.Pan,I.W.Tsang,J.T.Kwok,andQ.Yang,adaptationviatransfercomponent analysis, IEEETrans.onNeuralNetworks ,2010. [125] Y.GaninandV.Lempitsky,nsuperviseddomainadaptationbybackpropagation,in ICML , 2015. [126] M.Long,H.Zhu,J.Wang,andM.I.Jordan,transferlearningwithjointadaptation networks,in ICML ,2017. [127] K.Sohn,W.Shang,X.Yu,andM.Chandraker,nsuperviseddomainadaptationforface recognitioninunlabeledvideos,in CVPR ,2017. [128] K.Saito,K.Watanabe,Y.Ushiku,andT.Harada,classi˝erdiscrepancyfor unsuperviseddomainadaptation,in CVPR ,2018. [129] G.Kang,L.Jiang,Y.Yang,andA.G.Hauptmann,tiveadaptationnetworkfor unsuperviseddomainadaptation,in CVPR ,2019. [130] S.Yang,P.Luo,C.C.Loy,andX.Tang,iderface:Afacedetectionbenchmark,in CVPR , 2016. [131] D.-H.Lee,Thesimpleande˚cientsemi-supervisedlearningmethodfor deepneuralnetworks,in ICMLWorkshop ,2013. [132] A.Rasmus,M.Berglund,M.Honkala,H.Valpola,andT.Raiko,pervisedlearning withladdernetworks,in NeurIPS ,2015. [133] S.LaineandT.Aila,Temporalensemblingforsemi-supervisedlearning,in ICLR ,2017. [134] A.TarvainenandH.Valpola,teachersarebetterrolemodels:Weight-averaged consistencytargetsimprovesemi-superviseddeeplearningresults,in NeurIPS ,2017. 106 [135] Q.Xie,Z.Dai,E.Hovy,M.-T.Luong,andQ.V.Le,nsuperviseddataaugmentationfor consistencytraining, arXiv:1904.12848 ,2019. [136] X.Zhai,A.Oliver,A.Kolesnikov,andL.Beyer,Self-supervisedsemi-supervised learning,in ICCV ,2019. [137] D.Berthelot,N.Carlini,I.Goodfellow,N.Papernot,A.Oliver,andC.A.Ra˙el,h: Aholisticapproachtosemi-supervisedlearning,in NeurIPS ,2019. [138] K.Sohn,D.Berthelot,C.-L.Li,Z.Zhang,N.Carlini,E.D.Cubuk,A.Kurakin,H.Zhang,and C.Ra˙el,ixmatch:Simplifyingsemi-supervisedlearningwithconsistencyandcon˝dence, arXiv:2001.07685 ,2020. [139] J.-Y.Zhu,T.Park,P.Isola,andA.A.Efros,npairedimage-to-imagetranslationusing cycle-consistentadversarialnetworks,in ICCV ,2017. [140] J.Ho˙man,E.Tzeng,T.Park,J.-Y.Zhu,P.Isola,K.Saenko,A.Efros,andT.Darrell, cada:Cycle-consistentadversarialdomainadaptation,in ICML ,2018. [141] X.Huang,M.-Y.Liu,S.Belongie,andJ.Kautz,unsupervisedimage-to-image translation,in ECCV ,2018. [142] X.HuangandS.J.Belongie,Arbitrarystyletransferinreal-timewithadaptiveinstance normalization.,in ICCV ,2017. [143] J.Deng,J.Guo,Y.Zhou,J.Yu,I.Kotsia,andS.Zafeiriou,etinaface:Single-stagedense facelocalisationinthewild, arXivpreprintarXiv:1905.00641 ,2019. [144] D.Ulyanov,A.Vedaldi,andV.Lempitsky,tancenormalization:Themissingingredient forfaststylization, arXiv:1607.08022 ,2016. [145] J.L.Ba,J.R.Kiros,andG.E.Hinton,yernormalization, arXiv:1607.06450 ,2016. [146] R.Ranjan,A.Bansal,J.Zheng,H.Xu,J.Gleason,B.Lu,A.Nanduri,J.-C.Chen,C.D. Castillo,andR.Chellappa,Afastandaccuratesystemforfacedetection,identi˝cation,and veri˝cation, IEEETrans.onBiometrics,Behavior,andIdentityScience ,2019. [147] B.-N.Kang,Y.Kim,B.Jun,andD.Kim,Attentionalfeature-pairrelationnetworksfor accuratefacerecognition,in ICCV ,2019. [148] S.Gong,Y.Shi,andA.Jain,wqualityvideofacerecognition:Multi-modeaggregation recurrentnetwork(marn),in ICCVWorkshops ,2019. 107