FACEANTI-SPOOFING:DETECTION,GENERALIZATION,ANDVISUALIZATION By YaojieLiu ADISSERTATION Submittedto MichiganStateUniversity inpartialoftherequirements forthedegreeof ComputerScience-DoctorofPhilosophy 2021 ABSTRACT FACEANTI-SPOOFING:DETECTION,GENERALIZATION,ANDVISUALIZATION By YaojieLiu Faceistheprocessofdistinguishinggenuinefacesandfacepresentationattacks: attackerspresentingngfaces(e.g.photograph,digitalscreen,andmask)tothefacerecognition systemandattemptingtobeauthenticatedasthegenuineuser.Inrecentyears,face hasbroughtincreasingattentiontothevisioncommunityasitisacrucialsteptopreventface recognitionsystemsfromasecuritybreach.Previousapproachesformulatefaceas abinaryproblem,andmanyofthemstruggletogeneralizetodifferentconditions (suchaspose,lighting,expressions,camerasensorsandunknownspooftypes).Moreover,those methodsworkasablack-boxandcannotprovideinterpretationorvisualizationtotheirdecision.To addressthosechallenges,weinvestigatefacein 3 stages:detection,generalizationand visualization.Inthedetectionstage,welearnaCNN-RNNmodeltoestimateauxiliarytasksofface depthandrPPGsignalsestimation,whichcanbringadditionalknowledgeforthespoofdetection. Inthegeneralizationstage,weinvestigatethedetectionofunknownspoofattacksandproposea novelDeepTreeNetwork(DTN)towellrepresenttheunknownspoofattacks.Inthevisualization stage,wefispooftrace,thesubtleimagepatterninspooffaces(e.g.,colordistortion,3Dmask edge,andMoirepattern),iseffectivetoexplainwhyaspoofisaspoof.Weprovideaproper physicalmodelingofthespooftracesanddesignagenerativemodeltodisentanglethespooftraces frominputfaces.Inaddition,wealsoshowthataproperphysicalmodelingcanotherface problems,suchasfaceshadowdetectionandremoval.Apropershadowmodelingcannotonly detecttheshadowregioneffectively,butalsoremovetheshadowinavisuallyplausiblemanner. ACKNOWLEDGMENTS ThroughoutpreparingthisdissertationIhavereceivedagreatdealofsupportandassistance. Firstofall,Iwouldliketoexpressmysinceregratitudetomyadvisor,Prof.XiaomingLiufor hisinvaluableadvice,continuoussupport,andpatienceduringmyPhDstudy.Yourknowledge, experience,andenthusiasmmotivatemetoimprovemyacademicresearchaswellaslifeplanning. IwouldliketothankProf.ArunRossforhisleadontheproject.Yourleadprovides mewithcleardirectionsandgreatsupportofconductingresearchandachievingprojectgoals.I wouldalsoliketothanktherestofmythesiscommittee:Prof.AnilJainandProf.DanielMorris, foryourinsightfulcommentsandsuggestions,whichpushmetoworkonaboarderresearchimpact. It'smyhonortohaveyouasmycommittee. Mysincerethanksalsogoestoallmyco-authors,Dr.AminJourabloo,Dr.YousefAtoum, JoelStehouwer,andXiaohongLiu,forworkingwithmeonalltheseexcitingresearchprojects andholdontightincatchingthosedeadlines.Ialsowouldliketothankmymentors,Dr.Barry TheobaldandDr.NicholasApostoloffromApple,andDr.XinyuHuangandDr.LiuRenfrom Bosch,forofferingmethesummerinternshipopportunitiesandleadingmeworkingondiverse excitingprojects.Specialshout-outtoChristopherPerry.Ireallyenjoyworkingwithyouonthe projectandappreciateyourskillsandhelponthefacePADsolutions. IthankmyfellowlabmatesinComputerVisionLabs:JosephRoth,XiYin,LuanTran,Ying Tai,BangjieYin,ZiyuanZhang,GarrickBrazil,FengLiu,ShengjieZhu,MasaHu,AndrewHou, AbhinavKumar,VishalAsnani,andXiaoGuo,forsharingandexchangingknowledgeandopinions, andforallthefunwehavehadinalltheactivities.AlsoIthankmyfriends:JialinLiu,Zhiwei Wang,JoshuaEnglesma,XueJiang,HeZhang,JiaXue,andthelistgoesonandon. SpecialthankstomygirlfriendYufengWangandmydogBagelthebeagle,forenlighteningmy iii lifeandgivingmeemotionalsupport. Lastbutnottheleast,Iwouldliketothankmyparentsforyourunconditionalloveand Thousandsofwordsarenotevenenoughtoexpressmygratitudeandlovetoyou. iv TABLEOFCONTENTS LISTOFTABLES ....................................... viii LISTOFFIGURES ...................................... x LISTOFALGORITHMS ................................... xiv Chapter1IntroductiontoFace ...................... 1 1.1Introduction......................................1 1.2Overviewofthethesis.................................4 1.2.1Contributionsofthethesis..........................6 Chapter2Detection:FacewithAuxiliarySupervisions ........ 8 2.1Introduction......................................8 2.2PriorWork.......................................11 2.3FacewithDeepNetwork.......................14 2.3.1DepthMapSupervision............................14 2.3.2rPPGSupervision...............................15 2.3.3NetworkArchitecture.............................17 2.3.3.1CNNNetwork...........................17 2.3.3.2RNNNetwork...........................18 2.3.3.3ImplementationDetails......................18 2.3.4Non-rigidRegistrationLayer.........................20 2.4CollectionofFaceDatabase......................22 2.5ExperimentalResults.................................24 2.5.1ExperimentalSetup..............................24 2.5.2ExperimentalComparison..........................25 2.5.2.1AblationStudy...........................25 2.5.2.2IntraTesting............................26 2.5.2.3CrossTesting............................27 2.5.2.4VisualizationandAnalysis.....................28 2.6Conclusions......................................30 Chapter3Generalization:Zero-shotandOpen-setFace ........ 31 3.1Introduction......................................31 3.2PriorWork.......................................34 3.3DeepTreeNetworkforZSFA.............................37 3.3.1UnsupervisedTreeLearning.........................38 3.3.1.1NodeRoutingFunction......................38 3.3.1.2TreeofKnownSpoofs.......................39 3.3.2SupervisedFeatureLearning.........................40 v 3.3.3NetworkArchitecture.............................42 3.4SpoofintheWildDatabasewithMultipleAttackTypes...............44 3.5ExperimentalResults.................................45 3.5.1ExperimentalSetup..............................45 3.5.2ExperimentalComparison..........................46 3.5.2.1AblationStudy...........................46 3.5.2.2Testingonexistingdatabases...................47 3.5.2.3TestingonSiW-M.........................48 3.5.2.4VisualizationandAnalysis.....................49 3.6Conclusions......................................52 Chapter4Visualization:DisentanglingSpoofTraceswithPhysicalModeling .... 53 4.1Introduction......................................53 4.2RelatedWork.....................................58 4.3Physics-basedSpoofTraceDisentanglement.....................61 4.3.1ProblemFormulation.............................61 4.3.2DisentanglementGenerator..........................65 4.3.3ReconstructionandSynthesis.........................67 4.3.3.1Online 3 DWarpingLayer.....................69 4.3.4Multi-scaleDiscriminators..........................71 4.3.5LossFunctionsandTrainingSteps......................72 4.4Experiments......................................76 4.4.1ExperimentalSetup..............................77 4.4.2forKnownSpoofTypes....................78 4.4.3forUnknownandOpen-setSpoofs..............81 4.4.4SpoofTraces..........................84 4.4.5AblationStudy................................86 4.4.6Visualization.................................86 4.5Conclusions......................................91 Chapter5Visualization:BlindRemovalofFacialForeignShadow .......... 92 5.1Introduction......................................92 5.2RelatedWork.....................................96 5.3ProposedMethod...................................97 5.3.1Shadowsynthesisandmodeling.......................97 5.3.2Grayscaleshadowremoval..........................101 5.3.3Colorization..................................102 5.3.4Temporalinformationsharing........................103 5.3.5Training....................................105 5.4TrainingandEvaluationData.............................107 5.5Experiment......................................108 5.5.1Experimentalsetup..............................108 5.5.2Shadowremovalandsegmentation......................110 5.5.3AblationStudies...............................112 5.6Conclusion......................................112 vi Chapter6ConclusionsandFutureWork ........................ 114 6.1FutureWorks.....................................115 APPENDIX .................................... 117 BIBLIOGRAPHY ................................. 121 vii LISTOFTABLES Table1.1 Thetermusedinthiswork. .....................6 Table2.1 ThecomparisonofourcollectedSiWdatasetwithexistingdatasetsforface .................................22 Table2.2 TDRatdifferentFDRs,crosstestingonOuluProtocol 1 . ............25 Table2.3 ACERofourmethodatdifferent N f ,onOuluProtocol 2 . ...........25 Table2.4 Theintra-testingresultsonfourprotocolsofOulu. ...............27 Table2.5 Theintra-testingresultsonthreeprotocolsofSiW. ...............27 Table2.6 CrosstestingonCASIA-MFSDvs.Replay-Attack. ...............28 Table3.1 ComparingourSiW-Mwithexistingfacedatasets. ........36 Table3.2 Comparemodelswithdifferentroutingstrategies. ...............47 Table3.3 Comparemodelswithdifferenttreelossesandstrategies.Thetwotermsof row 2 - 5 refertousingliveorspoofdataintreelearning.Thelastrowisourmethod. 47 Table3.4 AUC( % )ofthemodeltestingonCASIA,Replay,andMSU-MFSD. ......47 Table3.5 TheevaluationandcomparisonofthetestingonSiW-M. ............49 Table4.1 TheevaluationonfourprotocolsinOULU-NPU. Bold indicatesthebestscorein eachprotocol. .................................76 Table4.2 TheevaluationonthreeprotocolsinSiWDataset.Wecomparewiththetop 7 performances. .................................78 Table4.3 TheevaluationandablationstudyonSiW-MProtocolI:knownspoofdetection. .79 Table4.4 TheevaluationonSiW-MProtocolII:unknownspoofdetection. ........79 Table4.5 TheevaluationonSiW-MProtocolIII:opensetspoofdetection. .........81 Table4.6 Theperformancecomparisonbetweenimpersonationattacksandobfascation attacks. ....................................82 viii Table4.7 Confusionmatricesofspoofmediumsbasedonspooftraces.The resultsarecomparedwiththepreviousmethodJourablooetal.(2018).Green representsimprovementoverJourablooetal.(2018).Redrepresentsperformance drop. .....................................84 Table4.8 Confusionmatricesof 6 -classspooftracesonSiW-Mdatabase. ..84 Table5.1 AquantitativecomparisonforshadowremovalonUCBdataset.Zhang et al. Zhangetal.(2020b) isourimplementationandtrainedusingoursynthesized data. .....................................110 Table5.2 AquantitativecomparisonofshadowsegmentationonSFWdatabase. .....112 ix LISTOFFIGURES Figure2.1 ConventionalCNN-basedfaceanti-spoofapproachesutilizethebinarysupervi- sion,whichmayleadtoovgiventheenormoussolutionspaceofCNN. Thisworkdesignsanovelnetworkarchitecturetoleveragetwoauxiliaryinforma- tionassupervision:thedepthmapandrPPGsignal,withthegoalsofimproved generalizationandexplainabledecisionsduringinference. ...........9 Figure2.2 Theoverviewoftheproposedmethod. .....................11 Figure2.3 TheproposedCNN-RNNarchitecture.Thenumberofareshownontop ofeachlayer,thesizeofallis 3 3 withstride 1 forconvolutionaland 2 forpoolinglayers. Colorcode used: orange =convolution, green =pooling, purple =responsemap. .............................14 Figure2.4 ExamplegroundtruthdepthmapsandrPPGsignals. ..............19 Figure2.5 Thenon-rigidregistrationlayer. ........................20 Figure2.6 ThestatisticsofthesubjectsintheSiWdatabase.Leftside:Thehistogramshows thedistributionofthefacesizes. ........................22 Figure2.7 Examplelive(top)andspoof(bottom)videosinSiW. .............24 Figure2.8 (a) 8 successfulexamplesandtheirestimateddepthmapsandrPPG signals.(b) 4 failureexamples:thetwoareliveandtheothertwoarespoof. NoteourabilitytoestimatediscriminativedepthmapsandrPPGsignals. ....28 Figure2.9 Mean/Stdoffrontalizedfeaturemapsforliveandspoof. ............29 Figure2.10 TheMSEofestimatingdepthmapsandrPPGsignals. .............29 Figure3.1 Todetectunknownspoofattacks,weproposeaDeepTreeNetwork(DTN)to unsuperviselylearnahierarchicembeddingforknownspoofattacks.Samplesof unknownattackswillberoutedthroughDTNandatthedestinedleaf node. .....................................32 x Figure3.2 TheproposedDeepTreeNetwork(DTN)architecture.(a)theoverallstructureof DTN.AtreenodeconsistsofaConvolutionalResidualUnit(CRU)andaTree RoutingUnit(TRU),andaleafnodeconsistsofaCRUandaSupervisedFeature Learning(SFL)module.(b)theconceptofTreeRoutingUnit(TRU): thebasewithlargestvariations;(c)thestructureofeachConvolutionalResidual Unit(CRU);(d)thestructureoftheSupervisedFeatureLearning(SFL)inthe leafnodes. ..................................37 Figure3.3 Theexamplesofthelivefacesand 13 typesofspoofattacks.Thesecondrow showsthegroundtruthmasksforthepixel-wisesupervision D k .For ( m;n ) in thethirdrow, m=n denotesthenumberofsubjects/videosforeachtypeofdata. .42 Figure3.4 ThestructureoftheTreeRoutingUnit(TRU). .................42 Figure3.5 VisulizationoftheTreeRouting. ........................50 Figure3.6 Treeroutingdistributionoflive/spoofdata.X-axisdenotes 8 leafnodes,and y-axisdenotes 15 typesofdata.Thenumberineachcellrepresentsthepercentage ( % )ofdatathatfallinthatleafnode.Eachrowissumto 1 .(a)PrintProtocol. (b)TransparentMaskProtocol.Yellowboxdenotestheunknownattacks. ....50 Figure3.7 t-SNEVisualizationoftheDTNleaffeatures. .................51 Figure4.1 Theproposedapproachcandetectspooffaces,disentanglethespooftraces,and reconstructthelivecounterparts.Itcanbeappliedtodiversespooftypesand estimatedistincttraces( e.g. ,Moirpatterninreplayattack,eyebrowand waxinmakeupattack,colordistortioninprintattack,andspecularhighlightsin 3 Dmaskattack).Zoominfordetails. .....................54 Figure4.2 Thecomparisonofdifferentdeep-learningbasedface(a)direct FASonlyprovidesabinarydecisionofspoofness;(b)auxiliaryFAScanprovide simpleinterpretationofspoofness. M denotestheauxiliarytask,suchasdepth mapestimation;(c)generativeFAScanprovidemoreintuitiveinterpretation ofspoofness,butonlyforalimitednumberofspoofattacks;(d)theproposed methodcanprovidespooftraceestimationforgenericfacespoofattacks. ....55 Figure4.3 OverviewoftheproposedPhysics-guidedSpoofTraceDisentanglement(PhySTD). 60 Figure4.4 TheproposedPhySTDnetworkarchitecture.Exceptthelastlayer,eachconv andtransposedconvisconcatenatedwithabatchnormalizstionlayerandaleaky ReLUlayer. k3c64s2 indicatesthekernelsizeof 3 3 ,theconvolutionchannel of 64 andthestrideof 2 . ...........................64 Figure4.5 Thevisualizationofimagedecompositionfordifferentinputfaces:(a)liveface (b) 3 Dmaskattack(c)replayattack(d)printattack. ..............65 xi Figure4.6 Theonline 3 Dwarpinglayer.(a)Giventhecorrespondingdenseoffset,wewarp thespooftraceandaddthemtothetargetlivefacetocreateanewspoof.E.g. pixel ( x;y ) withoffset (3 ; 5) iswarpedtopixel ( x +3 ;y +5) inthenewimage. (b)Toobtainadenseoffsetsfromthespareoffsetsoftheselectedfaceshape vertices,Delaunaytriangulationinterpolationisadopted. ............69 Figure4.7 Preliminarymask P 0 forthenegativetermininpaintingmaskloss.Whitepixels denote 1 andblackpixelsdenote 0 .Whiteindicatestheareashouldnotbe inpainted. P 0 for:(a)print,replay;(b)3Dmaskandmakeup;(c)partialattacks thatcovertheeyeportion;(d)partialattacksthatcoverthemouthportion. ...73 Figure4.8 Examplesofeachspooftracecomponents.(a)theinputsamplefaces.(b) B .(c) C .(d) T .(e) P .(f)thelivecounterpartreconstructionandzoom-indetails. (g)resultsfromLiuetal.(2020).(h)resultsfromStep 1 +Step 2 withasingletrace representation. ................................81 Figure4.9 ExamplesofspooftracedisentanglementonSiW(a-h)andSiW-M(i-x).(a)-(d) itemsareprintattacksand(e)-(h)itemsarereplayattacks.(i)-(x)itemsarelive, print,replay,halfmask,siliconemask,papermask,transparentmask,obfuscation makeup,impersonationmakeup,cosmeticmakeup,paperglasses,partialpaper, funnyeyeglasses,andmannequinhead.Thecolumnistheinputface, thesecondcolumnistheoverallspooftrace( I ^ I ),thethirdcolumnisthe reconstructedlive. ..............................83 Figure4.10 Examplesofthespoofdatasynthesis.Therowarethesourcespooffaces, thecolumnarethetargetlivefaces,andtheremainingarethesynthesized spooffacesfromthelivefacewiththecorrespondingspooftraces. .......85 Figure4.11 ThetSNEvisualizationoffeaturesfromdifferentscalesandlayers.The 3 visualizationarefromtheencoderfeature F 1 , F 2 , F 3 ,andthelast 2 visualization arefromthefeaturesthatproduce f B ; C ; T g and f P ; I P g . ............87 Figure4.12 Theillustrationofremovingthedisentangledspooftracecomponentsoneby one.Theestimatedspooftraceelementsofinputspoof(thecolumn)are progressivelyremovedintheorderof B ; C ; T ; T P .Thelastcolumnshowsthe reconstructedliveimageafterremovingallthreeadditivetracecomponentsand theinpaintingtrace.(a)Replayattack;(b)Makeupattack;(c)Maskattack;(d) Paperglassesattack. ..............................88 Figure4.13 Theillustrationofdoublespooftracedisentangling.Theleft 4 samplesarelive faces,andtheright 4 samplesarespooffaces.(a)OriginalInput.(b) 1 stround livereconstruction.(c) 1 stroundspooftraces.(d) 2 ndroundlivereconstruction. (e) 2 ndroundspooftraces. ...........................89 xii Figure5.1 TheresultsofourshadowremovalmodelonimagesfromourShadowFacein theWild(SFW)database( top )andUCBdatabaseZhangetal.(2020b)( bottom ). Thelefttorightareinputface,outputface,andshadowmatte. .........93 Figure5.2 Examplesof(a)foreignshadow,(b)strongselfshadow,and(c)normalself shadow.Ourmodelisdesignedtoremoveunwantedshadowsin(a-b)while keepingnormalshadowin(c). ........................95 Figure5.3 Illustrationofdatasynthesiscomponents. ...................98 Figure5.4 Illustrationofournetworkarchitecture.Themodelmainlyconsistsofanencoder, ashadowmattedecoder,acolormatrixdecoder,andashadowresidualdecoder. TheTemporalSharingModule(TSM)canbeeasilypluggedintothefaceen- coder.Togetherwiththetemporalconsistencyloss L T ,wecanleveragethe unlabeledimageframesefciently.Thegreendashedlinesindicatetheshort-cut connectionsandtheorangedashedlinesandboxesindicatethelossfunctions. ..101 Figure5.5 IllustrationoftheTemporalSharingModule(TSM).Itcanbeappliedtotemporal framesaswellasmirroredinput. .......................104 Figure5.6 AnillustrationofSFWdatabase.Therowshowstheshadowfacescollected underhighlydynamicenvironments.( e.g. ,varyingshadowsandheadposesdue towalkinganddriving);Thesecondrowshowsthepixel-levelannotationsof shadowsegmentation.Zoominforviewingthequalityofourannotation. ....108 Figure5.7 AqualitativecomparisonofshadowremovalontestingimagesofUCBdatabase. Fromtoptobottom,weshowshadowfaceandshadowremovalresultsprovided byZhangetal.(2020b),thenetworkwithnaiveRGBshadowmodeling,our single-framenetworkwithgrayscaleshadowremovalandcolorization(GS+C), andournetworkwithadditionalTSMandtemporalloss. ............109 Figure5.8 QualitativeshadowremovalevaluationsonSFWdatabase.Fromtoptobot- tom,weshowshadowface,shadowremovalresultsfromLe&Samaras(2019) andZhangetal.(2020b),oursingle-framemodel,andourtemporalmodel, groundtruthshadowsegmentation(inbrightpurple),andpredictedshadowmask (beforethresholding). .............................111 xiii LISTOFALGORITHMS Algorithm1PhySTDTrainingIteration.......................... 71 xiv flForinterpretationofthereferencestocolorinthisandallotherthereaderisreferredto theelectronicversionofthisthesis.fl xv Chapter1 IntroductiontoFace 1.1Introduction Biometricsutilizephysiological,suchasface,andiris,orbehavioralcharacteristics, suchastypingrhythmandgait,touniquelyidentifyorauthenticateanindividual.Asbiometric systemsarewidelyusedinreal-worldapplicationsincludingmobilephoneauthenticationand accesscontrol,biometricspoof,orPresentationAttack(PA)arebecomingalargethreat,wherea spoofedbiometricsampleispresentedtothebiometricsystemandattemptedtobeauthenticated. Face,asoneofthemostpopularmodalities,hasreceivedincreasingattentionintheacademiaand industryintherecentyears(e.g.,iPhoneX).However,theattentionalsobringsagrowingincentive forhackerstodesignbiometricpresentationattacks(PA),orspoofs,tobeauthenticatedasthe genuineuser.Duetothealmostno-costaccesstothehumanface,thespooffacecanbeassimple asaprintedphotopaper(i.e.,printattack)andadigitalimage/video(i.e.,replayattack),oras complicatedasa3DMaskandfacialcosmeticmakeup.Withproperhandling,thosespoofscanbe visuallyveryclosetothegenuineusersliveface.Asaresult,thesecallfortheneedofdeveloping robustfacealgorithms. InordertodevelopafacerecognitionsystemthatisinvulnerabletovarioustypesofPAs,thereis anincreasingdemandondesigningarobustface(orPAdetection)systemtoclassify afacesampleasliveorspoofbeforerecognizingitsidentity.AsRGBimageandvideoarethe standardinputtofacerecognitionsystems,mostfacestudiesareRGB-based,either 1 singleimageoraclipofvideo.Previousapproachestotacklefacecanbecategorized inthreegroups.Theisthemotion-basedmethodsthataimatclassifyingfacevideosbasedon detectingmovementsoffacialparts.Eye-blinkingisonecueproposedinPanetal.(2007);Sun etal.(2007),todetectspoofattackssuchaspaperattack.InKollreideretal.(2007),Kollreider etal.uselipmotiontomonitorthefaceliveness.MethodsproposedinChetty(2010);Chetty &Wagner(2006)combineaudioandvisualcuestoverifythefaceliveness.Thesemethodsare suitableforstaticattacks,butnotdynamicattackssuchasreplayormaskattacks.Thesecondis imagequalityandmethods,whichdesignfeaturestocapturethesuperimposed illuminationandnoiseinformationtothespoofimages.Astheimagequalityfactorsareheuristic basedonhumanobservation(notdata-driven),theyshowverylimitedcapabilitytogeneralizeto complexsituationwithvariationssuchaslighting,pose,camerasandexpressions.Thethirdis thetexture-basedmethods,whichdiscoverdiscriminativetexturecharacteristicsuniquetovarious attackmediums.Comparedtotheprevioustwogroups,texture-basedmethodsexploretheintrinsic propertiesofmaterialandmedium,andthusismoregeneralizable.However,duetoa lackofunderstandingbetweenpixelintensitiesanddifferenttypesofattacksintheearlystudies, extractingrobusttexturefeatureswaschallenging. Mosttexture-basedworksutilizehand-craftedfeaturesandadoptsshallowlearningtechniques (e.g.,SVMandLDA)todevelopansystem.Commonlocalfeaturesthathavebeen usedinpriorworkincludeLBPindeFreitasPereiraetal.(2012,2013);M ¨ a ¨ att ¨ aetal.(2011),HOG inKomulainenetal.(2013a);Yangetal.(2013),DoGinPeixotoetal.(2011);Tanetal.(2010), SIFTinPateletal.(2016b)andSURFinBoulkenafetetal.(2017a).However,theaforementioned featurestodetecttexturedifferencecouldbeverysensitivetodifferentilluminations,cameradevices andidentities.ResearchersalsoseeksolutionsondifferentcolorspacessuchasHSVand YCbCrinBoulkenafetetal.(2015,2016),FourierspectraLietal.(2004)andOpticalFlowMaps 2 (OFM)inBaoetal.(2009). WithCNNproventosuccessfullyoutperformotherlearningparadigmsinmanycomputervision tasksinKalchbrenneretal.(2014);Krizhevskyetal.(2012);Lawrenceetal.(1997),itisthen introducedasanewapproachtohandlefaceInLietal.(2016a);Pateletal.(2016a), theCNNservesasafeatureextractor.Bothmethodstheirnetworkfromapretrained model(CaffeNetinPateletal.(2016a),VGG-facemodelininLietal.(2016a)),andextractthe featurestodistinguishlivevs.spoof.Yangetal.(2014)proposetolearnaCNNasafor faceRegisteredfaceimageswithdifferentspatialscalesarestackedasinputand live/spooflabelingisassignedastheoutput.Inaddition,Fengetal.Fengetal.(2016)proposeto usemultiplecuesastheCNNinputforlive/spoofTheyselectShearlet-basedfeatures tomeasuretheimagequalityandtheOFMofthefaceareaaswellasthewholescenearea.And inXuetal.(2015),Xuetal.proposeanLSTM-CNNarchitecturetoconductajointpredictionfor multipleframesofavideo. ThoughCNN-basedmethodsprovideimprovementintermsofdetectionaccuracy tofaceg,comparedtootherfacerelatedproblemssuchasfacerecognitionandface alignment,therearestillsubstantiallylesseffortsandexplorationonfaceusingdeep learningtechniques.Therefore,inthisworkweaimtofurtherexplorethecapabilityofCNN inhandlingfacemainlyinthreeaspects:improvingthedetectionperformance, generalizationtowarddifferentdomainssuchasunseen/unknownspooftypesandcapturingcamera sensors,andprovidingvisualizationtotheCNN'sprediction. 3 1.2Overviewofthethesis InChapter 2 ,wearguetheimportanceofauxiliarysupervisiontoguidethelearningtowardmore discriminativecues.ACNN-RNNmodelislearnedtoestimatethefacedepthwithpixel-wise supervision,andtoestimaterPPGsignalswithsequence-wisesupervision.Theestimateddepthand rPPGarefusedtodistinguishlivevs.spooffaces.Further,weintroduceanewface databasethatcoversalargerangeofillumination,subject,andposevariations.Experimentsshow thatourmodelachievesthestate-of-theartresultsonbothintra-andcross-databasetesting. Whileadvancedfacemethodsaredeveloped,newtypesofspoofattacksarealso beingcreatedandbecomingathreattoallexistingsystems.Tostudythegeneralizationofthe facemethods,inChapter 3 ,wethedetectionofunknownspoofattacksas Zero-ShotFace(ZSFA).PreviousZSFAworksonlystudy 1 - 2 typesofspoofattacks, suchasprint/replay,whichlimitstheinsightofthisproblem.Inthischapter,weinvestigatethe ZSFAprobleminawiderangeof13typesofspoofattacks,includingprint,replay,3Dmask,and soon.AnovelDeepTreeNetwork(DTN)isproposedtopartitionthespoofsamplesintosemantic sub-groupsinanunsupervisedfashion.Whenadatasamplearrives,beingknoworunknownattacks, DTNroutesittothemostsimilarspoofcluster,andmakesthebinarydecision.Inaddition,to enablethestudyofZSFA,weintroducethefacedatabasethatcontainsdiverse typesofspoofattacks.Experimentsshowthatourproposedmethodachievesthestateofthearton multipletestingprotocolsofZSFA. Togainabettervisualunderstandingandinterpretationofthespoofattacks,inChapter 4 ,we identifyanewproblemofspooftracedisentangling.Weshowthatthekeytofacelies inthesubtleimagepattern,termedfispooftracefl, e.g. ,colordistortion, 3 Dmaskedge,Moirpattern, andmanyothers.Spooftracedisentanglingismotivatedbythenoisemodelinganddenoising 4 algorithms,forthepurposeofinverselydecomposingaspooffaceintoaspooftrace andaliveface,andthenutilizingthespooftraceforDesigningageneric modeltoestimatethosespooftracescanimprovenotonlythegeneralizationofthespoofdetection, butalsotheinterpretabilityofthemodel'sdecision.Westprovideapropermodelingofspoof traceasanadditivenoisetothegenuineface.Wedesignsanoveladversariallearningframeworkto disentanglethespooftracesfrominputfacesasahierarchicalcombinationofpatternsatmultiple scales.Withthedisentangledspooftraces,weunveilthelivecounterpartoftheoriginalspoofface, andfurthersynthesizerealisticnewspooffacesafterapropergeometriccorrection.Ourmethod demonstratessuperiorspoofdetectionperformanceonbothseenandunseenspoofscenarioswhile providingvisually-convincingestimationofspooftraces. InChapter5,weshowthataproperphysicalmodelingcanfaceshadowdetection andremovalproblemaswell.In-the-wildfacephotographsoftensufferfromundesiredforeign shadowscastbyexternalobjects,e.g.,hands,phones,andtrees.Removingfacialforeignshadows notonlyimprovesimageaestheticsbutalsomitigatesthenegativeimpactsonface-relatedtasks. Wetackletheblindremovaloffacialforeignshadowforbothsingleimageandvideos,bymaking threecontributions.Firstly,weproposeanoveltwo-stageshadowmodelingthatconsistsofgray- scaleshadowremovalandcolorization.Thismodelingprovidesaneffectivewaytohandleboth colordistortionandsubsurfacescatteringeffects.Second,weproposeanovelTemporalSharing Moduletoextracthierarchicalfeaturesacrossmultiplealignedvideoframes,whichrepresentsthe shadow-freefaces.Third,wecollectarealfacedatabasewith280videoscapturedunderhighly dynamicenvironmentsandannotatepixel-levelshadowsegmentationmaps.Extensiveexperiments demonstratetheeffectivenessofourapproachcom-paringwithboththebaselineandstate-of-the-art methods Tobetterunderstandtheworkinthisdissertation,Tab.1.1listsallthetermsandthe 5 Terms LivefaceGenuinefacefromthesubjectwithoutanyphysicalmanipulationofitsidentity.Alsoknownas bona face. SpooffaceFacenotfromtheoriginalsubject.Alsoknownas presentationattack . ImpersonationattackAttacksinwhichtheattackerwantstoberecognizedasadifferentsubject. ObfuscationattackAttacksinwhichtheattackerwanttohidetheidentityoftheattacker. SpoofmediumMaterialusedtopresentthespoofface,suchasprintedphotoanddigitalscreen. SpooftracePatternsonlyexistinginthespooffaces,suchasmoirepattern,and3Dmaskedges. Table1.1 Thetermusedinthiswork. usedinthiswork. 1.2.1Contributionsofthethesis Inthissection,welistthecontributionsinthisdissertation: PreviousworkregardthefaceasabinaryWeproposetoleverage novelauxiliaryinformation(i.e.,depthmapandrPPG)tosupervisetheCNN-RNNlearningfor improveddetectionperformance.Theauxiliarysupervisionsbringadditionalknowledgeto theface WeconductanextensivestudyofthegeneralizationproblemoffaceSpecially, westudyzero-shotfaceon 13 differenttypesofspoofattacksandproposeaDeepTree Network(DTN)tolearnfeatureshierarchically.DTNleveragesexistingspoofattackknowledgeto effectivelyrepresenttheunknownspoofattacks; Toprovidevisualinterpretation,westudyanovelproblemofspooftracedisentanglingand proposeanovelmodelingtodisentanglespooftracesintoahierarchicalrepresentationongeneric spoofattacks.Toourknowledge,thesearetheworktosolvefaceinagenerative andvisually-intuitiveapproach; Inspiredbytheeffectivespooftracemodeling,weprovideaeffectivemodelingonthefacewith foriegnshadow,andproposeanovelapproachtodecomposeRGBshadowremovalintograyscale shadowremovalandcolorization,andatemporalsharingmoduletoensurevideoconsistency; 6 WecollectdatabasesforfaceincludingSiWandSiW-M,andforfaceshadow detection,termedSFW. 7 Chapter2 Detection:Facewith AuxiliarySupervisions 2.1Introduction RGBimageandvideoarethestandardinputtofacesystems,similartofacerecognition systems.Researchersstartthetexture-basedapproachesbyfeedinghandcrafted featurestobinaryBoulkenafetetal.(2017a);deFreitasPereiraetal.(2012,2013); Komulainenetal.(2013a);M ¨ a ¨ att ¨ aetal.(2011);Mirjalili&Ross(2017);Pateletal.(2016b); Yangetal.(2013).Laterinthedeeplearningera,severalConvolutionalNeuralNetworks(CNN) approachesutilizesoftmaxlossasthesupervisionFengetal.(2016);Lietal.(2016a);Pateletal. (2016a);Yangetal.(2014).Itappearsalmostallpriorworkregardthefaceproblem asmerelya binary (livevs.spoof)problem. Therearetwomainissuesinlearningdeepmodelswithbinarysupervision,shown inFig.2.1.First,therearedifferentlevelsofimagedegradation,namely spoofpatterns ,comparing aspooffacetoaliveone,whichconsistofskindetailloss,colordistortion,moir ´ epattern,shape deformationandspoofartifacts(e.g.,Lietal.(2004);Pateletal.(2016b).ACNNwith softmaxlossmightdiscover arbitrary cuesthatareabletoseparatethetwoclasses,suchasscreen bezel,butnotthe faithful spoofpatterns.Whenthosecuesdisappearduringtesting,thesemodels wouldfailtodistinguishspoofvs.livefacesandresultinpoorgeneralization.Second,during 8 Figure2.1 ConventionalCNN-basedfaceanti-spoofapproachesutilizethebinarysupervision,whichmay leadtoovgiventheenormoussolutionspaceofCNN.Thisworkdesignsanovelnetworkarchitecture toleveragetwoauxiliaryinformationassupervision:thedepthmapandrPPGsignal,withthegoalsof improvedgeneralizationandexplainabledecisionsduringinference. thetesting,modelslearntwithbinarysupervisionwillonlygenerateabinarydecisionwithout explanation or rationale forthedecision.InthepursuitofExplainableIntelligenceTurek (2016),itisdesirableforthelearntmodeltogeneratethespoofpatternsthatsupportthebinary decision. Toaddresstheseissues,asshowninFig.2.2,weproposeadeepmodelthatusesthesupervision fromboththe spatial and temporalauxiliaryinformation ratherthanbinarysupervision,forthe purposeofrobustlydetectingfacePAfromafacevideo.Theseauxiliaryinformationareacquired basedonourdomainknowledgeaboutthekey differences betweenliveandspooffaces,which includetwoperspectives:spatialandtemporal.Fromthespatialperspective,itisknownthatlive faceshaveface-likedepth,e.g.,thenoseisclosertothecamerathanthecheekinfrontal-viewfaces, whilefacesinprintorreplayattackshaveorplanardepth,e.g.,allpixelsontheimageofa paperhavethesamedepthtothecamera.Hence,depthcanbeutilizedasauxiliaryinformationto supervisebothliveandspooffaces.Fromthetemporalperspective,itwasshownthatthenormal rPPGsignals(i.e.,heartpulsesignal)aredetectablefromlive,butnotspoof,facevideosLiu 9 etal.(2016b);Nowaraetal.(2017).Therefore,weprovidedifferentrPPGsignalsasauxiliary supervision,whichguidesthenetworktolearnfromliveorspooffacevideosrespectively.To enablebothsupervisions,wedesignanetworkarchitecturewithashort-cutconnectiontocapture differentscalesandanovelnon-rigidregistrationlayertohandlethemotionandposechangefor rPPGestimation. Furthermore,similartomanyvisionproblems,dataplaysaroleintrainingthe models.Asweknow,camera/screenqualityisacriticalfactortothequalityofspoof faces.Existingfacedatabases,suchasNUAATanetal.(2010),CASIAZhang etal.(2012),Replay-AttackChingovskaetal.(2012),andMSU-MFSDWenetal.(2015),were collected 3 5 yearsago.Giventhefastadvanceofconsumerelectronics,thetypesofequipment (e.g.,camerasandmediums)usedinthosedatacollectionareoutdatedcomparedtothe onesnowadays,regardingtheresolutionandimagingquality.MorerecentMSU-USSAPateletal. (2016b)andOULUdatabasesBoulkenafetetal.(2017b)havesubjectswithfewervariationsin poses,illuminations,expressions(PIE).Thelackofnecessaryvariationswouldmakeithardto learnaneffectivemodel.Giventheclearneedformoreadvanceddatabases,wecollectaface database,namedSpoofintheWildDatabase(SiW).SiWdatabaseconsistsof 165 subjects, 6 mediums,and 4 sessionscoveringvariationssuchasPIE,distance-to-camera, etc .SiWcoversmuchlargervariationsthanpreviousdatabases,asdetailedinTab.2.1andSec.2.4. Themaincontributionsofthisworkinclude: Weproposetoleveragenovelauxiliaryinformation(i.e.,depthmapandrPPG)tosupervise theCNNlearningforimprovedgeneralization. WeproposeanovelCNN-RNNarchitectureforend-to-endlearningthedepthmapandrPPG signal. WereleaseanewdatabasethatcontainsvariationsofPIE,andotherpracticalfactors.We 10 Figure2.2 Theoverviewoftheproposedmethod. achievethestate-of-the-artperformanceforface 2.2PriorWork Wereviewthepriorfaceworksinthreegroups:texture-basedmethods,temporal-based methods,andremotephotoplethysmographymethods. Texture-basedMethods SincemostfacerecognitionsystemsadoptonlyRGBcameras,using textureinformationhasbeenanaturalapproachtotacklingfaceManypriorworks utilizehand-craftedfeatures,suchasLBPdeFreitasPereiraetal.(2012,2013);M ¨ a ¨ att ¨ aetal.(2011), HoGKomulainenetal.(2013a);Yangetal.(2013),SIFTPateletal.(2016b)andSURFBoulkenafet etal.(2017a),andadopttraditionalsuchasSVMandLDA.Toovercometheof illuminationvariation,theyseeksolutionsinadifferentinputdomain,suchasHSVandYCbCr colorspaceBoulkenafetetal.(2015,2016),andFourierspectrumLietal.(2004). Asdeeplearninghasproventobeeffectiveinmanycomputervisionproblems,therearemany recentattemptsofusingCNN-basedfeaturesorCNNsinfaceFengetal.(2016);Li etal.(2016a);Pateletal.(2016a);Yangetal.(2014).Mostoftheworktreatsfaceas asimple binary problembyapplyingthesoftmaxloss.Forexample,Lietal.(2016a); 11 Pateletal.(2016a)useCNNasfeatureextractorandfromImageNet-pretrainedCaffeNet andVGG-face.TheworkofFengetal.(2016);Lietal.(2016a)feeddifferentdesignsofthe faceimagesintoCNN,suchasmulti-scalefacesandhand-craftedfeatures,anddirectlyclassify livevs.spoof.OnepriorworkthatsharesthesimilaritywithoursisAtoumetal.(2017),where Atoum et.al. proposeatwo-steamCNN-basedmethodusingtextureanddepth.We advanceAtoumetal.(2017)inanumberofaspects,includingfusionwithtemporalsupervision(i.e., rPPG),architecturedesign,novelnon-rigidregistrationlayer,andcomprehensiveexperimental support. Temporal-basedMethods Oneoftheearliestsolutionsforfaceisbasedontemporal cuessuchaseye-blinkingPanetal.(2007);Pateletal.(2016a).MethodssuchasKollreideretal. (2007);Shaoetal.(2017)trackthemotionofmouthandliptodetectthefaceliveness.Whilethese methodsareeffectivetotypicalpaperattacks,theybecomevulnerablewhenattackerspresenta replayattackorapaperattackwitheye/mouthportionbeingcut. Therearealsomethodsrelyingonmoregeneraltemporalfeatures,insteadofthefacial motion.Themostcommonapproachisframeconcatenation.Manyhandcraftedfeature-based methodsmayimproveintra-databasetestingperformancebysimplyconcatenatingthefeaturesof consecutiveframestotrainthersBoulkenafetetal.(2015);deFreitasPereiraetal.(2012); Komulainenetal.(2013b).Additionally,therearesomeworksproposingfeatures, e.g.,HaralickfeaturesAgarwaletal.(2016),motionmagBharadwajetal.(2014),andoptical wBaoetal.(2009).Inthedeeplearningera,Feng et.al. feedtheopticalwmapandShearlet imagefeaturetoCNNFengetal.(2016).InXuetal.(2015),Xu et.al. proposeanLSTM-CNN architecturetoutilizetemporalinformationforbinaryOverall,allpriormethods stillregardfaceasabinaryonproblem,andthustheyhaveahardtime togeneralizewellinthecross-databasetesting.Inthiswork,weextractdiscriminativetemporal 12 informationbylearningtherPPGsignalofthefacevideo. RemotePhotoplethysmography(rPPG) Remotephotoplethysmography(rPPG)isthetechnique totrackvitalsignals,suchasheartrate,withoutanycontactwithhumanskinBobbiaetal.(2016); deHaan&Jeanne(2013);Poetal.(2017);Tulyakovetal.(2016);Wuetal.(2016).Research startswithfacevideoswithnomotionorilluminationchangetovideoswithmultiplevariations. IndeHaan&Jeanne(2013),Haan et.al. estimaterPPGsignalsfromRGBfacevideoswithlighting andmotionchanges.Itutilizescolordifferencetoeliminatethespecularandestimate twoorthogonalchrominancesignals.AfterapplyingtheBandPassFilter(BPM),theratioofthe chrominancesignalsareusedtocomputetherPPGsignal. rPPGhaspreviouslybeenutilizedtotacklefaceLiuetal.(2016b);Nowaraetal. (2017).InLiuetal.(2016b),rPPGsignalsareusedfordetectingthe 3 Dmaskattack,wherethelive facesexhibitapulseofheartrateunlikethe 3 Dmasks.TheyuserPPGsignalsextractedbydeHaan &Jeanne(2013)andcomputethecorrelationfeaturesforSimilarly,Magdalena et. al. Nowaraetal.(2017)extractrPPGsignals(alsoviadeHaan&Jeanne(2013))fromthreeface regionsandtwonon-faceregions,fordetectingprintandreplayattacks.Althoughinreplayattacks, therPPGextractormightstillcapturethenormalpulse,thecombinationofmultipleregionscan differentiatelivevs.spooffaces.WhiletheanalyticsolutiontorPPGextractiondeHaan&Jeanne (2013)iseasytoimplement,weobservethatitissensitivetoPIEvariations.Hence,weemploya novelCNN-RNNarchitectureto learn amappingfromafacevideototherPPGsignal,whichisnot onlyrobusttoPIEvariations,butalsodiscriminativetolive vs. spoof. 13 Figure2.3 TheproposedCNN-RNNarchitecture.Thenumberofareshownontopofeachlayer, thesizeofallis 3 3 withstride 1 forconvolutionaland 2 forpoolinglayers. Colorcode used: orange =convolution, green =pooling, purple =responsemap. 2.3FacewithDeepNetwork Themainideaoftheproposedapproachistoguidethedeepnetworktofocusonthe knownspoof patterns acrossspatialandtemporaldomains,ratherthantoextractanycuesthatcouldseparate twoclassesbutarenotgeneralizable.AsshowninFig.2.2,theproposednetworkcombinesCNN andRNNarchitecturesinacoherentway.TheCNNpartutilizesthedepthmapsupervisionto discoversubtletexturepropertythatleadstodistinctdepthsforliveandspooffaces.Then,itfeeds theestimateddepthandthefeaturemapstoanovel non-rigidregistration layertocreatealigned featuremaps.TheRNNpartistrainedwiththealignedmapsandtherPPGsupervision,which examinestemporalvariabilityacrossvideoframes. 2.3.1DepthMapSupervision Depthmapsarearepresentationofthe 3 Dshapeofthefaceina 2 Dimage,whichshowstheface locationandthedepthinformationofdifferentfacialareas.Thisrepresentationismoreinformative thanbinarylabelssinceitindicatesoneofthefundamentaldifferencesbetweenlivefaces,andprint andreplayPA.WeutilizethedepthmapsinthedepthlossfunctiontosupervisetheCNNpart.The pixel-baseddepthlossguidestheCNNtolearnamappingfromthefaceareawithinareceptive toalabeleddepthvalueŒascalewithin [0 ; 1] forlivefacesand 0 forspooffaces. 14 Toestimatethedepthmapfora 2 Dfaceimage,givenafaceimage,weutilizethestate-of-the-art densefacealignment(DeFA)methodsJourabloo&Liu(2017);Liuetal.(2017)toestimatethe 3 Dshapeoftheface.Thefrontaldense 3 Dshape S F 2 R 3 Q ,with Q vertices,isrepresentedasa linearcombinationofidentitybases f S i id g N id i =1 andexpressionbases f S i exp g N exp i =1 , S F = S 0 + N id X i =1 i id S i id + N exp X i =1 i exp S i exp ; (2.1) where id 2 R 199 and ext 2 R 29 aretheidentityandexpressionparameters,and =[ id ; exp ] aretheshapeparameters.WeutilizetheBasel 3 DfacemodelPaysanetal.(2009)andthe facewearhouseCaoetal.(2014)astheidentityandexpressionbases. Withtheestimatedposeparameters P =( s; R ; t ) ,where R isa 3 Drotationmatrix, t isa 3 D translation,and s isascale,wealignthe 3 Dshape S tothe 2 Dfaceimage: S = s RS F + t : (2.2) Giventhechallengeofestimatingthe absolute depthfroma 2 Dface,wenormalizethe z values of 3 Dverticesin S tobewithin [0 ; 1] .Thatis,thevertexclosesttothecamera(e.g.,nose)has adepthofone,andthevertexfurthestawayhasthedepthofzero.Then,weapplytheZ-Buffer algorithmZhuetal.(2016)to S forprojectingthenormalized z valuestoa 2 Dplane,whichresults inanestimatedfigroundtruthfl 2 Ddepthmap D 2 R 32 32 forafaceimage. 2.3.2rPPGSupervision rPPGsignalshaverecentlybeenutilizedforfaceLiuetal.(2016b);Nowaraetal. (2017).TherPPGsignalprovidestemporalinformationaboutfaceliveness,asitisrelatedtothe 15 intensitychangesoffacialskinovertime.Theseintensitychangesarehighlycorrelatedwiththe bloodw.ThetraditionalmethoddeHaan&Jeanne(2013)forextractingrPPGsignalshasthree drawbacks.First,itissensitivetoposeandexpressionvariation,asitbecomesharderto track afaceareaformeasuringintensitychanges.Second,itisalsosensitivetoillumination changes,sincetheextralightingaffectstheamountofctedlightfromtheskin.Third,for thepurposeofanti-spoof,rPPGsignalsextractedfromspoofvideosmightnotbesuf distinguishable tosignalsoflivevideos. Onenoveltyaspectofourapproachisthat,insteadofcomputingtherPPGsignalviadeHaan& Jeanne(2013),ourRNNpartlearnstoestimatetherPPGsignal.Thiseasesthesignalestimation fromfacevideoswithPIEvariations,andalsoleadstomorediscriminativerPPGsignals,asdifferent rPPGsupervisionsareprovidedtolivevs.spoofvideos.Weassumethatthevideosofthesame subjectunderdifferentPIEconditionshavethe same groundtruthrPPGsignal.Thisassumptionis validsincetheheartbeatissimilarforthevideosofthesamesubjectthatarecapturedinashort spanoftime( < 5 minutes).TherPPGsignalextractedfromtheconstrainedvideos(i.e.,noPIE variation)areusedasthefigroundtruthflsupervisionintherPPGlossfunctionfor all livevideosof thesamesubject.ThisconsistentsupervisionhelpstheCNNandRNNpartstoberobusttothePIE changes. InordertoextracttherPPGsignalfromafacevideowithoutPIE,weapplytheDeFALiu etal.(2017)toeachframeandestimatethedense 3 Dfaceshape.Weutilizetheestimated 3 D shapetotrackafaceregion.Foratrackedregion,wecomputetwoorthogonalchrominancesignals x f =3 r f 2 g f , y f =1 : 5 r f + g f 1 : 5 b f where r f ; g f ; b f arethebandpassversionsof the r ; g ; b channelswiththeskin-tonenormalization.Weutilizetheratioofthestandarddeviation ofthechrominancesignals = ˙ ( x f ) ˙ ( y f ) tocomputebloodwsignalsdeHaan&Jeanne(2013).We 16 calculatethesignal p as: p =3(1 2 ) r f 2(1+ 2 ) g f + 3 2 b f : (2.3) ByapplyingFFTto p ,weobtaintherPPGsignal f 2 R 50 ,whichshowsthemagnitudeofeach frequency. 2.3.3NetworkArchitecture Ourproposednetworkconsistsoftwodeepnetworks.First,aCNNpartevaluateseachframe separatelyandestimatesthedepthmapandfeaturemapofeachframe.Second,arecurrentneural network(RNN)partevaluatesthetemporalvariabilityacrossthefeaturemapsofasequence. 2.3.3.1CNNNetwork WedesignaFullyConvolutionalNetwork(FCN)asourCNNpart,asshowninFig.2.3.The CNNpartcontainsmultipleblocksofthreeconvolutionallayers,poolingandresizinglayerswhere eachconvolutionallayerisfollowedbyoneexponentiallinearlayerandbatchnormalizationlayer. Then,theresizinglayersresizetheresponsemapsaftereachblocktoasizeof 64 64 andconcatenatetheresponsemaps.Thebypassconnectionshelpthenetworktoutilizeextracted featuresfromlayerswithdifferentdepthssimilartotheResNetstructureHeetal.(2016).After that,ourCNNhastwobranches,oneforestimatingthedepthmapandtheotherforestimatingthe featuremap. TheoutputoftheCNNistheestimateddepthmapoftheinputframe I 2 R 256 256 ,which 17 issupervisedbytheestimatedfigroundtruthfldepth D , D =argmin D N d X i =1 jj CNN D ( I i ; D ) D i jj 2 1 ; (2.4) where D istheCNNparametersand N d isthenumberoftrainingimages.Thesecondoutputof theCNNisthefeaturemap,whichisfedintothenon-rigidregistrationlayer. 2.3.3.2RNNNetwork TheRNNpartaimstoestimatetherPPGsignal f ofaninputsequencewith N f frames f I j g N f j =1 .As showninFig.2.3,weutilizeoneLSTMlayerwith 100 hiddenneurons,onefullyconnectedlayer, andanFFTlayerthatconvertstheresponseoffullyconnectedlayerintotheFourierdomain.Given theinputsequence f I j g N f j =1 andthefigroundtruthflrPPGsignal f ,wetraintheRNNtominimize the ` 1 distanceoftheestimatedrPPGsignaltofigroundtruthfl f : R =argmin R N s X i =1 jj RNN R ([ f F j g N f j =1 ] i ; R ) f i jj 2 1 ; (2.5) where R istheRNNparameters, F j 2 R 32 32 isthefrontalizedfeaturemap(detailsinSec.2.3.4), and N s isthenumberofsequences. 2.3.3.3ImplementationDetails GroundTruthData Givenasetofliveandspooffacevideos,weprovidethegroundtruth supervisionforthedepthmap D andrPPGsignal f ,asinFig.2.4.Wefollowtheprocedurein Sec.2.3.1tocomputefigroundtruthfldataforlivevideos.Forspoofvideos,wesetthegroundtruth depthmapstoaplainsurface,i.e.,zerodepth.Similarly,wefollowtheprocedureinSec.2.3.2to computethefigroundtruthflrPPGsignalfromapatchontheforehead,foronelivevideoofeach 18 Figure2.4 ExamplegroundtruthdepthmapsandrPPGsignals. subjectwithoutPIEvariation.Also,wenormalizethenormofestimatedrPPGsignalsuchthat k f k 2 =1 .Forspoofvideos,weconsidertherPPGsignalsarezero. Notethat,whilethetermfidepthflisusedhere,ourestimateddepthisdifferenttotheconventional depthmapincomputervision.Itcanbeviewedasafipseudo-depthflandservesthepurposeof providingdiscriminativeauxiliarysupervisiontothelearningprocess.Thesameperspectiveapplies tothesupervisionbasedonpseudo-rPPGsignal. TrainingStrategy OurproposednetworkcombinestheCNNandRNNpartsforend-to-end training.ThedesiredtrainingdatafortheCNNpartshouldbefromdiversesubjects,soasto makethetrainingproceduremorestableandincreasethegeneralizabilityofthelearnedmodel. Meanwhile,thetrainingdatafortheRNNpartshouldbelongsequencestoleveragethetemporal informationacrossframes.Thesetwopreferencescanbecontradictorytoeachother,especially giventhelimitedGPUmemory.Hence,tosatisfybothpreferences,wedesignatwo-streamtraining strategy.ThestreamthepreferenceoftheCNNpart,wheretheinputincludesface images I andthegroundtruthdepthmaps D .ThesecondstreamtheRNNpart,wherethe inputincludesfacesequences f I j g N f j =1 ,thegroundtruthdepthmaps f D j g N f j =1 ,theestimated3D shapes f S j g N f j =1 ,andthecorrespondinggroundtruthrPPGsignals f .Duringtraining,ourmethod 19 Figure2.5 Thenon-rigidregistrationlayer. alternatesbetweenthesetwostreamstoconvergetoamodelthatminimizesboththedepthmap andrPPGlosses.NotethateventhoughthestreamonlyupdatestheweightsoftheCNNpart, thebackpropagationofthesecondstreamupdatestheweightsofbothCNNandRNNpartsinan end-to-endmanner. Testing Toprovideascore,wefeedthetestingsequencetoournetworkandcompute thedepthmap ^ D ofthelastframeandtherPPGsignal ^ f .Insteadofdesigningausing ^ D and ^ f ,wecomputethescoreas: score = jj ^ f jj 2 2 + jj ^ D jj 2 2 ; (2.6) where isaconstantweightforcombiningthetwooutputsofthenetwork. 2.3.4Non-rigidRegistrationLayer Wedesignanewnon-rigidregistrationlayertopreparedatafortheRNNpart.Thislayerutilizes theestimateddense 3 DshapetoaligntheactivationorfeaturemapsfromtheCNNpart.Thislayer isimportanttoensurethattheRNNtracksandlearnsthechangesoftheactivationforthe same 20 facialarea acrosstime,aswellasacrossallsubjects. AsshowninFig.2.5,thislayerhasthreeinputs:thefeaturemap T 2 R 32 32 ,thedepthmap ^ D andthe 3 Dshape S .Withinthislayer,wethresholdthedepthmapandgenerateabinarymask V 2 R 32 32 : V = ^ D threshold: (2.7) Then,wecomputetheinnerproductofthebinarymaskandthefeaturemap U = T V ,which essentiallyutilizesthedepthmapasavisibilityindicatorforeachpixelinthefeaturemap.Ifthe depthvalueforonepixelislessthanthethreshold,weconsiderthatpixeltobeinvisible.Finally, wefrontalize U byutilizingtheestimated 3 Dshape S , F ( i;j )= U ( S ( m ij ; 1) ; S ( m ij ; 2)) ; (2.8) where m 2 R K isthelistof K indexesofthefaceareain S 0 ,and m ij isthe correspondingindexofpixel i;j .Weutilize m toprojectthemaskedactivationmap U tothe frontalizedimage F .Thisproposednon-rigidregistrationlayerhasthreecontributionstoour network: Byapplyingthenon-rigidregistration,theinputdataarealignedandtheRNNcancompare thefeaturemapswithoutconcerningaboutthefacialposeorexpression.Inotherwords,itcanlearn thetemporalchangesintheactivationofthefeaturemapsforthesamefacialarea. Thenon-rigidregistrationremovesthebackgroundareainthefeaturemap.Hencethe backgroundareawouldnotparticipateinRNNlearning,althoughthebackgroundinformationis alreadyutilizedinthelayersoftheCNNpart. Forspooffaces,thedepthmapsarelikelytobeclosertozero.Hence,theinnerproductwith thedepthmapssubstantiallyweakenstheactivationinthefeaturemaps,whichmakesiteasierfor 21 Dataset #of#of#oflive/attackPoseDifferentExtra Displaydevices Spoof subj.sess.video/image(I)rangeexpres.light.attacks NUAA[Tanetal.(2010)] 1535105 = 7509 (I)FrontalNoYes- 1 Print CASIA-MFSD[Zhangetal.(2012)] 503150 = 450 (V)FrontalNoNoiPad 1 Print, 1 Replay Replay-Attack[Chingovskaetal.(2012)] 501200 = 1000 (V)FrontalNoYesiPhone 3 GS,iPad 1 Print, 2 Replay MSU-MFSD[Wenetal.(2015)] 351110 = 330 (V)FrontalNoNoiPadAir,iPhone 5 S 1 Print, 2 Replay MSU-USSA[Pateletal.(2016b)] 114011140 = 9120 (I) < 45 YesYesMacBook,Nexus 5 ,NvidiaShieldTablet 2 print, 6 Replay Oulu-NPU[Boulkenafetetal.(2017b)] 5531980 = 3960 (V)FrontalNoYesDell 1905 FP,MacbookRetina 2 Print, 2 Replay SiW 16541320 = 3300 (V) < 90 YesYesiPadPro,iPhone 7 ,GalaxyS 8 ,AsusMB 168 B 2 Print, 4 Replay Table2.1 ThecomparisonofourcollectedSiWdatasetwithexistingdatasetsforface Figure2.6 ThestatisticsofthesubjectsintheSiWdatabase.Leftside:Thehistogramshowsthedistribution ofthefacesizes. theRNNtooutputzerorPPGsignals.Likewise,thebackpropagationfromtherPPGlossalso encouragestheCNNparttogeneratezerodepthmapsforeitherallframes,oronepixellocationin majorityoftheframeswithinaninputsequence. 2.4CollectionofFaceDatabase Withtheadvanceofsensortechnology,existingsystemscanbevulnerabletoemerging high-qualityspoofmediums.Onewaytomakethesystemrobusttotheseattacksistocollect newhigh-qualitydatabases.Inresponsetothisneed,wecollectanewfacedatabase namedSpoofintheWild(SiW)database,whichhasmultipleadvantagesoverpreviousdatasetsas 22 inTab.2.1.First,itcontainssubstantiallymorelivesubjectswithdiverseraces,e.g., 3 timesofthe subjectsofOulu-NPU.NotethatMSU-USSAisconstructedusingexistingimagesofcelebrities withoutcapturinglivefaces.Second,livevideosarecapturedwithtwohigh-qualitycameras(Canon EOST 6 ,LogitechC 920 webcam)withdifferentPIEvariations. SiWprovidesliveandspoof 30 -FPSvideosfrom 165 subjects.Foreachsubject,wehave 8 liveand 20 spoofvideos,intotal 4 ; 620 videos.Somestatisticsofthesubjectsareshownin Fig.2.6.Thelivevideosarecollectedinfoursessions.InSession 1 ,thesubjectmoveshisheadwith varyingdistancestothecamera.InSession 2 ,thesubjectchangestheyawangleoftheheadwithin [ 90 ; 90 ] ,andmakesdifferentfaceexpressions.InSessions 3 ; 4 ,thesubjectrepeatstheSessions 1 ; 2 ,whilethecollectormovesthepointlightsourcearoundthefacefromdifferentorientations. Thelivevideoscapturedbybothcamerasareof 1 ; 920 1 ; 080 resolution.Weprovidetwo printandfourreplayvideoattacksforeachsubject,withexamplesshowninFig.2.7.Togenerate differentqualitiesofprintattacks,wecaptureahigh-resolutionimage( 5 ; 184 3 ; 456 )foreach subjectanduseittomakeahigh-qualityprintattack.Also,weextractafrontal-viewframefrom alivevideoforlower-qualityprintattack.WeprinttheimageswithanHPcolorLaserJetM652 printer.Theprintattackvideosarecapturedbyholdingprintedpapersstillorwarpingthemin frontofthecameras.Togeneratehigh-qualityreplayattackvideos,weselectfourspoofmediums: SamsungGalaxyS 8 ,iPhone 7 ,iPadPro,andPC(AsusMB 168 B)screens.Foreachsubject,we randomlyselecttwoofthefourhigh-qualitylivevideostodisplayinthespoofmediums. 23 Figure2.7 Examplelive(top)andspoof(bottom)videosinSiW. 2.5ExperimentalResults 2.5.1ExperimentalSetup Databases Weevaluateourmethodonmultipledatabasestodemonstrateitsgeneralizability.We utilizeSiWandOuludatabasesBoulkenafetetal.(2017b)asnewhigh-resolutiondatabasesand performintraandcrosstestingbetweenthem.Also,weusetheCASIA-MFSDZhangetal.(2012) andReplay-AttackChingovskaetal.(2012)databasesforcrosstestingandcomparingwiththe stateoftheart. Parametersetting TheproposedmethodisimplementedinTensorFlowAbadietal.(2016)with aconstantlearningrateof 3e 3 ,and 10 epochsofthetrainingphase.ThebatchsizeoftheCNN streamis 10 andthatoftheCNN-RNNstreamis 2 with N f being 5 .Werandomlyinitializeour networkbyusinganormaldistributionwithzeromeanandstdof 0 : 02 .Weset inEq.4.10to 0 : 015 and threshold inEq.2.7to 0 : 1 . Evaluationmetrics Tocomparewithpriorworks,wereportourresultswiththefollowingmet- rics:AttackPresentationErrorRate APCER ISO/IEC-JTC-1/SC-37(2016),Bona FidePresentationErrorRate BPCER ISO/IEC-JTC-1/SC-37(2016), ACER = 24 FDR 1%2%10%20% Model 18 : 5%18 : 1%71 : 4%81 : 0% Model 240 : 2%46 : 9%78 : 5%93 : 5% Model 339 : 4%42 : 9%67 : 5%87 : 5% Model 4 45.8%47.9%81%94.2% Table2.2 TDRatdifferentFDRs,crosstestingonOuluProtocol 1 . Test Train 51020 54 : 16%4 : 16%3 : 05% 104 : 02%3 : 61%2 : 78% 204 : 10%3 : 67%2 : 98% Table2.3 ACERofourmethodatdifferent N f ,onOuluProtocol 2 . APCER + BPCER 2 ISO/IEC-JTC-1/SC-37(2016),andHalfTotalErrorRate HTER .The HTER ishalfofthesummationoftheFalseRejectionRate(FRR)andtheFalseAcceptanceRate(FAR). 2.5.2ExperimentalComparison 2.5.2.1AblationStudy Advantageofproposedarchitecture Wecomparefourarchitecturestodemonstratetheadvan- tagesoftheproposedlosslayersandnon-rigidregistrationlayer. Model 1 hasanarchitecturesimilar totheCNNpartinourmethod(Fig.2.3),exceptthatitisextendedwithadditionalpoolinglayers, fullyconnectedlayers,andsoftmaxlossforbinary Model 2 istheCNNpartinour methodwithadepthmaplossfunction.Wesimplyuse jj ^ D jj 2 for Model 3 contains theCNNandRNNpartswithoutthenon-rigidregistrationlayer.BothofthedepthmapandrPPG lossfunctionsareutilizedinthismodel.However,theRNNpartwouldprocessunregisteredfeature mapsfromtheCNN. Model 4 istheproposedarchitecture. Wetrainallfourmodelswiththeliveandspoofvideosfrom 20 subjectsofSiW.Wecompute 25 thecross-testingperformanceofallmodelsonProtocol 1 ofOuludatabase.TheTDRatdifferent FDRarereportedinTab.2.2. Model 1 hasapoorperformanceduetothebinarysupervision.In comparison,byonlyusingthedepthmapassupervision, Model 2 achievessubstantiallybetter performance.However,afteraddingtheRNNpartwiththerPPGsupervision,ourproposed Model 4 canfurthertheperformanceimprovement.Bycomparing Model 4 and 3 ,wecanseetheadvantage ofthenon-rigidregistrationlayer.ItisclearthattheRNNpartcannotusefeaturemapsdirectlyfor trackingthechangesintheactivationsandestimatingtherPPGsignals. Advantageoflongersequences Toshowtheadvantageofutilizinglongersequencesforestimating therPPG,wetrainandtestourmodelwhenthesequencelength N f is 5 , 10 ,or 20 ,usingintra-testing onOuluProtocol 2 .FromTab.2.3,wecanseethatbyincreasingthesequencelength,theACER decreasesduetomorereliablerPPGestimation.Despitetheoflongersequences,inpractice, wearelimitedbytheGPUmemorysize,andforcedtodecreasetheimagesizeto 128 128 forall experimentsinTab.2.3.Hence,weset N f tobe 5 withtheimagesizeof 256 256 insubsequent experiments,duetoimportanceofhigherresolution(e.g,alower ACER of 2 : 5% inTab.2.4is achievedthan 4 : 16% ). 2.5.2.2IntraTesting WeperformintratestingonOuluandSiWdatabases.ForOulu,wefollowthefourprotocolsBoulke- nafet(2017)andreporttheir APCER , BPCER and ACER .Tab.2.4showsthecomparisonofour proposedmethodandthebesttwomethodsfor each protocolrespectively,intheface competitionBoulkenafet(2017).Ourmethodachievesthelowest ACER in 3 outof 4 protocols. Wehaveslightlyworse ACER onProtocol 2 .TosetabaselineforfuturestudyonSiW,we threeprotocolsforSiW.TheProtocol 1 dealswithvariationsinfaceposeandexpression.Wetrain usingthe 60 framesofthetrainingvideosthataremainlyfrontalviewfaces,andtestonall 26 Prot.MethodAPCER(%)BPCER(%)ACER(%) CPqD 2 : 910 : 86 : 9 1 GRADIANT 1.3 12 : 56 : 9 Ours 1 : 6 1.61.6 MixedFASNet 9 : 72 : 56 : 1 2 Ours 2.7 2 : 72 : 7 GRADIANT 3 : 1 1.92.5 MixedFASNet 5 : 3 6 : 77 : 8 5 : 56 : 5 4 : 6 3 GRADIANT 2.6 3.9 5 : 0 5 : 33 : 8 2 : 4 Ours 2 : 7 1 : 3 3.1 1.72.9 1.5 MassyHNU 35 : 8 35 : 3 8.3 4.1 22 : 1 17 : 6 4 GRADIANT 5.0 4.5 15 : 0 7 : 110 : 0 5 : 0 Ours 9 : 3 5 : 610 : 4 6 : 0 9.5 6.0 Table2.4 Theintra-testingresultsonfourprotocolsofOulu. Prot.SubsetSubject#AttackAPCER(%)BPCER(%)ACER(%) 1 Train 90 First 60 Frames 3 : 583 : 583 : 58 Test 75 All 2 Train 903 display 0 : 57 0 : 690 : 57 0 : 690 : 57 0 : 69 Test 751 display 3 Train 90 print(display) 8 : 31 3 : 818 : 31 3 : 808 : 31 3 : 81 Test 75 display(print) Table2.5 Theintra-testingresultsonthreeprotocolsofSiW. testingvideos.TheProtocol 2 evaluatestheperformanceofcrossspoofmediumofreplayattack. TheProtocol 3 evaluatestheperformanceofcrossPA,i.e.,fromprintattacktoreplayattackand viceversa.Tab.2.5showstheprotocolandourperformanceofeachprotocol. 2.5.2.3CrossTesting Todemonstratethegeneralizationofourmethod,weperformmultiplecross-testingexperiments. Ourmodelistrainedwithliveandspoofvideosof 80 subjectsinSiW,andtestonallprotocolsof Oulu.The ACER onProtocol 1 - 4 arerespectively: 10 : 0% , 14 : 1% , 13 : 8 5 : 7% ,and 10 : 0 8 : 8% . Comparingthesecross-testingresultstothe intra-testing resultsinBoulkenafet(2017),weare rankedsixthontheaverage ACER offourprotocols,amongthe 15 participantsofthefaceanti- competition.EspeciallyonProtocol 4 ,thehardestoneamongallprotocols,weachievethe sameACER of 10 : 0% asthetopperformer.Thisisanotableresultsincecrosstestingisknownto 27 MethodCASIA ! ReplayReplay ! CASIA Motion[deFreitasPereiraetal.(2013)] 50 : 2%47 : 9% LBP[deFreitasPereiraetal.(2013)] 55 : 9%57 : 6% LBP-TOP[deFreitasPereiraetal.(2013)] 49 : 7%60 : 6% Motion-Mag[Bharadwajetal.(2013)] 50 : 1%47 : 0% Spectral[Pintoetal.(2015)] 34 : 4%50 : 0% CNN[Yangetal.(2014)] 48 : 5%45 : 5% LBP[Boulkenafetetal.(2015)] 47 : 0%39 : 6% ColourTexture[Boulkenafetetal.(2016)] 30 : 3%37 : 7% Ours 27 : 6 % 28 : 4 % Table2.6 CrosstestingonCASIA-MFSDvs.Replay-Attack. Figure2.8 (a) 8 successfulexamplesandtheirestimateddepthmapsandrPPGsignals.(b) 4 failureexamples:thetwoareliveandtheothertwoarespoof.Noteourabilitytoestimatediscriminative depthmapsandrPPGsignals. besubstantiallyharderthanintratesting,andyetourcross-testingresultiscomparablewiththetop intra-testingperformance.Thisdemonstratesthegeneralizationabilityofourlearntmodel. Furthermore,weutilizetheCASIA-MFSDandReplay-Attackdatabasestoperformcross testingbetweenthem,whichiswidelyusedasacross-testingbenchmark.Tab.2.6comparesthe cross-testing HTER ofdifferentmethods.Ourproposedmethodreducesthecross-testingerrorson theReplay-AttackandCASIA-MFSDdatabasesby 8 : 9% and 24 : 6% respectively,relativetothe previousSOTA. 2.5.2.4VisualizationandAnalysis ExamplesofsuccessfulandfailurecasesinestimatingdepthmapsandrPPGsignalsareshown inFig.2.8.Intheproposedarchitecture,thefrontalizedfeaturemapsareutilizedasinputtothe 28 Figure2.9 Mean/Stdoffrontalizedfeaturemapsforliveandspoof. Figure2.10 TheMSEofestimatingdepthmapsandrPPGsignals. RNNpartandaresupervisedbytherPPGlossfunction.Thevaluesofthesemapscanshowthe importanceofdifferentfacialareastorPPGestimation.Fig.2.9showsthemeanandstandard deviationoffrontalizedfeaturemaps,computedfrom 1 , 080 liveandspoofvideosofOulu.Wecan seethatthesideareasofforeheadandcheekhavehigherforrPPGestimation. WhilethegoalofoursystemistodetectPAs,ourmodelistrainedtoestimatetheauxiliary information.Hence,inadditiontoanti-spoof,wealsoliketoevaluatetheaccuracyofauxiliary informationestimation.Forthispurpose,wecalculatetheaccuracyofestimatingdepthmapsand rPPGsignals,fortestingdatainProtocol 2 ofOulu.AsshowninFig.2.10,theaccuracyforboth estimationinspoofdataishigh,whilethatofthelivedataisrelativelylower.Notethatthedepth estimationofthemouthareahasmoreerrors,whichisconsistentwiththefeweractivationsofthe sameareainFig.2.9. Finally,weconductstatisticalanalysisonthefailurecases,sinceoursystemcandetermine potentialcausesusingtheauxiliaryinformation.WithProctocol 2 ofOulu,weidentify 31 failure cases( 2 : 7% ACER ).Foreachcase,wecalculatewhetherausingitsdepthmaporrPPG 29 signalwouldfailifthatinformationaloneisused.Intotal, 29 31 , 13 31 ,and 11 31 samplesfailduetodepth map,rPPGsignals,orboth.Thisindicatesthefutureresearchdirection. 2.6Conclusions Thischaptertheimportanceofauxiliarysupervisiontodeepmodel-basedfaceanti- TheproposednetworkcombinesCNNandRNNarchitecturestojointlyestimatethe depthoffaceimagesandrPPGsignaloffacevideo. Oneimprovementofauxiliary supervisionsistomakeeachCNNpredictiontobebasedonalocalreceiptItwouldeffective reducetheCNNtrainingfromovwithlimiteddata. WeintroducetheSiWdatabasethat containsmoresubjectsandvariationsthanpriordatabases.Finally,weexperimentallydemonstrate thesuperiorityofourmethod. 30 Chapter3 Generalization:Zero-shotandOpen-set Face 3.1Introduction Attackerscanutilizeawidevarietyofmediumstolaunchspoofattacks.Themostcommonones arereplayingvideos/imagesondigitalscreens,i.e.,replayattack,andprintedphotograph,i.e., printattack.Differentmethodsareproposedtohandlereplayandprintattacks,basedoneither handcraftedfeaturesBoulkenafetetal.(2015);M ¨ a ¨ att ¨ aetal.(2011);Pateletal.(2016b)orCNN- basedfeaturesAtoumetal.(2017);Fengetal.(2016);Jourablooetal.(2018);Liuetal.(2018c). Recently,high-quality 3 Dcustommaskisalsousedforattacking,i.e., 3 Dmaskattack.InLiuetal. (2018b,2016a,b),methodsfordetectingprint/replayattacksarefoundtobelesseffectiveforthis newspoof,andhencetheauthorsleveragetheremotephotoplethysmography(r-PPG)todetect theheartratepulseasthecue.Further,facialmakeupmayalsotheoutcomeof recognition,i.e.,makeupattackChenetal.(2013).ManyworksChangetal.(2018);Chenetal. (2013,2014)studyfacialmakeup,despitenotasanproblem. Allaforementionedmethodspresentalgorithmicsolutionstothe known spoofattack(s),where modelsaretrainedandtestedonthe same type(s)ofspoofattacks.However,inreal-world applications,attackerscanalsoinitiatespoofattacksthatwe,thealgorithmdesigners,arenotaware 31 Figure3.1 Todetectunknownspoofattacks,weproposeaDeepTreeNetwork(DTN)tounsuperviselylearn ahierarchicembeddingforknownspoofattacks.SamplesofunknownattackswillberoutedthroughDTN andatthedestinedleafnode. of,termed unknown spoofattacks 1 .Researchersincreasinglypayattentiontothegeneralizationof models,i.e.,howwelltheyareabletodetectspoofattacksthathaveneverbeenseen duringthetraining?Wetheproblemofdetectingunknownfacespoofattacksas Zero-Shot Face(ZSFA) .Despitethesuccessoffaceonknownattacks,ZSFA,on theotherhand,isanewandunsolvedchallengetothecommunity. TheattemptsonZSFAareArashlooetal.(2017);Xiong&AbdAlmageed(2018).They addressZSFAbetweenprintandreplayattacks,andregarditasanoutlierdetectionproblemforlive faces(a.k.a.realhumanfaces).Withhandcraftedfeatures,thelivefacesaremodeledviastandard generativemodels,e.g.,GMM,auto-encoder.Duringtesting,anunknownattackisdetectedifit liesoutsidetheestimatedlivedistribution.TheseZSFAworkshavethreedrawbacks: Lackingspooftypevariety: Priormodelsaredevelopedw.r.t.printandreplayattacksonly.The 1 Thereissubtledistinctionbetween1) unseenattacks ,attacktypesthatareknowntoalgorithmdesignerssothat algorithmscouldbetailoredtothem,buttheirdataareunseenduringtraining;2) unknownattacks ,attacktypesthatare neitherknowntodesignersnorseenduringtraining.Wedonotdifferentiatethesetwocasesandtermbothunknown attacks. 32 respectivefeaturedesignmaynotbeapplicabletodifferentunknownattacks. Nospoofknowledge: Priormodelsonlyuselivefaces,withoutleveragingtheavailableknown spoofdata.Whiletheunknownattacksaredifferent,theknownspoofattacksmaystillprovide valuableinformationtolearnthemodel. Limitationoffeatureselection: TheyusehandcraftedfeaturessuchasLBPtorepresentlive faces,whichwereshowntobelesseffectiveforknownspoofdetectionLietal.(2016a);Liuetal. (2018c);Pateletal.(2016a);Yangetal.(2014).RecentdeeplearningmodelsJourablooetal. (2018);Liuetal.(2018c)showtheadvantageofCNNmodelsforface Thisworkaimstoaddressallthreedrawbacks.SinceoneZSFAmodelmayperformdifferently whentheunknownspoofattackisdifferent,itshouldbeevaluatedonawiderangeofunknown attackstypes.Inthiswork,wesubstantiallyexpandthestudyofZSFAfrom 2 typesofspoofattacks to 13 types.Besidesprintandreplayattacks,weinclude 5 typesof 3 Dmaskattacks, 3 typesof makeupattacks,and 3 partialattacks.Theseattackscoverbothimpersonationi.e.,attempt tobeauthenticatedassomeoneelse,andobfuscationi.e.,attempttocoverattacker'sown identity.Wecollectthefacedatabasethatincludesthesediversespoofattacks, termedSpoofintheWilddatabasewithMultipleAttackTypes(SiW-M),showninTab.3.1. TotacklethebroaderZSFA,weproposeaDeepTreeNetwork(DTN).Assumingthereare bothhomogeneousfeaturesamongdifferentspooftypesanddistinctfeatureswithineachspoof type,atree-likemodeliswell-suitedtohandlethiscase:learningthehomogeneousfeaturesin theearlytreenodesanddistinctfeaturesinlatertreenodes.Withoutanyauxiliarylabelsofspoof types,DTNlearnstopartitiondatainanunsupervisedmanner.Ateachtreenode,thepartitionis performedalongthedirectionofthelargestdatavariation.Intheend,itclustersthedataintoseveral sub-groupsattheleaflevel,andlearnstodetectspoofattacksforeachsub-groupindependently, showninFig.3.1.Duringthetesting,adatasampleisroutedtothemostsimilarleafnodeto 33 produceabinarydecisionoflivevs.spoof. Insummary,ourcontributionsinthisworkinclude: Conductanextensivestudyofzero-shotfaceon 13 typesofspoofattacks; ProposeaDeepTreeNetworktolearnfeatureshierarchicallyanddetectunknownspoof; CollectanewdatabaseforZSFAandachievethestate-of-the-artperformanceonmultiple testingprotocols. 3.2PriorWork Face Image-basedfacereferstofacetechniquesthatonly takeRGBimagesasinputwithoutextrainformationsuchasdepthorheat.Inearlyyears,researchers utilizelivenesscues,suchaseyeblinkingandheadmotion,todetectprintattacksKollreideretal. (2007);Panetal.(2007);Pateletal.(2016a);Shaoetal.(2017).However,whenencountering unknownattacks,suchasphotograhwitheyeportioncut,andvideoreplay,thosemethodssuffer fromatotalfailure.Later,researchmovetoamoregeneraltextureanalysisandaddressprint andreplayattacks.Researchersmainlyutilizehandcraftedfeatures,e.g.,LBPBoulkenafetetal. (2015);deFreitasPereiraetal.(2012,2013);M ¨ a ¨ att ¨ aetal.(2011),HoGKomulainenetal.(2013a); Yangetal.(2013),SIFTPateletal.(2016b)andSURFBoulkenafetetal.(2017a),withtraditional e.g.,SVMandLDA,tomakeabinarydecision.Thosemethodsperformwellonthe testingdatafromthesamedatabase.However,whilechangingthetestingconditionssuchaslighting andbackground,theyoftenhavealargeperformancedrop,whichcanbeviewedasanov issue.Moreover,theyalsoshowlimitationsinhandling 3 Dmaskattacks,mentionedinLiuetal. (2016a). Toovercometheovissue,researchersmakevariousattempts.Boulkenafetetal.extract 34 thefeaturesinHSV + YCbCRspaceBoulkenafetetal.(2015).WorksinAgarwaletal. (2016);Baoetal.(2009);Bharadwajetal.(2014);Fengetal.(2016);Xuetal.(2015)consider featuresinthetemporaldomain.RecentworksAgarwaletal.(2016);Atoumetal.(2017)augment thedatabyusingimagepatches,andfusethescoresfrompatchestoasingledecision.For 3 Dmask attacks,theheartpulserateisestimatedtodifferentiate 3 DmaskfromrealfacesLietal.(2016b); Liuetal.(2016a).Inthedeeplearningera,researchersproposeseveralCNNworksAtoumetal. (2017);Fengetal.(2016);Jourablooetal.(2018);Lietal.(2016a);Liuetal.(2018c);Pateletal. (2016a);Yangetal.(2014)thatoutperformthetraditionalmethods. Zero-shotlearningandunknownspoofattacks Zero-shotobjectrecognition,ormoregenerally, zero-shotlearning,aimstorecognizeobjectsfromunknownclassesSocheretal.(2013),i.e.,object classesunseenintraining.Theoverallideaistoassociatetheknownandunknownclassesvia asemanticembedding,whoseembeddingspacescanbeattributesLampertetal.(2009),word vectorFromeetal.(2013),textdescriptionZhangetal.(2017a)andhumangazeKaresslietal. (2017). Zero-shotlearningforunknownspoofattack,i.e.,ZSFA,isarelativelynewtopicwithunique properties.Firstly,unlikezero-shotobjectrecognition,ZSFAemphasizesthedetectionofspoof attacks,insteadofrecognizingspooftypes.Secondly,unlikegenericobjectswithrichse- manticembedding,thereisnoexplicitsemanticembeddingforspoofpatternsJourabloo etal.(2018).AselaboratedinSec.3.1,priorZSFAworksArashlooetal.(2017);Xiong&AbdAl- mageed(2018)onlymodelthelivedataviahandcraftedfeaturesandstandardgenerativemodels, withseveraldrawbacks.Inthiswork,weproposeadeeptreenetworktounsuperviselylearnthe semanticembeddingforknownspoofattacks.Thepartitionofthedatanaturallyassociatescertain semanticattributeswiththesub-groups.Duringthetesting,theunknownattacksareprojectedto theembeddingtotheclosestattributesforspoofdetection. 35 Dataset Num.ofFacevariationsSpoofattacktypesTotalnum.of subj./vid.poseexpressionlightingreplayprint 3 Dmaskmakeuppartialspooftypes CASIA-FASDZhangetal.(2012)( 2012 ) 50 / 600 FrontalNoNo 120003 Replay-AttackChingovskaetal.(2012)( 2012 ) 50 / 1 , 200 FrontalNoYes 110002 HKBU-MARsLiuetal.(2016a)( 2016 ) 35 / 1 , 008 FrontalNoYes 002002 Oulu-NPUBoulkenafetetal.(2017b)( 2017 ) 55 / 5 , 940 FrontalNoNo 110002 SiWLiuetal.(2018c)( 2018 ) 165 / 4 , 620 < 90 YesYes 110002 SiW-M( 2019 ) 493 / 1 , 630 < 90 YesYes 1153313 Table3.1 ComparingourSiW-Mwithexistingfacedatasets. Deeptreenetworks Treestructureisoftenfoundhelpfulintacklinglanguage-relatedtaskssuchas parsingandtranslationChenetal.(2018),duetotheintrinsicrelationofwordsandsentences.E.g., treemodelsareappliedtojointvisionandlanguageproblemssuchasvisualquestionreasoningCao etal.(2018).Treestructurealsohasthepropertyforlearningfeatureshierarchically.Facealignment worksKazemi&Sullivan(2014);Valleetal.(2018)utilizetheregressiontreestoestimatefacial landmarksfromcoarseto.Xiongetal.proposeatreeCNNtohandlethelarge-poseface recognitionXiongetal.(2015).InKanekoetal.(2018),Kanekoetal.proposeaGANwithdecision treestolearnhierarchicallyinterpretablerepresentations.Inourwork,weutilizetreenetworksto learnthelatentsemanticembeddingforZSFA. Facedatabases Giventheofagood-qualitydatabase,researchers havereleasedseveralfacedatabases,suchasCASIA-FASDZhangetal.(2012), Replay-AttackChingovskaetal.(2012),OULU-NPUBoulkenafetetal.(2017b),andSiWLiu etal.(2018c)forprint/replayattacks,andHKBU-MARsLiuetal.(2016a)for 3 Dmaskattacks. EarlydatabasessuchasCASIA-FASDandReplay-AttackZhangetal.(2012)havelimitedsubject variety,pose/expression/lightingvariations,andvideoresolutions.RecentdatabasesBoulkenafet etal.(2017b);Liuetal.(2016a,2018c)improvethoseaspects,andalsosetupdiverseevaluation protocols.However,uptonow,alldatabasesfocusoneitherprint/replayattacks,or 3 Dmaskattacks. ToprovideacomprehensivestudyoffaceespeciallythechallengingZSFA,wefor thetimecollectthedatabasewithdiversetypesofspoofattacks,asinTab.3.1.Thedetailsof 36 Figure3.2 TheproposedDeepTreeNetwork(DTN)architecture.(a)theoverallstructureofDTN.Atree nodeconsistsofaConvolutionalResidualUnit(CRU)andaTreeRoutingUnit(TRU),andaleafnode consistsofaCRUandaSupervisedFeatureLearning(SFL)module.(b)theconceptofTreeRoutingUnit (TRU):thebasewithlargestvariations;(c)thestructureofeachConvolutionalResidualUnit(CRU); (d)thestructureoftheSupervisedFeatureLearning(SFL)intheleafnodes. ourdatabaseareinSec.3.4. 3.3DeepTreeNetworkforZSFA ThemainpurposesofDTNaretwofold: 1) discoverthesemanticsub-groupsforknownspoofs; 2) learnthefeaturesinahierarchicalway.ThearchitectureofDTNisshowninFig.3.2.Eachtree nodeconsistsofaConvolutionalResidualUnit(CRU)andaTreeRoutingUnit(TRU),whilethe leafnodeconsistsofaCRUandaSupervisedFeatureLearning(SFL)module.CRUisablockwith convolutionallayersandtheshort-cutconnection.TRUanoderoutingfunctiontoroutea datasampletooneofthechildnodes.Theroutingfunctionpartitionsallvisitingdataalongthe directionwiththelargestdatavariation.SFLmoduleconcatenatesthesupervision andthepixel-wisesupervisiontolearnthefeatures. 37 3.3.1UnsupervisedTreeLearning 3.3.1.1NodeRoutingFunction ForaTRUnode,let'sassumetheinput x = f ( I j ) 2 R m isthevectorizedfeatureresponse, I isdata input, istheparametersofthepreviousCRUs,and S isthesetofdatasamples I k ;k =1 ; 2 ;:::;K thatvisitthisTRUnode.InXiongetal.(2015),Xiongetal.aroutingfunctionas: ' ( x )= x T v + ˝; (3.1) where v denotestheprojectionvectorand ˝ isthebias.Data S canthenbesplitinto S left : f I k j ' ( x k ) < 0 ; I k 2Sg and S right : f I k j ' ( x k ) 0 ; I k 2Sg ,anddirectedtotheleftandright childnode,respectively.Tolearnthisfunction,theyproposetomaximizethedistancebetweenthe meanof S left and S right ,whilekeepingthemeanof S centeredat 0 .Thisunsupervisedlossis formulatedas: L = ( 1 N P I k 2S ' ( x k )) 2 ( 1 N l P I k 2S left ' ( x k ) 1 N r P I k 2S right ' ( x k )) 2 ; (3.2) where N , N l , N r denotethenumberofsamplesineachset. However,inpractice,minizingEqu.3.2mightnotleadtoasatisfactorysolution.Firstly,the losscanbeminimizedbyincreasingthenormofeither v or x ,whichisatrivialsolution.Secondly, evenwhenthenormsof v , x areconstrained,Equ.3.2isaffectedbythedensityofdata S andcan besensitivetotheoutliers.Inotherwords,thezeroexpectationof ' ( x ) doesnotnecessarilyresult inabalancedpartitionofdata S .Localminimacouldbeachievedwhenalldataarespittooneside. Insomecases,thetreemaysufferfromcollapsingtoafew(evenone)leafnodes. Tobetterpartitionthedata,weproposeanovelroutingfunctionandanunsupervisedloss. 38 Regardlessof ˝ ,thedotproductbetween x T and v canberegardedasprojecting x tothedirection of v .Wedesign v suchthatwecanobservethelargestvariationafterprojection.Inspiredbythe conceptofPCA,theoptimalsolutionnaturallybecomesthelargestPCAbasisofdata S .Toachieve this,weconstrain v tobenorm 1 andreformulateEqu.3.1as: ' ( x )=( x ) T v ; k v k =1 ; (3.3) where isthemeanofdata S .Then, v isidenticaltothelargesteigenvectorofthe covariancematrix X T S X S ,where X S = X S ,and X S 2 R N K isthedatamatrix.Basedonthe ofeigen-analysis X T S X S v = v ,ouroptimizationaimstomaximize: argmax v =argmax v v T X T S X S v : (3.4) Thelossforlearningtheroutingfunctionisformulatedas: L route = exp ( v T X T S X S v )+ Tr ( X T S X S ) ; (3.5) where ; arescalars,andsetas 1 e- 3 , 1 e- 2 inourexperiments.Weapplytheexponentialfunction onthetermtomakethemaximizationproblembounded.Thesecondtermisintroducedasa regularizertopreventtrivialsolutionsbyconstrainingthetraceofcovariancematrixof X S . 3.3.1.2TreeofKnownSpoofs Withtheroutingfunction,wecanbuildtheentirebinarytree.Fig.3.2showsabinarytreeofdepth of 4 ,with 8 leafnodes.AsmentionedearlyinSec.3.3,thetreeisdesignedtothesemantic sub-groupsfromallknownspoofs,andistermedasspooftree.Similarly,wemayalsotrainlivetree 39 withlivefacesonly,aswellasgeneraldatatreewithbothliveandspoofdata.Comparedtospoof tree,liveandgeneraldatatreehavesomedrawbacks.Livetreedoesnotconveysemanticmeaning forthespoof,andtheattributeslearnedateachnodecannothelptorouteandbetterdetectspoof; Generaldatatreemayresultinimbalancedsub-groups,wheresamplesofoneclassoutnumber another.Suchimbalancewouldcausebiasforsupervisedlearninginthenextstage. Hence,whenwecomputeEqu.3.5tolearntheroutingfunctions,weonlyconsiderthespoof samplestoconstruct X S .Tohaveabalancedsub-groupforeachleaf,wesuppresstheresponsesof livedatatozero,sothatalllivedatacanbeevenlypartitionedtothechildnodes.Meanwhile,we alsosuppresstheresponsesofthespoofdatathatdonotvisitthisnode,sothateverynodemodels thedistributionofauniquespoofsubset. Formally,foreachnode,wemaximizetheroutingfunctionresponsesofspoofdatathatvisit thisnode(denotedas S ),whileminimizingtheresponsesofotherdata(denotedas S ),including alllivedataandspoofdatathatdon'tvisitthisnode,i.e.,thatvisitneighboringnodes.Toachieve thisobjective,wethefollowingloss: L uniq = 1 N X I k 2S x T k v 2 + 1 N X I k 2S x T k v 2 : (3.6) 3.3.2SupervisedFeatureLearning Giventheroutingfunctions,adatasample I k willbeassignedtooneoftheleafnodes.Let's thefeatureoutputofleafnodeas F ( I k j ) ,shortenedas F k forsimplicity.Ateach leafnode,wetwonode-wisesupervisedtaskstolearndiscriminativefeatures: 1) binary drivesthelearningofahigh-levelunderstandingoflivevs.spooffaces, 2) pixel-wise maskregressiondrawsCNN'sattentiontolow-levellocalfeaturelearning. 40 supervision Tolearnabinary,asshowninFig.3.2(d),weapplytwo additionalconvolutionlayersandtwofullyconnectedlayerson F k togenerateafeaturevector c k 2 R 500 .Wesupervisethelearningviathesoftmaxcrossentropyloss: L class = 1 N X I k 2S n (1 y k ) log (1 p k ) y k log p k o (3.7) p k = exp ( w 1 T c k ) exp ( w 0 T c k )+ exp ( w 1 T c k ) ; (3.8) where S representsallthedatasamplesthatarrivethisleafnode, N denotesthenumberofsamples in S , f w 0 ; w 1 g aretheparametersinthelastfullyconnectedlayer,and y k isthelabelofdata sample k ( 1 denotesspoof,and 0 live). Pixel-wisesupervision Wealsoconcatenateanotherconvolutionlayerto F k togenerateamap response M k 2 R 32 32 .InspiredbythepriorworkLiuetal.(2018c),weleveragethesemantic priorknowledgeoffaceshapesandspoofattackpositiontoprovideapixel-wisesupervision.Using thedensefacealignmentmodelLiuetal.(2017),weprovideabinarymask D k 2 R 32 32 ,shown inFig.3.3,toindicatethepixelsofspoofmediums.Thus,foraleafnode,thelossfunctionforthe pixel-wisesupervisionis: L mask = 1 N X I k 2S k M k D k k 1 : (3.9) Overallloss Finally,weapplythesupervisedlosseson p leafnodes,theunsupervisedlosseson q TRUnodes,andformulateourtraininglossas: L = p X i =1 ( 1 L i class + 2 L i mask )+ q X j =1 ( 3 L j route + 4 L j uniq ) ; (3.10) 41 Figure3.3 Theexamplesofthelivefacesand 13 typesofspoofattacks.Thesecondrowshowstheground truthmasksforthepixel-wisesupervision D k .For ( m;n ) inthethirdrow, m=n denotesthenumberof subjects/videosforeachtypeofdata. Figure3.4 ThestructureoftheTreeRoutingUnit(TRU). where 1 , 2 , 3 , 4 aretheregularizationcoefforeachterm,andaresetas 0 : 001 , 1 : 0 , 2 : 0 , 0 : 001 respectively.Fora 4 -layerDTN, p =8 and q =7 . 3.3.3NetworkArchitecture DeepTreeNetwork(DTN) DTNisthemainframeworkoftheproposedmodel.Ittakes I 2 R 256 256 6 asinput,wherethe 6 channelsareRGB+HSVcolorspaces.Weconcatenatethree 3 3 convolutionlayerswith 40 channelsand 1 max-poolinglayer,andgroupthemasoneConvolutional ResidualUnit(CRU).EachconvolutionlayerisequippedwithReLUandgroupnormalization layerWu&He(2018),duetothedynamicbatchsizeinthenetwork.Wealsoapplyashortcut connectionforeachconvolutionlayer.Foreachtreenode,wedeployoneCRUbeforetheTRU.At 42 theleafnode,DTNproducesthefeaturerepresentationofinput I as F ( I j ) 2 R 32 32 40 ,then usesone 1 1 convolutionlayertogeneratethebinarymaskmap M . TreeRoutingUnit(TRU) TRUisthemoduleroutingthedatasampletooneofthechildCRUs. AsshowninFig.3.4,itcompressesthefeaturebyusingan 1 1 convolutionlayer,andresizing theresponsespatially.Fortherootnode,wecompresstheCRUfeatureto x 2 R 32 32 10 ,andfor latertreenode,wecompresstheCRUfeatureto x 2 R 16 16 20 .Compressingtheinputfeature toasmallersizehelpstoreducetheburdenofcomputatingandsavingthecovariancematrixin Equ.3.5.E.g.,thevectorizedfeaturefortheCRUis x 2 R 655 ; 360 ,andthecovariancematrixof x cantake ˘ 400 GBinmemory.However,aftercompressionthevectorizedfeatureis x 2 R 10 ; 240 , andthecovariancematrixof x onlyneeds ˘ 0 : 1 GBofmemory. Afterthat,wevectorizetheoutputandapplytheroutingfunction ' ( x ) .Tocompute in Equ.3.3,insteadofoptimizingitasavariableofthenetwork,wesimplyapplyabatchnormalization layerwithoutscalingtosavethemovingaverageofeachmini-batch.Intheend,weprojectthe compressedCRUresponsetothelargestbasis v andobtaintheprojectioncoefcient.Thenwe assignthesampleswithnegativecoefcienttotheleftchildCRUandthesampleswithpositive coeftotherightchildCRU. Implementationdetails WiththeoveralllossinEqu.3.10,ourproposednetworkistrainedinan end-to-endfashion.Alllossesarecomputedbasedoneachmini-batch.DTNmodulesandTRU modulesareoptimizedalternately.WhileoptimizingDTN,wekeeptheparametersofTRUsed andviceversa. 43 3.4SpoofintheWildDatabasewithMultipleAttackTypes Tobenchmarkfacemethodsforunknownattacks,wecollecttheSpoofin theWilddatabasewithMultipleAttackTypes(SiW-M).Comparedwiththepreviousdatabasesin Tab.3.1,SiW-Mshowsagreatdiversityinspoofattacks,subjectidentities,environmentsandother factors. Forspoofdatacollection,weconsidertwoscenarios: impersonation ,whichentailsthe useofspooftoberecognizedassomeoneelse,and obfuscation ,whichentailstheusetoremovethe attacker'sownidentity.Intotal,wecollect 968 videosof 13 typesofspoofattackslistedhieratically inFig3.3.Forall 5 maskattacks, 3 partialattacks,obfuscationmakeupandcosmeticmakeup,we record 1080 PHDvideos.Forimpersonationmakeup,wecollect 720 PvideosfromYoutubedueto thelackofspecialmakeupartists.Forprintandreplayattacks,weintendtocollectvideosfrom hardercaseswheretheexistingsystemfails.Hence,wedeployanoff-the-shelfface algorithmLiuetal.(2018c)andrecordspoofvideoswhenthealgorithmpredictslive. Forlivedata,weinclude 660 videosfrom 493 subjects.Incomparison,thenumberofsubjects inSiW-Mis 9 timeslargerthanOulu-NPUBoulkenafetetal.(2017b)andCASIA-FASDZhang etal.(2012),and 3 timeslargerthanSiWLiuetal.(2018c).Inaddition,subjectsarediversein ethnicityandage.Thelivevideosarecollectedin 3 sessions: 1) aroomenvironmentwherethe subjectsarerecordedwithfewvariationssuchaspose,lightingandexpression(PIE). 2) adifferent andmuchlargerroomwherethesubjectsarealsorecordedwithPIEvariations. 3) amobilephone mode,wherethesubjectsaremovingwhilethephonecameraisrecording.Extremeposeangles andlightingconditionsareintroduced.Similartoprintandreplayvideos,wedeploytheface algorithmLiuetal.(2018c)tooutthevideoswherethealgorithmpredictsspoof. Hence,thisthirdsessionisaharderscenario. 44 Intotal,wecollect 1 , 630 videosandeachlasts 5 - 7 seconds.The 1080 Pvideosarerecordedby LogitechC 920 webcamandCanonEOST 6 .TouseSiW-MforthestudyofZSFA,wethe leave-one-outtestingprotocols.Eachtimewetrainamodelwith 12 typesofspoofattacksplusthe 80% ofthelivevideos,andtestontheleft 1 attacktypeplusthe 20% oflivevideos.Thereisno overlappingsubjectsbetweenthetrainingandtestingsetsoflivevideos. 3.5ExperimentalResults 3.5.1ExperimentalSetup Databases Weevaluateourproposedmethodonmultipledatabases.Wedeploytheleave-one-out testingprotocolsonSiW-Mandreporttheresultsof 13 experiments.Also,wetestonpreviousface databases,includingCASIAZhangetal.(2012),Replay-AttackChingovskaetal. (2012),andMSU-MFSDWenetal.(2015)),comparewiththestateoftheart. Evaluationmetrics Weevaluatewiththefollowingmetrics:AttackPresentation ErrorRate(APCER)ISO/IEC-JTC-1/SC-37(2016),BonaFidePresentationClasError Rate(BPCER)ISO/IEC-JTC-1/SC-37(2016),theaverageofAPCERandBPCER,AverageCla cationErrorRate(ACER)ISO/IEC-JTC-1/SC-37(2016),EqualErrorRate(EER),andAreaUnder Curve(AUC).Notethat,intheevaluationofunknownattacks,weassumethereisnovalidationset totunethemodelandthresholdswhilecalculatingthemetrics.Hence,wedeterminethethreshold basedonthetrainingsetanditforalltestingprotocols.Asingletestsampleisonevideoframe, insteadofonevideo. Parametersetting TheproposedmethodisimplementedinTwAbadietal.(2016),and trainedwithaconstantlearningrateof 0 : 001 withabatchsizeof 32 .Ittakes 15 epochstoconverge. Werandomlyinitializealltheweightsusinganormaldistributionof 0 meanand 0 : 02 standard 45 deviation. 3.5.2ExperimentalComparison 3.5.2.1AblationStudy AllablationstudiesusetheFunnyEyeprotocol. Differentfusionmethods Intheproposedmodel,boththenormofthemaskmapsandbinary spoofscorescouldbeutilizedfortheTondthebestfusionmethod,wecompute ACERfromusingmapnorm,softmaxscore,themaximumofmapnormandsoftmaxscore,andthe averageoftwovalues,andobtain 31 : 7% , 20 : 5% , 21 : 0% ,and 19 : 3% respectively.Sincetheaverage scoreofthemasknormandbinaryspoofscoreperformsthebest,weuseitfortheremaining experiments.Moreover,weset 0 : 2 asthethresholdtocomputeAPCER,BPCERandACER foralltheexperiments. Differentroutingmethods Routingisacrucialsteptothebestsubgrouptodetectspoofness ofatestingsample.Toshowtheeffectofproperrouting,weevaluate 2 alternativeroutingstrategies: randomroutingandpick-one-leaf.Randomroutingdenotesrandomlyselectingoneleafnodefora testingsampletoproduceprediction;Pick-one-leafdenotesconstantlyselectingoneparticularleaf nodetoproduceresults,forwhichwereportthemeanscoreandstandarddeviationof 8 selections. ShowninTab.3.2,bothstrategiesperformworsethantheproposedroutingfunction.Inaddition, thelargestandarddeviationofpick-one-leafstrategyshowsthe large performancedifferenceof 8 subgroupsonthe sametype ofunknownattacks,anddemonstratesthenecessityofaproperrouting. Advantageofeachlossfunction Wehavethreeimportantdesignsinourunsupervisedtree learning:routeloss L route ,datausedtocomputetherouteloss,andtheuniqueloss L uniq .To 46 StrategiesAPCERBPCERACEREER Randomrouting 37 : 1 16 : 1 26 : 624 : 7 Pick-one-leaf 51 : 2 20 : 018 : 1 4 : 934 : 7 8 : 824 : 1 3 : 1 Proposedroutingfunction 17 : 0 21 : 5 19 : 319 : 8 Table3.2 Comparemodelswithdifferentroutingstrategies. MethodsAPCERBPCERACEREER MPTXiongetal.(2015) 31 : 424 : 227 : 827 : 3 Livedata p ,Spoofdata p ,UniqueLoss 1 : 4 73 : 337 : 331 : 2 Livedata ,Spoofdata p ,UniqueLoss 70 : 012 : 741 : 344 : 8 Livedata p ,Spoofdata p ,UniqueLoss p 54 : 2 12 : 5 33 : 436 : 2 Livedata ,Spoofdata p ,UniqueLoss p 17 : 021 : 5 19 : 319 : 8 Table3.3 Comparemodelswithdifferenttreelossesandstrategies.Thetwotermsofrow 2 - 5 referto usingliveorspoofdataintreelearning.Thelastrowisourmethod. Methods CASIAZhangetal.(2012)Replay-AttackChingovskaetal.(2012)MSUWenetal.(2015) Overall VideoCutPhotoWarpedPhotoVideoDigitalPhotoPrintedPhotoPrintedPhotoHRVideoMobileVideo OC-SVM RBF +BSIFArashlooetal.(2017) 70 : 760 : 795 : 984 : 388 : 173 : 764 : 887 : 474 : 778 : 7 11 : 7 SVM RBF +LBPBoulkenafetetal.(2017b) 91 : 591 : 784 : 599 : 198 : 287 : 347 : 799 : 5 97 : 6 88 : 6 16 : 3 NN+LBPXiong&AbdAlmageed(2018) 94 : 2 88 : 479 : 999 : 895 : 278 : 950 : 699 : 993 : 586 : 7 15 : 6 Ours 90 : 0 97 : 397 : 599 : 999 : 999 : 681 : 699 : 9 97 : 5 95 : 9 6 : 2 Table3.4 AUC( % )ofthemodeltestingonCASIA,Replay,andMSU-MFSD. showtheeffectofeachlossandthetrainingstrategy,wetrainandcomparenetworkswitheachloss excludedandalternativestrategies.First,wetrainanetworkwiththeroutingfunctionproposed inXiongetal.(2015),andthen 4 modelswithdifferentmodulesonandoff,showninTab.3.3.The modelwithMPTXiongetal.(2015)routesdataonlyto 2 leafnodesoutof 8 (i.e.treecollapse issue),whichlimitstheperformance.Modelswithouttheuniquelossexhibittheimbalancerouting issuewheresub-groupscannotbetrainedproperly.Modelsusingalldatatolearnthetreeshow worseperformancesthanusingspoofdataonly.Finally,theproposedmethodperformsthebest amongalloptions. 3.5.2.2Testingonexistingdatabases FollowingtheprotocolproposedinArashlooetal.(2017),weuseCASIAZhangetal.(2012), Replay-AttackChingovskaetal.(2012)andMSU-MFSDWenetal.(2015)toperformZSFA 47 testingbetweenreplayandprintattacks.Tab.3.4comparestheproposedmethodwithtopthree methodsselectedfromover 20 methodsinArashlooetal.(2017);Boulkenafetetal.(2017b); Xiong&AbdAlmageed(2018).Ourproposedmethodoutperformsthepriorstateoftheartbya convincingmarginof 7 : 3% ,andoursmallerstandarddeviationfurtherindicatesaconsistentlygood performanceamongunknownattacks. 3.5.2.3TestingonSiW-M Weexecute 13 leave-one-outtestingprotocolsonSiW-M.Wecomparewithtwoofthemostrecent facemethodsBoulkenafetetal.(2017b);Liuetal.(2018c),andsetLiuetal.(2018c) asthebaseline,whichhasdemonstrateditsSOTAperformanceonvariousbenchmarks.Forafair comparisonwiththebaseline,weprovidethesamepixel-wiselabeling(asinFig.3.3),andsetthe samethresholdof 0 : 2 tocomputeAPCER,BPCER,andACER. AsshowninTab.3.5,ourmethodachievesanoverallbetterAPCER,ACERandEER,withthe improvementofbaselineby 55% , 29% ,and 5% .,wereducetheACERsoftransparent mask,funnyeye,andpaperglassesby 31% , 61% ,and 51% ,wherethebaselinemodelscanbe consideredastotalfailuressincetheyrecognizemostoftheattacksaslive.Notethat,ACERis morevaluableinthecontextofZSFA:noevaluationdataforsettingthresholdandconsiderably variedthresholdsforobtainingtheEERperformance.Forinstance,EERsofpaperglassesmodel aresimilarbetweenthebaselineandourmethod,butwithapresetthreshold,ourmethodoffersa muchbetterACER. Moreover,theproposedmethodisamorecompactmodelthanLiuetal.(2018c).Giventhe inputsizeof 256 256 6 ,thebaselinerequires 87 GFlopstocomputetheresultwhileourmethod onlyneeds 6 GFlops( 15 smaller).MoreanalysisareshownwithvisualizationinSec.3.5.2.4. Amongalltheattacks,replay,print,halfmask,papermask,impersonationmakeupareim- 48 MethodsMetrics(%)ReplayPrint MaskAttacksMakeupAttacksPartialAttacks Average HalfSiliconeTrans.PaperManne.Obfusc.Imperson.CosmeticFunnyEyePaperGlassesPaper SVM RBF +LBPBoulkenafetetal.(2017b) APCER 19 : 115 : 440 : 820 : 370 : 3 0 : 0 4 : 696 : 935 : 3 11 : 3 53 : 358 : 50 : 632 : 8 29 : 8 BPCER 22 : 121 : 521 : 921 : 420 : 723 : 122 : 921 : 712 : 522 : 218 : 420 : 022 : 921 : 0 2 : 9 ACER 20 : 618 : 431 : 321 : 445 : 511 : 613 : 859 : 323 : 916 : 735 : 939 : 211 : 726 : 9 14 : 5 EER 20 : 818 : 636 : 321 : 437 : 27 : 514 : 151 : 219 : 816 : 134 : 433 : 07 : 924 : 5 12 : 9 AuxiliaryLiuetal.(2018c) APCER 23 : 77 : 327 : 7 18 : 2 97 : 88 : 316 : 2100 : 018 : 016 : 391 : 872 : 20 : 438 : 3 37 : 4 BPCER 10 : 16 : 510 : 911 : 66 : 27 : 89 : 3 11 : 6 9 : 37 : 16 : 28 : 810 : 38 : 9 2 : 0 ACER 16 : 86 : 919 : 3 14 : 9 52 : 18 : 012 : 855 : 813 : 7 11 : 7 49 : 040 : 5 5 : 3 23 : 6 18 : 5 EER 14 : 04 : 3 11 : 612 : 424 : 6 7 : 810 : 072 : 310 : 1 9 : 4 21 : 4 18 : 64 : 0 17 : 0 17 : 7 Ours APCER 1.0 0 : 00 : 7 24 : 5 58 : 6 0 : 5 3 : 873 : 213 : 2 12 : 4 17 : 017 : 00 : 217 : 1 23 : 3 BPCER 18 : 611 : 929 : 312 : 813 : 48 : 523 : 0 11 : 5 9 : 616 : 021 : 522 : 616 : 816 : 6 6 : 2 ACER 9 : 86 : 015 : 0 18 : 7 36 : 04 : 57 : 748 : 111 : 4 14 : 2 19 : 319 : 8 8 : 5 16 : 8 11 : 1 EER 10 : 02 : 1 14 : 418 : 626 : 5 5 : 79 : 650 : 210 : 1 13 : 2 19 : 8 20 : 58 : 8 16 : 1 12 : 2 Table3.5 TheevaluationandcomparisonofthetestingonSiW-M. personationattacks.TheaverageACER/EERofimpersonationattacksis 9 : 3 / 8 : 5 ,whichislower thantheoverallaverageACER/EER.Thisshowsthattheproposedmethodhandlesimpersonation attacksbetter.Whentheattackerstrytoimpersonatesomeone,thespooffaceisrequiredtobe similarasaliveface,thusthenetworkcanextractfeaturemoreeasily.However,when theattackersjusttrytohiddenitsownidentity(obfascationattacks),thespooffaceisnotnecessary tobelookaliveface,whichiseasiertobecomeanoutlierofthedatadistributionandfalsethe system. 3.5.2.4VisualizationandAnalysis ToprovideabetterunderstandingofthetreelearningandZSFA,wevisualizetheresultsinseveral ways.First,weillustratethetreeroutingresults.InFig.3.5,werankthespoofdatabasedonthe routingfunctionvalues ' ( x ) ,andprovide 8 exampleswithresponsesfromthesmallesttothelargest. Thisoffersusanintuitiveunderstandingofwhatarelearnedateachtreenode.Weobservean obviousspoofstyletransfer:forthetwo-layernodes N 1 , N 2 and N 3 ,thetransfercapturesthe changeofgeneralspoofattributessuchasimagequalityandcolortemperature;forthethird-layer treenodes N 4 , N 5 , N 6 ,and N 7 ,thetransferinvolvesmorespooftypechanges.E.g., N 7 transfersfromeyeportionspoofstofullface 3 Dmaskspoofs. Further,Fig.3.6quantitativelyanalyzesthetreeroutingdistributionsofalltypesofdata.We 49 Figure3.5 VisulizationoftheTreeRouting. Figure3.6 Treeroutingdistributionoflive/spoofdata.X-axisdenotes 8 leafnodes,andy-axisdenotes 15 typesofdata.Thenumberineachcellrepresentsthepercentage( % )ofdatathatfallinthatleafnode.Each rowissumto 1 .(a)PrintProtocol.(b)TransparentMaskProtocol.Yellowboxdenotestheunknownattacks. utilizetwomodels,PrintandTrans.Mask,togeneratethedistributions.Itcanbeobservedthatlive samplesarerelativelymorespreadoutto 8 leafnodeswhilethespoofattacksareroutedtofewer 50 Figure3.7 t-SNEVisualizationoftheDTNleaffeatures. leafnodes.TwodistributionsinFig.3.6(a)&(b)sharesimilarsemanticsub-groups,which demonstratesthesuccessoftheproposedmethodonlearningatree.E.g.,inbothmodels,abouthalf oftrans.masksamplessharethesameleafnodeasob.makeup.Bycomparingtwodistributions, mosttestingunknownspoofsinbothmodelsaresuccessfullyroutedtothemostsimilarsub-groups. Inaddition,weuset-SNEMaaten&Hinton(2008)tovisualizethefeaturespaceofPrintmodel. Thet-SNEisabletoprojecttheoutputoftheleafnode F ( I j ) 2 R 32 32 40 to 2 Dbypreserving theKLdivergencedistance.Fig.3.7showsthefeaturesofdifferenttypesofspoofattacksare well-clusteredinto 8 semanticsub-groupseventhoughwedon'tprovideanyauxiliarylabels.Based onthesesub-groups,thefeaturesofunknownprintattacksarewellliedinthesub-groupofreplay andsiliconemask,andthusarerecognizedasspoof.Moreover,withthevisualization,wecan explaintheperformancevariationamongdifferentspoofattacks,showninTab.3.5.Amongall,the performanceoftrans.mask,funnyeye,paperglassesandob.makeupareworsethanotherprotocols. Thefeaturespaceshowsthatthelivesamplesliesmuchclosertothoseattacksthanothers(fi ! fl places),andhenceit'shardertodistinguishthemwiththelivesamples.Thisdemonstratesthe 51 diversepropertyofdifferentunknownattacksandthenecessityofsuchawiderangeevaluation. 3.6Conclusions Thischaptertacklesthezero-shotfaceproblemamong 13 typesofspoofattacks.The proposedmethodleveragesadeeptreenetworktoroutetheunknownattackstothemostproper leafnodeforspoofdetection.Thetreeistrainedinanunsupervisedfashiontothefeaturebase withthelargestvariationtosplitthespoofdata.WecollectSiW-Mthatcontainsmoresubjectsand spooftypesthananypreviousdatabases.Finally,weexperimentallyshowsuperiorperformanceof theproposedmethod. 52 Chapter4 Visualization:DisentanglingSpoofTraces withPhysicalModeling 4.1Introduction AsmostfacerecognitionsystemsarebasedonamonocularRGBcamera,monocularRGBbased facehasbeenstudiedforoveradecade,andoneofthemostcommonapproachesis basedontextureanalysisinBoulkenafetetal.(2015,2016);Pateletal.(2016b).Researchersnoticed thatpresentingfacesfromspoofmediumsintroducesspecialtexturedifferences,suchascolor distortions,unnaturalspecularhighlights,Moirpatterns, etc .Thosetexturedifferencesareinherent withinspoofmediumsandthushardtoremoveorConventionalapproachesbuilda featureextractorpluspipeline,suchasLBP+SVMandHOG+SVMindeFreitasPereira etal.(2012);Komulainenetal.(2013a),andshowgoodperformanceonseveralsmalldatabases withconstraintenvironments.Inrecentyears,manyworkssuchasAtoumetal.(2017);Liuetal. (2018c,2019a);Shaoetal.(2019a);Yangetal.(2019a)leveragedeeplearningtechniquesandshow greatprogressinfaceperformance.Deeplearningbasedmethodscanbegenerally groupedinto 3 categories:directFAS,auxiliaryFAS,andgenerativeFAS,asillustratedinFig.4.2. EarlyworksXuetal.(2015);Yangetal.(2014)buildvanillaCNNwithbinaryoutputtodirectly predictthespoofnessofaninputface(Fig.4.2a).MethodsLiuetal.(2018c);Yangetal.(2019a) proposetolearnanintermediaterepresentation, e.g. ,depth,rPPG,insteadofbinary 53 Figure4.1 Theproposedapproachcandetectspooffaces,disentanglethespooftraces,andreconstructthe livecounterparts.Itcanbeappliedtodiversespooftypesandestimatedistincttraces( e.g. ,Moirpattern inreplayattack,eyebrowandwaxinmakeupattack,colordistortioninprintattack,andspecular highlightsin 3 Dmaskattack).Zoominfordetails. classes,whichcanleadtobettergeneralizationandperformance(Fig.4.2b).Fengetal.(2020); Jourablooetal.(2018);Stehouweretal.(2020)additionallyattempttogeneratethevisualpatterns existinginthespoofsamples(Fig.4.2c),providingamoreintuitiveinterpretationofthesample's spoofness. Despitethesuccess,therearestillatleastthreeunsolvedproblemsinthetopicofdeeplearning- basedfaceFirst,mostpriorworksaredesignedtotackle limitedspooftypes ,either print/replayor 3 Dmasksolely,whileareal-worldsystemmayencounterawide varietyofspooftypesincludingprint,replay,various 3 Dmasks,facialmakeup,andevenunseen attacktypes.Therefore,tobettertreal-worldperformance,weneedabenchmarktoevaluate faceunderknownattacks,unknownattacks,andtheircombination(termed open- set setting).Second,manyapproachesformulatefaceasagression 54 Figure4.2 Thecomparisonofdifferentdeep-learningbasedface(a)directFASonlyprovides abinarydecisionofspoofness;(b)auxiliaryFAScanprovidesimpleinterpretationofspoofness. M denotes theauxiliarytask,suchasdepthmapestimation;(c)generativeFAScanprovidemoreintuitiveinterpretation ofspoofness,butonlyforalimitednumberofspoofattacks;(d)theproposedmethodcanprovidespooftrace estimationforgenericfacespoofattacks. problem,withasinglescoreastheoutput.AlthoughauxiliaryFASandgenerativeFASattemptto offersomeextentofinterpretationbysaliency,ornoiseanalysis,thereislittleunderstanding onwhattheexactdifferencesarebetweenliveandspoof,andwhatpatternsthesdecision isbasedupon.Abetterinterpretationcanbeestimatingtheexactpatternsdifferentiatingaspoof faceanditslivecounterpart,termed spooftrace .Thirdly,comparedwithotherfaceanalysistasks suchasrecognitionoralignment,thedataforfacehasseverallimitations.Most FASdatabasesarecapturedintheconstraintindoorenvironment,whichhaslimitedintra-subject variationandenvironmentvariation.Forsomespecialspooftypessuchasmakeupandcustomized siliconemask,theyrequirehighlyskilledexpertstoapplyorcreate,withhighcost,whichresultsin verylimitedsamples( i.e. ,long-taildata).Thus,howtolearnfromdatawithlimitedvariationsor samplesisachallengeforFAS. Inthiswork,weaimtodesignafacemodelthatisapplicabletoawidevarietyof spooftypes,termed genericface .Weequipthismodelwiththeabilitytoexplicitly 55 disentanglethespooftracesfromtheinputfaces.Someexamplesofspooftracedisentanglement areshowninFig.4.1.Thisisachallengingobjectiveduetothediversityofspooftracesandthe lackofgroundtruthduringmodellearning.However,webelievethatthisobjectivecan bringseveral 1. Binaryforfacewouldharvestanycuethathelps whichmightincludespoof-irrelevantcuessuchaslighting,andthushindergeneralization.In contrast,spooftracedisentanglementexplicitlytacklesthemostfundamentalcuein uponwhichthecanbemoregroundedandwitnessbettergeneralization. 2. WiththetrendofpursuingexplainableAIasmentionedinArrietaetal.(2020);Turek(2016), itisdesirableforthefacemodeltogeneratethespoofpatternsthatsupportits binarydecision,sincespooftraceservesasagoodvisualexplanationofthemodel'sdecision. Certainproperties( e.g. ,severity,methodology)ofspoofattacksmightpotentiallyberevealed fromthetraces. 3. Disentangledspooftracescanenablethesynthesisofrealisticspoofsamples,whichaddresses theissueoflimitedtrainingdatafortheminorityspooftypes,suchasspecial 3 Dmasksand makeup. AsshowninFig.4.2d,weproposeaPhysics-guidedSpoofTraceDisentanglement(PhySTD) toexplorethespooftracesforgenericfaceTomodelalltypesofspoofs,we formulatethespooftracedisentanglementasacombinationof additive processand inpainting process.Additiveprocessdescribesasspoofmaterialintroducingextrapatterns( e.g. ,moire pattern),wherethelivecounterpartcanberecoveredbyremovingthosepatterns.Inpaintingprocess describesasspoofmaterialfullycoveringcertainregionsoftheoriginalface,wherethe livecounterpartofthoseregionshastobefiguessedflasshowninBertalmioetal.(2000);Liu& 56 Shu(2015).Wefurtherdecomposethespooftracesintofrequency-dependentcomponents,sothat traceswithdifferentfrequencypropertiescanbeequallyhandled.Forthenetworkarchitecture,we extendabackbonenetworkforauxiliaryFASwithadecodertoperformthedisentanglement.With nogroundtruthofspooftraces,weadoptanoverallGAN-basedtrainingstrategy.Thegenerator takesaninputface,estimatesitsspoofness,anddisentanglesthespooftrace.Afterobtainingthe spooftrace,wecanreconstructthelivecounterpartfromthespoofandsynthesizenewspooffrom thelive.Thesynthesizedsamplesarethensenttomultiplediscriminatorswithrealsamplesfor adversarialtraining.Thesynthesizedspoofsamplesarefurtherutilizedtotrainthegeneratorina fullysupervisedfashion,thankstodisentangledspooftracesasgroundtruthforthesynthesized samples.Tocorrectpossiblegeometricdiscrepancyduringspoofsynthesis,weproposeanovel 3 D warpinglayertodeformspooftracestowardthetargetliveface. ApreliminaryversionofthisworkwaspublishedintheProceedingsEuropeanConferenceon ComputerVision(ECCV)2020Liuetal.(2020).Weextendtheworkfromthreeaspects. 1 )Guided bythephysicsofhowaspoofisgenerated,weintroduceaspoofgenerationfunction(SGF)tomodel thespooftracedisentanglementasacombinationofadditiveandinpaintingprocesses.SGFhasa betterandmorenaturalmodelingofgenericspoofattacks,suchaspaperglass. 2 )Previoustrace components f S ; B ; C ; T g arenotsupervisedhierarchicallysothatthereexistssemanticambiguity. Inthiswork,weintroduceseveralhierarchicaldesignsintheGANframeworktoremedysuch ambiguity. 3 )Weproposeanopen-settestingscenariotofurtherevaluatethereal-worldperformance forfacemodels.Bothknownandunknownattacksareincludedintheopen-settesting. Weperformaside-by-sidecomparisonbetweentheproposedapproachandthestate-of-the-art (SOTA)facesolutionsonmultipledatasetsandprotocols. Insummary,themaincontributionsofthisworkareasfollows: Weforthetimestudyspooftracefor generic facewhereawidevarietyof 57 spooftypesaretackledwithoneframework; Weproposeanovelphysics-guidedmodeltodisentanglespooftraces,andutilizethespoof tracestosynthesizenewdatasamplesforenhancedtraining; Weproposenovelprotocolsforagenericopen-setface WeachieveSOTAperformanceandprovideconvincingvisualizationforawide varietyofspooftypes. 4.2RelatedWork Face Facehasbeenstudiedformorethanadecadeanditsdevelopment canberoughlydividedintothreestages.Intheearlyyears,researchersleveragespontaneoushuman movement,suchaseyeblinkingandheadmotion,todetectsimpleprintphotographorstaticreplay attacksKollreideretal.(2007);Panetal.(2007).However,whenfacingcounterattacks,such asprintfacewitheyeregioncut,andreplayingafacevideo,thosemethodswouldfail.Inthe secondstage,researcherspaymoreattentiontotexturedifferencesbetweenliveandspoof,which areinherenttospoofmediums.Researchersmainlyextracthand-craftedfeaturesfromthefaces, e.g. ,LBPBoulkenafetetal.(2015);deFreitasPereiraetal.(2012,2013);M ¨ a ¨ att ¨ aetal.(2011),HoG Komulainenetal.(2013a);Yangetal.(2013),SIFTPateletal.(2016b)andSURFBoulkenafet etal.(2016),andtrainatosplitthelive vs. spoof, e.g. ,SVMandLDA. Recently,facesolutionsequippedwithdeeplearningtechniqueshavedemonstrated improvementsovertheconventionalmethods.MethodsinFengetal.(2016);Lietal. (2016a);Pateletal.(2016a);Yangetal.(2014)trainadeepneuralnetworktolearnabinary betweenliveandspoof.InAtoumetal.(2017);Liuetal.(2018c,2019a);Shaoetal. 58 (2019a);Yangetal.(2019a),additionalsupervisions,suchasfacedepthmapandrPPGsignal,are utilizedtohelpthenetworktolearnmoregeneralizablefeatures.Asthelatestapproachesachieving saturatedperformanceonseveralbenchmarks,researchersstarttoexploremorechallengingcases, suchasfew-shot/zero-shotfaceLiuetal.(2019a);Qinetal.(2019);Zhaoetal.(2019) anddomainadaptationinfaceShaoetal.(2019a,b). Inthiswork,weaimtosolveaninterestingyetverychallengingproblem:disentanglingand visualizingthespooftracesfromaninputface.ArelatedworkJourablooetal.(2018)alsoadopts GANseekingtoestimatethespooftraces.However,theyformulatethetracesaslow-intensity noises,whichislimitedtoprintandreplayattacksonlyandcannotprovideconvincingvisual results.Incontrast,weexplorespooftracesforamuchwiderrangeofspoofattacks,visualizethem withnoveldisentanglement,andalsoevaluatetheproposedmethodonthechallengingcases, e.g. , zero-shotface DisentanglementLearning Disentanglementlearningisoftenadoptedtobetterrepresentcomplex dataandfeatures.DR-GANTranetal.(2017b)disentanglesafaceintoidentityandposevectors forpose-invariantfacerecognitionandviewsynthesis.Similarlyingaitrecognition,Zhangetal. (2019)disentanglestherepresentationsofappearance,canonical,andposefeaturesfromaninput gaitvideo. 3 DreconstructionworksLiuetal.(2018a);Tran&Liu(2021)alsodisentanglethe representationofa 3 Dfaceintoidentity,expressions,poses,albedo,andilluminations.Forimage synthesis,Esseretal.(2018)disentanglesanimageintoappearanceandshapewithU-Netand VariationalAutoEncoder(VAE). DifferentfromLiuetal.(2018a);Tranetal.(2017b);Zhangetal.(2019),weintendto disentanglefeaturesthathavedifferentscalesandcontaingeometricinformation.Weleveragethe multipleoutputstorepresentfeaturesatdifferentscales,andadoptmultiple-scalediscriminators toproperlylearnthem.Moreover,weproposeanovelwarpinglayertotacklethegeometric 59 Figure4.3 OverviewoftheproposedPhysics-guidedSpoofTraceDisentanglement(PhySTD). discrepancyduringthedisentanglementandreconstruction. ImageTraceModeling Imagetracesarecertainsignalsexistingintheimagethatcanreveal informationaboutthecapturingcamera,imagingsetting,environment,andsoon.Thosesignals oftenhavemuchlowerenergycomparedtotheimagecontent,whichneedspropermodelingto explorethem.Abdelhamedetal.(2018);Thaietal.(2013,2016)observethedifferenceofimage noises,andusethemtorecognizethecapturecameras.Fromthefrequencydomain,Stehouwer etal.(2020)showstheimagenoisesfromdifferentcamerasobeydifferentnoisedistributions.Such techniquesareappliedtotheofimageforensics,andlaterChenetal.(2020);Wangetal. (2017)proposemethodstoremovesuchtracesforimageanti-forensics. Recently,imagetracemodelingiswidelyusedinimageforgerydetectionandimageadversarial attackdetectionDangetal.(2020);Wuetal.(2019).Inthiswork,weattempttoexplorethetraces ofspooffacepresentation.Duetodifferentspoofmediums,spooftracesshowlargevariations incontent,intensity,andfrequencydistribution.Weproposetodisentanglethetracesasadditive tracesandinpaintingtrace.Andforadditivetraces,wefurtherdecomposethembasedondifferent frequencybands. 60 4.3Physics-basedSpoofTraceDisentanglement 4.3.1ProblemFormulation Letthedomainoflivefacesbedenotedas Lˆ R N N 3 andspooffacesas Sˆ R N N 3 ,where N istheimagesize.Weintendtoobtainnotonlythecorrectprediction(live vs. spoof)oftheinput face,butalsoaconvincingestimationofthespooftraceandlivefacereconstruction.Torepresent thespooftrace,ourpreliminaryversionassumesanadditiverelationbetweenliveandspoof,and uses 4 tracecomponents f S ; B ; C ; T g atdifferentfrequencybandsas: I spoof =(1+ b S c n 1 ) I live + b B c n 1 + b C c n 2 + T ; (4.1) where S ; B representlow-frequencytraces, C representsmid-frequencyones,and T represents high-frequencyones. isthelowbandpassoperation,andinpractice,weachievethis bydownsamplingtheoriginalimageandupsamplingitback.Intheprevioussetting, n 1 =1 and n 2 =64 .ComparedtothesimplerepresentationwithonlyasinglecomponentinJourablooetal. (2018),thismulti-scalerepresentationof f S ; B ; C ; T g canlargelyimprovedisentanglementquality andsuppressundesiredartifactsduetoitsprocess.Themodelisdesignedtoprovide avalidestimationofspooftraces f S ; B ; C ; T g withoutrespectivegroundtruth.Ourpreliminary versionLiuetal.(2020)aimstoaminimumintensitychangethattransfersaninputfacetothe livedomain: argmin ^ I k I ^ I k F s:t: I 2 ( S[L ) and ^ I 2L ; (4.2) where I isthesourceface, ^ I isthetargetfacetobeoptimized,and I ^ I isasthespooftrace. Whenthesourcefaceislive I live , I ^ I shouldbe 0 as I isalreadyin L .Whenthesourcefaceis spoof I spoof , I ^ I shouldberegularizedtopreventunnecessarychangessuchasidentityshift. 61 Despitetheeffectivenessofthisrepresentation,therearestilltwodrawbacks:First,thespoof tracedisentanglementismainlyformulatedasanadditiveprocessing.TheoptimizationofEqn.4.2 limitsthetraceintensity,andthereconstructionforspoofregionswithlargeappearancedivergence mightbesub-optimal,suchasspoofglassesormask.Forthosespoofregions,thephysical relationshipbetweentheliveandthespoofisbetterdescribedasreplacementratherthanaddition; Second,whileourpreliminaryversionrepresentingthetraceswithhierarchicalcomponents,these componentsarelearnedwithlossesontheirsummation.Withoutcarefulsupervision,thelearned componentscanbeambiguousintheirsemanticmeanings, e.g. ,thehigh-frequencycomponentmay includelow-frequencyinformation. Toaddressthedrawback,weintroduceaspoofgenerationfunction(SGF)asanadditive processfollowedbyaninpaintingprocess: I spoof =(1 P )( I live + T A )+ P T P ; (4.3) where T A 2 R N N 3 indicatesthetracesfromadditiveprocess, T P indicatesthetracesfrom inpaintingprocess,and P 2 R N N 1 denotestheinpaintingregion.Givenaspoofface,onemay reconstructthelivecounterpartbyinversingEqn.4.3: ^ I live =(1 P )( I spoof T A )+ P ( ^ I live + I spoof T P ) ; (4.4) Astheinpaintingphysicallyreplacescontent,thespooftrace T P intheinpaintingregion P is identicaltothespoofimage I spoof inthesameregion,andthusbothcanceloutinthesecondterm ofEqn.4.4.Wefurtherrenamethe ^ I live inthesecondtermas I P toindicatetheinpaintingcontent withintheinpaintingregionthatshouldbeestimatedfromthemodel.Therefore,thereconstruction 62 oftheliveimagebecomes: ^ I live =(1 P )( I spoof T A )+ P I P ; (4.5) where T A = b B c n 1 + b C c n 2 + T denotestheadditivetracerepresentedbythreehierarchical components. n 1 and n 2 aresettobe 32 and 128 respectively.Withalarger n 1 ,theeffectof component S inthepreliminaryversioncanbeincorporatedinto B ,andhenceweremove S for simplicity.Besidestheadditivetraces,themodelisfurtherrequiredtoestimatetheinpaintingregion P andinpaintinglivecontent I P . I P isestimatedbasedontherestofthelivefacialregionwithout intensityconstraint.Weuseafunction G ( ) torepresentthereconstructionprocessofEqn.4.5. Accordingly,theoptimizationofEqn.4.2isre-formulatedbyreplacing ^ I withEqn.4.5as: argmin T A ; P ; I P k I (1 P )( I T A ) P I P k F ! argmin T A ; P ; I P k (1 P ) T A k F + k P ( I I P ) k F : (4.6) Aswedonotwishtoimposeanyintensityconstrainton I P ,theobjectiveisformulatedas: argmin T A ; P k (1 P ) T A k F + k P k F s:t: I 2S[L ; ^ I 2L ; (4.7) where isaweighttobalancetwoterms.Inaddition,basedonEqn.4.3,wecananother function G ( ) tosynthesizenewspooffaces,bytransferringthespooftracesfrom I i to I j : ^ I i ! j spoof = G ( I j j I i )=(1 P i )( I j + T i A )+ P i I i : (4.8) Notethat T P inEqn.4.3hasbeenreplacedwith I i sincethespoofimage I i containsthespoof 63 Figure4.4 TheproposedPhySTDnetworkarchitecture.Exceptthelastlayer,eachconvandtransposed convisconcatenatedwithabatchnormalizstionlayerandaleakyReLUlayer. k3c64s2 indicatesthekernel sizeof 3 3 ,theconvolutionchannelof 64 andthestrideof 2 . tracefortheinpaintingregion. Estimating f T A ; P ; I P g fromaninputface I istermedas spooftracedisentanglement .Given thatnogroundtruthoftracesisavailable,thisdisentanglementcanbeachievedviagenerative adversarialbasedtraining.AsshowninFig.4.3,theproposedPhysics-guidedSpoofTraceDis- entanglement(PhySTD)consistsofageneratoranddiscriminator.Givenaninputimage,the generatorisdesignedtopredictthespoofness(representedbythepseudodepthmap)aswellas estimatetheadditivetraces f B ; C ; T g andtheinpaintingcomponents f P ; I P g .Withthetraces, wecanapplyfunction G ( ) toreconstructthelivecounterpartandfunction G ( ) tosynthesize newspooffaces.Weadoptasetofdiscriminatorsatmultipleimageresolutionstodistinguishthe realfaces f I live ; I spoof g withthesyntheticfaces f ^ I live ; ^ I spoof g .Toremedythesemanticambiguity during f B ; C ; T g learning,threetracecomponentcombinations, f B g , f B ; C g ,and f B ; C ; T g , willcontributetothesynthesisoflivereconstructionatoneparticularresolution,whichisthen supervisedbyarespectivediscriminator(detailsinSec.4.3.3).Tolearnaproperinpaintingregion P ,weleverageboththepriorknowledgeandtheinformationfromtheadditivetraces. 64 Figure4.5 Thevisualizationofimagedecompositionfordifferentinputfaces:(a)liveface(b) 3 Dmask attack(c)replayattack(d)printattack. Intherestofthissection,wepresentthedetailsofthegenerator,thediscriminators,thedetails offacereconstructionandsynthesis,andthelossesandtrainingstepsusedinPhySTD. 4.3.2DisentanglementGenerator AsshowninFig.4.4,thedisentanglementgeneratorconsistsofabackboneencoder,aspooftrace decoderandadepthestimationnetwork.Thebackboneencoderaimstoextractmulti-scalefeatures, thedepthestimationnetworkleveragesthefeaturestoestimatethefacialdepthmap,andaspoof tracedecodertoestimatetheadditivetracecomponents f B ; C ; T g andtheinpaintingcomponents f P ; I P g .Thedepthmapandthespooftraceswillbeusedtocomputethespoofnessscore. Backboneencoder Backboneencoderextractsfeaturesfromtheinputimagesforbothdepthmap 65 estimationandspooftracedisentanglement.AsshowninourpreliminaryworkLiuetal.(2020),the spooftracesconsistsofcomponentsfromdifferentfrequencybands:low-frequencytracesincludes colordistortion,mid-frequencytracesincludesmakeupstrikes,andhigh-frequencytracesincludes Moirpatternsandmaskedges.However,avanillaCNNmodelmightoverlookhigh-frequency tracessincetheenergyofhigh-frequencytracesisoftenmuchweakerthanthatoflow-frequency traces.Inordertoencouragethenetworktoequallyregardtraceswithdifferentphysicalproperties, weexplicitlydecomposetheimageintothreeelements f I B ; I C ; I T g as: I B = b I c n 1 ; I C = b I c n 2 b I c n 1 ; I T = I b I c n 2 ; (4.9) where n 1 =32 , n 2 =128 andtheimagesize N =256 .Inaddition,weamplifythevaluein I C ; I T bytwoconstants 15 and 25 ,andthenfeedtheconcatenationofthreeelementstothebackbone network.Fig.4.5providesthevisualizationofimagedecomposition.Weobservethatthetracesthat arelessdistinctintheoriginalimagesbecomemorehighlightedinthe I T component: 3 Dmaskand replayattackbringuniquepatternsdifferentwiththelivefacepattern,whileprintattackislacking ofnecessaryhighfrequencydetails.Semantically, I B ; I C ; I T sharethesamefrequencydomains with B ; C ; T respectively,andthusthedecompositionpotentiallyeasesthelearningof B ; C ; T . Afterthat,theencoderprogressivelydownsamplesthedecomposedimagecomponents 3 times toobtainfeatures F 1 2 R 128 128 64 , F 2 2 R 64 64 96 , F 3 2 R 32 32 128 viaconvlayers. Spooftracedecoder Thedecoderupsamplesthefeature F 3 withtransposeconvlayersback totheinputfacesize 256 .Thelastlayeroutputsbothadditivetraces f B ; C ; T g andinpainting components f P ; I P g .SimilartoU-NetRonnebergeretal.(2015),weapplytheshort-cutconnection 66 betweenthebackboneencoderanddecodertobypassthemultiplescaledetailsforahigh-quality traceestimation. Depthestimationnetwork Westillrecognizetheimportanceofthediscriminativesupervision usedinauxiliaryFAS,andthusintroduceadepthestimationnetworktoperformthepseudo-depth estimationforfaceasproposedinLiuetal.(2018c).Thedepthestimationnetwork takestheconcatenatedfeaturesof F 1 , F 2 , F 3 fromthebackboneencoderand U 3 fromthedecoder asinput.ThefeaturesareputthroughaspatialattentionmechanismfromYuetal.(2020b)and resizetothesamesizeof K =32 .Itoutputsafacedepthmap M 2 R 32 32 ,wherethedepthvalues arenormalizedwithin [0 ; 1] .Regardingthenumberofparameters,bothspooftracedecoderand depthestimationnetworkarelightweighed,whilethebackbonenetworkismuchheavier.With morenetworklayersbeingsharedtotacklebothdepthestimationandspooftracedisentanglement, theknowledgelearntfromspooftracedisentanglementcanbebettersharedwithdepthestimation task,whichcanleadtoabetterperformance. Finalscoring Inthetestingphase,weusethenormofthedepthmapandtheintensityofspoof tracesforrealvs.spoof score = 1 2 K 2 k M k 1 + 0 2 N 2 ( k B k 1 + k C k 1 + k T k 1 + k P k 1 ) ; (4.10) where 0 istheweightforthespooftrace. 4.3.3ReconstructionandSynthesis Therearemultipleoptionstousethedisentangledspooftraces: 1 )livereconstruction, 2 )spoof synthesis,and 3 )fiharderflsamplesynthesis,whichwillbedescribedbelowrespectively. Livereconstruction: BasedonEqn.4.5,weproposeahierarchicalreconstructionoftheliveface 67 counterpartfromtheinputimages.Toreconstructfacesatacertainresolution,eachadditivetraceis includedonlyifitsfrequencydomainislowerthanthetargetresolution.Weapply f hi ; mid ; low g threeresolutionsettingsas: ^ I hi =(1 P )( I b B c n 1 b C c n 2 T )+ P I P ; ^ I mid =(1 P )( b I c n 2 b B c n 1 b C c n 2 )+ P I P ; ^ I low =(1 P )( b I c n 1 b B c n 1 )+ P I P : (4.11) Spoofsynthesis: BasedonEqn.4.8,wecanobtainanewspooffaceviaapplyingthespoof tracesdisentangledfromaspoofface I i toaliveface I j .However,spooftracesmaycontain face-dependentcontentassociatedwiththeoriginalspoofsubject.Directlyapplyingthemtoanew facewithdifferentshapesorposesmayresultinmis-alignmentandstrongvisualimplausibility. Therefore,thespooftraceshouldgothroughageometrycorrectionbeforeperformingthissynthesis. Weproposeanonline 3 Dwarpinglayerandwillintroduceitinthefollowingsubsection. fiHarderflsamplesynthesis: Thedisentangledspooftracescannotonlyreconstructliveand synthesizenewspoof,butalsosynthesizefiharderflspoofsamplesbyremovingoramplifyingpart ofthespooftraces.Wecantuneoneorsomeofthetraceelements f B ; C ; T ; P g tomakethespoof sampletobecomefilessspoofedfl,whichisthusclosertoalivefacesincethespooftracesare weakened.Suchspoofdatacanberegardedas harder samplesandmaythegeneralizationof thedisentanglementgenerator.Forinstance,whileremovingthelowfrequencyelement B froma replayspooftrace,thegeneratormaybeforcedtorelyonotherelementssuchashigh-leveltexture patterns.Tosynthesizethefiharderflsample ^ I hard ,wefollowEqn.4.8withtwominorchanges:1) generate 3 randomweightsbetween [0 ; 1] andmultipleeachwithonecomponentof f B ; C ; T g ;2) randomlyremovetheinpaintingprocess( i.e. ,set P =0 )withaprobabilityof 0 : 5 .Comparedwith 68 Figure4.6 Theonline 3 Dwarpinglayer.(a)Giventhecorrespondingdenseoffset,wewarpthespooftrace andaddthemtothetargetlivefacetocreateanewspoof.E.g.pixel ( x;y ) withoffset (3 ; 5) iswarpedto pixel ( x +3 ;y +5) inthenewimage.(b)Toobtainadenseoffsetsfromthespareoffsetsoftheselectedface shapevertices,Delaunaytriangulationinterpolationisadopted. othermethods,suchasbrightnessandcontrastchangeLiuetal.(2019b),andblurriness effectYangetal.(2019a),or 3 DdistortionGuoetal.(2019),ourapproachcanintroducemore realisticandeffectivedatasamples,asshowninSec.4.4. 4.3.3.1Online 3 DWarpingLayer Weproposeanonline 3 Dwarpinglayertocorrecttheshapediscrepancy.Toobtainthewarping, previousmethodsinChangetal.(2018);Liuetal.(2018c)useoffaceswappingandpre- computeddenseoffsetrespectively,wherebothmethodsarenon-differentiableaswellasmemory intensive.Incontrast,ourwarpinglayerisdesignedtobebothdifferentiableandcomputationally efwhichisnecessaryforonlinesynthesisduringthetraining. First,thelivereconstructionofaspoofface I i canbeexpressedas: G i = G ( I i )[ p 0 ] ; (4.12) 69 where p 0 = f (0 ; 0) ; (0 ; 1) ;:::; (255 ; 255) g2 R 256 256 2 enumeratespixellocationsin I i .Toalign thespooftraceswhilesynthesizinganewspoofface,adenseoffset p i ! j 2 R 256 256 2 is requiredtoindicatethedeformationbetweenface I i andface I j .Adiscretedeformationcanbe acquiredfromthedistancesofthecorrespondingfaciallandmarksbetweentwofaces.Duringthe datapreparation,weuseLiuetal.(2017)toa 3 DMMmodelandextractthe 2 Dlocationsof Q facialverticesforeachface: s = f ( x 0 ;y 0 ) ; ( x 1 ;y 1 ) ;:::; ( x N ;y N ) g2 R Q 2 : (4.13) Asparseoffsetonthecorrespondingverticescanthenbecomputedtwofacesas s i ! j = s j s i .To convertthesparseoffset s i ! j tothedenseoffset p i ! j ,weapplyatriangulationinterpolation: p i ! j = Tri ( p 0 ; s i ; s i ! j ) ; (4.14) where Tri ( ) istheinterpolation, s i denotesthevertexlocations, s i ! j arethevertexvalues,and weadoptDelaunaytriangulation.Thewarpingoperationcanbedenotedas: G i ! j = G ( I j j I i )[ p 0 + p i ! j ] ; (4.15) wheretheoffset p i ! j appliestoallsubject i relatedelements f T i A ; I i ; P i g .Sincetheoffset p i ! j istypicallycomposedoffractionalnumbers,weimplementthebilinearinterpolationto samplethefractionalpixellocations.Weselect Q =140 verticestocoverthefaceregionsothat theycanrepresentnon-rigiddeformation,duetoposeandexpression.Asthepixelvaluesinthe warpedfacearealinearcombinationofpixelvaluesofthetriangulationvertices,thisentireprocess isdifferentiable.ThisprocessisillustratedinFig.4.6. 70 Algorithm1 PhySTDTrainingIteration. Input: livefaces I live andfaciallandmarks s live ,spooffaces I spoof andfaciallandmarks s spoof ,groundtruthdepthmap M 0 ,preliminarymask P 0 ; Output: reconstructedlive ^ I live ,synthesizedspoof ^ I spoof ,spooftraces f T l A , P l , I l P ; T s A , P s , I s P g ,depthmaps f M l , M s g ; while iteration < max iteration do // trainingstep 1 1 :compute T l A , P l , I l P G ( I live ) andcompute T s A , P s , I s P G ( I spoof ) ; 2 :estimatethedepthmap M l , M s ; 3 :computelosses L depth , L P , L R ; // trainingstep 2 4 :compute ^ I low , ^ I mid , ^ I hi from T s A , P s , I s P and I spoof (Eqn.4.4); 5 :computewarpingoffset p s ! l from s live , s spoof (Eqn.5.8); 6 :compute ^ I spoof fromwarped T s ! l A , P s ! l and I live (Eqn.5.9); 7 :send I live , I spoof , ^ I low , ^ I mid , ^ I hi , ^ I spoof todiscriminators; 8 :computetheadversariallossforgenerator L G andfordiscriminators L D ; // trainingstep 3 9 :createhardersamples I hard from T s ! l A , P s ! l and I live withrandomperturbationon traces; 10 :compute T h A , P h , I h P G ( I hard ) ; 11 :computedepthmap M h for I hard ; 12 :computelosses L S , L H ; // backpropagation 13 :back-propagatethelossesfromstep 3 ; 8 ; 12 tocorrespondingpartsandupdatethe network; end 4.3.4Multi-scaleDiscriminators MotivatedbyWangetal.(2018b),weadoptmultiplediscriminatorsatdifferentresolutions( e.g. , 32 , 96 ,and 256 )inourGANarchitecture.WefollowthedesignofPatchGANIsolaetal.(2017), whichessentiallyisafullyconvolutionalnetwork.Fullyconvolutionalnetworksareshowntobe effectivetonotonlysynthesizehigh-qualityimagesIsolaetal.(2017);Wangetal.(2018b),butalso tacklefaceproblemsLiuetal.(2018c).Foreachdiscriminator,weadoptthesame structurebutdonotsharetheweights. 71 AsshowninFig.4.4,weuseintotal 4 discriminatorsinourwork: D 1 ,workinginthelowest resolutionof 32 ,focusesonlowfrequencyelementssincethehigher-frequencytracesareerasedby downsampling. D 2 ,workingattheresolutionof 96 ,focusesonthemiddlelevelcontentpattern. D 3 and D 4 ,workingonthehighestresolutionof 256 ,focusonthetexturedetails.Our preliminaryversionresizesrealandsyntheticsamples f I ; ^ I g todifferentresolutionsandassign toeachdiscriminator.Toremovesemanticambiguityandprovidecorrespondencetothetrace components,weinsteadassignthehierarchicalreconstructionfromEqn.4.11tothediscriminators: wesendlowfrequencypairs f I live ; ^ I low g to D 1 ,middlefrequencypairs f I live ; ^ I mid g to D 2 ,high frequencypairs f I live ; ^ I hi g to D 3 ,andreal/syntheticspoof f I spoof ; ^ I spoof g to D 4 .Eachdiscriminator outputsa 1 -channelmapintherangeof [0 ; 1] ,where 0 denotesfakeand 1 denotesreal. 4.3.5LossFunctionsandTrainingSteps Weutilizemultiplelossfunctionstosupervisethelearningofdepthmapsandspooftraces.Each trainingiterationconsistsofthreetrainingsteps.Weintroducethelossfunction,followedby howtheyareusedinthetrainingsteps. Depthmaploss: WefollowtheauxiliaryFASinLiuetal.(2018c)toestimateanauxiliarydepth map M ,wherethedepthgroundtruth M 0 foralivefacecontainsface-likeshapeandthedepthfor spoofshouldbezero.Weapplythe L - 1 normonthislossas: L depth = 1 K 2 E i ˘L[S k M i M i 0 k F ; (4.16) where K =32 isthesizeof M .WeapplythedensefacealignmentLiuetal.(2017)toestimatethe 3 Dshapeandrenderthedepthgroundtruth M 0 . Adversariallossfor G : WeemploytheLSGANsMaoetal.(2017)onreconstructedlivefaces 72 Figure4.7 Preliminarymask P 0 forthenegativetermininpaintingmaskloss.Whitepixelsdenote 1 and blackpixelsdenote 0 .Whiteindicatestheareashouldnotbeinpainted. P 0 for:(a)print,replay;(b)3Dmask andmakeup;(c)partialattacksthatcovertheeyeportion;(d)partialattacksthatcoverthemouthportion. andsynthesizedspooffaces.Itencouragesthereconstructedlivetolooksimilartoreallivefrom domain L ,andthesynthesizedspooffacestolooksimilartofacesfromdomain S : L G = E i ˘L ;j ˘S h k D 1 ( ^ I j low ) 1 k 2 F + k D 2 ( ^ I j mid ) 1 k 2 F + k D 3 ( ^ I j hi ) 1 k 2 F + k D 4 ( ^ I j ! i spoof ) 1 k 2 F i : (4.17) Adversariallossfor D : Theadversariallossfordiscriminatorsencourages D ( ) todistinguish betweenreallive vs. reconstructedlive,andrealspoof vs. synthesizedspoof: L D = E i ˘L ;j ˘S h k D 1 ( I i ) 1 k 2 F + k D 2 ( I i ) 1 k 2 F + k D 3 ( I i ) 1 k 2 F + k D 4 ( I j ) 1 k 2 F + k D 1 ( ^ I j low ) k 2 F + k D 2 ( ^ I j mid ) k 2 F + k D 3 ( ^ I j hi ) k 2 F + k D 4 ( ^ I j ! i spoof ) k 2 F i : (4.18) Inpaintingmaskloss: Thegroundtruthinpaintingregionforallspoofattacksisbarelypossible toobtain,henceafullysupervisedtrainingusedinTranetal.(2017a)forinpaintingmaskisoutof thequestion.However,wemaystillleveragethepriorknowledgeofspoofattackstofacilitatethe estimationofinpaintingmasks.Theinpaintingmasklossconsistsofapositivetermandanegative term.First,thepositivetermencouragescertainregiontobeinpainted.Asthegoalofinpainting processistoallowcertainregiontochangewithoutintensityconstraint,theregionwithlarger magnitudeofadditivetraceswouldhaveahigherprobabilitytobeinpainted.Hence,thepositive termadoptsa L - 2 normbetweentheinpaintingregion P andtheregionwheretheadditivetraceis 73 largerthanathreshold . Second,thenegativetermdiscouragescertainregiontobeinpainted.Whilethegroundtruth inpaintingmaskisunknown,it'sstraightforwardtomarkalargeportionofregionthatshouldnotbe inpainted.Forinstance,theinpaintingregionforfunnyeyeglassesshouldnotappearinthelower partofaface.Hence,weprovideapreliminarymask P 0 toindicatethenot-to-be-inpaintedregion, andadoptanormalized L - 2 normonthemaskedinpaintingregion P P 0 asthenegativeterm.The preliminarymask P 0 isillustratedinFig.4.7.Overall,theinpaintingmasklossisformedas: L P = E i ˘S h k P i ( T i A > ) k 2 F + k P i P i 0 k 2 F k P i 0 k 2 F i : (4.19) Traceregularization: BasedonEqn.4.6with =1 ,weregularizetheintensityofadditivetraces f B ; C ; T g andinpaintingregion P .Theregularizerlossisdenotedas: L R = E i ˘L[S h k B k 2 F + k C k 2 F + k T k 2 F + k P k 2 F i : (4.20) Synthesizedspoofloss: Synthesizedspoofdatacomewithgroundtruthspooftraces.Asaresult, weareabletoasupervisedpixellossforthegeneratortodisentangletheexactspooftraces thatwereadded: L S = E i ˘L ; j ˘S h k G ( d G j ! i e ) d G j ! i ek 1 F i ; (4.21) where G j ! i istheoveralleffectof f P j ; I j P ; B j ; C j ; T j g afterwarpingtosubject i ,and isthe stop_gradient operation.Withoutstoppingthegradient, G j ! i maycollapseto 0 . Depthmaplossforfiharderflsamples: Wesendthefiharderflsynthesizedspoofdatatodepth estimationnetworktoimprovethedatadiversity,andhopetoincreasetheFASmodel'sgeneraliza- 74 tion: L H = 1 K 2 E i ˘ ^ S h k M i M i 0 k F i ; (4.22) where ^ S denotesthedomainofsynthesizedspooffaces. Trainingstepsandtotalloss: Eachtrainingiterationhas 3 trainingsteps.Inthetrainingstep 1 , livefaces I live andspooffaces I spoof arefedintogenerator G ( ) todisentanglethespooftraces. Thespooftracesareusedtoreconstructthelivecounterpart ^ I live andsynthesizenewspoof ^ I spoof . Thegeneratorisupdatedwithrespecttothedepthmaploss L depth ,adversarialloss L G ,inpainting maskloss L P ,andregularizerloss L R : L = 1 L depth + 2 L G + 3 L P + 4 L R : (4.23) Inthetrainingstep 2 ,thediscriminatorsaresupervisedwiththeadversarialloss L D tocompete withthegenerator.Inthetrainingstep 3 , I live and ^ I hard arefedintothegeneratorwiththeground truthlabelandtracetominimizethesynthesizedspoofloss L S anddepthmaploss L H : L = 5 L S + 6 L H ; (4.24) where 1 - 6 aretheweightstobalancethemultitasktraining.Tonotethat,wesendtheoriginallive faces I live with ^ I hard forabalancedmini-batch,whichisimportantwhencomputingthemoving averageinthebatchnormalizationlayer.Weexecuteall 3 stepsineachminibatchiteration,but reducethelearningratefordiscriminatorstepbyhalf.Thewholetrainingprocessisdepictedin Alg.1. 75 ProtocolMethodAPCER(%)BPCER(%)ACER(%) 1 STASN(Yangetal.(2019a)) 1 : 22 : 51 : 9 Auxiliary(Liuetal.(2018c)) 1 : 61 : 61 : 6 DeSpoof(Jourablooetal.(2018)) 1 : 21 : 71 : 5 DRL(Zhangetal.(2020a)) 1 : 70 : 81 : 3 STDN(Liuetal.(2020)) 0 : 81 : 31 : 1 CDCN(Yuetal.(2020b)) 0 : 41 : 71 : 0 HMP(Yuetal.(2020a)) 0 : 01 : 60 : 8 CDCN++(Yuetal.(2020b)) 0 : 4 0.00.2 Ours 0.0 0 : 80 : 4 2 DeSpoof(Jourablooetal.(2018)) 4 : 24 : 44 : 3 Auxiliary(Liuetal.(2018c)) 2 : 72 : 72 : 7 DRL(Zhangetal.(2020a)) 1 : 13 : 62 : 4 STASN(Yangetal.(2019a)) 4 : 2 0.3 2 : 2 STDN(Liuetal.(2020)) 2 : 31 : 61 : 9 HMP(Yuetal.(2020a)) 2 : 60 : 81 : 7 CDCN(Yuetal.(2020b)) 0 : 41 : 71 : 5 CDCN++(Yuetal.(2020b)) 1 : 80 : 81 : 3 Ours 1.2 1 : 3 1.3 3 DeSpoof(Jourablooetal.(2018)) 4 : 0 1 : 83 : 8 1 : 23 : 6 1 : 6 Auxiliary(Liuetal.(2018c)) 2 : 7 1 : 33 : 1 1 : 72 : 9 1 : 5 STDN(Liuetal.(2020)) 1 : 6 1 : 64 : 0 5 : 42 : 8 3 : 3 STASN(Yangetal.(2019a)) 4 : 7 3 : 9 0.9 1.2 2 : 8 1 : 6 HMP(Yuetal.(2020a)) 2 : 8 2 : 42 : 3 2 : 82 : 5 1 : 1 CDCN(Yuetal.(2020b)) 2 : 4 1 : 32 : 2 2 : 02 : 3 1 : 4 DRL(Zhangetal.(2020a)) 2 : 8 2 : 21 : 7 2 : 62 : 2 2 : 2 CDCN++(Yuetal.(2020b)) 1 : 7 1 : 52 : 0 1 : 2 1.8 0.7 Ours 1.7 1.4 2 : 2 3 : 51 : 9 2 : 3 4 Auxiliary(Liuetal.(2018c)) 9 : 3 5 : 610 : 4 6 : 09 : 5 6 : 0 STASN(Yangetal.(2019a)) 6 : 7 10 : 68 : 3 8 : 47 : 5 4 : 7 CDCN(Yuetal.(2020b)) 4 : 6 4 : 69 : 2 8 : 06 : 9 2 : 9 DeSpoof(Jourablooetal.(2018)) 5 : 1 6 : 36 : 1 5 : 15 : 6 5 : 7 HMP(Yuetal.(2020a)) 2 : 9 4 : 07 : 5 6 : 95 : 2 3 : 7 CDCN++(Yuetal.(2020b)) 4 : 2 3 : 45 : 8 4 : 95 : 0 2 : 9 DRL(Zhangetal.(2020a)) 5 : 4 2 : 9 3.3 6.0 4 : 8 6 : 4 STDN(Liuetal.(2020)) 2 : 3 3 : 65 : 2 5 : 43 : 8 4 : 2 Ours 2.3 3.6 4 : 2 5 : 4 3.6 4.2 Table4.1 TheevaluationonfourprotocolsinOULU-NPU. Bold indicatesthebestscoreineachprotocol. 4.4Experiments Inthissection,weintroducetheexperimentalsetup,andthenpresenttheresultsintheknown, unknown,andopen-setspoofscenarios,withcomparisonstorespectivebaselines.Next,we quantitativelyevaluatethespooftracesbyperformingaspoofmediumandconduct anablationstudyoneachdesignintheproposedmethod.Finally,weprovidevisualizationresults onthespooftracedisentanglement,newspoofsynthesisandt-SNEvisualization. 76 4.4.1ExperimentalSetup Databases Weconductexperimentsonthreemajordatabases:Oulu-NPUBoulkenafetetal. (2017b),SiWLiuetal.(2018c),andSiW-MLiuetal.(2019a).Oulu-NPUandSiWinclude print/replayattacks,whileSiW-Mincludes 13 spooftypes.Wefollowalltheexistingtesting protocolsandcomparewithSOTAmethods.Similartomostpriorworks,weonlyusetheface regionfortrainingandtesting. Evaluationmetrics Twocommonmetricsareusedinthisworkforcomparison:EERand APCER/BPCER/ACER.EERdescribesthetheoreticalperformanceandpredeterminesthethresh- oldformakingdecisions.APCER/BPCER/ACERinISO/IEC-JTC-1/SC-37(2016)describethe practicalperformancegivenapredeterminedthreshold.Forbothevaluationmetrics,lowervalue meansbetterperformance.ThethresholdforAPCER/BPCER/ACERiscomputedfromeither trainingsetorvalidationset.Inaddition,wealsoreporttheTrueDetectionRate(TDR)atagiven FalseDetectionRate(FDR).Thismetricdescribesthespoofdetectionrateatastricttoleranceto liveerrors,whichiswidelyusedtoevaluatereal-worldsystemsIARPA(2016).Inthiswork,we reportTDRatFDR =0 : 5% .ForTDR,thehigherthebetter. Parametersetting PhySTDisimplementedinTwwithaninitiallearningrateof 5 e - 5 . Wetrainintotal 150 ; 000 iterationswithabatchsizeof 8 ,anddecreasethelearningratebya ratioof 10 every 45 ; 000 iterations.Weinitializetheweightswith [0 ; 0 : 02] normaldistribution. f 1 ; 2 ; 3 ; 4 ; 5 ; 6 g aresettobe f 100 ; 5 ; 1 ; 1 e - 4 ; 10 ; 1 g ,and =0 : 1 . 0 isempirically determinedfromthetrainingorvalidationset.Weusetheopen-sourcefacealignmentBulat &Tzimiropoulos(2017)and 3 DMMLiuetal.(2017)tocropthefaceandprovide 140 landmarks. 77 ProtocolMethodAPCER(%)BPCER(%)ACER(%) 1 Auxiliary(Liuetal.(2018c)) 3 : 63 : 63 : 6 STASN(Yangetal.(2019a)) 1 : 0 Meta-FAS-DR(Zhaoetal.(2019)) 0 : 50 : 50 : 5 HMP(Yuetal.(2020a)) 0 : 60 : 20 : 5 DRL(Zhangetal.(2020a)) 0 : 10 : 50 : 3 CDCN(Yuetal.(2020b)) 0 : 10 : 20 : 1 CDCN++(Yuetal.(2020b)) 0 : 10 : 20 : 1 Ours 0.00.00.0 2 Auxiliary(Liuetal.(2018c)) 0 : 6 0 : 70 : 6 0 : 70 : 6 0 : 7 Meta-FAS-DR(Zhaoetal.(2019)) 0 : 3 0 : 30 : 3 0 : 30 : 3 0 : 3 STASN(Yangetal.(2019a)) 0 : 3 0 : 1 HMP(Yuetal.(2020a)) 0 : 1 0 : 20 : 2 0 : 00 : 1 0 : 1 DRL(Zhangetal.(2020a)) 0 : 1 0 : 20 : 1 0 : 10 : 1 0 : 0 CDCN(Yuetal.(2020b)) 0 : 0 0 : 00 : 1 0 : 10 : 1 0 : 0 CDCN++(Yuetal.(2020b)) 0 : 0 0 : 00 : 1 0 : 10 : 0 0 : 1 Ours 0.0 0.00.0 0.00.0 0.0 3 STASN(Yangetal.(2019a)) 12 : 1 1 : 5 Auxiliary(Liuetal.(2018c)) 8 : 3 3 : 88 : 3 3 : 88 : 3 3 : 8 Meta-FAS-DR(Zhaoetal.(2019)) 8 : 0 5 : 07 : 4 5 : 77 : 7 5 : 3 DRL(Zhangetal.(2020a)) 9 : 4 6 : 11 : 8 2 : 65 : 6 4 : 4 HMP(Yuetal.(2020a)) 2 : 6 0 : 92 : 3 0 : 52 : 5 0 : 7 CDCN(Yuetal.(2020b)) 2 : 4 1 : 32 : 2 2 : 02 : 3 1 : 4 CDCN++(Yuetal.(2020b)) 1.7 1.52.0 1.21.8 0.7 Ours 13 : 1 9 : 41 : 6 0 : 67 : 4 4 : 3 Table4.2 TheevaluationonthreeprotocolsinSiWDataset.Wecomparewiththetop 7 performances. 4.4.2forKnownSpoofTypes Oulu-NPU Oulu-NPUBoulkenafetetal.(2017b)isacommonlyusedfacebenchmark duetoitshigh-qualitydataandchallengingtestingprotocols.Tab.4.1showsour performanceonOulu-NPU,comparedwithSOTAalgorithms.Ourmethodachievesthebest overallperformanceonthisdatabase.ComparedwithourpreliminaryversionLiuetal.(2020), wedemonstrateimprovementsinall 4 protocols,withimprovementonprotocol 1 and protocol 3 , i.e. ,reducingtheACERby 63 : 6% and 32 : 1% respectively.ComparedwiththeSOTA, ourapproachachievessimilarbestperformancesonthethreeprotocolsandoutperformsthe SOTAonthefourthprotocol,whichisthemostchallengingone.Tonotethat,inprotocol 3 and protocol 4 ,theperformancesoftestingcamera 6 aremuchlowerthanthoseofcameras 1 - 5 :the ACERforcamera 6 are 6 : 4% and 10 : 2% ,whiletheaverageACERfortheothercamerasare 1 : 0% and 2 : 0% respectively.Comparedwithothercameras,wenoticethatcamera 6 hasstrongersensor 78 Metrics(%)MethodReplayPrint 3DMaskMakeupPartialAttacks Overall HalfSilic.Trans.PaperMann.Ob.Im.Cos.Funny.Papergls.Paper ACER Auxiliary(Liuetal.(2018c)) 5 : 15 : 05 : 010 : 25 : 09 : 86 : 319 : 65 : 026 : 55 : 55 : 25 : 06 : 3 SDTN(Liuetal.(2020)) 3 : 23 : 13 : 09 : 03 : 03 : 44 : 7 3.0 3 : 024 : 54 : 13 : 73 : 04 : 1 Step 16 : 15 : 45 : 45 : 45 : 45 : 45 : 422 : 75 : 426 : 85 : 45 : 55 : 410 : 9 Step 1 +Step 2 w/singletrace 8 : 77 : 87 : 87 : 87 : 87 : 87 : 825 : 07 : 928 : 87 : 87 : 87 : 813 : 8 Step 1 +Step 24 : 13 : 93 : 93 : 94 : 03 : 94 : 013 : 54 : 025 : 13 : 93 : 93 : 94 : 6 Step 1 +Step 2 +Step 3 (Ours) 3.21.41.02.31.32.92.5 12 : 4 1.218.51.70.41.62.8 EER Auxiliary(Liuetal.(2018c)) 4 : 70 : 01 : 610 : 54 : 610 : 06 : 412 : 70 : 019 : 69 : 37 : 50 : 06 : 7 SDTN(Liuetal.(2020)) 2 : 12 : 20 : 07 : 20 : 13 : 94 : 8 0.0 0 : 019 : 65 : 35 : 4 0.0 4 : 8 Step 13 : 82 : 71 : 52 : 71 : 9 1.8 2 : 415 : 10 : 728 : 74 : 14 : 91 : 04 : 3 Step 1 +Step 2 w/singletrace 6 : 75 : 30 : 8 1.5 1 : 43 : 33 : 221 : 51 : 027 : 16 : 56 : 11 : 55 : 8 Step 1 +Step 22 : 43 : 10 : 42 : 61 : 23 : 02 : 49 : 50 : 423 : 51 : 10 : 50 : 62 : 8 Step 1 +Step 2 +Step 3 (Ours) 2.51.00.0 2 : 1 1.0 1 : 9 2.2 8 : 2 0.018.50.80.0 0 : 4 2.5 SDTN(Liuetal.(2020)) 90.1 76 : 180 : 771 : 562 : 374 : 485 : 0 100 10033 : 849 : 630 : 697 : 770 : 4 TPR@Step 143 : 843 : 347 : 244 : 562 : 954 : 855 : 416 : 790 : 631 : 560 : 356 : 777 : 159 : 3 FNR= : 5% Step 1 +Step 2 w/singletrace 58 : 976 : 897 : 6 94.2 94 : 966 : 378 : 313 : 394 : 149 : 162 : 458 : 592 : 174 : 8 Step 1 +Step 284 : 774 : 710070 : 1 96.6 77 : 589 : 636 : 910040 : 196 : 399 : 499 : 489 : 7 Step 1 +Step 2 +Step 3 (Ours) 85 : 7 85.4100 76 : 696 : 3 80.293.8 41 : 1 10055.898.110099.891.2 Table4.3 TheevaluationandablationstudyonSiW-MProtocolI:knownspoofdetection. Metrics MethodReplayPrint 3DMaskMakeupPartialAttacks Average (%)HalfSilic.Trans.PaperMann.Ob.Im.Cos.Fun.Papergls.Paper APCER Auxiliary(Liuetal.(2018c)) 23 : 77 : 327 : 718 : 297 : 88 : 316 : 2100 : 018 : 016 : 391 : 872 : 20 : 438 : 3 37 : 4 LBP+SVM(Boulkenafetetal.(2017b)) 19 : 115 : 440 : 820 : 370 : 30 : 04 : 696 : 935 : 3 11 : 3 53 : 358 : 50 : 632 : 8 29 : 8 DTL(Liuetal.(2019a)) 1.0 0 : 00 : 724 : 558 : 60 : 53 : 873 : 213 : 212 : 417 : 017 : 00 : 217 : 1 23 : 3 CDCN(Yuetal.(2020b)) 8 : 26 : 98 : 37 : 420 : 55 : 95 : 043 : 51 : 614 : 024 : 518 : 31 : 212 : 7 11 : 7 SDTN(Liuetal.(2020)) 1 : 6 0 : 00 : 57 : 2 9 : 70 : 5 0 : 0 96 : 10 : 021 : 8 14 : 46 : 5 0 : 012 : 2 26 : 1 CDCN++(Yuetal.(2020b)) 9 : 26 : 04 : 27 : 418 : 2 0 : 0 5 : 039 : 10 : 014 : 023 : 314 : 30 : 010 : 8 11 : 2 HMP(Yuetal.(2020a)) 12 : 45 : 28 : 39 : 713 : 6 0 : 0 2 : 5 30 : 4 0 : 012 : 022 : 615 : 91 : 2 10 : 3 9 : 1 Ours 10 : 04 : 95 : 316 : 7 3 : 5 2 : 02 : 892 : 8 0 : 0 37 : 533 : 723 : 20 : 217 : 9 25 : 8 BPCER LBP+SVM(Boulkenafetetal.(2017b)) 22 : 121 : 521 : 921 : 420 : 723 : 122 : 921 : 712 : 522 : 218 : 420 : 022 : 921 : 0 2 : 9 DTL(Liuetal.(2019a)) 18 : 611 : 929 : 312 : 813 : 48 : 523 : 011 : 59 : 616 : 021 : 522 : 616 : 816 : 6 6 : 2 SDTN(Liuetal.(2020)) 14 : 014 : 613 : 618 : 618 : 18 : 113 : 410 : 39 : 217 : 227 : 035 : 511 : 216 : 2 7 : 6 CDCN(Yuetal.(2020b)) 9 : 38 : 513 : 910 : 921 : 0 3 : 1 7 : 045 : 02 : 316 : 226 : 420 : 95 : 414 : 6 11 : 7 CDCN++(Yuetal.(2020b)) 12 : 48 : 514 : 013 : 219 : 47 : 06 : 045 : 01 : 614 : 024 : 820 : 93 : 914 : 6 11 : 4 HMP(Yuetal.(2020a)) 13 : 26 : 213 : 110 : 816 : 33 : 9 2 : 3 34 : 1 1 : 6 13 : 923 : 217 : 12 : 312 : 2 9 : 4 Auxiliary(Liuetal.(2018c)) 10 : 16 : 510 : 911 : 6 6 : 2 7 : 89 : 311 : 69 : 37 : 1 6 : 2 8 : 810 : 38 : 9 2 : 0 Ours 3 : 86 : 34 : 45 : 5 11 : 33 : 56 : 0 6 : 6 1 : 8 2 : 7 6 : 5 8 : 01 : 15 : 7 2 : 8 ACER LBP+SVM(Boulkenafetetal.(2017b)) 20 : 618 : 431 : 321 : 445 : 511 : 613 : 859 : 323 : 916 : 735 : 939 : 211 : 726 : 9 14 : 5 Auxiliary(Liuetal.(2018c)) 16 : 86 : 919 : 314 : 952 : 18 : 012 : 855 : 813 : 7 11 : 7 49 : 040 : 55 : 323 : 6 18 : 5 DTL(Liuetal.(2019a)) 9 : 86 : 015 : 018 : 736 : 04 : 513 : 448 : 111 : 414 : 219 : 319 : 88 : 516 : 8 11 : 1 CDCN(Yuetal.(2020b)) 8 : 77 : 711 : 1 9 : 1 20 : 74 : 55 : 944 : 22 : 015 : 125 : 419 : 63 : 313 : 6 11 : 7 SDTN(Liuetal.(2020)) 7 : 87 : 37 : 112 : 913 : 94 : 36 : 753 : 24 : 619 : 520 : 721 : 05 : 614 : 2 13 : 2 CDCN++(Yuetal.(2020b)) 10 : 87 : 39 : 110 : 318 : 83 : 55 : 642 : 10 : 814 : 024 : 017 : 61 : 912 : 7 11 : 2 HMP(Yuetal.(2020a)) 12 : 85 : 710 : 710 : 314 : 9 1 : 92 : 432 : 30 : 812 : 9 22 : 916 : 51 : 7 11 : 2 9 : 2 Ours 6 : 95 : 64 : 8 11 : 1 7 : 4 2 : 74 : 449 : 70 : 920 : 1 20 : 115 : 60 : 6 11 : 5 13 : 2 EER LBP+SVM(Boulkenafetetal.(2017b)) 20 : 818 : 636 : 321 : 437 : 27 : 514 : 151 : 219 : 816 : 134 : 433 : 07 : 924 : 5 12 : 9 Auxiliary(Liuetal.(2018c)) 14 : 04 : 311 : 612 : 424 : 67 : 810 : 072 : 310 : 1 9 : 4 21 : 418 : 64 : 017 : 0 17 : 7 DTL(Liuetal.(2019a)) 10 : 0 2 : 1 14 : 418 : 626 : 55 : 79 : 650 : 210 : 113 : 219 : 820 : 58 : 816 : 1 12 : 2 CDCN(Yuetal.(2020b)) 8 : 27 : 88 : 3 7 : 4 20 : 55 : 95 : 047 : 81 : 614 : 024 : 518 : 31 : 113 : 1 12 : 6 SDTN(Liuetal.(2020)) 7 : 63 : 88 : 413 : 814 : 55 : 34 : 435 : 40 : 019 : 321 : 020 : 81 : 612 : 0 10 : 0 CDCN++(Yuetal.(2020b)) 9 : 25 : 6 4 : 2 11 : 119 : 35 : 95 : 043 : 50 : 014 : 023 : 314 : 3 0 : 0 11 : 9 11 : 8 HMP(Yuetal.(2020a)) 13 : 45 : 28 : 39 : 713 : 65 : 8 2 : 533 : 8 0 : 014 : 023 : 316 : 61 : 211 : 3 9 : 5 Ours 5 : 2 4 : 44 : 410 : 1 8 : 62 : 6 4 : 347 : 2 0 : 0 19 : 6 18 : 612 : 4 0 : 7 10 : 6 12 : 6 TPR@SDTNLiuetal.(2020) 45 : 040 : 545 : 736 : 711 : 740 : 974 : 00 : 067 : 516 : 013 : 49 : 462 : 835 : 7 23 : 9 FNR= : 5% Ours 55 : 146 : 457 : 365 : 133 : 091 : 776 : 70 : 0100 : 046 : 431 : 815 : 497 : 753 : 7 31 : 8 Table4.4 TheevaluationonSiW-MProtocolII:unknownspoofdetection. noisesandourmodelrecognizesthemasunknownspooftraces,whichleadstoanincreasedfalse negativerate( i.e. ,BPCER).Howtoseparatesensornoisesfromspooftracescanbeanimportant futureresearchtopic. SiW SiWLiuetal.(2018c)isanotherrecenthigh-qualitydatabase.Itincludesfewercapture 79 camerasbutmorespoofmediumsandenvironmentvariations,suchaspose,illumination,and expression.ThecomparisononthreeprotocolsisshowninTab.4.2.Weoutperformtheprevious worksonthetwoprotocolsandrankinthemiddleonprotocol 3 .Protocol 3 aimstotestthe performanceofunknownspoofdetection,wherethemodelistrainedononespoofattack(print orreplay)andtestedontheother.AswecanseefromFig.4.8-4.9,thetracesofprintandreplay aredifferent,wherethereplaytracesaremoreonthehigh-frequencypart( i.e. ,trace component T )andtheprinttracesaremoreonthelow-frequencypart( i.e. ,tracecomponent S ). Thesepatterndivergenceleadstotheadaptiongapofourmethodwhiletrainingononeattackand testingontheother. SiW-M SiW-MLiuetal.(2019a)containsalargediversityofspooftypes,includingprint,replay, 3 Dmask,makeup,andpartialattacks.Thisallowsustohaveacomprehensiveevaluationof theproposedapproachwithdifferentspoofattacks.TouseSiW-Mforknownspoofdetection, werandomlysplitthedataofalltypesintotrain/testsetwitharatioof 60% vs. 40% ,andthe resultsareshowninTab.4.3.ComparedtothepreliminaryversionLiuetal.(2020),ourmethod outperformsonmostspooftypesaswellastheoverallEERperformanceby 47 : 9% relatively,which demonstratesthesuperiorityofouronknownspoofattacks. ForexperimentsonSiW-M(protocolI,II,andIII),weadditionallyreporttheTPRatFNR equalto 0 : 5% .WhileEERandACERprovidethetheoreticalevaluation,theusersinreal-world applicationscaremoreaboutthetruespoofdetectionrateunderagivenlivedetectionerrorrate,and henceTPRcanbetterhowwellthemodelcandetectoneorafewspoofattacksinpractices. AsshowninTab.4.3,weimprovetheoverallTDRofourpreliminaryversionLiuetal.(2020)by 29 : 5% . 80 Metrics MethodReplayPrint 3DMaskMakeupPartialAttacks Overall (%)HalfSilic.Trans.PaperMann.Ob.Im.Cos.Funny.Papergls.Paper ACER Auxiliary(Liuetal.(2018c)) 6 : 75 : 68 : 57 : 511 : 66 : 76 : 48 : 95 : 76 : 114 : 315 : 95 : 48 : 4 3 : 4 Ours 4.73.53.43.36.42.63.87.02.33.210.77.33.24.7 2.4 EER Auxiliary(Liuetal.(2018c)) 6 : 45 : 67 : 76 : 510 : 36 : 16 : 18 : 45 : 16 : 315 : 313 : 15 : 77 : 9 3 : 2 Ours 4.12.83.43.15.63.63.06.72.23.410.28.62.24.5 2.5 TPR@Auxiliary(Liuetal.(2018c)) 60 : 465 : 564 : 470 : 447 : 567 : 071 : 664 : 375 : 169 : 845 : 847 : 862 : 962 : 5 9 : 7 FNR= : 5% Ours 87.478.781.084.569.086.384.785.091.089.366.664.491.181.6 9.2 Table4.5 TheevaluationonSiW-MProtocolIII:opensetspoofdetection. Figure4.8 Examplesofeachspooftracecomponents.(a)theinputsamplefaces.(b) B .(c) C .(d) T .(e) P . (f)thelivecounterpartreconstructionandzoom-indetails.(g)resultsfromLiuetal.(2020).(h)results fromStep 1 +Step 2 withasingletracerepresentation. 4.4.3forUnknownandOpen-setSpoofs Anotherimportantaspectistotesttheperformanceonunknownspoof.TouseSiW-M forunknownspoofdetection,Liuetal.(2019a)theleave-one-outtestingprotocols,termed asSiW-MProtocolII.Inthisprotocol,eachmodel( i.e. ,onecolumninTab.4.4)istrainedwith 12 typesofspoofattacks(asknownattacks)plusthe 80% ofthelivefaces,andtestedonthe remaining 1 attack(asunknownattack)plusthe 20% oflivefaces.AsshowninTab.4.4,our PhySTDachievesimprovementoverourpreliminaryversion,withrelatively 11 : 7% on 81 Metrics(%)AttacksProtocolIProtocolIIProtocolIII ACER Impersonation 2 : 24 : 03 : 3 Obfuscation 4 : 614 : 95 : 4 EER Impersonation 1 : 43 : 13 : 2 Obfuscation 3 : 614 : 05 : 1 TPR@Impersonation 87 : 873 : 385 : 9 FNR= : 5% Obfuscation 84 : 647 : 079 : 5 Table4.6 Theperformancecomparisonbetweenimpersonationattacksandobfascationattacks. theoverallEER, 19 : 0% ontheoverallACER, 50 : 4% ontheoverallTPR.,wereducethe EERsofhalfmask,paperglasses,transparentmask,replayattack,andpartialpaperrelativelyby 47 : 6% , 40 : 4% , 37 : 7% , 31 : 6% , 56 : 3% ,respectively.Overall,comparedwiththetop 7 performances, weoutperformtheSOTAperformanceofEER/TPRandachievecomparableACER.Amongall, thedetectionofsiliconemask,paper-craftedmask,mannequinhead,impersonationmakeup,and partialpaperattacksarerelativelygood,withthedetectionaccuracy( i.e. ,TPR@FNR= 0 : 5% )above 65% .ObfuscationmakeupisthemostchallengingonewithTPRof 0% ,wherewepredictallthe spoofsamplesaslive.Thisisduetothefactthatthemakeuplooksverysimilartothelivefaces, whilebeingdissimilartoanyotherspooftypes.However,onceweobtainafewsamples,ourmodel canquicklyrecognizethespooftracesontheeyebrowandcheek,synthesizenewspoofsamples, andsuccessfullydetecttheattack(TPR= 41 : 1% inTab.4.3). Moreover,inthereal-worldscenario,thetestingsamplescanbeeitheraknownspoofattackor anunknownone.Thus,weproposeSiW-MProtocolIIItoevaluatethisopen-settestingsituation. InProtocolIII,wefollowthetrain/testsplitfromprotocolI,andthenfurtherremoveonespoof typeastheunknownattack.Duringthetesting,wetestontheentireunknownspoofsamplesas wellthetestsplitsetoftheknowspoofsamples.TheresultsarereportedinTab.4.5.Comparedto theSOTAfacemethodLiuetal.(2018c),ourapproachsubstantiallyoutperformsitin allthreemetrics. Impersonation v.s. Obfuscation InTab.4.6,weshowthecomparisonofourperformancetowards 82 Figure4.9 ExamplesofspooftracedisentanglementonSiW(a-h)andSiW-M(i-x).(a)-(d)itemsareprint attacksand(e)-(h)itemsarereplayattacks.(i)-(x)itemsarelive,print,replay,halfmask,siliconemask, papermask,transparentmask,obfuscationmakeup,impersonationmakeup,cosmeticmakeup,paperglasses, partialpaper,funnyeyeglasses,andmannequinhead.Thecolumnistheinputface,thesecondcolumn istheoverallspooftrace( I ^ I ),thethirdcolumnisthereconstructedlive. impersonationattacksandobfuscationattackson 3 protocolsinSiW-Mdatabase.Onall 3 protocols, weseeabetterperformanceonimpersonationattacksoverobfuscationattacks,especiallyinthe ProtocolIIunknowattacksituations.Obfuscationattacksgenerallyhavealargerappearance discrepancycomparedtoimpersonationattacks,andit'snaturallymorediffortheCNNmodel todoout-of-distributionpredictions.Inpractice,mostoftheattacksareimpersonationattacks,and henceoursolutioncanbeeffectivetothepracticalsituations. 83 Label Predict LivePrint 1 Print 2 Replay 1 Replay 2 Live 56( 4 )1( +1 )1( +1 )1( +1 )1( +1 ) Print 10 43( +2 )11( +9 )3( 8 )3( 3 ) Print 209( 25 ) 48( +37 )1( 8 )2( 4 ) Replay 11( 9 )2( 1 )3( +3 ) 51( +38 )3( 28 ) Replay 21( 7 )2( 5 )2( +2 )3( 3 ) 52( +13 ) Table4.7 Confusionmatricesofspoofmediumsbasedonspooftraces.Theresultsare comparedwiththepreviousmethodJourablooetal.(2018).GreenrepresentsimprovementoverJourabloo etal.(2018).Redrepresentsperformancedrop. Label Predict LivePrintReplayMasksMakeupPartial Live 11666300 Print 1 401301 Replay 31 32101 Masks 311 9003 Makeup 3000 360 Partial 20020 146 Table4.8 Confusionmatricesof 6 -classspooftracesonSiW-Mdatabase. 4.4.4SpoofTraces Toquantitativelyevaluatethespooftracedisentanglement,weperformaspoofmedium onthedisentangledspooftracesandreporttheaccuracy.Thespooftracesshould containspoofinformation,sothattheycanbeusedforclusteringwithoutseeing theface.TomakeafaircomparisonwithJourablooetal.(2018),weremovetheadditionalspoof typeinformationfromthepreliminarymask P 0 .Thatis,forthisexperiment,weonlyuse theadditivetraces f B ; C ; T g tolearnthetraceAfter f B ; C ; T g trainingwith onlybinarylabels,wePhySTDandapplyasimpleCNN( i.e. ,AlexNet)ontheestimatedadditive tracestodoasupervisedspoofmediumWefollowthesame 5 -classtestingprotocol inJourablooetal.(2018)inOulu-NPUProtocol 1 .Wereporttheaccuracyastheratio betweencorrectlypredictedsamplesfromallclassesandalltestingsamples.ShowninTab.4.7. Ourmodelcanachievea 5 -classclaccuracyof 83 : 3% .Ifwetreattwoprintattacksasthe 84 Figure4.10 Examplesofthespoofdatasynthesis.Therowarethesourcespooffaces,thecolumn arethetargetlivefaces,andtheremainingarethesynthesizedspooffacesfromthelivefacewiththe correspondingspooftraces. sameclassandtworeplayasthesameclass,ourmodelcanachievea 3 -classaccuracy of 92 : 0% .ComparedwiththepriormethodJourablooetal.(2018),weshowanimprovementof 29% onthe 5 -classmodel.Inaddition,wetrainthesameCNNontheoriginalimagesinsteadofthe estimatedspooftracesforthesamespoofmediumtask,andtheaccuracy canonlyreach 80 : 6% .Thisfurtherdemonstratesthattheestimatedtracesdocontain informationtodistinguishdifferentspoofmediums. WealsoexecutethespooftracestaskonmorespooftypesinSiW-Mdatabase.We leveragethetrain/testsplitonSiW-MProtocol 1 .WetrainthePhySTDtillconvergence,and usetheestimatedtracesfromthetrainingsettotrainthetracenetwork.Weexplore the 6 -classscenario,showninTab.4.8.Our 6 -classmodelcanachievetheaccuracyof 92 : 0% .Sincethetracesaremoredistinctamongdifferentspooftypes,thisperformanceiseven betterthan 5 -classonprint/replayscenarioinOulu-NPUProtocol 1 .Thisfurther demonstratesthatPhySTDcanestimatespooftracesthatcontainantinformationofspoof mediumsandcanbeappliedtomultiplespooftypes. 85 4.4.5AblationStudy Inthissection,weshowtheimportanceofeachdesignofourproposedapproachontheSiW-M ProtocolI,inTab.4.3.OurbaselineistheauxiliaryFASLiuetal.(2018c),withoutthetemporal module.Itconsistsofthebackboneencoderanddepthestimationnetwork.Whenincludingthe imagedecomposition,thebaselinebecomesthetrainingstep 1 inAlg.1,asthetracesarenot activatedwithoutthetrainingstep 2 .TovalidatetheeffectivenessofGANtraining,wereportthe resultsfromthebaselinemodelwithourGANdesign,denotedasStep 1 +Step 2 .Wealsoprovide thecontrolexperimentwherethetracesarerepresentedbyasinglecomponenttodemonstratethe effectivenessoftheproposed 5 -elementtracerepresentation.ThismodelisdenotedasStep 1 +Step 2 withsingletrace.Inaddition,weevaluatetheeffectoftrainingwithmoresynthesizeddatavia enablingthetrainingstep 3 asStep 1 +Step 2 +Step 3 ,whichisourapproach. AsshowninTab.4.3,thebaselinemodel(Auxiliary)canachieveadecentperformanceofEER 6 : 7% .Addingimagedecompositiontothebaseline(Step1)canimprovetheEERfrom 6 : 7% to 4 : 3% ,butmorelivesamplesarepredictedwithhigherscores,causingaworseACER.Adding simpleGANdesign(Step 1 +Step 2 withsingletrace)mayleadtoasimilarEERperformanceof 5 : 8% ,butbasedontheTPR( 59 : 3% ! 74 : 8% )itspracticalperformancemaybeimproved.With theproperphysics-guidedtracedisentanglement,wecanimprovetheEERto 2 : 8% andTPRto 89 : 7% .AndourdesigncanachievetheperformanceofHTER 2 : 8% ,EER 2 : 5% ,andTPR 91 : 2% .Comparedwithourpreliminaryversion,theEERisimprovedby 47 : 9% ,HTERisimproved by 31 : 7% andTPRisimprovedby 29 : 5% . 4.4.6Visualization Spooftracecomponents InFig.4.8,weprovideillustrationofeachspooftracecomponent.Strong 86 Figure4.11 ThetSNEvisualizationoffeaturesfromdifferentscalesandlayers.The 3 visualizationare fromtheencoderfeature F 1 , F 2 , F 3 ,andthelast 2 visualizationarefromthefeaturesthatproduce f B ; C ; T g and f P ; I P g . colordistortion(low-frequencytrace)showsupintheprintattacks.Moirpatternsinthereplay attackarewelldetectedinthehigh-frequencytrace.Thelocalspecularhighlightsintransparent maskarewellpresentedinthelow-andmid-frequencycomponents,andtheinpaintingprocess furtherthemosthighlightedarea.Forthetwoglassesattacks,thecolordiscrepancy iscorrectedinthelow-frequencytrace,andthesharpedgesarecorrectedinthemid-andhigh- frequencytraces.Eachcomponentshowsaconsistentsemanticmeaningondifferentspoofsamples, andthissuccessfultracedisentanglementcanleadtobettervisualresults.Asshownonthe rightsideofFig.4.8,wecomparewithourpreliminaryversionLiuetal.(2020)andtheablated GANdesignwithasingletracerepresentation.Theresultofsingletracerepresentationshows strongartifactsonmostofthelivereconstruction.Themulti-scalefromourpreliminaryversion hasalreadyshownalargevisualqualityimprovement,butstillhavesomespooftraces( e.g. ,glass edges)remainedinthelivereconstruction.Incontrast,ourapproachcanfurtherhandlethemissing tracesandachievebettervisualization. Livereconstruction InFig.4.9,weshowmoreexamplesfromdifferentspooftypesinSiWand SiW-Mdatabases.Theoveralltraceistheexactdifferencebetweentheinputfaceanditslive reconstruction.Forthelivefaces,thetraceiszero,andforthespooffaces,ourmethodremoves spooftraceswithoutunnecessarychanges,suchasidentityshift,andmakethemlooklikelivefaces. Forexample,strongcolordistortionshowsupinprint/replayattacks(Fig.4.9a-h)andsome 3 D 87 Figure4.12 Theillustrationofremovingthedisentangledspooftracecomponentsonebyone.Theestimated spooftraceelementsofinputspoof(thetcolumn)areprogressivelyremovedintheorderof B ; C ; T ; T P . Thelastcolumnshowsthereconstructedliveimageafterremovingallthreeadditivetracecomponentsand theinpaintingtrace.(a)Replayattack;(b)Makeupattack;(c)Maskattack;(d)Paperglassesattack. maskattacks(Fig.4.9l-o).Formakeupattacks(Fig.4.9q-s),thefakeeyebrows,lipstick, wax,andcheekshadeareclearlydetected.Thefoldsandedges(Fig.4.9t-w)arewelldetectedand removedinpaper-craftedmasks,paperglasses,andpartialpaperattacks. Spoofsynthesis Additionally,weshowexamplesofnewspoofsynthesisusingthedisentangled spooftraces,whichisanimportantcontributionofthiswork.AsshowninFig.4.10,thespoof tracescanbepreciselytransferredtoanewfacewithoutchangingtheidentityofthetargetface. Duetotheadditionalinpaintingprocess,spoofattackssuchastransparentmaskandpartialattacks canbebetterattachedtothenewliveface.Thankstotheproposed 3 Dwarpinglayer,thegeometric discrepancybetweenthesourcespooftraceandthetargetfacecanbecorrectedduringthesynthesis. Especiallyonthesecondsourcespoof,therightpartofthetracesissuccessfullytransferredtothe newlivefacewhiletheleftsideremainstobestilllive.Itdemonstratesthatourtraceregularization cansuppressunnecessaryartifactsgeneratedbythenetwork.Boththelivereconstructionresults 88 Figure4.13 Theillustrationofdoublespooftracedisentangling.Theleft 4 samplesarelivefaces,andthe right 4 samplesarespooffaces.(a)OriginalInput.(b) 1 stroundlivereconstruction.(c) 1 stroundspoof traces.(d) 2 ndroundlivereconstruction.(e) 2 ndroundspooftraces. inFig.4.9andthespoofsynthesisresultsinFig.4.10demonstratethatourapproachdisentangles visuallyconvincingspooftracesthathelpface Spooftraceremovingprocess AsshowninFig.4.12,weillustratetheeffectsoftracecomponents byprogressivelyremovingthemonebyone.Forthereplayattack,thespoofsamplecomeswith strongover-exposureaswellasclearMoirpattern.Removingthelow-frequencytracecaneffectively correcttheover-exposureandcolordistortioncausedbythedigitalscreen.Andremovingthe texturepatterninthehigh-frequencytracecanpeeloffthehigh-frequencygrideffectandreconstruct thelivecounterpart. Forthemakeupattack,sincethereisnostrongcolorrangebias,removingestimatedlow- frequencytracewouldmainlyremovethelip-stickcolorandfakeeyebrow,butinthemeantime bringafewartifactsattheedges.Next,whileremovingthecontentpattern,theshadowonthe cheekandthefakeeyebrowsareadequatelylightened.Finally,removingthetexturepattern wouldcorrectthespooftracesfromwax,eyeliner,andshadowonthecheek. Similarly,inmaskandpartialattacks,thereconstructionwillbegraduallyasweremoving 89 componentsonebyone. Tovalidatethequalityofspooftraceremoving,weexecuteadoublespooftracedisentanglement, showninFig.4.13.Foreachsample,weexecutea 2 -rounddisentanglement,wherethesecond- roundinputisthelivereconstructionfromtheround.Aswecanseefromtheregardless oftheoriginalspoofness,thenetworksrecognizealllivereconstructionaslive,whichshowshigh ontheroundspooftraceremoval.The 1 st-roundaveragespoofscoreforlivefaces is 0 : 11 ,andthesecondroundis 0 : 03 .The 1 st-roundaveragespoofscoreforspooffacesis 0 : 89 , andthesecondroundis 0 : 03 .However,wecanstillseedifferentdegreesofspooftracesleftinthe reconstructedlivefaces.Howtoeffectivelymeasurethequalityofspooftraceremovinganduseit forasecond-roundsupervisioncanbeafutureresearch. Tovalidatetherelationofspooftraceintensityandthespoofnessscore,weexecuteareverse doublespooftracedisentanglement,wheretheinputofthesecondrounddisentanglementisthe originalinputwithonly 50% oftheestimatedspooftracesremoved.Forthisexperiment,the secondroundscoreforlivefaceschangedfrom 0 : 03 t0 0 : 07 ,andthesecondroundscoreforspoof faceschangedfrom 0 : 03 to 0 : 53 .Basedontheresults,wecantellthatthespooftraceintensityis positivelycorrelatedwiththespoofnessscore. t-SNEvisualization Weuset-SNEMaaten&Hinton(2008)tovisualizetheencoderfeatures F 1 , F 2 , F 3 ,andthefeaturesthatproduce f B ; C ; T g and f P ; I P g .Thet-SNEisabletoprojectthe outputoffeaturesfromdifferentscalesandlayersto 2 DbypreservingtheKLdivergencedistance. AsshowninFig.4.11,amongthethreefeaturescalesintheencoder, F 3 isthemostseparable featurespace,thenextis F 1 ,andtheworstis F 2 .Thefeaturesforadditivetraces f B ; C ; T g are well-clusteredassemanticsub-groupsoflive,makeup,mask,andpartialattacks.Asweknow theinpaintingmasksforlivesamplesareclosetozero,thefeatureforinpaintingtraces f P ; I P g showstheinpaintingprocessmostlyupdatethepartialattacks,andthensomemakeupattacksand 90 maskattacks, i.e. ,thegreendotsbeingfurtherawayfromtheblackdotsmeanstheyhavegreater magnitude.Thisvalidatesourpriorknowledgeoftheinpaintingprocess. 4.5Conclusions Thisworkproposesaphysics-guidedspooftracesdisentanglementnetwork(PhySTD)totacklethe challengingproblemofdisentanglingspooftracesfromtheinputfaces.Withthespooftraces,we reconstructthelivefacesaswellassynthesizenewspoofs.Tocorrectthegeometricdiscrepancy insynthesis,weproposea 3 Dwarpinglayertodeformthetraces.Thedisentanglementnotonly improvestheSOTAoffaceinknown,unknown,andopen-setspoofsettings,butalso providesvisualevidencetosupportthemodel'sdecision. Tonotethat,eventhoughtheproposed spooftracemodelingisbasedonaphysicalapproximationofthespoofpresentationattack,thewhole learningprocessisstillintenselyrelyingonthedata.Inaddition,ourvisualization/interpretationof spoofattacksisontheimageappearance,ratherthanthefeaturelevel,suchasGrad-CAM. 91 Chapter5 Visualization:BlindRemovalofFacial ForeignShadow 5.1Introduction Inourdailyactivities,manyexternalobjectsarounduscancastshadowsonfaces,termedas facial foreignshadow .Forinstance,whilewetakeaoutdoors,ourhandandcameramightblockpart ofthesunlightandcreateashadowontheface.Dynamicandscatteredshadowsmaybeproduced byleaveswhilewalkingundertrees.Duringdriving,thedrivermayconfrontthehigh-contrast lightingcausedbythedirectsunlightandcarpillars.Whilepeoplemayexperiencethesesituations everyday,theshadowcastsometimescanbeunwanted.Insomecases,theshadowscastshould beremovedforaestheticpurposes,suchasphotoshopandfaceediting.Inothers,theshadows castcouldnegativelyimpactface-relatedtasks,suchasfacerecognition,expressionanalysis,age estimation,anddrivermonitoring. Whilefacialforeignshadowremovalisarelativelynewtopic,thereareafewrelatedstudies. Manyworksaimtohandletheselfshadowandrelightthefaceunderadifferentlighting,via quotientimageWenetal.(2003);Zhouetal.(2019),inverserenderingNaganoetal.(2019); Nestmeyeretal.(2020),andstyletransferGuetal.(2019);Leeetal.(2020).Thosemethodsfocus moreonthegloballightingdistributionandmightbelimitedinhandlingarbitraryhigh-frequency structurecausedbyharshforeignshadows.Therearealsofacecompletionworksunderstructured 92 Figure5.1 TheresultsofourshadowremovalmodelonimagesfromourShadowFaceintheWild(SFW) database( top )andUCBdatabaseZhangetal.(2020b)( bottom ).Thelefttorightareinputface,outputface, andshadowmatte. occlusions,suchassquare,circle,andlatticeLietal.(2020);Yangetal.(2019b);Zhangetal. (2017b).Comparedwithforeignshadowremoval,facecompletionisrelativelyeasierastheshape islesscomplicatedandtheocclusionisoftenwithasinglecolorsuchaswhite.Further,some worksstudyshadowongenericobjectsLe&Samaras(2019);Shor&Lischinski(2008).Whilethey excelatshadowdetection,whenappliedtofaces,observableartifactscanbedetectedonde-shadow resultsduetothelackoffacepriors. Themajorproblemofthesepriormethodsisthattheycannothandlethehigh-frequencystructure causedbytheharshshadowsasdemonstratedinSunetal.(2019);Zhangetal.(2020b).Insteadof predictingillumination,Xuaner etal. proposeasingleimage-basedapproachusingonlyperceptual andpixelintensitylossesandtrainthenetworkonasyntheticshadowdatasetZhangetal.(2020b). Itturnsoutthepixelintensitylossworksbettertoremoveharshshadowsandrecoverdetails. However,themodelbasedonlyonperceptualandpixelintensitylossesdoesnotgeneralizewellin practice,asitishardtobuildthetrainingdatasetcoveringreal-worldcomplexlightingconditions. Asstatedabove,thischapteraimstodetectandremovetheforeignshadowfromin-the-wild 93 faces.Whilewefocusontheforeignshadow,wealsoliketoaddressstrongself-shadowcaused byselfocclusion(seeFig.5.2).Totacklethisproblem,wefacethreemajorchallenges.First, theshadowofin-the-wildfacesisarbitrary,varyingfromdifferentsizesandshapes,todifferent locations,colors,blurrinessandintensities.PriorworksLe&Samaras(2019);Shor&Lischinski (2008);Zhangetal.(2020b)modeltheshadowdirectlyintheRGBspace.Giventhelevelof diversity,theyhaveahardtimetoaddressallthediscrepancies,leavingsomeobservableartifacts inde-shadowedfaces.Second,therearefewpublicdatabasesfortrainingandevaluation.To capturethepairedshadowandnon-shadowfaces,boththesubjectandphotographerneedtobe perfectlystill,whichisrarelyfeasible.Third,sometimestheshadowremovalisextendedfrom singleimagestovideo.Ononehand,multipleframes( e.g. ,livephoto)mayprovideadditioncuesto single-imageshadowremoval.Ontheotherhand,videoshadowremovalrequiresadditional temporalconsistency. Toaddresstheaforementionedchallenges,weproposeanovelblindremovalmodeloffacial foreignshadow.Tohandletheshadowdiversity,weproposeasimpleyeteffectiveapproachto decomposethedirectRGBshadowremovalintograyscaleshadowremovalandcolorization.We showthat,withoutcolor,theshadowmodelingbecomesamuchsimplertaskandthegrayscale removalmodeliseasytogeneralizetounseendata.Afterthat,withtheknowledgeofshadowregion fromgrayscaleshadowremoval,thecolorizationisturnedintoanimageinpaintingprocess.Without seenthebiasedcolorinformationfromshadowregion,thecolorizationprocessalsobecomesmore generalizable.Moreover,toensurethetemporalconsistency,weproposeatemporalsharingmodule (TSM)toaggregatetheinformationamongmultipleframes.TSMincludesanefwarping layer,inordertohandleframeswithposeandexpressionvariations. Fortrainingthemodel,wefollowtheprocessproposedbyZhangetal.(2020b)tobuiltasyn- theticdatabasethatcontainspairedshadowandshadow-freefaces.Foreignshadowsaregenerated 94 Figure5.2 Examplesof(a)foreignshadow,(b)strongselfshadow,and(c)normalselfshadow.Ourmodel isdesignedtoremoveunwantedshadowsin(a-b)whilekeepingnormalshadowin(c). withrandomizedpropertiesandmovingtrajectories.Further,wecollectafacedatabasewith 280 videoscapturedunderhighlydynamicenvironmentsforevaluationpurposes.Theexternalobjects castingshadowsincludehands,books,leaves,trees,windowblinds,carpillars,andbuildings.To quantitativeevaluateshadowsegmentation,weprovidedetailedpixel-levelsegmentationannotation forthisdatabase. Insummary,themaincontributionsofthischapterinclude: AnovelapproachtodecomposeRGBshadowremovalintograyscaleshadowremovaland colorization; Atemporalsharingmoduletoensurevideoconsistency; Afaceshadowdatabaseunderdynamicenvironments; SOTAresultsandphoto-realisticde-shadowquality. 95 5.2RelatedWork Facerelighting Facerelightingmethodscouldberoughlydividedintothreecategories,quotient image-based,styletransfer,andinverserendering.Thecolorratio( i.e. ,quotientimages)is proposedinShashua&Riklin-Raviv(2001)totransferafrontalfacefromonelightingtoanother. Thisbasicideahasbeenextendedtohandledifferentposes,useratiosofradianceenvironment maps,andgeneratesyntheticrelightingdatasetinStoschek(2000);Wenetal.(2003);Zhouetal. (2019).FaciallightingalsocanbechangedbystyletransferGuetal.(2019);Leeetal.(2020); Liaoetal.(2017);Shihetal.(2014);Shuetal.(2017a);Sunetal.(2019).Similartoquotient images,styletransfermethodsneedatleastareferenceimageasthetargetstyle.Moreover,the faceposesofinputandreferenceimagesareoftenveryclose.Inthecategoryofinverserendering, afaceimagecouldbedecomposedintomultiplecomponentssuchasgeometry,and lightingEggeretal.(2018);Naganoetal.(2019);Nestmeyeretal.(2020);Senguptaetal.(2018); Shuetal.(2017b);Tewarietal.(2017);Tran&Liu(2018);Wangetal.(2008).Forexample, inNestmeyeretal.(2020),theintrinsiccomponents( e.g. normalandalbedo)arepredictedand combinedthroughdiffuserenderinginthenetwork,thesecondnetworkisthenappliedtolearn non-diffuseresidual.Ingeneral,inverserenderingmethodsrelyonmultiplesub-networksforthe decomposition,whicharenotsufeftohandlefacialforeignshadowsinvideos. FaceCompletion Facecompletionaimstollinthemissingoroccludedfaceregionswithsemantic meaningfulinformation.InZhangetal.(2017b),Zhang etal. proposedaDemeshNetwithtwo sub-networkstoremovemesh-likelinesorwatermarksonfaces.Li etal. proposedadisentangling andfusingnetworkthatcontainsdiscriminatorsinthreedomains, i.e. ,occludedfaces,cleanfaces, andstructuredocclusionsLietal.(2020).ThefaceinpaintingnetworkinYangetal.(2019b) comprisesofalandmarkpredictingsubnetandaninpaintingsubnet.Differentfromtheshadow 96 removal,thestructuredocclusionseitherareopaqueorcontainrepeatedpatterns.Thenetworksare mainlyusedtohallucinatetheinvisiblefaceregions. GenericShadowDetectionandRemoval Withoutmuchtrainingdata,earlyworksingeneral- purposeshadowdetectionandremovalmainlystudyshadowproperties,especiallyaroundshadow edgesChuangetal.(2003);Finlaysonetal.(2002);Wuetal.(2012);Wu&Tang(2005).For example,Wuetal.(2012)appliedthegraph-cutinferencetodetectshadowregions,andthenused theshadowmattingtogeneratesoftshadowboundaries.Deeplearningbasedmethodshavebeen proposedrecentlytodetectandremoveshadowsDingetal.(2019);Le&Samaras(2019);Qu etal.(2017);Wangetal.(2018a).Huetal.(2018)designedthedirection-awarespatialcontext moduleandappliedaspatialRNNtodetectshadows.Cunetal.(2020)learnedtohierarchically aggregatesthedilatedmulti-contextsandattentions.AuthorsinZhangetal.(2020b)demonstrated thatthegeneral-purposemethodssuchasCunetal.(2020);Huetal.(2018)cannotpreservethe authenticityoftheinputfaces.Onereasonisthatthesegeneral-purposenetworksareunableto capturethefacecharacteristics.Forinstance,humanfaceskinisahighlyscatteringmaterial thatalsohasacomplexabsorptionspectrumDonner&Jensen(2006).Inthiswork,weproposea noveltwo-stageshadowmodelingthatcanbetterhandlebothsubsurfacescatteringeffectsandcolor distortion. 5.3ProposedMethod 5.3.1Shadowsynthesisandmodeling Shadowisproducedbyaforeignobjectthatblockspartofthelightraysfromarrivingtotheface. Letamatte M representtheshadowshape,theshadowformationcanbemodeledasablending 97 Figure5.3 Illustrationofdatasynthesiscomponents. betweenthewell-illuminatedface I b andtheunder-illuminatedface I d : I = I b (1 M )+ I d M ; (5.1) where denoteselement-wisemultiplication.Asthereal-worldshadowvariesinbothshapesand intensities,it'svitaltohave f I ; I b g paireddatawithalargevarietyof M totrainageneralized shadowremovalmodel.However,it'shardlyfeasibletocollectalarge-scaledatasetwithsuch pairing,asthesubjectneedstobeperfectlystaticwhilecapturingthepair.Therefore,creatinga syntheticdatasetbecomesourgo-toapproachtotacklethisproblem. Shadowsynthesis AsindicatedinZhangetal.(2020b),usingEqn.5.1tosynthesizenaturalface shadowshallincludeadditionalvariations:shape,intensity,subsurfacescatteringandcolor.Let shape B beabinarymaskthatthewholeregionaffectedbyforeignshadow.Theshadowis oftenunevenlydistributed,suchasinmottledpatternsorgraduallychangedpatterns,depending ontherelativedistancebetweentheobjectandface,andtheenvironmentallighting.Weusea 98 gray-scalematte M I torepresenttheunevenintensity.Inaddition,thelightoutsidetheshadow regionwouldpenetratebeneaththeskin,reachthevesselsandback,creatingaredband aroundtheshadowboundary.Werepresentsuchsubsurfacescatteringeffectby M ss ,whichis computedbyblurring B withdifferentkernelperRGBchannel.Therefore,Eqn.5.1canbeupdated to: I = I b (1 B M ss )+ I d B M ss M I : (5.2) Moreover,theshadowregionmaybeundercertaincolordistortion,duetotheblockofpartsof light.Weformulatesuchcolordistortionbya 3 3 colortransfermatrix C : I d = I b C : (5.3) The B , M I , M ss ,and C areillustratedinFig.5.3.Duringthesynthesis,givenawell-illuminated face I b ,wegeneraterandomparametersforeachcomponenttosynthesizedifferentshadowfaces I , whichisdetailedinSec.5.4. Shadowmodeling Withsyntheticpairwisedata,wecantrainamodel G ( ) todetectandremove foreignshadows( I ! I b ).Despitethecomplexityofshadowsynthesisprocess,priorworksLe &Samaras(2019);Shor&Lischinski(2008);Zhangetal.(2020b)opttosimplifytherelation between I and I b in G ( ) as: W ; N G ( I j ! ) ; (5.4) ^ I b = I W + N ; (5.5) where ! aretheparametersoftheshadowremovalmodel,andboththescaling W andoffset N are ofthesamesizeas I .Themotivationofthisistwo-fold:1)preciselyestimatingall shadowcomponents( i.e. , B , M I , M ss , C )canbeverychallenging,and2)evenwithfullsupervision 99 ofallthecomponents,reversingshadowformationmayraiseaconvergenceissue.Thisisdueto theambiguityintheshadowparameterization,whereoneshadowcanbegeneratedfromdifferent combinationsofshadowcomponents. However,thepriorworksbasedonEqn.5.5haveahardtimetoaddressallthediscrepancies betweentheshadowandnon-shadowregions,leavingsomeobservableartifactsinthede-shadowed faces.WeobservethatitisnotstraightforwardtoderivefromEqn.5.1toEqn.5.5.Duetothe existingofcolortransfermatrix C , W and N themselvesbecomeafunctionof I b ,insteadof independentto I b .Thus,themodellearningbecomesa chicken-and-egg problem,whichmayeasily turnintoamemorizationmode( i.e. ,atypeoflearningthatgeneralizespoorlyCorneanuetal. (2019)).Totacklethisissue,weproposetodecomposethecolorshadowremovalintograyscale shadowremovalandcolorization.Whiledealingwithshadowremovalingrayscale, C inEqn.5.3 simplybecomesascalar,andhenceboth C and M ss canbeintegratedinto M I as M 0 I .Wecanthen transfertherelationofEqn.5.1into: ^ I b,gs = I gs (1 B )+ I gs B M 0 I = I gs (1 B + B M 0 I ) = I gs W ; (5.6) where ^ I b,gs and I gs arethegrayscaleversionof I b and I , iselement-wisedivision,and W = 1 B + B M 0 I .It'sclearthatEqn.5.6isinacloseformandwellalignedwithEqn.5.5.As W and N aredetachedwith I b ,theyareeasiertolearn.Next,wesimplyneedtocolorizethegrayscale facetogetthefacerecoveryinRGB.Withtheknowledgeprovidedbythegrayscaleshadow removal,weturntheblindcolorrecoveryintoamask-guidedimageinpainting. TheoverallpipelineisshowninFig.5.4.Ourapproachconsistsofthreemajorsteps: 1 ) 100 Figure5.4 Illustrationofournetworkarchitecture.Themodelmainlyconsistsofanencoder,ashadow mattedecoder,acolormatrixdecoder,andashadowresidualdecoder.TheTemporalSharingModule(TSM) canbeeasilypluggedintothefaceencoder.Togetherwiththetemporalconsistencyloss L T ,wecanleverage theunlabeledimageframesef.Thegreendashedlinesindicatetheshort-cutconnectionsandthe orangedashedlinesandboxesindicatethelossfunctions. grayscaleshadowremoval (Sec.5.3.2), 2 )colorization (Sec.5.3.3),and 3 )temporalinformation sharing (Sec.5.3.4).Step 1 and 2 arethekeyingredientsforasingleframeshadowremoval,and step 3 isthekeyingredientforasmoothvideoshadowremoval.InSec.5.3.5,wediscussthelosses andtrainingstrategiesindetail. 5.3.2Grayscaleshadowremoval ThegrayscaleshadowremovalmoduletakesaRGBface I 2 R N 2 3 asinput,andoutputsthe scalingmap W 2 R N 2 1 andoffsetmap N 2 R N 2 1 thatcanrecoverawell-illuminatedgrayscale face ^ I b,gs 2 R N 2 1 basedonEqn.5.5.Themoduleconsistsofanencoder,astackofresidual non-localblocks,andadecoder.Theencoderextractsfeaturesas F frominputimagesforshadow removal.Itcontains 4 convolutionlayersand 3 timesofdownsampling.Toencourageaspatial consistencyoffaciallightingandalbedo,weleveragethelatestdesignofnon-localblockandvisual transformerCarionetal.(2020);Dosovitskiyetal.(2020);Wangetal.(2018c).Westack 3 residual non-localblockstoprocesstheencoderfeatureswithpositionencoding.Afterthat,thedecoder 101 upsamplesthefeaturesfromnon-localblocksvia 3 transposeconvolutionlayers,andestimate W and N .Weadoptashort-cutconnectionateachfeaturescaletobypasshigh-frequencyinformation. Forthepositionencoding,weadopttheprojectednormalizedcoordinatecode(PNCC)Zhu etal.(2017)andconcatenateittotheencoderfeature.PNCCisthenormalizedmeanshapeof 3 DMMBlanz&Vetter(2003),andisprojectedtoagivenface.Itencodesthefacesemanticsas eachvertex( e.g. ,eyecorner)hasitsunique 3 Dcoordinatebetween [0 ; 0 ; 0] and [1 ; 1 ; 1] ,regardless ofthepose,expressionandidentity.ComparedwithconventionalpositionencodinginCarionetal. (2020);Dosovitskiyetal.(2020),PNCCprovidesabetterfacesemanticthathelpstodetectand removeshadows. 5.3.3Colorization Animportanttakeawayfromthegrayscaleshadowremovalmoduleisthatwecanlocatetheshadow regionas ^ B = j ^ I b,gs I gs j >; (5.7) where ^ B istheshadowsegmentationmaskbinarizedwiththethresholdof .Withthisknowledge, wecanturntheblindcolorrecoveryprocessintoanimageinpaintingprocesswithagiveninpainting region.Incomparison,ifnoknowledgeisprovidedtothecolorizationprocess,thistwo-step approachisnearlyidenticaltodirectRGBshadowremovalappliedinpreviousworkLe&Samaras (2019);Shor&Lischinski(2008);Zhangetal.(2020b),whichmaystillsufferfromthepoor generalizationissue. Ourcolorizationmodulebreaksdowninto 3 steps: 1 )erasing, 2 )inpainting, 3 )colorspace transformation.Structurally,colorizationmoduleissimilarwiththegrayscaleshadowremoval module.Itconsistsof 3 residualnon-localblocksandadecoder.First,basedontheshadowmask ^ B , 102 wesettheshadowregionof F as 0 tocircumventanypotentialdisturbance,andtermitasinpainting feature.Secondly,theinpaintingfeature F (1 ^ B ) isconcatenatedwith ^ B andthePNCCcoding, andfedtothemodule.Thenon-localblocksaimtointhemissingregionin F ,andthedecoder isdesignedtoproducea M -channelcolorspace C 2 R N 2 M .Intheend,weusethree 1 1 convolutionlayerstotransfergrayscaleface ^ I b,gs withthecolorspace C backtotheRGBface ^ I b . Duringthetraining,nogradientsofcolorizationmodulewillbesentbacktothegrayscaleshadow removalmodulevia ^ B . 5.3.4Temporalinformationsharing Wecanextendournetworkforsingle-frameprocesstoleveragethetemporalinformationviaa TemporalSharingModule(TSM).Similartoothervideo-basedimagerestorationproblems,such asvideodeblurring,shadowmotioncanbearbitraryinshapeandspeedvariations.Thus,the orderoftheframesmightnotcarryusefulcuesforde-shadow.Asaresult,weproposetoadopt atemporal-wisemaxpoolingtoaggregatetheilluminationinformationamongdifferentframes, showninFig.5.5. Assuming F 1 ; F 2 ;:::; F k arethefeaturestobesharedamong k frames.Beforecomputingthe temporal-wisemaxpooling,weapplyawarpinglayertoregisterfeaturesbasedthefaceshape. Afterthetemporal-wisemaxpooling,weapplyareversewarpingtore-alignthesharedfeatureback toeachframefeature,andconcatenatewiththeoriginalfeature F i forthenext-stagecomputation. TheTSMisaplug-indesignforfeaturesatallscales.TSMcanbeusedtonotonlysharethe temporalinformation,butalsoenforcethepriorknowledgeoffacesymmetry,whichhasbeenused inothertasksWuetal.(2020).Toachievethis,wetreatthemirroredfaceasadifferentframe,and sendtoTSMforinformationsharing.Incasethereisonlyasingleframeavailable,theTSMsimply concatenateswiththeoriginalfeature. 103 Figure5.5 IllustrationoftheTemporalSharingModule(TSM).Itcanbeappliedtotemporalframesaswell asmirroredinput. Thewarpinglayerleveragesthepre-computed 68 faciallandmarksviaBulat&Tzimiropoulos (2017).Giventhelandmarksfortheneuralface s 0 andface s i attheframe i ,asparseoffsetcan becomputedas s i ! 0 = s 0 s i 2 R 68 2 toindicatewhereeachpixelinthelandmarkposition shouldbemovedto.Toobtainadenseoffsetmap S i ! 0 2 R N 2 2 indicatingwhereeachpixelin theentirefeaturemapshouldbemovedto,weapplyatriangulationinterpolation, S i ! 0 Tri ( s i ; s i ! 0 ;N ) ; (5.8) where Tri ( ) isDelaunaytriangulation-basedinterpolation.Theregistrationoperationoffeature F isdenotedas: F i ! 0 = F i ( S 0 + S i ! 0 ) ; (5.9) where S 0 = f (0 ; 0) ; (0 ; 1) ;:::; ( N;N ) g2 R N 2 2 enumeratespixellocationsin F i .Similarly,when wegetthesharedfeature F max ,wecanuse S 0 ! i towarpitback. 104 5.3.5Training Weusesyntheticshadowfacestotrainourmodel.Weapplymultiplelossestosuperviseallthree stepsinthemodel. Shadowremovalloss: Withthepairedshadowfaceandwell-illuminatedface,weenableapixel- to-pixelsupervisionontherecoveryingrayscale.,weintroduceaweightingmapto encouragethelosstofocusmoreontheshadowandshadowboundary,as L gs = E i ˘ P " ^ I b,gs i I b,gs i 1 1+ B i + B edge i R # ; (5.10) where P indicatessynthesizeddatadistribution, B edge istheboundaryof B , 1+ B + B edge isthe weightingmap,and R isthenormalizationtermoftheweightingmap.Asimilarlossalsoapplied totheRGBrecoveryas L clr . Imagegradientloss: Humanvisionisverysensitivetohighfrequencyartifacts,suchasedges.To furthersuppressartifactsaroundshadowboundariesandrecoverhigh-frequencydetailsbeneath shadows,weadoptanimagegradientlosstoencouragetheimagegradientsbetween ^ I b and I b to besimilar.Thislossisdenotedas: L r = E i ˘ P " X k krb ^ I b i c k rb I b i c k k 1 # ; (5.11) where r isthegradientoperatorand k denotesdownsamplingbytheratio k = f 1 ; 2 ; 4 ; 8 g . Multiscalegradientshelpremovebothsharpandblurringshadowboundaries. Perceptualloss L P : Toensurethevisualquality,weadopttheperceptuallossbetweenthe recoveredface ^ I b and I b . GANloss: MotivatedbyWangetal.(2018b),weadoptamultiscalePatchGANIsolaetal.(2017) 105 atthescales 1 , 1 = 2 , 1 = 4 oftheoriginalimage'sresolution.Eachdiscriminatorconsistsof 5 convolutionallayersand 4 poolinglayers,andoutputsa 1 -channelmapintherangeof [0 ; 1] ,where 0 denotessyntheticand 1 real.WeusethehingelossintheGANtraining: L D = E i ˘ P 2 4 X n =1 ; 2 ; 3 min (0 ; 1+ D n ( I b i )) 3 5 E i ˘ P 2 4 X n =1 ; 2 ; 3 min (0 ; 1 D n ( ^ I b i )) 3 5 ; L G = E i ˘ P 2 4 X n =1 ; 2 ; 3 D n ( ^ I b i ) 3 5 ; (5.12) where D 1 , D 2 and D 3 arediscriminatorsat 3 scales. L D isthelosstotrainthediscriminatorsand L G isthelosstoguidetheshadowremovalmodeltorecovermorerealisticshadow-freefaces. Temporalconsistencyloss: Weadoptatemporalconsistencylosstoencouragetheimagegradients between ^ I sf i and I sf i tobesimilar.Thisconsistencylossisdenotedas: L T = E i ˘ P h ^ I b i;t 1 ^ I b i;t 2 1 i ; (5.13) where t 1 , t 2 canbeeithertwonearbyframesofthesamevideo,northeframewithitsmirrored image. OverallLoss Thegeneratorissupervisedanoveralllossas: L = 1 L gs + 2 L clr + 3 L r + 4 L P + 5 L G + 6 L T : (5.14) Thediscriminatorsaresupervisedwithadversarialloss L D tocompetewiththegenerator.We executethegeneratorstepandthediscriminatorstepineachmini-batchiteration. 106 5.4TrainingandEvaluationData Trainingdata TosynthesizeourtrainingdatabasedonEqn.5.1-5.3,wemanuallyselect 15 ; 000 faceimagesfromFFHQdatasetKarrasetal.(2019)thatdonotcontainanyforeignshadowsand strongselfshadows.Therawbinaryshadowshape B comesfrom:1) 100 silhouette shapes2)Perlinnoisefunction.Afterthat,therawshapesarerandomlyaugmentedwithdifferent scales,rotationsandboundaryblurriness.Intensitymap M I isalsogeneratedbyrandomPerlin noisefunctionattwooctaves. Tosimulatecommonshadowmotioninfacevideos,weproposetwoapproachestosynthesize theshadowmotion:translationandering.Intranslationmode,theshadowmovesontheface regionwithrandomlyselectedspeed,direction,androtation.Theshapeoftheshadowisedfor eachvideobutthescale,rotation,andboundaryblurrinesscanbecontinuouslyshiftingfromframe toframe.Ineringmode,thelocationfromframetoframeisrandomlypickedandthechanges ofscale,rotationandboundaryblurrinessarenotcontinuous. Evaluationdata Toourknowledge,thereisnolargevideodatabaseofreal-worldhumanfaces withforeignshadows.Oneexistingdatabase,UCBZhangetal.(2020b),includesaverylimited numberof 100 faceimages.Moreimportantly,thisdatabasecontainsonlysingleimagessothat consistentimagereconstructiononvideoscannotbeevaluated.Inresponsetotheneedofalarge videodatabase,wecollectadatabasetermedShadowFaceintheWild(SFW)fortheevaluation ofreal-worldfacialshadowremoval.Intotal,SFWincludes 280 videosfrom 20 subjects.Some examplesareshowninFig.5.6.Mostvideosarecapturedat 1080 presolutionbyvarioussmartphone cameras. Foreachsubject,thevideosarecollectedinvesessions:indoor,outdoorstanding,outdoor walking,outdoorextreme,anddriving.Theindoorsessioncollectsfacevideosinanindoor 107 Figure5.6 AnillustrationofSFWdatabase.Therowshowstheshadowfacescollectedunderhighly dynamicenvironments.( e.g. ,varyingshadowsandheadposesduetowalkinganddriving);Thesecondrow showsthepixel-levelannotationsofshadowsegmentation.Zoominforviewingthequalityofourannotation. environment,wherethelightingisrelativelysoftwithnostrongspecularlights.Foroutdoor collection,thestandingsessionrequiresthesubjecttoholdastandingpositionwithnoambient lightvariations,andthewalkingandextremesessionsrequirethesubjecttobemoving,creatinga changingambientlight.Forthethreesessions,subjectsusecommonobjectstocreateshadows, suchashand,phone,paper,pen etc .Fortheoutdoorextremesession,westrivetocreatemore complexshadowpatternsandrequirethesubjectswalkingundertreestocreateleaf-shapeshadows. Inthelastsession,thesubjectsrecordvideosinamovingcar,wheretheshadowsmaycomefrom theblockingofsunvisor,rear-viewmirrorA-pillar,andsurroundingbuildings. Fortheevaluationpurpose,weannotatethepixel-levelshadowsegmentationmapsofkeyframes selectedfromthevideoset,andmoreannotationswillbeaddedinthefuture. 5.5Experiment 5.5.1Experimentalsetup Metrics ToevaluateonUCBdataset,wecandirectlycomparebetweenthede-shadowface imagesfromourmodelandthegroundtruthfaceimages.Weevaluatetheperformancebasedon thefollowingmetrics:peaksignal-to-noiseratio(PSNR)andstructuralsimilarityindexmeasure 108 Figure5.7 AqualitativecomparisonofshadowremovalontestingimagesofUCBdatabase.Fromtopto bottom,weshowshadowfaceandshadowremovalresultsprovidedbyZhangetal.(2020b),thenetwork withnaiveRGBshadowmodeling,oursingle-framenetworkwithgrayscaleshadowremovalandcolorization (GS+C),andournetworkwithadditionalTSMandtemporalloss. (SSIM).ToevaluateonSFWdataset,wecanevaluateonhowtheshadowisdetectedsincethe groundtruthshadowsegmentationisprovided.ThepredictedshadowmasksarefromEqn.5.7.We computetheareaundercurve(AUC)ofreceiveroperatingcharacteristiccurve(ROC)andaccuracy basedonthepredictedshadowmasksandthegroundtruthmasks.Theaccuracyiscomputedas TP + TN N p + N n where TP , TN , N p ,and N n aretruepositives,truenegatives,numberofshadowpixels, andnumberofnon-shadowpixels,respectively.Webinarizetheshadowmatte M intoshadowmask withathresholdof 0 : 1 . Implementationdetails OurshadowremovalnetworkisimplementedinTwwithaninitial learningrateof 1 e - 4 .Wetrainthenetworkfor 50 ; 000 iterationsintotalwithabatchsizeof 32 ,and decreasethelearningratebyaratioof 10 every 25 ; 000 iterations.Weinitializetheweightswiththe normaldistributionof[ 0 ; 0 : 02 ]. f 1 ; 2 ; 3 ; 4 ; 5 ; 6 ; g aresettobe f 100 ; 100 ; 1 ; 1 ; 1 ; 1 ; 0 : 1 g . WeuseBulat&Tzimiropoulos(2017)tocropthefaceandprovide 68 faciallandmarks. 109 RemovalModelPSNRSSIM InputImage 19 : 6710 : 766 Guo etal. Guoetal.(2012) 15 : 9390 : 593 Hu etal. Huetal.(2018) 18 : 9560 : 699 Cun etal. Cunetal.(2020) 19 : 3860 : 722 Zhang etal. Zhangetal.(2020b) 23.8160.782 Zhang etal. Zhangetal.(2020b) 20 : 2200 : 677 RGB 21 : 4640 : 725 GS+C(Ours) 23 : 3640 : 784 TemporalGS+C(Ours) 23.7930.805 Table5.1 AquantitativecomparisonforshadowremovalonUCBdataset.Zhang etal. Zhangetal.(2020b) isourimplementationandtrainedusingoursynthesizeddata. 5.5.2Shadowremovalandsegmentation WecomparetheresultsonUCBdataset.ThebaselinemethodistheonefromZhangetal. (2020b),whichalsoincludestheperformanceofseveralpreviousworksCunetal.(2020);Guo etal.(2012);Huetal.(2018).However,asnopre-trainedmodels,trainingdata,andtraining scriptsofallthesemethodsareavailable,itishardtoobtainanyfaircomparisonbetweenthese methodsandoursonUCB.Tobridgethegap,Were-implementZhangetal.(2020b)withour bestefforts,andtrainitusingoursynthesizeddata.Webelieveittobeafaithfulimplementation astheconventionalperceptuallossandpixel-wiselossaremainlyused.Table5.1reportsthe comparisonresultsofPSNRandSSIM.Oursingle-framegrayscaleshadowremoval+colorization model(GS+C)outperformsthemethodsofCunetal.(2020);Guoetal.(2012);Huetal.(2018) andachievescomparableperformancewithZhangetal.(2020b).Withthetemporalsharingmodule (TSM)andtemporalconsistencyloss( L T ),ourmethodcanachieveacompetitivePSNRand outperformthereportedZhangetal.(2020b)onSSIM.NoticethatbothPSNRandSSIMfromour implementedZhangetal.(2020b)arelowerthanthereportednumbers.Webelieveitismainlydue tothelargedomaingapbetweenourtrainingdataandthetrainingdatausedinZhangetal.(2020b). AqualitativecomparisonisshowninFig.5.7. 110 Figure5.8 QualitativeshadowremovalevaluationsonSFWdatabase.Fromtoptobottom,weshowshadow face,shadowremovalresultsfromLe&Samaras(2019)andZhangetal.(2020b),oursingle-framemodel, andourtemporalmodel,groundtruthshadowsegmentation(inbrightpurple),andpredictedshadowmask (beforethresholding). Secondly,weevaluatethemodelsontheSFWdatabase,whichismorechallengingdueto highlydynamicenvironments.Weconductaquantitativecomparisonontheperformanceofshadow segmentation,whichisarequiredmoduleinmanyapplications.Table5.2showsthecomparison withrecentmethodsHuetal.(2019);Le&Samaras(2019);Zhangetal.(2020b).Ourmethod outperformsallothersintermsofAUCandaccuracy.Fig.5.8showsthequalitativecomparisonand ourmethodproducesmuchbetterrecoveryoffacialforeignshadowscomparingwiththebaselines. Furthermore,ourmodelisabletoremovestrongself-shadows(theforeheadareaofthe 9 thcolumn) andkeepnormalshadows(theleft-cheekareaofthelastcolumn).Asourdatasetsarehighlydiverse, wethatmethodsinLe&Samaras(2019);Zhangetal.(2020b)cannotbewellgeneralizedand theirperformancedegrades. 111 SegmentationModelAUCAccuracy Le etal. Le&Samaras(2019) 0 : 6030 : 683 Hu etal. Huetal.(2019) 0 : 5400 : 604 Zhang etal. Zhangetal.(2020b) 0 : 6860 : 756 GS+C 0 : 8980 : 874 TemporalGS+C 0.9040.882 Table5.2 AquantitativecomparisonofshadowsegmentationonSFWdatabase. 5.5.3AblationStudies Weconductablationtobetterunderstandeachcomponent.Thebaselinemethodisourimplemented versionofZhangetal.(2020b)andthedirectRGBshadowmodelingwithourbackbonenetwork. Forafaircomparison,wemergethecomputationresourceofGS+Cmodel( i.e. ,doublingthe bottleneckdepthandthedecoderchannel.).AsshowninTab.5.1,ourimplementedZhangetal. (2020b)showsaperformanceofPSNRas 20 : 220 andSSIMas 0 : 677 .Byupdatingtoabetter backbonenetworkwithnon-localblocksandfacepositioncoding,ourbaselinemodelwithRGB shadowmodelingachievesabetterPSNRof 21 : 464 andSSIMof 0 : 725 .Next,oursingle-frame GS+Cmodeloutperformstheprevioustwobaselinemodels,thankstotheeffectivenessofnovel shadowmodeling.AndwiththetemporaldesignofTSMandthecorrespondingloss,ourmodel canfurtherimprovethePSNRandSSIMto 23 : 793 and 0 : 805 respectively.Qualitativecomparison areshowninFig.5.7.Wecanseebothsingle-frameandtemporalGS+Cmodelshowbettervisual qualitytotheRGBmodel.Inaddition,thetemporalmodelfurtherimprovesthecolorconsistency andsuppressestheartifactsonseveralsubjects. 5.6Conclusion Inthischapter,weintroducetheproblemofblindremovaloffacialforeignshadow.Wepropose aneffectiveshadowmodelingtohelpthemodeltobemoregeneralized.Wedecomposethe 112 conventionalRGBshadowmodelingintograyscaleshadowmodelingandcolorization.Wealso proposeatemporalsharingmodule(TSM)thatcanbeeasilyintegratedintoanyencodersand decoderstoimposetemporalconsistency.Ourmethodcanproducephoto-realisticde-shadow faceswithhighPSNRandSSIM.Ourlarge-scalevideodatabasecollectedunderhighlydynamic environmentsisanothermajorcontributionthatcanvariousface-relatedresearchesand applications. 113 Chapter6 ConclusionsandFutureWork Faceisoneofthemostpopularbiometricmodalitiesduetoitsconvenienceofusage,e.g.,access control,phoneunlock.Despitethehighrecognitionaccuracy,facerecognitionsystemsarenotable todistinguishbetweenrealhumanfacesandfakeones.Thus,theyarevulnerabletofacespoof attacks,whichdeceivesthesystemstorecognizeasanotherperson.Tosafelyusefacerecognition, facetechniquesarerequiredtodetectspoofattacksbeforeperformingrecognition. InChapter 2 ,weproposeaCNN-RNNmodelislearnedtoestimatethefacedepthwithpixel- wisesupervision,andtoestimaterPPGsignalswithsequence-wisesupervision.Theestimated depthandrPPGarefusedtodistinguishlivevs.spooffaces.Experimentsshowthatourmodel achievesimprovementsonbothintra-andcross-databasedetectionperformance. InChapter 3 ,westudythegeneralizationproblemoffaceWethedetection ofunknownspoofattacksasZero-ShotFace(ZSFA)andextendthestudyofZSFA from 1 - 2 typesto 13 types.WeproposeanovelDeepTreeNetwork(DTN)topartitionthespoof samplesintosemanticsub-groupsinanunsupervisedfashion.Experimentsshowthatourproposed methodachievesthestateoftheartonmultipletestingprotocolsofZSFA. InChapter 4 ,westudyanewproblemofinterpretingfacemodel'sdecision.We provideacomprehensivemodelingofthespooftracesofvariousspoofattacks,anddesignsanovel adversariallearningframeworktodisentanglethespooftracesfrominputfacesasahierarchical combinationofpatternsatmultiplescales.Withthedisentangledspooftraces,weunveilthelive counterpartoftheoriginalspoofface,andfurthersynthesizerealisticnewspooffacesafteraproper 114 geometriccorrection.Ourmethoddemonstratessuperiorspoofdetectionperformanceonbothseen andunseenspoofscenarioswhileprovidingvisually-convincingestimationofspooftraces. InChapter 5 ,wetheproperphysicalmodelingcanalsootherfaceproblemsand studyanewproblemoffaceshadowremoval.Weproposeaneffectiveshadowmodelingtohelp themodeltobemoregeneralized.WedecomposetheconventionalRGBshadowmodelinginto grayscaleshadowmodelingandcolorization.Inaddition,weproposeatemporalsharingmodule (TSM)thatcanbeeasilyintegratedintoanyencodersanddecoderstoimposetemporalconsistency. 6.1FutureWorks Uncertaintyofface Welookintoopen-setfaceandwesometime wethemodelmaynotbealwaysveryaboutitsdecision.Inpractice,sometime it'sokaytorejectthesampleorprovideadditionalmanualinspectionwhenthemodelisnotvery Soafuturedirectionistoenablethemodeltoprovideascorewithitsdecision. Retrainabilityoffaceanti-s Inpractice,thefacemodelmayneedtodeliver todifferentuserstohandledifferentsituation,wheretheprocess,orretrainingprocess isengaged.Toeasetheretrainingprocess,severaltopciscanbefurtherinvastigated,suchas incrementallearning,modelfusionandearlystoppingpolicy. Sensorvariations Sensorvariationisanimportantfactorinpracticalfacesystem, whichhaven'tbeenquantitativelystudied.Whilesensorvariationcauseslargenegativeimpact onthefaceperformance,facemodelshavetobere-trainedeverytime switchingtoanewsensor,whichismerelypossibleandconsuming.Weintendtoevaluatethe cross-sensorperformanceandproposesolutioniftheperformanceisnotideal. Improvingsynthesisonlargeposeandexpression Weintendtomodifythe 3 DWarpingLayer 115 tobetterhandlelargeposevariation.,weadditionallyprovidevisibilityinthelandmark preparation,sothewarpinglayercanleveragethevisibilitytoimplementafastanddifferentiable z-bufferrendering.Thiscanadditionallyleverageotherlargeposefacedatabases,suchas 300 -VW, tosynthesizelargeposespooffacesfortraining,whichishardtoincludeinrealface databases. 116 APPENDIX 117 ContributiononCo-authoredPublications Chapters2ofthisdissertationarebasedonresearchpapersLiuetal.(2018c)andJourablooetal. (2018)respectively.AminJourablooandIhaveequalcontributionsontheproposedmethodsand implementationoftheseresearchpapers.Here,Imentionthedetailcontributionsofeachindividual: 1.LearningDeepModelsforFaceBinaryorAuxiliarySuper- vision[Liuetal.(2018c)] YaojieLiu: DatacollectionforSiWdataset; Providingthegroundtruthforthepseudo-depthmapforthetrainingimages; DesigningtheCNNpartofthenetwork,includingDepthCNNandnon-rigidregistration layer; Implementingthetrainingondifferentdatasetsandprotocols,andnalizingtrainingsettings andtricks; CarryingonthedailymanagementofSiWdataset,includinggrantaccessestoresearchgroups worldwide,andansweringquestionsofthedatabase. AminJourabloo: DatacollectionforSiWdataset; ProvidingthegroundtruthrPPGsignalsforthetrainingvideos; DesigningtheRNNpartofthenetwork,includingtheLSTM,FFTlayer,andcorresponding lossfunction; Implementingevaluationmetrics(suchasEER,HTER,andACER),performingtestingon 118 differentdatasetsandprotocols,andgeneratequalitativeandquantitativeresultsandanalysis. 2.FaceAntviaNoiseModeling[Jourablooetal. (2018)] YaojieLiu: Performingacasestudyonspoofnoisepattern,andexplorethreeimportantpropertiesof spoofnoisepattern; Implementingexperimentsandanalyzingresultsondifferentdatasetsandprotocols(CASIA andReplayAttackdatasets); AminJourabloo: Designingthelossfunctionsandthenetworkarchitectureforthefaceincluding theDSNet,DQNet,andVQNet; ImplementingthetrainingonallprotocolsinOuludatabase,andtrainingsettings andtricks; AnalyzingtheexperimentresultsonOuludatasetandexecutingseveralablationstudies. 119 RelatedPublication Inthissection,IlistallrelatedpublicationIduringmyPhDstudyinMichiganState University: 1. Liu,Xiaohong,YaojieLiu,JunChen&XiaomingLiu.2021.PSCC-Net:Progressive spatio-channelcorrelationnetworkforimagemanipulationdetectionandlocalization. arXiv preprintarXiv:2103.10596. 2. Liu,Yaojie&XiaomingLiu,Physics-GuidedSpoofTraceDisentanglementforGenericFace arXivpreprintarXiv:2012.05185. 3. Liu,Yaojie,JoelStehouwer&XiaomingLiu.2020.OnDisentanglingSpoofTracesfor GenericFaceIn ECCV . 4. Stehouwer,Joel,AminJourabloo,YaojieLiu&XiaomingLiu.2020.NoiseModeling, SynthesisandforGenericObjectIn CVPR ,IEEE. 5. Liu,Yaojie,JoelStehouwer,AminJourabloo&XiaomingLiu.2019.PresentationAttack DetectionforFaceinMobilePhones,In Biometrics ,Springer. 6. Liu,Yaojie,JoelStehouwer,AminJourabloo&XiaomingLiu.2019.DeepTreeLearningfor Zero-shotFaceIn CVPR ,IEEE. 7. Jourabloo,Amin ,YaojieLiu &XiaomingLiu.2018.Facevia NoiseModeling.In ECCV . 8. Liu,Yaojie ,AminJourabloo &XiaomingLiu.2018.LearningDeepModelsforFace BinaryorAuxiliarySupervision.In CVPR ,IEEE. 9. Atoum,Yousef ,YaojieLiu ,AminJourabloo &XiaomingLiu.2017.Face UsingPatchandDepth-basedCNNs.In IJCB ,IEEE. 10. Liu,Yaojie,AminJourabloo,WilliamRen&XiaomingLiu.2017.DenseFaceAlignment. In ICCVW ,IEEE. 11. Liu,Yaojie,XinyuHuang,LiuRen&XiaomingLiu.2021.BlindRemovalofFacialForeign Shadow.Submittedto ICCV . 120 BIBLIOGRAPHY 121 BIBLIOGRAPHY Abadi,Mart ´ PaulBarham,JianminChen,ZhifengChen,AndyDavis,JeffreyDean,Matthieu Devin,SanjayGhemawat,GeoffreyIrving,MichaelIsardetal.2016.Tw:Asystemfor large-scalemachinelearning.In OSDI ,. Abdelhamed,Abdelrahman,StephenLin&MichaelSBrown.2018.Ahigh-qualitydenoising datasetforsmartphonecameras.In CVPR ,IEEE. Agarwal,Akshay,RichaSingh&MayankVatsa.2016.FaceusingHaralickfeatures. In BTAS ,IEEE. Arashloo,ShervinRahimzadeh,JosefKittler&WilliamChristmas.2017.Ananomalydetection approachtofacedetection:Anewformulationandevaluationprotocol. IEEEAccess . Arrieta,AlejandroBarredo,NataliaD ´ ´ JavierDelSer,AdrienBennetot,Siham Tabik,AlbertoBarbado,SalvadorGarc ´ SergioGil-L ´ opez,DanielMolina,RichardBenjamins etal.2020.Explainableintelligence(xai):Concepts,taxonomies,opportunitiesand challengestowardresponsibleai. InformationFusion . Atoum,Yousef,YaojieLiu,AminJourabloo&XiaomingLiu.2017.Faceanti-susingpatch anddepth-basedCNNs.In IJCB ,IEEE. Bao,Wei,HongLi,NanLi&WeiJiang.2009.Alivenessdetectionmethodforfacerecognition basedonopticalwIn IASP ,. Bertalmio,Marcelo,GuillermoSapiro,VincentCaselles&ColomaBallester.2000.Imagein- painting.In Proceedingsofthe27thannualconferenceoncomputergraphicsandinteractive techniques ,. Bharadwaj,Samarth,TejasIDhamecha,MayankVatsa&RichaSingh.2013.Computationally effacedetectionwithmotionIn CVPRW ,IEEE. Bharadwaj,Samarth,TejasIDhamecha,MayankVatsa&RichaSingh.2014.Face viamotionandmultifeaturevideoletaggregation. Blanz,Volker&ThomasVetter.2003.Facerecognitionbasedona3Dmorphablemodel. PAMI . Bobbia,Serge,YannickBenezeth&JulienDubois.2016.Remotephotoplethysmographybasedon implicitlivingskintissuesegmentation.In ICPR ,. Boulkenafet,Zinelabdine.2017.Acompetitionongeneralizedsoftware-basedfacepresentation attackdetectioninmobilescenarios.In IJCB ,IEEE. Boulkenafet,Zinelabidine,JukkaKomulainen&AbdenourHadid.2015.Facebased oncolortextureanalysis.In ICIP ,IEEE. 122 Boulkenafet,Zinelabidine,JukkaKomulainen&AbdenourHadid.2016.Facedetection usingcolourtextureanalysis. TIFS . Boulkenafet,Zinelabidine,JukkaKomulainen&AbdenourHadid.2017a.Faceusing speeded-uprobustfeaturesandFishervectorencoding. SignalProcessLetters . Boulkenafet,Zinelabinde,JukkaKomulainen,LeiLi,XiaoyiFeng&AbdenourHadid.2017b. OULU-NPU:Amobilefacepresentationattackdatabasewithreal-worldvariations.In FG ,. Bulat,Adrian&GeorgiosTzimiropoulos.2017.Howfararewefromsolvingthe2d&3dface alignmentproblem?(andadatasetof230,0003dfaciallandmarks).In ICCV ,IEEE. Cao,Chen,YanlinWeng,ShunZhou,YiyingTong&KunZhou.2014.Facewarehouse:A3D facialexpressiondatabaseforvisualcomputing. Trans.Vis.Comput.Graphics . Cao,Qingxing,XiaodanLiang,BailingLi,GuanbinLi&LiangLin.2018.Visualquestion reasoningongeneraldependencytree.In CVPR ,IEEE. Carion,Nicolas,FranciscoMassa,GabrielSynnaeve,NicolasUsunier,AlexanderKirillov&Sergey Zagoruyko.2020.End-to-endobjectdetectionwithtransformers.In ECCV ,Springer. Chang,Huiwen,JingwanLu,FisherYu&AdamFinkelstein.2018.PairedCycleGAN:Asymmetric styletransferforapplyingandremovingmakeup.In CVPR ,IEEE. Chen,Chang,ZhiweiXiong,XiaomingLiu&FengWu.2020.Cameratraceerasing.In CVPR , IEEE. Chen,Cunjian,AntitzaDantcheva&ArunRoss.2013.Automaticfacialmakeupdetectionwith applicationinfacerecognition.In ICB ,IEEE. Chen,Cunjian,AntitzaDantcheva&ArunRoss.2014.Impactoffacialcosmeticsonautomatic genderandageestimationalgorithms.In Internationalconferenceoncomputervisiontheory andapplications(visapp) ,IEEE. Chen,Xinyun,ChangLiu&DawnSong.2018.Tree-to-treeneuralnetworksforprogramtranslation. In NIPS ,. Chetty,Girija.2010.Biometriclivenesscheckingusingmultimodalfuzzyfusion.In International conferenceonfuzzysystems ,IEEE. Chetty,Girija&MichaelWagner.2006.Audio-visualmultimodalfusionforbiometricperson authenticationandlivenessvIn Proceedingsofthe2005nicta-hcsnetmultimodaluser interactionworkshop-volume57 ,. Chingovska,Ivana,Andr ´ eAnjos&S ´ ebastienMarcel.2012.Ontheeffectivenessoflocalbinary patternsinfaceIn BIOSIG ,IEEE. Chuang,Yung-Yu,DanBGoldman,BrianCurless,DavidHSalesin&RichardSzeliski.2003. Shadowmattingandcompositing.In Siggraph ,ACM. 123 Corneanu,CiprianA,MeysamMadadi,SergioEscalera&AleixMMartinez.2019.Whatdoesit meantolearnindeepnetworks?and,howdoesonedetectadversarialattacks?In CVPR ,IEEE. Cun,Xiaodong,Chi-ManPun&ChengShi.2020.Towardsghost-freeshadowremovalviadual hierarchicalaggregationnetworkandshadowmattingGAN.In AAAI ,. Dang,Hao,FengLiu,JoelStehouwer,XiaomingLiu&AnilKJain.2020.Onthedetectionof digitalfacemanipulation.In CVPR ,IEEE. Ding,Bin,ChengjiangLong,LingZhang&ChunxiaXiao.2019.ARGAN:Attentiverecurrent generativeadversarialnetworkforshadowdetectionandremoval.In ICCV ,IEEE. Donner,Craig&HenrikWannJensen.2006.AspectralBSSRDFforshadinghumanskin.In Proceedingsofthe17theurographicsconferenceonrenderingtechniques ,409Œ417. Dosovitskiy,Alexey,LucasBeyer,AlexanderKolesnikov,DirkWeissenborn,XiaohuaZhai,Thomas Unterthiner,MostafaDehghani,MatthiasMinderer,GeorgHeigold,SylvainGellyetal.2020. Animageisworth16x16words:Transformersforimagerecognitionatscale. arXivpreprint arXiv:2010.11929 . Egger,Bernhard,SandroSch ¨ onborn,AndreasSchneider,AdamKortylewski,AndreasMorel- Forster,ClemensBlumer&ThomasVetter.2018.Occlusion-aware3Dmorphablemodelsandan illuminationpriorforfaceimageanalysis. IJCV . Esser,Patrick,EkaterinaSutter&Bj ¨ ornOmmer.2018.AvariationalU-Netforconditional appearanceandshapegeneration.In CVPR ,IEEE. Feng,Haocheng,ZhibinHong,HaixiaoYue,YangChen,KeyaoWang,JunyuHan,JingtuoLiu &ErruiDing.2020.Learninggeneralizedspoofcuesforface arXivpreprint arXiv:2005.03922 . Feng,Litong,Lai-ManPo,YumingLi,XuyuanXu,FangYuan,TerenceChun-HoCheung& Kwok-WaiCheung.2016.IntegrationofimagequalityandmotioncuesforfaceA neuralnetworkapproach. J.VisualCommunicationandImageRepresentation . Finlayson,GrahamD,StevenDHordley&MarkSDrew.2002.Removingshadowsfromimages. In ECCV ,Springer. deFreitasPereira,Tiago,Andr ´ eAnjos,Jos ´ eMarioDeMartino&S ´ ebastienMarcel.2012.LBP-TOP basedcountermeasureagainstfaceattacks.In ACCV ,IEEE. deFreitasPereira,Tiago,Andr ´ eAnjos,Jos ´ eMarioDeMartino&S ´ ebastienMarcel.2013.Can facecountermeasuresworkinarealworldscenario?In ICB ,IEEE. Frome,Andrea,GregSCorrado,JonShlens,SamyBengio,JeffDean,Marc'AurelioRanzato& TomasMikolov.2013.Devise:Adeepvisual-semanticembeddingmodel.In NIPS ,. Gu,Shuyang,JianminBao,HaoYang,DongChen,FangWen&LuYuan.2019.Mask-guided portraiteditingwithconditionalGANs.In CVPR ,IEEE. 124 Guo,Jianzhu,XiangyuZhu,JinchuanXiao,ZhenLei,GenxunWan&StanZLi.2019.Improving faceby3Dvirtualsynthesis. arXivpreprintarXiv:1901.00488 . Guo,Ruiqi,QieyunDai&DerekHoiem.2012.Pairedregionsforshadowdetectionandremoval. PAMI . deHaan,Gerard&VincentJeanne.2013.Robustpulseratefromchrominance-basedrPPG. Trans. BiomedicalEngineering . He,Kaiming,XiangyuZhang,ShaoqingRen&JianSun.2016.Deepresiduallearningforimage recognition.In CVPR ,IEEE. Hu,X,CWFu,LZhu,JQin&PAHeng.2019.Direction-awarespatialcontextfeaturesforshadow detectionandremoval. PAMI . Hu,Xiaowei,LeiZhu,Chi-WingFu,JingQin&Pheng-AnnHeng.2018.Direction-awarespatial contextfeaturesforshadowdetection.In CVPR ,IEEE. IARPA.2016.IARPAresearchprogramOdin. https://www.iarpa.gov/index.php/ research-programs/odin . ISO/IEC-JTC-1/SC-37.2016.Biometrics.informationtechnologybiometricpresentationattack detectionpart1:Framework. https://www.iso.org/obp/ui/iso . Isola,Phillip,Jun-YanZhu,TinghuiZhou&AlexeiAEfros.2017.Image-to-imagetranslationwith conditionaladversarialnetworks.In CVPR ,IEEE. Jourabloo,Amin&XiaomingLiu.2017.Pose-invariantfacealignmentviaCNN-baseddense3D model IJCV . Jourabloo,Amin,YaojieLiu&XiaomingLiu.2018.Facevianoise modeling.In ECCV ,. Kalchbrenner,Nal,EdwardGrefenstette&PhilBlunsom.2014.Aconvolutionalneuralnetwork formodellingsentences. arXivpreprintarXiv:1404.2188 . Kaneko,Takuhiro,KaoruHiramatsu&KunioKashino.2018.Generativeadversarialimage synthesiswithdecisiontreelatentcontroller.In CVPR ,IEEE. Karessli,Nour,ZeynepAkata,BerntSchiele&AndreasBulling.2017.Gazeembeddingsfor zero-shotimageIn CVPR ,IEEE. Karras,Tero,SamuliLaine&TimoAila.2019.Astyle-basedgeneratorarchitectureforgenerative adversarialnetworks.In CVPR ,IEEE. Kazemi,Vahid&JosephineSullivan.2014.Onemillisecondfacealignmentwithanensembleof regressiontrees.In CVPR ,IEEE. Kollreider,Klaus,HartwigFronthaler,MaycelIsaacFaraj&JosefBigun.2007.Real-timeface detectionandmotionanalysiswithapplicationinlivenessassessment. TIFS . 125 Komulainen,Jukka,AbdenourHadid&MattiPietikainen.2013a.Contextbasedface In BTAS ,IEEE. Komulainen,Jukka,AbdenourHadid,MattiPietik ¨ ainen,Andr ´ eAnjos&S ´ ebastienMarcel.2013b. Complementarycountermeasuresfordetectingscenicfaceattacks.In ICB ,IEEE. Krizhevsky,Alex,IlyaSutskever&GeoffreyEHinton.2012.Imagenetwithdeep convolutionalneuralnetworks.In NIPS ,. Lampert,ChristophH,HannesNickisch&StefanHarmeling.2009.Learningtodetectunseen objectclassesbybetween-classattributetransfer.In CVPR ,IEEE. Lawrence,Steve,CLeeGiles,AhChungTsoi&AndrewDBack.1997.Facerecognition:A convolutionalneural-networkapproach. IEEETransactionsonNeuralNetworks . Le,Hieu&DimitrisSamaras.2019.Shadowremovalviashadowimagedecomposition.In ICCV , IEEE. Lee,Cheng-Han,ZiweiLiu,LingyunWu&PingLuo.2020.MaskGAN:Towardsdiverseand interactivefacialimagemanipulation.In CVPR ,IEEE. Li,Jiangwei,YunhongWang,TieniuTan&AnilKJain.2004.Livefacedetectionbasedonthe analysisoffourierspectra.In BTHI ,SPIE. Li,Lei,XiaoyiFeng,ZinelabidineBoulkenafet,ZhaoqiangXia,MingmingLi&AbdenourHadid. 2016a.Anoriginalfaceapproachusingpartialconvolutionalneuralnetwork.In IPTA ,. Li,Xiaobai,JukkaKomulainen,GuoyingZhao,Pong-ChiYuen&MattiPietik ¨ ainen.2016b. Generalizedfacebydetectingpulsefromfacevideos.In ICPR ,IEEE. Li,Zhihang,YiboHu,RanHe&ZhenanSun.2020.Learningdisentanglingandfusingnetworks forfacecompletionunderstructuredocclusions. PatternRecognition . Liao,Jing,YuanYao,LuYuan,GangHua&SingBingKang.2017.Visualattributetransfer throughdeepimageanalogy. arXivpreprintarXiv:1705.01088 . Liu,Feng,DanZeng,QijunZhao&XiaomingLiu.2018a.Disentanglingfeaturesin3Dfaceshapes forjointfacereconstructionandrecognition.In CVPR ,IEEE. Liu,Si-Qi,XiangyuanLan&PongCYuen.2018b.Remotephotoplethysmographycorrespondence featurefor3dmaskfacepresentationattackdetection.In ECCV ,558Œ573. Liu,Siqi,BaoyaoYang,PongCYuen&GuoyingZhao.2016a.A3Dmaskface databasewithrealworldvariations.In CVPRW ,IEEE. Liu,Siqi,PongCYuen,ShengpingZhang&GuoyingZhao.2016b.3Dmaskface withremotephotoplethysmography.In ECCV ,. 126 Liu,Yaojie,AminJourabloo&XiaomingLiu.2018c.Learningdeepmodelsforface Binaryorauxiliarysupervision.In CVPR ,IEEE. Liu,Yaojie,AminJourabloo,WilliamRen&XiaomingLiu.2017.Densefacealignment.In ICCVW ,IEEE. Liu,Yaojie&ChangShu.2015.Acomparisonofimageinpaintingtechniques.In Sixthinterna- tionalconferenceongraphicandimageprocessing(icgip) ,InternationalSocietyforOpticsand Photonics. Liu,Yaojie,JoelStehouwer,AminJourabloo&XiaomingLiu.2019a.Deeptreelearningfor zero-shotfaceIn CVPR ,IEEE. Liu,Yaojie,JoelStehouwer,AminJourabloo&XiaomingLiu.2019b.Presentationattackdetection forfaceinmobilephones. Biometrics . Liu,Yaojie,JoelStehouwer&XiaomingLiu.2020.Ondisentanglingspooftracesforgenericface In ECCV ,. Maaten,Laurensvander&GeoffreyHinton.2008.Visualizingdatausingt-SNE. Journalof machinelearningresearch . M ¨ a ¨ att ¨ a,Jukka,AbdenourHadid&MattiPietik ¨ ainen.2011.Facedetectionfromsingle imagesusingmicro-textureanalysis.In IJCB ,IEEE. Mao,Xudong,QingLi,HaoranXie,RaymondYKLau,ZhenWang&StephenPaulSmolley.2017. Leastsquaresgenerativeadversarialnetworks.In ICCV ,IEEE. Mirjalili,Vahid&ArunRoss.2017.Softbiometricprivacy:Retainingbiometricutilityofface imageswhileperturbinggender.In ICB ,IEEE. Nagano,Koki,HuiwenLuo,ZejianWang,JaewooSeo,JunXing,LiwenHu,LingyuWei&HaoLi. 2019.Deepfacenormalization. TOG . Nestmeyer,Thomas,Jean-Fran c¸ oisLalonde,IainMatthews&AndreasLehrmann.2020.Learning physics-guidedfacerelightingunderdirectionallight.In CVPR ,IEEE. Nowara,EwaMagdalena,AshutoshSabharwal&AshokVeeraraghavan.2017.Ppgsecure:Biomet- ricpresentationattackdetectionusingphotopletysmograms.In FG ,. Pan,Gang,LinSun,ZhaohuiWu&ShihongLao.2007.Eyeblink-basedinface recognitionfromagenericwebcamera.In ICCV ,IEEE. Patel,Keyurkumar,HuHan&AnilKJain.2016a.Cross-databasefacewithrobust featurerepresentation.In CCBR ,. Patel,Keyurkumar,HuHan&AnilKJain.2016b.Securefaceunlock:Spoofdetectionon smartphones. TIFS . 127 Paysan,Pascal,ReinhardKnothe,BrianAmberg,SamiRomdhani&ThomasVetter.2009.A3D facemodelforposeandilluminationinvariantfacerecognition.In AVSS ,. Peixoto,Bruno,CarolinaMichelassi&AndersonRocha.2011.Facelivenessdetectionunderbad illuminationconditions.In ICIP ,IEEE. Pinto,Allan,HelioPedrini,WilliamRobsonSchwartz&AndersonRocha.2015.Face detectionthroughvisualcodebooksofspectraltemporalcubes. TIP . Po,Lai-Man,LitongFeng,YumingLi,XuyuanXu,TerenceChun-HoCheung&Kwok-Wai Cheung.2017.Block-basedadaptiveROIforremotephotoplethysmography. J.MultimediaTools andApplications . Qin,Yunxiao,ChenxuZhao,XiangyuZhu,ZezhengWang,ZitongYu,TianyuFu,FengZhou, JingpingShi&ZhenLei.2019.Learningmetamodelforzero-andfew-shotface arXivpreprintarXiv:1904.12490 . Qu,Liangqiong,JiandongTian,ShengfengHe,YandongTang&RynsonWHLau.2017.Deshad- ownet:Amulti-contextembeddingdeepnetworkforshadowremoval.In CVPR ,IEEE. Ronneberger,Olaf,PhilippFischer&ThomasBrox.2015.U-net:Convolutionalnetworksfor biomedicalimagesegmentation.In Internationalconferenceonmedicalimagecomputingand computer-assistedintervention ,Springer. Sengupta,Soumyadip,AngjooKanazawa,CarlosDCastillo&DavidWJacobs.2018.SfSNet: Learningshape,andilluminanceoffacesinthewild'.In CVPR ,IEEE. Shao,Rui,XiangyuanLan,JiaweiLi&PongCYuen.2019a.Multi-adversarialdiscriminativedeep domaingeneralizationforfacepresentationattackdetection.In CVPR ,IEEE. Shao,Rui,XiangyuanLan&PongC.Yuen.2017.Deepconvolutionaldynamictexturelearning withadaptivechannel-discriminabilityfor3DmaskfaceIn IJCB ,IEEE. Shao,Rui,XiangyuanLan&PongCYuen.2019b.Regularizedmetaface arXivpreprintarXiv:1911.10771 . Shashua,Amnon&TammyRiklin-Raviv.2001.Thequotientimage:Class-basedre-renderingand recognitionwithvaryingilluminations. PAMI . Shih,YiChang,SylvainParis,ConnellyBarnes,WilliamTFreeman&Fr ´ edoDurand.2014.Style transferforheadshotportraits. Shor,Yael&DaniLischinski.2008.Theshadowmeetsthemask:Pyramid-basedshadowremoval. In Computergraphicsforum ,WileyOnlineLibrary. Shu,Zhixin,SunilHadap,EliShechtman,KalyanSunkavalli,SylvainParis&DimitrisSamaras. 2017a.Portraitlightingtransferusingamasstransportapproach. TOG . Shu,Zhixin,ErsinYumer,SunilHadap,KalyanSunkavalli,EliShechtman&DimitrisSamaras. 2017b.Neuralfaceeditingwithintrinsicimagedisentangling.In CVPR ,IEEE. 128 Socher,Richard,MilindGanjoo,ChristopherDManning&AndrewNg.2013.Zero-shotlearning throughcross-modaltransfer.In NIPS ,. Stehouwer,Joel,AminJourabloo,YaojieLiu&XiaomingLiu.2020.Noisemodeling,synthesis andforgenericobjectIn CVPR ,IEEE. Stoschek,Arne.2000.Image-basedre-renderingoffacesforcontinuousposeandillumination directions.In CVPR ,IEEE. Sun,Lin,GangPan,ZhaohuiWu&ShihongLao.2007.Blinking-basedlivefacedetectionusing conditionalrandom J.AdvancesinBiometrics . Sun,Tiancheng,JonathanTBarron,Yun-TaTsai,ZexiangXu,XuemingYu,GrahamFyffe, ChristophRhemann,JayBusch,PaulEDebevec&RaviRamamoorthi.2019.Singleimage portraitrelighting. TOG . Tan,Xiaoyang,YiLi,JunLiu&LinJiang.2010.Facelivenessdetectionfromasingleimagewith sparselowrankbilineardiscriminativemodel.In ECCV ,. Tewari,Ayush,MichaelZollhofer,HyeongwooKim,PabloGarrido,FlorianBernard,PatrickPerez &ChristianTheobalt.2017.MoFA:Model-baseddeepconvolutionalfaceautoencoderfor unsupervisedmonocularreconstruction.In ICCVW ,IEEE. Thai,ThanhHai,RemiCogranne&FlorentRetraint.2013.Cameramodelbasedon theheteroscedasticnoisemodel. TIP . Thai,ThanhHai,FlorentRetraint&R ´ emiCogranne.2016.Cameramodelbasedon thegeneralizednoisemodelinnaturalimages. DigitalSignalProcessing . Tran,Luan&XiaomingLiu.2018.Nonlinear3Dfacemorphablemodel.In CVPR ,IEEE. Tran,Luan&XiaomingLiu.2021.Onlearning3dfacemorphablemodelfromin-the-wildimages. PAMI . Tran,Luan,XiaomingLiu,JiayuZhou&RongJin.2017a.Missingmodalitiesimputationvia cascadedresidualautoencoder.In CVPR ,IEEE. Tran,Luan,XiYin&XiaomingLiu.2017b.DisentangledrepresentationlearningGANfor pose-invariantfacerecognition.In CVPR ,IEEE. Tulyakov,Sergey,XavierAlameda-Pineda,ElisaRicci,LijunYin,JeffreyFCohn&NicuSebe. 2016.Self-adaptivematrixcompletionforheartrateestimationfromfacevideosunderrealistic conditions.In CVPR ,IEEE. Turek,Matt.2016.ExplainableIntelligence(XAI). https://www.darpa.mil/ program/explainable-artificial-intelligence . Valle,Roberto,JoseMBuenaposada,AntonioValdes&LuisBaumela.2018.Adeeply-initialized ensembleofregressiontreesforfacealignment.In ECCV ,. 129 Wang,Jifeng,XiangLi&JianYang.2018a.Stackedconditionalgenerativeadversarialnetworks forjointlylearningshadowdetectionandshadowremoval.In CVPR ,IEEE. Wang,Ting-Chun,Ming-YuLiu,Jun-YanZhu,AndrewTao,JanKautz&BryanCatanzaro.2018b. High-resolutionimagesynthesisandsemanticmanipulationwithconditionalGANs.In CVPR , IEEE. Wang,Xiaolong,RossGirshick,AbhinavGupta&KaimingHe.2018c.Non-localneuralnetworks. In CVPR ,IEEE. Wang,Yang,LeiZhang,ZichengLiu,GangHua,ZhenWen,ZhengyouZhang&DimitrisSamaras. 2008.Facerelightingfromasingleimageunderarbitraryunknownlightingconditions. PAMI . Wang,Yu,LucaBondi,PaoloBestagini,StefanoTubaro,DavidJEdwardDelpetal.2017.A counter-forensicmethodforcnn-basedcameramodelIn CVPRW ,IEEE. Wen,Di,HuHan&A.K.Jain.2015.Facespoofdetectionwithimagedistortionanalysis. TIFS . Wen,Zhen,ZichengLiu&ThomasSHuang.2003.Facerelightingwithradianceenvironment maps.In CVPR ,IEEE. Wu,Bing-Fei,Yun-WeiChu,Po-WeiHuang,Meng-LiangChung&Tzu-MinLin.2016.Amotion robustremote-PPGapproachtodriver'shealthstatemonitoring.In ACCV ,IEEE. Wu,Qi,WendeZhang&BVKVijayaKumar.2012.Strongshadowremovalviapatch-based shadowedgedetection.In ICRA ,IEEE. Wu,Shangzhe,ChristianRupprecht&AndreaVedaldi.2020.Unsupervisedlearningofprobably symmetricdeformable3Dobjectsfromimagesinthewild.In CVPR ,IEEE. Wu,Tai-Pang&Chi-KeungTang.2005.Abayesianapproachforshadowextractionfromasingle image.In ICCV ,IEEE. Wu,Yue,WaelAbdAlmageed&PremkumarNatarajan.2019.Mantra-net:Manipulationtracing networkfordetectionandlocalizationofimageforgerieswithanomalousfeatures.In CVPR , IEEE. Wu,Yuxin&KaimingHe.2018.Groupnormalization.In ECCV ,. Xiong,Chao,XiaoweiZhao,DanhangTang,KarlekarJayashree,ShuichengYan&Tae-KyunKim. 2015.Conditionalconvolutionalneuralnetworkformodality-awarefacerecognition.In ICCV , IEEE. Xiong,Fei&WaelAbdAlmageed.2018.Unknownpresentationattackdetectionwithfacergb images.In BTAS ,IEEE. Xu,Zhenqi,ShanLi&WeihongDeng.2015.LearningtemporalfeaturesusingLSTM-CNN architectureforfaceIn ACPR ,IEEE. 130 Yang,Jianwei,ZhenLei&StanZLi.2014.Learnconvolutionalneuralnetworkforfaceanti- arXivpreprintarXiv:1408.5601 . Yang,Jianwei,ZhenLei,ShengcaiLiao&StanZLi.2013.Facelivenessdetectionwithcomponent dependentdescriptor.In ICB ,IEEE. Yang,Xiao,WenhanLuo,LinchaoBao,YuanGao,DihongGong,ShibaoZheng,ZhifengLi&Wei Liu.2019a.FaceModelmatters,sodoesdata.In CVPR ,IEEE. Yang,Yang,XiaojieGuo,JiayiMa,LinMa&HaibinLing.2019b.Generativelandmark guidedfaceinpainting. arXivpreprintarXiv:1911.11394 . Yu,Zitong,XiaobaiLi,XuesongNiu,JingangShi&GuoyingZhao.2020a.Face withhumanmaterialperception.In ECCV ,. Yu,Zitong,ChenxuZhao,ZezhengWang,YunxiaoQin,ZhuoSu,XiaobaiLi,FengZhou& GuoyingZhao.2020b.Searchingcentraldifferenceconvolutionalnetworksforfaceanti- In CVPR ,IEEE. Zhang,Ke-Yue,TaipingYao,JianZhang,YingTai,ShouhongDing,JilinLi,FeiyueHuang, HaichuanSong&LizhuangMa.2020a.Faceviadisentangledrepresentation learning.In ECCV ,. Zhang,Li,TaoXiang&ShaogangGong.2017a.Learningadeepembeddingmodelforzero-shot learning.In CVPR ,IEEE. Zhang,Shu,RanHe,ZhenanSun&TieniuTan.2017b.Demeshnet:Blindfaceinpaintingfordeep meshfacev TIFS . Zhang,Xuaner,JonathanT.Barron,Yun-TaTsai,RohitPandey,XiumingZhang,RenNg&DavidE. Jacobs.2020b.Portraitshadowmanipulation. TOG . Zhang,Zhiwei,JunjieYan,SifeiLiu,ZhenLei,DongYi&StanZLi.2012.Aface databasewithdiverseattacks.In ICB ,IEEE. Zhang,Ziyuan,LuanTran,XiYin,YousefAtoum,JianWan,NanxinWang&XiaomingLiu.2019. Gaitrecognitionviadisentangledrepresentationlearning.In CVPR ,IEEE. Zhao,Chenxu,YunxiaoQin,ZezhengWang,TianyuFu&HailinShi.2019.Metaant Learningtolearninface arXivpreprintarXiv:1904.12490 . Zhou,Hao,SunilHadap,KalyanSunkavalli&DavidWJacobs.2019.Deepsingle-imageportrait relighting.In ICCV ,IEEE. Zhu,Xiangyu,ZhenLei,XiaomingLiu,HailinShi&StanZLi.2016.Facealignmentacrosslarge poses:A3Dsolution.In CVPR ,IEEE. Zhu,Xiangyu,XiaomingLiu,ZhenLei&StanZLi.2017.Facealignmentinfullposerange:A 3Dtotalsolution. PAMI . 131