DESIGNINGCONVOLUTIONALNEURALNETWORKSFORFACEALIGNMENTAND ANTI-SPOOFING By AminJourabloo ADISSERTATION Submittedto MichiganStateUniversity inpartialoftherequirements forthedegreeof ComputerScienceŠDoctorofPhilosophy 2019 ABSTRACT DESIGNINGCONVOLUTIONALNEURALNETWORKSFORFACEALIGNMENTAND ANTI-SPOOFING By AminJourabloo Facealignmentistheprocessofdetectingasetofpointsonafaceimage,suchasmouth corners,nosetip,etc.Facealignmentisakeymoduleinthepipelineofmostfacialanalysis tasks,normallyafterfacedetectionandbeforesubsequentfeatureextractionand Asaresult,improvingthefacealignmentaccuracyishelpfulfornumerousfacialanalysistasks. Recently,facealignmentworksarepopularintopvisionvenuesandachievealotofattention. Inspiteofthefruitfulpriorworkandongoingprogressoffacealignment,pose-invariantface alignmentisstillchallenging.Toaddresstheinherentchallengesassociatedwiththisproblem,we proposepose-invariantfacealignmentbyingadense3DMM,andintegratingestimationof3D shapeand2DfaciallandmarksfromasinglefaceimageinasingleCNN.Weintroduceanew layer,calledvisualizationlayer,whichisdifferentiableandallowsbackpropagationofanerror fromalaterblocktoanearlierone. Anotherapplicationoffacialanalysisisthefacewhichhasrecentlyachieveda lotofattention.Whilefacerecognitionsystemsserveasavportalforvariousdevices (i.e.,phoneunlock,accesscontrol,andtransportationsecurity),attackerspresentfacespoofs(i.e., presentationattacks,PA)tothesystemandattempttobeauthenticatedasthegenuineuser.We presentourproposeddeepmodelsforfacethatusethesupervisionfromboththe spatialandtemporalauxiliaryinformation,forthepurposeofrobustlydetectingfacePAfroma facevideo. Thisthesisisdedicatedtomyfamily, myparents: Hassan and Kobra mybrother: Ahmad mysister'sfamily: Zahra , Hamed and Samyar iii ACKNOWLEDGMENTS Thisdissertationwouldnothavebeenmadepossiblewithoutthehelpofmanypeople. IamveryhonoredtohaveDr.XiaomingLiuasmyadvisor.Hisexpectationandencourage- menthavemademeachievemorethanIcouldeverhaveimagined.Thetimewespenttodebug codes,brainstorm,andpolishpapershasmyskillsincriticalthinking,presentationand writing.Bysettinghimselfasanexample,hehastaughtmewhatagoodresearchershouldbelike. ItismygreatpleasuretohavetheopportunitytoworkwithDr.AnilK.Jain,Dr.ArunRossinthe lab.Asaworld-leadingresearcher,Dr.Jainhasinspiredmanyyoungergenerations includingmetopursueaPh.D.Dr.Ross'spatienceandinsightfulcommentsatallpresentations haveshownmethateveryresearcherdesirestobeheard. Iamgratefulformylabmates,JosephRoth,MortezaSafdarnejad,MuhammadJamalAfridi, YousefAtoum,XiYin,LuanTran,GarrickBrazil,YaojieLiu,BangjieYin,JoelStehouwer,Adam Terwilliger,HieuNguyen,ShengjieZhu,MasaHu.Thevaluablecommentsinpaperreview,the willingnesstohelp,theencouragementwhenIaminabadmood,andtheentertainmenttogether havemadeitaverypleasantjourney. iv TABLEOFCONTENTS LISTOFTABLES ....................................... viii LISTOFFIGURES ...................................... x Chapter1IntroductiononFaceAlignmentandFace ......... 1 1.1Introduction......................................1 1.2Priorworkonfacealignment.............................2 1.3Priorworkonface...........................6 1.4Overviewofthethesis.................................8 1.4.1Contributionsofthethesis..........................10 Chapter2Pose-Invariant3DFaceAlignment ..................... 12 2.1Introduction......................................12 2.2Pose-Invariant3DFaceAlignment..........................13 2.2.13DFaceModeling..............................14 2.2.2CascadedCoupled-Regressor.........................17 2.2.33DSurface-EnabledVisibility........................19 2.3ExperimentalResults.................................21 2.4Summary.......................................27 Chapter3Pose-InvariantFaceAlignmentviaCNN-basedDense3DModelFitting29 3.1Introduction......................................29 3.2Unconstrained3DFaceAlignment..........................32 3.2.13DMorphableModel.............................32 3.2.2DataAugmentation..............................35 3.2.2.1Optimization............................36 3.2.3CascadedCNNCoupled-Regressor.....................38 3.2.4ConventionalCNN(C-CNN).........................40 3.2.5MirrorCNN(M-CNN)............................40 3.2.5.1MirrorLoss.............................41 3.2.5.2MirrorCNNArchitecture.....................42 3.2.6Visibilityand2DAppearanceFeatures....................43 3.2.7Testing....................................47 3.3ExperimentalResults.................................47 3.3.1ExperimentalSetup..............................48 3.3.2ComparisonExperiments...........................51 3.4Summary.......................................57 Chapter4Pose-InvariantFaceAlignmentwithaSingleCNN ............ 59 4.1Introduction......................................59 4.23DFaceAlignmentwithVisualizationLayer.....................61 v 4.2.13Dand2DFaceShapes............................62 4.2.2ProposedCNNArchitecture.........................63 4.2.3VisualizationLayer..............................66 4.3ExperimentalResults.................................68 4.3.1QuantitativeEvaluationsonAFLWandAFW................69 4.3.2Evaluationon300Wdataset.........................71 4.3.3AnalysisoftheVisualizationLayer.....................72 4.3.4Timecomplexity...............................75 4.4Summary.......................................77 Chapter5LearningDeepModelsforFaceBinaryorAuxiliarySu- pervision .................................... 78 5.1Introduction......................................78 5.2FacewithDeepNetwork.......................81 5.2.1DepthMapSupervision............................82 5.2.2rPPGSupervision...............................83 5.2.3NetworkArchitecture.............................84 5.2.3.1CNNNetwork...........................85 5.2.3.2RNNNetwork...........................85 5.2.3.3ImplementationDetails......................86 5.2.4Non-rigidRegistrationLayer.........................88 5.3CollectionofFaceDatabase......................90 5.4ExperimentalResults.................................91 5.4.1ExperimentalSetup..............................91 5.4.2ExperimentalComparison..........................93 5.4.2.1AblationStudy...........................93 5.4.2.2IntraTesting............................94 5.4.2.3CrossTesting............................95 5.4.2.4VisualizationandAnalysis.....................96 5.5Summary.......................................97 Chapter6FaceviaNoiseModeling ........... 98 6.1Introduction......................................98 6.2Face...................................101 6.2.1ACaseStudyofSpoofNoisePattern.....................101 6.2.2De-SpoofNetwork..............................104 6.2.2.1NetworkOverview.........................104 6.2.3DQNetandVQNet.............................106 6.2.3.1DiscriminativeQualityNet.....................106 6.2.3.2VisualQualityNet.........................107 6.2.4Lossfunctions.................................108 6.2.4.1MagnitudeLoss..........................108 6.2.4.2RepetitiveLoss...........................109 6.3ExperimentalResults.................................110 6.3.1ExperimentalSetup..............................110 vi 6.3.2AblationStudy................................111 6.3.3ExperimentalComparison..........................112 6.3.3.1IntraTesting............................113 6.3.3.2CrossTesting............................113 6.3.4QualitativeExperiments...........................113 6.3.4.1Spoofmedium....................113 6.3.4.2Successfulandfailurecases....................115 6.4Summary.......................................116 Chapter7ConclusionsandFutureWork ........................ 118 7.1Limitations......................................119 7.2FutureWork......................................119 BIBLIOGRAPHY ....................................... 121 vii LISTOFTABLES Table2.1:Thecomparisonoffacealignmentalgorithmsinposehandling(estima- tionerrorsmayhavedifferent.................21 Table2.2: TheNME(%)ofthreemethodsonAFLW. ...................24 Table2.3: ThecomparisonoffourmethodsonAFW. ...................24 Table2.4: EfyoffourmethodsinFPS. .......................26 Table3.1: NME(%)oftheproposedmethodwithdifferentfeatureswiththeC-CNNarchi- tectureandthebasetrainingset. ........................50 Table3.2: TheNME(%)ofthreemethodsonAFLWwiththebasetrainingset. ......52 Table3.3: TheNME(%)ofthreemethodsonALFWwithextendedtrainingsetandCaffe toolbox. ....................................54 Table3.4: TheMAPEofsixmethodsonAFW. ......................55 Table3.5: Thesix-stageNMEsofimplementingC-CNNandM-CNNarchitectureswith differenttrainingdatasetsandCNNtoolboxes.Theinitialerroris25 : 8%. ....56 Table4.1:Thenumberandsizeofconvolutionalineachvisualizationblock. Forallblocks,thetwofullyconnectedlayershavethesamelengthof800 and236....................................69 Table4.2:NME(%)offourmethodsontheAFLWdataset..............70 Table4.3:NME(%)oftheproposedmethodateachvisualizationblockonAFLW dataset.TheinitialNMEis25.8%......................70 Table4.4:MAPEofvemethodsontheAFWdataset.................71 Table4.5:TheNMEofdifferentmethodson300Wdataset..............72 Table4.6:TheNME(%)ofthreearchitectureswithdifferentinputs( I :Inputimage, V :Visualization, F :Featuremaps)......................73 Table4.7:NME(%)whendifferentmasksareused..................74 viii Table4.8:NME(%)whenusingdifferentnumbersofvisualizationblocks( N v )and convolutionallayers( N c )...........................75 Table5.1:Thecomparisonofourcollecteddatasetwithavailabledatasetsforthe face..............................89 Table5.2:TDRatdifferentFDRs,crosstestingonOuluProtocol1..........92 Table5.3: ACERofourmethodatdifferent N f ,onOuluProtocol2. ............92 Table5.4:Theintra-testingresultsonfourprotocolsofOulu..............93 Table5.5:Theintra-testingresultsonthreeprotocolsofSiW..............95 Table5.6: CrosstestingonCASIA-MFSDvs.Replay-Attack. ...............96 Table6.1: ThenetworkstructureofDSNet,DQNetandVQNet.Eachconvolutionallayer isfollowedbyanexponentiallinearunit(ELU)andbatchnormalizationlayer. TheinputimagesizeforDSNetis256 256 6.Alltheconvolutional are3 3.0\1MapNetisthebottom-leftpart,i.e.,conv1-10,conv1-11,and conv1-12. ...................................105 Table6.2:Theaccuracyofdifferentoutputsoftheproposedarchitectureandtheir fusions....................................111 Table6.3: ACERoftheproposedmethodwithdifferentimageresolutionsandblurriness. Tocreateblurryimages,weapplyGaussianwithdifferentkernelsizesto theinputimages. ...............................112 Table6.4:Theintratestingresultson4protocolsofOulu-NPU............112 Table6.5:TheHTERofdifferentmethodsforthecrosstestingbetweentheCASIA- MFSDandtheReplay-Attackdatabases.Wemarkthetop-2perfor- mancesinbold................................114 Table6.6: Theconfusionmatricesofspoofmediumsbasedonspoofnoise pattern. ....................................115 ix LISTOFFIGURES Figure2.1: Givenafaceimagewithanarbitrary pose ,ourproposedalgorithmautomati- callyestimatesthe2 Dlocations and visibilities offaciallandmarks,aswellas 3 Dlandmarks .Thedisplayed3Dlandmarksareestimatedfortheimageinthe center.Green/redpointsindicatevisible/invisiblelandmarks. ..........13 Figure2.2: OverallarchitectureofourproposedPIFAmethod,withthreemainmodules(3D modeling,cascadedcoupled-regressorlearning,and3Dsurface-enabledvisibil- ityestimation).Green/redarrowsindicatesurfacenormalspointingtoward/away fromthecamera. ...............................14 Figure2.3:ThetrainingprocedureofPIFA.......................20 Figure2.4: TheNMEofveposegroupsfortwomethods. .................25 Figure2.5: TheNMEofeachlandmarkforPIFA. .....................26 Figure2.6: 2Dand3DalignmentresultsoftheBP4D-Sdataset. ..............27 Figure2.7: TestingresultsofAFLW(top)andAFW(bottom).Asshowninthetoprow,we initializefacealignmentbyplacinga2Dmeanshapeinthegivenboundingbox ofeachimage.Notethe disparity betweentheinitiallandmarksandthees- timatedones,aswellasthediversityinpose,illuminationandresolutionamong theimages.Green/redpointsindicatevisible/invisibleestimatedlandmarks. ..28 Figure3.1: Theproposedmethodestimateslandmarksforlarge-posefacesbytingadense 3Dshape.Fromlefttoright:initiallandmarks,3Ddenseshape,estimated landmarkswithvisibility.Thegreen/red/yellowdotsintherightcolumnshow thevisible/invisible/cheeklandmarks,respectively. ...............30 Figure3.2: Theoverallprocessoftheproposedmethod. ..................32 Figure3.3: Thelandmarkmarchingprocessforupdatingvector d .(a-b)showthe pathsofcheeklandmarksonthemeanshape;(c)istheestimatedfaceshape; (d)istheestimatedfaceshapebyignoringtherollrotation;and(e)showsthe locationsoflandmarksonthecheek. ......................34 Figure3.4:Landmarkmarching g ( S ; m ) .........................35 x Figure3.5: ArchitectureofC-CNN(thesameCNNarchitectureisusedforallsixstages). Colorcodeused:purple=extractedimagefeature,orange=Conv,brown= pooling+batchnormalization,blue=fullyconnectedlayer,red=ReLU.The sizeandthenumberofforeachlayerareshownonthetopandthe bottomrespectively. ..............................40 Figure3.6: ArchitectureoftheM-CNN(thesameCNNarchitectureisusedforallsix stages).Colorcodeused:purple=extractedimagefeature,orange=Conv, brown=pooling+batchnormalization,green=locallyconnectedlayer,blue= fullyconnectedlayer,red=batchnormalization+ReLU+dropout.The sizeandthenumberofofeachlayerareshownonthetopandthebottom ofthetopbranchrespectively. .........................42 Figure3.7: The3Dsurfacenormalastheaverageofnormalsarounda3D landmark(blackarrow).Noticetherelativelynoisysurfacenormalofthe3D filefteyecornerfllandmark(bluearrow). ....................44 Figure3.8: Featureextractionprocess,(a-e)PAWFforthelandmarkontherightsideofthe righteye,(f-j)D3PFforthelandmarkontherightsideofthelip. ........45 Figure3.9: ExamplesofextractingPAWF.Whenoneofthefourneighborhoodpoints(red pointinthebottom-right)isinvisible,itconnectstothe2Dlandmark(green point),extendsthesamedistancefurther,andgenerateanewneighborhood point.Thishelpstoincludethebackgroundcontextaroundthenose. ......45 Figure3.10: ExampleofextractingD3PF. .........................46 Figure3.11: (a)AFLWoriginal(yellow)andaddedlandmarks(green),(b)Comparisonof meanNMEofeachlandmarkforRCPR(blue)andproposedmethod(green). TheradiusofcirclesisdeterminedbythemeanNMEmultipledwiththeface boundingboxsize. ..............................48 Figure3.12: ErrorsonAFLWtestingsetaftereachstagesofCNNfordifferentfeatureextrac- tionmethodswiththeC-CNNarchitectureandthebasetrainingset.Theinitial erroris25 : 8%. ................................51 Figure3.13: ComparisonofNMEforeachposewiththeC-CNNarchitectureandthebase trainingset. ..................................52 Figure3.14: ThecomparisonofCEDfordifferentmethodswiththeC-CNNarchitectureand thebasetrainingset. ..............................53 xi Figure3.15: ResultoftheproposedmethodafterthestageCNN.Thisimageshowsthat thestageCNNcanmodelthedistributionoffaceposes.Theright-view facesareatthetop,thefrontal-viewfacesareatthemiddle,andtheleft-view facesareatthebottom. ............................53 Figure3.16:Thedistributionofvisibilityerrorsforeachlandmark.Forsixlandmarks onthehorizontalcenteroftheface,theirvisibilityerrorsarezerossince theyarealwaysvisible............................56 Figure3.17: TheresultsoftheproposedmethodonAFLWandAFW.Thegreen/red/yellow dotsshowthevisible/invisible/cheeklandmarks,respectively.Firstrow:initial landmarksforAFLW,Second:estimated3Ddenseshapes,Third:estimated landmarks,ForthandFifth:estimatedlandmarksforAFLW,Sixth:estimated landmarksforAFW.Noticethatdespitethediscrepancybetweenthediverseface posesandconstantfront-viewlandmarkinitialization(toprow),ourmodelcan adaptivelyestimatethepose,adensemodelandproducethe2Dlandmarksas abyproduct. .................................57 Figure3.18: Theresultoftheproposedmethodacrossstages,withtheextractedfeatures(1st and3rdrows)andalignmentresults(2ndand4throws).Notethechangesofthe landmarkpositionandvisibility(thebluearrow)overstages. ..........58 Figure4.1:Forthepurposeoflearninganend-to-endfacealignmentmodel,our novelvisualizationlayerreconstructsthe3Dfaceshape(a)fromtheesti- matedparametersinsidetheCNNandsynthesizesa2Dimage(b)viathe surfacenormalvectorsofvisiblevertexes..................60 Figure4.2:TheproposedCNNarchitecture.Weusegreen,orange,andpurpleto representthevisualizationlayer,convolutionallayer,andfullyconnected layer,respectively.PleaserefertoFigure4.3forthedetailsofthevisual- izationblock.................................61 Figure4.3:Avisualizationblockconsistsofavisualizationlayer,twoconvolutional layersandtwofullyconnectedlayers....................64 Figure4.4:Thefrontalandsideviewsofthemask a thathaspositivevaluesinthe middleandnegativevaluesinthecontourarea...............67 Figure4.5:Anexamplewithfourvertexesprojectedtoasamepixel.Twoofthem havenegativevaluesin z componentoftheirnormals(redarrows).Be- tweentheothertwowithpositivevalues,theonewiththesmallerdepth (closertotheimageplane)isselected....................68 Figure4.6:ArchitecturesofthreeCNNswithdifferentinputs..............73 xii Figure4.7:Theaverageofweightsforinputimage,visualizationandfeature mapsinthreearchitecturesofFigure4.6.The y -axisand x -axisshowthe averageandtheblockindex,respectively..................74 Figure4.8:Mask2,adifferentdesignedmaskwithvepositiveareasontheeyes, topofthenoseandsidesofthelip......................75 Figure4.9:ResultsofalignmentonAFLWandAFWdatasets,greenlandmarksshow theestimatedlocationsofvisiblelandmarksandredlandmarksshowes- timatedlocationsofinvisiblelandmarks.Firstrow:providedbounding boxbyAFLWwithinitiallocationsoflandmarks,Second:estimated 3Ddenseshapes,Third:estimatedlandmarks,Fourthtosixth:estimated landmarksforAFLW,Seventh:estimatedlandmarksforAFW.......76 Figure4.10:Threeexamplesofoutputsofvisualizationlayerateachvisualization block.Therowshowsthattheproposedmethodrecoverstheex- pressionofthefacegracefully,thethirdrowshowsthevisualizationsof afacewithamorechallengingpose.....................77 Figure5.1: ConventionalCNN-basedfaceanti-spoofapproachesutilizethebinarysupervi- sion,whichmayleadtoovgiventheenormoussolutionspaceofCNN. Thisworkdesignsanovelnetworkarchitecturetoleveragetwoauxiliaryin- formationassupervision:thedepthmapandrPPGsignal,withthegoalsof improvedgeneralizationandexplainabledecisionsduringinference. ......79 Figure5.2: Theoverviewoftheproposedmethod. .....................81 Figure5.3: TheproposedCNN-RNNarchitecture.Thenumberofareshownontop ofeachlayer,thesizeofallis3 3withstride1forconvolutionaland 2forpoolinglayers. Colorcode used: orange =convolution, green =pooling, purple =responsemap. ............................82 Figure5.4: ExamplegroundtruthdepthmapandrPPGsignals. ..............86 Figure5.5: Thenon-rigidregistrationlayer. ........................88 Figure5.6: ThestatisticsofthesubjectsintheSiWdatabase.Leftside:Thehistogram showsthedistributionofthefacesizes. ....................90 Figure5.7: ExamplesoftheliveandspoofattackvideosintheSiWdatabase.Therow showsalivesubjectwithdifferentPIE.Thesecondrowshowsdifferenttypesof thespoofattacks. ...............................91 xiii Figure5.8: (a)8successfulexamplesandtheirestimateddepthmapsandrPPG signals.(b)4failureexamples:thetwoareliveandtheothertwoarespoof. NoteourabilitytoestimatediscriminativedepthmapsandrPPGsignals. ....95 Figure5.9: Mean/Stdoffrontalizedfeaturemapsforliveandspoof. ............96 Figure5.10: TheMSEofestimatingdepthmapsandrPPGsignals. .............97 Figure6.1: Theillustrationoffaceandprocesses.pro- cessaimstoestimateaspoofnoisefromaspooffaceandreconstructthelive face.Theestimatedspoofnoiseshouldbediscriminativeforface .99 Figure6.2: Theillustrationofthespoofnoisepattern. Left: livefaceanditslocalregions. Right: Tworegisteredfacesfromprintattackandreplayattack.For eachsample,weshowthelocalregionoftheface,intensitydifferencetothelive image,magnitudeof2DFFT,andthelocalpeaksinthefrequencydomainthat indicatesthespoofnoisepattern.Bestviewedelectronically. ..........102 Figure6.3: Theproposednetworkarchitecture. ......................103 Figure6.4:The2DvisualizationoftheestimatedspoofnoisefortestvideosonOulu- NPUProtocol1.Left:theestimatednoise,Right:thehigh-frequency bandoftheestimatednoise, Colorcode used: black =live, green =printer1, blue =printer2, magenta =display1, red =display2...............115 Figure6.5:Thevisualizationofinputimages,estimatedspoofnoisesandestimated liveimagesfortestvideosofProtocol1ofOulu-NPUdatabase.Therst fourcolumnsintherowarepaperattacksandthesecondfourare thereplayattacks.Forabettervisualization,wemagnifythenoiseby5 timesandaddthevaluewith128,toshowbothpositiveandnegativenoise.116 Figure6.6:Thefailurecasesforconvertingthespoofimagestotheliveones......116 Figure7.1: Left:ArepresentationoftheestimatedpointcloudiniPhoneX.Right:The hardwaretechnologyinHuaweiP11forcapturingthepointcloud. .......120 xiv Chapter1 IntroductiononFaceAlignmentandFace 1.1Introduction Facealignmentistheprocessofdetectingasetofpointsonafaceimage,suchasmouth corners,nosetip,etc.Facealignmentisakeymoduleinthepipelineofmostfacialanalysistasks, normallyafterfacedetectionandbeforesubsequentfeatureextractionandAsa result,improvingthefacealignmentaccuracyishelpfulfornumerousfacialanalysistasks,e.g., facerecognition[114],face[55]and3Dfacereconstruction[94]. Duetotheimportanceoffacealignment,ithasbeenwellstudiedduringpastdecades[115], withthewell-knownActiveShapeModel[30]andActiveAppearanceModel(AAM)[70,78]. Recently,facealignmentworksarepopularintopvisionvenuesandachievealotofattention. Despitethecontinuousimprovementonthealignmentaccuracy,facealignmentisstillaverychal- lengingproblem,duetothenon-frontalface pose ,lowimage quality , occlusion ,etc.Amongallthe challenges,weidentifythe poseinvariantfacealignment astheonedeservingsubstantialresearch efforts,foranumberofreasons.First,facedetectionhassubstantiallyadvanceditscapabilityin detectingfacesinallposes,including[138],whichcallsforthesubsequentfacealign- menttohandlefaceswitharbitraryposes.Second,manyfacialanalysistaskswouldfrom therobustalignmentoffacesatallposes,suchasexpressionrecognitionand3Dfacereconstruc- 1 tion[94].Third,thereareveryfewexistingapproachesthatcanalignafacewithanyviewangle, orhaveconductedextensiveevaluationsonfaceimagesacross 90 yawangles[135,151],which isaclear contrast withthevastfacealignmentliterature[115]. Wepresentourproposedapproachesforpose-invariantfacealignmentinchapters2to4.The coreideaofourproposedmethodsisthatinsteadofestimating2Dlandmarksdirectly,weestimate the3Dshapeofthefaceandbyprojectingthe3Dshapeto2D,wecanhavethe2Dlocationsof thelandmarks. Anotherapplicationoffacialanalysisisfacewhichhasrecentlyachievedalot ofattention.Whilefacerecognitionsystemsserveasavportalforvariousdevices (i.e.,phoneunlock,accesscontrol,andtransportationsecurity),attackerspresentfacespoofs(i.e., presentationattacks,PA)tothesystemandattempttobeauthenticatedasthegenuineuser.The facePAsincludeprintingthefaceonpaper(printattack),replayingafacevideoonadigitaldevice (replayattack),andwearingamask(maskattack).TocounteractPA,researchershavedeveloped facetechniques[27,38,39,65]todetectPA priorto afaceimagebeingrecognized. Therefore,faceisvitaltoensurethatfacerecognitionsystemsarerobusttoPAand safetouse. Inchapters5and6,wepresentourproposeddeepmodelsforfacethatusethe supervisionsfromboththespatialandtemporalauxiliaryinformation,forthepurposeofrobustly detectingfacePAfromafaceimageorafacevideo. 1.2Priorworkonfacealignment Wereviewpriorworkonfacealignmentinsevenareasrelatedtotheproposedmethods:generic facealignment,pose-invariantfacealignment,3Dfacemodelfacealignmentviadeep 2 learning,sharinginformationinfacealignmentanddeeplearning,convolutionalrecurrentneural network(CRNN),andvisualizationindeeplearning. Genericfacealignment ThetypeoffacealignmentapproachisbasedonConstrainedLo- calModel(CLM),whereanearlyexampleisASM[30].Thebasicideaistolearnasetoflocal appearancemodels,oneforeachlandmark,andthedecisionsfromthelocalmodelsarefused withaglobalshapemodel.Therearegenerativeordiscriminative[32]approachesinlearning thelocalmodel,andvariousapproachesinutilizingtheshapeconstraint[4].Whilethelocal modelsarefavoredforhigherestimationprecision,italsocreatesdifforalignmentonlow- resolutionimagesduetolimitedlocalappearance.Incontrast,theAAMmethod[29,78]andits extension[75,97]learnaglobalappearancemodel,whosesimilaritytotheinputimagedrivesthe landmarkestimation.WhileAAMisknowntohavedifwithunseensubjects[42],there- centdevelopmenthassubstantiallyimproveditsgeneralizationcapability[110].Motivatedbythe ShapeRegressionMachine[140,146]inthemedicaldomain,cascadedregressor-basedmethods havebeenverypopularinrecentyears[26,111].Ononehand,theseriesofregressorsprogres- sivelyreducethealignmenterrorandleadtohigheraccuracy.Ontheotherhand,advancedfeature learningalsorendersultra-efalignmentprocedures[56,93].Otherthanthethreemajor typesofalgorithms,therearealsoworksbasedondeeplearning[142],graph-model[151],and semi-supervisedlearning[105]. Pose-invariantfacealignment Themethodsof[45,135,151]combinesfacedetection,pose estimationandfacealignment.Byusinga3Dshapemodelwithanoptimizedmixtureofparts, [135]isapplicabletofaceswithalargerangeofposes.In[120],afacealignmentmethodbased oncascaderegressorsisproposedtohandleinvisiblelandmarks.Eachstageiscomposedoftwo regressorsforestimatingtheprobabilityoflandmarkvisibilityandthelocationoflandmarks.This methodisappliedtowfacesofFERETdatabase[89].However,asa2Dlandmark- 3 basedapproach,itcannotestimate3Dfaceposes.Occlusion-invariantfacealignment,suchas RCPR[22],mayalsobeappliedtohandlelargeposessincenon-frontalfacesareonetypeof occlusions.[109]isaveryrecentworkthatestimates3Dlandmarkviaregressors.However,it onlytestsonsynthesizedfaceimagesupto ˘ 50 yaw. 3 Dfacemodel Almostallpriorworksassumethatthe2Dlandmarksoftheinputface imageiseithermanuallylabeledorestimatedviaafacealignmentmethod.In[48],adense3D facealignmentfromvideosisproposed.Atadensesetof2Dlandmarksisestimatedby usingthecascadedregressor.Then,anEM-basedalgorithmisutilizedtoestimatethe3Dshape and3Dposeofthefacefromtheestimated2Dlandmarks.Theauthorsin[92]aimtomakesure thatthelocationsof2Dcontourlandmarksareconsistentwiththe3Dfaceshape.In[152],a3D facemodelmethodbasedonthesimilarityoffrontalviewfaceimagesisproposed. Facealignmentviadeeplearning Withthecontinuoussuccessofdeeplearninginvision,re- searchersstarttoapplydeeplearningtofacealignment.Sunetal.[99]proposedathree-stage facealignmentalgorithmwithCNN.Atthestage,threeCNNsareappliedtodifferentface partstoestimatepositionsofdifferentlandmarks,whoseaveragesareregardedasthestage results.Atthenexttwostages,byusinglocalpatcheswithdifferentsizesaroundeachlandmark, thelandmarkpositionsareSimilarfacealignmentalgorithmsbasedonmulti-stageCNNs arefurtherdevelopedbyZhouetal.[144]andCFAN[139].In[139],afacealignmentmethod basedoncascadeofstackedauto-encoder(SAE)networkscanprogressivelylocationsof2D landmarksateachstage.TCDCN[142]usesone-stageCNNtoestimatespositionsofveland- marksgivenafaceimage.Thecommonalityamongmostofthesepriorworksisthattheyonly estimate2Dlandmarksandthenumberoflandmarksislimitedto6. Sharinginformationinfacealignmentanddeeplearning Utilizingdifferentsideinformation infacealignmentcanimprovethealignmentaccuracy.TCDCN[142]jointlyestimatesauxil- 4 iaryattributes(e.g.,gender,expression)withlandmarklocationstoimprovealignmentaccuracy. In[129],themirrorabilityconstraint,i.e.,thealignmentdifferencebetweenafaceanditsmir- roredcounterpart,isusedasameasureforevaluatingthealignmentresultswithouttheground truth,andforchoosingabetterinitialization.Theconsensusofregressors[136] inaBayesianmodelisusedtoshareinformationamongdifferentregressors.In[147]multiple initializationsareusedforeachfaceimageandaclusteringmethodcombinestheestimatedface shapes.Fordeeplearningmethods,sharinginformationisperformedeitherbytransferringthe learnedweightsfromasourcedomaintothetargetdomain[134],orbyusingthesiamesenet- works[8,137]tosharetheweightsamongbranchesofthenetworkandmakeadecisionwith combinedresponsesofallbranches. Convolutionalrecurrentneuralnetwork(CRNN) MethodsbasedonCRNNs[107,116,122] aretheattemptstocombinecascadeofregressorswithjointoptimization,foraligningmostly frontalfaces.Theirconvolutionalpartextractsfeaturesfromthewholeimage[122]orfromthe patchesatthelandmarklocations[107].Therecurrentpartfacilitatesjointoptimizationbysharing informationamongallregressors.Generally,themaindrawbacksofCRNNsare:1)existing CRNNmethodsaredesignedfornear-frontalfacealignment;2)theCRNNmethodssharethe sameCNNatallstages. Visualizationindeeplearning Visualizationtechniqueshavebeenusedindeeplearningtoas- sistinmakingarelativecomparisonamongtheinputdataandfocusingontheregionofinterest. Onecategoryofthesemethodsexploitthedeconvolutionalandupsamplinglayerstoeitherexpand responsemaps[67,87]orrepresentestimatedparameters[132].Alternatively,varioustypesof featuremaps,e.g.,heatmapsandZ-Buffering,canrepresentthecurrentestimationoflandmarks andparameters.In[21,80,119],2Dlandmarkheatmapsrepresentthelandmarks'locations.[21] proposesatwostepposeinvariantalignmentbasedonheatmapstomakemorepreciseestima- 5 tions.Theheatmapssufferfromthreedrawbacks:1)lackofthecapabilitytorepresentobjectsin details;2)requirementofoneheatmapperlandmarkduetoitsweakrepresentationpower.3)they cannotestimatethevisibilityoflandmarks.TheZ-Bufferrenderedusingtheestimated3Dfaceis alsousedtoconveytheresultsofapreviousCNNtothenextone[147].However,theZ-Buffer representationisnotdifferentiable,preventingend-to-endtraining. 1.3Priorworkonface Wereviewthepriorfaceworksinthreegroups:texture-basedmethods,temporal- basedmethods,andremotephotoplethysmographymethods. Texture-basedMethods SincemostfacerecognitionsystemsadoptonlyRGBcameras,using textureinformationhasbeenanaturalapproachtotacklingfaceManypriorworks utilizehand-craftedfeatures,suchasLBP[33,34,77],HoG[59,131],SIFT[84]andSURF[18], andadopttriditionalsuchasSVMandLDA.Toovercometheofillumination variation,theyseeksolutionsinadifferentinputdomain,suchasHSVandYCbCrcolorspace[16, 17],andFourierspectrum[65]. Asdeeplearninghasproventobeeffectiveinmanycomputervisionproblems,therearemany recentattemptsofusingCNN-basedfeaturesorCNNsinface[37,66,83,130].Most oftheworktreatsfaceasasimple binary problembyapplyingthe softmaxloss.Forexample,[66,83]useCNNasfeatureextractorandfromImageNet- pretrainedCaffeNetandVGG-face.Theworkof[37,66]feeddifferentdesignsofthefaceimages intoCNN,suchasmulti-scalefacesandhand-craftedfeatures,anddirectlyclassifylivevs.spoof. Temporal-basedMethods Oneoftheearliestsolutionsforfaceisbasedontemporal cuessuchaseye-blinking[82,83].Methodssuchas[58,98]trackthemotionofmouthandlipto 6 detectthefaceliveness.Whilethesemethodsareeffectivetotypicalpaperattacks,theybecome vulnerablewhenattackerspresentareplayattackorapaperattackwitheye/mouthportionbeing cut.Therearealsomethodsrelyingonmoregeneraltemporalfeatures,insteadofthefacial motion.Themostcommonapproachisframeconcatenation.Manyhandcraftedfeature-based methodsmayimproveintra-databasetestingperformancebysimplyconcatenatingthefeaturesof consecutiveframestotrainthe[16,33,60].Additionally,therearesomeworkproposing features,e.g.,Haralickfeatures[3],motionmag[10],andopticalw[6].Inthe deeplearningera,Feng etal .feedtheopticalwmapandShearletimagefeaturetoCNN[37]. In[125],Xu etal .proposeanLSTM-CNNarchitecturetoutilizetemporalinformationforbinary Overall,allpriormethodsstillregardfaceasabinary problem,andthustheyhaveahardtimetogeneralizewellinthecross-databasetesting. RemotePhotoplethysmography(rPPG) Remotephotoplethysmography(rPPG)isthetechnique totrackvitalsignals,suchasheartrate,withoutanycontactwithhumanskin[14,35,91,108, 118].Researchstartswithfacevideoswithnomotionorilluminationchangetovideoswith multiplevariations.In[35],Haan etal .estimaterPPGsignalsfromRGBfacevideoswithlighting andmotionchanges.Itutilizescolordifferencetoeliminatethespecularandestimate twoorthogonalchrominancesignals.AfterapplyingtheBandPassFilter(BPM),theratioofthe chrominancesignalsareusedtocomputetherPPGsignal. TherPPGsignalhaspreviouslybeenutilizedtotackleface[69,81].In[69], rPPGsignalsareusedfordetectingthe3Dmaskattack,wherethelivefacesexhibitapulseofheart rateunlikethe3Dmasks.TheyuserPPGsignalsextractedby[35]andcomputethecorrelation featuresforSimilarly,Magdalena etal .[81]extractrPPGsignals(alsovia[35])from threefaceregionsandtwonon-faceregions,fordetectingprintandreplayattacks.Althoughin replayattacks,therPPGextractormightstillcapturethenormalpulse,thecombinationofmultiple 7 regionscandifferentiatelivevs.spooffaces.WhiletheanalyticsolutiontorPPGextraction[35] iseasytoimplement,weobservethatitissensitivetoPIEvariations. 1.4Overviewofthethesis Inthesecondchapter,weproposeanovelregression-basedapproachfor pose-invariantfacealign- ment ,whichaimstoestimatethe2Dand3Dlocationsofasparsesetoffacelandmarks,aswell astheirvisibilitiesinthe2Dimage,forafacewitharbitrarypose(e.g., 90 yaw).Byextending thepopularcascadedregressorfor2Dlandmarkestimation,welearntwofernregressorsforeach cascadelayer,oneforpredictingtheupdateforthecameraprojectionmatrix,andtheotherfor predictingtheupdateforthe3Dshapeparameter.Thelearningoftworegressorsisconducted alternativelywiththegoalofminimizingthedifferencebetweenthegroundtruthupdatesandthe predictedupdates. Inthethirdchapter,weproposetouseConvolutionalNeuralNetworks(CNN)astheregressor inthecascadedframework,tolearnthemapping.WhilemostpriorworkonCNN-basedface alignmentestimatenomorethansix2Dlandmarksperimage[99,142],ourcascadedCNNcan produceasubstantiallylargernumber(34)of2Dand3Dlandmarks.Further,usinglandmark marching[150],ouralgorithmcanadaptivelyadjustthe3Dlandmarksduringthesothat thelocalappearancesaroundcheeklandmarkscontributetotheprocess. Inthefourthchapter,weintroduceanovelvisualizationlayer.WeproposedaCNNarchitecture whichisconsistofseveralblocks,calledvisualizationblocks.Thisarchitecturecanbeconsidered asacascadeofshallowCNNs.Thenewlayervisualizesthealignmentresultofapreviousvisual- izationblockandutilizesitinalatervisualizationblock.Itisdesignedbasedonseveralguidelines. Firstly,itisderivedfromthesurfacenormalsoftheunderlying3Dfacemodelandencodestherel- 8 ativeposebetweenthefaceandcamera,partiallyinspiredbythesuccessofusingsurfacenormals for3Dfacerecognition[79].Secondly,thevisualizationlayerisdifferentiable,whichallowsthe gradienttobecomputedanalyticallyandenablesend-to-endtraining.Lastly,amaskisutilizedto differentiatebetweenpixelsinthemiddleandcontourareasofaface. Thelasttwochapterscontainourproposedmethodsforthefaceproblem.In thechapter,weproposeadeepmodelforfacethatusesthesupervisionfrom boththe spatial and temporalauxiliaryinformation ,forthepurposeofrobustlydetectingface presentationattacks(PA)fromafacevideo.Theseauxiliaryinformationareacquiredbasedon ourdomainknowledgeaboutthekey differences betweenliveandspooffaces,whichincludetwo perspectives:spatialandtemporal.Fromthespatialperspective,itisknownthatlivefaceshave face-likedepth,e.g.,thenoseisclosertothecamerathanthecheekinfrontal-viewfaces,while facesinprintorreplayattackshaveorplanardepth,e.g.,allpixelsontheimageofapaperhave thesamedepthtothecamera.Hence,depthcanbeutilizedasauxiliaryinformationtosupervise bothliveandspooffaces. Fromthetemporalperspective,itwasshownthatthenormalrPPGsignals(i.e.,heartpulse signal)aredetectablefromlive,butnotspoof,facevideos[69,81].Therefore,weprovidedifferent rPPGsignalsasauxiliarysupervision,whichguidesthenetworktolearnfromliveorspoofface videosrespectively.Toenablebothsupervisions,wedesignanetworkarchitecturewithashort-cut connectiontocapturedifferentscalesandanovelnon-rigidregistrationlayertohandlethemotion andposechangeforrPPGestimation. Finally,inthesixthchapter,weproposeaCNNmethodthat,givenaspooffaceimage,itcan estimatethespoofnoiseandreconstructtheoriginalliveface.Weproposeseveralconstraintsand supervisionsbasedonourpriorknowledgeofthespoofnoise.First,alivefacehasnospoofnoise. Second,weassumethatthespoofnoiseofaspoofimageisubiquitous,i.e.,itexistseverywherein 9 thespatialdomainoftheimage;and,third,thespoofnoiseisrepetitive,i.e.,itisthespatialrepe- titionofcertainnoiseintheimage.Withsuchconstraints,anovelCNNarchitectureispresented inthesixthchapter.Givenanimage,oneCNNisdesignedtosynthesizethespoofnoisepattern andreconstructthecorrespondingliveimage.Inordertoexaminethereconstructedliveimage, wetrainanotherCNNwithauxiliarysupervisionandaGAN-likediscriminatorinanend-to-end fashion.Thesetwonetworksaredesignedtoensurethequalityofthereconstructedimageregard- ingitsdiscriminativenessbetweenliveandspoof,andthevisualplausibilityofthesynthesizedlive image. 1.4.1Contributionsofthethesis Inthissection,welistsomecontributionsalreadymadetowardsposeinvariantfacealignment: Weproposeapose-invariantfacealignmentbyadense3DMM,andintegratingesti- mationof3Dshapeand2Dfaciallandmarksfromasinglefaceimage. WeintroducecascadedCNN-based3Dfacemodelalgorithmthatisapplicabletoall poses,withintegratedlandmarkmarchingandcontributionfromlocalappearancesaroundcheek landmarksduringtheprocess. Avisualizationlayerispresentedwhichisdifferentiable,andallowsbackpropagationoferror fromalaterblocktoanearlierone. Also,welistsomeofthecontributionsmadetowardtheface Weproposetoleveragenovelauxiliaryinformation(i.e.,depthmapandrPPG)tosupervise theCNNlearningforimprovedgeneralizationoffacesystems. WeproposeanovelCNN-RNNarchitectureforfacewhichperformsend-to- endlearningwiththedepthmapandtherPPGsignal. Weofferanewperspectivefordetectingthefacefromprintattackandreplayattack 10 byinverselydecomposingaspooffaceimageintothelivefaceandthenoise,without havingthegroundtruthofeither. 11 Chapter2 Pose-Invariant3DFaceAlignment 2.1Introduction Motivatedbytheneedstoaddresstheposevariation,andthelackofpriorworkinhandlingposes, asshowninFig.2.1,thischapterproposesanovelregression-basedapproachfor pose-invariant facealignment ,whichaimstoestimatethe2 Dand 3 Dlocations offacelandmarks,aswellastheir visibilities inthe2Dimage,forafacewith arbitrarypose (e.g., 90 yaw).Byextendingthe popularcascadedregressorfor2Dlandmarkestimation,welearntworegressorsforeachcascade layer,oneforpredictingtheupdateforthecameraprojectionmatrix,andtheotherforpredicting theupdateforthe3Dshapeparameter.Thelearningoftworegressorsisconductedalternatively withthegoalofminimizingthedifferencebetweenthegroundtruthupdatesandthepredicted updates.Byusingthe3Dsurfacenormalsof3Dlandmarks,wecanautomaticallyestimatethe visibilitiesoftheir2Dprojectedlandmarksbyinspectingwhetherthetransformedsurfacenormal hasapositive z coordinate,andthesevisibilitiesaredynamicallyincorporatedintotheregressor learningsuchthatonlythelocalappearanceofvisiblelandmarkscontributetothelearning.Finally, extensiveexperimentsareconductedonalargesubsetofAFLWdataset[57]withawiderangeof poses,andtheAFWdataset[151],withthecomparisonwithanumberofstate-of-the-artmethods. Wedemonstratesuperior2Dalignmentaccuracyandquantitativelyevaluatethe3Dalignment accuracy. Insummary,themaincontributionsoftheproposedposeinvariantfacealignmentare: 12 Figure2.1: Givenafaceimagewithanarbitrary pose ,ourproposedalgorithmautomaticallyestimatesthe 2 Dlocations and visibilities offaciallandmarks,aswellas3 Dlandmarks .Thedisplayed3Dlandmarksare estimatedfortheimageinthecenter.Green/redpointsindicatevisible/invisiblelandmarks. Ł Weproposeafacealignmentmethodthatcanestimatesparsesetof2D/3Dlandmarksand theirvisibilitiesforafaceimagewithanarbitrarypose. Ł Byintegratingwitha3Dpointdistributionmodel,acascadedcoupled-regressorapproachis developedtoestimateboththecameraprojectionmatrixandthe3Dlandmarks,where3D modelenablestheautomaticallycomputedlandmarkvisibilitiesviasurfacenormal. Ł Asubstantiallylargernumberofnon-frontalviewfaceimagesareutilizedinevaluationwith demonstratedsuperiorperformancesthanthestateoftheart. 2.2Pose-Invariant3DFaceAlignment ThissectionpresentsthedetailsofourproposedPose-Invariant3DFaceAlignment(PIFA)algo- rithm,withemphasisonthetrainingprocedure.AsshowninFig.2.2,welearna3DPoint DistributionModel(3DPDM)[31]fromasetoflabeled3Dscans,whereasetof2Dlandmarkson animagecanbeconsideredasaprojectionofa3DPDMinstance(i.e.,3Dlandmarks).Foreach 2Dtrainingfaceimage,weassumethatthereexiststhemanuallabeled2Dlandmarksandtheir visibilities,aswellasthecorresponding3 Dgroundtruth Œ3Dlandmarksandthecameraprojection 13 Figure2.2: OverallarchitectureofourproposedPIFAmethod,withthreemainmodules(3Dmodeling, cascadedcoupled-regressorlearning,and3Dsurface-enabledvisibilityestimation).Green/redarrowsindi- catesurfacenormalspointingtoward/awayfromthecamera. matrix.Giventhetrainingimagesand2D/3Dgroundtruth,wetrainacascadedcoupled-regressor thatiscomposedoftworegressorsateachcascadelayer,fortheestimationoftheupdateofthe 3DPDMcoefandtheprojectionmatrixrespectively.Finally,thevisibilitiesoftheprojected 3Dlandmarksareautomaticallycomputedviathedomainknowledgeofthe3Dsurfacenormals, andincorporatedintotheregressorlearningprocedure. 2.2.13DFaceModeling Facealignmentconcernsthe2Dfaceshape,representedbythelocationsof N 2Dlandmarks,i.e., U = 0 B @ u 1 u 2 u N v 1 v 2 v N 1 C A : (2.1) 14 A2Dfaceshape U isaprojectionofa3Dfaceshape S ,similarlyrepresentedbythehomogeneous coordinatesof N 3Dlandmarks,i.e., S = 0 B B B B B B B B B @ x 1 x 2 x N y 1 y 2 y N z 1 z 2 z N 11 1 1 C C C C C C C C C A : (2.2) Similartothepriorwork[121],aweakperspectivemodelisassumedfortheprojection, U = MS ; (2.3) where M isa2 4projectionmatrixwithsevendegreesoffreedom(yaw,pitch,roll,twoscales and2Dtranslations). Followingthebasicideaof3DPDM[31],weassumea3Dfaceshapeisaninstanceofthe 3DPDM, S = S 0 + N s å i = 1 p i S i ; (2.4) where S 0 and S i isthemeanshapeand i thshapebasisofthe3DPDMrespectively, N s isthetotal numberofshapebases,and p i isthe i thshapecoefGivenadatasetof3Dscanswithmanual labelson N 3Dlandmarksperscan,weperformprocrustesanalysisonthe3Dscanstoremove theglobaltransformation,andthenconductPrincipalComponentAnalysis(PCA)toobtainthe S 0 and f S i g (seethetop-leftpartofFig.2.2). Thesetofallshapecoef p =( p 1 ; p 2 ; ; p N s ) istermedasthe3 Dshapeparameter of animage.Atthispoint,thefacealignmentforatestingimage I hasbeenconvertedfromthe 15 estimationof U totheestimationof P = f M ; p g .Theconversionismotivatedbyafewfactors. First,withoutthe3Dmodeling,itisverydiftomodeltheout-of-planerotation,whichhasa varyingnumberoflandmarksdependingontherotationangleandtheindividual3Dfaceshape. Second,aspointedoutby[121],byonlyusing 1 6 ofthenumberoftheshapebases,3DPDMcan haveanequivalentrepresentationpowerasits2Dcounterpart.Hence,using3Dmodelmightlead toamorecompactrepresentationofunknownparameters. GroundtruthP Estimating P foratestingimageimpliestheexistenceofgroundtruth P foreach trainingimage.However,while U canbemanuallylabeledonafaceimage, P isnormallyunavail- ableunlessa3Dscaniscapturedalongwithafaceimage.Therefore,inordertoleveragethevast amountofexisting2Dfacealignmentdatasets,suchastheAFLWdataset[57],itisdesirableto estimate P forafaceimageanduseitasthegroundtruthforlearning. Givenafaceimage I ,wedenotethemanuallylabeled2Dlandmarksas U andthelandmark visibilityas v ,an N -dimvectorwithbinaryelementsindicatingvisible(1)orinvisible(0)land- marks.Notethatitisnotnecessarytolabelthe2Dlocationsofinvisiblelandmarks.Wethe followingobjectivefunctiontoestimate M and p , J ( M ; p )= M S 0 + N s å i = 1 p i S i ! U ! V 2 ; (2.5) where V =( v | ; v | ) isa2 N visibilitymatrix, denotestheelement-wisemultiplication,and jjj 2 isthesumofthesquaresofallmatrixelements.Basically J ( ; ) computesthedifferencebetween thevisible2Dlandmarksandtheir3Dprojections.Analternativeestimationschemeisutilized, i.e.,byassuming p 0 = 0,weestimate M k = argmin M J ( M ; p k 1 ) ,andthen p k = argmin p J ( M k ; p ) iterativelyuntilthechangesof M and p aresmallenough.Bothminimizationscanbeef solvedinclosedformsvialeast-squareerror. 16 2.2.2CascadedCoupled-Regressor Foreachtrainingimage I i ,wenowhaveitsgroundtruthas P i = f M i ; p i g ,aswellastheirinitial- ization,i.e., M 0 i = g ( ¯ M ; b i ) , p 0 i = 0 ,and v 0 i = 1 .Here ¯ M istheaverageofgroundtruthprojection matricesinthetrainingset, b i isa4-dimvectorindicatingtheboundingboxlocation,and g ( M ; b ) isafunctionthatthescaleandtranslationof M basedon b .Givenadatasetof N d train- ingimages,thequestionis how toformulateanoptimizationproblemtoestimate P i .Wedecide toextendthesuccessfulcascadedregressorsframeworkduetoitsaccuracyandefy[26]. Thegeneralideaofcascadedregressorsistolearnaseriesofregressors,wherethe k thregressor estimatesthedifferencebetweenthecurrentparameter P k 1 i andthegroundtruth P i ,suchthatthe estimatedparametergraduallyapproximatesthegroundtruth. Motivatedbythisgeneralidea,weadoptacascadedcoupled-regressorschemewheretwo regressorsarelearnedatthe k thcascadelayer,fortheestimationof M i and p i respectively.Specif- ically,thelearningtaskofthe k thregressoris, Q k 1 = argmin Q k 1 N d å i = 1 jj D M k i R k 1 ( I i ; U i ; v k 1 i ; Q k 1 ) jj 2 ; (2.6) where U i = M k 1 i S 0 + N s å i = 1 p k 1 i S i ! ; (2.7) isthecurrentestimated2Dlandmarks, D M k i = M i M k 1 i ,and R k 1 ( ; Q k 1 ) isthedesiredregressor withtheparameterof Q k 1 .After Q k 1 isestimated,weobtain D ‹ M i = R k 1 ( ; Q k 1 ) foralltrainingimages andupdate M k i = M k 1 i + D ‹ M i .Notethatthislinerupdatingmaypotentiallybreaktheconstraintof theprojectionmatrix.Therefore,weestimatethescalesandyaw,pitch,rollangles( s x ; s y ; a ; b ; g ) from M k i andcomposeanew M k i basedontheseveparameters. 17 Similarlythesecondlearningtaskofthe k thregressoris, Q k 2 = argmin Q k 2 N d å i = 1 jj D p k i R k 2 ( I i ; U i ; v k i ; Q k 2 ) jj 2 ; (2.8) where U i iscomputedviaEq2.7except M k 1 i isreplacedwith M k i .Wealsoobtain D ‹ p i = R k 2 ( ; Q k 2 ) foralltrainingimagesandupdate p k i = p k 1 i + D ‹ p i .Thisiterativelearningprocedurecontinuesfor K cascadelayers. Learning R k ( ) Ourcascadedcoupled-regressorschemedoesnotdependontheparticularfeature representationorthetypeofregressors.Therefore,wemaythembasedonpriorworkorany futuredevelopmentinfeaturesandregressors.,inthisworkweadopttheHOG-based linearregressor[126]andthefernregressor[22]. Forthelinearregressor,wedenoteafunction f ( I ; U ) toextractHOGfeaturesaroundasmall rectangularregionofeachoneof N landmarks,whichreturnsa32 N -dimfeaturevector.Thus,we theregressorfunctionas R ( )= Q | Diag ( v i ) f ( I i ; U i ) ; (2.9) whereDiag ( v ) isafunctionthatduplicateseachelementof v 32timesandconvertsintoadiagonal matrixofsize32 N .Notethatwealsoaddaconstraint, l jj Q jj 2 ,toEq2.6orEq2.8foramore robustleast-squaresolution.BypluggingEq2.9toEq2.6orEq2.8,theregressorparameter Q (e.g.,a N s 32 N matrixfor R k 2 )canbeeasilyestimatedintheclosedform. Forthefernregressor,wefollowthetrainingprocedureof[22].Thatis,wedividetheface regionintoa3 3grid.Ateachcascadelayer,wechoose3outof9zoneswiththeleastocclusion, computedbasedonthe f v k i g .Foreachselectedzone,adepth5randomfernregressorislearned 18 fromtheinterpolatedshape-indexedfeaturesselectedbythecorrelation-basedmethod[26]from thatzoneonly.Finallythelearned R ( ) isaweightedmeanvotingfromthe3fernregressors,where theweightisinverselyproportionaltotheaverageamountofocclusioninthatzone. 2.2.33DSurface-EnabledVisibility Uptonowtheonlythingthathasnotbeenexplainedinthetrainingprocedureishowtoestimate thevisibilityoftheprojected2Dlandmarks, v i .Itisobviousthatduringthetestingwehaveto estimate v ateachcascadelayerforatestingimage,sincethereisnovisibilityinformationgiven. Asaresult,duringthetrainingprocedure,wealsohaveto estimate v percascadelayerforeach trainingimage ,ratherthanusingthemanuallylabeledgroundtruthvisibilitythatisusefulfor estimatinggroundtruth P asshowninEq2.5. Dependingonthecameraprojectionmatrix M ,thevisibilityofeachprojected2Dlandmark maydynamicallychangealongdifferentlayersofthecascade(seethetop-rightpartofFig.2.2).In ordertoestimate v ,wedecidetousethe3Dfacesurfaceinformation.Westartbyassumingevery individualhasasimilar3Dsurfacenormalvectorateachofits3Dlandmarks.Then,byrotating thesurfacenormalaccordingtotherotationangleindicatedbytheprojectionmatrix,weknowthat whethertherotatedsurfacenormalispointingtowardthecamera(i.e.,visible)orawayfromthe camera(i.e.,invisible).Inotherwords,thesignofthe z -axiscoordinatesindicatesvisibility. Bytakingasetof3Dscanswithmanuallylabeled3Dlandmarks,wecancomputetheland- marks'average3Dsurfacenormals,denotedasa3 N matrix ~ N .Thenweusethefollowing equationtocomputethevisibilityvector, v = ~ N | m 1 jj m 1 jj m 2 jj m 2 jj ; (2.10) 19 Data: 3Dmodel ff S g N s i = 0 ; ~ N g ,labeleddata f I i ; U i ; b i g N d i = 1 Result: Cascadedregressorparameters f Q k 1 ; Q k 2 g K k = 1 / * 3Dmodeling * / 1 foreach i = 1 ; ; N d do 2 Estimate M i and p i viaEq.2.5 / * Initialization * / 3 foreach i = 1 ; ; N d do 4 p 0 i = 0 ; . Assumingthemean3Dshape 5 v 0 i = 1 ; . Assumingalllandmarksvisible 6 M 0 i = g ( ¯ M ; b i ) and U i = M 0 i S 0 / * Regressorlearning * / 7 foreach k = 1 ; ; K do 8 Estimate Q k 1 viaEq2.6 9 Update M k i and U i forallimages 10 Compute v k i viaEq2.10forallimages 11 Estimate Q k 2 viaEq2.8; 12 Update p k i and U i forallimages. Figure2.3:ThetrainingprocedureofPIFA. where m 1 and m 2 aretheleft-mostthreeelementsattheandsecondrowof M respectively, and jjj denotesthe L 2 norm.Forfernregressors, v isasoftvisibilitywithin 1.Forlinear regressors,wefurthercompute v = 1 2 ( 1 + sign ( v )) ,whichresultsinahardvisibilityofeither1or 0.Insummary,wepresentthedetailedtrainingprocedureinAlgorithm2.3. Model Givenatestingimage I withboundingbox b anditsinitialparameter M 0 = g ( ¯ M ; b ) and p 0 = 0 ,wecanapplythelearnedcascadedcoupled-regressorforfacealignment.Basicallywe iterativelyuse R k 1 ( ; Q k 1 ) tocompute D ‹ M ,update M k ,compute v k ,use R k 2 ( ; Q k 2 ) tocompute D ‹ p , andupdate p k .Finallytheestimated3Dlandmarksare ‹ S = S 0 + å i p K i S i ,andtheestimated2D landmarksare ‹ U = M K ‹ S .Notethat ‹ S carriestheindividual3Dshapeinformationofthesubject, butnotnecessaryinthesameposeasthe2Dtestingimage. 20 2.3ExperimentalResults Datasets Thegoalofthisworkistoadvancethecapabilityoffacealignmenton in-the-wildfaces withallpossibleviewangles ,whichisthetypeofimageswedesirewhenselectingexperimental datasets.However,veryfewpubliclyavailabledatasetssatisfythischaracteristic,orhavebeen extensivelyevaluatedinpriorwork(seeTab.2.1).Nevertheless,weidentifythreedatasetsforour experiments. AFLWdataset[57]contains ˘ 25 ; 000in-the-wildfaceimages,eachimageannotatedwiththe visible landmarks(upto21landmarks),andaboundingbox.Basedonourestimated M foreach image,weselectasubsetof5 ; 200imageswherethenumbersofimageswhoseabsoluteyawangles within [ 0 ; 30 ] , [ 30 ; 60 ] , [ 60 ; 90 ] areroughly 1 3 each.Tohaveamore balanceddistribution of theleftvs.rightviewfaces,wetaketheoddindexedimagesamong5 ; 200(i.e.,1st,3rd), themhorizontally,andusethemtoreplacetheoriginalimages.Finally,arandompartitionleads to3 ; 901and1 ; 299imagesfortrainingandtestingrespectively.AsshowninTab.2.1,amongthe methodsthattestonallposes,wehavethelargestnumberoftestingimages. AFWdataset[151]contains205imagesandintotal468faceswithdifferentposeswithin 90 .Eachimageislabeledwith visible landmarks(upto6),andafaceboundingbox.Weonly useAFWfortesting. Sincewearealsoestimating3Dlandmarks,itisimportanttotestonadatasetwith ground Table2.1:Thecomparisonoffacealignmentalgorithmsinposehandling(estimationerrorsmay havedifferent Method 3D Visibility Pose-relateddatabase Pose Training Testing Landmark Estimation landmark range face# face# # errors RCPR[22] No Yes COFW frontalw.occlu. 1 ; 345 507 19 8 : 5 CoR[136] No Yes COFW;LFPW-O;Helen-O frontalw.occlu. 1 ; 345;468;402 507;112;290 19;49;49 8 : 5 TSPM[151] No No AFW allposes 2 ; 118 468 6 11 : 1 CDM[135] No No AFW allposes 1 ; 300 468 6 9 : 1 OSRD[123] No No MVFW < 40 2 ; 050 450 68 N/A TCDCN[142] No No AFLW,AFW < 60 10 ; 000 3 ; 000; ˘ 313 5 8 : 0;8 : 2 PIFA Yes Yes AFLW,AFW allposes 3 ; 901 1 ; 299;468 21 ; 6 6 : 5;8 : 6 21 truth ,ratherthanestimated,3Dlandmarklocations.WeBP4D-Sdatabase[141]tobethebest forthispurpose,whichcontainspairsof2Dimagesand3Dscansofspontaneousfacialexpressions from41subjects.Eachpairhassemi-automaticallygenerated832Dand833Dlandmarks,and thepose.Weapplyarandomperturbationon2Dlandmarks(tomimicimprecisefacedetection) andgeneratetheirenclosedboundingbox.Withthegoalofselectingasmanynon-frontalview facesaspossible,wechooseasubsetwherethenumbersoffaceswhoseyawanglewithin [ 0 ; 10 ] , [ 10 ; 20 ] , [ 20 ; 30 ] are100,500,and500respectively.Werandomlyselecthalfof1 ; 100images fortrainingandtherestfortesting,withdisjointsubjects. Experimentsetup OurPIFAapproachneedsa3Dmodelof f S g N s i = 0 and ~ N .UsingtheBU-4DFE database[133]thatcontains6063Dfacialexpressionsequencesfrom101subjects,weevenly sample72scansfromeachsequenceandgatheratotalof72 606scans.Basedonthemethodin Sec.2.2.1,theresultantmodelhas N s = 30forAFLWandAFW,and N s = 200forBP4D-S. Duringthetrainingandtesting,foreachimagewithaboundingbox,weplacethemean2D landmarks(learnedfromthetrainingset)ontheimagesuchthatthelandmarksontheboundary arewithinthefouredgesofthebox.Fortrainingwithlinearregressors,weset K = 10, l = 120, while K = 75forfernregressors. Evaluationmetric Giventhegroundtruth2Dlandmarks U i ,theirvisibility v i ,andestimated landmarks ‹ U i of N t testingimages,wehavetwowaysofcomputingthelandmarkestimationerrors: 1)MeanAveragePixelError(MAPE)[135],whichistheaverageoftheestimationerrorsfor visiblelandmarks,i.e., MAPE = 1 å N t i j v i j 1 N t ; N å i ; j v i ( j ) jj ‹ U i ( : ; j ) U i ( : ; j ) jj ; (2.11) where j v i j 1 isthenumberofvisiblelandmarksofimage I i ,and U i ( : ; j ) isthe j thcolumnof U i .2) 22 NormalizedMeanError(NME),whichistheaverageofthenormalizedestimationerrorofvisible landmarks,i.e., NME = 1 N t N t å i ( 1 d i j v i j 1 N å j v i ( j ) jj ‹ U i ( : ; j ) U i ( : ; j ) jj ) ; (2.12) where d i isthesquarerootofthefaceboundingboxsize,asusedby[135].Notethatnormally d i istheinter-eyedistanceinpriorfacealignmentworkdealingwithnear-frontalfaces. Giventhegroundtruth3Dlandmarks S i andestimatedlandmarks ‹ S i ,weestimatethe globalrotation,translationandscaletransformationsothatthetransformed S i ,denotedas S 0 i ,has theminimumdistanceto ‹ S i .WethencomputetheMAPEviaEq2.11exceptreplacing U and ‹ U i with S 0 i and ‹ S i ,and v i = 1 .ThustheMAPEonlymeasurestheerrorduetonon-rigidshape deformation,ratherthantheposeestimation. Choiceofbaselinemethods Giventheexplosionoffacealignmentworkinrecentyears,itisim- portanttochooseappropriatebaselinemethodssoastomakesuretheproposedmethodadvances thestateoftheart.Inthiswork,weselectthreerecentworksasbaselinemethods:1)CDM[135] isaCLM-typemethodandtheoneclaimedtoperformpose-freefacealignment,whichhas exactlythesameobjectiveasours.OnAFWitalsooutperformstheotherwell-knownTSPM method[151]thatcanhandleallposefaces.2)TCDCN[142]isapowerfuldeeplearning-based methodpublishedinthemostrecentECCV.Althoughitonlyestimates5landmarksforupto ˘ 60 yaw,itrepresentstherecentdevelopmentinfacealignment.3)RCPR[22]isaregression-type methodthatrepresentstheocclusion-invariantfacealignment.Althoughitisanearlierworkthan CoR[136],wechooseitduetoitssuperiorperformanceonthelargeCOFWdataset(seeTab.1 of[136]).Itcanbeseenthatthesethreebaselinesnotonlyaremostrelevanttoourfocusonpose- invariantfacealignment,butalsowellrepresentthemajorcategoriesofexistingfacealignment algorithmsbasedon[115]. 23 Table2.2: TheNME(%)ofthreemethodsonAFLW. N t PIFA CDM RCPR 1 ; 299 6 : 52 7 : 15 783 6 : 08 8 : 65 Table2.3: ThecomparisonoffourmethodsonAFW. N t N Metric PIFA CDM RCPR TCDCN 468 6 MAPE 8 : 61 9 : 13 313 5 NME 9 : 42 9 : 30 8 : 20 ComparisononAFLW SincethesourcecodeofRCPRispubliclyavailable,weareableto performthetrainingandtestingofRCPRonourAFLWpartition.Weusetheavailable executableofCDMtocomputeitsperformanceonourtestset.Westrivetoprovidethesame setuptothebaselinesasours,suchastheinitialboundingbox,regressorlearning,etc.Forour PIFAmethod,weusethefernregressor.BecauseCDMintegratesfacedetectionandpose-free facealignment,noboundingboxwasgiventoCDManditsuccessfullydetectsandaligns783out of1 ; 299testingimages.Therefore,tocomparewithCDM,weevaluatetheNMEonthe same 783testingimages.AsshowninTab.2.2,ourPIFAshowssuperiorperformancetobothbaselines. AlthoughTCDCNalsoreportsperformanceonasubsetof3 ; 000AFLWimageswithin 60 yaw, itisevaluatedwith5landmarks,basedonNMEwhen d i istheinter-eyedistance.Hence,without thesourcecodeofTCDCN,itisdiftohaveafaircomparisononoursubsetofAFLWimages (e.g.,wecannot d i astheinter-eyedistanceduetoviewfaces).Onthe1 ; 299testing images,wealsotestourmethodwithlinearregressors,andachieveaNMEof7 : 50,whichshows thestrengthoffernregressors. ComparisononAFW UnlikeoursubsetofAFLW,theAFWdatasethasbeenevaluated byallthreebaselines,butdifferentmetricsareused.Therefore,theresultsofthebaselinesin Tab.2.3arefromthepublishedpapers,insteadofexecutingthetestingcode.Onenoteisthatfrom theTCDCNpaper[142],itappearsthatall5landmarksarevisibleonalldisplayedimagesand 24 Figure2.4: TheNMEofveposegroupsfortwomethods. novisibilityestimationisshown,whichmightsuggestthatTCDCNwasevaluatedonasubsetof AFWwithupto 60 yaw.Hence,weselectthetotalof313outof468faceswithinthispose rangeandtestouralgorithm.Sinceitislikelythatoursubsetcoulddifferto[142],pleasetake thisintoconsiderationwhilecomparingwithTCDCN.Overall,ourPIFAmethodstillperforms comparablyamongthefourmethods.ThisisespeciallyencouraginggiventhefactthatTCDCN utilizesasubstantiallylargertrainingsetof10 ; 000images-morethantwotimesofourtraining set.NotethatinadditiontoTab.2.2and2.3,ourPIFAalsohasotherasshowninTab.2.1. E.g.,wehave3Dandvisibilityestimation,whileRCPRhasno3DestimationandTCDCNdoes nothavevisibilityestimation. Estimationerroracrossposes Justlikepose-invariantfacerecognitionstudiestherecognition rateacrossposes[71,72],wealsoliketostudytheperformanceoffacealignmentacrossposes. AsshowninFig.2.4,basedontheestimatedprojectionmatrix M anditsyawangles,wepartition alltestingimagesofAFLWintovebins,eacharoundayawangle.Thenwecomputethe NMEoftestingimageswithineachbin,forourmethodandRCPR.Wecanobservethatthe viewimageshaveingenerallargerNMEthannear-frontalimages,whichshowsthechallengeof pose-invariantfacealignment.Further,theimprovementofPIFAoverRCPRisconsistentacross mostoftheposes. Estimationerroracrosslandmarks Wearealsointerestedintheestimationerroracrossvarious 25 Figure2.5: TheNMEofeachlandmarkforPIFA. Table2.4: EfyoffourmethodsinFPS. PIFA CDM RCPR TCDCN 3 : 0 0 : 2 3 : 0 58 : 8 landmarks,underawiderangeofposes.Hence,fortheAFLWtestset,wecomputetheNMEof eachlandmarkforourmethod.AsshowninFig.2.5,thetwoeyeregionshavetheleastamountof error.Thetwolandmarksundertheearshavethemosterror,whichisconsistentwiththeintuition. Theseobservationsalsoalignwellwithpriorfacealignmentstudyonnear-frontalfaces. 3Dlandmarkestimation ByperformingthetrainingandtestingontheBP4D-Sdataset,wecan evaluatetheMAPEof3Dlandmarkestimation,withexemplarresultsshowninFig.2.6.Since therearelimited3Dalignmentworkandmanyofwhichdonotperformquantitativeevaluation, suchas[43],wearenotabletoanothermethodasthebaseline.Instead,weusethe3Dmean shape, S 0 ,asabaselineandcomputeitsMAPEwithrespecttothegroundtruth3Dlandmarks S i (afterglobaltransformation).WethattheMAPEof S 0 baselineis5 : 02,whileourmethod has4 : 75.Althoughourmethodoffersabetterestimationthanthemeanshape,thisshowsthat3D facealignmentisstillaverychallengingproblem.Wehopetheefforttoquantitativelymeasurethe 3Destimationerror,whichismorediftthanits2Dcounterpart,willencouragemoreresearch activitiestoaddressthischallenge. Computational Basedontheefyreportedinthepublicationsofbaselinemeth- ods,wecomparethecomputationalefyoffourmethodsinTab.2.4.OnlyTCDCNismea- 26 Figure2.6: 2Dand3DalignmentresultsoftheBP4D-Sdataset. suredbasedontheCimplementationwhileotherthreeareallbasedonMatlabimplementation. ItcanbeobservedthatTCDCNisthemostefone.Considerthatweestimateboth2Dand 3Dlandmarks,at3FPSourunoptimizedimplementationisreasonablyefInouralgorithm, themostcomputationaldemandingpartisfeatureextraction,whileestimatingtheupdatesforthe projectionmatrixand3Dshapeparameterhasclosed-formsolutionsandisveryef Qualitativeresults Wenowshowthequalitativefacealignmentresultsforimagesintwodatasets. AsshowninFig.2.7,despitethelargeposerangeof 90 yaw,ouralgorithmdoesagoodjobof aligningthelandmarks,andcorrectlypredictthelandmarkvisibilities.Theseresultsareespecially impressiveifyouconsiderthesamemeanshape(2Dlandmarks)isusedastheinitializationofall testingimages,whichhasverylargedeformationswithrespecttotheirlandmarkestimation. 2.4Summary Motivatedbythefastprogressoffacealignmenttechnologiesandtheneedtoalignfacesatall poses,thischapterdrawsattentiontoarelativelylessexploredproblemoffacealignmentrobustto posesvariation.Tothisend,weproposeanovelapproachtotightlyintegratethepowerfulcascaded regressorschemeandthe3Dfacemodel.The3Dmodelnotonlyservesasacompactconstraint, 27 Figure2.7: TestingresultsofAFLW(top)andAFW(bottom).Asshowninthetoprow,weinitialize facealignmentbyplacinga2Dmeanshapeinthegivenboundingboxofeachimage.Notethe disparity betweentheinitiallandmarksandtheestimatedones,aswellasthediversityinpose,illuminationand resolutionamongtheimages.Green/redpointsindicatevisible/invisibleestimatedlandmarks. butalsooffersanautomaticandconvenientwaytoestimatethevisibilitiesof2Dlandmarks- akeyforsuccessfulpose-invariantfacealignment.Asaresult,fora2Dimage,ourapproach estimatesthelocationsof2Dand3Dlandmarks,aswellastheir2Dvisibilities.Weconductan extensiveexperimentonalargecollectionofall-posefaceimagesandcomparewiththreestate- of-the-artmethods.Whilesuperior2Dlandmarkestimationhasbeenshown,theperformanceon 3Dlandmarkestimationindicatesthefuturedirectiontoimprovethislineofresearch. 28 Chapter3 Pose-InvariantFaceAlignmentvia CNN-basedDense3DModelFitting 3.1Introduction Inthepreviouschapter,weproposethePIFAmethodwhichcanestimatethelocationsofasparse setof3Dlandmarkpoints.Inthischapter,weextendPIFAinanumberofways: First,weproposetouseadense3DMorphableModel(3DMM)toreconstructthe3Dshape offaceandtheprojectionmatrixasthe latentrepresentation ofa2Dfaceshape.Therefore,face alignmentamountstoestimatingthisrepresentation,i.e.,performingthe3DMMtoaface imagewith arbitrary poses. Second,weproposetouseConvolutionalNeuralNetworks(CNN)astheregressorinthecas- cadedframework,tolearnthemapping.ThemainadvantageofCNNoverthefernregressiontrees (inthepreviouschapter)isthatitdoesnotdependonhand-craftedfeatureextractionmethods.The CNNcanlearnandextractmoremeaningful,generalizableandabstractfeaturesbyhierarchical representation.Thispropertyismoreimportantinpose-invariantfacealignmentbecauseachange intheheadpose(frontaltoside-view)makesaconsiderabledifferenceinthefaceimages. WhilemostpriorworkonCNN-basedfacealignmentestimatenomorethansix2Dlandmarks perimage[99,142],ourcascadedCNNcanproduceasubstantiallylargernumber(34)of2Dand 3Dlandmarks.Further,usinglandmarkmarching[150],ouralgorithmcanadaptivelyadjustthe 29 Figure3.1: Theproposedmethodestimateslandmarksforlarge-posefacesbyadense3Dshape. Fromlefttoright:initiallandmarks,3Ddenseshape,estimatedlandmarkswithvisibility.The green/red/yellowdotsintherightcolumnshowthevisible/invisible/cheeklandmarks,respectively. 3Dlandmarksduringthesothatthelocalappearancesaroundcheeklandmarkscontribute totheprocess. Third,weproposetwonovelpose-invariantlocalfeatures,astheinputlayerforCNNlearning. Weutilizethedense3Dfacemodelasanoracletobuilddensefeature correspondence across variousposesandexpressions.Wealsoutilizesurfacenormalstoestimatethe visibilityofeachlandmarkbyinspectingwhetheritssurfacenormalhasapositive z coordinate, andtheestimatedvisibilitiesaredynamicallyincorporatedintotheCNNregressorlearningsuch thatonlytheextractedfeaturesfromvisiblelandmarkscontributetothelearning. Fourth,theCNNregressordealswithaverychallenginglearningtaskgiventhediversefa- cialappearanceacrossallposes.Tofacilitatethelearningtaskunderlargevariationsofposeand expression,wedeveloptwonewconstraintstolearntheCNNregressors.Oneisthat,thereisinher- entambiguityinrepresentinga2Dfaceshapeasthecombinationofthe3Dshapeandprojection matrix.Therefore,inadditiontoregressingtowardsuchanon-uniquelatentrepresentation,we alsoproposetoconstraintheCNNregressorinitsabilitytodirectlyestimate2Dfaceshapes.The otheristhat,ahorizontallymirroredversionofafaceimageisstillavalidfaceandtheiralignment 30 resultsshouldbetheversionofeachother.Inthiswork,weproposeaCNNarchitecturewith anewlossfunctionthatexplicitlyenforcestheseconstraints.Thenewlossfunctionminimizes thedifferenceoffacealignmentresultsofafaceimageanditsmirror,inasiamesenetworkar- chitecture[20].Althoughthismirrorabilityconstraintwasanalignmentaccuracymeasureusedin post-processing[129],weintegrateitdirectlyinCNNlearning. Thesealgorithmdesignscollectivelyleadtotheextendedpose-invariantfacealignmentalgo- rithm.Weconductextensiveexperimentstodemonstratethecapabilityofproposedmethodin aligningfacesacrossposesontwochallengingdatasets,AFLW[61]andAFW[151],withcom- parisontothestateoftheart. Wesummarizethemaincontributionsofthisworkas: Œ Pose-invariantfacealignmentbyadense3DMM,andintegratingestimationof3D shapeand2Dfaciallandmarksfromasinglefaceimage. Œ ThecascadedCNN-based3Dfacemodelalgorithmthatisapplicabletoallposes, withintegratedlandmarkmarchingandcontributionfromlocalappearancesaroundcheek landmarksduringtheprocess. Œ Dense3Dface-enabledpose-invariantlocalfeaturesandutilizingsurface normalstoestimatethevisibilityoflandmarks. Œ AnovelCNNarchitecturewithmirrorabilityconstraintthatminimizesthedifferenceofface alignmentresultsofafaceimageanditsmirror. 31 Figure3.2: Theoverallprocessoftheproposedmethod. 3.2Unconstrained 3 DFaceAlignment Thecoreofourproposed3Dfacealignmentmethodistheabilitytoadense3DMorphable Modeltoa2Dfaceimagewitharbitraryposes.Theunknownparametersofthe3Dshape parametersandtheprojectionmatrixparameters,aresequentiallyestimatedthroughacascadeof CNN-basedregressors.Byemployingthedense3Dshapemodel,weenjoytheofbeing abletoestimate3Dshapeofface,locatethecheeklandmarks,use3Dsurface normals,andextractpose-invariantlocalfeaturerepresentation,whicharelesslikelytoachieve withasimplePDM[52].Fig.3.2showstheoverallprocessoftheproposedmethod. 3.2.1 3 DMorphableModel Torepresentadense3Dshapeofanindividual'sface,weuse3DMorphableModel(3DMM), S = S 0 + N id å i = 1 p i id S i id + N exp å i = 1 p i exp S i exp ; (3.1) where S isthe3Dshapematrix, S 0 isthemeanshape, S i id isthe i thidentitybasis, S i exp isthe i th expressionbasis, p i id isthe i thidentitycoefand p i exp isthe i thexpressioncoefThe 32 collectionofbothcoefisdenotedastheshapeparameterofa3Dface, p =( p | id ; p | exp ) | .We usetheBasel3Dfacemodelastheidentitybases[86]andthefacewearhouseastheexpression bases[25].The3Dshape S ,alongwith S 0 , S i id ,and S i exp ,isa3 Q matrixwhichcontains x ; y and z coordinatesof Q vertexesonthe3Dfacesurface, S = 0 B B B B B @ x 1 x 2 x Q y 1 y 2 y Q z 1 z 2 z Q 1 C C C C C A : (3.2) Any3Dfacemodelwillbeprojectedontoa2Dimagewherethefaceshapemayberepresented asasparsesetof N landmarks,onthefacialpoints.Wedenote x and y coordinatesofthese 2Dlandmarksasamatrix U , U = 0 B @ u 1 u 2 u N v 1 v 2 v N 1 C A : (3.3) Therelationshipbetweenthe3Dshape S and2Dlandmarks U canbedescribedbyusingthe weakperspectiveprojection,i.e., U = s RS ( : ; d )+ t ; (3.4) where s isascaleparameter, R isthetworowsofa3 3rotationmatrixcontrolledbythree rotationangles a , b ,and g (pitch,yaw,roll), t isatranslationparametercomposedof t x and t y , d isa N -dimindexvectorindicatingtheindexesofsemanticallymeaningful3Dvertexesthat correspondto2Dlandmarks.Weformaprojectionvector m =( s ; a ; b ; g ; t x ; t y ) | whichcollectsall parameterstothisprojection.Weassumetheweakperspectiveprojectionmodelwithsixdegrees offreedom,whichisatypicalmodelusedinmanypriorface-relatedwork[48,121]. 33 (a)(b)(c)(d)(e) Figure3.3: Thelandmarkmarchingprocessforupdatingvector d .(a-b)showthepathsofcheek landmarksonthemeanshape;(c)istheestimatedfaceshape;(d)istheestimatedfaceshapebyignoring therollrotation;and(e)showsthelocationsoflandmarksonthecheek. Atthispoint,wecanrepresentany2Dfaceshapeastheprojectionofa3Dfaceshape.In otherwords,theprojectionparameter m andshapeparameter p canuniquelyrepresenta2Dface shape.Therefore,thefacealignmentproblemamountstoestimating m and p ,givenafaceimage. Estimating m and p insteadofestimating U ismotivatedbyafewfactors.First,withoutthe 3Dmodeling,itisnon-trivialtomodeltheout-of-planerotation,whichhasavaryingnumberof landmarksdependingontherotationangle.Second,aspointedoutby[121],byonlyusing 1 6 ofthenumberoftheshapebases,3DMMcanhaveanequivalentrepresentationpowerasits2D counterpart.Hence,using3Dmodelleadstoamorecompactrepresentationofshapeparameters. Cheeklandmarkscorrespondence TheprojectionrelationshipinEqn.3.4iscorrectforfrontal- viewfaces,givenaconstantindexvector d .However,assoonasafaceturnstothenon-frontal view,theoriginal3Dlandmarksonthecheekbecomeinvisibleonthe2Dimage.Yetmost2D facealignmentalgorithmsstilldetect2Dlandmarksonthecontourofthecheek,termedficheek landmarks".Therefore,inordertostillmaintainthe3D-to-2DcorrespondencesasEqn.3.4,itis desirabletoestimatethe3Dvertexesthatmatchwiththesecheeklandmarks.Afewpriorworks haveproposedvariousapproachestohandlethis[24,92,150].Inthispaper,weleveragethe landmarkmarchingmethodproposedin[150]. ,weasetof paths eachstoringtheindexesofvertexesthatarenotonlythe 34 Data: Estimated3Dface S andprojectionmatrixparameter m Result: Indexvector d / * Rotate S bytheestimated a ; b * / 13 ‹ S = R ( a ; b ; 0 ) S 14 if 0 < b < 70 then 15 foreach i = 1 ; ; 4 do 16 V cheek ( i )= argmax id ( ‹ S ( 1 ; Path cheek ( i ))) 17 if 70 < b < 0 then 18 foreach i = 5 ; ; 8 do 19 V cheek ( i )= argmin id ( ‹ S ( 1 ; Path cheek ( i ))) 20 Update8elementsof d with V cheek . Figure3.4:Landmarkmarching g ( S ; m ) . mostclosestonestotheoriginal3Dcheeklandmarks,butalsoonthecontourofthe3Dfaceasit turns.Givenanon-frontal3Dface S ,byignoringtherollrotation g ,werotate S byusingthe a and b angles(pitchandyaw),andsearchforavertexineachpaththathasthemaximum (minimum) x coordinate,i.e.,theboundaryvertexontheright(left)cheek.Theseresultingvertexes willbethenew3Dlandmarksthatcorrespondtothe2Dcheeklandmarks.Wewillthenupdate relevantelementsof d tomakesurethesevertexesareselectedintheprojectionofEqn.3.4.This landmarkmarchingprocessissummarizedinAlgorithm3.4asafunction d g ( S ; m ) .Notethat whenthefaceisapproximatelyofview( j b j > 70 ),wedonotapplylandmarkmarching sincethemarchedlandmarkswouldoverlapwiththeexisting2Dlandmarksonthemiddleofnose andmouth.Fig.3.3showsthesetofpathesonthe3Dshapeoffaceandoneexampleof applyingAlgorithm3.4forupdatingvector d . 3.2.2DataAugmentation Giventhattheprojectionmatrixparameter m andshapeparameter p aretherepresentationofaface shape,weshouldhaveacollectionoffaceimageswithgroundtruth m and p sothatthelearning algorithmcanbeapplied.However,while U canbemanuallylabeledonafaceimage, m and p are 35 normallyunavailableunlessa3Dscaniscapturedalongwithafaceimage.Formostexistingface alignmentdatabases,suchastheAFLWdatabase[61],only2Dlandmarklocationsandsometimes thevisibilitiesoflandmarksaremanuallylabeled,withnoassociated3Dinformationsuchas m and p .Inordertomakethelearningpossible,weproposeadataaugmentationprocessfor2Dface images,withthegoalofestimatingtheir m and p representation. ,giventhelabeledvisible2Dlandmarks U andthelandmarkvisibilities V ,we estimate m and p byminimizingthefollowingobjectivefunction: J ( m ; p )= jj ( s RS ( : ; g ( S ; m ))+ t U ) V jj 2 F ; (3.5) whichisthedifferencebetweentheprojectionof3Dlandmarksandthe2Dlabeledlandmarks. Notethatalthoughthelandmarkmarching g ( : ; : ) makescheeklandmarksfivisibleflfor views,thevisibility V isstillnecessarytoavoidinvisiblelandmarks,suchasoutereyecornersand halfofthefaceattheview,beingpartoftheoptimization. 3.2.2.1Optimization ForconvenientoptimizationofEqn.3.5,weallprojectionparametersasaprojection matrix,i.e., M = 2 6 4 s R t x t y 3 7 5 2 R 2 4 : (3.6) Also,wedenote d = g ( S ; m ) inEqn.3.5byassumingitisaconstantgiventhecurrently estimated m and p .WethenrewriteEqn.3.5as, J ( M ; p )= 0 B B @ M 2 6 6 4 S ( : ; d ) 1 | 3 7 7 5 U 1 C C A V 2 F : (3.7) 36 Tominimizethisobjectivefunction,wealternatetheminimizationw.r.t. M and p ateachitera- tion.Weinitializethe3Dshapeparameter p = 0 andestimate M by M k = argmin M J ( M ; p k 1 ) , M k = U V 2 6 6 4 S ( : ; d V ) 1 | 3 7 7 5 | 0 B B @ 2 6 6 4 S ( : ; d V ) 1 | 3 7 7 5 2 6 6 4 S ( : ; d V ) 1 | 3 7 7 5 | 1 C C A 1 ; (3.8) where U V iszero-meanpositions(byremovingthemeanfromalltheelements)ofvisible2D landmarks, d V isavectorcontainstheindexofvisiblelandmarks.Giventheestimated M k ,wethen usetheSingularValueDecomposition(SVD)todecomposeittovariouselementsofprojection parameter m ,i.e, M k = BDQ | .Thediagonalelementof D isscale s andwedecomposethe rotationmatrix R = BQ | 2 R 2 3 tothreerotationangles ( a ; b ; g ) .Finally,themeanvaluesof U aretranslationparameters t x and t y . Then,weestimate p k = argmin p J ( M k ; p ) .Giventheorthogonalbasesof3DMM,wechoose tocomputeeachelementof p onebyone.Thatis, p i id isthecontributionof i -thidentitybasisin reconstructingthedense3Dfaceshape, p i id = Tr ‹ U | V ‹ U id i Tr ‹ U | id i ‹ U id i ; (3.9) where ‹ U V = M k 2 6 6 4 S ( : ; d V ) 1 | 3 7 7 5 , ‹ U id i = M k 2 6 6 4 S i id ( : ; d V ) 1 | 3 7 7 5 : Here ‹ U V iscurrentresidualofpositionof2Dvisiblelandmarksaftersubtractingcontribution of M k ,and Tr () isthetracefunction.Once p i id iscomputed,weupdate ‹ U V bysubtractingthe contributionof i -thbasisandcontinuetocompute p i + 1 id .Wealternativelyestimate M and p until 37 thechangesof M and p aresmallenough.AftereachstepofapplyingEqn.3.8forcomputing anewestimationof M anddecomposingtoitsparameters m ,weapplythelandmarkmarching algorithm(Algorithm3.4)toupdatethevector d . 3.2.3CascadedCNNCoupled-Regressor Givenasetof N d trainingfaceimagesandtheiraugmented(i.e.,figroundtruth") m and p repre- sentation,weareinterestedinlearningamappingfunctionthatisabletopredict m and p from theappearanceofaface.Clearlythisisacomplicatednon-linearmappingduetothediversityof facialappearance.GiventhesuccessofCNNinvisiontaskssuchasposeestimation[88],face detection[64],andfacealignment[142],wedecidetomarrytheCNNwiththecascaderegressor frameworkbylearningaseriesofCNN-basedregressorstoalternatetheestimationof m and p . Tothebestofourknowledge,thisisthetimeCNNisusedin3Dfacealignment,withthe estimationofover10landmarks. Foreachtrainingimage I i ,inadditiontothegroundtruth m i and p i ,wealsoinitializeimage's representationby, m 0 i = h ( ¯ m ; b i ) and p 0 i = 0 .Here ¯ m istheaverageofgroundtruthparametersof projectionmatricesinthetrainingset, b i isa4-dimvectorindicatingtheboundingboxlocation, and h ( m ; b ) isafunctionthatthescaleandtranslationsof m basedon b . Thus,atthestage k ofthecascadedCNN,wecanlearnaCNNtoestimatethedesiredupdate oftheprojectionmatrixparameter, Q k m = argmin Q k m J Q = N d å i = 1 jj D m k i CNN k m ( I i ; U i ; v k 1 i ; Q k m ) jj 2 ; (3.10) wherethetrueprojectionupdateisthedifferencebetweenthecurrentprojectionmatrixparameter andthegroundtruth,i.e., D m k i = m i m k 1 i , U i iscurrentestimated2Dlandmarks,computedvia Eqn.3.4,basedon m k 1 i and d k 1 i ,and v k 1 i isestimatedlandmarkvisibilityatstage k 1. 38 SimilarlyanotherCNNregressorcanbelearnedtoestimatetheupdatesoftheshapeparameter, Q k p = argmin Q k p J Q = N d å i = 1 jj D p k i CNN k p ( I i ; U i ; v k i ; Q k p ) jj 2 : (3.11) Notethat U i willbere-computedviaEqn.3.4,basedontheupdated m k i and d k i byCNN m . Weuseasix-stagecascadedCNN,includingCNN 1 m ,CNN 2 m ,CNN 3 p ,CNN 4 m ,CNN 5 p ,and CNN 6 m .Atthestage,theinputlayerofCNN 1 m istheentirefaceregioncroppedbytheini- tialboundingbox,withthegoalofroughlyestimatingtheposeoftheface.Theinputforthe secondtosixthstagesisa114 114imagethatcontainsanarrayof19 19pose-invariantfeature patches,extractedfromthecurrentestimated2Dlandmarks U i .Inourimplementation,sincewe have N = 34landmarks,thelasttwopatchesof114 114imagearewithzero.Similarly,for invisible2Dlandmarks,theircorrespondingpatcheswillbewithzerosaswell.Thesefeature patchesencodesufinformationaboutthelocalappearancearoundthecurrent2Dlandmarks, whichdrivestheCNNtooptimizetheparameters Q k m or Q k p .Also,throughconcatenation,these featurepatchessharetheinformationamongdifferentlandmarksandjointlydrivetheCNNinpa- rameterestimation.Ourinputrepresentationcanbeextendedtousealargernumberoflandmarks andhenceamoreaccuratedense3Dmodelcanbeestimated. Notethatsincelandmarkmarchingisused,theestimated2Dlandmarks U i includethepro- jectionofmarched3Dlandmarks,i.e.,2Dcheeklandmarks.Asaresult,theappearancefeatures aroundthesecheeklandmarksarepartoftheinputtoCNNaswell.Thisisinsharpcontrastto[52] wherenocheeklandmarksparticipatetheregressorlearning.Effectively,theseadditionalcheek landmarksserveasconstraintstoguidehowthefacialsilhouettesatvariousposesshouldlooklike, whichisessentiallytheshapeofthe3Dfacesurface. Anothernoteisthat,insteadofalternatingbetweentheestimationof m and p ,anotheroption istojointlyestimatebothparametersineachCNNstage.Experimentallyweobservedthatsuch 39 Figure3.5: ArchitectureofC-CNN(thesameCNNarchitectureisusedforallsixstages).Colorcode used:purple=extractedimagefeature,orange=Conv,brown=pooling+batchnormalization,blue=fully connectedlayer,red=ReLU.Thesizeandthenumberofforeachlayerareshownonthetop andthebottomrespectively. ajointestimationscheduleleadstoaloweraccuracythanthealternatingscheme,potentiallydue tothedifferentphysicalmeaningof m and p andtheambiguityofmultiplepairsof m and p correspondingthesame2Dshape.Forthealternatingscheme,nowwepresenttwodifferentCNN architectures,andusethesameCNNarchitectureforallsixstagesofthecascade. 3.2.4ConventionalCNN(C-CNN) ThearchitectureoftheCNNisshowninFig.3.5.Ithasthreeconvolutionallayerswhereeach oneisfollowedbyapoolinglayerandabatchnormalizationlayer.Then,onefullyconnected layerandReLUlayerand,attheendofthearchitecture,ithasonefullyconnectedlayerandone Euclidianloss( J Q )forestimatingtheprojectionmatrixparametersor3Dshapeparameters.We uselinearunit(ReLU)[41]astheactivationfunctionwhichenablesCNNtoachievethe bestperformancewithoutunsupervisedpre-training. 3.2.5MirrorCNN(M-CNN) Wedealwithtwoinherentambiguitieswhenweestimateprojectionmatrixparameter m and3D shapeparameter p .First,mutiplepairsof m and p canrepresentthesame2Dfaceshape.Second, theestimatedupdatesof m and p arenotexplicitlyrelatedtothefacealignmenterror.Inother words,thechangesin m and p arenotlinearlyrelatedtothe2Dshapechanges.Toremedythese 40 ambiguities,wepredict2Dshapeupdatesimultaneouslywhileestimatingthe m and p updates. WeextendtheCNNarchitectureofeachcascadestagebyencouragingthealignmentresultsof afaceimageanditsmirrortobehighlycorrelated.Tothisend,weusetheideaofmirrorability constraint[129]withtwomaindifferences.First,wecombinethisconstraintwiththelearning procedureratherthanusingitasapost-processingstep.Second,weintegratethemirrorability constraintinsideasiameseCNN[20]bysharingthenetwork'sweightsbetweentheinputface imageanditsmirrorimageandaddinganewlossfunction. 3.2.5.1MirrorLoss Giventheinputimageanditsmirrorimagewiththeirinitialboundingboxes,weusefunction h ( ¯ m ; b ) ,thatthescaleandtranslationsof ¯ m basedon b ,forinitialization.Then,according tothemirrorabilityconstraint,weassumethattheestimatedupdateofshapefortheinputimage shouldbesimilartotheupdateofshapeforthemirrorimagewithareordering.Thisassumption istruewhenbothimagesareinitializedwiththe same landmarksuptoareordering,whichistrue inallcascadestages.WeusethemirrorlosstominimizetheEuclidiandistanceofestimatedshape updateoftwoimages.Themirrorlossatstage k is, J k M = jj D ‹ U k C D ‹ U k M jj 2 ; (3.12) where D ‹ U k istheinputimage'sshapeupdate, D ‹ U k M isthemirrorimage'sshapeupdateand C () is areorderingfunctiontoindicatelandmarkcorrespondencebetweenmirrorimages. 41 Figure3.6: ArchitectureoftheM-CNN(thesameCNNarchitectureisusedforallsixstages).Colorcode used:purple=extractedimagefeature,orange=Conv,brown=pooling+batchnormalization,green= locallyconnectedlayer,blue=fullyconnectedlayer,red=batchnormalization+ReLU+dropout.The sizeandthenumberofofeachlayerareshownonthetopandthebottomofthetopbranch respectively. 3.2.5.2MirrorCNNArchitecture ThenewCNNarchitecturefollowsthesiamesenetwork[20]withtwobrancheswhoseweightsare shared.Fig.3.6showsthearchitectureoftheM-CNN.Thetopandbottombranchesarefeeded withtheextractedinputfeaturefromatrainingimageanditsmirrorrespectively.Eachbranch hastwoconvolutionallayersandtwolayersoflocallyconnectedlayers.Thelocallyconnected layer[102]issimilartoconvolutionallayerandlearnsasetofforvariousregionsofits input.Thelocallyconnectedlayersarespatiallocationdependent,whichisacorrectassumption forourextractedimagefeatureateachstage.Aftereachoftheselayers,wehaveonepooling andbatchnormalizationlayers.Attheendinthetopbranch,afterafullyconnectedlayer,batch normalization,ReLUanddropoutlayers,wehavetwofullyconnectedlayers,oneforestimating theupdateofparameters( J Q )andtheotheroneforestimatingtheupdateof2Dshapeviatheloss ( J U ), J k U = jj D U k D ‹ U k jj 2 : (3.13) Inthebottombranch,weonlyhaveoneloss( J MU )forestimatingtheupdateof2Dshapeinthe 42 mirrorimage.Intotal,wehavefourlossfunctions,onefortheupdatesof m or p ,twoforthe2D shapeupdatesoftwoimagesrespectively,andonemirrorloss.Weminimizethetotallossatstage k , J k T = J k Q + l 1 J k U + l 2 J k MU + l 3 J k M ; (3.14) where l 1 to l 3 areweightsforlossfunctions.DespiteM-CNNappearsmorecomplicatedtobe trainedthanC-CNN,theirtestingarethesame.Thatis,theonlyusefulresultateachcascade stageofM-CNNistheestimatedupdateofthe m or p ,whichisalsopassedtothenextstageand initializetheinputimagefeatures.Inotherwords,themirrorimagesandestimated D U inboth imagesonlyserveasconstraintsintraining,andareneitherneedednorusedintesting. 3.2.6Visibilityand 2 DAppearanceFeatures Onenotableadvantageofemployingadense3Dshapemodelisthatmoreadvanced2Dfeatures, whichmightbeonlypossiblebecauseofthe3Dmodel,canbeextractedandcontributetothe cascadedCNNlearning.Inthiswork,these2Dfeaturesrefertothe2Dlandmarkvisibilityandthe appearancepatcharoundeach2Dlandmark. Inordertocomputethevisibilityofeach2Dlandmark,weleveragethebasicideaofexamining whetherthe3Dsurfacenormalofthecorresponding3Dlandmarkispointingtothecameraornot, underthecurrentcameraprojectionmatrix[52].Insteadofusingtheaverage3Dsurfacenormal forallhumans,weextenditbyusing3Dsurfacenormal.,giventhe currentestimated3Dshape S ,wecomputethe3Dsurfacenormalsforasetofsparsevertexes aroundthe3Dlandmarkofinterest,andtheaverageofthese3Dnormalsisdenotedas ~ N .Fig.3.7 43 Figure3.7: The3Dsurfacenormalastheaverageofnormalsarounda3Dlandmark(black arrow).Noticetherelativelynoisysurfacenormalofthe3Dfilefteyecornerfllandmark(bluearrow). illustratestheadvantageofusingtheaverage3Dsurfacenormal.Given ~ N ,wecompute, v = ~ N | ( R 1 R 2 ) ; (3.15) where R 1 and R 2 arethetworowsof R .If v ispositive,the2Dlandmarkisconsideredas visibleandits2DappearancefeaturewillbepartoftheinputforCNN.Otherwise,itisinvisible andthecorrespondingfeaturewillbezeroforCNN.Notethatthismethoddoesnotestimate occlusionduetootherobjectssuchashairs. Inadditiontovisibilityestimation,a3Dshapemodelcanalsocontributeingeneratingad- vancedappearancefeaturesastheinputlayerforCNN.,weaimtoextractapose- invariantappearancepatcharoundeachestimated2Dlandmark,andthearrayofthesepatcheswill formtheinputlayer.In[128],asimilarfeatureextractionisproposedbyputtingdifferentscales ofinputimagetogetherandformingabigimageastheappearancefeature.Wenowdescribe twoproposedapproachestoextractanappearancefeature,i.e.,a19 19patch,forthe n th2D landmark. Piecewisepedfeature(PAWF) Featurecorrespondenceisalwaysveryimportantfor anyvisuallearning,asevidentbytheimportanceofeye-basedtofacerecognition. Yet,duetothefactthata2Dfaceisaprojectionof3Dsurfacewithanarbitraryviewangle,it ishardtomakesurethatalocalpatchextractedfromthis2Dimagecorrespondstothepatch 44 (a)(b)(c)(d)(e) (f)(g)(h)(i)(j) Figure3.8: Featureextractionprocess,(a-e)PAWFforthelandmarkontherightsideoftherighteye,(f-j) D3PFforthelandmarkontherightsideofthelip. Figure3.9: ExamplesofextractingPAWF.Whenoneofthefourneighborhoodpoints(redpointinthe bottom-right)isinvisible,itconnectstothe2Dlandmark(greenpoint),extendsthesamedistancefurther, andgenerateanewneighborhoodpoint.Thishelpstoincludethebackgroundcontextaroundthenose. fromanotherimage,evenbothpatchesarecenteredatthegroundtruthlocationsofthesame n th 2Dlandmark.Here,ficorrespondflmeansthatthepatchescovertheexactlysamelocalregionof facesanatomically.However,withadense3Dshapemodelinhand,wemayextractlocalpatches acrossdifferentsubjectsandposeswithanatomicalcorrespondence.Thesecorrespondencesacross subjectsandposesfacilitateCNNtolearntheappearancevariationinducedbymisalignment, ratherthansubjectsorposes. Inanofprocedure,wesearchfor T vertexesonthemean3Dshape S 0 thatarethe 45 Figure3.10: ExampleofextractingD3PF. mostclosesttothe n thlandmark(Fig.3.8(b)).Second,werotatethe T vertexessuchthatthe3D surfacenormalofthe n thlandmarkpointstowardthecamera(Fig.3.8(c)).Third,amongthe T vertexeswefourfineighborhoodvertexesfl,whichhavetheminimumandmaximum x and y coordinates,anddenotethefourvertexIDsasa4-dimvector d ( n ) p (Fig.3.8(d)).Therowof Fig.3.8showstheprocessofextractingPAWFforrightlandmarkoftherighteye. DuringtheCNNlearning,forthe n thlandmarkof i thimage,weprojectthefourneighborhood vertexesontothe i thimageandobtainfourneighborhoodpoints, U ( n ) i = s RS ( : ; d ( n ) p )+ t ,based onthecurrentestimatedprojectionparameter m .Acrossall2Dfaceimages, U ( n ) i correspond tothesamefacevertexesanatomically.Therefore,wewarptheimagerycontentwithinthese neighborhoodpointstoa19 19patchbyusingthepiecewiseaftransformation[78]. Thisnovelfeaturerepresentationcanbewellextractedinmostcases,exceptforcasessuchas thenosetipattheview.Insuchcases,theprojectionofthe n thlandmarkisoutsidethe regionbytheneighborhoodpoints,whereoneoftheneighborhoodpointsisinvisibledue toocclusion.Whenthishappens,wechangethelocationoftheinvisiblepointbyusingitsrelative distancetotheprojectedlandmarklocation,asshowninFig.3.9. Direct 3 Dprojectedfeature(D 3 PF) BothD3PFandPAWFstartwiththe T vertexessurrounding the n th3Dlandmark(Fig.3.8(g)).InsteadoffourneighborhoodvertexesasinPAWF, D3PFoverlaysa19 19gridcoveringthe T vertexes,andstoresthevertexesofthegridpoints in d ( n ) d (Fig.3.8(i)).ThesecondrowofFig.3.8showstheprocessofextractingD3PF.Similarto 46 PAWF,wecannowprojectthesetof3Dvertexes S ( : ; d ( n ) d ) tothe2Dimageandextracta19 19 patchviabilinear-interpolation,asshowninFig.3.10.Wealsoestimatethevisibilitiesofthe3D vertexes S ( : ; d ( n ) d ) viatheirsurfacenormals,andzerowillbeplacedinthepatchforinvisibleones. ForD3PF,everypixelinthepatchwillbecorrespondingtothesamepixelinthepatchesofother images,whileforPAWF,thisistrueonlyforthefourneighborhoodpoints. 3.2.7Testing ThetestingpartofbothC-CNNandM-CNNarethesame.Givenatestingimage I anditsinitial parameter m 0 and p 0 ,weapplythelearnedcascadedCNNcoupled-regressorforfacealignment. Basicallyweiterativelyuse R k m ( ; Q k m ) tocompute D ‹ m ,update m k ,use R k p ( ; Q k p ) tocompute D ‹ p , andupdate p k .Finallythedense3DshapeisconstructedviaEqn.3.1,andtheestimated2D landmarksare ‹ U = s R ‹ S ( : ; d )+ t .Notethatweapplythefeatureextractionprocedureonetimefor eachCNNstage. 3.3ExperimentalResults Inthissection,wedesignexperimentstoanswerthefollowingquestions:(1)Whatistheperfor- manceofproposedmethodonchallengingdatasetsincomparisontothestate-of-the-artmethods? (2)Howdodifferentfeatureextractionmethodsperforminpose-invariantfacealignment?(3) WhatistheperformanceofproposedmethodwithdifferentCNNarchitecturesandwithdifferent deeplearningtoolboxes? 47 (a)(b) Figure3.11: (a)AFLWoriginal(yellow)andaddedlandmarks(green),(b)ComparisonofmeanNMEof eachlandmarkforRCPR(blue)andproposedmethod(green).Theradiusofcirclesisdeterminedbythe meanNMEmultipledwiththefaceboundingboxsize. 3.3.1ExperimentalSetup Databases Giventhatthisworkfocusonpose-invariantfacealignment,wechoosetwopublicly availablefacedatasetswithlabeledlandmarksanda wide rangeofposes. AFLWdatabase[61]isalargefacedatasetwith25Kfaceimages.Eachimageismanuallyla- beledwithupto21landmarks,withavisibilitylabelforeachlandmark.In[52],asubsetofAFLW isselectedtohaveabalanceddistributionofyawangles,including3 ; 901imagesfortrainingand 1 ; 299imagesfortesting.Weusethesamesubsetandmanuallylabel13additionallandmarksfor all5 ; 200images.Wecallthese3 ; 901imagesasthe base trainingset.Theionoforiginal landmarksandaddedlandmarksisshowninFig.3.11(a).Usinggroundtruthlandmarksofeach image,wethetightestboundingbox,expanditby10%ofitssize,andadd10%noisetothe top-leftcorner,widthandheightoftheboundingbox(examplesinthe1strowofFig.3.17).These randomlygeneratedboundingboxesmimictheimprecisefacedetectionwindowandwillbeused forbothtrainingandtesting. AFWdataset[151]contains468facesin205images.Eachfaceimageismanuallylabeled withupto6landmarksandhasavisibilitylabelforeachlandmark.Foreachfaceimageadetected boundingboxisprovided,andwillbeusedasinitialization.Giventhesmallnumberofimages, 48 weonlyusethisdatasetfortesting. Weusethe N id = 199basesofBaselFaceModel[86]forrepresentingidentityvariationand the N exp = 29basesoffacewearhouse[25]forrepresentingexpressionvariation.Intotal,there are228basesrepresenting3Dfaceshapeswith53 ; 215vertexes. Synthetictrainingdata Unlikeconventionalfacealignment,oneofthemainchallengesinpose- invariantfacealignmentisthelimitedtrainingimages.Thereareonlytwopubliclyavailableface databaseswithawideposes,alongwithlandmarklabeling.Therefore,utilizingsyntheticface imagesisanefwaytosupplymoreimagesintothetrainingset.,weadd16 ; 556 faceimageswithvariousposes,generatedfrom1 ; 035subjectsofLFPWdataset[7]bythemethod of[150],tothebasetrainingset.Wecallthisnewtrainingsetasthe extended trainingset. Baselineselection Giventheexplosionoffacealignmentworkinrecentyears,itisimportantto chooseappropriatebaselinemethodssoastomakesuretheproposedmethodadvancesthestateof theart.Weselectthemostrecentpose-invariantfacealignmentmethodsforcomparingwiththe proposedmethod.WecomparetheproposedmethodwithtwomethodsonAFLW:1)PIFA[52] isapose-invariantfacealignmentmethodwhichalignsfacesofarbitraryposeswiththeassistant ofasparse3Dpointdistributionmodel,2)RCPR[22]isamethodbasedoncascadeofregressors thatrepresentstheocclusion-invariantfacealignment.ForcomparisononAFW,weselectthree methods:1)PIFA[52],2)CDM[135]isamethodbasedonConstrainedLocalModel(CLM)and theoneclaimedtoperformpose-freefacealignment,3)TSPM[151]isbasedonamixtures oftreeswithasharedpoolofpartsandcanhandlefacealignmentforlargeposefaceimages.It canbeseenthatthesebaselinesaremostrelevanttoourfocusonpose-invariantfacealignment. Parametersetting Forimplementingtheproposedmethods,weusetwodifferentdeeplearning toolboxes.ForimplementingtheC-CNNarchitecture,weusetheMatConvNettoolbox[113]with aconstantlearningrateof1e 4,withtenepochsfortrainingeachCNNandabatchsizeof100. 49 Table3.1: NME(%)oftheproposedmethodwithdifferentfeatureswiththeC-CNNarchitectureandthe basetrainingset. PAWF+Cheek D3PF+Cheek PAWF Extracted Landmarks Landmarks Patch 4 : 72 5 : 02 5 : 19 5 : 51 FortheM-CNNarchitecture,weusetheCaffetoolbox[49]withalearningrateof1e 7andthe steplearningratepolicywithadroprateof0 : 9,in70epochsateachstageandabatchsizeof100. Wesettheweightparametersofthetotalloss l 1 to l 3 inEqn.3.14to1.ForRCPR,weusethe parametersreportedinitspaper,with100iterationsand15boostedregressors.ForPIFA,weuse 200iterationsand5boostedregressors.ForPAWFandD3PF,atthesecondstage T is5 ; 000,and 3 ; 000fortheotherstages.Accordingtoourempiricalevaluation,sixstagesofCNNaresuf forconvergenceofprocess. Evaluationmetrics Giventhegroundtruth2Dlandmarks U i ,theirvisibility v i ,andestimated landmarks ‹ U i of N t testingimages,weusetwoconventionalmetricsformeasuringtheerrorof upto34landmarks:1)MeanAveragePixelError(MAPE)[135],whichistheaverageofthe estimationerrorsforvisiblelandmarks,i.e., MAPE = 1 å N t i j v i j 1 N t ; N å i ; j v i ( j ) jj ‹ U i ( : ; j ) U i ( : ; j ) jj ; (3.16) where j v i j 1 isthenumberofvisiblelandmarksofimage I i ,and U i ( : ; j ) isthe j thcolumnof U i .2) NormalizedMeanError(NME),whichistheaverageofthenormalizedestimationerrorofvisible landmarks,i.e., NME = 1 N t N t å i ( 1 d i j v i j 1 N å j v i ( j ) jj ‹ U i ( : ; j ) U i ( : ; j ) jj ) ; (3.17) where d i isthesquarerootofthefaceboundingboxsize[52].Theeye-to-eyedistanceisnotused inNMEsinceitisnotwellinlargeposessuchas 50 Figure3.12: ErrorsonAFLWtestingsetaftereachstagesofCNNfordifferentfeatureextractionmethods withtheC-CNNarchitectureandthebasetrainingset.Theinitialerroris25 : 8%. 3.3.2ComparisonExperiments Featureextractionmethods Toshowtheadvantagesoftheproposedfeatures,Table3.1compares theaccuracyoftheproposedmethodonAFLWwith34landmarks,withvariousfeaturepresenta- tion(i.e.,theinputlayerforCNN 2 toCNN 6 ).Forthisexperiment,weusetheC-CNNarchitecture withthebasetrainingset.ThefiExtractedPatch"referstoextractingaconstantsize(19 19) patchcenteredbyanestimated2Dlandmark,fromafaceimagenormalizedusingthebounding box,whichisabaselinefeaturewidelyusedinconventional2Dalignmentmethods[139,147].For thefeaturefi+CheekLandmarks",additionaluptofour19 19patchesofthecontourlandmarks, whichareinvisiblefornon-frontalfaces,willbereplacedwithpatchesofthecheeklandmarks, andusedintheinputlayerofCNNlearning.ThePAWFcanachievehigheraccuracythanthe D3PF.BycomparingColumn1and3ofTable3.1,itshowsthatextractingfeaturesfromcheek landmarksareeffectiveinactingasadditionalvisualcuesforthecascadedCNNregressors.The combinationofusingthecheeklandmarksandextractingPAWFachievesthehighestaccuracy, whichwillbeusedintheremainingexperiments.Fig.3.12showstheerrorsonAFLWtestingset aftereachstagesofCNNfordifferentfeatureextractionmethods.Thereisnodifferenceinthe errorsofthestageCNNbecauseitusestheglobalappearanceintheboundingbox,ratherthan thearrayoflocalfeatures. 51 Table3.2: TheNME(%)ofthreemethodsonAFLWwiththebasetrainingset. Proposedmethod PIFA RCPR (C-CNN) 4 : 72 8 : 04 6 : 26 Figure3.13: ComparisonofNMEforeachposewiththeC-CNNarchitectureandthebasetrainingset. CNNisknownfordemandingalargetrainingset,whilethe3 ; 901-imageAFLWtrainingset isrelativelysmallfromCNN'sperspective.However,ourCNN-basedregressorisstillableto learnandalignwellonunseenimages.Weattributethisfacttotheeffectiveappearancefeatures proposedinthiswork,i.e.,thesuperiorfeaturecorrespondenceenabledbythedensefacemodel reducesCNN'sdemandformassivetrainingdata. ExperimentsonAFLWdataset Wecomparetheproposedmethodwiththetwomostrelated methodsforaligningfaceswitharbitraryposes.ForbothRCPRandPIFA,weusetheirsource codetoperformtrainingonthebasetrainingset.TheNMEofthethreemethodsontheAFLW testingsetareshowninTable3.2.Theproposedmethodcanachievebetterresultsthanthetwo baselines.TheerrorcomparisonforeachlandmarkisshowninFig.3.11(b).Asexpected,the contourlandmarkshaverelativelyhighererrorsandtheproposedmethodhaslowererrorsthan RCPRacrossallofthelandmarks. Byusingthegroundtruthlandmarklocationsofthetestimages,wedividealltestimagesto sixsubsetsaccordingtotheestimatedyawangleofeachimage.Fig.3.13comparestheproposed methodwithRCPR.Theproposedmethodcanachievebetterresultsacrossdifferentposes,and moreimportantly,ismorerobustorhaslessvariationacrossposes.Forthedetailedcomparisonon 52 Figure3.14: ThecomparisonofCEDfordifferentmethodswiththeC-CNNarchitectureandthebase trainingset. Figure3.15: ResultoftheproposedmethodafterthestageCNN.Thisimageshowsthatthestage CNNcanmodelthedistributionoffaceposes.Theright-viewfacesareatthetop,thefrontal-viewfacesare atthemiddle,andtheleft-viewfacesareatthebottom. theNMEdistribution,theCumulativeErrorsDistribution(CED)diagramsofvariousmethodsare showninFig.3.14.TheimprovementseemstobeoverallNMEvalues,andisespeciallylarger aroundlowerNMEs( 8%).Weusethet-SNEtoolbox[112]toapplydimensionreductiononthe outputofReLUlayerinthestageCNN.Theoutputofeachtestimageisreducedtoatwo- dimensionalpointandalltestimagesareplottedbasedontheirlocationofthepoints(Fig.3.15). ThisshowsthatthestageCNNcanmodelthedistributionoffaceposes. 53 Table3.3: TheNME(%)ofthreemethodsonALFWwithextendedtrainingsetandCaffetoolbox. Proposedmethod Proposedmethod RCPR (M-CNN) (C-CNN) 4 : 52 5 : 38 7 : 04 ExperimentsontheAFLWdatasetwithM-CNN Weusetheextendedtrainingsetandthemirror CNNarchitecture(M-CNN)forthisexperiment.WereporttheNMEresultsofthreemethodsin Table3.3.TheM-CNNarchitecture,whichincorporatesthemirrorconstraintduringthelearning, achievesapproximately16%reductionoferrorovertheC-CNNarchitectureimplementedwiththe Caffetoolbox.Thisshowstheeffectivenessofthemirrorabilityconstraintinthenewarchitecture. ThecomparisonofTable3.2andTable3.3showsthattheaccuracyoftheRCPRmethodis lowerwiththeextendedtrainingsetthanwiththebasetrainingset.Weattributethistothelow qualityofthesidepartsofthesynthesizedlarge-posefaceimages.Althoughthemethodin[150] cansynthesizeside-viewfaceimages,thesynthesizedimagescouldhavesomeartifactsonthe sidepartoftheface.Theseartifactsmakeithardforthelocalfernfeatures-basedRCPRmethod tosimultaneouslyestimatethelocationandthevisibilityoflandmarks. Inourproposedmethod,wearrangetheextractedPAWFpatchesinaspatialarrayanduseitas theinputtoCNN.AnalternativeCNNinputistoassigntheextractedPAWFpatchestodifferent channelsandconstructa19 19 34inputdatum.Toevaluateitsperformance,consideringthe changeoftheinputsize,wemodifytheCNNarchitectureinFig.3.6byremovingthethe thirdandthefourthpoolinglayers.TheNMEofM-CNNwiththeextendedtrainingsetis4 : 91%, whichshowsthatarrangingthePAWFpatchesasalargeimageistillsuperior. ExperimentsonAFWdataset TheAFWdatasetcontainsfacesofallposerangeswithlabelsof 6landmarks.WereporttheMAPEforsixmethodsinTable3.4.ForPIFA,CDMandTSPM,we showtheerrorreportedintheirpapers.Againweseetheconsistentimprovementofourproposed method(withbotharchitectures)overthebaselinemethods. 54 Table3.4: TheMAPEofsixmethodsonAFW. Proposedmethod Proposedmethod Proposedmethod PIFA CDM TSPM (M-CNN+PAWF) (C-CNN+PAWF) (C-CNN+D3PF) 6 : 52 7 : 43 7 : 83 8 : 61 9 : 13 11 : 09 ComparisonoftwoCNNtoolboxes Weutilizetwotoolboxesforourimplementations.Weuse theMatConvNettoolbox[113]toimplementtheC-CNNarchitecture(Fig.3.5).However,the MatConvNettoolboxhaslimitedabilityindifferentbranchesforCNN,whichisrequired totrainasiamesenetwork.Therefore,weusetheCaffetoolbox[49]toimplementtheM-CNN architecture(Fig.3.6).BasedonourexperimentsontheAFLWtestset,therearenoticeable differencebetweenthetestingresultsofthesetwotoolboxes. Table3.5showsthedetailedcomparisonoftheC-CNNandM-CNNarchitectureswithdiffer- entsettings.TheSettings1and2comparetheimplementationsoftheC-CNNarchitectureonthe AFLWtrainingset,usingtheMatConvNetandCaffetoolboxesrespectively.Itshowsthesuperior accuracyoftheMatConvNetimplementationinallstages,evenwhentheextendedtrainingsetis providedintheSetting4.Thesedifferenttestingresultsoftwotoolboxesmightbeduetotwo reasons.Oneisthat,theimplementationofthebasicbuildingblocks,theoptimization,andthede- faultparameterscouldbedifferentonthetwotoolboxes.Theotheristherandominitializationof networkparameters.ThecomparisonoftheSettings2and3showsthesuperiorityoftheM-CNN architecture.TheSetting5includesourresultwiththeM-CNNarchitectureandtheextended trainingset. Landmarkvisibilityestimation Forevaluatingtheaccuracyofourvisibilityprediction,weuti- lizethegroundtruth3Dshapeofthetestimagesandcomputethevisibilitylabeloflandmarks duetotheselfocclusion.Wethefivisibilityerror"asthemetric,whichistheaverageof theratiosbetweenthenumberofincorrectlyestimatedvisibilitylabelsandthetotalnumberof landmarksperimage.Theproposedmethodachievesavisibilityerrorof4 : 1%.Ifwebreakdown 55 Table3.5: Thesix-stageNMEsofimplementingC-CNNandM-CNNarchitectureswithdifferenttraining datasetsandCNNtoolboxes.Theinitialerroris25 : 8%. Sett. Toolbox Method/Data S-1 S-2 S-3 S-4 S-5 S-6 1 MatConvNet 7 : 68 5 : 93 5 : 58 4 : 94 4 : 89 4 : 72 C-CNN/ 2 Caffe Baseset 8 : 75 6 : 32 6 : 15 5 : 55 5 : 53 5 : 44 3 M-CNN/ 7 : 18 6 : 06 5 : 83 5 : 08 4 : 91 4 : 76 Baseset 4 C-CNN/ 8 : 44 6 : 78 6 : 60 5 : 75 5 : 70 5 : 38 Extendedset 5 M-CNN/ 7 : 41 6 : 16 5 : 80 4 : 76 4 : 67 4 : 52 Extendedset Figure3.16:Thedistributionofvisibilityerrorsforeachlandmark.Forsixlandmarksonthe horizontalcenteroftheface,theirvisibilityerrorsarezerossincetheyarealwaysvisible. thevisibilityerrorforeachlandmark,theirdistributionisshowninFig.3.16. Qualitativeresults SomeexamplesofalignmentresultsfortheproposedmethodonAFLWand AFWdatasetsareshowninFig.3.17.Theresultoftheproposedmethodateachstageisshownin Fig.3.18. Timecomplexity ThespeedsofproposedmethodwithPAWFandD3PFare0 : 6and0 : 26FPS respectively,withtheMatlabimplementation.Themosttimeconsumingpartintheproposed methodisfeatureextractionwhichconsumes80%ofthetotaltime.Webelievethiscanbesub- stantiallyimprovedwithCcodingandparallelfeatureextraction.NotethatthespeedofC-CNN andM-CNNarchitecturesarethesamebecauseweonlycomputeresponseofthetopbranchof M-CNNinthetestingphase. 56 Figure3.17: TheresultsoftheproposedmethodonAFLWandAFW.Thegreen/red/yellowdotsshowthe visible/invisible/cheeklandmarks,respectively.Firstrow:initiallandmarksforAFLW,Second:estimated 3Ddenseshapes,Third:estimatedlandmarks,ForthandFifth:estimatedlandmarksforAFLW,Sixth:esti- matedlandmarksforAFW.Noticethatdespitethediscrepancybetweenthediversefaceposesandconstant front-viewlandmarkinitialization(toprow),ourmodelcanadaptivelyestimatethepose,adensemodel andproducethe2Dlandmarksasabyproduct. 3.4Summary Weproposeamethodtoa3Ddenseshapetoafaceimagewithlargeposesbycombining cascadeCNNregressorsandthe3DMorphableModel(3DMM).Weproposetwotypesofpose invariantfeaturesandonenewCNNarchitectureforboostingtheaccuracyoffacealignment.Also, weestimatethelocationoflandmarksonthecheek,whichalsodrivesthe3Dfacemodel Finally,weachievethestate-of-the-artperformanceontwochallengingfacealignmentwithlarge poses. 57 Figure3.18: Theresultoftheproposedmethodacrossstages,withtheextractedfeatures(1stand3rdrows) andalignmentresults(2ndand4throws).Notethechangesofthelandmarkpositionandvisibility(theblue arrow)overstages. 58 Chapter4 Pose-InvariantFaceAlignmentwitha SingleCNN 4.1Introduction Inthepreviouschapter,weproposeourPIFAmethodbasedoncascadeofCNNregressorsand3D MorphableModel.Thecascadeoftheregressorsisthedominanttechnologyforpose-invariant facealignment[68,147,148].Despitetherecentsuccess,thecascadeofCNNs,whenappliedto thelargeposefaceimages,stillsuffersfromthefollowingdrawbacks. Lackofend-to-endtraining :Itisaconsensusthatend-to-endtrainingisdesiredforCNN[23, 47].However,intheexistingmethods,theCNNsaretypicallytrainedindependentlyateachcas- cadestage.SometimesevenmultipleCNNsareappliedindependentlyateachstage.Forexample, locationsofdifferentlandmarksetsareestimatedbyvariousCNNsandcombinedviaaseparate fusingmodule[99].Therefore,theseCNNscannotbejointlyoptimizedandmightleadtoa sub-optimalsolution. Hand-craftedfeatureextraction :SincetheCNNsaretrainedindependently,featureextrac- tionisrequiredtoutilizetheresultofapreviousCNNandprovideinputtothecurrentCNN.Simple featureextractionmethodsareused,e.g.,extractingpatchesbasedon2Dor3Dfaceshapeswithout consideringotherfactorsincludingposeandexpression[99,139].Normally,thecascadeofCNNs isacollectionofshallowCNNswhereeachonehaslessthanvelayers.Hence,thisframework 59 Figure4.1:Forthepurposeoflearninganend-to-endfacealignmentmodel,ournovelvisualiza- tionlayerreconstructsthe3Dfaceshape(a)fromtheestimatedparametersinsidetheCNNand synthesizesa2Dimage(b)viathesurfacenormalvectorsofvisiblevertexes. cannotextract deep featuresbybuildingupontheextractedfeaturesofearly-stageCNNs. Slowtrainingspeed :TrainingacascadeofCNNsisusuallytime-consumingfortworea- sons.Firstly,theCNNsaretrainedsequentially,oneafteranother.Secondly,featureextractionis requiredbetweentwoconsecutiveCNNs. Toaddresstheseissues,weintroduceanovellayer,asshowninFigure4.1.OurCNNarchi- tectureconsistsofseveralblocks,whicharecalledvisualizationblocks.Thisarchitecturecanbe consideredasacascadeofshallowCNNs.Thenewlayervisualizesthealignmentresultofaprevi- ousvisualizationblockandutilizesitinalatervisualizationblock.Itisdesignedbasedonseveral guidelines.Firstly,itisderivedfromthesurfacenormalsoftheunderlying3Dfacemodeland encodestherelativeposebetweenthefaceandcamera,partiallyinspiredbythesuccessofusing surfacenormalsfor3Dfacerecognition[79].Secondly,thevisualizationlayerisdifferentiable, whichallowsthegradienttobecomputedanalyticallyandenablesend-to-endtraining.Lastly,a maskisutilizedtodifferentiatebetweenpixelsinthemiddleandcontourareasofaface. fromthedesignofthevisualizationlayer,ourmethodhasthefollowingadvantages andcontributions: TheproposedmethodallowsablockintheCNNtoutilizetheextractedfeaturesfromprevi- 60 Figure4.2:TheproposedCNNarchitecture.Weusegreen,orange,andpurpletorepresentthe visualizationlayer,convolutionallayer,andfullyconnectedlayer,respectively.Pleasereferto Figure4.3forthedetailsofthevisualizationblock. ousblocksandextractdeeperfeatures.Therefore,extractionofhand-craftedfeaturesisnolonger necessary. Thevisualizationlayerisdifferentiable,allowingforbackpropagationofanerrorfromalater blocktoanearlierone.Tothebestofourknowledge,thisisthemethodforpose-invariant facealignment,thatutilizesonlyonesingleCNNandallowsend-to-endtraining. Theproposedmethodconvergesfasterduringthetrainingphasecomparedtothecascadeof CNNs.Therefore,thetrainingtimeisdramaticallyreduced. 4.2 3 DFaceAlignmentwithVisualizationLayer Givenasinglefaceimagewithanarbitrarypose,ourgoalistoestimatethe2Dlandmarkswith theirvisibilitylabelsbya3Dfacemodel.Towardsthisend,weproposeaCNNarchitecture withend-to-endtrainingformodelasshowninFigure4.2.Inthissection,wewill describetheunderlying3Dfacemodelusedinthiswork,followedbyourCNNarchitectureand thevisualizationlayer. 61 4.2.1 3 Dand 2 DFaceShapes Weusethe3DMorphableModel(3DMM)torepresentthe3Dshapeofaface S p asalinear combinationofmeanshape S 0 ,identitybases S I andexpressionbases S E : S p = S 0 + N I å k p I k S I k + N E å k p E k S E k : (4.1) Weusevector p =[ p I ; p E ] toindicatethe3Dshapeparameters,where p I =[ p I 0 ; ; p I N I ] arethe identityparametersand p E =[ p E 0 ; ; p E N E ] aretheexpressionparameters.WeusetheBasel3D facemodel[86],whichhas199bases,asouridentitybasesandthefacewearhousemodel[25] with29basesasourexpressionbases.Each3Dfaceshapeconsistsofasetof Q 3Dvertexes: S p = 0 B B B B B @ x p 1 x p 2 ::: x p Q y p 1 y p 2 ::: y p Q z p 1 z p 2 ::: z p Q 1 C C C C C A : (4.2) The2Dfaceshapesaretheprojectionof3Dshapes.Inthiswork,weusetheweakperspective projectionmodelwith6degreesoffreedoms,i.e.,oneforscale,threeforrotationanglesandtwo fortranslations,whichprojectsthe3Dfaceshape S p onto2Dimagestoobtainthe2Dshape U : U = f ( P )= M 0 B @ S p ( : ; b ) 1 1 C A ; (4.3) where M = 2 6 4 m 1 m 2 m 3 m 4 m 5 m 6 m 7 m 8 3 7 5 ; (4.4) 62 and U = 0 B @ x t 1 x t 2 ::: x t N y t 1 y t 2 ::: y t N 1 C A : (4.5) Here U isasetof N 2Dlandmarks, M isthecameraprojectionmatrix.Withmisuseofnotation,we thetargetparameters P = f M ; p g .The N -dimvector b includes3Dvertexindexeswhichare semanticallycorrespondingto2Dlandmarks.Wedenote m 1 =[ m 1 m 2 m 3 ] and m 2 =[ m 5 m 6 m 7 ] asthetworowsofthescaledrotationcomponent,while m 4 and m 8 arethetranslations. Equation4.3establishstherelationship,orequivalency,between2Dlandmarks U and P ,i.e., 3Dshapeparameters p andthecameraprojectionmatrix M .Giventhatalmostallthetraining imagesforfacealignmenthaveonly2Dlabels,i.e., U ,wepreformadataaugmentationstep similarto[53]tocomputetheircorresponding P .Givenaninputimage,ourgoalistoestimatethe parameter P ,basedonwhichthe2Dlandmarksandtheirvisibilitiescanbenaturallyderived. 4.2.2ProposedCNNArchitecture OurCNNarchitectureresemblesthecascadeofCNNs,whileeachfishallowCNN"is asavisualizationblock.Insideeachblock,avisualizationlayerbasedonthelatestparameter estimationservesasabridgebetweenconsecutiveblocks.Thisdesignenablesustoaddressthe drawbacksoftypicalcascadeofCNNsinSection4.1.Wenowdescribethevisualizationblock andCNNarchitecture,anddiveintothedetailsofthevisualizationlayerinSection4.2.3. VisualizationBlock Figure4.3showsthestructureofourvisualizationblock.Thevisualiza- tionlayergeneratesafeaturemapbasedonthelatestparameter P (detailsinSection4.2.3).Each convolutionallayerisfollowedbyabatchnormalization(BN)layerandaReLUlayer.Itextracts deeperfeaturesbasedonthefeaturesprovidedbythepreviousvisualizationblockandthevisual- izationlayeroutput.Betweenthetwofullyconnectedlayers,theoneisfollowedbyaReLU 63 Figure4.3:Avisualizationblockconsistsofavisualizationlayer,twoconvolutionallayersand twofullyconnectedlayers. layerandadropoutlayer,whilethesecondonesimultaneouslyestimatestheupdateof M and p , denoted D P .Theoutputsofthevisualizationblockaredeeperfeaturesandthenewestimationof theparameters( D P + P ).AsshowninFigure4.3,thetoppartofthevisualizationblockfocuseson learningdeeperfeatures,whilethebottompartutilizesthosefeaturestoestimatetheparameters inaResNet-likestructure[44].Duringthebackwardpassofthetrainingphase,thevisualization blockbackpropagatesthelossthroughbothofitsinputstoadjusttheconvolutionalandthefully connectedlayersinpreviousblocks.Thisallowstheblocktoextractbetterfeaturesforthenext blockandimprovetheoverallparameterestimation. CNNArchitecture Theproposedarchitectureconsistsofseveralconnectedvisualizationblocks asshowninFigure4.2.Theinputsincludeanimageandaninitialestimationoftheparameter P 0 .Theoutputsistheestimationoftheparameter.Duetothejointoptimizationofallvi- sualizationblocksthroughbackpropagation,theproposedarchitectureisabletoconvergewith substantiallyfewerepochsduringtraining,comparedtothetypicalcascadeofCNNs. LossFunctions TwotypesoflossfunctionsareemployedinourCNNarchitecture.The oneisanEuclideanlossbetweentheestimationandthetargetoftheparameterupdate,witheach parameterweightedseparately: E i P =( D P i D ¯ P i ) T W ( D P i D ¯ P i ) ; (4.6) 64 where E i P istheloss, D P i istheestimationand D ¯ P i isthetarget(orgroundtruth)atthe i -thvi- sualizationblock.Thediagonalmatrix W containstheweights.Foreachelementoftheshape parameter p ,itsweightistheinverseofthestandarddeviationthatwasobtainedfromthedataused in3DMMtraining.Tocompensatetherelativescaleamongtheparametersof M ,wecomputethe ratio r betweentheaverageofscaledrotationparametersandaverageoftranslationparametersin thetrainingdata.Wesettheweightsofthescaledrotationparametersof M to 1 r andtheweights ofthetranslationof M to1.ThesecondtypeoflossfunctionistheEuclideanlossontheresultant 2Dlandmarks: E i S = k f ( P i + D P i ) ¯ U k 2 ; (4.7) where ¯ U isthegroundtruth2Dlandmarks,and P i istheinputparametertothe i -thblock,i.e., theoutputofthe i 1-thblock. f ( ) computes2Dlandmarklocationsusingthecurrentlyupdated parametersviaEquation4.3.Forbackpropagationofthislossfunctiontotheparameter D P ,we usethechainruletocomputethegradient. ¶ E i S ¶ D P i = ¶ E i S ¶ f ¶ f ¶ P i : Forthethreevisualizationblocks,theEuclideanlossontheparameterupdates(Equa- tion4.6)isused,whiletheEuclideanlosson2Dlandmarks(Equation4.7)isappliedtothelast threeblocks.Thethreeblocksestimateparameterstoroughlyalign3Dshapetothefaceim- ageandthelastthreeblocksleveragethegoodinitializationtoestimatetheparametersandthe2D landmarklocationsmoreprecisely. 65 4.2.3VisualizationLayer Severalvisualizationtechniqueshavebeenexploredforfacialanalysis.Inparticular,Z-Buffering, whichiswidelyusedinpriorworks[12,13],isasimpleandfast2Drepresentationofthe3D shape.However,thisrepresentationisnotdifferentiable.Incontrast,ourvisualizationisbasedon surfacenormalsofthe3Dface,whichdescribessurface'sorientationinalocalneighbourhoods.It hasbeensuccessfullyutilizedfordifferentfacialanalysistasks,e.g.,3Dfacereconstruction[95] and3Dfacerecognition[79]. Inthiswork,weusethe z coordinateofsurfacenormalsofeachvertex,transformedwiththe pose.Itisanindicatoroffifrontability"ofavertex,i.e.,theamountthatthesurfacenormalis pointingtowardsthecamera.Thisquantityisusedtoassignanintensityvalueatitsprojected2D locationtoconstructthevisualizationimage.Thefrontabilitymeasure g ,a Q dimensionalvector, canbecomputedas, g = max 0 ; ( m 1 m 2 ) k m 1 kk m 2 k N 0 ; (4.8) where isthecrossproduct,and k : k denotesthe L 2 norm.The3 Q matrix N 0 isthesurface normalvectorsofa3Dfaceshape.Toavoidthehighcomputationalcostofcalculatingthesurface normalsaftereachshapeupdate,weapproximate N 0 withthesurfacenormalsofthemean3D face.Notethatboththefaceshapeandposearestillcontinuouslyupdatedacrossvisualization blocks,andareusedtodeterminetheprojected2Dlocation.Hence,thisapproximationwould onlyslightlyaffecttheintensityvalues.Totransformthesurfacenormalbasedonthepose,we applytheestimationofthescaledrotationmatrix( m 1 and m 2 )tothesurfacenormalscomputed fromthemeanface.Thevalueisthentruncatedwiththelowerboundof0(Equation4.8). Thepixelintensityofavisualizedimage V ( u ; v ) iscomputedastheweightedaverageofthe frontabilitymeasureswithinalocalneighbourhood: 66 Figure4.4:Thefrontalandsideviewsofthemask a thathaspositivevaluesinthemiddleand negativevaluesinthecontourarea. V ( u ; v )= å q 2 D ( u ; v ) g ( q ) a ( q ) w ( u ; v ; x t q ; y t q ) å q 2 D ( u ; v ) w ( u ; v ; x t q ; y t q ) ; (4.9) where D ( u ; v ) isthesetofindexesofvertexeswhose2Dprojectedlocationsarewithinthelocal neighborhoodofthepixel ( u ; v ) . ( x t q ; y t q ) isthe2Dprojectedlocationof q -th3Dvertex.Theweight w isthedistancemetricbetweenthepixel ( u ; v ) andtheprojectedlocation ( x t q ; y t q ) , w ( u ; v ; x t q ; y t q )= exp ( u x t q ) 2 +( v y t q ) 2 2 s 2 ! : (4.10) The Q -dimvector a isamaskwithpositivevaluesforvertexesinthemiddleareaofthefaceand negativevaluesforvertexesaroundthecontourareaoftheface: a ( q )= exp ( x n x p q ) 2 +( y n y p q ) 2 +( z n z p q ) 2 2 s 2 n ; (4.11) where ( x n ; y n ; z n ) isthevertexcoordinateofthenosetip. a ispre-computedandnormalizedfor zero-meanandunitstandarddeviation.Themaskisutilizedtodiscriminatebetweenthemiddle andcontourareasoftheface.AvisualizationofthemaskisprovidedinFigure4.4. Sincethehumanfaceisa3Dobject,visualizingitatanarbitraryviewanglerequiresthe estimationofthevisibilityofeach3Dvertex.Toavoidthecomputationallyexpensivevisibility 67 Figure4.5:Anexamplewithfourvertexesprojectedtoasamepixel.Twoofthemhavenegative valuesin z componentoftheirnormals(redarrows).Betweentheothertwowithpositivevalues, theonewiththesmallerdepth(closertotheimageplane)isselected. testviarendering,weadopttwostrategiesforapproximation.Firstly,weprunethevertexeswhose frontabilitymeasures g equal0,i.e.,thevertexespointingagainstthecamera.Secondly,ifmultiple vertexesprojectstoasameimagepixel,wekeeponlytheonewiththesmallestdepthvalues.An illustrationisprovidedinFigure4.5. Backpropagation Toallowbackpropagationofthelossfunctionsthroughthevisualization layers,wecomputethederivativeof V withrespecttotheelementsoftheparameters M and p . Firstly,wecomputethepartialderivatives, ¶ g ¶ m k , ¶ w ( u ; v ; x t i ; y t i ) ¶ m k and ¶ w ( u ; v ; x t i ; y t i ) ¶ p j .Thenthederivativesof ¶ V ¶ m k and ¶ V ¶ p j canbecomputedbasedonEquation.4.9. 4.3ExperimentalResults WeevaluateourmethodontwochallengingPIFAdatasets,AFLWandAFW,bothqualitatively andquantitatively,aswellasthenear-frontalfacedatasetof300W.Furthermore,weconductex- perimentsondifferentCNNarchitecturestovalidateourvisualizationlayerdesign. Implementationdetails OurimplementationisbuiltupontheCaffetoolbox[49].Inallofthe experiments,weusesixvisualizationblocks( N v )withtwoconvolutionallayers( N c )andfully 68 connectedlayersineachblock(Figure4.3).Detailsofthenetworkstructureareprovidedin Table4.1. Insteadofusingthesequentiallypretrainstrategy[127],weperformthejointend-to-endtrain- ingfromscratch.Tobetterestimatetheparameterupdateineachblockandtoincreasetheef- fectivenesswhenusingvisualizationblocks,wesettheweightofthelossfunctioninthe visualizationblockto1andlinearlyincreasetheweightsbyoneforeachlaterblock.Thisstrategy helpstheCNNtopaymoreattentiontothelandmarklossusedinlaterblocks.Backpropagationof lossfunctionsinthelastblockswouldhavemoreimpactintheblock,andthelastblockcan adoptitselfmorequicklytothechangesintheblock. Inthetrainingphase,wesettheweightdecayto0 : 005,themomentumto0 : 99,theinitial learningrateto1e 6.Besides,wedecreasethelearningrateto5e 6and1e 7after20and29 epochs.Intotal,thetrainingphaseiscontinuedfor33epochsforallexperiments. 4.3.1QuantitativeEvaluationsonAFLWandAFW TheAFLWdataset[61]isverychallengingwithlarge-posefaceimages( 90 yaw).Weusethe subsetwith3 ; 901trainingimagesand1 ; 299testingimagesreleasedby[53].Allfaceimagesin thissubsetarelabeledwith34landmarksandaboundingbox.TheAFWdataset[151]contains 205imageswith468faces.Eachimageislabeledwithatmost6landmarkswithvisibilitylabels, aswellasaboundingbox.AFWisusedonlyfortestinginourexperiments.Theboundingboxes Table4.1:Thenumberandsizeofconvolutionalineachvisualizationblock.Forallblocks, thetwofullyconnectedlayershavethesamelengthof800and236. Block# 1 2 3 4 5 ; 6 Conv. 12(5 5) 20(3 3) 28(3 3) 36(3 3) 40(3 3) layers 16(5 5) 24(3 3) 32(3 3) 40(3 3) 40(3 3) 69 Table4.2:NME(%)offourmethodsontheAFLWdataset. Proposedmethod Extended-PIFA[53] PIFA RCPR 4 : 45 4 : 72 8 : 04 6 : 26 Table4.3:NME(%)oftheproposedmethodateachvisualizationblockonAFLWdataset.The initialNMEis25.8%. Block# 1 2 3 4 5 6 NME 9 : 26 6 : 77 5 : 51 4 : 98 4 : 60 4 : 45 inbothdatasetsareusedastheinitilizationforouralgorithm,aswellasthebaselines.Wecrop theregioninsidetheboundingboxandnormalizeitto114 114.Duetothememoryconstraint ofGPUs,wehaveapoolinglayerinthevisualizationblockaftertheconvolutionallayer todecreasethesizeoffeaturemapstohalf.Theinputtothesubsequentvisualizationblocksisof 57 57.Toaugmentthetrainingdata,wegenerate20differentvariationsforeachtrainingimage byaddingnoisetothelocation,widthandheightoftheboundingboxes. Forquantitativeevaluations,weusetwoconventionalmetrics.TheoneisMeanAverage PixelError(MAPE)[135],whichistheaverageofthepixelerrorsforthevisiblelandmarks.The otheroneisNormalizedMeanError(NME),i.e.,theaverageofthenormalizedestimationerrorof visiblelandmarks.Thenormalizationfactoristhesquarerootofthefaceboundingboxsize[52], insteadoftheeye-to-eyedistanceinthefrontal-viewfacealignment. Wecompareourmethodwithseveralstate-of-the-artLFPAapproaches.OnAFLW,wecom- parewithExtended-PIFA[53],PIFA[52]andRCPR[22]withtheNMEmetric.Table4.2shows thatourproposedmethodachievesahigheraccuracythanthealternatives.Theheatmap-based method,namedCALE[21],reportedanNMEof2 : 96%,butsuffersfromseveraldisadvantages. Todemonstratethecapabilitiesofeachvisualizationblock,theNMEcomputedusingtheestimated 70 P aftereachblockisshowninTable4.3.Ifahigheralignmentspeedisdesirable,itispossibleto skipthelasttwovisualizationblockswithareasonableNME.OntheAFWdataset,comparisons areconductedwithExtended-PIFA[53],PIFA[52],CDM[135]andTSPM[151]withtheMAPE metric.TheresultsinTable4.4againshowthesuperiorityoftheproposedmethod. SomeexamplesofalignmentresultsoftheproposedmethodonAFLWandAFWdatasetsare showninFigure4.9.Threeexamplesofvisualizationlayeroutputateachvisualizationblockare showninFigure4.10. 4.3.2Evaluationon300Wdataset WhileourmaingoalisPIFA,wealsoevaluateonthemostwidelyusednearfrontal300W dataset[96].300Wcontaines3 ; 148trainingand689testingimages,whicharedividedinto commonandchallengingsetswith554and135images,respectively.Table4.5showstheNME (normalizedbytheinteroculardistance)oftheevaluatedmethods.Themostrelevantmethodis 3DDFA[147],whichalsoestimates M and p .Ourmethodoutperformsitonboththecommonand challengingsets.Methodsthatdonotemployshapeconstraints,e.g.,via3DMM,generallyhave higherfreedomandcouldachieveslightlybetteraccuracyonfrontalfacecases.Nonetheless,they aretypicallylessrobustinmorechallengingcases.AnothercomparisoniswithMDM[107]via thefailurerateusingathresholdof0 : 08.Thefailureratesare16 : 83%(ours)versus6 : 80%(MDM) with68landmarks,and8 : 99%(ours)versus4 : 20%(MDM)with51landmarks. Table4.4:MAPEofvemethodsontheAFWdataset. Proposedmethod Extended-PIFA[53] PIFA CDM TSPM 6 : 27 7 : 43 8 : 61 9 : 13 11 : 09 71 Table4.5:TheNMEofdifferentmethodson300Wdataset. Method Common Challenging Full ESR[26] 5 : 28 17 : 00 7 : 58 RCPR[22] 6 : 18 17 : 26 8 : 35 SDM[124] 5 : 57 15 : 40 7 : 50 LBF[93] 4 : 95 11 : 98 6 : 32 CFSS[147] 4 : 73 9 : 98 5 : 76 RCFA[116] 4 : 03 9 : 85 5 : 32 RAR[122] 4 : 12 8 : 35 4 : 94 3DDFA[147] 6 : 15 10 : 59 7 : 01 3DDFA+SDM 5 : 53 9 : 56 6 : 31 Proposedmethod 5 : 43 9 : 88 6 : 30 4.3.3AnalysisoftheVisualizationLayer Weperformfoursetsofexperimentstostudythepropertiesofthevisualizationlayerandnetwork architectures. ofvisualizationlayers Toanalyzetheofthevisualizationlayerinthetesting phase,weadd5%noisetothefullyconnectedlayerparametersofeachvisualizationblock,and computethealignmenterrorontheAFLWtestset.TheNMEsare[4 : 46,4 : 53,4 : 60,4 : 66,4 : 80, 5 : 16]wheneachblockisseperately.Thisanalysisshowsthatthevisualizedimageshave moreonthelaterblocks,sinceimpreciseparametersofearlyblockscouldbecompen- satedbylaterblocks.Inanotherexperiment,wetrainthenetworkwithoutanyvisualizationlayer. TheNMEonAFLWis7 : 18%whichshowstheimportanceofvisualizationlayersinguiding thenetworktraining. Advantageofdeeperfeatures WetrainthreeCNNarchitecturesshowninFigure4.6onAFLW. Theinputsofthevisualizationblockinthearchitecturearetheinputimages I ,featuremaps F andthevisualizationimage V .Theinputsofthesecondandthethirdarchitecturesare f F ; V g and f I ; V g ,respectively.TheNMEofeacharchitectureisshowninTable4.6.Whilethetone 72 Table4.6:TheNME(%)ofthreearchitectureswithdifferentinputs( I :Inputimage, V :Visualiza- tion, F :Featuremaps). Architecturea Architectureb Architecturec ( I ; F ; V ) ( F ; V ) ( I ; V ) 4 : 45 4 : 48 5 : 06 performsthebest,thesubstantiallowerperformanceofthethirdonedemonstratestheimportance ofdeeperfeatureslearnedacrossblocks. Attheconvolutionallayerofeachvisualizationblock,wecomputetheaverageofthe weights,acrossboththekernelsizeandthenumberofmaps.Theaveragesforthesethreetypesof inputfeaturesareshowninFigure4.7.Asobserved,theweightsdecreaseacrossblocks,leading toamorepreciseestimationofsmall-scaleparameterupdates.Consideringthenumberof inTable4.1,thetotalimpactoffeaturemapsarehigherthantheothertwoinputsinallblocks. Thisagainshowstheimportanceofdeeperfeaturesinguidingthenetworktoestimateparameters. Furthermore,theaverageofthevisualizationishigherthanthatoftheinputimage, demonstratingstrongeroftheproposedvisualizationduringtraining. (a) (b) (c) Figure4.6:ArchitecturesofthreeCNNswithdifferentinputs. 73 Table4.7:NME(%)whendifferentmasksareused. Mask1 Mask2 NoMask 4 : 45 4 : 49 5 : 31 Advantageofusingmasks Toshowtheadvantageofusingthemaskinthevisualizationlayer,we conductanexperimentwithdifferentmasks.,weanothermaskforcomparison asshowninFigure4.8.Ithasvepositiveareas,i.e.,theeyes,nosetipandtwolipcorners.The valuesarenormalizedtozero-meanandunitstandarddeviation.Comparedtotheoriginalmask inFig.4.4,thismaskismorecomplicatedandconveysmoreinformationabouttheinformative facialareastothenetwork.Moreover,toshowthenecessityofusingthemask,wealsotestusing visualizationlayerswithoutanymask.TheNMEsofthetrainednetworkswithdifferentmasks areshowninTable4.7.Comparisonbetweentheandthethirdcolumnsshowstheof usingthemask,bydifferentiatingthemiddleandcontourareasoftheface.Bycomparingthe andsecondcolumns,wecanseethatutilizingmorecomplicatedmaskdoesnotfurtherimprove theresult,indicatingtheoriginalmaskprovidessufinformationforitspurpose. Differentnumbersofblocksandlayers Giventhetotalnumberof12convolutionallayersin ournetwork,wecanpartitionthemintovisualizationblocksofvarioussizes.Tocomparetheir InputimageVisualizationFeaturemaps Figure4.7:Theaverageofweightsforinputimage,visualizationandfeaturemapsinthree architecturesofFigure4.6.The y -axisand x -axisshowtheaverageandtheblockindex,respec- tively. 74 Figure4.8:Mask2,adifferentdesignedmaskwithvepositiveareasontheeyes,topofthenose andsidesofthelip. performance,wetraintwoadditionalCNNs.Theoneconsistsof4visualizationblocks,with 3convolutionallayersineach.Theothercomeswith3blockand4convolutionallayersper block.Hence,allthreearchitectureshave12convolutionallayersintotal.TheNMEofthese architecturesareshowninTable4.8.Similarto[22],itshowsthatthenumberofregressorsis importantforfacealignmentandwecanpotentiallyachieveahigheraccuracybyincreasingthe numberofvisualizationblocks. 4.3.4Timecomplexity ComparedtothecascadeofCNNs,oneofthemainadvantagesofend-to-endtrainingasingle CNNisthereducedtrainingtime.Theproposedmethodneeds33epochswhichtakesaround2 : 5 days.Withthesametrainingandtestingdatasets,[53]requires70epochsforeachCNN.Witha totalofsixCNNs,itneedsaround7days.Similarly,themethodin[147]needsaround12days totrainthreeCNNs,eachonewith20epochs,despiteusingdifferenttrainingdata.Compared Table4.8:NME(%)whenusingdifferentnumbersofvisualizationblocks( N v )andconvolutional layers( N c ). N v = 6, N c = 2 N v = 4, N c = 3 N v = 3, N c = 4 4 : 45 4 : 61 4 : 83 75 Figure4.9:ResultsofalignmentonAFLWandAFWdatasets,greenlandmarksshowtheestimated locationsofvisiblelandmarksandredlandmarksshowestimatedlocationsofinvisiblelandmarks. Firstrow:providedboundingboxbyAFLWwithinitiallocationsoflandmarks,Second:estimated 3Ddenseshapes,Third:estimatedlandmarks,Fourthtosixth:estimatedlandmarksforAFLW, Seventh:estimatedlandmarksforAFW. to[53],ourmethodreducesthetrainingtimebymorethanhalf.Thetestingspeedofproposed methodis4 : 3FPSonaTitanXGPU.Itismuchfasterthanthe0 : 6FPSspeedof[53]andissimilar tothe4FPSspeedof[122]. 76 initializationBlock1Block2Block3Block4Block5Block6 Figure4.10:Threeexamplesofoutputsofvisualizationlayerateachvisualizationblock.The rowshowsthattheproposedmethodrecoverstheexpressionofthefacegracefully,thethirdrow showsthevisualizationsofafacewithamorechallengingpose. 4.4Summary Weproposeapose-invariantfacealignmentmethodwithend-to-endtraininginasingleCNN. Thekeyisadifferentiablevisualizationlayer,whichisintegratedtothenetworkandenablesjoint optimizationbybackpropagatingtheerrorfromalatervisualizationblockstoearlyones.Itallows thevisualizationblocktoutilizetheextractedfeaturesfrompreviousblocksandextractdeeper features,withoutextractinghand-craftedfeatures.Inaddition,theproposedmethodconverges fasterduringthetrainingphasecomparedtothecascadeofCNNs.Throughextensiveexperiments, wedemonstratethesuperiorresultsoftheproposedmethodoverthestate-of-the-artapproaches. 77 Chapter5 LearningDeepModelsforFace BinaryorAuxiliary Supervision 5.1Introduction Withtheincreasingofsmartdevicesinourdailylives,peopleareseekingforsecure andconvenientwaystoaccesstheirpersonalinformation.Biometrics,suchasface, andiris,arewidelyutilizedforpersonauthenticationduetotheirintrinsicdistinctivenessand conveniencetouse.Face,asoneofthemostpopularmodalities,hasreceivedincreasingattention intheacademiaandindustryintherecentyears(e.g.,iPhoneX).However,theattentionalso bringsagrowingincentiveforhackerstodesignbiometricpresentationattacks(PA),orspoofs,to beauthenticatedasthegenuineuser.Duetothealmostno-costaccesstothehumanface,thespoof facecanbeassimpleasaprintedphotopaper(i.e.,printattack)andadigitalimage/video(i.e., replayattack),orascomplicatedasa3DMaskandfacialcosmeticmakeup.Withproperhandling, thosespoofscanbevisuallyveryclosetothegenuineuser'sliveface.Asaresult,thesecallfor theneedofdevelopingrobustfacealgorithms. RGBimageandvideoarethestandardinputtofacesystemssincethemajority 78 Figure5.1: ConventionalCNN-basedfaceanti-spoofapproachesutilizethebinarysupervision,which mayleadtoovgiventheenormoussolutionspaceofCNN.Thisworkdesignsanovelnetwork architecturetoleveragetwoauxiliaryinformationassupervision:thedepthmapandrPPGsignal,withthe goalsofimprovedgeneralizationandexplainabledecisionsduringinference. offacerecognitionsystemsadoptRGBcamerasasthesensor.Researchersstartthetexture-based approachesbyfeedinghandcraftedfeaturestobinary[18,33,34,59,77,84, 131].Laterinthedeeplearningera,severalConvolutionalNeuralNetworks(CNN)approaches utilizesoftmaxlossasthesupervision[37,66,83,130].Itappearsalmostallpriorworkregardthe faceproblemasmerelya binary (livevs.spoof)problem. Therearetwomainissuesinlearningdeepmodelswithbinarysupervision.First, therearedifferentlevelsofimagedegradation,namely spoofpatterns ,comparingaspooffaceto aliveone,whichconsistofskindetailloss,colordistortion,moirépattern,shapedeformation andspoofartifacts(e.g.,[65,84].ACNNwithsoftmaxlossmightdiscover arbitrary cuesthatareabletoseparatethetwoclasses,suchasscreenbezel,butnotthe faithful spoof patterns.Whenthosecuesdisappearduringtesting,thesemodelswouldfailtodistinguishspoof vs.livefacesandresultinpoorgeneralization.Second,duringthetesting,modelslearntwith binarysupervisionwillonlygenerateabinarydecisionwithout explanation or rationale forthe decision.InthepursuitofExplainableIntelligence[1],itisdesirableforthelearnt 79 modeltogeneratethespoofpatternsthatsupportthebinarydecision. Toaddresstheseissues,asshowninFig.5.1,weproposeadeepmodelthatusesthesupervision fromboththe spatialandtemporalauxiliaryinformation ratherthanbinarysupervision,forthe purposeofrobustlydetectingfacePAfromafacevideo.Theseauxiliaryinformationareacquired basedonourdomainknowledgeaboutthekey differences betweenliveandspooffaces,which includetwoperspectives:spatialandtemporal.Fromthespatialperspective,itisknownthatlive faceshaveface-likedepth,e.g.,thenoseisclosertothecamerathanthecheekinfrontal-view faces,whilefacesinprintorreplayattackshaveorplanardepth,e.g.,allpixelsontheimageof apaperhavethesamedepthtothecamera.Hence,depthcanbeutilizedasauxiliaryinformationto supervisebothliveandspooffaces.Fromthetemporalperspective,itwasshownthatthenormal rPPGsignals(i.e.,heartpulsesignal)aredetectablefromlive,butnotspoof,facevideos[69,81]. Therefore,weprovidedifferentrPPGsignalsasauxiliarysupervision,whichguidesthenetwork tolearnfromliveorspooffacevideosrespectively.Toenablebothsupervisions,wedesigna networkarchitecturewithashort-cutconnectiontocapturedifferentscalesandanovelnon-rigid registrationlayertohandlethemotionandposechangeforrPPGestimation. Furthermore,similartoothervisionproblems,dataplaysaroleintrainingthe models.Asweknow,camera/screenqualityisacriticalfactortothequalityof spooffaces.Existingfaceantdatabases,suchasNUAA[104],CASIA[143],Replay- Attack[28],andMSU-MFSD[117],werecollected3 5yearsago.Giventhefastdevelopment paceofconsumerelectronics,thetypesofequipment(e.g.,camerasandmediums)used inthosedatacollectionareoutdatedcomparedtotheonesnowadays,regardingtheresolution andimagingquality.MorerecentMSU-USSA[84]andOULUdatabases[19]havesubjectswith fewervariationsinfacialposes,illuminations,expressions(PIE).Thelackofnecessaryvaria- tionswouldmakeithardtolearnaneffectivemodel.Giventheclearneedformoreadvanced 80 Figure5.2: Theoverviewoftheproposedmethod. databases,wecollectafacedatabasefortrainingandevaluation,namedSpoofinthe WildDatabase(SiW).SiWdatabaseconsistsof299subjects,6mediums,and4sessions coveringvariationssuchasPIE,distance-to-camera,etc.SiWcoversmuchlargervariationsthan previousdatabases,asdetailedinTab.5.1andSec.5.3. Themaincontributionsofthisworkinclude: Weproposetoleveragenovelauxiliaryinformation(i.e.,depthmapandrPPG)tosupervise theCNNlearningforimprovedgeneralization. WeproposeanovelCNN-RNNarchitectureforend-to-endlearningthedepthmapandrPPG signal. WereleaseanewdatabasethatcontainsvariationsofPIE,andotherpracticalfactors.We achievethestate-of-the-artperformanceforface 5.2FacewithDeepNetwork Themainideaoftheproposedapproachistoguidethedeepnetworktofocusonthe knownspoof patterns acrossspatialandtemporaldomains,ratherthantoextractanycuesthatcouldseparate twoclassesbutarenotgeneralizable.AsshowninFig.5.2,theproposednetworkcombinesCNN andRNNarchitecturesinacoherentway.TheCNNpartutilizesthedepthmapsupervisionto 81 Figure5.3: TheproposedCNN-RNNarchitecture.Thenumberofareshownontopofeachlayer, thesizeofallis3 3withstride1forconvolutionaland2forpoolinglayers. Colorcode used: orange =convolution, green =pooling, purple =responsemap. discoversubtletexturepropertythatleadstodistinctdepthsforliveandspooffaces.Then,itfeeds theestimateddepthandthefeaturemapstoanovel non-rigidregistration layertocreatealigned featuremaps.TheRNNpartistrainedwiththealignedmapsandtherPPGsupervision,which examinestemporalvariabilityacrossvideoframes. 5.2.1DepthMapSupervision Depthmapsarearepresentationofthe3Dshapeofthefaceina2Dimage,whichshowstheface locationandthedepthinformationofdifferentfacialareas.Thisrepresentationismoreinformative thanbinarylabelssinceitindicatesoneofthefundamentaldifferencesbetweenlivefaces,andprint andreplayPA.WeutilizethedepthmapsinthedepthlossfunctiontosupervisetheCNNpart.The pixel-baseddepthlossguidestheCNNtolearnamappingfromthefaceareawithinareceptive toalabeleddepthvalueŒascalewithin [ 0 ; 1 ] forlivefacesand0forspooffaces. Toestimatethedepthmapfora2Dfaceimage,givenafaceimage,weutilizethestate-of-the- artdensefacealignment(DeFA)methods[54,74]toestimatethe3Dshapeoftheface.Thefrontal dense3Dshape S F 2 R 3 Q ,with Q vertices,isrepresentedasalinearcombinationofidentity bases f S i id g N id i = 1 andexpressionbases f S i exp g N exp i = 1 , S F = S 0 + N id å i = 1 a i id S i id + N exp å i = 1 a i exp S i exp ; (5.1) 82 where a id 2 R 199 and a ext 2 R 29 aretheidentityandexpressionparameters,and a =[ a id ; a exp ] aretheshapeparameters.WeutilizetheBasel3Dfacemodel[86]andthefacewearhouse[25]as theidentityandexpressionbases. Withtheestimatedposeparameters P =( s ; R ; t ) ,where R 2 R 3 3 isarotationmatrix, t 2 R 3 isa3Dtranslation,and s isascale,wealignthe3Dshape S tothe2Dfaceimage: S = s RS F + t : (5.2) Giventhechallengeofestimatingthe absolute depthfroma2Dface,wenormalizethe z values of3Dverticesin S tobewithin [ 0 ; 1 ] .Thatis,thevertexclosesttothecamera(e.g.,nose)has adepthofone,andthevertexfurthestawayhasthedepthofzero.Then,weapplytheZ-Buffer algorithm[149]to S forprojectingthenormalized z valuestoa2Dplane,whichresultsinan estimatedfigroundtruthfl2Ddepthmap D 2 R 32 32 forafaceimage. 5.2.2rPPGSupervision rPPGsignalshaverecentlybeenutilizedforface[69,81].TherPPGsignalprovides temporalinformationaboutfaceliveness,asitisrelatedtothechangesintheintensitiesoffacial skinsovertime.Theseintensitychangesarehighlycorrelatedwiththebloodw.Thetraditional method[35]forextractingrPPGsignalshasthreedrawbacks.First,itissensitivetoposeand expressionvariations,sinceitbecomesharderto track afaceareaformeasuringintensity changes.Second,itisalsosensitivetothechangesinillumination,sincetheextralightingaffects theamountoflightfromtheskinsurface.Third,forthepurposeofanti-spoof,therPPG signalextractedfromspoofvideosmightnotbesuf distinguishable tothesignaloflive videos. 83 Onenoveltyaspectofourapproachisthat,insteadofcomputingtherPPGsignalvia[35],our RNNpartlearnstoestimatetherPPGsignal.Thiseasesthesignalestimationfromfacevideoswith PIEvariations,andalsoleadstomorediscriminativerPPGsignals,asdifferentrPPGsupervisions areprovidedtolivevs.spoofvideos.Weassumethatthevideosofthesamesubjectunderdifferent PIEconditionshavethe same groundtruthrPPGsignal.Thisassumptionisvalidsincetheheart beatissimilarforthevideosofthesamesubjectthatarecapturedinashortspanoftime( < 5 minutes).TherPPGsignalextractedfromtheconstrainedvideos(i.e.,noPIEvariation)areused asthefigroundtruth"supervisionintherPPGlossfunctionfor all livevideosofthesamesubject. ThisconsistentsupervisionhelpstheCNNandRNNpartstoberobusttothePIEchanges. InordertoextracttherPPGsignalfromafacevideowithoutPIE,weapplytheDeFA[74]to eachframeandestimatethedense3Dfaceshape.Weutilizetheestimated3Dshapetotrackaface region.Foratrackedregion,wecomputetwoorthogonalchrominancesignals x f = 3 r f 2 g f , y f = 1 : 5 r f + g f 1 : 5 b f where r f ; g f ; b f arethebandpassversionsofthe r ; g ; b channels withtheskin-tonenormalization.Weutilizetheratioofthestandarddeviationofthechrominance signals g = s ( x f ) s ( y f ) tocomputebloodwsignals[35].Wecalculatethesignal p as: p = 3 ( 1 g 2 ) r f 2 ( 1 + g 2 ) g f + 3 g 2 b f : (5.3) ByapplyingFFTto p ,weobtaintherPPGsignal f 2 R 50 ,whichshowsthemagnitudeofeach frequency. 5.2.3NetworkArchitecture Ourproposednetworkconsistsoftwodeepnetworks.First,aCNNpartevaluateseachframe separatelyandestimatesthedepthmapandfeaturemapofeachframe.Second,arecurrentneural 84 network(RNN)partevaluatesthetemporalvariabilityacrossthefeaturemapsofasequence. 5.2.3.1CNNNetwork WedesignaFullyConvolutionalNetwork(FCN)asourCNNpart,asshowninFig.5.3.TheCNN partcontainsmultipleblocksofthreeconvolutionallayers,poolingandresizinglayerswhereeach convolutionallayerisfollowedbyoneexponentiallinearlayerandbatchnormalizationlayer. Then,theresizinglayersresizetheresponsemapsaftereachblocktoasizeof64 64 andconcatenatetheresponsemaps.Thebypassconnectionshelpthenetworktoutilizeextracted featuresfromlayerswithdifferentdepthssimilartotheResNetstructure[44].Afterthat,ourCNN hastwobranches,oneforestimatingthedepthmapandtheotherforestimatingthefeaturemap. TheoutputoftheCNNistheestimateddepthmapoftheinputframe I 2 R 256 256 ,which issupervisedbytheestimatedfigroundtruth"depth D , Q D = argmin Q D N d å i = 1 jj CNN D ( I i ; Q D ) D i jj 2 1 ; (5.4) where Q D istheCNNparametersand N d isthenumberoftrainingimages.Thesecondoutputof theCNNisthefeaturemap,whichisfedintothenon-rigidregistrationlayer. 5.2.3.2RNNNetwork TheRNNpartaimstoestimatetherPPGsignal f ofaninputsequencewith N f frames f I j g N f j = 1 .As showninFig.5.3,weutilizeoneLSTMlayerwith100hiddenneurons,onefullyconnectedlayer, andanFFTlayerthatconvertstheresponseoffullyconnectedlayerintotheFourierdomain.Given theinputsequence f I j g N f j = 1 andthefigroundtruth"rPPGsignal f ,wetraintheRNNtominimize the ` 1 distanceoftheestimatedrPPGsignaltofigroundtruth" f , 85 Figure5.4: ExamplegroundtruthdepthmapandrPPGsignals. Q R = argmin Q R N s å i = 1 jj RNN R ([ f F j g N f j = 1 ] i ; Q R ) f i jj 2 1 ; (5.5) where Q R istheRNNparameters, F j 2 R 32 32 isthefrontalizedfeaturemap(detailsinSec.5.2.4), and N s isthenumberofsequences. 5.2.3.3ImplementationDetails GroundTruthData Givenasetofliveandspooffacevideos,weprovidethegroundtruth supervisionforthedepthmap D andrPPGsignal f .WefollowtheprocedureinSec.5.2.1to computefigroundtruth"dataforlivevideos.Forspoofvideos,wesetthegroundtruthdepthmaps toaplainsurface,i.e.,zerodepth.Similarly,wefollowtheprocedureinSec.5.2.2tocompute thefigroundtruth"rPPGsignalfromapatchontheforehead,foronelivevideoofeachsubject withoutPIEvariation.Also,wenormalizethenormofestimatedrPPGsignalsuchthat k f k 2 = 1. Forspoofvideos,weconsidertherPPGsignalsarezero.Fig.5.4showsexamplesoftheground truthdepthmapandrPPGsignal. Notethat,whilethetermfidepth"isusedhere,ourestimateddepthisdifferenttotheconven- tionaldepthmapincomputervision.Itcanbeviewedasafipseudo-depth"andservesthepurpose ofprovidingdiscriminativeauxiliarysupervisiontothelearningprocess.Thesameperspective 86 appliestothesupervisionbasedonpseudo-rPPGsignal. TrainingStrategy OurproposednetworkcombinestheCNNandRNNpartsforend-to-endtrain- ing.ThedesiredtrainingdatafortheCNNpartshouldbefromdiversesubjects,soastomakethe trainingproceduremorestableandincreasethegeneralizabilityofthelearntmodel.Meanwhile, thetrainingdatafortheRNNpartshouldbelongsequencestoleveragethetemporalinformation acrossframes.Thesetwopreferencescanbecontradictorytoeachother,especiallygiventhelim- itedGPUmemory.Hence,tosatisfybothpreferences,wedesignatwo-streamtrainingstrategy. ThestreamthepreferenceoftheCNNpart,wheretheinputincludesfaceimages I andthegroundtruthdepthmaps D .ThesecondstreamtheRNNpart,wheretheinput includesfacesequences f I j g N f j = 1 ,thegroundtruthdepthmaps f D j g N f j = 1 ,theestimated3Dshapes f S j g N f j = 1 ,andthecorrespondinggroundtruthrPPGsignals f .Duringtraining,ourmethodalter- natesbetweenthesetwostreamstoconvergetoamodelthatminimizesboththedepthmapand rPPGlosses.NotethateventhoughthestreamonlyupdatestheweightsoftheCNNpart, thebackpropagationofthesecondstreamupdatestheweightsofbothCNNandRNNpartsinan end-to-endmanner. Testing Toprovideascore,wefeedthetestingsequencetoournetworkandcompute theestimateddepthmap ‹ D ofthelastframeandtherPPGsignal ‹ f .Toavoidovdueto utilizingbinarylossfunctioninthenetwork,wecomputethescoreas: score = jj ‹ f jj 2 2 + l jj ‹ D jj 2 2 ; (5.6) where l isaconstantweightforcombiningthetwooutputsofthenetwork. 87 Figure5.5: Thenon-rigidregistrationlayer. 5.2.4Non-rigidRegistrationLayer Wedesignanewnon-rigidregistrationlayertopreparedatafortheRNNpart.Thislayerutilizes theestimateddense3DshapetoaligntheactivationsorfeaturemapsfromtheCNNpart.This layerisimportanttoensurethattheRNNtracksandlearnsthechangesoftheactivationsforthe samefacialarea acrosstime,aswellasacrossallsubjects. AsshowninFig.5.5,thislayerhasthreeinputs:thefeaturemap T 2 R 32 32 ,thedepthmap ‹ D andthe3Dshape S .Withinthislayer,wethresholdthedepthmapandgenerateabinarymask V 2 R 32 32 : V = ‹ D threshold : (5.7) Then,wecomputetheinnerproductofthebinarymaskandthefeaturemap U = T V ,which essentiallyutilizesthedepthmapasavisibilityindicatorforeachpixelinthefeaturemap.Ifthe depthvalueforonepixelislessthanthethreshold,weconsiderthatpixeltobeinvisible.Finally, wefrontalize U byutilizingtheestimated3Dshape S , F ( i ; j )= U ( S ( m ij ; 1 ) ; S ( m ij ; 2 )) ; (5.8) 88 Table5.1:Thecomparisonofourcollecteddatasetwithavailabledatasetsforthefaceanti- Dataset Year #of #of #oflive/attack Pose Different Extra Displaydevices Spoof subj. sess. vid.(V),ima.(I) range expres. light. attacks NUAA[104] 2010 15 3 5105 = 7509(I) Frontal No Yes - Print CASIA-MFSD[143] 2012 50 3 150 = 450(V) Frontal No No iPad Print,Replay Replay-Attack[28] 2012 50 1 200 = 1000(V) Frontal No Yes iPhone3GS,iPad Print,2Replay MSU-MFSD[117] 2015 35 1 110 = 330(V) Frontal No No iPadAir,iPhone5S Print,2Replay MSU-USSA[84] 2016 1140 1 1140 = 9120(I) [ 45 ; 45 ] Yes Yes MacBook,Nexus5,NvidiaShieldTablet 2print,6Replay Oulu-NPU[19] 2017 55 3 1980 = 3960(V) Frontal No Yes Dell1905FP,MacbookRetina 2Print,2Replay SiW 2018 165 4 1320 = 3300(V) [ 90 ; 90 ] Yes Yes iPadPro,iPhone7S,GalaxyS8,AsusMB168B 2Print,4Replay where m 2 R K isthelistof K indexesofthefaceareain S 0 ,and m ij incorresponding indexforthepixel i ; j .Weutilize m toprojectthemaskedactivationvalues U tothefrontalized image F . Thisnon-rigidregistrationlayerhasthreemaincontributionstotheproposednetworkarchi- tecture: Byapplyingthenon-rigidregistration,theinputdataarealignedandtheRNNcancompare thefeaturemapswithoutconcerningaboutthefacialposeorexpression.Inotherwords,itcan learnthetemporalchangesintheactivationsofthefeaturemapsforthesamefacialarea. Thenon-rigidregistrationremovesthebackgroundareainthefeaturemap.Hencetheback- groundareawouldnotparticipateinRNNlearning,althoughthebackgroundinformationisal- readyutilizedinthelayersoftheCNNpart. Forspooffaces,thedepthmapsarelikelytobeclosertozero.Hence,theinnerproduct withthedepthmapssubstantiallyweakenstheactivationsinthefeaturemaps,whichmakesit convenientfortheRNNtooutputzerorPPGsignals.Likewise,thebackpropagationfromthe rPPGlossalsoencouragestheCNNparttogeneratezerodepthmapsforeitherallframes,orone pixellocationinmajorityoftheframeswithinaninputsequence. 89 Figure5.6: ThestatisticsofthesubjectsintheSiWdatabase.Leftside:Thehistogramshowsthedistribu- tionofthefacesizes. 5.3CollectionofFaceDatabase Withthefastadvancementofimagingsensortechnology,theexistingsystemscould becomevulnerabletoemerginghigh-qualityspoofmediums.Onewaytomakethesystemrobust totheseattacksistocollectnewhigh-qualitydatabases.Inrespondingtothisneed,wecollect anewfacedatabasenamedSpoofintheWild(SiW)database,whichhasmultiple advantagesoverpreviousdatasetsasshowninTab.5.1.First,itcontainssubstantiallymorelive subjectswithdiverseraces,e.g.,5timesofthesubjectsofOulu-NPU.NotethatMSU-USSAis constructedbasedonexistingimagesofcelebritieswithoutcapturinglivefacevideos.Second, livevideosarecapturedwithtwohigh-qualitycameras(CanonEOST6,LogitechC920webcam) withdifferentPIEvariations. SiWprovidesliveandspoofvideosfrom299subjects.Foreachsubject,wehave8liveand 16spoofvideos,intotal7 ; 170videos.SomestatisticsofthesubjectsareshowninFig.5.6. Thelivevideosarecollectedinfoursessions.InSession1,thesubjectmoveshisheadwith varyingdistancestothecamera.InSession2,thesubjectchangestheyawangleoftheheadwithin [ 90 ; 90 ] ,andmakesdifferentfaceexpressions.InSessions3 ; 4,thesubjectrepeatstheSessions 1 ; 2,whilethecollectormovingthepointlightsourcearoundthefacefromdifferentorientations. 90 Figure5.7: ExamplesoftheliveandspoofattackvideosintheSiWdatabase.Therowshowsalive subjectwithdifferentPIE.Thesecondrowshowsdifferenttypesofthespoofattacks. ThelivevideosfromtheCanonEOST6andLogitechC920webcamareintheresolutionof 1 ; 920 1 ; 080.WeprovidetwoprintandfourreplayvideoattacksforeachsubjectinSiW.Some examplesoftheliveandspoofvideosareshowninFig.5.7.Togeneratedifferentqualitiesof printattacks,wecaptureahigh-resolutionimage(5 ; 184 3 ; 456)foreachsubjectanduseitto makeahigh-qualityprintattack.Also,weextractafrontal-viewframefromalivevideoforlower- qualityprintattack.WeprinttheimageswithanHPcolorLaserJetM652printer.Theprintattack videosarecapturedbyholdingprintedimagesstillandwarpingtheminfrontofthecameras.To generatehigh-qualityreplayattackvideos,weselectfourspoofmediums:SamsungGalaxyS8, iPhone7,iPadPro,andPCscreens.Foreachsubject,werandomlyselecttwovideosfromthefour high-qualitylivevideostodisplayinthespoofmediums. 5.4ExperimentalResults 5.4.1ExperimentalSetup Databases Weevaluateourmethodonmultipledatabasestodemonstrateitsgeneralizability.We utilizeSiWandOuludatabases[19]asnewhigh-resolutiondatabasesandperformintraandcross testingbetweenthem.Also,weusetheCASIA-MFSD[143]andReplay-Attack[28]databases 91 Table5.2:TDRatdifferentFDRs,crosstestingonOuluProtocol1. FDR 1% 2% 10% 20% Model1 8 : 5% 18 : 1% 71 : 4% 81 : 0% Model2 40 : 2% 46 : 9% 78 : 5% 93 : 5% Model3 39 : 4% 42 : 9% 67 : 5% 87 : 5% Model4 45.8% 47.9% 81% 94.2% Table5.3: ACERofourmethodatdifferent N f ,onOuluProtocol2. P P P P P P P P P Test Train 5 10 20 5 4 : 16% 4 : 16% 3 : 05% 10 4 : 02% 3 : 61% 2 : 78% 20 4 : 10% 3 : 67% 2 : 98% forcrosstestingandcomparingwiththestateoftheart. Parametersetting TheproposedmethodisimplementedinTensorFlow[2]withaconstantlearn- ingrateof3e 3,and10epochsofthetrainingphase.ThebatchsizeoftheCNNstreamis10and thatoftheCNN-RNNstreamis2with N f being5.Werandomlyinitializeournetworkbyusinga normaldistributionwithzeromeanandstdof0 : 02.Weset l inEq.5.6to0 : 015and threshold in Eq.5.7to0 : 1. Evaluationmetrics Tocomparewithpriorworks,wereportourresultswiththefollowingmet- rics:AttackPresentationErrorRate APCER [46],BonaFidePresentationClassi- ErrorRate BPCER [46], ACER = APCER + BPCER 2 [46],andHalfTotalErrorRate HTER . The HTER ishalfofthesummationoftheFalseRejectionRate(FRR)andtheFalseAcceptance Rate(FAR). 92 Table5.4:Theintra-testingresultsonfourprotocolsofOulu. Prot. Method APCER(%) BPCER(%) ACER(%) CPqD 2 : 9 10 : 8 6 : 9 1 GRADIANT 1.3 12 : 5 6 : 9 Proposedmethod 1 : 6 1.6 1.6 MixedFASNet 9 : 7 2 : 5 6 : 1 2 Proposedmethod 2.7 2 : 7 2 : 7 GRADIANT 3 : 1 1.9 2.5 MixedFASNet 5 : 3 6 : 7 7 : 8 5 : 5 6 : 5 4 : 6 3 GRADIANT 2.6 3.9 5 : 0 5 : 3 3 : 8 2 : 4 Proposedmethod 2 : 7 1 : 3 3.1 1.7 2.9 1.5 Massy_HNU 35 : 8 35 : 3 8.3 4.1 22 : 1 17 : 6 4 GRADIANT 5.0 4.5 15 : 0 7 : 1 10 : 0 5 : 0 Proposedmethod 9 : 3 5 : 6 10 : 4 6 : 0 9.5 6.0 5.4.2ExperimentalComparison 5.4.2.1AblationStudy Advantageofproposedarchitecture Wecomparefourarchitecturestodemonstratetheadvan- tagesoftheproposedlosslayersandnon-rigidregistrationlayer. Model 1hasanarchitecture similartotheCNNpartinourmethod(Fig.5.3),exceptthatitisextendedwithadditionalpooling layers,fullyconnectedlayers,andsoftmaxlossforbinarycl Model 2istheCNNpart inourmethodwithadepthmaplossfunction.Wesimplyuse jj ‹ D jj 2 for Model 3containstheCNNandRNNpartswithoutthenon-rigidregistrationlayer.Bothofthedepth mapandrPPGlossfunctionsareutilizedinthismodel.However,theRNNpartwouldprocess unregisteredfeaturemapsfromtheCNN. Model 4istheproposedarchitecture. Wetrainallfourmodelswiththeliveandspoofvideosfrom20subjectsofSiW.Wecompute thecross-testingperformanceofallmodelsonProtocol1ofOuludatabase.TheTDRatdifferent FDRarereportedinTab.5.2. Model 1hasapoorperformanceduetothebinarysupervision.In comparison,byonlyusingthedepthmapassupervision, Model 2achievessubstantiallybetterper- formance.However,afteraddingtheRNNpartwiththerPPGsupervision,ourproposed Model 4 canfurthertheperformanceimprovement.Bycomparing Model 4and3,wecanseetheadvantage 93 ofthenon-rigidregistrationlayer.ItisclearthattheRNNpartcannotusefeaturemapsdirectly fortrackingthechangesintheactivationsandestimatingtherPPGsignals. Advantageoflongersequences Toshowtheadvantageofutilizinglongersequencesfores- timatingtherPPG,wetrainandtestourmodelwhenthesequencelength N f is5,10,or20, usingintra-testingonOuluProtocol2.FromTab.5.3,wecanseethatbyincreasingthesequence length,theACERdecreasesduetomorereliablerPPGestimation.Despitetheoflonger sequences,inpractice,wearelimitedbytheGPUmemorysize,andforcedtodecreasetheimage sizeto128 128forallexperimentsinTab.5.3.Hence,weset N f tobe5withtheimagesizeof 256 256insubsequentexperiments,duetoimportanceofhigherresolution(e.g,alower ACER of2 : 5%inTab.5.4isachievedthan4 : 16%). 5.4.2.2IntraTesting WeperformintratestingonOuluandSiWdatabases.ForOulu,wefollowthefourprotocols[15] andreporttheir APCER , BPCER and ACER .Tab.5.4showsthecomparisonofourproposed methodandthebesttwomethodsfor each protocolrespectively,inthefacecom- petition[15].Ourmethodachievesthelowest ACER in3outof4protocols.Wehaveslightly worse ACER onProtocol2.TosetabaselineforfuturestudyonSiW,wethreeprotocols forSiW.TheProtocol1dealswithvariationsinfaceposeandexpression.Wetrainusingthe 60framesofthetrainingvideosthataremainlyfrontalviewfaces,andtestonalltestingvideos. TheProtocol2evaluatestheperformanceofcrossspoofmediumofreplayattack.TheProtocol 3evaluatestheperformanceofcrossPA,i.e.,fromprintattacktoreplayattackandviceversa. Tab.5.5showstheprotocolandourperformanceofeachprotocol. 94 Table5.5:Theintra-testingresultsonthreeprotocolsofSiW. Prot. Subset Subject# Attack APCER(%) BPCER(%) ACER(%) 1 Train 90 First60Frames 3 : 58 3 : 58 3 : 58 Test 75 All 2 Train 90 3display 0 : 57 0 : 69 0 : 57 0 : 69 0 : 57 0 : 69 Test 75 1display 3 Train 90 print(display) 8 : 31 3 : 81 8 : 31 3 : 80 8 : 31 3 : 81 Test 75 display(print) Figure5.8: (a)8successfulexamplesandtheirestimateddepthmapsandrPPGsignals.(b)4 failureexamples:thetwoareliveandtheothertwoarespoof.Noteourabilitytoestimatediscriminative depthmapsandrPPGsignals. 5.4.2.3CrossTesting Todemonstratethegeneralizationofourmethod,weperformmultiplecross-testingexperiments. Ourmodelistrainedwithliveandspoofvideosof80subjectsinSiW,andtestonallprotocolsof Oulu.The ACER onProtocol1-4arerespectively:10 : 0%,14 : 1%,13 : 8 5 : 7%,and10 : 0 8 : 8%. Comparingthesecross-testingresultstothe intra-testing resultsin[15],wearerankedsixthonthe average ACER offourprotocols,amongthe15participantsofthefacecompetition. EspeciallyonProtocol4,thehardestoneamongallprotocols,weachievethe sameACER of 10 : 0%asthetopperformer.Thisisanotableresultsincecrosstestingisknowntobesubstantially harderthanintratesting,andyetourcross-testingresultiscomparablewiththetopintra-testing performance.Thisdemonstratesthegeneralizationabilityofourlearntmodel. Furthermore,weutilizetheCASIA-MFSDandReplay-Attackdatabasestoperformcrosstest- ingbetweenthem,whichiswidelyusedasacross-testingbenchmark.Tab.5.6comparesthe 95 Table5.6: CrosstestingonCASIA-MFSDvs.Replay-Attack. Method Train Test Train Test CASIA- Replay Replay CASIA- MFSD Attack Attack MFSD Motion[34] 50.2% 47.9% LBP[34] 55.9% 57.6% LBP-TOP[34] 49.7% 60.6% Motion-Mag[11] 50.1% 47.0% Spectralcubes[90] 34.4% 50.0% CNN[130] 48.5% 45.5% LBP[16] 47.0% 39.6% ColourTexture[17] 30.3% 37.7% Proposedmethod 27.6% 28.4% Figure5.9: Mean/Stdoffrontalizedfeaturemapsforliveandspoof. cross-testing HTER ofdifferentmethods.Ourproposedmethodreducesthecross-testingerrorson theReplay-AttackandCASIA-MFSDdatabasesby8 : 9%and24 : 6%respectively,relativetothe previousSOTA. 5.4.2.4VisualizationandAnalysis Intheproposedarchitecture,thefrontalizedfeaturemapsareutilizedasinputtotheRNNpart andaresupervisedbytherPPGlossfunction.Thevaluesofthesemapscanshowtheimportance ofdifferentfacialareastorPPGestimation.Fig.5.9showsthemeanandstandarddeviationof frontalizedfeaturemaps,computedfrom1 ; 080liveandspoofvideosofOulu.Wecanseethatthe sideareasofforeheadandcheekhavehigherforrPPGestimation. WhilethegoalofoursystemistodetectPAs,ourmodelistrainedtoestimatetheauxiliary information.Hence,inadditiontoanti-spoof,wealsoliketoevaluatetheaccuracyofauxiliary 96 Figure5.10: TheMSEofestimatingdepthmapsandrPPGsignals. informationestimation.Forthispurpose,wecalculatetheaccuracyofestimatingdepthmapsand rPPGsignals,fortestingdatainProtocol2ofOulu.AsshowninFig.5.10,theaccuracyforboth estimationinspoofdataishigh,whilethatofthelivedataisrelativelylower.Notethatthedepth estimationofthemouthareahasmoreerrors,whichisconsistentwiththefeweractivationsofthe sameareainFig.5.9.ExamplesofsuccessfulandfailurecasesinestimatingdepthmapsandrPPG signalsareshowninFig.5.8. Finally,weconductstatisticalanalysisonthefailurecases,sinceoursystemcandetermine potentialcausesusingtheauxiliaryinformation.WithProctocol2ofOulu,weidentify31failure cases(2 : 7% ACER ).Foreachcase,wecalculatewhetherusingitsdepthmaporrPPG signalwouldfailifthatinformationaloneisused.Intotal, 29 31 , 13 31 ,and 11 31 samplesfailduetodepth map,rPPGsignals,orboth.Thisindicatesthefutureresearchdirection. 5.5Summary Thischaptertheimportanceofauxiliarysupervisiontodeepmodel-basedfaceanti- TheproposednetworkcombinesCNNandRNNarchitecturestojointlyestimatethe depthoffaceimagesandrPPGsignaloffacevideo.WeintroducetheSiWdatabasethatcontains moresubjectsandvariationsthanpriordatabases.Finally,weexperimentallydemonstratethe superiorityofourmethod. 97 Chapter6 FaceviaNoise Modeling 6.1Introduction Theprintandthereplayattacksarethethemostcommonspoofstypesandtheyhavebeenwell studiedpreviously,fromdifferentperspectives.Thecue-basedmethodsaimtodetectliveness cues[82,83](e.g.,eyeblinking,headmotion)toclassifylivevideos.Butthesemethodscanbe fooledbyvideoreplayattacks.Thetexture-basedmethodsattempttocomparetexturedifference betweenliveandspooffaces,usingfeaturessuchasLBP[33,34],HOG[59,131]. Similartotexture-basedmethods,CNN-basedmethods[66,83,130]designaprocessof featureextractionandWithasoftmaxlossbasedbinarysupervision,theyhavethe riskofovonthetrainingdata.Regardlessoftheperspectives,almostallthepriorworks treatfaceasa blackbox binaryproblem.Incontrast,inthischapter,we proposetoopentheblackboxbymodelingtheprocessofhowaspoofimageisgeneratedfromits originalliveimage. Ourapproachismotivatedbytheclassicimagede-Xproblems,suchasimagede-noisingand de-blurring[36,51,62,85].Inimagede-noising,thecorruptedimageisregardedasadegradation fromtheadditivenoise,e.g.,salt-and-peppernoiseandwhiteGaussiannoise.Inimagede-blurring, theuncorruptedimageisdegradedbymotion,whichcanbedescribedasaprocessofconvolution. 98 Figure6.1: Theillustrationoffaceandprocesses.processaimsto estimateaspoofnoisefromaspooffaceandreconstructtheliveface.Theestimatedspoofnoiseshouldbe discriminativeforface Similarly,infacethespoofimagecanbeviewedasare-renderingoftheliveimage butwithsomefispecial"noisefromthespoofmediumandtheenvironment.Hence,thenatural questionis, canwerecovertheunderlyingliveimagewhengivenaspoofimage,similartoimage de-noising ? Yes.Thischaptershowsfihow"todothis.Wecalltheprocessofdecomposingaspoofface tothespoofnoisepatternandalivefaceas Face ,showninFig.6.1.Similartothe previousde-Xworks,thedegradedimage x 2 R m canbeformulatedasafunctionoftheoriginal image ‹ x ,thedegradationmatrix A 2 R m m andanadditivenoise n 2 R m . x = A ‹ x + n = ‹ x +( A I ) ‹ x + n = ‹ x + N ( ‹ x ) ; (6.1) where N ( ‹ x )=( A I ) ‹ x + n istheimage-dependentnoisefunction.Insteadofsolving A and n ,we decidetoestimate N ( ‹ x ) directlysinceitismoresolvableunderthedeeplearningframework[40, 63,100,101,145].Essentially,byestimating N ( ‹ x ) and ‹ x ,weaimtopeeloffthespoofnoiseand reconstructtheoriginalliveface.Likewise,ifgivenaliveface,facemodelshould returnitselfplus zero noise.Notethatourfaceisdesignedtohandlepaperattack, 99 replayattackandpossiblymake-upattack,butourexperimentsarelimitedtothetwoPAs. Theoffacearetwofold:1)itreverses,orundoes,thegeneration process,whichhelpsustomodelandvisualizethespoofnoisepatternofdifferentspoofmediums. 2)thespoofnoiseitselfisdiscriminativebetweenliveandspoofimagesandhenceisusefulfor face Whilefacesharesthesamechallengesasotherimagede-Xproblems,ithasafew distinctdiftoconquer: NoGroundTruth: Imagede-Xworksoftenusesyntheticdatawheretheoriginalundegraded imagecouldbeusedasgroundtruthforsupervisedlearning.Incontrast,wehavenoaccessto ‹ x , whichisthecorrespondinglivefaceofaspooffaceimage. NoNoiseModel: Thereisnocomprehensivestudyandunderstandingaboutthespoofnoise. Henceitisnotclearhowwecanconstrainthesolutionspaceto faithfully estimatethespoofnoise pattern. DiverseSpoofMediums: Eachtypeofspoofsutilizesdifferentspoofmediumsforgenerating spoofimages.Eachspoofmediumrepresentsatypeofnoisepattern. Toaddressthesechallenges,weproposeseveralconstraintsandsupervisionsbasedonour priorknowledgeandtheconclusionsfromacasestudy(inSection6.2.1).Giventhataliveface hasnospoofnoise,weimposetheconstraintthat N ( ‹ x ) ofaliveimageis zero .Basedonourstudy, weassumethatthespoofnoiseofaspoofimageisubiquitous,i.e.,itexistseverywhereinthe spatialdomainoftheimage;andisrepetitive,i.e.,itisthespatialrepetitionofcertainnoiseinthe image.Therepetitivenesscanbeencouragedbymaximizingthehigh-frequencymagnitudeofthe estimatednoiseintheFourierdomain. Withsuchconstraintsandauxiliarysupervisionsproposedin[73],anovelCNNarchitectureis presentedinthispaper.Givenanimage,oneCNNisdesignedtosynthesizethespoofnoisepattern 100 andreconstructthecorrespondingliveimage.Inordertoexaminethereconstructedliveimage, wetrainanotherCNNwithauxiliarysupervisionandaGAN-likediscriminatorinanend-to-end fashion.Thesetwonetworksaredesignedtoensurethequalityofthereconstructedimageregard- ingitsdiscriminativenessbetweenliveandspoof,andthevisualplausibilityofthesynthesizedlive image. Tosummarize,themaincontributionsofthisworkinclude: Weofferanewperspectivefordetectingthefacefromprintattackandreplayattack byinverselydecomposingaspooffaceimageintothelivefaceandthenoise,without havingthegroundtruthofeither. AnovelCNNarchitectureisproposedforfacewhereappropriateconstraints andauxiliarysupervisionsareimposed. Wedemonstratethevalueoffacebyitscontributiontofaceand thevisualizationofthespoofnoisepatterns. 6.2Face Inthissection,westartwithacasestudyofspoofnoisepattern,whichdemonstratesafewimpor- tantcharacteristicsofthenoise.ThisstudymotivatesustodesignthenovelCNNarchitecturethat willbepresentedinSec.6.2.2.1. 6.2.1ACaseStudyofSpoofNoisePattern Thecoretaskoffaceistoestimatetheevantnoisepatterninthegivenface image.DespitethestrengthofusingaCNNmodel,wearestillfacingthechallengeoflearning without thegroundtruthofthenoisepattern.Toaddressthischallenge,wewouldliketocarry 101 Figure6.2: Theillustrationofthespoofnoisepattern. Left: livefaceanditslocalregions. Right: Two registeredfacesfromprintattackandreplayattack.Foreachsample,weshowthelocalregionof theface,intensitydifferencetotheliveimage,magnitudeof2DFFT,andthelocalpeaksinthefrequency domainthatindicatesthespoofnoisepattern.Bestviewedelectronically. outacasestudyonthenoisepatternwiththeobjectivesofansweringthefollowingquestions:1) isEqn.6.1agoodmodelingofthespoofnoise?2)whatcharacteristicsdoesthespoofnoisehold? Letusdenoteagenuinefaceas ‹ I .Byusingprintedpaperorvideoreplayondigitaldevices,the attackercanmanufactureaspoofimage I from ‹ I .Consideringnonon-rigiddeformationbetween I and ‹ I ,wesummarizethedegradationfrom ‹ I to I asthefollowingsteps: 1. Colordistortion: Colordistortionisduetoanarrowercolorgamutofthespoofmedium (e.g.LCDscreenorTonerCartridge).Itisaprojectionfromtheoriginalcolorspacetoa tiniercolorsubspace.Thisnoiseisdependentonthecolorintensityofthesubject,andhence itmayapplyasadegradationmatrixtothegenuineface I duringthedegradation. 2. Displayartifacts: Spoofmediumsoftenuseseveralnearbydots/sensorstoapproximate onepixel'scolor,andtheymayalsodisplaythefacedifferentlythantheoriginalsize.Ap- proximationanddown-samplingprocedurewouldcauseacertaindegreeofhigh-frequency informationloss,blurring,andpixelperturbation.Thisnoisemayalsoapplyasadegradation matrixduetoitssubjectdependence. 102 Figure6.3: Theproposednetworkarchitecture. 3. Presentingartifacts: Whenpresentingthespoofmediumtothecamera,themediuminter- actswiththeenvironmentandbringsseveralartifacts,includingandtransparency ofthesurface.Thisnoisemayapplyasanadditivenoise. 4. Imagingartifacts: Imaginglatticepatternssuchasscreenpixelsonthecamera'ssensor array(e.g.CMOSandCCD)wouldcauseinterferenceoflight.Thiseffectleadstoaliasing andcreatesmoirépattern,whichappearsinreplayattackandsomeprintattackwithstrong latticeartifacts.Thisnoisemayapplyasanadditivenoise. Thesefourstepsshowthatthespoofimage I canbegeneratedviaapplyingdegradationmatri- cesandadditivenoisesto ‹ I ,whichisbasicallyconveyedbyEqn.6.1.AsexpressedbyEqn.6.1,the spoofimageisthesummationoftheliveimageandimage-dependentnoise.Tofurthervalidatethis model,weshowanexampleinFig.6.2.Givenahigh-qualityliveimage,wecarefullyproducetwo spoofimagesviaprintandreplayattack,withminimalnon-rigiddeformation.Aftereachspoof imageisregisteredwiththeliveimage,theliveimagebecomesthe groundtruth liveimageifwe wouldperformonthespoofimage.Thisallowsustocomputethedifferencebetween theliveandspoofimages,whichisthenoisepattern N ( ‹ I ) .Toanalyzeitsfrequencyproperties,we 103 performFFTonthespoofnoiseandshowthe2Dshiftedmagnituderesponse. Inbothspoofcases,weobserveahighresponseinthelow-frequencydomain,whichisrelated tocolordistortionanddisplayartifacts.Inprintattack, repetitive noiseinStep3leadstoafew fipeak"responsesinthehigh-frequencydomain.Similarly,inthereplayattack,visiblemoiré patternasseveralspursinthelow-frequencydomain,andthelatticepatternthatcauses themoirépatternisrepresentedaspeaksinthehigh-frequencydomain.Moreover,spoofpatterns areuniformlydistributedintheimagedomainduetotheuniformtextureofthespoofmediums. Andthehighresponseoftherepetitivepatterninthefrequencydomainexactlydemonstratesthat itappearswidelyintheimageandthuscanbeviewedasubiquitous. Underthisidealregistration,thecomparisonbetweenliveandspoofimagesprovidesusa basicunderstandingofthespoofnoisepattern.Itisatypeoftexturethathasthecharacteristicsof repetitive and ubiquitous .Basedonthismodelingandnoisecharacteristics,wedesignanetwork toestimatethenoise without theaccesstothepreciselyregisteredgroundtruthliveimage,asthis casestudyhas. 6.2.2De-SpoofNetwork 6.2.2.1NetworkOverview Figure6.3showstheoverallnetworkarchitectureofourproposedmethod.Itconsistsofthree parts:De-SpoofNet(DSNet),DiscriminativeQualityNet(DQNet),andVisualQualityNet(VQ Net).DSNetisdesignedtoestimatethespoofnoisepattern N (i.e.theoutputof N ( ‹ I ) )fromthe inputimage I .Theliveface ‹ I thencanbereconstructedbysubtractingtheestimatednoise N from theinputimage I .Thisreconstructedimage ‹ I shouldbebothvisuallyappealingandindeedlive, whichwillbesafeguardedbytheDQNetandVQNetrespectively.Allnetworkscanbetrainedin 104 Table6.1: ThenetworkstructureofDSNet,DQNetandVQNet.Eachconvolutionallayerisfollowed byanexponentiallinearunit(ELU)andbatchnormalizationlayer.TheinputimagesizeforDSNetis 256 256 6.Alltheconvolutionalare3 3.0\1MapNetisthebottom-leftpart,i.e.,conv1-10, conv1-11,andconv1-12. DSNet(EncoderPart) DSNet(DecoderPart) DQNet VQNet LayerChan./Stri.Outp.Size LayerChan./Stri.Outp.Size LayerChan./Stri.Outp.Size LayerChan./Stri.Outp.Size Input Input Input Input image pool1-1+pool1-2+pool1-3 {image,live} {image,live} conv1-024/1256 resize-/-256 conv3-064/1256 conv1-120/1256 conv2-128/1256 conv3-1128/1256 conv4-124/2256 conv1-225/1256 conv2-224/1256 conv3-2196/1256 conv4-220/2256 conv1-320/1256 conv3-3128/1256 pool4-1-/2128 pool1-1-/2128 pool3-1-/2128 conv1-420/1128 conv2-320/1256 conv3-4128/1128 conv4-320/1128 conv1-525/1128 conv2-420/1256 conv3-5196/1128 conv4-416/1128 conv1-620/1128 conv3-6128/1128 pool4-2-/264 pool1-2-/264 pool3-2-/264 conv1-720/164 conv2-520/1256 conv3-7128/164 conv4-512/164 conv1-825/164 conv2-616/1256 conv3-8196/164 conv4-66/164 conv1-920/164 conv3-9128/164 pool4-3-/232 pool1-3-/232 pool3-3-/232 short-cutconnection short-cutconnection vectorize pool1-1+pool1-2+pool1-3 pool3-1+pool3-2+pool3-3 1024 conv1-1028/132 conv2-716/1256 conv3-10128/132 fc4-11/1100 conv1-1116/132 conv2-86/1256 conv3-1164/132 dropout-0.2% conv1-121/132 live(image-conv2-8) conv3-121/132 fc4-21/12 anend-to-endfashion.ThedetailsofthenetworkstructureareshowninTab.6.1. Asthecorepart,DSNetisdesignedasanencoder-decoderstructurewiththeinput I 2 R 256 256 6 .Herethe6channelsareRGB + HSVcolorspace,followingthesuggestionin[5].In theencoderpart,westack10convolutionallayerswith3poolinglayers.Inspiredbytheresid- ualnetwork[44],wefollowbyashort-cutconnection:concatenatingtheresponsesfrom pool 1-1, pool 1-2with pool 1-3,andthensendingthemto conv 1-10.Thisoperationhelpsustopassthe featureresponsesfromdifferentscalestothelaterstagesandeasethetrainingprocedure.Going through3moreconvolutionlayers,theresponses F 2 R 32 32 32 from conv 1-12arethefeature representationofthespoofnoisepatterns.Thehighermagnitudestheresponseshave,themore theinputis. Outfromtheencoder,thefeaturerepresentation F isfedintothedecodertoreconstructthe spoofnoisepattern. F isdirectlyresizedtotheinputspatialsize256 256.Itintroducesnoextra 105 gridartifacts,whichexistinthealternativeapproachofusingadeconvolutionallayer.Then,we passtheresized F toseveralconvolutionallayerstoreconstructthenoisepattern N .Accordingto Eqn.6.1,thereconstructedliveimagecanberetrievedby: ‹ x = x N ( ‹ x )= I N . EachconvolutionallayerintheDSNetisequippedwithexponentiallinearunit(ELU)and batchnormalizationlayers.TosupervisethetrainingofDSNet,wedesignmultiplelossfunctions: lossesfromDQNetandVQNetfortheimagequality,0\1maploss,andnoisepropertylosses.We introducetheselossfunctionsinSec.6.2.3-6.2.4. 6.2.3DQNetandVQNet Whilewedonothavethegroundtruthtosupervisetheestimatedspoofnoisepattern,itispossible tosupervisethereconstructedliveimage,whichimplicitlyguidesthenoiseestimation.Toestimate agood-qualityspoofnoise,thereconstructedliveimageshouldbequantitativelyandvisually recognizedaslive.Forthispurpose,weproposetwonetworksinourarchitecture:Discriminative QualityNet(DQNet)andVisualQualityNet(VQNet).TheVQNetaimstoguaranteethe reconstructedlivefaceisphotorealistic.TheDQNetisproposedtoguaranteethereconstructed facewouldindeedbeconsideredaslive,basedonthejudgmentofapre-trainedface network.ThedetailsofourproposedarchitectureareshowninTab.6.1. 6.2.3.1DiscriminativeQualityNet Wefollowthestate-of-the-artnetworkarchitectureofface[73]tobuildourDQNet. Itisafullyconvolutionalnetworkwiththreeblocksandthreeadditionalconvolutionallayers. Eachblockconsistsofthreeconvolutionallayersandonepoolinglayer.Thefeaturemapsafter eachpoolinglayerareresizedandstackedtofeedintothefollowingconvolutionallayers.Finally, DQNetissupervisedtoestimatethepseudo-depth D ofaninputface,where D forthelivefaceis 106 thedepthofthefaceshapeand D forthespooffaceisazeromapasasurface.Weadoptthe3D facealignmentalgorithmin[74]toestimatethefaceshapeandrenderthedepthviaZ-Buffering. Similartothepreviouswork[50],DQNetispre-trainedtoobtainthesemanticknowledgeof livefacesandfaces.AndduringthetrainingofDSNet,theparametersofDQNetare ed.Sincethereconstructedimages ‹ I areliveimages,thecorrespondingpseudo-depth D should bethedepthofthefaceshape.ThebackpropagationoftheerrorfromDQNetguidestheDSNet toestimatethespoofnoisepatternwhichshouldbesubtractedfromtheinputimage, J DQ = CNN DQ ( ‹ I ) D 1 ; (6.2) whereCNN DQ isaednetworkand D isthedepthofthefaceshape. 6.2.3.2VisualQualityNet WedeployaGANtoverifythevisualqualityoftheestimatedliveimage ‹ I .Givenboththereal liveimage I live andthesynthesizedliveimage ‹ I ,VQNetistrainedtodistinguishbetween I live and ‹ I .Meanwhile,DSNettriestoreconstructphotorealisticliveimageswheretheVQNetwould classifythemasnon-synthetic(orreal)images.TheVQNetconsistsof6convolutionallayersand afullyconnectedlayerwitha2Dvectorastheoutput,whichrepresentstheprobabilityoftheinput imagetoberealorsynthetic.Ineachiterationduringthetraining,theVQNetisevaluatedwith twobatches,intheone,theDSNetisedandweupdatetheVQNet, J VQ train = E I 2 R log ( CNN VQ ( I )) E I 2 S log ( 1 CNN VQ ( CNN DS ( I ))) ; (6.3) 107 where R and S arethesetsofrealandsyntheticimagesrespectively.Inthesecondbatch,theVQ NetisedandtheDSNetisupdated, J VQ test = E I 2 S log ( CNN VQ ( CNN DS ( I ))) : (6.4) 6.2.4Lossfunctions Themainchallengeforspoofmodelingisthelackofthegroundtruthforthespoofnoisepattern. SincewehaveconcludedsomepropertiesaboutthespoofnoiseinSec.6.2.1,wecanleverage themtodesignseveralnovellossfunctionstoconstraintheconvergencespace.First,weintroduce magnitudelosstoenforcethespoofnoiseoftheliveimagetobezero.Second,zero\onemaplossis usedtodemonstratetheubiquitousnessofthespoofnoise.Third,weencouragetherepetitiveness propertyofspoofnoiseviarepetitiveloss.Wedescribethreelossfunctionsasthefollowing: 6.2.4.1MagnitudeLoss Thespoofnoisepatternfortheliveimagesiszero.Themagnitudelosscanbeutilizedtoimpose theconstraintfortheestimatednoise.Giventheestimatednoise N andreconstructedliveimage ‹ I = I N ofanoriginalliveimage I ,wehave, J m = k N k 1 : (6.5) Zero\OneMapLoss: Tolearndiscriminativefeaturesintheencoderlayers,weasub-task intheDSNettoestimateazero-mapforthelivefacesandanone-mapforthespoof.Sincethisis aper pixel supervision,itisalsoaconstraintofubiquitousnessonthenoise.Moreover,0\1map 108 enablesthereceptiveofeachpixeltocoveralocalarea,whichhelpstolearngeneralizable featuresforthisproblem.Formally,giventheextractedfeatures F fromaninputfaceimage I in theencoder,wehave, J z = CNN 01 map ( F ; Q ) M 1 ; (6.6) where M 2 0 32 32 or M 2 1 32 32 isthezero\onemaplabel. 6.2.4.2RepetitiveLoss Basedonthepreviousdiscussion,weassumethespoofnoisepatterntoberepetitive,becauseit isgeneratedfromtherepetitivespoofmedium.Toencouragetherepetitiveness,weconvertthe estimatednoise N totheFourierdomainandcomputethemaximumvalueinthehigh-frequency band.Theexistenceofhighpeakisindicativeoftherepetitivepattern.Wewouldliketomaximize thispeakforspoofimages,butminimizeitforliveimages,asthefollowinglossfunction: J r = 8 > < > : max ( H ( F ( N ) ; k )) , I 2 Spoof k max ( H ( F ( N ) ; k )) k 1 , I 2 Live ; where F istheFouriertransformoperator, H isanoperatorformaskingthelow-frequencydomain ofanimage,i.e.,settinga k k regioninthecenteroftheshifted2DFourierresponsetozero. Finally,thetotallossfunctioninourtrainingistheweightedsummationoftheaforementioned lossfunctionsandthesupervisionsfortheimagequalities, J T = J z + l 1 J m + l 2 J r + l 3 J DQ + l 4 J VQ test ; (6.7) where l 1 ; l 2 ; l 3 ; l 4 aretheweights.Duringthetraining,wealternatebetweenoptimizingEqn.6.7 andEqn.6.3. 109 6.3ExperimentalResults 6.3.1ExperimentalSetup Databases Weevaluateourworkonthreefaceantidatabases,withprintandreplay attacks:Oulu-NPU[19],CASIA-MFSD[143]andReplay-Attack[28].Oulu-NPU[19]isahigh- resolutiondatabase,consideringmanyreal-worldvariations.Oulu-NPUalsoincludes4testing protocols:Protocol1evaluatesontheilluminationvariation,Protocol2examinesthe ofdifferentspoofmedium,Protocol3inspectstheeffectofdifferentcameradevicesandProto- col4containsallthechallengesabove,whichisclosetothescenarioofcrosstesting.CASIA- MFSD[143]containsvideoswithresolution640 480and1280 720.Replay-Attack[28]in- cludesvideosof320 240.Thesetwodatabasesareoftenusedforcrosstesting[83]. Parametersetting WeimplementourmethodinTw[2].Modelsaretrainedwiththe batchsizeof6andthelearningrateof3e 5.Wesetthe k = 64intherepetitivelossandset l 1 to l 4 inEqn.6.7as3 ; 0 : 005 ; 0 : 1and0 : 016,respectively.DQNetistrainedseparatelyandremains edduringtheupdateofDSNetandVQNet,butallsub-networksaretrainedwiththesameand respectivedataineachprotocol. Evaluationmetrics Tocomparewithpreviousmethods,weuseAttackPresentation ErrorRate( APCER )[46],BonaFidePresentationErrorRate( BPCER )[46]and, ACER =( APCER + BPCER ) = 2[46]fortheintratestingonOulu-NPU,andHalfTotalErrorRate ( HTER )[9],halfofthesummationofFARandFRR,forthecrosstestingbetweenCASIA-MFSD andReplay-Attack. 110 Table6.2:Theaccuracyofdifferentoutputsoftheproposedarchitectureandtheirfusions. Method 0\1map Spoofnoise Depthmap Fusion(Spoofnoise,Depthmap) Fusionofallthreeoutputs Maximum Average Maximum Average APCER 2 : 50 1 : 70 1 : 66 1 : 70 1 : 27 1 : 70 1 : 27 BPCER 2 : 52 1 : 70 1 : 68 1 : 73 1 : 73 1 : 73 1 : 73 ACER 2 : 51 1 : 70 1 : 67 1 : 72 1 : 50 1 : 72 1 : 50 6.3.2AblationStudy UsingOulu-NPUProtocol1,weperformthreestudiesontheeffectofscorefusing,theimportance ofeachlossfunction,andtheofimageresolutionandblurriness. Differentfusionmethods Intheproposedarchitecture,threeoutputscanbeutilizedforclassi- thenormsofeitherthe0\1map,thespoofnoisepatternorthedepthmap.Becauseof thediscriminativenessenabledbyourlearning,wecansimplyusearudimentarylike L -1norm.Notethatamoreadvanceisapplicableandwouldlikelyleadtohigherper- formance.Table6.2showstheperformanceofeachoutputandtheirfusionwithmaximumand average.Itshowsthatthefusionofspoofnoiseanddepthmapachievesthebestperformance. However,addingthe0\1mapscoresdonotimprovetheaccuracysinceitcontainsthesamein- formationasthespoofnoise.Hence,fortherestofexperiments,wereportperformancefromthe averagefusionofthespoofnoise N andthedepthmap ‹ D ,i.e., score =( k N k 1 + ‹ D 1 ) = 2. Advantageofeachlossfunction Wehavethreemainlossfunctionsinourproposedarchitecture. Toshowstheeffectofeachlossfunction,wetrainanetworkwitheachlossexcludedonebyone. Bydisablingthemagnitudeloss,the0\1maplossandtherepetitiveloss,weobtaintheACERs 5 : 24,2 : 34and1 : 50,respectively.Tofurthervalidatetherepetitiveloss,weperformanexperiment onhigh-resolutionimagesbychangingthenetworkinputtothecheekregionoftheoriginal1080P resolution.TheACERofthenetworkwiththerepetitivelossis2 : 92butthenetworkwithout cannotconverge. Resolutionandblurriness Asshownintheablationstudyofrepetitiveloss,theimagequal- 111 Table6.3: ACERoftheproposedmethodwithdifferentimageresolutionsandblurriness.Tocreateblurry images,weapplyGaussianwithdifferentkernelsizestotheinputimages. X X X X X X X X X X X Metric Resolution 256 256 128 128 64 64 APCER 1 : 27 2 : 27 5 : 24 BPCER 1 : 73 3 : 36 5 : 30 ACER 1 : 50 3 : 07 5 : 27 X X X X X X X X X X X Metric Blurriness 1 1 3 3 5 5 7 7 9 9 APCER 1 : 27 2 : 29 3 : 12 3 : 95 4 : 79 BPCER 1 : 73 2 : 50 3 : 33 4 : 16 5 : 00 ACER 1 : 50 2 : 39 3 : 22 4 : 06 4 : 89 Table6.4:Theintratestingresultson4protocolsofOulu-NPU. Protocol Method APCER(%) BPCER(%) ACER(%) CPqD[15] 2 : 9 10 : 8 6 : 9 1 GRADIANT[15] 1 : 3 12 : 5 6 : 9 Auxiliary[73] 1 : 6 1.6 1 : 6 Ours 1.2 1 : 7 1.5 MixedFASNet[15] 9 : 7 2 : 5 6 : 1 2 Ours 4 : 2 4 : 4 4 : 3 Auxiliary[73] 2 : 7 2 : 7 2 : 7 GRADIANT 3.1 1.9 2.5 MixedFASNet 5 : 3 6 : 7 7 : 8 5 : 5 6 : 5 4 : 6 3 GRADIANT 2.6 3.9 5 : 0 5 : 3 3 : 8 2 : 4 Ours 4 : 0 1 : 8 3 : 8 1 : 2 3 : 6 1 : 6 Auxiliary[73] 2 : 7 1 : 3 3.1 1.7 2.9 1.5 Massy_HNU[15] 35 : 8 35 : 3 8 : 3 4 : 1 22 : 1 17 : 6 4 GRADIANT 5.0 4.5 15 : 0 7 : 1 10 : 0 5 : 0 Auxiliary[73] 9 : 3 5 : 6 10 : 4 6 : 0 9 : 5 6 : 0 Ours 5 : 1 6 : 3 6.1 5.1 5.6 5.7 ityiscriticalforachievingahighaccuracy.Thespoofnoisepatternmaynotbedetectedinthe low-resolutionormotion-blurredimages.Thetestingresultsondifferentimageresolutionsand blurrinessareshowninTab.6.3.Theseresultsvalidatethatthespoofnoisepatternislessdiscrim- inativeforthelower-resolutionorblurryimages,asthehigh-frequencypartoftheinputimages containsmostofthespoofnoisepattern. 6.3.3ExperimentalComparison Toshowtheperformanceofourproposedmethod,wepresentouraccuracyintheintratestingof Oulu-NPUandthecrosstestingonCASIAandReplay-Attack. 112 6.3.3.1IntraTesting Wecompareourintratestingperformanceonall4protocolsofOulu-NPU.Table6.4showsthe comparisonofourmethodandthebest3outof18previousmethods[15,73].Ourproposed methodachievespromisingresultsonallprotocols.,weoutperformthepreviousstate oftheartbyalargemargininProtocol4,whichisthemostchallengingprotocol,andsimilarto crosstesting. 6.3.3.2CrossTesting WeperformcrosstestingbetweenCASIA-MFSD[143]andReplay-Attack[28].Asshownin Tab.6.5,ourmethodachievesthecompetitiveperformanceonthecrosstestingfromCASIA- MFSDtoReplay-Attack.However,weachieveaworseHTERcomparedtothebestperform- ingmethodsfromReplayAttacktoCASIA-MFSD.Wehypothesizethereasonisthatimagesof CASIA-MFSDareofmuchhigherresolutionthanthoseofReplayAttack.Thisshowsthatthe modeltrainedwithhigher-resolutiondatacangeneralizewellonlower-resolutiontestingdata, butnottheotherwayaround.Thisisonelimitationoftheproposedmethod,andworthyfurther research. 6.3.4QualitativeExperiments 6.3.4.1Spoofmedium Theestimatedspoofnoisepatternofthetestimagescanbeusedforclusteringthemintodifferent groupsandeachgrouprepresentsonespoofmedium.Tovisualizetheresults,weuset-SNE[76] fordimensionreduction.Thet-SNEprojectsthenoise N 2 R 256 256 6 to2dimensionsbybest preservingtheKLdivergencedistance.Fig.6.4showsthedistributionsofthetestingvideoson 113 Table6.5:TheHTERofdifferentmethodsforthecrosstestingbetweentheCASIA-MFSDand theReplay-Attackdatabases.Wemarkthetop-2performancesinbold. Method Train Test Train Test CASIA Replay Replay CASIA MFSD Attack Attack MFSD Motion[34] 50 : 2% 47 : 9% LBP-TOP[34] 49 : 7% 60 : 6% Motion-Mag[11] 50 : 1% 47 : 0% Spectralcubes[90] 34 : 4% 50 : 0% CNN[130] 48 : 5% 45 : 5% LBP[16] 47 : 0% 39 : 6% ColourTexture[17] 30 : 3% 37.7 % Auxiliary[73] 27.6 % 28.4 % Ours 28.5 % 41 : 1% Oulu-NPUProtocol1.Theleftimageshowsthatthenoiseofliveiswell-clustered,andthenoise ofspoofissubjectdependent,whichisconsistentwithournoiseassumption.Toobtainabetter visualization,weutilizethehighpasstoextractthehigh-frequencyinformationofnoise patternfordimensionreduction.Therightimageshowsthatthehighfrequencyparthasmore subjectindependentinformationaboutthespooftypeandcanbeutilizedforofthe spoofmedium. Tofurthershowthediscriminativepoweroftheestimatedspoofnoise,wedividethetestingset ofProtocol1totrainingandtestingpartsandtrainanSVMforspoofmedium tion.Wetraintwomodels,athree-class(live,printanddisplay)andave-class (live,print1,print2,display1anddisplay2),andtheyachievetheaccuracyof82 : 0% and54 : 3%respectively,showninTab.6.6.Mosterrorsoftheve-classmodelare withinthesamespoofmedium.Thisresultisnoteworthygiventhatnolabelofspoofmediumtype isprovidedduringthelearningofthespoofnoisemodel.Yettheestimatednoiseactuallycarries appreciableinformationregardingthemediumtype;hencewecanobservereasonableresultsof spoofmediumThisdemonstratesthattheestimatednoisecontainsspoofmedium 114 Figure6.4:The2DvisualizationoftheestimatedspoofnoisefortestvideosonOulu-NPUProtocol 1.Left:theestimatednoise,Right:thehigh-frequencybandoftheestimatednoise, Colorcode used: black =live, green =printer1, blue =printer2, magenta =display1, red =display2. Table6.6: Theconfusionmatricesofspoofmediumsbasedonspoofnoisepattern. X X X X X X X X X X X Actual Predicted live print display live 59 1 0 print 0 88 32 display 13 8 99 X X X X X X X X X X X Actual Predicted live print1 print2 display1 display2 live 59 0 1 0 0 print1 0 41 2 11 6 print2 0 34 11 9 6 display1 10 6 0 13 31 display2 8 7 0 6 39 informationandindeedwearemovingtowardestimatingthefaithfulspoofnoiseresidingineach spoofimage.Inthefuture,iftheperformanceofspoofmediumimproves,thiscould bringnewimpacttoapplicationssuchasforensic. 6.3.4.2Successfulandfailurecases WeshowseveralsuccessandfailurecasesinFig.6.5-6.6.Fig.6.5showsthattheestimatedspoof noisesaresimilarwithineachmediumbutdifferentfromtheothermediums.Wesuspectthatthe yellowishcolorinthefourcolumnsisduetothestrongercolordistortioninthepaperattack. Therowshowsthattheestimatednoisefortheliveimagesisnearlyzero.Forthefailurecases, weonlyhaveafewfalsepositivecases.Thefailuresareduetoundesirednoiseestimationwhich willmotivateusforfurtherresearch. 115 Figure6.5:Thevisualizationofinputimages,estimatedspoofnoisesandestimatedliveimages fortestvideosofProtocol1ofOulu-NPUdatabase.Thefourcolumnsintheroware paperattacksandthesecondfourarethereplayattacks.Forabettervisualization,wemagnifythe noiseby5timesandaddthevaluewith128,toshowbothpositiveandnegativenoise. Figure6.6:Thefailurecasesforconvertingthespoofimagestotheliveones. 6.4Summary Thischapterintroducesanewperspectiveforsolvingthefacebyinverselydecom- posingaspooffaceintothelivefaceandthespoofnoisepattern.AnovelCNNarchitecturewith 116 multipleappropriatesupervisionsisproposed.Wedesignlossfunctionstoencouragethepattern ofthespoofimagestobeubiquitousandrepetitive,whilethenoiseoftheliveimagesshouldbe zero.Wevisualizethespoofnoisepatternwhichcanhelptohaveadeeperunderstandingofthe addednoisebyeachspoofmedium.Weevaluatetheproposedmethodonmultiplewidely-used facedatabases. 117 Chapter7 ConclusionsandFutureWork Facealignmentisanimportantresearchtopicbecauseithasmanyapplicationsinfacerecognition, facetracking,expressionestimation,augmentedandvirtualreality,etc.Byutilizingdeeplearning methods,facealignmentsystemsimproveddramaticallyandtheycanbeusedinmanycommercial applicationsforfacetrackingandaugmentedrealitytaskse.g.,SnapchatandFacebook. Inchapter2to4,weproposethreefacealignmentmethodswiththesameideaofconvertingthe 2Dfacealignmentto3Dfacealignmentandreconstructingthe3Dshapeoftheface.Wepresent ouradvantagetoperformfacealignmentforlargeposefaceimagesbyutilizingtheestimated3D shape.Thesethreemethodsareanextendedversionofeachother,andtheaccuracyandthespeed ofourpose-invariantfacealignmentareimprovedineachmethod. Thefaceisoneofthemostpopularbiometricmodalitiesthatcanbeusedforaccesscontroland authentication.Althoughfacerecognitionsystemscanachieveveryhighnaccuracy, ifwewanttousefacerecognitionsystemsinpracticeweneedtohavearobustface system Inchapter5and6,weproposeourCNN-basedfacemethodsthatusethesupervi- sionsfromboththespatialandtemporalauxiliaryinformationforthepurposeofrobustlydetecting facePAfromafacevideo. 118 7.1Limitations Pose-invariantFaceAlignment: ThemainlimitationoftheproposedPIFAmethodsisthatthe estimated3Dshapeoffacedoesnotcontainthedetailedchangesinthe3Dshapee.g.,wrinkles. Thisisduetothelimitednumbersofgroundtruth2Dlandmarkswhichareutilizedduringtraining assupervision.Tran et.al [106]showthatutilizingunsupervisedlossfunctionsbasedonthe texturereconstructionishelpfultohavemoredetailed3Dshapeestimation. Face ThemainchallengeofmethodsforalloftheBiometricmodal- itiesisunknownspoofattack(spoofmedium).Themethodsaremainlyconstrained withthespoofmediumdatawhichusedfortrainingthem.Makingthemethodsmore generalizableandrobusttonewspoofmediumisthemainlimitationoftheproposedmethods. 7.2FutureWork Detailed 3 DShapeforFace Theestimated3Dshapes,byourproposedmethods,arelimitedto the3DMMbaseswhichweutilizetorepresentthe3Dshape.Inordertohavemorepowerfulbases torepresentthe3Dshapes,wecanlearnthebasesfromthedataandmakeamoredetailedrecon- structionbyincorporatingunsupervisedlossfunctions.In[106],aCNNbasedmethodproposedto learnasetofnon-linearbasesforrepresentingthe3Dshapeoftheface.Similarly,Tan et.al [103] utilizevariationalauto-encodersforrepresentingthe3Dmeshesinordertohavemorexibility forrepresentingthe3Dshapeandallowingnon-lineardeformations. GeneralObject Theideaoffaceantcanbeextendedfordetecting thespoofimageforallobjects.Thegeneralobjecthasapplicationsfordetectingthe spoofimagesinonlineshoppingwebsitese.g.,eBayandAmazon.Thegeneralis muchmorechallengingthanthefaceanti-spoofduetovariousmaterialsanddifferenttextureof 119 Figure7.1: Left:ArepresentationoftheestimatedpointcloudiniPhoneX.Right:Thehardwaretechnol- ogyinHuaweiP11forcapturingthepointcloud. objects.Thegoalistodetecttheprintattackandthereplayattackforgeneralobject 3 DPointCloud The3Dpointcloudhasseveralapplicationsin computervision.Recently,withtheadvancementofhardwaretechnology,wecanhavesensors onthephoneforcapturingthe3Dpointcloud.IniPhoneX,thepointclouddataisutilizedfor face(LeftinFig.7.1).Similarly,othercellphonecompaniesareutilizing pointcloudforfaceForexample,therightinFig.7.1showsthehardware technologyinHuaweiP11forcapturingthepointcloud.Thesedatacanbeutilizedfordetecting objectsandmeasuringthelengthofobjectsandthedistancetotheobjects. 120 BIBLIOGRAPHY 121 BIBLIOGRAPHY [1] ExplainableIntelligence(XAI)(https://www.darpa.mil/program/explainable- [2] M.Abadi,A.Agarwal,andetal.TensorFlow:Large-scalemachinelearningonheteroge- neoussystems,2015. [3] A.Agarwal,R.Singh,andM.Vatsa.FaceusingHaralickfeatures.IEEE, 2016. [4] A.Asthana,S.Zafeiriou,S.Cheng,andM.Pantic.Robustdiscriminativeresponsemap withconstrainedlocalmodels.pages3444Œ3451.IEEE,2013. [5] Y.Atoum,Y.Liu,A.Jourabloo,andX.Liu.Faceusingpatchanddepth-based cnns.IEEE,2017. [6] W.Bao,H.Li,N.Li,andW.Jiang.Alivenessdetectionmethodforfacerecognitionbased onopticalwIn IASP .IEEE,2009. [7] P.N.Belhumeur,D.W.Jacobs,D.Kriegman,andN.Kumar.Localizingpartsoffacesusing aconsensusofexemplars.pages545Œ552.IEEE,2011. [8] S.BellandK.Bala.Learningvisualsimilarityforproductdesignwithconvolutionalneural networks.34(4):98,2015. [9] S.BengioandJ.Mariéthoz.Astatisticaltestforpersonauthentication.In ProceedingsofOdyssey2004:TheSpeakerandLanguageRecognitionWorkshop ,2004. [10] S.Bharadwaj,T.Dhamecha,M.Vatsa,andR.Singh.Faceviamotionmagni- andmultifeaturevideoletaggregation.2014. [11] S.Bharadwaj,T.I.Dhamecha,M.Vatsa,andR.Singh.Computationallyefface detectionwithmotionIEEE,2013. [12] V.BlanzandT.Vetter.Amorphablemodelforthesynthesisof3Dfaces.In ACMSIG- GRAPH ,pages187Œ194,1999. [13] V.BlanzandT.Vetter.Facerecognitionbasedona3Dmorphablemodel.25(9):1063Œ 1074,2003. [14] S.Bobbia,Y.Benezeth,andJ.Dubois.Remotephotoplethysmographybasedonimplicit livingskintissuesegmentation.pages361Œ365,2016. 122 [15] Z.Boulkenafet.Acompetitionongeneralizedsoftware-basedfacepresentationattackde- tectioninmobilescenarios.IEEE,2017. [16] Z.Boulkenafet,J.Komulainen,andA.Hadid.Facebasedoncolortexture analysis.IEEE,2015. [17] Z.Boulkenafet,J.Komulainen,andA.Hadid.Facedetectionusingcolourtexture analysis.11(8):1818Œ1830,2016. [18] Z.Boulkenafet,J.Komulainen,andA.Hadid.Faceusingspeeded-uprobust featuresandvectorencoding. IEEESignalProcessingLetters ,24(2):141Œ145,2017. [19] Z.Boulkenafet,J.Komulainen,L.Li,X.Feng,andA.Hadid.OULU-NPU:Amobileface presentationattackdatabasewithreal-worldvariations.IEEE,2017. [20] J.Bromley,J.W.Bentz,L.Bottou,I.Guyon,Y.LeCun,C.Moore,E.Säckinger,and R.Shah.Signaturevusingafisiamesefltimedelayneuralnetwork.7(04):669Œ 688,1993. [21] A.BulatandG.Tzimiropoulos.Convolutionalaggregationoflocalevidenceforlargepose facealignment.2016. [22] X.P.Burgos-Artizzu,P.Perona,andP.Dollár.Robustfacelandmarkestimationunder occlusion.pages1513Œ1520,2013. [23] H.Caesar,J.Uijlings,andV.Ferrari.Region-basedsemanticsegmentationwithend-to-end training.pages381Œ397,2016. [24] C.Cao,Q.Hou,andK.Zhou.Displaceddynamicexpressionregressionforreal-timefacial trackingandanimation. ACMTransactionsonGraphics(TOG) ,33(4):43,2014. [25] C.Cao,Y.Weng,S.Zhou,Y.Tong,andK.Zhou.Facewarehouse:a3Dfacialexpression databaseforvisualcomputing.20(3):413Œ425,2014. [26] X.Cao,Y.Wei,F.Wen,andJ.Sun.Facealignmentbyexplicitshaperegression. 107(2):177Œ190,2014. [27] G.ChettyandM.Wagner.Multi-levellivenessvforface-voicebiometricauthen- tication.In BC ,2006. [28] I.Chingovska,A.Anjos,andS.Marcel.Ontheeffectivenessoflocalbinarypatternsinface IEEE,2012. [29] T.Cootes,G.Edwards,andC.Taylor.Activeappearancemodels.23(6):681Œ685,June 2001. 123 [30] T.Cootes,C.Taylor,andA.Lanitis.Activeshapemodels:Evaluationofamulti-resolution methodforimprovingimagesearch.volume1,pages327Œ336,1994. [31] T.F.Cootes,C.J.Taylor,D.H.Cooper,andJ.Graham.ActiveshapemodelsŠtheir trainingandapplication.61(1):38Œ59,Jan1995. [32] D.CristinacceandT.Cootes.Boostedregressionactiveshapemodels.volume2,pages 880Œ889,2007. [33] T.deFreitasPereira,A.Anjos,J.M.DeMartino,andS.Marcel.LBP-TOPbasedcounter- measureagainstfaceattacks.Springer,2012. [34] T.deFreitasPereira,A.Anjos,J.M.DeMartino,andS.Marcel.Canface countermeasuresworkinarealworldscenario?IEEE,2013. [35] G.deHaanandV.Jeanne.Robustpulseratefromchrominance-basedrPPG. IEEETrans. BiomedicalEngineering ,60(10):2878Œ2886,2013. [36] C.Dong,C.C.Loy,K.He,andX.Tang.Learningadeepconvolutionalnetworkforimage super-resolution.Springer,2014. [37] L.Feng,L.-M.Po,Y.Li,X.Xu,F.Yuan,T.C.-H.Cheung,andK.-W.Cheung.Integration ofimagequalityandmotioncuesforfaceaneuralnetworkapproach. Journal ofVisualCommunicationandImageRepresentation ,38:451Œ460,2016. [38] R.W.FrischholzandU.Dieckmann.BiolD:amultimodalbiometricsystem. J.Computer ,33(2):64Œ68,2000. [39] R.W.FrischholzandA.Werner.Avoidingreplay-attacksinafacerecognitionsystemusing head-poseestimation.In AMFGW ,pages234Œ235,2003. [40] M.Gharbi,G.Chaurasia,S.Paris,andF.Durand.Deepjointdemosaickinganddenoising. 35(6):191,2016. [41] X.Glorot,A.Bordes,andY.Bengio.Deepsparseneuralnetworks.In Proc. IntelligenceandStatistics(AISTATS) ,pages315Œ323,2011. [42] R.Gross,I.Matthews,andS.Baker.Genericvs.personactiveappearancemodels. 23(11):1080Œ1093,Nov.2005. [43] L.GuandT.Kanade.3Dalignmentoffaceinasingleimage.volume1,pages1305Œ1312, 2006. [44] K.He,X.Zhang,S.Ren,andJ.Sun.Deepresiduallearningforimagerecognition.IEEE, 2016. 124 [45] G.-S.Hsu,K.-H.Chang,andS.-C.Huang.Regressivetreestructuredmodelforfacial landmarklocalization.pages3855Œ3861,2015. [46] ISO/IECJTC1/SC37Biometrics.Informationtechnologybiometricpresentation attackdetectionpart1:Framework.internationalorganizationforstandardization (https://www.iso.org/obp/ui/iso),2016. [47] M.Jaderberg,K.Simonyan,A.Zisserman,etal.Spatialtransformernetworks.pages2017Œ 2025,2015. [48] L.A.Jeni,J.F.Cohn,andT.Kanade.Dense3Dfacealignmentfrom2dvideosinreal-time. volume1,pages1Œ8,2015. [49] Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.Girshick,S.Guadarrama,and T.Darrell.Caffe:Convolutionalarchitectureforfastfeatureembedding.In ACMMM, (2014) ,pages675Œ678,2014. [50] J.Johnson,A.Alahi,andL.Fei-Fei.Perceptuallossesforreal-timestyletransferandsuper- resolution.Springer,2016. [51] A.Jourabloo,A.Feghahati,andM.Jamzad.Newalgorithmsforrecoveringhighlycorrupted imageswithimpulsenoise. ScientiaIranica ,19(6):1738Œ1745,2012. [52] A.JourablooandX.Liu.Pose-invariant3Dfacealignment.pages3694Œ3702,2015. [53] A.JourablooandX.Liu.Large-posefacealignmentviacnn-baseddense3Dmodel 2016. [54] A.JourablooandX.Liu.Pose-invariantfacealignmentviacnn-baseddense3Dmodel pages1Œ17,2017. [55] A.Jourabloo,X.Yin,andX.Liu.Attributepreservedface2015. [56] V.KazemiandJ.Sullivan.Onemillisecondfacealignmentwithanensembleofregression trees.pages1867Œ1874,2014. [57] M.Koestinger,P.Wohlhart,P.M.Roth,andH.Bischof.Annotatedfaciallandmarksin thewild:Alarge-scale,real-worlddatabaseforfaciallandmarklocalization.In FirstIEEE InternationalWorkshoponBenchmarkingFacialImageAnalysisTechnologies ,2011. [58] K.Kollreider,H.Fronthaler,M.I.Faraj,andJ.Bigun.Real-timefacedetectionandmotion analysiswithapplicationinlivenessassessment.2(3):548Œ558,2007. [59] J.Komulainen,A.Hadid,andM.Pietikainen.ContextbasedfaceIEEE, 2013. 125 [60] J.Komulainen,A.Hadid,M.Pietikäinen,A.Anjos,andS.Marcel.Complementarycoun- termeasuresfordetectingscenicfaceattacks.2013. [61] M.Köstinger,P.Wohlhart,P.M.Roth,andH.Bischof.Annotatedfaciallandmarksinthe wild:Alarge-scale,real-worlddatabaseforfaciallandmarklocalization.pages2144Œ2151, 2011. [62] K.Kulkarni,S.Lohit,P.Turaga,R.Kerviche,andA.Ashok.Reconnet:Non-iterative reconstructionofimagesfromcompressivelysensedmeasurements.IEEE,2016. [63] S.Lefkimmiatis.Non-localcolorimagedenoisingwithconvolutionalneuralnetworks. IEEE,2017. [64] H.Li,Z.Lin,X.Shen,J.Brandt,andG.Hua.Aconvolutionalneuralnetworkcascadefor facedetection.pages5325Œ5334,2015. [65] J.Li,Y.Wang,T.Tan,andA.K.Jain.Livefacedetectionbasedontheanalysisoffourier spectra.In BiometricTechnologyforHuman .SPIE,2004. [66] L.Li,X.Feng,Z.Boulkenafet,Z.Xia,M.Li,andA.Hadid.Anoriginalface approachusingpartialconvolutionalneuralnetwork.In ImageProcessingTheoryToolsand Applications(IPTA),20166thInternationalConferenceon .IEEE,2016. [67] Y.Li,B.Sun,T.Wu,Y.Wang,andW.Gao.Facedetectionwithend-to-endintegrationofa convnetanda3Dmodel.2016. [68] F.Liu,D.Zeng,Q.Zhao,andX.Liu.Jointfacealignmentand3Dfacereconstruction. pages545Œ560,2016. [69] S.Liu,P.C.Yuen,S.Zhang,andG.Zhao.3Dmaskfacewithremotephoto- plethysmography.pages85Œ100,2016. [70] X.Liu.Discriminativefacealignment.31(11):1941Œ1954,2009. [71] X.LiuandT.Chen.Pose-robustfacerecognitionusinggeometryassistedprobabilistic modeling.volume1,pages502Œ509,2005. [72] X.Liu,J.Rittscher,andT.Chen.Optimalposeforfacerecognition.volume2,pages 1439Œ1446,2006. [73] Y.Liu,A.Jourabloo,andX.Liu.LearningdeepmodelsforfaceBinaryor auxiliarysupervision.IEEE,2018. [74] Y.Liu,A.Jourabloo,W.Ren,andX.Liu.Densefacealignment.IEEE,2017. [75] S.Lucey,R.Navarathna,A.B.Ashraf,andS.Sridharan.FourierLucas-Kanadealgorithm. 126 35(6):1383Œ1396,2013. [76] L.v.d.MaatenandG.Hinton.Visualizingdatausingt-SNE. Journalofmachinelearning research ,9(Nov):2579Œ2605,2008. [77] J.Määttä,A.Hadid,andM.Pietikäinen.Facedetectionfromsingleimagesusing micro-textureanalysis.IEEE,2011. [78] I.MatthewsandS.Baker.Activeappearancemodelsrevisited.60(2):135Œ164,2004. [79] H.MohammadzadeandD.Hatzinakos.Iterativeclosestnormalpointfor3Dfacerecogni- tion.35(2):381Œ397,2013. [80] A.Newell,K.Yang,andJ.Deng.Stackedhourglassnetworksforhumanposeestimation. pages483Œ499,2016. [81] E.M.Nowara,A.Sabharwal,andA.Veeraraghavan.Ppgsecure:Biometricpresentation attackdetectionusingphotopletysmograms.pages56Œ62,2017. [82] G.Pan,L.Sun,Z.Wu,andS.Lao.Eyeblink-basedanti-infacerecognitionfroma genericwebcamera.IEEE,2007. [83] K.Patel,H.Han,andA.K.Jain.Cross-databasefacewithrobustfeature representation.In ChineseConferenceonBiometricRecognition .Springer,2016. [84] K.Patel,H.Han,andA.K.Jain.Securefaceunlock:Spoofdetectiononsmartphones. 11(10):2268Œ2283,2016. [85] D.Pathak,P.Krahenbuhl,J.Donahue,T.Darrell,andA.A.Efros.Contextencoders: Featurelearningbyinpainting.IEEE,2016. [86] P.Paysan,R.Knothe,B.Amberg,S.Romdhani,andT.Vetter.A3Dfacemodelforpose andilluminationinvariantfacerecognition.pages296Œ301,2009. [87] X.Peng,R.S.Feris,X.Wang,andD.N.Metaxas.Arecurrentencoder-decodernetwork forsequentialfacealignment.pages38Œ56.Springer,2016. [88] T.,K.Simonyan,J.Charles,andA.Zisserman.Deepconvolutionalneuralnetworks forefposeestimationingesturevideos.pages538Œ552,2015. [89] P.J.Phillips,H.Moon,S.Rizvi,P.J.Rauss,etal.TheFERETevaluationmethodologyfor face-recognitionalgorithms.22(10):1090Œ1104,2000. [90] A.Pinto,H.Pedrini,W.R.Schwartz,andA.Rocha.Facedetectionthroughvisual codebooksofspectraltemporalcubes.24(12):4726Œ4740,2015. 127 [91] L.-M.Po,L.Feng,Y.Li,X.Xu,T.C.-H.Cheung,andK.-W.Cheung.Block-basedadaptive ROIforremotephotoplethysmography. J.MultimediaToolsandApplications ,pages1Œ27, 2017. [92] C.Qu12,E.Monari,T.Schuchert,andJ.Beyerer21.Adaptivecontourforpose- invariant3dfaceshapereconstruction.2015. [93] S.Ren,X.Cao,Y.Wei,andJ.Sun.Facealignmentat3000fpsviaregressinglocalbinary features.pages1685Œ1692,2014. [94] J.Roth,Y.Tong,andX.Liu.Unconstrained3Dfacereconstruction.pages2606Œ2615, 2015. [95] J.Roth,Y.Tong,andX.Liu.Adaptive3Dfacereconstructionfromunconstrainedphoto collections.2016. [96] C.Sagonas,G.Tzimiropoulos,S.Zafeiriou,andM.Pantic.300facesin-the-wildchallenge: Thefaciallandmarklocalizationchallenge.pages397Œ403,2013. [97] E.Sánchez-Lozano,F.DelaTorre,andD.González-Jiménez.Continuousregressionfor non-rigidimagealignment.pages250Œ263.Springer,2012. [98] R.Shao,X.Lan,andP.C.Yuen.Deepconvolutionaldynamictexturelearningwithadaptive channel-discriminabilityfor3DmaskfaceIn IJCB ,2017. [99] Y.Sun,X.Wang,andX.Tang.Deepconvolutionalnetworkcascadeforfacialpointdetec- tion.pages3476Œ3483,2013. [100] Y.Tai,J.Yang,andX.Liu.Imagesuper-resolutionviadeeprecursiveresidualnetwork. IEEE,2017. [101] Y.Tai,J.Yang,X.Liu,andC.Xu.Memnet:Apersistentmemorynetworkforimage restoration.IEEE,2017. [102] Y.Taigman,M.Yang,M.Ranzato,andL.Wolf.Deepface:Closingthegaptohuman-level performanceinfacevpages1701Œ1708,2014. [103] Q.Tan,L.Gao,Y.-K.Lai,andS.Xia.Variationalautoencodersfordeforming3dmeshmod- els.In ProceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition , pages5841Œ5850,2018. [104] X.Tan,Y.Li,J.Liu,andL.Jiang.Facelivenessdetectionfromasingleimagewithsparse lowrankbilineardiscriminativemodel.pages504Œ517,2010. [105] Y.Tong,X.Liu,F.W.Wheeler,andP.Tu.Automaticfaciallandmarklabelingwithminimal 128 supervision.2009. [106] L.TranandX.Liu.Nonlinear3dfacemorphablemodel.In ProceedingsoftheIEEE ConferenceonComputerVisionandPatternRecognition ,pages7346Œ7355,2018. [107] G.Trigeorgis,P.Snape,M.A.Nicolaou,E.Antonakos,andS.Zafeiriou.Mnemonicdescent method:Arecurrentprocessappliedforend-to-endfacealignment.2016. [108] S.Tulyakov,X.Alameda-Pineda,E.Ricci,L.Yin,J.F.Cohn,andN.Sebe.Self-adaptive matrixcompletionforheartrateestimationfromfacevideosunderrealisticconditions. pages2396Œ2404,2016. [109] S.TulyakovandN.Sebe.Regressinga3Dfaceshapefromasingleimage.pages3748Œ 3755,2015. [110] G.TzimiropoulosandM.Pantic.OptimizationproblemsforfastAAMin-the-wild. pages593Œ600.IEEE,2013. [111] M.Valstar,B.Martinez,X.Binefa,andM.Pantic.Facialpointdetectionusingboosted regressionandgraphmodels.pages2729Œ2736.IEEE,2010. [112] L.VanderMaatenandG.Hinton.Visualizingdatausingt-SNE.9(2579-2605),2008. [113] A.VedaldiandK.Lenc.MatConvNetŒconvolutionalneuralnetworksformatlab.In ACM MM,(2015) ,pages689Œ692,2015. [114] A.Wagner,J.Wright,A.Ganesh,Z.Zhou,H.Mobahi,andY.Ma.Towardapracticalface recognitionsystem:Robustalignmentandilluminationbysparserepresentation.34(2):372Œ 386,2012. [115] N.Wang,X.Gao,D.Tao,andX.Li.Facialfeaturepointdetection:Acomprehensive survey. arXivpreprintarXiv:1410.1037 ,2014. [116] W.Wang,S.Tulyakov,andN.Sebe.Recurrentconvolutionalfacealignment.2016. [117] D.Wen,H.Han,andA.Jain.FaceSpoofDetectionwithImageDistortionAnalysis. 10(4):746Œ761,2015. [118] B.-F.Wu,Y.-W.Chu,P.-W.Huang,M.-L.Chung,andT.-M.Lin.Amotionrobustremote- PPGapproachtodriver'shealthstatemonitoring.pages463Œ476,2016. [119] J.Wu,T.Xue,J.J.Lim,Y.Tian,J.B.Tenenbaum,A.Torralba,andW.T.Freeman.Single image3Dinterpreternetwork.2016. [120] Y.WuandQ.Ji.Robustfaciallandmarkdetectionunderheadposesandocclu- sion.pages3658Œ3666,2015. 129 [121] J.Xiao,S.Baker,I.Matthews,andT.Kanade.Real-timecombined2D+3Dactiveappear- ancemodels.volume2,pages535Œ542,2004. [122] S.Xiao,J.Feng,J.Xing,H.Lai,S.Yan,andA.Kassim.Robustfaciallandmarkdetection viarecurrentattentivnetworks.pages57Œ72,2016. [123] J.Xing,Z.Niu,J.Huang,W.Hu,andS.Yan.Towardsmulti-viewandpartially-occluded facealignment.pages1829Œ1836.IEEE,2014. [124] X.XiongandF.DelaTorre.Superviseddescentmethodanditsapplicationstofacealign- ment.pages532Œ539,2013. [125] Z.Xu,S.Li,andW.Deng.LearningtemporalfeaturesusingLSTM-CNNarchitecturefor faceIn IAPRAsianConference .IEEE,2015. [126] J.Yan,Z.Lei,D.Yi,andS.Z.Li.Learntocombinemultiplehypothesesforaccurateface alignment.pages392Œ396.IEEE,2013. [127] Z.Yan,H.Zhang,R.Piramuthu,V.Jagadeesh,D.DeCoste,W.Di,andY.Yu.Hd-cnn: hierarchicaldeepconvolutionalneuralnetworksforlargescalevisualrecognition.pages 2740Œ2748,2015. [128] B.Yang,J.Yan,Z.Lei,andS.Z.Li.Convolutionalchannelfeatures.pages82Œ90,2015. [129] H.YangandI.Patras.Mirror,mirroronthewall,tellme,istheerrorsmall?pages4685Œ 4693,2015. [130] J.Yang,Z.Lei,andS.Z.Li.Learnconvolutionalneuralnetworkforface arXivpreprintarXiv:1408.5601 ,2014. [131] J.Yang,Z.Lei,S.Liao,andS.Z.Li.Facelivenessdetectionwithcomponentdependent descriptor.IEEE,2013. [132] J.Yang,S.E.Reed,M.-H.Yang,andH.Lee.Weakly-superviseddisentanglingwithrecur- renttransformationsfor3Dviewsynthesis.pages1099Œ1107,2015. [133] L.Yin,X.Chen,Y.Sun,T.Worm,andM.Reale.Ahigh-resolution3Ddynamicfacial expressiondatabase.2008. [134] J.Yosinski,J.Clune,Y.Bengio,andH.Lipson.Howtransferablearefeaturesindeepneural networks?pages3320Œ3328,2014. [135] X.Yu,J.Huang,S.Zhang,W.Yan,andD.N.Metaxas.Pose-freefaciallandmarkvia optimizedpartmixturesandcascadeddeformableshapemodel.pages1944Œ1951,2013. [136] X.Yu,Z.Lin,J.Brandt,andD.N.Metaxas.Consensusofregressionforocclusion-robust 130 facialfeaturelocalization.pages105Œ118,2014. [137] S.ZagoruykoandN.Komodakis.Learningtocompareimagepatchesviaconvolutional neuralnetworks.pages4353Œ4361,2015. [138] C.ZhangandZ.Zhang.Asurveyofrecentadvancesinfacedetection.Technicalreport, Tech.rep.,MicrosoftResearch,2010. [139] J.Zhang,S.Shan,M.Kan,andX.Chen.auto-encodernetworks(cfan)for real-timefacealignment.pages1Œ16.2014. [140] J.Zhang,S.Zhou,D.Comaniciu,andL.McMillan.Conditionaldensitylearningviare- gressionwithapplicationtodeformableshapesegmentation.2008. [141] X.Zhang,L.Yin,J.F.Cohn,S.Canavan,M.Reale,A.Horowitz,P.Liu,andJ.M.Girard. BP4D-spontaneous:ahigh-resolutionspontaneous3Ddynamicfacialexpressiondatabase. 32(10):692Œ706,2014. [142] Z.Zhang,P.Luo,C.C.Loy,andX.Tang.Faciallandmarkdetectionbydeepmulti-task learning.pages94Œ108.2014. [143] Z.Zhang,J.Yan,S.Liu,Z.Lei,D.Yi,andS.Z.Li.Afacedatabasewith diverseattacks.IEEE,2012. [144] E.Zhou,H.Fan,Z.Cao,Y.Jiang,andQ.Yin.Extensivefaciallandmarklocalizationwith convolutionalnetworkcascade.pages386Œ391,2013. [145] R.Zhou,R.Achanta,andS.Süsstrunk.Deepresidualnetworkforjointdemosaicingand super-resolution. arXivpreprintarXiv:1802.06573 ,2018. [146] S.ZhouandD.Comaniciu.Shaperegressionmachine.pages13Œ25,2007. [147] S.Zhu,C.Li,C.ChangeLoy,andX.Tang.Facealignmentbyshapesearch- ing.pages4998Œ5006,2015. [148] S.Zhu,C.Li,C.C.Loy,andX.Tang.Unconstrainedfacealignmentviacascadedcompo- sitionallearning.2016. [149] X.Zhu,Z.Lei,X.Liu,H.Shi,andS.Z.Li.Facealignmentacrosslargeposes:A3D solution.pages146Œ155,2016. [150] X.Zhu,Z.Lei,J.Yan,D.Yi,andS.Z.Li.Hiposeandexpressionnormalization forfacerecognitioninthewild.pages787Œ796,2015. [151] X.ZhuandD.Ramanan.Facedetection,poseestimation,andlandmarklocalizationinthe wild.pages2879Œ2886,2012. 131 [152] X.Zhu,J.Yan,D.Yi,Z.Lei,andS.Z.Li.Discriminative3Dmorphablemodel pages1Œ8,2015. 132