DESIGNINGCONVOLUTIONALNEURALNETWORKSFORFACEALIGNMENTAND
ANTI-SPOOFING
By
AminJourabloo
ADISSERTATION
Submittedto
MichiganStateUniversity
inpartialoftherequirements
forthedegreeof
ComputerScienceŠDoctorofPhilosophy
2019
ABSTRACT
DESIGNINGCONVOLUTIONALNEURALNETWORKSFORFACEALIGNMENTAND
ANTI-SPOOFING
By
AminJourabloo
Facealignmentistheprocessofdetectingasetofpointsonafaceimage,suchasmouth
corners,nosetip,etc.Facealignmentisakeymoduleinthepipelineofmostfacialanalysis
tasks,normallyafterfacedetectionandbeforesubsequentfeatureextractionand
Asaresult,improvingthefacealignmentaccuracyishelpfulfornumerousfacialanalysistasks.
Recently,facealignmentworksarepopularintopvisionvenuesandachievealotofattention.
Inspiteofthefruitfulpriorworkandongoingprogressoffacealignment,pose-invariantface
alignmentisstillchallenging.Toaddresstheinherentchallengesassociatedwiththisproblem,we
proposepose-invariantfacealignmentbyingadense3DMM,andintegratingestimationof3D
shapeand2DfaciallandmarksfromasinglefaceimageinasingleCNN.Weintroduceanew
layer,calledvisualizationlayer,whichisdifferentiableandallowsbackpropagationofanerror
fromalaterblocktoanearlierone.
Anotherapplicationoffacialanalysisisthefacewhichhasrecentlyachieveda
lotofattention.Whilefacerecognitionsystemsserveasavportalforvariousdevices
(i.e.,phoneunlock,accesscontrol,andtransportationsecurity),attackerspresentfacespoofs(i.e.,
presentationattacks,PA)tothesystemandattempttobeauthenticatedasthegenuineuser.We
presentourproposeddeepmodelsforfacethatusethesupervisionfromboththe
spatialandtemporalauxiliaryinformation,forthepurposeofrobustlydetectingfacePAfroma
facevideo.
Thisthesisisdedicatedtomyfamily,
myparents:
Hassan
and
Kobra
mybrother:
Ahmad
mysister'sfamily:
Zahra
,
Hamed
and
Samyar
iii
ACKNOWLEDGMENTS
Thisdissertationwouldnothavebeenmadepossiblewithoutthehelpofmanypeople.
IamveryhonoredtohaveDr.XiaomingLiuasmyadvisor.Hisexpectationandencourage-
menthavemademeachievemorethanIcouldeverhaveimagined.Thetimewespenttodebug
codes,brainstorm,andpolishpapershasmyskillsincriticalthinking,presentationand
writing.Bysettinghimselfasanexample,hehastaughtmewhatagoodresearchershouldbelike.
ItismygreatpleasuretohavetheopportunitytoworkwithDr.AnilK.Jain,Dr.ArunRossinthe
lab.Asaworld-leadingresearcher,Dr.Jainhasinspiredmanyyoungergenerations
includingmetopursueaPh.D.Dr.Ross'spatienceandinsightfulcommentsatallpresentations
haveshownmethateveryresearcherdesirestobeheard.
Iamgratefulformylabmates,JosephRoth,MortezaSafdarnejad,MuhammadJamalAfridi,
YousefAtoum,XiYin,LuanTran,GarrickBrazil,YaojieLiu,BangjieYin,JoelStehouwer,Adam
Terwilliger,HieuNguyen,ShengjieZhu,MasaHu.Thevaluablecommentsinpaperreview,the
willingnesstohelp,theencouragementwhenIaminabadmood,andtheentertainmenttogether
havemadeitaverypleasantjourney.
iv
TABLEOFCONTENTS
LISTOFTABLES
.......................................
viii
LISTOFFIGURES
......................................
x
Chapter1IntroductiononFaceAlignmentandFace
.........
1
1.1Introduction......................................1
1.2Priorworkonfacealignment.............................2
1.3Priorworkonface...........................6
1.4Overviewofthethesis.................................8
1.4.1Contributionsofthethesis..........................10
Chapter2Pose-Invariant3DFaceAlignment
.....................
12
2.1Introduction......................................12
2.2Pose-Invariant3DFaceAlignment..........................13
2.2.13DFaceModeling..............................14
2.2.2CascadedCoupled-Regressor.........................17
2.2.33DSurface-EnabledVisibility........................19
2.3ExperimentalResults.................................21
2.4Summary.......................................27
Chapter3Pose-InvariantFaceAlignmentviaCNN-basedDense3DModelFitting29
3.1Introduction......................................29
3.2Unconstrained3DFaceAlignment..........................32
3.2.13DMorphableModel.............................32
3.2.2DataAugmentation..............................35
3.2.2.1Optimization............................36
3.2.3CascadedCNNCoupled-Regressor.....................38
3.2.4ConventionalCNN(C-CNN).........................40
3.2.5MirrorCNN(M-CNN)............................40
3.2.5.1MirrorLoss.............................41
3.2.5.2MirrorCNNArchitecture.....................42
3.2.6Visibilityand2DAppearanceFeatures....................43
3.2.7Testing....................................47
3.3ExperimentalResults.................................47
3.3.1ExperimentalSetup..............................48
3.3.2ComparisonExperiments...........................51
3.4Summary.......................................57
Chapter4Pose-InvariantFaceAlignmentwithaSingleCNN
............
59
4.1Introduction......................................59
4.23DFaceAlignmentwithVisualizationLayer.....................61
v
4.2.13Dand2DFaceShapes............................62
4.2.2ProposedCNNArchitecture.........................63
4.2.3VisualizationLayer..............................66
4.3ExperimentalResults.................................68
4.3.1QuantitativeEvaluationsonAFLWandAFW................69
4.3.2Evaluationon300Wdataset.........................71
4.3.3AnalysisoftheVisualizationLayer.....................72
4.3.4Timecomplexity...............................75
4.4Summary.......................................77
Chapter5LearningDeepModelsforFaceBinaryorAuxiliarySu-
pervision
....................................
78
5.1Introduction......................................78
5.2FacewithDeepNetwork.......................81
5.2.1DepthMapSupervision............................82
5.2.2rPPGSupervision...............................83
5.2.3NetworkArchitecture.............................84
5.2.3.1CNNNetwork...........................85
5.2.3.2RNNNetwork...........................85
5.2.3.3ImplementationDetails......................86
5.2.4Non-rigidRegistrationLayer.........................88
5.3CollectionofFaceDatabase......................90
5.4ExperimentalResults.................................91
5.4.1ExperimentalSetup..............................91
5.4.2ExperimentalComparison..........................93
5.4.2.1AblationStudy...........................93
5.4.2.2IntraTesting............................94
5.4.2.3CrossTesting............................95
5.4.2.4VisualizationandAnalysis.....................96
5.5Summary.......................................97
Chapter6FaceviaNoiseModeling
...........
98
6.1Introduction......................................98
6.2Face...................................101
6.2.1ACaseStudyofSpoofNoisePattern.....................101
6.2.2De-SpoofNetwork..............................104
6.2.2.1NetworkOverview.........................104
6.2.3DQNetandVQNet.............................106
6.2.3.1DiscriminativeQualityNet.....................106
6.2.3.2VisualQualityNet.........................107
6.2.4Lossfunctions.................................108
6.2.4.1MagnitudeLoss..........................108
6.2.4.2RepetitiveLoss...........................109
6.3ExperimentalResults.................................110
6.3.1ExperimentalSetup..............................110
vi
6.3.2AblationStudy................................111
6.3.3ExperimentalComparison..........................112
6.3.3.1IntraTesting............................113
6.3.3.2CrossTesting............................113
6.3.4QualitativeExperiments...........................113
6.3.4.1Spoofmedium....................113
6.3.4.2Successfulandfailurecases....................115
6.4Summary.......................................116
Chapter7ConclusionsandFutureWork
........................
118
7.1Limitations......................................119
7.2FutureWork......................................119
BIBLIOGRAPHY
.......................................
121
vii
LISTOFTABLES
Table2.1:Thecomparisonoffacealignmentalgorithmsinposehandling(estima-
tionerrorsmayhavedifferent.................21
Table2.2:
TheNME(%)ofthreemethodsonAFLW.
...................24
Table2.3:
ThecomparisonoffourmethodsonAFW.
...................24
Table2.4:
EfyoffourmethodsinFPS.
.......................26
Table3.1:
NME(%)oftheproposedmethodwithdifferentfeatureswiththeC-CNNarchi-
tectureandthebasetrainingset.
........................50
Table3.2:
TheNME(%)ofthreemethodsonAFLWwiththebasetrainingset.
......52
Table3.3:
TheNME(%)ofthreemethodsonALFWwithextendedtrainingsetandCaffe
toolbox.
....................................54
Table3.4:
TheMAPEofsixmethodsonAFW.
......................55
Table3.5:
Thesix-stageNMEsofimplementingC-CNNandM-CNNarchitectureswith
differenttrainingdatasetsandCNNtoolboxes.Theinitialerroris25
:
8%.
....56
Table4.1:Thenumberandsizeofconvolutionalineachvisualizationblock.
Forallblocks,thetwofullyconnectedlayershavethesamelengthof800
and236....................................69
Table4.2:NME(%)offourmethodsontheAFLWdataset..............70
Table4.3:NME(%)oftheproposedmethodateachvisualizationblockonAFLW
dataset.TheinitialNMEis25.8%......................70
Table4.4:MAPEofvemethodsontheAFWdataset.................71
Table4.5:TheNMEofdifferentmethodson300Wdataset..............72
Table4.6:TheNME(%)ofthreearchitectureswithdifferentinputs(
I
:Inputimage,
V
:Visualization,
F
:Featuremaps)......................73
Table4.7:NME(%)whendifferentmasksareused..................74
viii
Table4.8:NME(%)whenusingdifferentnumbersofvisualizationblocks(
N
v
)and
convolutionallayers(
N
c
)...........................75
Table5.1:Thecomparisonofourcollecteddatasetwithavailabledatasetsforthe
face..............................89
Table5.2:TDRatdifferentFDRs,crosstestingonOuluProtocol1..........92
Table5.3:
ACERofourmethodatdifferent
N
f
,onOuluProtocol2.
............92
Table5.4:Theintra-testingresultsonfourprotocolsofOulu..............93
Table5.5:Theintra-testingresultsonthreeprotocolsofSiW..............95
Table5.6:
CrosstestingonCASIA-MFSDvs.Replay-Attack.
...............96
Table6.1:
ThenetworkstructureofDSNet,DQNetandVQNet.Eachconvolutionallayer
isfollowedbyanexponentiallinearunit(ELU)andbatchnormalizationlayer.
TheinputimagesizeforDSNetis256

256

6.Alltheconvolutional
are3

3.0\1MapNetisthebottom-leftpart,i.e.,conv1-10,conv1-11,and
conv1-12.
...................................105
Table6.2:Theaccuracyofdifferentoutputsoftheproposedarchitectureandtheir
fusions....................................111
Table6.3:
ACERoftheproposedmethodwithdifferentimageresolutionsandblurriness.
Tocreateblurryimages,weapplyGaussianwithdifferentkernelsizesto
theinputimages.
...............................112
Table6.4:Theintratestingresultson4protocolsofOulu-NPU............112
Table6.5:TheHTERofdifferentmethodsforthecrosstestingbetweentheCASIA-
MFSDandtheReplay-Attackdatabases.Wemarkthetop-2perfor-
mancesinbold................................114
Table6.6:
Theconfusionmatricesofspoofmediumsbasedonspoofnoise
pattern.
....................................115
ix
LISTOFFIGURES
Figure2.1:
Givenafaceimagewithanarbitrary
pose
,ourproposedalgorithmautomati-
callyestimatesthe2
Dlocations
and
visibilities
offaciallandmarks,aswellas
3
Dlandmarks
.Thedisplayed3Dlandmarksareestimatedfortheimageinthe
center.Green/redpointsindicatevisible/invisiblelandmarks.
..........13
Figure2.2:
OverallarchitectureofourproposedPIFAmethod,withthreemainmodules(3D
modeling,cascadedcoupled-regressorlearning,and3Dsurface-enabledvisibil-
ityestimation).Green/redarrowsindicatesurfacenormalspointingtoward/away
fromthecamera.
...............................14
Figure2.3:ThetrainingprocedureofPIFA.......................20
Figure2.4:
TheNMEofveposegroupsfortwomethods.
.................25
Figure2.5:
TheNMEofeachlandmarkforPIFA.
.....................26
Figure2.6:
2Dand3DalignmentresultsoftheBP4D-Sdataset.
..............27
Figure2.7:
TestingresultsofAFLW(top)andAFW(bottom).Asshowninthetoprow,we
initializefacealignmentbyplacinga2Dmeanshapeinthegivenboundingbox
ofeachimage.Notethe
disparity
betweentheinitiallandmarksandthees-
timatedones,aswellasthediversityinpose,illuminationandresolutionamong
theimages.Green/redpointsindicatevisible/invisibleestimatedlandmarks.
..28
Figure3.1:
Theproposedmethodestimateslandmarksforlarge-posefacesbytingadense
3Dshape.Fromlefttoright:initiallandmarks,3Ddenseshape,estimated
landmarkswithvisibility.Thegreen/red/yellowdotsintherightcolumnshow
thevisible/invisible/cheeklandmarks,respectively.
...............30
Figure3.2:
Theoverallprocessoftheproposedmethod.
..................32
Figure3.3:
Thelandmarkmarchingprocessforupdatingvector
d
.(a-b)showthe
pathsofcheeklandmarksonthemeanshape;(c)istheestimatedfaceshape;
(d)istheestimatedfaceshapebyignoringtherollrotation;and(e)showsthe
locationsoflandmarksonthecheek.
......................34
Figure3.4:Landmarkmarching
g
(
S
;
m
)
.........................35
x
Figure3.5:
ArchitectureofC-CNN(thesameCNNarchitectureisusedforallsixstages).
Colorcodeused:purple=extractedimagefeature,orange=Conv,brown=
pooling+batchnormalization,blue=fullyconnectedlayer,red=ReLU.The
sizeandthenumberofforeachlayerareshownonthetopandthe
bottomrespectively.
..............................40
Figure3.6:
ArchitectureoftheM-CNN(thesameCNNarchitectureisusedforallsix
stages).Colorcodeused:purple=extractedimagefeature,orange=Conv,
brown=pooling+batchnormalization,green=locallyconnectedlayer,blue=
fullyconnectedlayer,red=batchnormalization+ReLU+dropout.The
sizeandthenumberofofeachlayerareshownonthetopandthebottom
ofthetopbranchrespectively.
.........................42
Figure3.7:
The3Dsurfacenormalastheaverageofnormalsarounda3D
landmark(blackarrow).Noticetherelativelynoisysurfacenormalofthe3D
ﬁlefteyecornerﬂlandmark(bluearrow).
....................44
Figure3.8:
Featureextractionprocess,(a-e)PAWFforthelandmarkontherightsideofthe
righteye,(f-j)D3PFforthelandmarkontherightsideofthelip.
........45
Figure3.9:
ExamplesofextractingPAWF.Whenoneofthefourneighborhoodpoints(red
pointinthebottom-right)isinvisible,itconnectstothe2Dlandmark(green
point),extendsthesamedistancefurther,andgenerateanewneighborhood
point.Thishelpstoincludethebackgroundcontextaroundthenose.
......45
Figure3.10:
ExampleofextractingD3PF.
.........................46
Figure3.11:
(a)AFLWoriginal(yellow)andaddedlandmarks(green),(b)Comparisonof
meanNMEofeachlandmarkforRCPR(blue)andproposedmethod(green).
TheradiusofcirclesisdeterminedbythemeanNMEmultipledwiththeface
boundingboxsize.
..............................48
Figure3.12:
ErrorsonAFLWtestingsetaftereachstagesofCNNfordifferentfeatureextrac-
tionmethodswiththeC-CNNarchitectureandthebasetrainingset.Theinitial
erroris25
:
8%.
................................51
Figure3.13:
ComparisonofNMEforeachposewiththeC-CNNarchitectureandthebase
trainingset.
..................................52
Figure3.14:
ThecomparisonofCEDfordifferentmethodswiththeC-CNNarchitectureand
thebasetrainingset.
..............................53
xi
Figure3.15:
ResultoftheproposedmethodafterthestageCNN.Thisimageshowsthat
thestageCNNcanmodelthedistributionoffaceposes.Theright-view
facesareatthetop,thefrontal-viewfacesareatthemiddle,andtheleft-view
facesareatthebottom.
............................53
Figure3.16:Thedistributionofvisibilityerrorsforeachlandmark.Forsixlandmarks
onthehorizontalcenteroftheface,theirvisibilityerrorsarezerossince
theyarealwaysvisible............................56
Figure3.17:
TheresultsoftheproposedmethodonAFLWandAFW.Thegreen/red/yellow
dotsshowthevisible/invisible/cheeklandmarks,respectively.Firstrow:initial
landmarksforAFLW,Second:estimated3Ddenseshapes,Third:estimated
landmarks,ForthandFifth:estimatedlandmarksforAFLW,Sixth:estimated
landmarksforAFW.Noticethatdespitethediscrepancybetweenthediverseface
posesandconstantfront-viewlandmarkinitialization(toprow),ourmodelcan
adaptivelyestimatethepose,adensemodelandproducethe2Dlandmarksas
abyproduct.
.................................57
Figure3.18:
Theresultoftheproposedmethodacrossstages,withtheextractedfeatures(1st
and3rdrows)andalignmentresults(2ndand4throws).Notethechangesofthe
landmarkpositionandvisibility(thebluearrow)overstages.
..........58
Figure4.1:Forthepurposeoflearninganend-to-endfacealignmentmodel,our
novelvisualizationlayerreconstructsthe3Dfaceshape(a)fromtheesti-
matedparametersinsidetheCNNandsynthesizesa2Dimage(b)viathe
surfacenormalvectorsofvisiblevertexes..................60
Figure4.2:TheproposedCNNarchitecture.Weusegreen,orange,andpurpleto
representthevisualizationlayer,convolutionallayer,andfullyconnected
layer,respectively.PleaserefertoFigure4.3forthedetailsofthevisual-
izationblock.................................61
Figure4.3:Avisualizationblockconsistsofavisualizationlayer,twoconvolutional
layersandtwofullyconnectedlayers....................64
Figure4.4:Thefrontalandsideviewsofthemask
a
thathaspositivevaluesinthe
middleandnegativevaluesinthecontourarea...............67
Figure4.5:Anexamplewithfourvertexesprojectedtoasamepixel.Twoofthem
havenegativevaluesin
z
componentoftheirnormals(redarrows).Be-
tweentheothertwowithpositivevalues,theonewiththesmallerdepth
(closertotheimageplane)isselected....................68
Figure4.6:ArchitecturesofthreeCNNswithdifferentinputs..............73
xii
Figure4.7:Theaverageofweightsforinputimage,visualizationandfeature
mapsinthreearchitecturesofFigure4.6.The
y
-axisand
x
-axisshowthe
averageandtheblockindex,respectively..................74
Figure4.8:Mask2,adifferentdesignedmaskwithvepositiveareasontheeyes,
topofthenoseandsidesofthelip......................75
Figure4.9:ResultsofalignmentonAFLWandAFWdatasets,greenlandmarksshow
theestimatedlocationsofvisiblelandmarksandredlandmarksshowes-
timatedlocationsofinvisiblelandmarks.Firstrow:providedbounding
boxbyAFLWwithinitiallocationsoflandmarks,Second:estimated
3Ddenseshapes,Third:estimatedlandmarks,Fourthtosixth:estimated
landmarksforAFLW,Seventh:estimatedlandmarksforAFW.......76
Figure4.10:Threeexamplesofoutputsofvisualizationlayerateachvisualization
block.Therowshowsthattheproposedmethodrecoverstheex-
pressionofthefacegracefully,thethirdrowshowsthevisualizationsof
afacewithamorechallengingpose.....................77
Figure5.1:
ConventionalCNN-basedfaceanti-spoofapproachesutilizethebinarysupervi-
sion,whichmayleadtoovgiventheenormoussolutionspaceofCNN.
Thisworkdesignsanovelnetworkarchitecturetoleveragetwoauxiliaryin-
formationassupervision:thedepthmapandrPPGsignal,withthegoalsof
improvedgeneralizationandexplainabledecisionsduringinference.
......79
Figure5.2:
Theoverviewoftheproposedmethod.
.....................81
Figure5.3:
TheproposedCNN-RNNarchitecture.Thenumberofareshownontop
ofeachlayer,thesizeofallis3

3withstride1forconvolutionaland
2forpoolinglayers.
Colorcode
used:
orange
=convolution,
green
=pooling,
purple
=responsemap.
............................82
Figure5.4:
ExamplegroundtruthdepthmapandrPPGsignals.
..............86
Figure5.5:
Thenon-rigidregistrationlayer.
........................88
Figure5.6:
ThestatisticsofthesubjectsintheSiWdatabase.Leftside:Thehistogram
showsthedistributionofthefacesizes.
....................90
Figure5.7:
ExamplesoftheliveandspoofattackvideosintheSiWdatabase.Therow
showsalivesubjectwithdifferentPIE.Thesecondrowshowsdifferenttypesof
thespoofattacks.
...............................91
xiii
Figure5.8:
(a)8successfulexamplesandtheirestimateddepthmapsandrPPG
signals.(b)4failureexamples:thetwoareliveandtheothertwoarespoof.
NoteourabilitytoestimatediscriminativedepthmapsandrPPGsignals.
....95
Figure5.9:
Mean/Stdoffrontalizedfeaturemapsforliveandspoof.
............96
Figure5.10:
TheMSEofestimatingdepthmapsandrPPGsignals.
.............97
Figure6.1:
Theillustrationoffaceandprocesses.pro-
cessaimstoestimateaspoofnoisefromaspooffaceandreconstructthelive
face.Theestimatedspoofnoiseshouldbediscriminativeforface
.99
Figure6.2:
Theillustrationofthespoofnoisepattern.
Left:
livefaceanditslocalregions.
Right:
Tworegisteredfacesfromprintattackandreplayattack.For
eachsample,weshowthelocalregionoftheface,intensitydifferencetothelive
image,magnitudeof2DFFT,andthelocalpeaksinthefrequencydomainthat
indicatesthespoofnoisepattern.Bestviewedelectronically.
..........102
Figure6.3:
Theproposednetworkarchitecture.
......................103
Figure6.4:The2DvisualizationoftheestimatedspoofnoisefortestvideosonOulu-
NPUProtocol1.Left:theestimatednoise,Right:thehigh-frequency
bandoftheestimatednoise,
Colorcode
used:
black
=live,
green
=printer1,
blue
=printer2,
magenta
=display1,
red
=display2...............115
Figure6.5:Thevisualizationofinputimages,estimatedspoofnoisesandestimated
liveimagesfortestvideosofProtocol1ofOulu-NPUdatabase.Therst
fourcolumnsintherowarepaperattacksandthesecondfourare
thereplayattacks.Forabettervisualization,wemagnifythenoiseby5
timesandaddthevaluewith128,toshowbothpositiveandnegativenoise.116
Figure6.6:Thefailurecasesforconvertingthespoofimagestotheliveones......116
Figure7.1:
Left:ArepresentationoftheestimatedpointcloudiniPhoneX.Right:The
hardwaretechnologyinHuaweiP11forcapturingthepointcloud.
.......120
xiv
Chapter1
IntroductiononFaceAlignmentandFace

1.1Introduction
Facealignmentistheprocessofdetectingasetofpointsonafaceimage,suchasmouth
corners,nosetip,etc.Facealignmentisakeymoduleinthepipelineofmostfacialanalysistasks,
normallyafterfacedetectionandbeforesubsequentfeatureextractionandAsa
result,improvingthefacealignmentaccuracyishelpfulfornumerousfacialanalysistasks,e.g.,
facerecognition[114],face[55]and3Dfacereconstruction[94].
Duetotheimportanceoffacealignment,ithasbeenwellstudiedduringpastdecades[115],
withthewell-knownActiveShapeModel[30]andActiveAppearanceModel(AAM)[70,78].
Recently,facealignmentworksarepopularintopvisionvenuesandachievealotofattention.
Despitethecontinuousimprovementonthealignmentaccuracy,facealignmentisstillaverychal-
lengingproblem,duetothenon-frontalface
pose
,lowimage
quality
,
occlusion
,etc.Amongallthe
challenges,weidentifythe
poseinvariantfacealignment
astheonedeservingsubstantialresearch
efforts,foranumberofreasons.First,facedetectionhassubstantiallyadvanceditscapabilityin
detectingfacesinallposes,including[138],whichcallsforthesubsequentfacealign-
menttohandlefaceswitharbitraryposes.Second,manyfacialanalysistaskswouldfrom
therobustalignmentoffacesatallposes,suchasexpressionrecognitionand3Dfacereconstruc-
1
tion[94].Third,thereareveryfewexistingapproachesthatcanalignafacewithanyviewangle,
orhaveconductedextensiveevaluationsonfaceimagesacross

90

yawangles[135,151],which
isaclear
contrast
withthevastfacealignmentliterature[115].
Wepresentourproposedapproachesforpose-invariantfacealignmentinchapters2to4.The
coreideaofourproposedmethodsisthatinsteadofestimating2Dlandmarksdirectly,weestimate
the3Dshapeofthefaceandbyprojectingthe3Dshapeto2D,wecanhavethe2Dlocationsof
thelandmarks.
Anotherapplicationoffacialanalysisisfacewhichhasrecentlyachievedalot
ofattention.Whilefacerecognitionsystemsserveasavportalforvariousdevices
(i.e.,phoneunlock,accesscontrol,andtransportationsecurity),attackerspresentfacespoofs(i.e.,
presentationattacks,PA)tothesystemandattempttobeauthenticatedasthegenuineuser.The
facePAsincludeprintingthefaceonpaper(printattack),replayingafacevideoonadigitaldevice
(replayattack),andwearingamask(maskattack).TocounteractPA,researchershavedeveloped
facetechniques[27,38,39,65]todetectPA
priorto
afaceimagebeingrecognized.
Therefore,faceisvitaltoensurethatfacerecognitionsystemsarerobusttoPAand
safetouse.
Inchapters5and6,wepresentourproposeddeepmodelsforfacethatusethe
supervisionsfromboththespatialandtemporalauxiliaryinformation,forthepurposeofrobustly
detectingfacePAfromafaceimageorafacevideo.
1.2Priorworkonfacealignment
Wereviewpriorworkonfacealignmentinsevenareasrelatedtotheproposedmethods:generic
facealignment,pose-invariantfacealignment,3Dfacemodelfacealignmentviadeep
2
learning,sharinginformationinfacealignmentanddeeplearning,convolutionalrecurrentneural
network(CRNN),andvisualizationindeeplearning.
Genericfacealignment
ThetypeoffacealignmentapproachisbasedonConstrainedLo-
calModel(CLM),whereanearlyexampleisASM[30].Thebasicideaistolearnasetoflocal
appearancemodels,oneforeachlandmark,andthedecisionsfromthelocalmodelsarefused
withaglobalshapemodel.Therearegenerativeordiscriminative[32]approachesinlearning
thelocalmodel,andvariousapproachesinutilizingtheshapeconstraint[4].Whilethelocal
modelsarefavoredforhigherestimationprecision,italsocreatesdifforalignmentonlow-
resolutionimagesduetolimitedlocalappearance.Incontrast,theAAMmethod[29,78]andits
extension[75,97]learnaglobalappearancemodel,whosesimilaritytotheinputimagedrivesthe
landmarkestimation.WhileAAMisknowntohavedifwithunseensubjects[42],there-
centdevelopmenthassubstantiallyimproveditsgeneralizationcapability[110].Motivatedbythe
ShapeRegressionMachine[140,146]inthemedicaldomain,cascadedregressor-basedmethods
havebeenverypopularinrecentyears[26,111].Ononehand,theseriesofregressorsprogres-
sivelyreducethealignmenterrorandleadtohigheraccuracy.Ontheotherhand,advancedfeature
learningalsorendersultra-efalignmentprocedures[56,93].Otherthanthethreemajor
typesofalgorithms,therearealsoworksbasedondeeplearning[142],graph-model[151],and
semi-supervisedlearning[105].
Pose-invariantfacealignment
Themethodsof[45,135,151]combinesfacedetection,pose
estimationandfacealignment.Byusinga3Dshapemodelwithanoptimizedmixtureofparts,
[135]isapplicabletofaceswithalargerangeofposes.In[120],afacealignmentmethodbased
oncascaderegressorsisproposedtohandleinvisiblelandmarks.Eachstageiscomposedoftwo
regressorsforestimatingtheprobabilityoflandmarkvisibilityandthelocationoflandmarks.This
methodisappliedtowfacesofFERETdatabase[89].However,asa2Dlandmark-
3
basedapproach,itcannotestimate3Dfaceposes.Occlusion-invariantfacealignment,suchas
RCPR[22],mayalsobeappliedtohandlelargeposessincenon-frontalfacesareonetypeof
occlusions.[109]isaveryrecentworkthatestimates3Dlandmarkviaregressors.However,it
onlytestsonsynthesizedfaceimagesupto
˘
50

yaw.
3
Dfacemodel
Almostallpriorworksassumethatthe2Dlandmarksoftheinputface
imageiseithermanuallylabeledorestimatedviaafacealignmentmethod.In[48],adense3D
facealignmentfromvideosisproposed.Atadensesetof2Dlandmarksisestimatedby
usingthecascadedregressor.Then,anEM-basedalgorithmisutilizedtoestimatethe3Dshape
and3Dposeofthefacefromtheestimated2Dlandmarks.Theauthorsin[92]aimtomakesure
thatthelocationsof2Dcontourlandmarksareconsistentwiththe3Dfaceshape.In[152],a3D
facemodelmethodbasedonthesimilarityoffrontalviewfaceimagesisproposed.
Facealignmentviadeeplearning
Withthecontinuoussuccessofdeeplearninginvision,re-
searchersstarttoapplydeeplearningtofacealignment.Sunetal.[99]proposedathree-stage
facealignmentalgorithmwithCNN.Atthestage,threeCNNsareappliedtodifferentface
partstoestimatepositionsofdifferentlandmarks,whoseaveragesareregardedasthestage
results.Atthenexttwostages,byusinglocalpatcheswithdifferentsizesaroundeachlandmark,
thelandmarkpositionsareSimilarfacealignmentalgorithmsbasedonmulti-stageCNNs
arefurtherdevelopedbyZhouetal.[144]andCFAN[139].In[139],afacealignmentmethod
basedoncascadeofstackedauto-encoder(SAE)networkscanprogressivelylocationsof2D
landmarksateachstage.TCDCN[142]usesone-stageCNNtoestimatespositionsofveland-
marksgivenafaceimage.Thecommonalityamongmostofthesepriorworksisthattheyonly
estimate2Dlandmarksandthenumberoflandmarksislimitedto6.
Sharinginformationinfacealignmentanddeeplearning
Utilizingdifferentsideinformation
infacealignmentcanimprovethealignmentaccuracy.TCDCN[142]jointlyestimatesauxil-
4
iaryattributes(e.g.,gender,expression)withlandmarklocationstoimprovealignmentaccuracy.
In[129],themirrorabilityconstraint,i.e.,thealignmentdifferencebetweenafaceanditsmir-
roredcounterpart,isusedasameasureforevaluatingthealignmentresultswithouttheground
truth,andforchoosingabetterinitialization.Theconsensusofregressors[136]
inaBayesianmodelisusedtoshareinformationamongdifferentregressors.In[147]multiple
initializationsareusedforeachfaceimageandaclusteringmethodcombinestheestimatedface
shapes.Fordeeplearningmethods,sharinginformationisperformedeitherbytransferringthe
learnedweightsfromasourcedomaintothetargetdomain[134],orbyusingthesiamesenet-
works[8,137]tosharetheweightsamongbranchesofthenetworkandmakeadecisionwith
combinedresponsesofallbranches.
Convolutionalrecurrentneuralnetwork(CRNN)
MethodsbasedonCRNNs[107,116,122]
aretheattemptstocombinecascadeofregressorswithjointoptimization,foraligningmostly
frontalfaces.Theirconvolutionalpartextractsfeaturesfromthewholeimage[122]orfromthe
patchesatthelandmarklocations[107].Therecurrentpartfacilitatesjointoptimizationbysharing
informationamongallregressors.Generally,themaindrawbacksofCRNNsare:1)existing
CRNNmethodsaredesignedfornear-frontalfacealignment;2)theCRNNmethodssharethe
sameCNNatallstages.
Visualizationindeeplearning
Visualizationtechniqueshavebeenusedindeeplearningtoas-
sistinmakingarelativecomparisonamongtheinputdataandfocusingontheregionofinterest.
Onecategoryofthesemethodsexploitthedeconvolutionalandupsamplinglayerstoeitherexpand
responsemaps[67,87]orrepresentestimatedparameters[132].Alternatively,varioustypesof
featuremaps,e.g.,heatmapsandZ-Buffering,canrepresentthecurrentestimationoflandmarks
andparameters.In[21,80,119],2Dlandmarkheatmapsrepresentthelandmarks'locations.[21]
proposesatwostepposeinvariantalignmentbasedonheatmapstomakemorepreciseestima-
5
tions.Theheatmapssufferfromthreedrawbacks:1)lackofthecapabilitytorepresentobjectsin
details;2)requirementofoneheatmapperlandmarkduetoitsweakrepresentationpower.3)they
cannotestimatethevisibilityoflandmarks.TheZ-Bufferrenderedusingtheestimated3Dfaceis
alsousedtoconveytheresultsofapreviousCNNtothenextone[147].However,theZ-Buffer
representationisnotdifferentiable,preventingend-to-endtraining.
1.3Priorworkonface
Wereviewthepriorfaceworksinthreegroups:texture-basedmethods,temporal-
basedmethods,andremotephotoplethysmographymethods.
Texture-basedMethods
SincemostfacerecognitionsystemsadoptonlyRGBcameras,using
textureinformationhasbeenanaturalapproachtotacklingfaceManypriorworks
utilizehand-craftedfeatures,suchasLBP[33,34,77],HoG[59,131],SIFT[84]andSURF[18],
andadopttriditionalsuchasSVMandLDA.Toovercometheofillumination
variation,theyseeksolutionsinadifferentinputdomain,suchasHSVandYCbCrcolorspace[16,
17],andFourierspectrum[65].
Asdeeplearninghasproventobeeffectiveinmanycomputervisionproblems,therearemany
recentattemptsofusingCNN-basedfeaturesorCNNsinface[37,66,83,130].Most
oftheworktreatsfaceasasimple
binary
problembyapplyingthe
softmaxloss.Forexample,[66,83]useCNNasfeatureextractorandfromImageNet-
pretrainedCaffeNetandVGG-face.Theworkof[37,66]feeddifferentdesignsofthefaceimages
intoCNN,suchasmulti-scalefacesandhand-craftedfeatures,anddirectlyclassifylivevs.spoof.
Temporal-basedMethods
Oneoftheearliestsolutionsforfaceisbasedontemporal
cuessuchaseye-blinking[82,83].Methodssuchas[58,98]trackthemotionofmouthandlipto
6
detectthefaceliveness.Whilethesemethodsareeffectivetotypicalpaperattacks,theybecome
vulnerablewhenattackerspresentareplayattackorapaperattackwitheye/mouthportionbeing
cut.Therearealsomethodsrelyingonmoregeneraltemporalfeatures,insteadofthefacial
motion.Themostcommonapproachisframeconcatenation.Manyhandcraftedfeature-based
methodsmayimproveintra-databasetestingperformancebysimplyconcatenatingthefeaturesof
consecutiveframestotrainthe[16,33,60].Additionally,therearesomeworkproposing
features,e.g.,Haralickfeatures[3],motionmag[10],andopticalw[6].Inthe
deeplearningera,Feng
etal
.feedtheopticalwmapandShearletimagefeaturetoCNN[37].
In[125],Xu
etal
.proposeanLSTM-CNNarchitecturetoutilizetemporalinformationforbinary
Overall,allpriormethodsstillregardfaceasabinary
problem,andthustheyhaveahardtimetogeneralizewellinthecross-databasetesting.
RemotePhotoplethysmography(rPPG)
Remotephotoplethysmography(rPPG)isthetechnique
totrackvitalsignals,suchasheartrate,withoutanycontactwithhumanskin[14,35,91,108,
118].Researchstartswithfacevideoswithnomotionorilluminationchangetovideoswith
multiplevariations.In[35],Haan
etal
.estimaterPPGsignalsfromRGBfacevideoswithlighting
andmotionchanges.Itutilizescolordifferencetoeliminatethespecularandestimate
twoorthogonalchrominancesignals.AfterapplyingtheBandPassFilter(BPM),theratioofthe
chrominancesignalsareusedtocomputetherPPGsignal.
TherPPGsignalhaspreviouslybeenutilizedtotackleface[69,81].In[69],
rPPGsignalsareusedfordetectingthe3Dmaskattack,wherethelivefacesexhibitapulseofheart
rateunlikethe3Dmasks.TheyuserPPGsignalsextractedby[35]andcomputethecorrelation
featuresforSimilarly,Magdalena
etal
.[81]extractrPPGsignals(alsovia[35])from
threefaceregionsandtwonon-faceregions,fordetectingprintandreplayattacks.Althoughin
replayattacks,therPPGextractormightstillcapturethenormalpulse,thecombinationofmultiple
7
regionscandifferentiatelivevs.spooffaces.WhiletheanalyticsolutiontorPPGextraction[35]
iseasytoimplement,weobservethatitissensitivetoPIEvariations.
1.4Overviewofthethesis
Inthesecondchapter,weproposeanovelregression-basedapproachfor
pose-invariantfacealign-
ment
,whichaimstoestimatethe2Dand3Dlocationsofasparsesetoffacelandmarks,aswell
astheirvisibilitiesinthe2Dimage,forafacewitharbitrarypose(e.g.,

90

yaw).Byextending
thepopularcascadedregressorfor2Dlandmarkestimation,welearntwofernregressorsforeach
cascadelayer,oneforpredictingtheupdateforthecameraprojectionmatrix,andtheotherfor
predictingtheupdateforthe3Dshapeparameter.Thelearningoftworegressorsisconducted
alternativelywiththegoalofminimizingthedifferencebetweenthegroundtruthupdatesandthe
predictedupdates.
Inthethirdchapter,weproposetouseConvolutionalNeuralNetworks(CNN)astheregressor
inthecascadedframework,tolearnthemapping.WhilemostpriorworkonCNN-basedface
alignmentestimatenomorethansix2Dlandmarksperimage[99,142],ourcascadedCNNcan
produceasubstantiallylargernumber(34)of2Dand3Dlandmarks.Further,usinglandmark
marching[150],ouralgorithmcanadaptivelyadjustthe3Dlandmarksduringthesothat
thelocalappearancesaroundcheeklandmarkscontributetotheprocess.
Inthefourthchapter,weintroduceanovelvisualizationlayer.WeproposedaCNNarchitecture
whichisconsistofseveralblocks,calledvisualizationblocks.Thisarchitecturecanbeconsidered
asacascadeofshallowCNNs.Thenewlayervisualizesthealignmentresultofapreviousvisual-
izationblockandutilizesitinalatervisualizationblock.Itisdesignedbasedonseveralguidelines.
Firstly,itisderivedfromthesurfacenormalsoftheunderlying3Dfacemodelandencodestherel-
8
ativeposebetweenthefaceandcamera,partiallyinspiredbythesuccessofusingsurfacenormals
for3Dfacerecognition[79].Secondly,thevisualizationlayerisdifferentiable,whichallowsthe
gradienttobecomputedanalyticallyandenablesend-to-endtraining.Lastly,amaskisutilizedto
differentiatebetweenpixelsinthemiddleandcontourareasofaface.
Thelasttwochapterscontainourproposedmethodsforthefaceproblem.In
thechapter,weproposeadeepmodelforfacethatusesthesupervisionfrom
boththe
spatial
and
temporalauxiliaryinformation
,forthepurposeofrobustlydetectingface
presentationattacks(PA)fromafacevideo.Theseauxiliaryinformationareacquiredbasedon
ourdomainknowledgeaboutthekey
differences
betweenliveandspooffaces,whichincludetwo
perspectives:spatialandtemporal.Fromthespatialperspective,itisknownthatlivefaceshave
face-likedepth,e.g.,thenoseisclosertothecamerathanthecheekinfrontal-viewfaces,while
facesinprintorreplayattackshaveorplanardepth,e.g.,allpixelsontheimageofapaperhave
thesamedepthtothecamera.Hence,depthcanbeutilizedasauxiliaryinformationtosupervise
bothliveandspooffaces.
Fromthetemporalperspective,itwasshownthatthenormalrPPGsignals(i.e.,heartpulse
signal)aredetectablefromlive,butnotspoof,facevideos[69,81].Therefore,weprovidedifferent
rPPGsignalsasauxiliarysupervision,whichguidesthenetworktolearnfromliveorspoofface
videosrespectively.Toenablebothsupervisions,wedesignanetworkarchitecturewithashort-cut
connectiontocapturedifferentscalesandanovelnon-rigidregistrationlayertohandlethemotion
andposechangeforrPPGestimation.
Finally,inthesixthchapter,weproposeaCNNmethodthat,givenaspooffaceimage,itcan
estimatethespoofnoiseandreconstructtheoriginalliveface.Weproposeseveralconstraintsand
supervisionsbasedonourpriorknowledgeofthespoofnoise.First,alivefacehasnospoofnoise.
Second,weassumethatthespoofnoiseofaspoofimageisubiquitous,i.e.,itexistseverywherein
9
thespatialdomainoftheimage;and,third,thespoofnoiseisrepetitive,i.e.,itisthespatialrepe-
titionofcertainnoiseintheimage.Withsuchconstraints,anovelCNNarchitectureispresented
inthesixthchapter.Givenanimage,oneCNNisdesignedtosynthesizethespoofnoisepattern
andreconstructthecorrespondingliveimage.Inordertoexaminethereconstructedliveimage,
wetrainanotherCNNwithauxiliarysupervisionandaGAN-likediscriminatorinanend-to-end
fashion.Thesetwonetworksaredesignedtoensurethequalityofthereconstructedimageregard-
ingitsdiscriminativenessbetweenliveandspoof,andthevisualplausibilityofthesynthesizedlive
image.
1.4.1Contributionsofthethesis
Inthissection,welistsomecontributionsalreadymadetowardsposeinvariantfacealignment:

Weproposeapose-invariantfacealignmentbyadense3DMM,andintegratingesti-
mationof3Dshapeand2Dfaciallandmarksfromasinglefaceimage.

WeintroducecascadedCNN-based3Dfacemodelalgorithmthatisapplicabletoall
poses,withintegratedlandmarkmarchingandcontributionfromlocalappearancesaroundcheek
landmarksduringtheprocess.

Avisualizationlayerispresentedwhichisdifferentiable,andallowsbackpropagationoferror
fromalaterblocktoanearlierone.
Also,welistsomeofthecontributionsmadetowardtheface

Weproposetoleveragenovelauxiliaryinformation(i.e.,depthmapandrPPG)tosupervise
theCNNlearningforimprovedgeneralizationoffacesystems.

WeproposeanovelCNN-RNNarchitectureforfacewhichperformsend-to-
endlearningwiththedepthmapandtherPPGsignal.

Weofferanewperspectivefordetectingthefacefromprintattackandreplayattack
10
byinverselydecomposingaspooffaceimageintothelivefaceandthenoise,without
havingthegroundtruthofeither.
11
Chapter2
Pose-Invariant3DFaceAlignment
2.1Introduction
Motivatedbytheneedstoaddresstheposevariation,andthelackofpriorworkinhandlingposes,
asshowninFig.2.1,thischapterproposesanovelregression-basedapproachfor
pose-invariant
facealignment
,whichaimstoestimatethe2
Dand
3
Dlocations
offacelandmarks,aswellastheir
visibilities
inthe2Dimage,forafacewith
arbitrarypose
(e.g.,

90

yaw).Byextendingthe
popularcascadedregressorfor2Dlandmarkestimation,welearntworegressorsforeachcascade
layer,oneforpredictingtheupdateforthecameraprojectionmatrix,andtheotherforpredicting
theupdateforthe3Dshapeparameter.Thelearningoftworegressorsisconductedalternatively
withthegoalofminimizingthedifferencebetweenthegroundtruthupdatesandthepredicted
updates.Byusingthe3Dsurfacenormalsof3Dlandmarks,wecanautomaticallyestimatethe
visibilitiesoftheir2Dprojectedlandmarksbyinspectingwhetherthetransformedsurfacenormal
hasapositive
z
coordinate,andthesevisibilitiesaredynamicallyincorporatedintotheregressor
learningsuchthatonlythelocalappearanceofvisiblelandmarkscontributetothelearning.Finally,
extensiveexperimentsareconductedonalargesubsetofAFLWdataset[57]withawiderangeof
poses,andtheAFWdataset[151],withthecomparisonwithanumberofstate-of-the-artmethods.
Wedemonstratesuperior2Dalignmentaccuracyandquantitativelyevaluatethe3Dalignment
accuracy.
Insummary,themaincontributionsoftheproposedposeinvariantfacealignmentare:
12
Figure2.1:
Givenafaceimagewithanarbitrary
pose
,ourproposedalgorithmautomaticallyestimatesthe
2
Dlocations
and
visibilities
offaciallandmarks,aswellas3
Dlandmarks
.Thedisplayed3Dlandmarksare
estimatedfortheimageinthecenter.Green/redpointsindicatevisible/invisiblelandmarks.
Ł
Weproposeafacealignmentmethodthatcanestimatesparsesetof2D/3Dlandmarksand
theirvisibilitiesforafaceimagewithanarbitrarypose.
Ł
Byintegratingwitha3Dpointdistributionmodel,acascadedcoupled-regressorapproachis
developedtoestimateboththecameraprojectionmatrixandthe3Dlandmarks,where3D
modelenablestheautomaticallycomputedlandmarkvisibilitiesviasurfacenormal.
Ł
Asubstantiallylargernumberofnon-frontalviewfaceimagesareutilizedinevaluationwith
demonstratedsuperiorperformancesthanthestateoftheart.
2.2Pose-Invariant3DFaceAlignment
ThissectionpresentsthedetailsofourproposedPose-Invariant3DFaceAlignment(PIFA)algo-
rithm,withemphasisonthetrainingprocedure.AsshowninFig.2.2,welearna3DPoint
DistributionModel(3DPDM)[31]fromasetoflabeled3Dscans,whereasetof2Dlandmarkson
animagecanbeconsideredasaprojectionofa3DPDMinstance(i.e.,3Dlandmarks).Foreach
2Dtrainingfaceimage,weassumethatthereexiststhemanuallabeled2Dlandmarksandtheir
visibilities,aswellasthecorresponding3
Dgroundtruth
Œ3Dlandmarksandthecameraprojection
13
Figure2.2:
OverallarchitectureofourproposedPIFAmethod,withthreemainmodules(3Dmodeling,
cascadedcoupled-regressorlearning,and3Dsurface-enabledvisibilityestimation).Green/redarrowsindi-
catesurfacenormalspointingtoward/awayfromthecamera.
matrix.Giventhetrainingimagesand2D/3Dgroundtruth,wetrainacascadedcoupled-regressor
thatiscomposedoftworegressorsateachcascadelayer,fortheestimationoftheupdateofthe
3DPDMcoefandtheprojectionmatrixrespectively.Finally,thevisibilitiesoftheprojected
3Dlandmarksareautomaticallycomputedviathedomainknowledgeofthe3Dsurfacenormals,
andincorporatedintotheregressorlearningprocedure.
2.2.13DFaceModeling
Facealignmentconcernsthe2Dfaceshape,representedbythelocationsof
N
2Dlandmarks,i.e.,
U
=
0
B
@
u
1
u
2

u
N
v
1
v
2

v
N
1
C
A
:
(2.1)
14
A2Dfaceshape
U
isaprojectionofa3Dfaceshape
S
,similarlyrepresentedbythehomogeneous
coordinatesof
N
3Dlandmarks,i.e.,
S
=
0
B
B
B
B
B
B
B
B
B
@
x
1
x
2

x
N
y
1
y
2

y
N
z
1
z
2

z
N
11

1
1
C
C
C
C
C
C
C
C
C
A
:
(2.2)
Similartothepriorwork[121],aweakperspectivemodelisassumedfortheprojection,
U
=
MS
;
(2.3)
where
M
isa2

4projectionmatrixwithsevendegreesoffreedom(yaw,pitch,roll,twoscales
and2Dtranslations).
Followingthebasicideaof3DPDM[31],weassumea3Dfaceshapeisaninstanceofthe
3DPDM,
S
=
S
0
+
N
s
å
i
=
1
p
i
S
i
;
(2.4)
where
S
0
and
S
i
isthemeanshapeand
i
thshapebasisofthe3DPDMrespectively,
N
s
isthetotal
numberofshapebases,and
p
i
isthe
i
thshapecoefGivenadatasetof3Dscanswithmanual
labelson
N
3Dlandmarksperscan,weperformprocrustesanalysisonthe3Dscanstoremove
theglobaltransformation,andthenconductPrincipalComponentAnalysis(PCA)toobtainthe
S
0
and
f
S
i
g
(seethetop-leftpartofFig.2.2).
Thesetofallshapecoef
p
=(
p
1
;
p
2
;

;
p
N
s
)
istermedasthe3
Dshapeparameter
of
animage.Atthispoint,thefacealignmentforatestingimage
I
hasbeenconvertedfromthe
15
estimationof
U
totheestimationof
P
=
f
M
;
p
g
.Theconversionismotivatedbyafewfactors.
First,withoutthe3Dmodeling,itisverydiftomodeltheout-of-planerotation,whichhasa
varyingnumberoflandmarksdependingontherotationangleandtheindividual3Dfaceshape.
Second,aspointedoutby[121],byonlyusing
1
6
ofthenumberoftheshapebases,3DPDMcan
haveanequivalentrepresentationpowerasits2Dcounterpart.Hence,using3Dmodelmightlead
toamorecompactrepresentationofunknownparameters.
GroundtruthP
Estimating
P
foratestingimageimpliestheexistenceofgroundtruth
P
foreach
trainingimage.However,while
U
canbemanuallylabeledonafaceimage,
P
isnormallyunavail-
ableunlessa3Dscaniscapturedalongwithafaceimage.Therefore,inordertoleveragethevast
amountofexisting2Dfacealignmentdatasets,suchastheAFLWdataset[57],itisdesirableto
estimate
P
forafaceimageanduseitasthegroundtruthforlearning.
Givenafaceimage
I
,wedenotethemanuallylabeled2Dlandmarksas
U
andthelandmark
visibilityas
v
,an
N
-dimvectorwithbinaryelementsindicatingvisible(1)orinvisible(0)land-
marks.Notethatitisnotnecessarytolabelthe2Dlocationsofinvisiblelandmarks.Wethe
followingobjectivefunctiontoestimate
M
and
p
,
J
(
M
;
p
)=


M
 
S
0
+
N
s
å
i
=
1
p
i
S
i
!

U
!

V


2
;
(2.5)
where
V
=(
v
|
;
v
|
)
isa2

N
visibilitymatrix,

denotestheelement-wisemultiplication,and
jjj
2
isthesumofthesquaresofallmatrixelements.Basically
J
(

;

)
computesthedifferencebetween
thevisible2Dlandmarksandtheir3Dprojections.Analternativeestimationschemeisutilized,
i.e.,byassuming
p
0
=
0,weestimate
M
k
=
argmin
M
J
(
M
;
p
k

1
)
,andthen
p
k
=
argmin
p
J
(
M
k
;
p
)
iterativelyuntilthechangesof
M
and
p
aresmallenough.Bothminimizationscanbeef
solvedinclosedformsvialeast-squareerror.
16
2.2.2CascadedCoupled-Regressor
Foreachtrainingimage
I
i
,wenowhaveitsgroundtruthas
P
i
=
f
M
i
;
p
i
g
,aswellastheirinitial-
ization,i.e.,
M
0
i
=
g
(
¯
M
;
b
i
)
,
p
0
i
=
0
,and
v
0
i
=
1
.Here
¯
M
istheaverageofgroundtruthprojection
matricesinthetrainingset,
b
i
isa4-dimvectorindicatingtheboundingboxlocation,and
g
(
M
;
b
)
isafunctionthatthescaleandtranslationof
M
basedon
b
.Givenadatasetof
N
d
train-
ingimages,thequestionis
how
toformulateanoptimizationproblemtoestimate
P
i
.Wedecide
toextendthesuccessfulcascadedregressorsframeworkduetoitsaccuracyandefy[26].
Thegeneralideaofcascadedregressorsistolearnaseriesofregressors,wherethe
k
thregressor
estimatesthedifferencebetweenthecurrentparameter
P
k

1
i
andthegroundtruth
P
i
,suchthatthe
estimatedparametergraduallyapproximatesthegroundtruth.
Motivatedbythisgeneralidea,weadoptacascadedcoupled-regressorschemewheretwo
regressorsarelearnedatthe
k
thcascadelayer,fortheestimationof
M
i
and
p
i
respectively.Specif-
ically,thelearningtaskofthe
k
thregressoris,
Q
k
1
=
argmin
Q
k
1
N
d
å
i
=
1
jj
D
M
k
i

R
k
1
(
I
i
;
U
i
;
v
k

1
i
;
Q
k
1
)
jj
2
;
(2.6)
where
U
i
=
M
k

1
i
 
S
0
+
N
s
å
i
=
1
p
k

1
i
S
i
!
;
(2.7)
isthecurrentestimated2Dlandmarks,
D
M
k
i
=
M
i

M
k

1
i
,and
R
k
1
(

;
Q
k
1
)
isthedesiredregressor
withtheparameterof
Q
k
1
.After
Q
k
1
isestimated,weobtain
D
‹
M
i
=
R
k
1
(

;
Q
k
1
)
foralltrainingimages
andupdate
M
k
i
=
M
k

1
i
+
D
‹
M
i
.Notethatthislinerupdatingmaypotentiallybreaktheconstraintof
theprojectionmatrix.Therefore,weestimatethescalesandyaw,pitch,rollangles(
s
x
;
s
y
;
a
;
b
;
g
)
from
M
k
i
andcomposeanew
M
k
i
basedontheseveparameters.
17
Similarlythesecondlearningtaskofthe
k
thregressoris,
Q
k
2
=
argmin
Q
k
2
N
d
å
i
=
1
jj
D
p
k
i

R
k
2
(
I
i
;
U
i
;
v
k
i
;
Q
k
2
)
jj
2
;
(2.8)
where
U
i
iscomputedviaEq2.7except
M
k

1
i
isreplacedwith
M
k
i
.Wealsoobtain
D
‹
p
i
=
R
k
2
(

;
Q
k
2
)
foralltrainingimagesandupdate
p
k
i
=
p
k

1
i
+
D
‹
p
i
.Thisiterativelearningprocedurecontinuesfor
K
cascadelayers.
Learning
R
k
(

)
Ourcascadedcoupled-regressorschemedoesnotdependontheparticularfeature
representationorthetypeofregressors.Therefore,wemaythembasedonpriorworkorany
futuredevelopmentinfeaturesandregressors.,inthisworkweadopttheHOG-based
linearregressor[126]andthefernregressor[22].
Forthelinearregressor,wedenoteafunction
f
(
I
;
U
)
toextractHOGfeaturesaroundasmall
rectangularregionofeachoneof
N
landmarks,whichreturnsa32
N
-dimfeaturevector.Thus,we
theregressorfunctionas
R
(

)=
Q
|

Diag

(
v
i
)
f
(
I
i
;
U
i
)
;
(2.9)
whereDiag

(
v
)
isafunctionthatduplicateseachelementof
v
32timesandconvertsintoadiagonal
matrixofsize32
N
.Notethatwealsoaddaconstraint,
l
jj
Q
jj
2
,toEq2.6orEq2.8foramore
robustleast-squaresolution.BypluggingEq2.9toEq2.6orEq2.8,theregressorparameter
Q
(e.g.,a
N
s

32
N
matrixfor
R
k
2
)canbeeasilyestimatedintheclosedform.
Forthefernregressor,wefollowthetrainingprocedureof[22].Thatis,wedividetheface
regionintoa3

3grid.Ateachcascadelayer,wechoose3outof9zoneswiththeleastocclusion,
computedbasedonthe
f
v
k
i
g
.Foreachselectedzone,adepth5randomfernregressorislearned
18
fromtheinterpolatedshape-indexedfeaturesselectedbythecorrelation-basedmethod[26]from
thatzoneonly.Finallythelearned
R
(

)
isaweightedmeanvotingfromthe3fernregressors,where
theweightisinverselyproportionaltotheaverageamountofocclusioninthatzone.
2.2.33DSurface-EnabledVisibility
Uptonowtheonlythingthathasnotbeenexplainedinthetrainingprocedureishowtoestimate
thevisibilityoftheprojected2Dlandmarks,
v
i
.Itisobviousthatduringthetestingwehaveto
estimate
v
ateachcascadelayerforatestingimage,sincethereisnovisibilityinformationgiven.
Asaresult,duringthetrainingprocedure,wealsohaveto
estimate
v
percascadelayerforeach
trainingimage
,ratherthanusingthemanuallylabeledgroundtruthvisibilitythatisusefulfor
estimatinggroundtruth
P
asshowninEq2.5.
Dependingonthecameraprojectionmatrix
M
,thevisibilityofeachprojected2Dlandmark
maydynamicallychangealongdifferentlayersofthecascade(seethetop-rightpartofFig.2.2).In
ordertoestimate
v
,wedecidetousethe3Dfacesurfaceinformation.Westartbyassumingevery
individualhasasimilar3Dsurfacenormalvectorateachofits3Dlandmarks.Then,byrotating
thesurfacenormalaccordingtotherotationangleindicatedbytheprojectionmatrix,weknowthat
whethertherotatedsurfacenormalispointingtowardthecamera(i.e.,visible)orawayfromthe
camera(i.e.,invisible).Inotherwords,thesignofthe
z
-axiscoordinatesindicatesvisibility.
Bytakingasetof3Dscanswithmanuallylabeled3Dlandmarks,wecancomputetheland-
marks'average3Dsurfacenormals,denotedasa3

N
matrix
~
N
.Thenweusethefollowing
equationtocomputethevisibilityvector,
v
=
~
N
|


m
1
jj
m
1
jj

m
2
jj
m
2
jj

;
(2.10)
19
Data:
3Dmodel
ff
S
g
N
s
i
=
0
;
~
N
g
,labeleddata
f
I
i
;
U
i
;
b
i
g
N
d
i
=
1
Result:
Cascadedregressorparameters
f
Q
k
1
;
Q
k
2
g
K
k
=
1
/
*
3Dmodeling
*
/
1
foreach
i
=
1
;

;
N
d
do
2
Estimate
M
i
and
p
i
viaEq.2.5
/
*
Initialization
*
/
3
foreach
i
=
1
;

;
N
d
do
4
p
0
i
=
0
;
.
Assumingthemean3Dshape
5
v
0
i
=
1
;
.
Assumingalllandmarksvisible
6
M
0
i
=
g
(
¯
M
;
b
i
)
and
U
i
=
M
0
i
S
0
/
*
Regressorlearning
*
/
7
foreach
k
=
1
;

;
K
do
8
Estimate
Q
k
1
viaEq2.6
9
Update
M
k
i
and
U
i
forallimages
10
Compute
v
k
i
viaEq2.10forallimages
11
Estimate
Q
k
2
viaEq2.8;
12
Update
p
k
i
and
U
i
forallimages.
Figure2.3:ThetrainingprocedureofPIFA.
where
m
1
and
m
2
aretheleft-mostthreeelementsattheandsecondrowof
M
respectively,
and
jjj
denotesthe
L
2
norm.Forfernregressors,
v
isasoftvisibilitywithin

1.Forlinear
regressors,wefurthercompute
v
=
1
2
(
1
+
sign
(
v
))
,whichresultsinahardvisibilityofeither1or
0.Insummary,wepresentthedetailedtrainingprocedureinAlgorithm2.3.
Model
Givenatestingimage
I
withboundingbox
b
anditsinitialparameter
M
0
=
g
(
¯
M
;
b
)
and
p
0
=
0
,wecanapplythelearnedcascadedcoupled-regressorforfacealignment.Basicallywe
iterativelyuse
R
k
1
(

;
Q
k
1
)
tocompute
D
‹
M
,update
M
k
,compute
v
k
,use
R
k
2
(

;
Q
k
2
)
tocompute
D
‹
p
,
andupdate
p
k
.Finallytheestimated3Dlandmarksare
‹
S
=
S
0
+
å
i
p
K
i
S
i
,andtheestimated2D
landmarksare
‹
U
=
M
K
‹
S
.Notethat
‹
S
carriestheindividual3Dshapeinformationofthesubject,
butnotnecessaryinthesameposeasthe2Dtestingimage.
20
2.3ExperimentalResults
Datasets
Thegoalofthisworkistoadvancethecapabilityoffacealignmenton
in-the-wildfaces
withallpossibleviewangles
,whichisthetypeofimageswedesirewhenselectingexperimental
datasets.However,veryfewpubliclyavailabledatasetssatisfythischaracteristic,orhavebeen
extensivelyevaluatedinpriorwork(seeTab.2.1).Nevertheless,weidentifythreedatasetsforour
experiments.
AFLWdataset[57]contains
˘
25
;
000in-the-wildfaceimages,eachimageannotatedwiththe
visible
landmarks(upto21landmarks),andaboundingbox.Basedonourestimated
M
foreach
image,weselectasubsetof5
;
200imageswherethenumbersofimageswhoseabsoluteyawangles
within
[
0

;
30

]
,
[
30

;
60

]
,
[
60

;
90

]
areroughly
1
3
each.Tohaveamore
balanceddistribution
of
theleftvs.rightviewfaces,wetaketheoddindexedimagesamong5
;
200(i.e.,1st,3rd),
themhorizontally,andusethemtoreplacetheoriginalimages.Finally,arandompartitionleads
to3
;
901and1
;
299imagesfortrainingandtestingrespectively.AsshowninTab.2.1,amongthe
methodsthattestonallposes,wehavethelargestnumberoftestingimages.
AFWdataset[151]contains205imagesandintotal468faceswithdifferentposeswithin

90

.Eachimageislabeledwith
visible
landmarks(upto6),andafaceboundingbox.Weonly
useAFWfortesting.
Sincewearealsoestimating3Dlandmarks,itisimportanttotestonadatasetwith
ground
Table2.1:Thecomparisonoffacealignmentalgorithmsinposehandling(estimationerrorsmay
havedifferent
Method
3D
Visibility
Pose-relateddatabase
Pose
Training
Testing
Landmark
Estimation
landmark
range
face#
face#
#
errors
RCPR[22]
No
Yes
COFW
frontalw.occlu.
1
;
345
507
19
8
:
5
CoR[136]
No
Yes
COFW;LFPW-O;Helen-O
frontalw.occlu.
1
;
345;468;402
507;112;290
19;49;49
8
:
5
TSPM[151]
No
No
AFW
allposes
2
;
118
468
6
11
:
1
CDM[135]
No
No
AFW
allposes
1
;
300
468
6
9
:
1
OSRD[123]
No
No
MVFW
<

40

2
;
050
450
68
N/A
TCDCN[142]
No
No
AFLW,AFW
<

60

10
;
000
3
;
000;
˘
313
5
8
:
0;8
:
2
PIFA
Yes
Yes
AFLW,AFW
allposes
3
;
901
1
;
299;468
21
;
6
6
:
5;8
:
6
21
truth
,ratherthanestimated,3Dlandmarklocations.WeBP4D-Sdatabase[141]tobethebest
forthispurpose,whichcontainspairsof2Dimagesand3Dscansofspontaneousfacialexpressions
from41subjects.Eachpairhassemi-automaticallygenerated832Dand833Dlandmarks,and
thepose.Weapplyarandomperturbationon2Dlandmarks(tomimicimprecisefacedetection)
andgeneratetheirenclosedboundingbox.Withthegoalofselectingasmanynon-frontalview
facesaspossible,wechooseasubsetwherethenumbersoffaceswhoseyawanglewithin
[
0

;
10

]
,
[
10

;
20

]
,
[
20

;
30

]
are100,500,and500respectively.Werandomlyselecthalfof1
;
100images
fortrainingandtherestfortesting,withdisjointsubjects.
Experimentsetup
OurPIFAapproachneedsa3Dmodelof
f
S
g
N
s
i
=
0
and
~
N
.UsingtheBU-4DFE
database[133]thatcontains6063Dfacialexpressionsequencesfrom101subjects,weevenly
sample72scansfromeachsequenceandgatheratotalof72

606scans.Basedonthemethodin
Sec.2.2.1,theresultantmodelhas
N
s
=
30forAFLWandAFW,and
N
s
=
200forBP4D-S.
Duringthetrainingandtesting,foreachimagewithaboundingbox,weplacethemean2D
landmarks(learnedfromthetrainingset)ontheimagesuchthatthelandmarksontheboundary
arewithinthefouredgesofthebox.Fortrainingwithlinearregressors,weset
K
=
10,
l
=
120,
while
K
=
75forfernregressors.
Evaluationmetric
Giventhegroundtruth2Dlandmarks
U
i
,theirvisibility
v
i
,andestimated
landmarks
‹
U
i
of
N
t
testingimages,wehavetwowaysofcomputingthelandmarkestimationerrors:
1)MeanAveragePixelError(MAPE)[135],whichistheaverageoftheestimationerrorsfor
visiblelandmarks,i.e.,
MAPE
=
1
å
N
t
i
j
v
i
j
1
N
t
;
N
å
i
;
j
v
i
(
j
)
jj
‹
U
i
(
:
;
j
)

U
i
(
:
;
j
)
jj
;
(2.11)
where
j
v
i
j
1
isthenumberofvisiblelandmarksofimage
I
i
,and
U
i
(
:
;
j
)
isthe
j
thcolumnof
U
i
.2)
22
NormalizedMeanError(NME),whichistheaverageofthenormalizedestimationerrorofvisible
landmarks,i.e.,
NME
=
1
N
t
N
t
å
i
(
1
d
i
j
v
i
j
1
N
å
j
v
i
(
j
)
jj
‹
U
i
(
:
;
j
)

U
i
(
:
;
j
)
jj
)
;
(2.12)
where
d
i
isthesquarerootofthefaceboundingboxsize,asusedby[135].Notethatnormally
d
i
istheinter-eyedistanceinpriorfacealignmentworkdealingwithnear-frontalfaces.
Giventhegroundtruth3Dlandmarks
S
i
andestimatedlandmarks
‹
S
i
,weestimatethe
globalrotation,translationandscaletransformationsothatthetransformed
S
i
,denotedas
S
0
i
,has
theminimumdistanceto
‹
S
i
.WethencomputetheMAPEviaEq2.11exceptreplacing
U
and
‹
U
i
with
S
0
i
and
‹
S
i
,and
v
i
=
1
.ThustheMAPEonlymeasurestheerrorduetonon-rigidshape
deformation,ratherthantheposeestimation.
Choiceofbaselinemethods
Giventheexplosionoffacealignmentworkinrecentyears,itisim-
portanttochooseappropriatebaselinemethodssoastomakesuretheproposedmethodadvances
thestateoftheart.Inthiswork,weselectthreerecentworksasbaselinemethods:1)CDM[135]
isaCLM-typemethodandtheoneclaimedtoperformpose-freefacealignment,whichhas
exactlythesameobjectiveasours.OnAFWitalsooutperformstheotherwell-knownTSPM
method[151]thatcanhandleallposefaces.2)TCDCN[142]isapowerfuldeeplearning-based
methodpublishedinthemostrecentECCV.Althoughitonlyestimates5landmarksforupto
˘
60

yaw,itrepresentstherecentdevelopmentinfacealignment.3)RCPR[22]isaregression-type
methodthatrepresentstheocclusion-invariantfacealignment.Althoughitisanearlierworkthan
CoR[136],wechooseitduetoitssuperiorperformanceonthelargeCOFWdataset(seeTab.1
of[136]).Itcanbeseenthatthesethreebaselinesnotonlyaremostrelevanttoourfocusonpose-
invariantfacealignment,butalsowellrepresentthemajorcategoriesofexistingfacealignment
algorithmsbasedon[115].
23
Table2.2:
TheNME(%)ofthreemethodsonAFLW.
N
t
PIFA
CDM
RCPR
1
;
299
6
:
52
7
:
15
783
6
:
08
8
:
65
Table2.3:
ThecomparisonoffourmethodsonAFW.
N
t
N
Metric
PIFA
CDM
RCPR
TCDCN
468
6
MAPE
8
:
61
9
:
13
313
5
NME
9
:
42
9
:
30
8
:
20
ComparisononAFLW
SincethesourcecodeofRCPRispubliclyavailable,weareableto
performthetrainingandtestingofRCPRonourAFLWpartition.Weusetheavailable
executableofCDMtocomputeitsperformanceonourtestset.Westrivetoprovidethesame
setuptothebaselinesasours,suchastheinitialboundingbox,regressorlearning,etc.Forour
PIFAmethod,weusethefernregressor.BecauseCDMintegratesfacedetectionandpose-free
facealignment,noboundingboxwasgiventoCDManditsuccessfullydetectsandaligns783out
of1
;
299testingimages.Therefore,tocomparewithCDM,weevaluatetheNMEonthe
same
783testingimages.AsshowninTab.2.2,ourPIFAshowssuperiorperformancetobothbaselines.
AlthoughTCDCNalsoreportsperformanceonasubsetof3
;
000AFLWimageswithin

60

yaw,
itisevaluatedwith5landmarks,basedonNMEwhen
d
i
istheinter-eyedistance.Hence,without
thesourcecodeofTCDCN,itisdiftohaveafaircomparisononoursubsetofAFLWimages
(e.g.,wecannot
d
i
astheinter-eyedistanceduetoviewfaces).Onthe1
;
299testing
images,wealsotestourmethodwithlinearregressors,andachieveaNMEof7
:
50,whichshows
thestrengthoffernregressors.
ComparisononAFW
UnlikeoursubsetofAFLW,theAFWdatasethasbeenevaluated
byallthreebaselines,butdifferentmetricsareused.Therefore,theresultsofthebaselinesin
Tab.2.3arefromthepublishedpapers,insteadofexecutingthetestingcode.Onenoteisthatfrom
theTCDCNpaper[142],itappearsthatall5landmarksarevisibleonalldisplayedimagesand
24
Figure2.4:
TheNMEofveposegroupsfortwomethods.
novisibilityestimationisshown,whichmightsuggestthatTCDCNwasevaluatedonasubsetof
AFWwithupto

60

yaw.Hence,weselectthetotalof313outof468faceswithinthispose
rangeandtestouralgorithm.Sinceitislikelythatoursubsetcoulddifferto[142],pleasetake
thisintoconsiderationwhilecomparingwithTCDCN.Overall,ourPIFAmethodstillperforms
comparablyamongthefourmethods.ThisisespeciallyencouraginggiventhefactthatTCDCN
utilizesasubstantiallylargertrainingsetof10
;
000images-morethantwotimesofourtraining
set.NotethatinadditiontoTab.2.2and2.3,ourPIFAalsohasotherasshowninTab.2.1.
E.g.,wehave3Dandvisibilityestimation,whileRCPRhasno3DestimationandTCDCNdoes
nothavevisibilityestimation.
Estimationerroracrossposes
Justlikepose-invariantfacerecognitionstudiestherecognition
rateacrossposes[71,72],wealsoliketostudytheperformanceoffacealignmentacrossposes.
AsshowninFig.2.4,basedontheestimatedprojectionmatrix
M
anditsyawangles,wepartition
alltestingimagesofAFLWintovebins,eacharoundayawangle.Thenwecomputethe
NMEoftestingimageswithineachbin,forourmethodandRCPR.Wecanobservethatthe
viewimageshaveingenerallargerNMEthannear-frontalimages,whichshowsthechallengeof
pose-invariantfacealignment.Further,theimprovementofPIFAoverRCPRisconsistentacross
mostoftheposes.
Estimationerroracrosslandmarks
Wearealsointerestedintheestimationerroracrossvarious
25
Figure2.5:
TheNMEofeachlandmarkforPIFA.
Table2.4:
EfyoffourmethodsinFPS.
PIFA
CDM
RCPR
TCDCN
3
:
0
0
:
2
3
:
0
58
:
8
landmarks,underawiderangeofposes.Hence,fortheAFLWtestset,wecomputetheNMEof
eachlandmarkforourmethod.AsshowninFig.2.5,thetwoeyeregionshavetheleastamountof
error.Thetwolandmarksundertheearshavethemosterror,whichisconsistentwiththeintuition.
Theseobservationsalsoalignwellwithpriorfacealignmentstudyonnear-frontalfaces.
3Dlandmarkestimation
ByperformingthetrainingandtestingontheBP4D-Sdataset,wecan
evaluatetheMAPEof3Dlandmarkestimation,withexemplarresultsshowninFig.2.6.Since
therearelimited3Dalignmentworkandmanyofwhichdonotperformquantitativeevaluation,
suchas[43],wearenotabletoanothermethodasthebaseline.Instead,weusethe3Dmean
shape,
S
0
,asabaselineandcomputeitsMAPEwithrespecttothegroundtruth3Dlandmarks
S
i
(afterglobaltransformation).WethattheMAPEof
S
0
baselineis5
:
02,whileourmethod
has4
:
75.Althoughourmethodoffersabetterestimationthanthemeanshape,thisshowsthat3D
facealignmentisstillaverychallengingproblem.Wehopetheefforttoquantitativelymeasurethe
3Destimationerror,whichismorediftthanits2Dcounterpart,willencouragemoreresearch
activitiestoaddressthischallenge.
Computational
Basedontheefyreportedinthepublicationsofbaselinemeth-
ods,wecomparethecomputationalefyoffourmethodsinTab.2.4.OnlyTCDCNismea-
26
Figure2.6:
2Dand3DalignmentresultsoftheBP4D-Sdataset.
suredbasedontheCimplementationwhileotherthreeareallbasedonMatlabimplementation.
ItcanbeobservedthatTCDCNisthemostefone.Considerthatweestimateboth2Dand
3Dlandmarks,at3FPSourunoptimizedimplementationisreasonablyefInouralgorithm,
themostcomputationaldemandingpartisfeatureextraction,whileestimatingtheupdatesforthe
projectionmatrixand3Dshapeparameterhasclosed-formsolutionsandisveryef
Qualitativeresults
Wenowshowthequalitativefacealignmentresultsforimagesintwodatasets.
AsshowninFig.2.7,despitethelargeposerangeof

90

yaw,ouralgorithmdoesagoodjobof
aligningthelandmarks,andcorrectlypredictthelandmarkvisibilities.Theseresultsareespecially
impressiveifyouconsiderthesamemeanshape(2Dlandmarks)isusedastheinitializationofall
testingimages,whichhasverylargedeformationswithrespecttotheirlandmarkestimation.
2.4Summary
Motivatedbythefastprogressoffacealignmenttechnologiesandtheneedtoalignfacesatall
poses,thischapterdrawsattentiontoarelativelylessexploredproblemoffacealignmentrobustto
posesvariation.Tothisend,weproposeanovelapproachtotightlyintegratethepowerfulcascaded
regressorschemeandthe3Dfacemodel.The3Dmodelnotonlyservesasacompactconstraint,
27
Figure2.7:
TestingresultsofAFLW(top)andAFW(bottom).Asshowninthetoprow,weinitialize
facealignmentbyplacinga2Dmeanshapeinthegivenboundingboxofeachimage.Notethe
disparity
betweentheinitiallandmarksandtheestimatedones,aswellasthediversityinpose,illuminationand
resolutionamongtheimages.Green/redpointsindicatevisible/invisibleestimatedlandmarks.
butalsooffersanautomaticandconvenientwaytoestimatethevisibilitiesof2Dlandmarks-
akeyforsuccessfulpose-invariantfacealignment.Asaresult,fora2Dimage,ourapproach
estimatesthelocationsof2Dand3Dlandmarks,aswellastheir2Dvisibilities.Weconductan
extensiveexperimentonalargecollectionofall-posefaceimagesandcomparewiththreestate-
of-the-artmethods.Whilesuperior2Dlandmarkestimationhasbeenshown,theperformanceon
3Dlandmarkestimationindicatesthefuturedirectiontoimprovethislineofresearch.
28
Chapter3
Pose-InvariantFaceAlignmentvia
CNN-basedDense3DModelFitting
3.1Introduction
Inthepreviouschapter,weproposethePIFAmethodwhichcanestimatethelocationsofasparse
setof3Dlandmarkpoints.Inthischapter,weextendPIFAinanumberofways:
First,weproposetouseadense3DMorphableModel(3DMM)toreconstructthe3Dshape
offaceandtheprojectionmatrixasthe
latentrepresentation
ofa2Dfaceshape.Therefore,face
alignmentamountstoestimatingthisrepresentation,i.e.,performingthe3DMMtoaface
imagewith
arbitrary
poses.
Second,weproposetouseConvolutionalNeuralNetworks(CNN)astheregressorinthecas-
cadedframework,tolearnthemapping.ThemainadvantageofCNNoverthefernregressiontrees
(inthepreviouschapter)isthatitdoesnotdependonhand-craftedfeatureextractionmethods.The
CNNcanlearnandextractmoremeaningful,generalizableandabstractfeaturesbyhierarchical
representation.Thispropertyismoreimportantinpose-invariantfacealignmentbecauseachange
intheheadpose(frontaltoside-view)makesaconsiderabledifferenceinthefaceimages.
WhilemostpriorworkonCNN-basedfacealignmentestimatenomorethansix2Dlandmarks
perimage[99,142],ourcascadedCNNcanproduceasubstantiallylargernumber(34)of2Dand
3Dlandmarks.Further,usinglandmarkmarching[150],ouralgorithmcanadaptivelyadjustthe
29
Figure3.1:
Theproposedmethodestimateslandmarksforlarge-posefacesbyadense3Dshape.
Fromlefttoright:initiallandmarks,3Ddenseshape,estimatedlandmarkswithvisibility.The
green/red/yellowdotsintherightcolumnshowthevisible/invisible/cheeklandmarks,respectively.
3Dlandmarksduringthesothatthelocalappearancesaroundcheeklandmarkscontribute
totheprocess.
Third,weproposetwonovelpose-invariantlocalfeatures,astheinputlayerforCNNlearning.
Weutilizethedense3Dfacemodelasanoracletobuilddensefeature
correspondence
across
variousposesandexpressions.Wealsoutilizesurfacenormalstoestimatethe
visibilityofeachlandmarkbyinspectingwhetheritssurfacenormalhasapositive
z
coordinate,
andtheestimatedvisibilitiesaredynamicallyincorporatedintotheCNNregressorlearningsuch
thatonlytheextractedfeaturesfromvisiblelandmarkscontributetothelearning.
Fourth,theCNNregressordealswithaverychallenginglearningtaskgiventhediversefa-
cialappearanceacrossallposes.Tofacilitatethelearningtaskunderlargevariationsofposeand
expression,wedeveloptwonewconstraintstolearntheCNNregressors.Oneisthat,thereisinher-
entambiguityinrepresentinga2Dfaceshapeasthecombinationofthe3Dshapeandprojection
matrix.Therefore,inadditiontoregressingtowardsuchanon-uniquelatentrepresentation,we
alsoproposetoconstraintheCNNregressorinitsabilitytodirectlyestimate2Dfaceshapes.The
otheristhat,ahorizontallymirroredversionofafaceimageisstillavalidfaceandtheiralignment
30
resultsshouldbetheversionofeachother.Inthiswork,weproposeaCNNarchitecturewith
anewlossfunctionthatexplicitlyenforcestheseconstraints.Thenewlossfunctionminimizes
thedifferenceoffacealignmentresultsofafaceimageanditsmirror,inasiamesenetworkar-
chitecture[20].Althoughthismirrorabilityconstraintwasanalignmentaccuracymeasureusedin
post-processing[129],weintegrateitdirectlyinCNNlearning.
Thesealgorithmdesignscollectivelyleadtotheextendedpose-invariantfacealignmentalgo-
rithm.Weconductextensiveexperimentstodemonstratethecapabilityofproposedmethodin
aligningfacesacrossposesontwochallengingdatasets,AFLW[61]andAFW[151],withcom-
parisontothestateoftheart.
Wesummarizethemaincontributionsofthisworkas:
Œ
Pose-invariantfacealignmentbyadense3DMM,andintegratingestimationof3D
shapeand2Dfaciallandmarksfromasinglefaceimage.
Œ
ThecascadedCNN-based3Dfacemodelalgorithmthatisapplicabletoallposes,
withintegratedlandmarkmarchingandcontributionfromlocalappearancesaroundcheek
landmarksduringtheprocess.
Œ
Dense3Dface-enabledpose-invariantlocalfeaturesandutilizingsurface
normalstoestimatethevisibilityoflandmarks.
Œ
AnovelCNNarchitecturewithmirrorabilityconstraintthatminimizesthedifferenceofface
alignmentresultsofafaceimageanditsmirror.
31
Figure3.2:
Theoverallprocessoftheproposedmethod.
3.2Unconstrained
3
DFaceAlignment
Thecoreofourproposed3Dfacealignmentmethodistheabilitytoadense3DMorphable
Modeltoa2Dfaceimagewitharbitraryposes.Theunknownparametersofthe3Dshape
parametersandtheprojectionmatrixparameters,aresequentiallyestimatedthroughacascadeof
CNN-basedregressors.Byemployingthedense3Dshapemodel,weenjoytheofbeing
abletoestimate3Dshapeofface,locatethecheeklandmarks,use3Dsurface
normals,andextractpose-invariantlocalfeaturerepresentation,whicharelesslikelytoachieve
withasimplePDM[52].Fig.3.2showstheoverallprocessoftheproposedmethod.
3.2.1
3
DMorphableModel
Torepresentadense3Dshapeofanindividual'sface,weuse3DMorphableModel(3DMM),
S
=
S
0
+
N
id
å
i
=
1
p
i
id
S
i
id
+
N
exp
å
i
=
1
p
i
exp
S
i
exp
;
(3.1)
where
S
isthe3Dshapematrix,
S
0
isthemeanshape,
S
i
id
isthe
i
thidentitybasis,
S
i
exp
isthe
i
th
expressionbasis,
p
i
id
isthe
i
thidentitycoefand
p
i
exp
isthe
i
thexpressioncoefThe
32
collectionofbothcoefisdenotedastheshapeparameterofa3Dface,
p
=(
p
|
id
;
p
|
exp
)
|
.We
usetheBasel3Dfacemodelastheidentitybases[86]andthefacewearhouseastheexpression
bases[25].The3Dshape
S
,alongwith
S
0
,
S
i
id
,and
S
i
exp
,isa3

Q
matrixwhichcontains
x
;
y
and
z
coordinatesof
Q
vertexesonthe3Dfacesurface,
S
=
0
B
B
B
B
B
@
x
1
x
2

x
Q
y
1
y
2

y
Q
z
1
z
2

z
Q
1
C
C
C
C
C
A
:
(3.2)
Any3Dfacemodelwillbeprojectedontoa2Dimagewherethefaceshapemayberepresented
asasparsesetof
N
landmarks,onthefacialpoints.Wedenote
x
and
y
coordinatesofthese
2Dlandmarksasamatrix
U
,
U
=
0
B
@
u
1
u
2

u
N
v
1
v
2

v
N
1
C
A
:
(3.3)
Therelationshipbetweenthe3Dshape
S
and2Dlandmarks
U
canbedescribedbyusingthe
weakperspectiveprojection,i.e.,
U
=
s
RS
(
:
;
d
)+
t
;
(3.4)
where
s
isascaleparameter,
R
isthetworowsofa3

3rotationmatrixcontrolledbythree
rotationangles
a
,
b
,and
g
(pitch,yaw,roll),
t
isatranslationparametercomposedof
t
x
and
t
y
,
d
isa
N
-dimindexvectorindicatingtheindexesofsemanticallymeaningful3Dvertexesthat
correspondto2Dlandmarks.Weformaprojectionvector
m
=(
s
;
a
;
b
;
g
;
t
x
;
t
y
)
|
whichcollectsall
parameterstothisprojection.Weassumetheweakperspectiveprojectionmodelwithsixdegrees
offreedom,whichisatypicalmodelusedinmanypriorface-relatedwork[48,121].
33
(a)(b)(c)(d)(e)
Figure3.3:
Thelandmarkmarchingprocessforupdatingvector
d
.(a-b)showthepathsofcheek
landmarksonthemeanshape;(c)istheestimatedfaceshape;(d)istheestimatedfaceshapebyignoring
therollrotation;and(e)showsthelocationsoflandmarksonthecheek.
Atthispoint,wecanrepresentany2Dfaceshapeastheprojectionofa3Dfaceshape.In
otherwords,theprojectionparameter
m
andshapeparameter
p
canuniquelyrepresenta2Dface
shape.Therefore,thefacealignmentproblemamountstoestimating
m
and
p
,givenafaceimage.
Estimating
m
and
p
insteadofestimating
U
ismotivatedbyafewfactors.First,withoutthe
3Dmodeling,itisnon-trivialtomodeltheout-of-planerotation,whichhasavaryingnumberof
landmarksdependingontherotationangle.Second,aspointedoutby[121],byonlyusing
1
6
ofthenumberoftheshapebases,3DMMcanhaveanequivalentrepresentationpowerasits2D
counterpart.Hence,using3Dmodelleadstoamorecompactrepresentationofshapeparameters.
Cheeklandmarkscorrespondence
TheprojectionrelationshipinEqn.3.4iscorrectforfrontal-
viewfaces,givenaconstantindexvector
d
.However,assoonasafaceturnstothenon-frontal
view,theoriginal3Dlandmarksonthecheekbecomeinvisibleonthe2Dimage.Yetmost2D
facealignmentalgorithmsstilldetect2Dlandmarksonthecontourofthecheek,termedﬁcheek
landmarks".Therefore,inordertostillmaintainthe3D-to-2DcorrespondencesasEqn.3.4,itis
desirabletoestimatethe3Dvertexesthatmatchwiththesecheeklandmarks.Afewpriorworks
haveproposedvariousapproachestohandlethis[24,92,150].Inthispaper,weleveragethe
landmarkmarchingmethodproposedin[150].
,weasetof
paths
eachstoringtheindexesofvertexesthatarenotonlythe
34
Data:
Estimated3Dface
S
andprojectionmatrixparameter
m
Result:
Indexvector
d
/
*
Rotate
S
bytheestimated
a
;
b
*
/
13
‹
S
=
R
(
a
;
b
;
0
)
S
14
if
0

<
b
<
70

then
15
foreach
i
=
1
;

;
4
do
16
V
cheek
(
i
)=
argmax
id
(
‹
S
(
1
;
Path
cheek
(
i
)))
17
if

70

<
b
<
0

then
18
foreach
i
=
5
;

;
8
do
19
V
cheek
(
i
)=
argmin
id
(
‹
S
(
1
;
Path
cheek
(
i
)))
20
Update8elementsof
d
with
V
cheek
.
Figure3.4:Landmarkmarching
g
(
S
;
m
)
.
mostclosestonestotheoriginal3Dcheeklandmarks,butalsoonthecontourofthe3Dfaceasit
turns.Givenanon-frontal3Dface
S
,byignoringtherollrotation
g
,werotate
S
byusingthe
a
and
b
angles(pitchandyaw),andsearchforavertexineachpaththathasthemaximum
(minimum)
x
coordinate,i.e.,theboundaryvertexontheright(left)cheek.Theseresultingvertexes
willbethenew3Dlandmarksthatcorrespondtothe2Dcheeklandmarks.Wewillthenupdate
relevantelementsof
d
tomakesurethesevertexesareselectedintheprojectionofEqn.3.4.This
landmarkmarchingprocessissummarizedinAlgorithm3.4asafunction
d
 
g
(
S
;
m
)
.Notethat
whenthefaceisapproximatelyofview(
j
b
j
>
70

),wedonotapplylandmarkmarching
sincethemarchedlandmarkswouldoverlapwiththeexisting2Dlandmarksonthemiddleofnose
andmouth.Fig.3.3showsthesetofpathesonthe3Dshapeoffaceandoneexampleof
applyingAlgorithm3.4forupdatingvector
d
.
3.2.2DataAugmentation
Giventhattheprojectionmatrixparameter
m
andshapeparameter
p
aretherepresentationofaface
shape,weshouldhaveacollectionoffaceimageswithgroundtruth
m
and
p
sothatthelearning
algorithmcanbeapplied.However,while
U
canbemanuallylabeledonafaceimage,
m
and
p
are
35
normallyunavailableunlessa3Dscaniscapturedalongwithafaceimage.Formostexistingface
alignmentdatabases,suchastheAFLWdatabase[61],only2Dlandmarklocationsandsometimes
thevisibilitiesoflandmarksaremanuallylabeled,withnoassociated3Dinformationsuchas
m
and
p
.Inordertomakethelearningpossible,weproposeadataaugmentationprocessfor2Dface
images,withthegoalofestimatingtheir
m
and
p
representation.
,giventhelabeledvisible2Dlandmarks
U
andthelandmarkvisibilities
V
,we
estimate
m
and
p
byminimizingthefollowingobjectivefunction:
J
(
m
;
p
)=
jj
(
s
RS
(
:
;
g
(
S
;
m
))+
t

U
)

V
jj
2
F
;
(3.5)
whichisthedifferencebetweentheprojectionof3Dlandmarksandthe2Dlabeledlandmarks.
Notethatalthoughthelandmarkmarching
g
(
:
;
:
)
makescheeklandmarksﬁvisibleﬂfor
views,thevisibility
V
isstillnecessarytoavoidinvisiblelandmarks,suchasoutereyecornersand
halfofthefaceattheview,beingpartoftheoptimization.
3.2.2.1Optimization
ForconvenientoptimizationofEqn.3.5,weallprojectionparametersasaprojection
matrix,i.e.,
M
=
2
6
4
s
R
t
x
t
y
3
7
5
2
R
2

4
:
(3.6)
Also,wedenote
d
=
g
(
S
;
m
)
inEqn.3.5byassumingitisaconstantgiventhecurrently
estimated
m
and
p
.WethenrewriteEqn.3.5as,
J
(
M
;
p
)=


0
B
B
@
M
2
6
6
4
S
(
:
;
d
)
1
|
3
7
7
5

U
1
C
C
A

V


2
F
:
(3.7)
36
Tominimizethisobjectivefunction,wealternatetheminimizationw.r.t.
M
and
p
ateachitera-
tion.Weinitializethe3Dshapeparameter
p
=
0
andestimate
M
by
M
k
=
argmin
M
J
(
M
;
p
k

1
)
,
M
k
=
U
V
2
6
6
4
S
(
:
;
d
V
)
1
|
3
7
7
5
|
0
B
B
@
2
6
6
4
S
(
:
;
d
V
)
1
|
3
7
7
5
2
6
6
4
S
(
:
;
d
V
)
1
|
3
7
7
5
|
1
C
C
A

1
;
(3.8)
where
U
V
iszero-meanpositions(byremovingthemeanfromalltheelements)ofvisible2D
landmarks,
d
V
isavectorcontainstheindexofvisiblelandmarks.Giventheestimated
M
k
,wethen
usetheSingularValueDecomposition(SVD)todecomposeittovariouselementsofprojection
parameter
m
,i.e,
M
k
=
BDQ
|
.Thediagonalelementof
D
isscale
s
andwedecomposethe
rotationmatrix
R
=
BQ
|
2
R
2

3
tothreerotationangles
(
a
;
b
;
g
)
.Finally,themeanvaluesof
U
aretranslationparameters
t
x
and
t
y
.
Then,weestimate
p
k
=
argmin
p
J
(
M
k
;
p
)
.Giventheorthogonalbasesof3DMM,wechoose
tocomputeeachelementof
p
onebyone.Thatis,
p
i
id
isthecontributionof
i
-thidentitybasisin
reconstructingthedense3Dfaceshape,
p
i
id
=
Tr

‹
U
|
V
‹
U
id
i

Tr

‹
U
|
id
i
‹
U
id
i

;
(3.9)
where
‹
U
V
=
M
k
2
6
6
4
S
(
:
;
d
V
)
1
|
3
7
7
5
,
‹
U
id
i
=
M
k
2
6
6
4
S
i
id
(
:
;
d
V
)
1
|
3
7
7
5
:
Here
‹
U
V
iscurrentresidualofpositionof2Dvisiblelandmarksaftersubtractingcontribution
of
M
k
,and
Tr
()
isthetracefunction.Once
p
i
id
iscomputed,weupdate
‹
U
V
bysubtractingthe
contributionof
i
-thbasisandcontinuetocompute
p
i
+
1
id
.Wealternativelyestimate
M
and
p
until
37
thechangesof
M
and
p
aresmallenough.AftereachstepofapplyingEqn.3.8forcomputing
anewestimationof
M
anddecomposingtoitsparameters
m
,weapplythelandmarkmarching
algorithm(Algorithm3.4)toupdatethevector
d
.
3.2.3CascadedCNNCoupled-Regressor
Givenasetof
N
d
trainingfaceimagesandtheiraugmented(i.e.,ﬁgroundtruth")
m
and
p
repre-
sentation,weareinterestedinlearningamappingfunctionthatisabletopredict
m
and
p
from
theappearanceofaface.Clearlythisisacomplicatednon-linearmappingduetothediversityof
facialappearance.GiventhesuccessofCNNinvisiontaskssuchasposeestimation[88],face
detection[64],andfacealignment[142],wedecidetomarrytheCNNwiththecascaderegressor
frameworkbylearningaseriesofCNN-basedregressorstoalternatetheestimationof
m
and
p
.
Tothebestofourknowledge,thisisthetimeCNNisusedin3Dfacealignment,withthe
estimationofover10landmarks.
Foreachtrainingimage
I
i
,inadditiontothegroundtruth
m
i
and
p
i
,wealsoinitializeimage's
representationby,
m
0
i
=
h
(
¯
m
;
b
i
)
and
p
0
i
=
0
.Here
¯
m
istheaverageofgroundtruthparametersof
projectionmatricesinthetrainingset,
b
i
isa4-dimvectorindicatingtheboundingboxlocation,
and
h
(
m
;
b
)
isafunctionthatthescaleandtranslationsof
m
basedon
b
.
Thus,atthestage
k
ofthecascadedCNN,wecanlearnaCNNtoestimatethedesiredupdate
oftheprojectionmatrixparameter,
Q
k
m
=
argmin
Q
k
m
J
Q
=
N
d
å
i
=
1
jj
D
m
k
i

CNN
k
m
(
I
i
;
U
i
;
v
k

1
i
;
Q
k
m
)
jj
2
;
(3.10)
wherethetrueprojectionupdateisthedifferencebetweenthecurrentprojectionmatrixparameter
andthegroundtruth,i.e.,
D
m
k
i
=
m
i

m
k

1
i
,
U
i
iscurrentestimated2Dlandmarks,computedvia
Eqn.3.4,basedon
m
k

1
i
and
d
k

1
i
,and
v
k

1
i
isestimatedlandmarkvisibilityatstage
k

1.
38
SimilarlyanotherCNNregressorcanbelearnedtoestimatetheupdatesoftheshapeparameter,
Q
k
p
=
argmin
Q
k
p
J
Q
=
N
d
å
i
=
1
jj
D
p
k
i

CNN
k
p
(
I
i
;
U
i
;
v
k
i
;
Q
k
p
)
jj
2
:
(3.11)
Notethat
U
i
willbere-computedviaEqn.3.4,basedontheupdated
m
k
i
and
d
k
i
byCNN
m
.
Weuseasix-stagecascadedCNN,includingCNN
1
m
,CNN
2
m
,CNN
3
p
,CNN
4
m
,CNN
5
p
,and
CNN
6
m
.Atthestage,theinputlayerofCNN
1
m
istheentirefaceregioncroppedbytheini-
tialboundingbox,withthegoalofroughlyestimatingtheposeoftheface.Theinputforthe
secondtosixthstagesisa114

114imagethatcontainsanarrayof19

19pose-invariantfeature
patches,extractedfromthecurrentestimated2Dlandmarks
U
i
.Inourimplementation,sincewe
have
N
=
34landmarks,thelasttwopatchesof114

114imagearewithzero.Similarly,for
invisible2Dlandmarks,theircorrespondingpatcheswillbewithzerosaswell.Thesefeature
patchesencodesufinformationaboutthelocalappearancearoundthecurrent2Dlandmarks,
whichdrivestheCNNtooptimizetheparameters
Q
k
m
or
Q
k
p
.Also,throughconcatenation,these
featurepatchessharetheinformationamongdifferentlandmarksandjointlydrivetheCNNinpa-
rameterestimation.Ourinputrepresentationcanbeextendedtousealargernumberoflandmarks
andhenceamoreaccuratedense3Dmodelcanbeestimated.
Notethatsincelandmarkmarchingisused,theestimated2Dlandmarks
U
i
includethepro-
jectionofmarched3Dlandmarks,i.e.,2Dcheeklandmarks.Asaresult,theappearancefeatures
aroundthesecheeklandmarksarepartoftheinputtoCNNaswell.Thisisinsharpcontrastto[52]
wherenocheeklandmarksparticipatetheregressorlearning.Effectively,theseadditionalcheek
landmarksserveasconstraintstoguidehowthefacialsilhouettesatvariousposesshouldlooklike,
whichisessentiallytheshapeofthe3Dfacesurface.
Anothernoteisthat,insteadofalternatingbetweentheestimationof
m
and
p
,anotheroption
istojointlyestimatebothparametersineachCNNstage.Experimentallyweobservedthatsuch
39
Figure3.5:
ArchitectureofC-CNN(thesameCNNarchitectureisusedforallsixstages).Colorcode
used:purple=extractedimagefeature,orange=Conv,brown=pooling+batchnormalization,blue=fully
connectedlayer,red=ReLU.Thesizeandthenumberofforeachlayerareshownonthetop
andthebottomrespectively.
ajointestimationscheduleleadstoaloweraccuracythanthealternatingscheme,potentiallydue
tothedifferentphysicalmeaningof
m
and
p
andtheambiguityofmultiplepairsof
m
and
p
correspondingthesame2Dshape.Forthealternatingscheme,nowwepresenttwodifferentCNN
architectures,andusethesameCNNarchitectureforallsixstagesofthecascade.
3.2.4ConventionalCNN(C-CNN)
ThearchitectureoftheCNNisshowninFig.3.5.Ithasthreeconvolutionallayerswhereeach
oneisfollowedbyapoolinglayerandabatchnormalizationlayer.Then,onefullyconnected
layerandReLUlayerand,attheendofthearchitecture,ithasonefullyconnectedlayerandone
Euclidianloss(
J
Q
)forestimatingtheprojectionmatrixparametersor3Dshapeparameters.We
uselinearunit(ReLU)[41]astheactivationfunctionwhichenablesCNNtoachievethe
bestperformancewithoutunsupervisedpre-training.
3.2.5MirrorCNN(M-CNN)
Wedealwithtwoinherentambiguitieswhenweestimateprojectionmatrixparameter
m
and3D
shapeparameter
p
.First,mutiplepairsof
m
and
p
canrepresentthesame2Dfaceshape.Second,
theestimatedupdatesof
m
and
p
arenotexplicitlyrelatedtothefacealignmenterror.Inother
words,thechangesin
m
and
p
arenotlinearlyrelatedtothe2Dshapechanges.Toremedythese
40
ambiguities,wepredict2Dshapeupdatesimultaneouslywhileestimatingthe
m
and
p
updates.
WeextendtheCNNarchitectureofeachcascadestagebyencouragingthealignmentresultsof
afaceimageanditsmirrortobehighlycorrelated.Tothisend,weusetheideaofmirrorability
constraint[129]withtwomaindifferences.First,wecombinethisconstraintwiththelearning
procedureratherthanusingitasapost-processingstep.Second,weintegratethemirrorability
constraintinsideasiameseCNN[20]bysharingthenetwork'sweightsbetweentheinputface
imageanditsmirrorimageandaddinganewlossfunction.
3.2.5.1MirrorLoss
Giventheinputimageanditsmirrorimagewiththeirinitialboundingboxes,weusefunction
h
(
¯
m
;
b
)
,thatthescaleandtranslationsof
¯
m
basedon
b
,forinitialization.Then,according
tothemirrorabilityconstraint,weassumethattheestimatedupdateofshapefortheinputimage
shouldbesimilartotheupdateofshapeforthemirrorimagewithareordering.Thisassumption
istruewhenbothimagesareinitializedwiththe
same
landmarksuptoareordering,whichistrue
inallcascadestages.WeusethemirrorlosstominimizetheEuclidiandistanceofestimatedshape
updateoftwoimages.Themirrorlossatstage
k
is,
J
k
M
=
jj
D
‹
U
k

C

D
‹
U
k
M

jj
2
;
(3.12)
where
D
‹
U
k
istheinputimage'sshapeupdate,
D
‹
U
k
M
isthemirrorimage'sshapeupdateand
C
()
is
areorderingfunctiontoindicatelandmarkcorrespondencebetweenmirrorimages.
41
Figure3.6:
ArchitectureoftheM-CNN(thesameCNNarchitectureisusedforallsixstages).Colorcode
used:purple=extractedimagefeature,orange=Conv,brown=pooling+batchnormalization,green=
locallyconnectedlayer,blue=fullyconnectedlayer,red=batchnormalization+ReLU+dropout.The
sizeandthenumberofofeachlayerareshownonthetopandthebottomofthetopbranch
respectively.
3.2.5.2MirrorCNNArchitecture
ThenewCNNarchitecturefollowsthesiamesenetwork[20]withtwobrancheswhoseweightsare
shared.Fig.3.6showsthearchitectureoftheM-CNN.Thetopandbottombranchesarefeeded
withtheextractedinputfeaturefromatrainingimageanditsmirrorrespectively.Eachbranch
hastwoconvolutionallayersandtwolayersoflocallyconnectedlayers.Thelocallyconnected
layer[102]issimilartoconvolutionallayerandlearnsasetofforvariousregionsofits
input.Thelocallyconnectedlayersarespatiallocationdependent,whichisacorrectassumption
forourextractedimagefeatureateachstage.Aftereachoftheselayers,wehaveonepooling
andbatchnormalizationlayers.Attheendinthetopbranch,afterafullyconnectedlayer,batch
normalization,ReLUanddropoutlayers,wehavetwofullyconnectedlayers,oneforestimating
theupdateofparameters(
J
Q
)andtheotheroneforestimatingtheupdateof2Dshapeviatheloss
(
J
U
),
J
k
U
=
jj
D
U
k

D
‹
U
k
jj
2
:
(3.13)
Inthebottombranch,weonlyhaveoneloss(
J
MU
)forestimatingtheupdateof2Dshapeinthe
42
mirrorimage.Intotal,wehavefourlossfunctions,onefortheupdatesof
m
or
p
,twoforthe2D
shapeupdatesoftwoimagesrespectively,andonemirrorloss.Weminimizethetotallossatstage
k
,
J
k
T
=
J
k
Q
+
l
1
J
k
U
+
l
2
J
k
MU
+
l
3
J
k
M
;
(3.14)
where
l
1
to
l
3
areweightsforlossfunctions.DespiteM-CNNappearsmorecomplicatedtobe
trainedthanC-CNN,theirtestingarethesame.Thatis,theonlyusefulresultateachcascade
stageofM-CNNistheestimatedupdateofthe
m
or
p
,whichisalsopassedtothenextstageand
initializetheinputimagefeatures.Inotherwords,themirrorimagesandestimated
D
U
inboth
imagesonlyserveasconstraintsintraining,andareneitherneedednorusedintesting.
3.2.6Visibilityand
2
DAppearanceFeatures
Onenotableadvantageofemployingadense3Dshapemodelisthatmoreadvanced2Dfeatures,
whichmightbeonlypossiblebecauseofthe3Dmodel,canbeextractedandcontributetothe
cascadedCNNlearning.Inthiswork,these2Dfeaturesrefertothe2Dlandmarkvisibilityandthe
appearancepatcharoundeach2Dlandmark.
Inordertocomputethevisibilityofeach2Dlandmark,weleveragethebasicideaofexamining
whetherthe3Dsurfacenormalofthecorresponding3Dlandmarkispointingtothecameraornot,
underthecurrentcameraprojectionmatrix[52].Insteadofusingtheaverage3Dsurfacenormal
forallhumans,weextenditbyusing3Dsurfacenormal.,giventhe
currentestimated3Dshape
S
,wecomputethe3Dsurfacenormalsforasetofsparsevertexes
aroundthe3Dlandmarkofinterest,andtheaverageofthese3Dnormalsisdenotedas
~
N
.Fig.3.7
43
Figure3.7:
The3Dsurfacenormalastheaverageofnormalsarounda3Dlandmark(black
arrow).Noticetherelativelynoisysurfacenormalofthe3Dﬁlefteyecornerﬂlandmark(bluearrow).
illustratestheadvantageofusingtheaverage3Dsurfacenormal.Given
~
N
,wecompute,
v
=
~
N
|

(
R
1

R
2
)
;
(3.15)
where
R
1
and
R
2
arethetworowsof
R
.If
v
ispositive,the2Dlandmarkisconsideredas
visibleandits2DappearancefeaturewillbepartoftheinputforCNN.Otherwise,itisinvisible
andthecorrespondingfeaturewillbezeroforCNN.Notethatthismethoddoesnotestimate
occlusionduetootherobjectssuchashairs.
Inadditiontovisibilityestimation,a3Dshapemodelcanalsocontributeingeneratingad-
vancedappearancefeaturesastheinputlayerforCNN.,weaimtoextractapose-
invariantappearancepatcharoundeachestimated2Dlandmark,andthearrayofthesepatcheswill
formtheinputlayer.In[128],asimilarfeatureextractionisproposedbyputtingdifferentscales
ofinputimagetogetherandformingabigimageastheappearancefeature.Wenowdescribe
twoproposedapproachestoextractanappearancefeature,i.e.,a19

19patch,forthe
n
th2D
landmark.
Piecewisepedfeature(PAWF)
Featurecorrespondenceisalwaysveryimportantfor
anyvisuallearning,asevidentbytheimportanceofeye-basedtofacerecognition.
Yet,duetothefactthata2Dfaceisaprojectionof3Dsurfacewithanarbitraryviewangle,it
ishardtomakesurethatalocalpatchextractedfromthis2Dimagecorrespondstothepatch
44
(a)(b)(c)(d)(e)
(f)(g)(h)(i)(j)
Figure3.8:
Featureextractionprocess,(a-e)PAWFforthelandmarkontherightsideoftherighteye,(f-j)
D3PFforthelandmarkontherightsideofthelip.
Figure3.9:
ExamplesofextractingPAWF.Whenoneofthefourneighborhoodpoints(redpointinthe
bottom-right)isinvisible,itconnectstothe2Dlandmark(greenpoint),extendsthesamedistancefurther,
andgenerateanewneighborhoodpoint.Thishelpstoincludethebackgroundcontextaroundthenose.
fromanotherimage,evenbothpatchesarecenteredatthegroundtruthlocationsofthesame
n
th
2Dlandmark.Here,ﬁcorrespondﬂmeansthatthepatchescovertheexactlysamelocalregionof
facesanatomically.However,withadense3Dshapemodelinhand,wemayextractlocalpatches
acrossdifferentsubjectsandposeswithanatomicalcorrespondence.Thesecorrespondencesacross
subjectsandposesfacilitateCNNtolearntheappearancevariationinducedbymisalignment,
ratherthansubjectsorposes.
Inanofprocedure,wesearchfor
T
vertexesonthemean3Dshape
S
0
thatarethe
45
Figure3.10:
ExampleofextractingD3PF.
mostclosesttothe
n
thlandmark(Fig.3.8(b)).Second,werotatethe
T
vertexessuchthatthe3D
surfacenormalofthe
n
thlandmarkpointstowardthecamera(Fig.3.8(c)).Third,amongthe
T
vertexeswefourﬁneighborhoodvertexesﬂ,whichhavetheminimumandmaximum
x
and
y
coordinates,anddenotethefourvertexIDsasa4-dimvector
d
(
n
)
p
(Fig.3.8(d)).Therowof
Fig.3.8showstheprocessofextractingPAWFforrightlandmarkoftherighteye.
DuringtheCNNlearning,forthe
n
thlandmarkof
i
thimage,weprojectthefourneighborhood
vertexesontothe
i
thimageandobtainfourneighborhoodpoints,
U
(
n
)
i
=
s
RS
(
:
;
d
(
n
)
p
)+
t
,based
onthecurrentestimatedprojectionparameter
m
.Acrossall2Dfaceimages,
U
(
n
)
i
correspond
tothesamefacevertexesanatomically.Therefore,wewarptheimagerycontentwithinthese
neighborhoodpointstoa19

19patchbyusingthepiecewiseaftransformation[78].
Thisnovelfeaturerepresentationcanbewellextractedinmostcases,exceptforcasessuchas
thenosetipattheview.Insuchcases,theprojectionofthe
n
thlandmarkisoutsidethe
regionbytheneighborhoodpoints,whereoneoftheneighborhoodpointsisinvisibledue
toocclusion.Whenthishappens,wechangethelocationoftheinvisiblepointbyusingitsrelative
distancetotheprojectedlandmarklocation,asshowninFig.3.9.
Direct
3
Dprojectedfeature(D
3
PF)
BothD3PFandPAWFstartwiththe
T
vertexessurrounding
the
n
th3Dlandmark(Fig.3.8(g)).InsteadoffourneighborhoodvertexesasinPAWF,
D3PFoverlaysa19

19gridcoveringthe
T
vertexes,andstoresthevertexesofthegridpoints
in
d
(
n
)
d
(Fig.3.8(i)).ThesecondrowofFig.3.8showstheprocessofextractingD3PF.Similarto
46
PAWF,wecannowprojectthesetof3Dvertexes
S
(
:
;
d
(
n
)
d
)
tothe2Dimageandextracta19

19
patchviabilinear-interpolation,asshowninFig.3.10.Wealsoestimatethevisibilitiesofthe3D
vertexes
S
(
:
;
d
(
n
)
d
)
viatheirsurfacenormals,andzerowillbeplacedinthepatchforinvisibleones.
ForD3PF,everypixelinthepatchwillbecorrespondingtothesamepixelinthepatchesofother
images,whileforPAWF,thisistrueonlyforthefourneighborhoodpoints.
3.2.7Testing
ThetestingpartofbothC-CNNandM-CNNarethesame.Givenatestingimage
I
anditsinitial
parameter
m
0
and
p
0
,weapplythelearnedcascadedCNNcoupled-regressorforfacealignment.
Basicallyweiterativelyuse
R
k
m
(

;
Q
k
m
)
tocompute
D
‹
m
,update
m
k
,use
R
k
p
(

;
Q
k
p
)
tocompute
D
‹
p
,
andupdate
p
k
.Finallythedense3DshapeisconstructedviaEqn.3.1,andtheestimated2D
landmarksare
‹
U
=
s
R
‹
S
(
:
;
d
)+
t
.Notethatweapplythefeatureextractionprocedureonetimefor
eachCNNstage.
3.3ExperimentalResults
Inthissection,wedesignexperimentstoanswerthefollowingquestions:(1)Whatistheperfor-
manceofproposedmethodonchallengingdatasetsincomparisontothestate-of-the-artmethods?
(2)Howdodifferentfeatureextractionmethodsperforminpose-invariantfacealignment?(3)
WhatistheperformanceofproposedmethodwithdifferentCNNarchitecturesandwithdifferent
deeplearningtoolboxes?
47
(a)(b)
Figure3.11:
(a)AFLWoriginal(yellow)andaddedlandmarks(green),(b)ComparisonofmeanNMEof
eachlandmarkforRCPR(blue)andproposedmethod(green).Theradiusofcirclesisdeterminedbythe
meanNMEmultipledwiththefaceboundingboxsize.
3.3.1ExperimentalSetup
Databases
Giventhatthisworkfocusonpose-invariantfacealignment,wechoosetwopublicly
availablefacedatasetswithlabeledlandmarksanda
wide
rangeofposes.
AFLWdatabase[61]isalargefacedatasetwith25Kfaceimages.Eachimageismanuallyla-
beledwithupto21landmarks,withavisibilitylabelforeachlandmark.In[52],asubsetofAFLW
isselectedtohaveabalanceddistributionofyawangles,including3
;
901imagesfortrainingand
1
;
299imagesfortesting.Weusethesamesubsetandmanuallylabel13additionallandmarksfor
all5
;
200images.Wecallthese3
;
901imagesasthe
base
trainingset.Theionoforiginal
landmarksandaddedlandmarksisshowninFig.3.11(a).Usinggroundtruthlandmarksofeach
image,wethetightestboundingbox,expanditby10%ofitssize,andadd10%noisetothe
top-leftcorner,widthandheightoftheboundingbox(examplesinthe1strowofFig.3.17).These
randomlygeneratedboundingboxesmimictheimprecisefacedetectionwindowandwillbeused
forbothtrainingandtesting.
AFWdataset[151]contains468facesin205images.Eachfaceimageismanuallylabeled
withupto6landmarksandhasavisibilitylabelforeachlandmark.Foreachfaceimageadetected
boundingboxisprovided,andwillbeusedasinitialization.Giventhesmallnumberofimages,
48
weonlyusethisdatasetfortesting.
Weusethe
N
id
=
199basesofBaselFaceModel[86]forrepresentingidentityvariationand
the
N
exp
=
29basesoffacewearhouse[25]forrepresentingexpressionvariation.Intotal,there
are228basesrepresenting3Dfaceshapeswith53
;
215vertexes.
Synthetictrainingdata
Unlikeconventionalfacealignment,oneofthemainchallengesinpose-
invariantfacealignmentisthelimitedtrainingimages.Thereareonlytwopubliclyavailableface
databaseswithawideposes,alongwithlandmarklabeling.Therefore,utilizingsyntheticface
imagesisanefwaytosupplymoreimagesintothetrainingset.,weadd16
;
556
faceimageswithvariousposes,generatedfrom1
;
035subjectsofLFPWdataset[7]bythemethod
of[150],tothebasetrainingset.Wecallthisnewtrainingsetasthe
extended
trainingset.
Baselineselection
Giventheexplosionoffacealignmentworkinrecentyears,itisimportantto
chooseappropriatebaselinemethodssoastomakesuretheproposedmethodadvancesthestateof
theart.Weselectthemostrecentpose-invariantfacealignmentmethodsforcomparingwiththe
proposedmethod.WecomparetheproposedmethodwithtwomethodsonAFLW:1)PIFA[52]
isapose-invariantfacealignmentmethodwhichalignsfacesofarbitraryposeswiththeassistant
ofasparse3Dpointdistributionmodel,2)RCPR[22]isamethodbasedoncascadeofregressors
thatrepresentstheocclusion-invariantfacealignment.ForcomparisononAFW,weselectthree
methods:1)PIFA[52],2)CDM[135]isamethodbasedonConstrainedLocalModel(CLM)and
theoneclaimedtoperformpose-freefacealignment,3)TSPM[151]isbasedonamixtures
oftreeswithasharedpoolofpartsandcanhandlefacealignmentforlargeposefaceimages.It
canbeseenthatthesebaselinesaremostrelevanttoourfocusonpose-invariantfacealignment.
Parametersetting
Forimplementingtheproposedmethods,weusetwodifferentdeeplearning
toolboxes.ForimplementingtheC-CNNarchitecture,weusetheMatConvNettoolbox[113]with
aconstantlearningrateof1e

4,withtenepochsfortrainingeachCNNandabatchsizeof100.
49
Table3.1:
NME(%)oftheproposedmethodwithdifferentfeatureswiththeC-CNNarchitectureandthe
basetrainingset.
PAWF+Cheek
D3PF+Cheek
PAWF
Extracted
Landmarks
Landmarks
Patch
4
:
72
5
:
02
5
:
19
5
:
51
FortheM-CNNarchitecture,weusetheCaffetoolbox[49]withalearningrateof1e

7andthe
steplearningratepolicywithadroprateof0
:
9,in70epochsateachstageandabatchsizeof100.
Wesettheweightparametersofthetotalloss
l
1
to
l
3
inEqn.3.14to1.ForRCPR,weusethe
parametersreportedinitspaper,with100iterationsand15boostedregressors.ForPIFA,weuse
200iterationsand5boostedregressors.ForPAWFandD3PF,atthesecondstage
T
is5
;
000,and
3
;
000fortheotherstages.Accordingtoourempiricalevaluation,sixstagesofCNNaresuf
forconvergenceofprocess.
Evaluationmetrics
Giventhegroundtruth2Dlandmarks
U
i
,theirvisibility
v
i
,andestimated
landmarks
‹
U
i
of
N
t
testingimages,weusetwoconventionalmetricsformeasuringtheerrorof
upto34landmarks:1)MeanAveragePixelError(MAPE)[135],whichistheaverageofthe
estimationerrorsforvisiblelandmarks,i.e.,
MAPE
=
1
å
N
t
i
j
v
i
j
1
N
t
;
N
å
i
;
j
v
i
(
j
)
jj
‹
U
i
(
:
;
j
)

U
i
(
:
;
j
)
jj
;
(3.16)
where
j
v
i
j
1
isthenumberofvisiblelandmarksofimage
I
i
,and
U
i
(
:
;
j
)
isthe
j
thcolumnof
U
i
.2)
NormalizedMeanError(NME),whichistheaverageofthenormalizedestimationerrorofvisible
landmarks,i.e.,
NME
=
1
N
t
N
t
å
i
(
1
d
i
j
v
i
j
1
N
å
j
v
i
(
j
)
jj
‹
U
i
(
:
;
j
)

U
i
(
:
;
j
)
jj
)
;
(3.17)
where
d
i
isthesquarerootofthefaceboundingboxsize[52].Theeye-to-eyedistanceisnotused
inNMEsinceitisnotwellinlargeposessuchas
50
Figure3.12:
ErrorsonAFLWtestingsetaftereachstagesofCNNfordifferentfeatureextractionmethods
withtheC-CNNarchitectureandthebasetrainingset.Theinitialerroris25
:
8%.
3.3.2ComparisonExperiments
Featureextractionmethods
Toshowtheadvantagesoftheproposedfeatures,Table3.1compares
theaccuracyoftheproposedmethodonAFLWwith34landmarks,withvariousfeaturepresenta-
tion(i.e.,theinputlayerforCNN
2
toCNN
6
).Forthisexperiment,weusetheC-CNNarchitecture
withthebasetrainingset.TheﬁExtractedPatch"referstoextractingaconstantsize(19

19)
patchcenteredbyanestimated2Dlandmark,fromafaceimagenormalizedusingthebounding
box,whichisabaselinefeaturewidelyusedinconventional2Dalignmentmethods[139,147].For
thefeatureﬁ+CheekLandmarks",additionaluptofour19

19patchesofthecontourlandmarks,
whichareinvisiblefornon-frontalfaces,willbereplacedwithpatchesofthecheeklandmarks,
andusedintheinputlayerofCNNlearning.ThePAWFcanachievehigheraccuracythanthe
D3PF.BycomparingColumn1and3ofTable3.1,itshowsthatextractingfeaturesfromcheek
landmarksareeffectiveinactingasadditionalvisualcuesforthecascadedCNNregressors.The
combinationofusingthecheeklandmarksandextractingPAWFachievesthehighestaccuracy,
whichwillbeusedintheremainingexperiments.Fig.3.12showstheerrorsonAFLWtestingset
aftereachstagesofCNNfordifferentfeatureextractionmethods.Thereisnodifferenceinthe
errorsofthestageCNNbecauseitusestheglobalappearanceintheboundingbox,ratherthan
thearrayoflocalfeatures.
51
Table3.2:
TheNME(%)ofthreemethodsonAFLWwiththebasetrainingset.
Proposedmethod
PIFA
RCPR
(C-CNN)
4
:
72
8
:
04
6
:
26
Figure3.13:
ComparisonofNMEforeachposewiththeC-CNNarchitectureandthebasetrainingset.
CNNisknownfordemandingalargetrainingset,whilethe3
;
901-imageAFLWtrainingset
isrelativelysmallfromCNN'sperspective.However,ourCNN-basedregressorisstillableto
learnandalignwellonunseenimages.Weattributethisfacttotheeffectiveappearancefeatures
proposedinthiswork,i.e.,thesuperiorfeaturecorrespondenceenabledbythedensefacemodel
reducesCNN'sdemandformassivetrainingdata.
ExperimentsonAFLWdataset
Wecomparetheproposedmethodwiththetwomostrelated
methodsforaligningfaceswitharbitraryposes.ForbothRCPRandPIFA,weusetheirsource
codetoperformtrainingonthebasetrainingset.TheNMEofthethreemethodsontheAFLW
testingsetareshowninTable3.2.Theproposedmethodcanachievebetterresultsthanthetwo
baselines.TheerrorcomparisonforeachlandmarkisshowninFig.3.11(b).Asexpected,the
contourlandmarkshaverelativelyhighererrorsandtheproposedmethodhaslowererrorsthan
RCPRacrossallofthelandmarks.
Byusingthegroundtruthlandmarklocationsofthetestimages,wedividealltestimagesto
sixsubsetsaccordingtotheestimatedyawangleofeachimage.Fig.3.13comparestheproposed
methodwithRCPR.Theproposedmethodcanachievebetterresultsacrossdifferentposes,and
moreimportantly,ismorerobustorhaslessvariationacrossposes.Forthedetailedcomparisonon
52
Figure3.14:
ThecomparisonofCEDfordifferentmethodswiththeC-CNNarchitectureandthebase
trainingset.
Figure3.15:
ResultoftheproposedmethodafterthestageCNN.Thisimageshowsthatthestage
CNNcanmodelthedistributionoffaceposes.Theright-viewfacesareatthetop,thefrontal-viewfacesare
atthemiddle,andtheleft-viewfacesareatthebottom.
theNMEdistribution,theCumulativeErrorsDistribution(CED)diagramsofvariousmethodsare
showninFig.3.14.TheimprovementseemstobeoverallNMEvalues,andisespeciallylarger
aroundlowerNMEs(

8%).Weusethet-SNEtoolbox[112]toapplydimensionreductiononthe
outputofReLUlayerinthestageCNN.Theoutputofeachtestimageisreducedtoatwo-
dimensionalpointandalltestimagesareplottedbasedontheirlocationofthepoints(Fig.3.15).
ThisshowsthatthestageCNNcanmodelthedistributionoffaceposes.
53
Table3.3:
TheNME(%)ofthreemethodsonALFWwithextendedtrainingsetandCaffetoolbox.
Proposedmethod
Proposedmethod
RCPR
(M-CNN)
(C-CNN)
4
:
52
5
:
38
7
:
04
ExperimentsontheAFLWdatasetwithM-CNN
Weusetheextendedtrainingsetandthemirror
CNNarchitecture(M-CNN)forthisexperiment.WereporttheNMEresultsofthreemethodsin
Table3.3.TheM-CNNarchitecture,whichincorporatesthemirrorconstraintduringthelearning,
achievesapproximately16%reductionoferrorovertheC-CNNarchitectureimplementedwiththe
Caffetoolbox.Thisshowstheeffectivenessofthemirrorabilityconstraintinthenewarchitecture.
ThecomparisonofTable3.2andTable3.3showsthattheaccuracyoftheRCPRmethodis
lowerwiththeextendedtrainingsetthanwiththebasetrainingset.Weattributethistothelow
qualityofthesidepartsofthesynthesizedlarge-posefaceimages.Althoughthemethodin[150]
cansynthesizeside-viewfaceimages,thesynthesizedimagescouldhavesomeartifactsonthe
sidepartoftheface.Theseartifactsmakeithardforthelocalfernfeatures-basedRCPRmethod
tosimultaneouslyestimatethelocationandthevisibilityoflandmarks.
Inourproposedmethod,wearrangetheextractedPAWFpatchesinaspatialarrayanduseitas
theinputtoCNN.AnalternativeCNNinputistoassigntheextractedPAWFpatchestodifferent
channelsandconstructa19

19

34inputdatum.Toevaluateitsperformance,consideringthe
changeoftheinputsize,wemodifytheCNNarchitectureinFig.3.6byremovingthethe
thirdandthefourthpoolinglayers.TheNMEofM-CNNwiththeextendedtrainingsetis4
:
91%,
whichshowsthatarrangingthePAWFpatchesasalargeimageistillsuperior.
ExperimentsonAFWdataset
TheAFWdatasetcontainsfacesofallposerangeswithlabelsof
6landmarks.WereporttheMAPEforsixmethodsinTable3.4.ForPIFA,CDMandTSPM,we
showtheerrorreportedintheirpapers.Againweseetheconsistentimprovementofourproposed
method(withbotharchitectures)overthebaselinemethods.
54
Table3.4:
TheMAPEofsixmethodsonAFW.
Proposedmethod
Proposedmethod
Proposedmethod
PIFA
CDM
TSPM
(M-CNN+PAWF)
(C-CNN+PAWF)
(C-CNN+D3PF)
6
:
52
7
:
43
7
:
83
8
:
61
9
:
13
11
:
09
ComparisonoftwoCNNtoolboxes
Weutilizetwotoolboxesforourimplementations.Weuse
theMatConvNettoolbox[113]toimplementtheC-CNNarchitecture(Fig.3.5).However,the
MatConvNettoolboxhaslimitedabilityindifferentbranchesforCNN,whichisrequired
totrainasiamesenetwork.Therefore,weusetheCaffetoolbox[49]toimplementtheM-CNN
architecture(Fig.3.6).BasedonourexperimentsontheAFLWtestset,therearenoticeable
differencebetweenthetestingresultsofthesetwotoolboxes.
Table3.5showsthedetailedcomparisonoftheC-CNNandM-CNNarchitectureswithdiffer-
entsettings.TheSettings1and2comparetheimplementationsoftheC-CNNarchitectureonthe
AFLWtrainingset,usingtheMatConvNetandCaffetoolboxesrespectively.Itshowsthesuperior
accuracyoftheMatConvNetimplementationinallstages,evenwhentheextendedtrainingsetis
providedintheSetting4.Thesedifferenttestingresultsoftwotoolboxesmightbeduetotwo
reasons.Oneisthat,theimplementationofthebasicbuildingblocks,theoptimization,andthede-
faultparameterscouldbedifferentonthetwotoolboxes.Theotheristherandominitializationof
networkparameters.ThecomparisonoftheSettings2and3showsthesuperiorityoftheM-CNN
architecture.TheSetting5includesourresultwiththeM-CNNarchitectureandtheextended
trainingset.
Landmarkvisibilityestimation
Forevaluatingtheaccuracyofourvisibilityprediction,weuti-
lizethegroundtruth3Dshapeofthetestimagesandcomputethevisibilitylabeloflandmarks
duetotheselfocclusion.Wetheﬁvisibilityerror"asthemetric,whichistheaverageof
theratiosbetweenthenumberofincorrectlyestimatedvisibilitylabelsandthetotalnumberof
landmarksperimage.Theproposedmethodachievesavisibilityerrorof4
:
1%.Ifwebreakdown
55
Table3.5:
Thesix-stageNMEsofimplementingC-CNNandM-CNNarchitectureswithdifferenttraining
datasetsandCNNtoolboxes.Theinitialerroris25
:
8%.
Sett.
Toolbox
Method/Data
S-1
S-2
S-3
S-4
S-5
S-6
1
MatConvNet
7
:
68
5
:
93
5
:
58
4
:
94
4
:
89
4
:
72
C-CNN/
2
Caffe
Baseset
8
:
75
6
:
32
6
:
15
5
:
55
5
:
53
5
:
44
3
M-CNN/
7
:
18
6
:
06
5
:
83
5
:
08
4
:
91
4
:
76
Baseset
4
C-CNN/
8
:
44
6
:
78
6
:
60
5
:
75
5
:
70
5
:
38
Extendedset
5
M-CNN/
7
:
41
6
:
16
5
:
80
4
:
76
4
:
67
4
:
52
Extendedset
Figure3.16:Thedistributionofvisibilityerrorsforeachlandmark.Forsixlandmarksonthe
horizontalcenteroftheface,theirvisibilityerrorsarezerossincetheyarealwaysvisible.
thevisibilityerrorforeachlandmark,theirdistributionisshowninFig.3.16.
Qualitativeresults
SomeexamplesofalignmentresultsfortheproposedmethodonAFLWand
AFWdatasetsareshowninFig.3.17.Theresultoftheproposedmethodateachstageisshownin
Fig.3.18.
Timecomplexity
ThespeedsofproposedmethodwithPAWFandD3PFare0
:
6and0
:
26FPS
respectively,withtheMatlabimplementation.Themosttimeconsumingpartintheproposed
methodisfeatureextractionwhichconsumes80%ofthetotaltime.Webelievethiscanbesub-
stantiallyimprovedwithCcodingandparallelfeatureextraction.NotethatthespeedofC-CNN
andM-CNNarchitecturesarethesamebecauseweonlycomputeresponseofthetopbranchof
M-CNNinthetestingphase.
56
Figure3.17:
TheresultsoftheproposedmethodonAFLWandAFW.Thegreen/red/yellowdotsshowthe
visible/invisible/cheeklandmarks,respectively.Firstrow:initiallandmarksforAFLW,Second:estimated
3Ddenseshapes,Third:estimatedlandmarks,ForthandFifth:estimatedlandmarksforAFLW,Sixth:esti-
matedlandmarksforAFW.Noticethatdespitethediscrepancybetweenthediversefaceposesandconstant
front-viewlandmarkinitialization(toprow),ourmodelcanadaptivelyestimatethepose,adensemodel
andproducethe2Dlandmarksasabyproduct.
3.4Summary
Weproposeamethodtoa3Ddenseshapetoafaceimagewithlargeposesbycombining
cascadeCNNregressorsandthe3DMorphableModel(3DMM).Weproposetwotypesofpose
invariantfeaturesandonenewCNNarchitectureforboostingtheaccuracyoffacealignment.Also,
weestimatethelocationoflandmarksonthecheek,whichalsodrivesthe3Dfacemodel
Finally,weachievethestate-of-the-artperformanceontwochallengingfacealignmentwithlarge
poses.
57
Figure3.18:
Theresultoftheproposedmethodacrossstages,withtheextractedfeatures(1stand3rdrows)
andalignmentresults(2ndand4throws).Notethechangesofthelandmarkpositionandvisibility(theblue
arrow)overstages.
58
Chapter4
Pose-InvariantFaceAlignmentwitha
SingleCNN
4.1Introduction
Inthepreviouschapter,weproposeourPIFAmethodbasedoncascadeofCNNregressorsand3D
MorphableModel.Thecascadeoftheregressorsisthedominanttechnologyforpose-invariant
facealignment[68,147,148].Despitetherecentsuccess,thecascadeofCNNs,whenappliedto
thelargeposefaceimages,stillsuffersfromthefollowingdrawbacks.
Lackofend-to-endtraining
:Itisaconsensusthatend-to-endtrainingisdesiredforCNN[23,
47].However,intheexistingmethods,theCNNsaretypicallytrainedindependentlyateachcas-
cadestage.SometimesevenmultipleCNNsareappliedindependentlyateachstage.Forexample,
locationsofdifferentlandmarksetsareestimatedbyvariousCNNsandcombinedviaaseparate
fusingmodule[99].Therefore,theseCNNscannotbejointlyoptimizedandmightleadtoa
sub-optimalsolution.
Hand-craftedfeatureextraction
:SincetheCNNsaretrainedindependently,featureextrac-
tionisrequiredtoutilizetheresultofapreviousCNNandprovideinputtothecurrentCNN.Simple
featureextractionmethodsareused,e.g.,extractingpatchesbasedon2Dor3Dfaceshapeswithout
consideringotherfactorsincludingposeandexpression[99,139].Normally,thecascadeofCNNs
isacollectionofshallowCNNswhereeachonehaslessthanvelayers.Hence,thisframework
59
Figure4.1:Forthepurposeoflearninganend-to-endfacealignmentmodel,ournovelvisualiza-
tionlayerreconstructsthe3Dfaceshape(a)fromtheestimatedparametersinsidetheCNNand
synthesizesa2Dimage(b)viathesurfacenormalvectorsofvisiblevertexes.
cannotextract
deep
featuresbybuildingupontheextractedfeaturesofearly-stageCNNs.
Slowtrainingspeed
:TrainingacascadeofCNNsisusuallytime-consumingfortworea-
sons.Firstly,theCNNsaretrainedsequentially,oneafteranother.Secondly,featureextractionis
requiredbetweentwoconsecutiveCNNs.
Toaddresstheseissues,weintroduceanovellayer,asshowninFigure4.1.OurCNNarchi-
tectureconsistsofseveralblocks,whicharecalledvisualizationblocks.Thisarchitecturecanbe
consideredasacascadeofshallowCNNs.Thenewlayervisualizesthealignmentresultofaprevi-
ousvisualizationblockandutilizesitinalatervisualizationblock.Itisdesignedbasedonseveral
guidelines.Firstly,itisderivedfromthesurfacenormalsoftheunderlying3Dfacemodeland
encodestherelativeposebetweenthefaceandcamera,partiallyinspiredbythesuccessofusing
surfacenormalsfor3Dfacerecognition[79].Secondly,thevisualizationlayerisdifferentiable,
whichallowsthegradienttobecomputedanalyticallyandenablesend-to-endtraining.Lastly,a
maskisutilizedtodifferentiatebetweenpixelsinthemiddleandcontourareasofaface.
fromthedesignofthevisualizationlayer,ourmethodhasthefollowingadvantages
andcontributions:

TheproposedmethodallowsablockintheCNNtoutilizetheextractedfeaturesfromprevi-
60
Figure4.2:TheproposedCNNarchitecture.Weusegreen,orange,andpurpletorepresentthe
visualizationlayer,convolutionallayer,andfullyconnectedlayer,respectively.Pleasereferto
Figure4.3forthedetailsofthevisualizationblock.
ousblocksandextractdeeperfeatures.Therefore,extractionofhand-craftedfeaturesisnolonger
necessary.

Thevisualizationlayerisdifferentiable,allowingforbackpropagationofanerrorfromalater
blocktoanearlierone.Tothebestofourknowledge,thisisthemethodforpose-invariant
facealignment,thatutilizesonlyonesingleCNNandallowsend-to-endtraining.

Theproposedmethodconvergesfasterduringthetrainingphasecomparedtothecascadeof
CNNs.Therefore,thetrainingtimeisdramaticallyreduced.
4.2
3
DFaceAlignmentwithVisualizationLayer
Givenasinglefaceimagewithanarbitrarypose,ourgoalistoestimatethe2Dlandmarkswith
theirvisibilitylabelsbya3Dfacemodel.Towardsthisend,weproposeaCNNarchitecture
withend-to-endtrainingformodelasshowninFigure4.2.Inthissection,wewill
describetheunderlying3Dfacemodelusedinthiswork,followedbyourCNNarchitectureand
thevisualizationlayer.
61
4.2.1
3
Dand
2
DFaceShapes
Weusethe3DMorphableModel(3DMM)torepresentthe3Dshapeofaface
S
p
asalinear
combinationofmeanshape
S
0
,identitybases
S
I
andexpressionbases
S
E
:
S
p
=
S
0
+
N
I
å
k
p
I
k
S
I
k
+
N
E
å
k
p
E
k
S
E
k
:
(4.1)
Weusevector
p
=[
p
I
;
p
E
]
toindicatethe3Dshapeparameters,where
p
I
=[
p
I
0
;

;
p
I
N
I
]
arethe
identityparametersand
p
E
=[
p
E
0
;

;
p
E
N
E
]
aretheexpressionparameters.WeusetheBasel3D
facemodel[86],whichhas199bases,asouridentitybasesandthefacewearhousemodel[25]
with29basesasourexpressionbases.Each3Dfaceshapeconsistsofasetof
Q
3Dvertexes:
S
p
=
0
B
B
B
B
B
@
x
p
1
x
p
2
:::
x
p
Q
y
p
1
y
p
2
:::
y
p
Q
z
p
1
z
p
2
:::
z
p
Q
1
C
C
C
C
C
A
:
(4.2)
The2Dfaceshapesaretheprojectionof3Dshapes.Inthiswork,weusetheweakperspective
projectionmodelwith6degreesoffreedoms,i.e.,oneforscale,threeforrotationanglesandtwo
fortranslations,whichprojectsthe3Dfaceshape
S
p
onto2Dimagestoobtainthe2Dshape
U
:
U
=
f
(
P
)=
M
0
B
@
S
p
(
:
;
b
)
1
1
C
A
;
(4.3)
where
M
=
2
6
4
m
1
m
2
m
3
m
4
m
5
m
6
m
7
m
8
3
7
5
;
(4.4)
62
and
U
=
0
B
@
x
t
1
x
t
2
:::
x
t
N
y
t
1
y
t
2
:::
y
t
N
1
C
A
:
(4.5)
Here
U
isasetof
N
2Dlandmarks,
M
isthecameraprojectionmatrix.Withmisuseofnotation,we
thetargetparameters
P
=
f
M
;
p
g
.The
N
-dimvector
b
includes3Dvertexindexeswhichare
semanticallycorrespondingto2Dlandmarks.Wedenote
m
1
=[
m
1
m
2
m
3
]
and
m
2
=[
m
5
m
6
m
7
]
asthetworowsofthescaledrotationcomponent,while
m
4
and
m
8
arethetranslations.
Equation4.3establishstherelationship,orequivalency,between2Dlandmarks
U
and
P
,i.e.,
3Dshapeparameters
p
andthecameraprojectionmatrix
M
.Giventhatalmostallthetraining
imagesforfacealignmenthaveonly2Dlabels,i.e.,
U
,wepreformadataaugmentationstep
similarto[53]tocomputetheircorresponding
P
.Givenaninputimage,ourgoalistoestimatethe
parameter
P
,basedonwhichthe2Dlandmarksandtheirvisibilitiescanbenaturallyderived.
4.2.2ProposedCNNArchitecture
OurCNNarchitectureresemblesthecascadeofCNNs,whileeachﬁshallowCNN"is
asavisualizationblock.Insideeachblock,avisualizationlayerbasedonthelatestparameter
estimationservesasabridgebetweenconsecutiveblocks.Thisdesignenablesustoaddressthe
drawbacksoftypicalcascadeofCNNsinSection4.1.Wenowdescribethevisualizationblock
andCNNarchitecture,anddiveintothedetailsofthevisualizationlayerinSection4.2.3.
VisualizationBlock
Figure4.3showsthestructureofourvisualizationblock.Thevisualiza-
tionlayergeneratesafeaturemapbasedonthelatestparameter
P
(detailsinSection4.2.3).Each
convolutionallayerisfollowedbyabatchnormalization(BN)layerandaReLUlayer.Itextracts
deeperfeaturesbasedonthefeaturesprovidedbythepreviousvisualizationblockandthevisual-
izationlayeroutput.Betweenthetwofullyconnectedlayers,theoneisfollowedbyaReLU
63
Figure4.3:Avisualizationblockconsistsofavisualizationlayer,twoconvolutionallayersand
twofullyconnectedlayers.
layerandadropoutlayer,whilethesecondonesimultaneouslyestimatestheupdateof
M
and
p
,
denoted
D
P
.Theoutputsofthevisualizationblockaredeeperfeaturesandthenewestimationof
theparameters(
D
P
+
P
).AsshowninFigure4.3,thetoppartofthevisualizationblockfocuseson
learningdeeperfeatures,whilethebottompartutilizesthosefeaturestoestimatetheparameters
inaResNet-likestructure[44].Duringthebackwardpassofthetrainingphase,thevisualization
blockbackpropagatesthelossthroughbothofitsinputstoadjusttheconvolutionalandthefully
connectedlayersinpreviousblocks.Thisallowstheblocktoextractbetterfeaturesforthenext
blockandimprovetheoverallparameterestimation.
CNNArchitecture
Theproposedarchitectureconsistsofseveralconnectedvisualizationblocks
asshowninFigure4.2.Theinputsincludeanimageandaninitialestimationoftheparameter
P
0
.Theoutputsistheestimationoftheparameter.Duetothejointoptimizationofallvi-
sualizationblocksthroughbackpropagation,theproposedarchitectureisabletoconvergewith
substantiallyfewerepochsduringtraining,comparedtothetypicalcascadeofCNNs.
LossFunctions
TwotypesoflossfunctionsareemployedinourCNNarchitecture.The
oneisanEuclideanlossbetweentheestimationandthetargetoftheparameterupdate,witheach
parameterweightedseparately:
E
i
P
=(
D
P
i

D
¯
P
i
)
T
W
(
D
P
i

D
¯
P
i
)
;
(4.6)
64
where
E
i
P
istheloss,
D
P
i
istheestimationand
D
¯
P
i
isthetarget(orgroundtruth)atthe
i
-thvi-
sualizationblock.Thediagonalmatrix
W
containstheweights.Foreachelementoftheshape
parameter
p
,itsweightistheinverseofthestandarddeviationthatwasobtainedfromthedataused
in3DMMtraining.Tocompensatetherelativescaleamongtheparametersof
M
,wecomputethe
ratio
r
betweentheaverageofscaledrotationparametersandaverageoftranslationparametersin
thetrainingdata.Wesettheweightsofthescaledrotationparametersof
M
to
1
r
andtheweights
ofthetranslationof
M
to1.ThesecondtypeoflossfunctionistheEuclideanlossontheresultant
2Dlandmarks:
E
i
S
=
k
f
(
P
i
+
D
P
i
)

¯
U
k
2
;
(4.7)
where
¯
U
isthegroundtruth2Dlandmarks,and
P
i
istheinputparametertothe
i
-thblock,i.e.,
theoutputofthe
i

1-thblock.
f
(

)
computes2Dlandmarklocationsusingthecurrentlyupdated
parametersviaEquation4.3.Forbackpropagationofthislossfunctiontotheparameter
D
P
,we
usethechainruletocomputethegradient.
¶
E
i
S
¶
D
P
i
=
¶
E
i
S
¶
f
¶
f
¶
P
i
:
Forthethreevisualizationblocks,theEuclideanlossontheparameterupdates(Equa-
tion4.6)isused,whiletheEuclideanlosson2Dlandmarks(Equation4.7)isappliedtothelast
threeblocks.Thethreeblocksestimateparameterstoroughlyalign3Dshapetothefaceim-
ageandthelastthreeblocksleveragethegoodinitializationtoestimatetheparametersandthe2D
landmarklocationsmoreprecisely.
65
4.2.3VisualizationLayer
Severalvisualizationtechniqueshavebeenexploredforfacialanalysis.Inparticular,Z-Buffering,
whichiswidelyusedinpriorworks[12,13],isasimpleandfast2Drepresentationofthe3D
shape.However,thisrepresentationisnotdifferentiable.Incontrast,ourvisualizationisbasedon
surfacenormalsofthe3Dface,whichdescribessurface'sorientationinalocalneighbourhoods.It
hasbeensuccessfullyutilizedfordifferentfacialanalysistasks,e.g.,3Dfacereconstruction[95]
and3Dfacerecognition[79].
Inthiswork,weusethe
z
coordinateofsurfacenormalsofeachvertex,transformedwiththe
pose.Itisanindicatorofﬁfrontability"ofavertex,i.e.,theamountthatthesurfacenormalis
pointingtowardsthecamera.Thisquantityisusedtoassignanintensityvalueatitsprojected2D
locationtoconstructthevisualizationimage.Thefrontabilitymeasure
g
,a
Q
dimensionalvector,
canbecomputedas,
g
=
max

0
;
(
m
1

m
2
)
k
m
1
kk
m
2
k
N
0

;
(4.8)
where

isthecrossproduct,and
k
:
k
denotesthe
L
2
norm.The3

Q
matrix
N
0
isthesurface
normalvectorsofa3Dfaceshape.Toavoidthehighcomputationalcostofcalculatingthesurface
normalsaftereachshapeupdate,weapproximate
N
0
withthesurfacenormalsofthemean3D
face.Notethatboththefaceshapeandposearestillcontinuouslyupdatedacrossvisualization
blocks,andareusedtodeterminetheprojected2Dlocation.Hence,thisapproximationwould
onlyslightlyaffecttheintensityvalues.Totransformthesurfacenormalbasedonthepose,we
applytheestimationofthescaledrotationmatrix(
m
1
and
m
2
)tothesurfacenormalscomputed
fromthemeanface.Thevalueisthentruncatedwiththelowerboundof0(Equation4.8).
Thepixelintensityofavisualizedimage
V
(
u
;
v
)
iscomputedastheweightedaverageofthe
frontabilitymeasureswithinalocalneighbourhood:
66
Figure4.4:Thefrontalandsideviewsofthemask
a
thathaspositivevaluesinthemiddleand
negativevaluesinthecontourarea.
V
(
u
;
v
)=
å
q
2
D
(
u
;
v
)
g
(
q
)
a
(
q
)
w
(
u
;
v
;
x
t
q
;
y
t
q
)
å
q
2
D
(
u
;
v
)
w
(
u
;
v
;
x
t
q
;
y
t
q
)
;
(4.9)
where
D
(
u
;
v
)
isthesetofindexesofvertexeswhose2Dprojectedlocationsarewithinthelocal
neighborhoodofthepixel
(
u
;
v
)
.
(
x
t
q
;
y
t
q
)
isthe2Dprojectedlocationof
q
-th3Dvertex.Theweight
w
isthedistancemetricbetweenthepixel
(
u
;
v
)
andtheprojectedlocation
(
x
t
q
;
y
t
q
)
,
w
(
u
;
v
;
x
t
q
;
y
t
q
)=
exp
 

(
u

x
t
q
)
2
+(
v

y
t
q
)
2
2
s
2
!
:
(4.10)
The
Q
-dimvector
a
isamaskwithpositivevaluesforvertexesinthemiddleareaofthefaceand
negativevaluesforvertexesaroundthecontourareaoftheface:
a
(
q
)=
exp


(
x
n

x
p
q
)
2
+(
y
n

y
p
q
)
2
+(
z
n

z
p
q
)
2
2
s
2
n

;
(4.11)
where
(
x
n
;
y
n
;
z
n
)
isthevertexcoordinateofthenosetip.
a
ispre-computedandnormalizedfor
zero-meanandunitstandarddeviation.Themaskisutilizedtodiscriminatebetweenthemiddle
andcontourareasoftheface.AvisualizationofthemaskisprovidedinFigure4.4.
Sincethehumanfaceisa3Dobject,visualizingitatanarbitraryviewanglerequiresthe
estimationofthevisibilityofeach3Dvertex.Toavoidthecomputationallyexpensivevisibility
67
Figure4.5:Anexamplewithfourvertexesprojectedtoasamepixel.Twoofthemhavenegative
valuesin
z
componentoftheirnormals(redarrows).Betweentheothertwowithpositivevalues,
theonewiththesmallerdepth(closertotheimageplane)isselected.
testviarendering,weadopttwostrategiesforapproximation.Firstly,weprunethevertexeswhose
frontabilitymeasures
g
equal0,i.e.,thevertexespointingagainstthecamera.Secondly,ifmultiple
vertexesprojectstoasameimagepixel,wekeeponlytheonewiththesmallestdepthvalues.An
illustrationisprovidedinFigure4.5.
Backpropagation
Toallowbackpropagationofthelossfunctionsthroughthevisualization
layers,wecomputethederivativeof
V
withrespecttotheelementsoftheparameters
M
and
p
.
Firstly,wecomputethepartialderivatives,
¶
g
¶
m
k
,
¶
w
(
u
;
v
;
x
t
i
;
y
t
i
)
¶
m
k
and
¶
w
(
u
;
v
;
x
t
i
;
y
t
i
)
¶
p
j
.Thenthederivativesof
¶
V
¶
m
k
and
¶
V
¶
p
j
canbecomputedbasedonEquation.4.9.
4.3ExperimentalResults
WeevaluateourmethodontwochallengingPIFAdatasets,AFLWandAFW,bothqualitatively
andquantitatively,aswellasthenear-frontalfacedatasetof300W.Furthermore,weconductex-
perimentsondifferentCNNarchitecturestovalidateourvisualizationlayerdesign.
Implementationdetails
OurimplementationisbuiltupontheCaffetoolbox[49].Inallofthe
experiments,weusesixvisualizationblocks(
N
v
)withtwoconvolutionallayers(
N
c
)andfully
68
connectedlayersineachblock(Figure4.3).Detailsofthenetworkstructureareprovidedin
Table4.1.
Insteadofusingthesequentiallypretrainstrategy[127],weperformthejointend-to-endtrain-
ingfromscratch.Tobetterestimatetheparameterupdateineachblockandtoincreasetheef-
fectivenesswhenusingvisualizationblocks,wesettheweightofthelossfunctioninthe
visualizationblockto1andlinearlyincreasetheweightsbyoneforeachlaterblock.Thisstrategy
helpstheCNNtopaymoreattentiontothelandmarklossusedinlaterblocks.Backpropagationof
lossfunctionsinthelastblockswouldhavemoreimpactintheblock,andthelastblockcan
adoptitselfmorequicklytothechangesintheblock.
Inthetrainingphase,wesettheweightdecayto0
:
005,themomentumto0
:
99,theinitial
learningrateto1e

6.Besides,wedecreasethelearningrateto5e

6and1e

7after20and29
epochs.Intotal,thetrainingphaseiscontinuedfor33epochsforallexperiments.
4.3.1QuantitativeEvaluationsonAFLWandAFW
TheAFLWdataset[61]isverychallengingwithlarge-posefaceimages(

90

yaw).Weusethe
subsetwith3
;
901trainingimagesand1
;
299testingimagesreleasedby[53].Allfaceimagesin
thissubsetarelabeledwith34landmarksandaboundingbox.TheAFWdataset[151]contains
205imageswith468faces.Eachimageislabeledwithatmost6landmarkswithvisibilitylabels,
aswellasaboundingbox.AFWisusedonlyfortestinginourexperiments.Theboundingboxes
Table4.1:Thenumberandsizeofconvolutionalineachvisualizationblock.Forallblocks,
thetwofullyconnectedlayershavethesamelengthof800and236.
Block#
1
2
3
4
5
;
6
Conv.
12(5

5)
20(3

3)
28(3

3)
36(3

3)
40(3

3)
layers
16(5

5)
24(3

3)
32(3

3)
40(3

3)
40(3

3)
69
Table4.2:NME(%)offourmethodsontheAFLWdataset.
Proposedmethod
Extended-PIFA[53]
PIFA
RCPR
4
:
45
4
:
72
8
:
04
6
:
26
Table4.3:NME(%)oftheproposedmethodateachvisualizationblockonAFLWdataset.The
initialNMEis25.8%.
Block#
1
2
3
4
5
6
NME
9
:
26
6
:
77
5
:
51
4
:
98
4
:
60
4
:
45
inbothdatasetsareusedastheinitilizationforouralgorithm,aswellasthebaselines.Wecrop
theregioninsidetheboundingboxandnormalizeitto114

114.Duetothememoryconstraint
ofGPUs,wehaveapoolinglayerinthevisualizationblockaftertheconvolutionallayer
todecreasethesizeoffeaturemapstohalf.Theinputtothesubsequentvisualizationblocksisof
57

57.Toaugmentthetrainingdata,wegenerate20differentvariationsforeachtrainingimage
byaddingnoisetothelocation,widthandheightoftheboundingboxes.
Forquantitativeevaluations,weusetwoconventionalmetrics.TheoneisMeanAverage
PixelError(MAPE)[135],whichistheaverageofthepixelerrorsforthevisiblelandmarks.The
otheroneisNormalizedMeanError(NME),i.e.,theaverageofthenormalizedestimationerrorof
visiblelandmarks.Thenormalizationfactoristhesquarerootofthefaceboundingboxsize[52],
insteadoftheeye-to-eyedistanceinthefrontal-viewfacealignment.
Wecompareourmethodwithseveralstate-of-the-artLFPAapproaches.OnAFLW,wecom-
parewithExtended-PIFA[53],PIFA[52]andRCPR[22]withtheNMEmetric.Table4.2shows
thatourproposedmethodachievesahigheraccuracythanthealternatives.Theheatmap-based
method,namedCALE[21],reportedanNMEof2
:
96%,butsuffersfromseveraldisadvantages.
Todemonstratethecapabilitiesofeachvisualizationblock,theNMEcomputedusingtheestimated
70
P
aftereachblockisshowninTable4.3.Ifahigheralignmentspeedisdesirable,itispossibleto
skipthelasttwovisualizationblockswithareasonableNME.OntheAFWdataset,comparisons
areconductedwithExtended-PIFA[53],PIFA[52],CDM[135]andTSPM[151]withtheMAPE
metric.TheresultsinTable4.4againshowthesuperiorityoftheproposedmethod.
SomeexamplesofalignmentresultsoftheproposedmethodonAFLWandAFWdatasetsare
showninFigure4.9.Threeexamplesofvisualizationlayeroutputateachvisualizationblockare
showninFigure4.10.
4.3.2Evaluationon300Wdataset
WhileourmaingoalisPIFA,wealsoevaluateonthemostwidelyusednearfrontal300W
dataset[96].300Wcontaines3
;
148trainingand689testingimages,whicharedividedinto
commonandchallengingsetswith554and135images,respectively.Table4.5showstheNME
(normalizedbytheinteroculardistance)oftheevaluatedmethods.Themostrelevantmethodis
3DDFA[147],whichalsoestimates
M
and
p
.Ourmethodoutperformsitonboththecommonand
challengingsets.Methodsthatdonotemployshapeconstraints,e.g.,via3DMM,generallyhave
higherfreedomandcouldachieveslightlybetteraccuracyonfrontalfacecases.Nonetheless,they
aretypicallylessrobustinmorechallengingcases.AnothercomparisoniswithMDM[107]via
thefailurerateusingathresholdof0
:
08.Thefailureratesare16
:
83%(ours)versus6
:
80%(MDM)
with68landmarks,and8
:
99%(ours)versus4
:
20%(MDM)with51landmarks.
Table4.4:MAPEofvemethodsontheAFWdataset.
Proposedmethod
Extended-PIFA[53]
PIFA
CDM
TSPM
6
:
27
7
:
43
8
:
61
9
:
13
11
:
09
71
Table4.5:TheNMEofdifferentmethodson300Wdataset.
Method
Common
Challenging
Full
ESR[26]
5
:
28
17
:
00
7
:
58
RCPR[22]
6
:
18
17
:
26
8
:
35
SDM[124]
5
:
57
15
:
40
7
:
50
LBF[93]
4
:
95
11
:
98
6
:
32
CFSS[147]
4
:
73
9
:
98
5
:
76
RCFA[116]
4
:
03
9
:
85
5
:
32
RAR[122]
4
:
12
8
:
35
4
:
94
3DDFA[147]
6
:
15
10
:
59
7
:
01
3DDFA+SDM
5
:
53
9
:
56
6
:
31
Proposedmethod
5
:
43
9
:
88
6
:
30
4.3.3AnalysisoftheVisualizationLayer
Weperformfoursetsofexperimentstostudythepropertiesofthevisualizationlayerandnetwork
architectures.
ofvisualizationlayers
Toanalyzetheofthevisualizationlayerinthetesting
phase,weadd5%noisetothefullyconnectedlayerparametersofeachvisualizationblock,and
computethealignmenterrorontheAFLWtestset.TheNMEsare[4
:
46,4
:
53,4
:
60,4
:
66,4
:
80,
5
:
16]wheneachblockisseperately.Thisanalysisshowsthatthevisualizedimageshave
moreonthelaterblocks,sinceimpreciseparametersofearlyblockscouldbecompen-
satedbylaterblocks.Inanotherexperiment,wetrainthenetworkwithoutanyvisualizationlayer.
TheNMEonAFLWis7
:
18%whichshowstheimportanceofvisualizationlayersinguiding
thenetworktraining.
Advantageofdeeperfeatures
WetrainthreeCNNarchitecturesshowninFigure4.6onAFLW.
Theinputsofthevisualizationblockinthearchitecturearetheinputimages
I
,featuremaps
F
andthevisualizationimage
V
.Theinputsofthesecondandthethirdarchitecturesare
f
F
;
V
g
and
f
I
;
V
g
,respectively.TheNMEofeacharchitectureisshowninTable4.6.Whilethetone
72
Table4.6:TheNME(%)ofthreearchitectureswithdifferentinputs(
I
:Inputimage,
V
:Visualiza-
tion,
F
:Featuremaps).
Architecturea
Architectureb
Architecturec
(
I
;
F
;
V
)
(
F
;
V
)
(
I
;
V
)
4
:
45
4
:
48
5
:
06
performsthebest,thesubstantiallowerperformanceofthethirdonedemonstratestheimportance
ofdeeperfeatureslearnedacrossblocks.
Attheconvolutionallayerofeachvisualizationblock,wecomputetheaverageofthe
weights,acrossboththekernelsizeandthenumberofmaps.Theaveragesforthesethreetypesof
inputfeaturesareshowninFigure4.7.Asobserved,theweightsdecreaseacrossblocks,leading
toamorepreciseestimationofsmall-scaleparameterupdates.Consideringthenumberof
inTable4.1,thetotalimpactoffeaturemapsarehigherthantheothertwoinputsinallblocks.
Thisagainshowstheimportanceofdeeperfeaturesinguidingthenetworktoestimateparameters.
Furthermore,theaverageofthevisualizationishigherthanthatoftheinputimage,
demonstratingstrongeroftheproposedvisualizationduringtraining.
(a)
(b)
(c)
Figure4.6:ArchitecturesofthreeCNNswithdifferentinputs.
73
Table4.7:NME(%)whendifferentmasksareused.
Mask1
Mask2
NoMask
4
:
45
4
:
49
5
:
31
Advantageofusingmasks
Toshowtheadvantageofusingthemaskinthevisualizationlayer,we
conductanexperimentwithdifferentmasks.,weanothermaskforcomparison
asshowninFigure4.8.Ithasvepositiveareas,i.e.,theeyes,nosetipandtwolipcorners.The
valuesarenormalizedtozero-meanandunitstandarddeviation.Comparedtotheoriginalmask
inFig.4.4,thismaskismorecomplicatedandconveysmoreinformationabouttheinformative
facialareastothenetwork.Moreover,toshowthenecessityofusingthemask,wealsotestusing
visualizationlayerswithoutanymask.TheNMEsofthetrainednetworkswithdifferentmasks
areshowninTable4.7.Comparisonbetweentheandthethirdcolumnsshowstheof
usingthemask,bydifferentiatingthemiddleandcontourareasoftheface.Bycomparingthe
andsecondcolumns,wecanseethatutilizingmorecomplicatedmaskdoesnotfurtherimprove
theresult,indicatingtheoriginalmaskprovidessufinformationforitspurpose.
Differentnumbersofblocksandlayers
Giventhetotalnumberof12convolutionallayersin
ournetwork,wecanpartitionthemintovisualizationblocksofvarioussizes.Tocomparetheir
InputimageVisualizationFeaturemaps
Figure4.7:Theaverageofweightsforinputimage,visualizationandfeaturemapsinthree
architecturesofFigure4.6.The
y
-axisand
x
-axisshowtheaverageandtheblockindex,respec-
tively.
74
Figure4.8:Mask2,adifferentdesignedmaskwithvepositiveareasontheeyes,topofthenose
andsidesofthelip.
performance,wetraintwoadditionalCNNs.Theoneconsistsof4visualizationblocks,with
3convolutionallayersineach.Theothercomeswith3blockand4convolutionallayersper
block.Hence,allthreearchitectureshave12convolutionallayersintotal.TheNMEofthese
architecturesareshowninTable4.8.Similarto[22],itshowsthatthenumberofregressorsis
importantforfacealignmentandwecanpotentiallyachieveahigheraccuracybyincreasingthe
numberofvisualizationblocks.
4.3.4Timecomplexity
ComparedtothecascadeofCNNs,oneofthemainadvantagesofend-to-endtrainingasingle
CNNisthereducedtrainingtime.Theproposedmethodneeds33epochswhichtakesaround2
:
5
days.Withthesametrainingandtestingdatasets,[53]requires70epochsforeachCNN.Witha
totalofsixCNNs,itneedsaround7days.Similarly,themethodin[147]needsaround12days
totrainthreeCNNs,eachonewith20epochs,despiteusingdifferenttrainingdata.Compared
Table4.8:NME(%)whenusingdifferentnumbersofvisualizationblocks(
N
v
)andconvolutional
layers(
N
c
).
N
v
=
6,
N
c
=
2
N
v
=
4,
N
c
=
3
N
v
=
3,
N
c
=
4
4
:
45
4
:
61
4
:
83
75
Figure4.9:ResultsofalignmentonAFLWandAFWdatasets,greenlandmarksshowtheestimated
locationsofvisiblelandmarksandredlandmarksshowestimatedlocationsofinvisiblelandmarks.
Firstrow:providedboundingboxbyAFLWwithinitiallocationsoflandmarks,Second:estimated
3Ddenseshapes,Third:estimatedlandmarks,Fourthtosixth:estimatedlandmarksforAFLW,
Seventh:estimatedlandmarksforAFW.
to[53],ourmethodreducesthetrainingtimebymorethanhalf.Thetestingspeedofproposed
methodis4
:
3FPSonaTitanXGPU.Itismuchfasterthanthe0
:
6FPSspeedof[53]andissimilar
tothe4FPSspeedof[122].
76
initializationBlock1Block2Block3Block4Block5Block6
Figure4.10:Threeexamplesofoutputsofvisualizationlayerateachvisualizationblock.The
rowshowsthattheproposedmethodrecoverstheexpressionofthefacegracefully,thethirdrow
showsthevisualizationsofafacewithamorechallengingpose.
4.4Summary
Weproposeapose-invariantfacealignmentmethodwithend-to-endtraininginasingleCNN.
Thekeyisadifferentiablevisualizationlayer,whichisintegratedtothenetworkandenablesjoint
optimizationbybackpropagatingtheerrorfromalatervisualizationblockstoearlyones.Itallows
thevisualizationblocktoutilizetheextractedfeaturesfrompreviousblocksandextractdeeper
features,withoutextractinghand-craftedfeatures.Inaddition,theproposedmethodconverges
fasterduringthetrainingphasecomparedtothecascadeofCNNs.Throughextensiveexperiments,
wedemonstratethesuperiorresultsoftheproposedmethodoverthestate-of-the-artapproaches.
77
Chapter5
LearningDeepModelsforFace
BinaryorAuxiliary
Supervision
5.1Introduction
Withtheincreasingofsmartdevicesinourdailylives,peopleareseekingforsecure
andconvenientwaystoaccesstheirpersonalinformation.Biometrics,suchasface,
andiris,arewidelyutilizedforpersonauthenticationduetotheirintrinsicdistinctivenessand
conveniencetouse.Face,asoneofthemostpopularmodalities,hasreceivedincreasingattention
intheacademiaandindustryintherecentyears(e.g.,iPhoneX).However,theattentionalso
bringsagrowingincentiveforhackerstodesignbiometricpresentationattacks(PA),orspoofs,to
beauthenticatedasthegenuineuser.Duetothealmostno-costaccesstothehumanface,thespoof
facecanbeassimpleasaprintedphotopaper(i.e.,printattack)andadigitalimage/video(i.e.,
replayattack),orascomplicatedasa3DMaskandfacialcosmeticmakeup.Withproperhandling,
thosespoofscanbevisuallyveryclosetothegenuineuser'sliveface.Asaresult,thesecallfor
theneedofdevelopingrobustfacealgorithms.
RGBimageandvideoarethestandardinputtofacesystemssincethemajority
78
Figure5.1:
ConventionalCNN-basedfaceanti-spoofapproachesutilizethebinarysupervision,which
mayleadtoovgiventheenormoussolutionspaceofCNN.Thisworkdesignsanovelnetwork
architecturetoleveragetwoauxiliaryinformationassupervision:thedepthmapandrPPGsignal,withthe
goalsofimprovedgeneralizationandexplainabledecisionsduringinference.
offacerecognitionsystemsadoptRGBcamerasasthesensor.Researchersstartthetexture-based
approachesbyfeedinghandcraftedfeaturestobinary[18,33,34,59,77,84,
131].Laterinthedeeplearningera,severalConvolutionalNeuralNetworks(CNN)approaches
utilizesoftmaxlossasthesupervision[37,66,83,130].Itappearsalmostallpriorworkregardthe
faceproblemasmerelya
binary
(livevs.spoof)problem.
Therearetwomainissuesinlearningdeepmodelswithbinarysupervision.First,
therearedifferentlevelsofimagedegradation,namely
spoofpatterns
,comparingaspooffaceto
aliveone,whichconsistofskindetailloss,colordistortion,moirépattern,shapedeformation
andspoofartifacts(e.g.,[65,84].ACNNwithsoftmaxlossmightdiscover
arbitrary
cuesthatareabletoseparatethetwoclasses,suchasscreenbezel,butnotthe
faithful
spoof
patterns.Whenthosecuesdisappearduringtesting,thesemodelswouldfailtodistinguishspoof
vs.livefacesandresultinpoorgeneralization.Second,duringthetesting,modelslearntwith
binarysupervisionwillonlygenerateabinarydecisionwithout
explanation
or
rationale
forthe
decision.InthepursuitofExplainableIntelligence[1],itisdesirableforthelearnt
79
modeltogeneratethespoofpatternsthatsupportthebinarydecision.
Toaddresstheseissues,asshowninFig.5.1,weproposeadeepmodelthatusesthesupervision
fromboththe
spatialandtemporalauxiliaryinformation
ratherthanbinarysupervision,forthe
purposeofrobustlydetectingfacePAfromafacevideo.Theseauxiliaryinformationareacquired
basedonourdomainknowledgeaboutthekey
differences
betweenliveandspooffaces,which
includetwoperspectives:spatialandtemporal.Fromthespatialperspective,itisknownthatlive
faceshaveface-likedepth,e.g.,thenoseisclosertothecamerathanthecheekinfrontal-view
faces,whilefacesinprintorreplayattackshaveorplanardepth,e.g.,allpixelsontheimageof
apaperhavethesamedepthtothecamera.Hence,depthcanbeutilizedasauxiliaryinformationto
supervisebothliveandspooffaces.Fromthetemporalperspective,itwasshownthatthenormal
rPPGsignals(i.e.,heartpulsesignal)aredetectablefromlive,butnotspoof,facevideos[69,81].
Therefore,weprovidedifferentrPPGsignalsasauxiliarysupervision,whichguidesthenetwork
tolearnfromliveorspooffacevideosrespectively.Toenablebothsupervisions,wedesigna
networkarchitecturewithashort-cutconnectiontocapturedifferentscalesandanovelnon-rigid
registrationlayertohandlethemotionandposechangeforrPPGestimation.
Furthermore,similartoothervisionproblems,dataplaysaroleintrainingthe
models.Asweknow,camera/screenqualityisacriticalfactortothequalityof
spooffaces.Existingfaceantdatabases,suchasNUAA[104],CASIA[143],Replay-
Attack[28],andMSU-MFSD[117],werecollected3

5yearsago.Giventhefastdevelopment
paceofconsumerelectronics,thetypesofequipment(e.g.,camerasandmediums)used
inthosedatacollectionareoutdatedcomparedtotheonesnowadays,regardingtheresolution
andimagingquality.MorerecentMSU-USSA[84]andOULUdatabases[19]havesubjectswith
fewervariationsinfacialposes,illuminations,expressions(PIE).Thelackofnecessaryvaria-
tionswouldmakeithardtolearnaneffectivemodel.Giventheclearneedformoreadvanced
80
Figure5.2:
Theoverviewoftheproposedmethod.
databases,wecollectafacedatabasefortrainingandevaluation,namedSpoofinthe
WildDatabase(SiW).SiWdatabaseconsistsof299subjects,6mediums,and4sessions
coveringvariationssuchasPIE,distance-to-camera,etc.SiWcoversmuchlargervariationsthan
previousdatabases,asdetailedinTab.5.1andSec.5.3.
Themaincontributionsofthisworkinclude:

Weproposetoleveragenovelauxiliaryinformation(i.e.,depthmapandrPPG)tosupervise
theCNNlearningforimprovedgeneralization.

WeproposeanovelCNN-RNNarchitectureforend-to-endlearningthedepthmapandrPPG
signal.

WereleaseanewdatabasethatcontainsvariationsofPIE,andotherpracticalfactors.We
achievethestate-of-the-artperformanceforface
5.2FacewithDeepNetwork
Themainideaoftheproposedapproachistoguidethedeepnetworktofocusonthe
knownspoof
patterns
acrossspatialandtemporaldomains,ratherthantoextractanycuesthatcouldseparate
twoclassesbutarenotgeneralizable.AsshowninFig.5.2,theproposednetworkcombinesCNN
andRNNarchitecturesinacoherentway.TheCNNpartutilizesthedepthmapsupervisionto
81
Figure5.3:
TheproposedCNN-RNNarchitecture.Thenumberofareshownontopofeachlayer,
thesizeofallis3

3withstride1forconvolutionaland2forpoolinglayers.
Colorcode
used:
orange
=convolution,
green
=pooling,
purple
=responsemap.
discoversubtletexturepropertythatleadstodistinctdepthsforliveandspooffaces.Then,itfeeds
theestimateddepthandthefeaturemapstoanovel
non-rigidregistration
layertocreatealigned
featuremaps.TheRNNpartistrainedwiththealignedmapsandtherPPGsupervision,which
examinestemporalvariabilityacrossvideoframes.
5.2.1DepthMapSupervision
Depthmapsarearepresentationofthe3Dshapeofthefaceina2Dimage,whichshowstheface
locationandthedepthinformationofdifferentfacialareas.Thisrepresentationismoreinformative
thanbinarylabelssinceitindicatesoneofthefundamentaldifferencesbetweenlivefaces,andprint
andreplayPA.WeutilizethedepthmapsinthedepthlossfunctiontosupervisetheCNNpart.The
pixel-baseddepthlossguidestheCNNtolearnamappingfromthefaceareawithinareceptive
toalabeleddepthvalueŒascalewithin
[
0
;
1
]
forlivefacesand0forspooffaces.
Toestimatethedepthmapfora2Dfaceimage,givenafaceimage,weutilizethestate-of-the-
artdensefacealignment(DeFA)methods[54,74]toestimatethe3Dshapeoftheface.Thefrontal
dense3Dshape
S
F
2
R
3

Q
,with
Q
vertices,isrepresentedasalinearcombinationofidentity
bases
f
S
i
id
g
N
id
i
=
1
andexpressionbases
f
S
i
exp
g
N
exp
i
=
1
,
S
F
=
S
0
+
N
id
å
i
=
1
a
i
id
S
i
id
+
N
exp
å
i
=
1
a
i
exp
S
i
exp
;
(5.1)
82
where
a
id
2
R
199
and
a
ext
2
R
29
aretheidentityandexpressionparameters,and
a
=[
a
id
;
a
exp
]
aretheshapeparameters.WeutilizetheBasel3Dfacemodel[86]andthefacewearhouse[25]as
theidentityandexpressionbases.
Withtheestimatedposeparameters
P
=(
s
;
R
;
t
)
,where
R
2
R
3

3
isarotationmatrix,
t
2
R
3
isa3Dtranslation,and
s
isascale,wealignthe3Dshape
S
tothe2Dfaceimage:
S
=
s
RS
F
+
t
:
(5.2)
Giventhechallengeofestimatingthe
absolute
depthfroma2Dface,wenormalizethe
z
values
of3Dverticesin
S
tobewithin
[
0
;
1
]
.Thatis,thevertexclosesttothecamera(e.g.,nose)has
adepthofone,andthevertexfurthestawayhasthedepthofzero.Then,weapplytheZ-Buffer
algorithm[149]to
S
forprojectingthenormalized
z
valuestoa2Dplane,whichresultsinan
estimatedﬁgroundtruthﬂ2Ddepthmap
D
2
R
32

32
forafaceimage.
5.2.2rPPGSupervision
rPPGsignalshaverecentlybeenutilizedforface[69,81].TherPPGsignalprovides
temporalinformationaboutfaceliveness,asitisrelatedtothechangesintheintensitiesoffacial
skinsovertime.Theseintensitychangesarehighlycorrelatedwiththebloodw.Thetraditional
method[35]forextractingrPPGsignalshasthreedrawbacks.First,itissensitivetoposeand
expressionvariations,sinceitbecomesharderto
track
afaceareaformeasuringintensity
changes.Second,itisalsosensitivetothechangesinillumination,sincetheextralightingaffects
theamountoflightfromtheskinsurface.Third,forthepurposeofanti-spoof,therPPG
signalextractedfromspoofvideosmightnotbesuf
distinguishable
tothesignaloflive
videos.
83
Onenoveltyaspectofourapproachisthat,insteadofcomputingtherPPGsignalvia[35],our
RNNpartlearnstoestimatetherPPGsignal.Thiseasesthesignalestimationfromfacevideoswith
PIEvariations,andalsoleadstomorediscriminativerPPGsignals,asdifferentrPPGsupervisions
areprovidedtolivevs.spoofvideos.Weassumethatthevideosofthesamesubjectunderdifferent
PIEconditionshavethe
same
groundtruthrPPGsignal.Thisassumptionisvalidsincetheheart
beatissimilarforthevideosofthesamesubjectthatarecapturedinashortspanoftime(
<
5
minutes).TherPPGsignalextractedfromtheconstrainedvideos(i.e.,noPIEvariation)areused
astheﬁgroundtruth"supervisionintherPPGlossfunctionfor
all
livevideosofthesamesubject.
ThisconsistentsupervisionhelpstheCNNandRNNpartstoberobusttothePIEchanges.
InordertoextracttherPPGsignalfromafacevideowithoutPIE,weapplytheDeFA[74]to
eachframeandestimatethedense3Dfaceshape.Weutilizetheestimated3Dshapetotrackaface
region.Foratrackedregion,wecomputetwoorthogonalchrominancesignals
x
f
=
3
r
f

2
g
f
,
y
f
=
1
:
5
r
f
+
g
f

1
:
5
b
f
where
r
f
;
g
f
;
b
f
arethebandpassversionsofthe
r
;
g
;
b
channels
withtheskin-tonenormalization.Weutilizetheratioofthestandarddeviationofthechrominance
signals
g
=
s
(
x
f
)
s
(
y
f
)
tocomputebloodwsignals[35].Wecalculatethesignal
p
as:
p
=
3
(
1

g
2
)
r
f

2
(
1
+
g
2
)
g
f
+
3
g
2
b
f
:
(5.3)
ByapplyingFFTto
p
,weobtaintherPPGsignal
f
2
R
50
,whichshowsthemagnitudeofeach
frequency.
5.2.3NetworkArchitecture
Ourproposednetworkconsistsoftwodeepnetworks.First,aCNNpartevaluateseachframe
separatelyandestimatesthedepthmapandfeaturemapofeachframe.Second,arecurrentneural
84
network(RNN)partevaluatesthetemporalvariabilityacrossthefeaturemapsofasequence.
5.2.3.1CNNNetwork
WedesignaFullyConvolutionalNetwork(FCN)asourCNNpart,asshowninFig.5.3.TheCNN
partcontainsmultipleblocksofthreeconvolutionallayers,poolingandresizinglayerswhereeach
convolutionallayerisfollowedbyoneexponentiallinearlayerandbatchnormalizationlayer.
Then,theresizinglayersresizetheresponsemapsaftereachblocktoasizeof64

64
andconcatenatetheresponsemaps.Thebypassconnectionshelpthenetworktoutilizeextracted
featuresfromlayerswithdifferentdepthssimilartotheResNetstructure[44].Afterthat,ourCNN
hastwobranches,oneforestimatingthedepthmapandtheotherforestimatingthefeaturemap.
TheoutputoftheCNNistheestimateddepthmapoftheinputframe
I
2
R
256

256
,which
issupervisedbytheestimatedﬁgroundtruth"depth
D
,
Q
D
=
argmin
Q
D
N
d
å
i
=
1
jj
CNN
D
(
I
i
;
Q
D
)

D
i
jj
2
1
;
(5.4)
where
Q
D
istheCNNparametersand
N
d
isthenumberoftrainingimages.Thesecondoutputof
theCNNisthefeaturemap,whichisfedintothenon-rigidregistrationlayer.
5.2.3.2RNNNetwork
TheRNNpartaimstoestimatetherPPGsignal
f
ofaninputsequencewith
N
f
frames
f
I
j
g
N
f
j
=
1
.As
showninFig.5.3,weutilizeoneLSTMlayerwith100hiddenneurons,onefullyconnectedlayer,
andanFFTlayerthatconvertstheresponseoffullyconnectedlayerintotheFourierdomain.Given
theinputsequence
f
I
j
g
N
f
j
=
1
andtheﬁgroundtruth"rPPGsignal
f
,wetraintheRNNtominimize
the
`
1
distanceoftheestimatedrPPGsignaltoﬁgroundtruth"
f
,
85
Figure5.4:
ExamplegroundtruthdepthmapandrPPGsignals.
Q
R
=
argmin
Q
R
N
s
å
i
=
1
jj
RNN
R
([
f
F
j
g
N
f
j
=
1
]
i
;
Q
R
)

f
i
jj
2
1
;
(5.5)
where
Q
R
istheRNNparameters,
F
j
2
R
32

32
isthefrontalizedfeaturemap(detailsinSec.5.2.4),
and
N
s
isthenumberofsequences.
5.2.3.3ImplementationDetails
GroundTruthData
Givenasetofliveandspooffacevideos,weprovidethegroundtruth
supervisionforthedepthmap
D
andrPPGsignal
f
.WefollowtheprocedureinSec.5.2.1to
computeﬁgroundtruth"dataforlivevideos.Forspoofvideos,wesetthegroundtruthdepthmaps
toaplainsurface,i.e.,zerodepth.Similarly,wefollowtheprocedureinSec.5.2.2tocompute
theﬁgroundtruth"rPPGsignalfromapatchontheforehead,foronelivevideoofeachsubject
withoutPIEvariation.Also,wenormalizethenormofestimatedrPPGsignalsuchthat
k
f
k
2
=
1.
Forspoofvideos,weconsidertherPPGsignalsarezero.Fig.5.4showsexamplesoftheground
truthdepthmapandrPPGsignal.
Notethat,whilethetermﬁdepth"isusedhere,ourestimateddepthisdifferenttotheconven-
tionaldepthmapincomputervision.Itcanbeviewedasaﬁpseudo-depth"andservesthepurpose
ofprovidingdiscriminativeauxiliarysupervisiontothelearningprocess.Thesameperspective
86
appliestothesupervisionbasedonpseudo-rPPGsignal.
TrainingStrategy
OurproposednetworkcombinestheCNNandRNNpartsforend-to-endtrain-
ing.ThedesiredtrainingdatafortheCNNpartshouldbefromdiversesubjects,soastomakethe
trainingproceduremorestableandincreasethegeneralizabilityofthelearntmodel.Meanwhile,
thetrainingdatafortheRNNpartshouldbelongsequencestoleveragethetemporalinformation
acrossframes.Thesetwopreferencescanbecontradictorytoeachother,especiallygiventhelim-
itedGPUmemory.Hence,tosatisfybothpreferences,wedesignatwo-streamtrainingstrategy.
ThestreamthepreferenceoftheCNNpart,wheretheinputincludesfaceimages
I
andthegroundtruthdepthmaps
D
.ThesecondstreamtheRNNpart,wheretheinput
includesfacesequences
f
I
j
g
N
f
j
=
1
,thegroundtruthdepthmaps
f
D
j
g
N
f
j
=
1
,theestimated3Dshapes
f
S
j
g
N
f
j
=
1
,andthecorrespondinggroundtruthrPPGsignals
f
.Duringtraining,ourmethodalter-
natesbetweenthesetwostreamstoconvergetoamodelthatminimizesboththedepthmapand
rPPGlosses.NotethateventhoughthestreamonlyupdatestheweightsoftheCNNpart,
thebackpropagationofthesecondstreamupdatestheweightsofbothCNNandRNNpartsinan
end-to-endmanner.
Testing
Toprovideascore,wefeedthetestingsequencetoournetworkandcompute
theestimateddepthmap
‹
D
ofthelastframeandtherPPGsignal
‹
f
.Toavoidovdueto
utilizingbinarylossfunctioninthenetwork,wecomputethescoreas:
score
=
jj
‹
f
jj
2
2
+
l
jj
‹
D
jj
2
2
;
(5.6)
where
l
isaconstantweightforcombiningthetwooutputsofthenetwork.
87
Figure5.5:
Thenon-rigidregistrationlayer.
5.2.4Non-rigidRegistrationLayer
Wedesignanewnon-rigidregistrationlayertopreparedatafortheRNNpart.Thislayerutilizes
theestimateddense3DshapetoaligntheactivationsorfeaturemapsfromtheCNNpart.This
layerisimportanttoensurethattheRNNtracksandlearnsthechangesoftheactivationsforthe
samefacialarea
acrosstime,aswellasacrossallsubjects.
AsshowninFig.5.5,thislayerhasthreeinputs:thefeaturemap
T
2
R
32

32
,thedepthmap
‹
D
andthe3Dshape
S
.Withinthislayer,wethresholdthedepthmapandgenerateabinarymask
V
2
R
32

32
:
V
=
‹
D

threshold
:
(5.7)
Then,wecomputetheinnerproductofthebinarymaskandthefeaturemap
U
=
T

V
,which
essentiallyutilizesthedepthmapasavisibilityindicatorforeachpixelinthefeaturemap.Ifthe
depthvalueforonepixelislessthanthethreshold,weconsiderthatpixeltobeinvisible.Finally,
wefrontalize
U
byutilizingtheestimated3Dshape
S
,
F
(
i
;
j
)=
U
(
S
(
m
ij
;
1
)
;
S
(
m
ij
;
2
))
;
(5.8)
88
Table5.1:Thecomparisonofourcollecteddatasetwithavailabledatasetsforthefaceanti-

Dataset
Year
#of
#of
#oflive/attack
Pose
Different
Extra
Displaydevices
Spoof
subj.
sess.
vid.(V),ima.(I)
range
expres.
light.
attacks
NUAA[104]
2010
15
3
5105
=
7509(I)
Frontal
No
Yes
-
Print
CASIA-MFSD[143]
2012
50
3
150
=
450(V)
Frontal
No
No
iPad
Print,Replay
Replay-Attack[28]
2012
50
1
200
=
1000(V)
Frontal
No
Yes
iPhone3GS,iPad
Print,2Replay
MSU-MFSD[117]
2015
35
1
110
=
330(V)
Frontal
No
No
iPadAir,iPhone5S
Print,2Replay
MSU-USSA[84]
2016
1140
1
1140
=
9120(I)
[

45

;
45

]
Yes
Yes
MacBook,Nexus5,NvidiaShieldTablet
2print,6Replay
Oulu-NPU[19]
2017
55
3
1980
=
3960(V)
Frontal
No
Yes
Dell1905FP,MacbookRetina
2Print,2Replay
SiW
2018
165
4
1320
=
3300(V)
[

90

;
90

]
Yes
Yes
iPadPro,iPhone7S,GalaxyS8,AsusMB168B
2Print,4Replay
where
m
2
R
K
isthelistof
K
indexesofthefaceareain
S
0
,and
m
ij
incorresponding
indexforthepixel
i
;
j
.Weutilize
m
toprojectthemaskedactivationvalues
U
tothefrontalized
image
F
.
Thisnon-rigidregistrationlayerhasthreemaincontributionstotheproposednetworkarchi-
tecture:

Byapplyingthenon-rigidregistration,theinputdataarealignedandtheRNNcancompare
thefeaturemapswithoutconcerningaboutthefacialposeorexpression.Inotherwords,itcan
learnthetemporalchangesintheactivationsofthefeaturemapsforthesamefacialarea.

Thenon-rigidregistrationremovesthebackgroundareainthefeaturemap.Hencetheback-
groundareawouldnotparticipateinRNNlearning,althoughthebackgroundinformationisal-
readyutilizedinthelayersoftheCNNpart.

Forspooffaces,thedepthmapsarelikelytobeclosertozero.Hence,theinnerproduct
withthedepthmapssubstantiallyweakenstheactivationsinthefeaturemaps,whichmakesit
convenientfortheRNNtooutputzerorPPGsignals.Likewise,thebackpropagationfromthe
rPPGlossalsoencouragestheCNNparttogeneratezerodepthmapsforeitherallframes,orone
pixellocationinmajorityoftheframeswithinaninputsequence.
89
Figure5.6:
ThestatisticsofthesubjectsintheSiWdatabase.Leftside:Thehistogramshowsthedistribu-
tionofthefacesizes.
5.3CollectionofFaceDatabase
Withthefastadvancementofimagingsensortechnology,theexistingsystemscould
becomevulnerabletoemerginghigh-qualityspoofmediums.Onewaytomakethesystemrobust
totheseattacksistocollectnewhigh-qualitydatabases.Inrespondingtothisneed,wecollect
anewfacedatabasenamedSpoofintheWild(SiW)database,whichhasmultiple
advantagesoverpreviousdatasetsasshowninTab.5.1.First,itcontainssubstantiallymorelive
subjectswithdiverseraces,e.g.,5timesofthesubjectsofOulu-NPU.NotethatMSU-USSAis
constructedbasedonexistingimagesofcelebritieswithoutcapturinglivefacevideos.Second,
livevideosarecapturedwithtwohigh-qualitycameras(CanonEOST6,LogitechC920webcam)
withdifferentPIEvariations.
SiWprovidesliveandspoofvideosfrom299subjects.Foreachsubject,wehave8liveand
16spoofvideos,intotal7
;
170videos.SomestatisticsofthesubjectsareshowninFig.5.6.
Thelivevideosarecollectedinfoursessions.InSession1,thesubjectmoveshisheadwith
varyingdistancestothecamera.InSession2,thesubjectchangestheyawangleoftheheadwithin
[

90

;
90

]
,andmakesdifferentfaceexpressions.InSessions3
;
4,thesubjectrepeatstheSessions
1
;
2,whilethecollectormovingthepointlightsourcearoundthefacefromdifferentorientations.
90
Figure5.7:
ExamplesoftheliveandspoofattackvideosintheSiWdatabase.Therowshowsalive
subjectwithdifferentPIE.Thesecondrowshowsdifferenttypesofthespoofattacks.
ThelivevideosfromtheCanonEOST6andLogitechC920webcamareintheresolutionof
1
;
920

1
;
080.WeprovidetwoprintandfourreplayvideoattacksforeachsubjectinSiW.Some
examplesoftheliveandspoofvideosareshowninFig.5.7.Togeneratedifferentqualitiesof
printattacks,wecaptureahigh-resolutionimage(5
;
184

3
;
456)foreachsubjectanduseitto
makeahigh-qualityprintattack.Also,weextractafrontal-viewframefromalivevideoforlower-
qualityprintattack.WeprinttheimageswithanHPcolorLaserJetM652printer.Theprintattack
videosarecapturedbyholdingprintedimagesstillandwarpingtheminfrontofthecameras.To
generatehigh-qualityreplayattackvideos,weselectfourspoofmediums:SamsungGalaxyS8,
iPhone7,iPadPro,andPCscreens.Foreachsubject,werandomlyselecttwovideosfromthefour
high-qualitylivevideostodisplayinthespoofmediums.
5.4ExperimentalResults
5.4.1ExperimentalSetup
Databases
Weevaluateourmethodonmultipledatabasestodemonstrateitsgeneralizability.We
utilizeSiWandOuludatabases[19]asnewhigh-resolutiondatabasesandperformintraandcross
testingbetweenthem.Also,weusetheCASIA-MFSD[143]andReplay-Attack[28]databases
91
Table5.2:TDRatdifferentFDRs,crosstestingonOuluProtocol1.
FDR
1%
2%
10%
20%
Model1
8
:
5%
18
:
1%
71
:
4%
81
:
0%
Model2
40
:
2%
46
:
9%
78
:
5%
93
:
5%
Model3
39
:
4%
42
:
9%
67
:
5%
87
:
5%
Model4
45.8%
47.9%
81%
94.2%
Table5.3:
ACERofourmethodatdifferent
N
f
,onOuluProtocol2.
P
P
P
P
P
P
P
P
P
Test
Train
5
10
20
5
4
:
16%
4
:
16%
3
:
05%
10
4
:
02%
3
:
61%
2
:
78%
20
4
:
10%
3
:
67%
2
:
98%
forcrosstestingandcomparingwiththestateoftheart.
Parametersetting
TheproposedmethodisimplementedinTensorFlow[2]withaconstantlearn-
ingrateof3e

3,and10epochsofthetrainingphase.ThebatchsizeoftheCNNstreamis10and
thatoftheCNN-RNNstreamis2with
N
f
being5.Werandomlyinitializeournetworkbyusinga
normaldistributionwithzeromeanandstdof0
:
02.Weset
l
inEq.5.6to0
:
015and
threshold
in
Eq.5.7to0
:
1.
Evaluationmetrics
Tocomparewithpriorworks,wereportourresultswiththefollowingmet-
rics:AttackPresentationErrorRate
APCER
[46],BonaFidePresentationClassi-
ErrorRate
BPCER
[46],
ACER
=
APCER
+
BPCER
2
[46],andHalfTotalErrorRate
HTER
.
The
HTER
ishalfofthesummationoftheFalseRejectionRate(FRR)andtheFalseAcceptance
Rate(FAR).
92
Table5.4:Theintra-testingresultsonfourprotocolsofOulu.
Prot.
Method
APCER(%)
BPCER(%)
ACER(%)
CPqD
2
:
9
10
:
8
6
:
9
1
GRADIANT
1.3
12
:
5
6
:
9
Proposedmethod
1
:
6
1.6
1.6
MixedFASNet
9
:
7
2
:
5
6
:
1
2
Proposedmethod
2.7
2
:
7
2
:
7
GRADIANT
3
:
1
1.9
2.5
MixedFASNet
5
:
3

6
:
7
7
:
8

5
:
5
6
:
5

4
:
6
3
GRADIANT
2.6

3.9
5
:
0

5
:
3
3
:
8

2
:
4
Proposedmethod
2
:
7

1
:
3
3.1

1.7
2.9

1.5
Massy_HNU
35
:
8

35
:
3
8.3

4.1
22
:
1

17
:
6
4
GRADIANT
5.0

4.5
15
:
0

7
:
1
10
:
0

5
:
0
Proposedmethod
9
:
3

5
:
6
10
:
4

6
:
0
9.5

6.0
5.4.2ExperimentalComparison
5.4.2.1AblationStudy
Advantageofproposedarchitecture
Wecomparefourarchitecturestodemonstratetheadvan-
tagesoftheproposedlosslayersandnon-rigidregistrationlayer.
Model
1hasanarchitecture
similartotheCNNpartinourmethod(Fig.5.3),exceptthatitisextendedwithadditionalpooling
layers,fullyconnectedlayers,andsoftmaxlossforbinarycl
Model
2istheCNNpart
inourmethodwithadepthmaplossfunction.Wesimplyuse
jj
‹
D
jj
2
for
Model
3containstheCNNandRNNpartswithoutthenon-rigidregistrationlayer.Bothofthedepth
mapandrPPGlossfunctionsareutilizedinthismodel.However,theRNNpartwouldprocess
unregisteredfeaturemapsfromtheCNN.
Model
4istheproposedarchitecture.
Wetrainallfourmodelswiththeliveandspoofvideosfrom20subjectsofSiW.Wecompute
thecross-testingperformanceofallmodelsonProtocol1ofOuludatabase.TheTDRatdifferent
FDRarereportedinTab.5.2.
Model
1hasapoorperformanceduetothebinarysupervision.In
comparison,byonlyusingthedepthmapassupervision,
Model
2achievessubstantiallybetterper-
formance.However,afteraddingtheRNNpartwiththerPPGsupervision,ourproposed
Model
4
canfurthertheperformanceimprovement.Bycomparing
Model
4and3,wecanseetheadvantage
93
ofthenon-rigidregistrationlayer.ItisclearthattheRNNpartcannotusefeaturemapsdirectly
fortrackingthechangesintheactivationsandestimatingtherPPGsignals.
Advantageoflongersequences
Toshowtheadvantageofutilizinglongersequencesfores-
timatingtherPPG,wetrainandtestourmodelwhenthesequencelength
N
f
is5,10,or20,
usingintra-testingonOuluProtocol2.FromTab.5.3,wecanseethatbyincreasingthesequence
length,theACERdecreasesduetomorereliablerPPGestimation.Despitetheoflonger
sequences,inpractice,wearelimitedbytheGPUmemorysize,andforcedtodecreasetheimage
sizeto128

128forallexperimentsinTab.5.3.Hence,weset
N
f
tobe5withtheimagesizeof
256

256insubsequentexperiments,duetoimportanceofhigherresolution(e.g,alower
ACER
of2
:
5%inTab.5.4isachievedthan4
:
16%).
5.4.2.2IntraTesting
WeperformintratestingonOuluandSiWdatabases.ForOulu,wefollowthefourprotocols[15]
andreporttheir
APCER
,
BPCER
and
ACER
.Tab.5.4showsthecomparisonofourproposed
methodandthebesttwomethodsfor
each
protocolrespectively,inthefacecom-
petition[15].Ourmethodachievesthelowest
ACER
in3outof4protocols.Wehaveslightly
worse
ACER
onProtocol2.TosetabaselineforfuturestudyonSiW,wethreeprotocols
forSiW.TheProtocol1dealswithvariationsinfaceposeandexpression.Wetrainusingthe
60framesofthetrainingvideosthataremainlyfrontalviewfaces,andtestonalltestingvideos.
TheProtocol2evaluatestheperformanceofcrossspoofmediumofreplayattack.TheProtocol
3evaluatestheperformanceofcrossPA,i.e.,fromprintattacktoreplayattackandviceversa.
Tab.5.5showstheprotocolandourperformanceofeachprotocol.
94
Table5.5:Theintra-testingresultsonthreeprotocolsofSiW.
Prot.
Subset
Subject#
Attack
APCER(%)
BPCER(%)
ACER(%)
1
Train
90
First60Frames
3
:
58
3
:
58
3
:
58
Test
75
All
2
Train
90
3display
0
:
57

0
:
69
0
:
57

0
:
69
0
:
57

0
:
69
Test
75
1display
3
Train
90
print(display)
8
:
31

3
:
81
8
:
31

3
:
80
8
:
31

3
:
81
Test
75
display(print)
Figure5.8:
(a)8successfulexamplesandtheirestimateddepthmapsandrPPGsignals.(b)4
failureexamples:thetwoareliveandtheothertwoarespoof.Noteourabilitytoestimatediscriminative
depthmapsandrPPGsignals.
5.4.2.3CrossTesting
Todemonstratethegeneralizationofourmethod,weperformmultiplecross-testingexperiments.
Ourmodelistrainedwithliveandspoofvideosof80subjectsinSiW,andtestonallprotocolsof
Oulu.The
ACER
onProtocol1-4arerespectively:10
:
0%,14
:
1%,13
:
8

5
:
7%,and10
:
0

8
:
8%.
Comparingthesecross-testingresultstothe
intra-testing
resultsin[15],wearerankedsixthonthe
average
ACER
offourprotocols,amongthe15participantsofthefacecompetition.
EspeciallyonProtocol4,thehardestoneamongallprotocols,weachievethe
sameACER
of
10
:
0%asthetopperformer.Thisisanotableresultsincecrosstestingisknowntobesubstantially
harderthanintratesting,andyetourcross-testingresultiscomparablewiththetopintra-testing
performance.Thisdemonstratesthegeneralizationabilityofourlearntmodel.
Furthermore,weutilizetheCASIA-MFSDandReplay-Attackdatabasestoperformcrosstest-
ingbetweenthem,whichiswidelyusedasacross-testingbenchmark.Tab.5.6comparesthe
95
Table5.6:
CrosstestingonCASIA-MFSDvs.Replay-Attack.
Method
Train
Test
Train
Test
CASIA-
Replay
Replay
CASIA-
MFSD
Attack
Attack
MFSD
Motion[34]
50.2%
47.9%
LBP[34]
55.9%
57.6%
LBP-TOP[34]
49.7%
60.6%
Motion-Mag[11]
50.1%
47.0%
Spectralcubes[90]
34.4%
50.0%
CNN[130]
48.5%
45.5%
LBP[16]
47.0%
39.6%
ColourTexture[17]
30.3%
37.7%
Proposedmethod
27.6%
28.4%
Figure5.9:
Mean/Stdoffrontalizedfeaturemapsforliveandspoof.
cross-testing
HTER
ofdifferentmethods.Ourproposedmethodreducesthecross-testingerrorson
theReplay-AttackandCASIA-MFSDdatabasesby8
:
9%and24
:
6%respectively,relativetothe
previousSOTA.
5.4.2.4VisualizationandAnalysis
Intheproposedarchitecture,thefrontalizedfeaturemapsareutilizedasinputtotheRNNpart
andaresupervisedbytherPPGlossfunction.Thevaluesofthesemapscanshowtheimportance
ofdifferentfacialareastorPPGestimation.Fig.5.9showsthemeanandstandarddeviationof
frontalizedfeaturemaps,computedfrom1
;
080liveandspoofvideosofOulu.Wecanseethatthe
sideareasofforeheadandcheekhavehigherforrPPGestimation.
WhilethegoalofoursystemistodetectPAs,ourmodelistrainedtoestimatetheauxiliary
information.Hence,inadditiontoanti-spoof,wealsoliketoevaluatetheaccuracyofauxiliary
96
Figure5.10:
TheMSEofestimatingdepthmapsandrPPGsignals.
informationestimation.Forthispurpose,wecalculatetheaccuracyofestimatingdepthmapsand
rPPGsignals,fortestingdatainProtocol2ofOulu.AsshowninFig.5.10,theaccuracyforboth
estimationinspoofdataishigh,whilethatofthelivedataisrelativelylower.Notethatthedepth
estimationofthemouthareahasmoreerrors,whichisconsistentwiththefeweractivationsofthe
sameareainFig.5.9.ExamplesofsuccessfulandfailurecasesinestimatingdepthmapsandrPPG
signalsareshowninFig.5.8.
Finally,weconductstatisticalanalysisonthefailurecases,sinceoursystemcandetermine
potentialcausesusingtheauxiliaryinformation.WithProctocol2ofOulu,weidentify31failure
cases(2
:
7%
ACER
).Foreachcase,wecalculatewhetherusingitsdepthmaporrPPG
signalwouldfailifthatinformationaloneisused.Intotal,
29
31
,
13
31
,and
11
31
samplesfailduetodepth
map,rPPGsignals,orboth.Thisindicatesthefutureresearchdirection.
5.5Summary
Thischaptertheimportanceofauxiliarysupervisiontodeepmodel-basedfaceanti-
TheproposednetworkcombinesCNNandRNNarchitecturestojointlyestimatethe
depthoffaceimagesandrPPGsignaloffacevideo.WeintroducetheSiWdatabasethatcontains
moresubjectsandvariationsthanpriordatabases.Finally,weexperimentallydemonstratethe
superiorityofourmethod.
97
Chapter6
FaceviaNoise
Modeling
6.1Introduction
Theprintandthereplayattacksarethethemostcommonspoofstypesandtheyhavebeenwell
studiedpreviously,fromdifferentperspectives.Thecue-basedmethodsaimtodetectliveness
cues[82,83](e.g.,eyeblinking,headmotion)toclassifylivevideos.Butthesemethodscanbe
fooledbyvideoreplayattacks.Thetexture-basedmethodsattempttocomparetexturedifference
betweenliveandspooffaces,usingfeaturessuchasLBP[33,34],HOG[59,131].
Similartotexture-basedmethods,CNN-basedmethods[66,83,130]designaprocessof
featureextractionandWithasoftmaxlossbasedbinarysupervision,theyhavethe
riskofovonthetrainingdata.Regardlessoftheperspectives,almostallthepriorworks
treatfaceasa
blackbox
binaryproblem.Incontrast,inthischapter,we
proposetoopentheblackboxbymodelingtheprocessofhowaspoofimageisgeneratedfromits
originalliveimage.
Ourapproachismotivatedbytheclassicimagede-Xproblems,suchasimagede-noisingand
de-blurring[36,51,62,85].Inimagede-noising,thecorruptedimageisregardedasadegradation
fromtheadditivenoise,e.g.,salt-and-peppernoiseandwhiteGaussiannoise.Inimagede-blurring,
theuncorruptedimageisdegradedbymotion,whichcanbedescribedasaprocessofconvolution.
98
Figure6.1:
Theillustrationoffaceandprocesses.processaimsto
estimateaspoofnoisefromaspooffaceandreconstructtheliveface.Theestimatedspoofnoiseshouldbe
discriminativeforface
Similarly,infacethespoofimagecanbeviewedasare-renderingoftheliveimage
butwithsomeﬁspecial"noisefromthespoofmediumandtheenvironment.Hence,thenatural
questionis,
canwerecovertheunderlyingliveimagewhengivenaspoofimage,similartoimage
de-noising
?
Yes.Thischaptershowsﬁhow"todothis.Wecalltheprocessofdecomposingaspoofface
tothespoofnoisepatternandalivefaceas
Face
,showninFig.6.1.Similartothe
previousde-Xworks,thedegradedimage
x
2
R
m
canbeformulatedasafunctionoftheoriginal
image
‹
x
,thedegradationmatrix
A
2
R
m

m
andanadditivenoise
n
2
R
m
.
x
=
A
‹
x
+
n
=
‹
x
+(
A

I
)
‹
x
+
n
=
‹
x
+
N
(
‹
x
)
;
(6.1)
where
N
(
‹
x
)=(
A

I
)
‹
x
+
n
istheimage-dependentnoisefunction.Insteadofsolving
A
and
n
,we
decidetoestimate
N
(
‹
x
)
directlysinceitismoresolvableunderthedeeplearningframework[40,
63,100,101,145].Essentially,byestimating
N
(
‹
x
)
and
‹
x
,weaimtopeeloffthespoofnoiseand
reconstructtheoriginalliveface.Likewise,ifgivenaliveface,facemodelshould
returnitselfplus
zero
noise.Notethatourfaceisdesignedtohandlepaperattack,
99
replayattackandpossiblymake-upattack,butourexperimentsarelimitedtothetwoPAs.
Theoffacearetwofold:1)itreverses,orundoes,thegeneration
process,whichhelpsustomodelandvisualizethespoofnoisepatternofdifferentspoofmediums.
2)thespoofnoiseitselfisdiscriminativebetweenliveandspoofimagesandhenceisusefulfor
face
Whilefacesharesthesamechallengesasotherimagede-Xproblems,ithasafew
distinctdiftoconquer:
NoGroundTruth:
Imagede-Xworksoftenusesyntheticdatawheretheoriginalundegraded
imagecouldbeusedasgroundtruthforsupervisedlearning.Incontrast,wehavenoaccessto
‹
x
,
whichisthecorrespondinglivefaceofaspooffaceimage.
NoNoiseModel:
Thereisnocomprehensivestudyandunderstandingaboutthespoofnoise.
Henceitisnotclearhowwecanconstrainthesolutionspaceto
faithfully
estimatethespoofnoise
pattern.
DiverseSpoofMediums:
Eachtypeofspoofsutilizesdifferentspoofmediumsforgenerating
spoofimages.Eachspoofmediumrepresentsatypeofnoisepattern.
Toaddressthesechallenges,weproposeseveralconstraintsandsupervisionsbasedonour
priorknowledgeandtheconclusionsfromacasestudy(inSection6.2.1).Giventhataliveface
hasnospoofnoise,weimposetheconstraintthat
N
(
‹
x
)
ofaliveimageis
zero
.Basedonourstudy,
weassumethatthespoofnoiseofaspoofimageisubiquitous,i.e.,itexistseverywhereinthe
spatialdomainoftheimage;andisrepetitive,i.e.,itisthespatialrepetitionofcertainnoiseinthe
image.Therepetitivenesscanbeencouragedbymaximizingthehigh-frequencymagnitudeofthe
estimatednoiseintheFourierdomain.
Withsuchconstraintsandauxiliarysupervisionsproposedin[73],anovelCNNarchitectureis
presentedinthispaper.Givenanimage,oneCNNisdesignedtosynthesizethespoofnoisepattern
100
andreconstructthecorrespondingliveimage.Inordertoexaminethereconstructedliveimage,
wetrainanotherCNNwithauxiliarysupervisionandaGAN-likediscriminatorinanend-to-end
fashion.Thesetwonetworksaredesignedtoensurethequalityofthereconstructedimageregard-
ingitsdiscriminativenessbetweenliveandspoof,andthevisualplausibilityofthesynthesizedlive
image.
Tosummarize,themaincontributionsofthisworkinclude:

Weofferanewperspectivefordetectingthefacefromprintattackandreplayattack
byinverselydecomposingaspooffaceimageintothelivefaceandthenoise,without
havingthegroundtruthofeither.

AnovelCNNarchitectureisproposedforfacewhereappropriateconstraints
andauxiliarysupervisionsareimposed.

Wedemonstratethevalueoffacebyitscontributiontofaceand
thevisualizationofthespoofnoisepatterns.
6.2Face
Inthissection,westartwithacasestudyofspoofnoisepattern,whichdemonstratesafewimpor-
tantcharacteristicsofthenoise.ThisstudymotivatesustodesignthenovelCNNarchitecturethat
willbepresentedinSec.6.2.2.1.
6.2.1ACaseStudyofSpoofNoisePattern
Thecoretaskoffaceistoestimatetheevantnoisepatterninthegivenface
image.DespitethestrengthofusingaCNNmodel,wearestillfacingthechallengeoflearning
without
thegroundtruthofthenoisepattern.Toaddressthischallenge,wewouldliketocarry
101
Figure6.2:
Theillustrationofthespoofnoisepattern.
Left:
livefaceanditslocalregions.
Right:
Two
registeredfacesfromprintattackandreplayattack.Foreachsample,weshowthelocalregionof
theface,intensitydifferencetotheliveimage,magnitudeof2DFFT,andthelocalpeaksinthefrequency
domainthatindicatesthespoofnoisepattern.Bestviewedelectronically.
outacasestudyonthenoisepatternwiththeobjectivesofansweringthefollowingquestions:1)
isEqn.6.1agoodmodelingofthespoofnoise?2)whatcharacteristicsdoesthespoofnoisehold?
Letusdenoteagenuinefaceas
‹
I
.Byusingprintedpaperorvideoreplayondigitaldevices,the
attackercanmanufactureaspoofimage
I
from
‹
I
.Consideringnonon-rigiddeformationbetween
I
and
‹
I
,wesummarizethedegradationfrom
‹
I
to
I
asthefollowingsteps:
1.
Colordistortion:
Colordistortionisduetoanarrowercolorgamutofthespoofmedium
(e.g.LCDscreenorTonerCartridge).Itisaprojectionfromtheoriginalcolorspacetoa
tiniercolorsubspace.Thisnoiseisdependentonthecolorintensityofthesubject,andhence
itmayapplyasadegradationmatrixtothegenuineface
I
duringthedegradation.
2.
Displayartifacts:
Spoofmediumsoftenuseseveralnearbydots/sensorstoapproximate
onepixel'scolor,andtheymayalsodisplaythefacedifferentlythantheoriginalsize.Ap-
proximationanddown-samplingprocedurewouldcauseacertaindegreeofhigh-frequency
informationloss,blurring,andpixelperturbation.Thisnoisemayalsoapplyasadegradation
matrixduetoitssubjectdependence.
102
Figure6.3:
Theproposednetworkarchitecture.
3.
Presentingartifacts:
Whenpresentingthespoofmediumtothecamera,themediuminter-
actswiththeenvironmentandbringsseveralartifacts,includingandtransparency
ofthesurface.Thisnoisemayapplyasanadditivenoise.
4.
Imagingartifacts:
Imaginglatticepatternssuchasscreenpixelsonthecamera'ssensor
array(e.g.CMOSandCCD)wouldcauseinterferenceoflight.Thiseffectleadstoaliasing
andcreatesmoirépattern,whichappearsinreplayattackandsomeprintattackwithstrong
latticeartifacts.Thisnoisemayapplyasanadditivenoise.
Thesefourstepsshowthatthespoofimage
I
canbegeneratedviaapplyingdegradationmatri-
cesandadditivenoisesto
‹
I
,whichisbasicallyconveyedbyEqn.6.1.AsexpressedbyEqn.6.1,the
spoofimageisthesummationoftheliveimageandimage-dependentnoise.Tofurthervalidatethis
model,weshowanexampleinFig.6.2.Givenahigh-qualityliveimage,wecarefullyproducetwo
spoofimagesviaprintandreplayattack,withminimalnon-rigiddeformation.Aftereachspoof
imageisregisteredwiththeliveimage,theliveimagebecomesthe
groundtruth
liveimageifwe
wouldperformonthespoofimage.Thisallowsustocomputethedifferencebetween
theliveandspoofimages,whichisthenoisepattern
N
(
‹
I
)
.Toanalyzeitsfrequencyproperties,we
103
performFFTonthespoofnoiseandshowthe2Dshiftedmagnituderesponse.
Inbothspoofcases,weobserveahighresponseinthelow-frequencydomain,whichisrelated
tocolordistortionanddisplayartifacts.Inprintattack,
repetitive
noiseinStep3leadstoafew
ﬁpeak"responsesinthehigh-frequencydomain.Similarly,inthereplayattack,visiblemoiré
patternasseveralspursinthelow-frequencydomain,andthelatticepatternthatcauses
themoirépatternisrepresentedaspeaksinthehigh-frequencydomain.Moreover,spoofpatterns
areuniformlydistributedintheimagedomainduetotheuniformtextureofthespoofmediums.
Andthehighresponseoftherepetitivepatterninthefrequencydomainexactlydemonstratesthat
itappearswidelyintheimageandthuscanbeviewedasubiquitous.
Underthisidealregistration,thecomparisonbetweenliveandspoofimagesprovidesusa
basicunderstandingofthespoofnoisepattern.Itisatypeoftexturethathasthecharacteristicsof
repetitive
and
ubiquitous
.Basedonthismodelingandnoisecharacteristics,wedesignanetwork
toestimatethenoise
without
theaccesstothepreciselyregisteredgroundtruthliveimage,asthis
casestudyhas.
6.2.2De-SpoofNetwork
6.2.2.1NetworkOverview
Figure6.3showstheoverallnetworkarchitectureofourproposedmethod.Itconsistsofthree
parts:De-SpoofNet(DSNet),DiscriminativeQualityNet(DQNet),andVisualQualityNet(VQ
Net).DSNetisdesignedtoestimatethespoofnoisepattern
N
(i.e.theoutputof
N
(
‹
I
)
)fromthe
inputimage
I
.Theliveface
‹
I
thencanbereconstructedbysubtractingtheestimatednoise
N
from
theinputimage
I
.Thisreconstructedimage
‹
I
shouldbebothvisuallyappealingandindeedlive,
whichwillbesafeguardedbytheDQNetandVQNetrespectively.Allnetworkscanbetrainedin
104
Table6.1:
ThenetworkstructureofDSNet,DQNetandVQNet.Eachconvolutionallayerisfollowed
byanexponentiallinearunit(ELU)andbatchnormalizationlayer.TheinputimagesizeforDSNetis
256

256

6.Alltheconvolutionalare3

3.0\1MapNetisthebottom-leftpart,i.e.,conv1-10,
conv1-11,andconv1-12.
DSNet(EncoderPart)
DSNet(DecoderPart)
DQNet
VQNet
LayerChan./Stri.Outp.Size
LayerChan./Stri.Outp.Size
LayerChan./Stri.Outp.Size
LayerChan./Stri.Outp.Size
Input
Input
Input
Input
image
pool1-1+pool1-2+pool1-3
{image,live}
{image,live}
conv1-024/1256
resize-/-256
conv3-064/1256
conv1-120/1256
conv2-128/1256
conv3-1128/1256
conv4-124/2256
conv1-225/1256
conv2-224/1256
conv3-2196/1256
conv4-220/2256
conv1-320/1256
conv3-3128/1256
pool4-1-/2128
pool1-1-/2128
pool3-1-/2128
conv1-420/1128
conv2-320/1256
conv3-4128/1128
conv4-320/1128
conv1-525/1128
conv2-420/1256
conv3-5196/1128
conv4-416/1128
conv1-620/1128
conv3-6128/1128
pool4-2-/264
pool1-2-/264
pool3-2-/264
conv1-720/164
conv2-520/1256
conv3-7128/164
conv4-512/164
conv1-825/164
conv2-616/1256
conv3-8196/164
conv4-66/164
conv1-920/164
conv3-9128/164
pool4-3-/232
pool1-3-/232
pool3-3-/232
short-cutconnection
short-cutconnection
vectorize
pool1-1+pool1-2+pool1-3
pool3-1+pool3-2+pool3-3
1024
conv1-1028/132
conv2-716/1256
conv3-10128/132
fc4-11/1100
conv1-1116/132
conv2-86/1256
conv3-1164/132
dropout-0.2%
conv1-121/132
live(image-conv2-8)
conv3-121/132
fc4-21/12
anend-to-endfashion.ThedetailsofthenetworkstructureareshowninTab.6.1.
Asthecorepart,DSNetisdesignedasanencoder-decoderstructurewiththeinput
I
2
R
256

256

6
.Herethe6channelsareRGB
+
HSVcolorspace,followingthesuggestionin[5].In
theencoderpart,westack10convolutionallayerswith3poolinglayers.Inspiredbytheresid-
ualnetwork[44],wefollowbyashort-cutconnection:concatenatingtheresponsesfrom
pool
1-1,
pool
1-2with
pool
1-3,andthensendingthemto
conv
1-10.Thisoperationhelpsustopassthe
featureresponsesfromdifferentscalestothelaterstagesandeasethetrainingprocedure.Going
through3moreconvolutionlayers,theresponses
F
2
R
32

32

32
from
conv
1-12arethefeature
representationofthespoofnoisepatterns.Thehighermagnitudestheresponseshave,themore
theinputis.
Outfromtheencoder,thefeaturerepresentation
F
isfedintothedecodertoreconstructthe
spoofnoisepattern.
F
isdirectlyresizedtotheinputspatialsize256

256.Itintroducesnoextra
105
gridartifacts,whichexistinthealternativeapproachofusingadeconvolutionallayer.Then,we
passtheresized
F
toseveralconvolutionallayerstoreconstructthenoisepattern
N
.Accordingto
Eqn.6.1,thereconstructedliveimagecanberetrievedby:
‹
x
=
x

N
(
‹
x
)=
I

N
.
EachconvolutionallayerintheDSNetisequippedwithexponentiallinearunit(ELU)and
batchnormalizationlayers.TosupervisethetrainingofDSNet,wedesignmultiplelossfunctions:
lossesfromDQNetandVQNetfortheimagequality,0\1maploss,andnoisepropertylosses.We
introducetheselossfunctionsinSec.6.2.3-6.2.4.
6.2.3DQNetandVQNet
Whilewedonothavethegroundtruthtosupervisetheestimatedspoofnoisepattern,itispossible
tosupervisethereconstructedliveimage,whichimplicitlyguidesthenoiseestimation.Toestimate
agood-qualityspoofnoise,thereconstructedliveimageshouldbequantitativelyandvisually
recognizedaslive.Forthispurpose,weproposetwonetworksinourarchitecture:Discriminative
QualityNet(DQNet)andVisualQualityNet(VQNet).TheVQNetaimstoguaranteethe
reconstructedlivefaceisphotorealistic.TheDQNetisproposedtoguaranteethereconstructed
facewouldindeedbeconsideredaslive,basedonthejudgmentofapre-trainedface
network.ThedetailsofourproposedarchitectureareshowninTab.6.1.
6.2.3.1DiscriminativeQualityNet
Wefollowthestate-of-the-artnetworkarchitectureofface[73]tobuildourDQNet.
Itisafullyconvolutionalnetworkwiththreeblocksandthreeadditionalconvolutionallayers.
Eachblockconsistsofthreeconvolutionallayersandonepoolinglayer.Thefeaturemapsafter
eachpoolinglayerareresizedandstackedtofeedintothefollowingconvolutionallayers.Finally,
DQNetissupervisedtoestimatethepseudo-depth
D
ofaninputface,where
D
forthelivefaceis
106
thedepthofthefaceshapeand
D
forthespooffaceisazeromapasasurface.Weadoptthe3D
facealignmentalgorithmin[74]toestimatethefaceshapeandrenderthedepthviaZ-Buffering.
Similartothepreviouswork[50],DQNetispre-trainedtoobtainthesemanticknowledgeof
livefacesandfaces.AndduringthetrainingofDSNet,theparametersofDQNetare
ed.Sincethereconstructedimages
‹
I
areliveimages,thecorrespondingpseudo-depth
D
should
bethedepthofthefaceshape.ThebackpropagationoftheerrorfromDQNetguidestheDSNet
toestimatethespoofnoisepatternwhichshouldbesubtractedfromtheinputimage,
J
DQ
=


CNN
DQ
(
‹
I
)

D


1
;
(6.2)
whereCNN
DQ
isaednetworkand
D
isthedepthofthefaceshape.
6.2.3.2VisualQualityNet
WedeployaGANtoverifythevisualqualityoftheestimatedliveimage
‹
I
.Givenboththereal
liveimage
I
live
andthesynthesizedliveimage
‹
I
,VQNetistrainedtodistinguishbetween
I
live
and
‹
I
.Meanwhile,DSNettriestoreconstructphotorealisticliveimageswheretheVQNetwould
classifythemasnon-synthetic(orreal)images.TheVQNetconsistsof6convolutionallayersand
afullyconnectedlayerwitha2Dvectorastheoutput,whichrepresentstheprobabilityoftheinput
imagetoberealorsynthetic.Ineachiterationduringthetraining,theVQNetisevaluatedwith
twobatches,intheone,theDSNetisedandweupdatetheVQNet,
J
VQ
train
=

E
I
2
R
log
(
CNN
VQ
(
I
))

E
I
2
S
log
(
1

CNN
VQ
(
CNN
DS
(
I
)))
;
(6.3)
107
where
R
and
S
arethesetsofrealandsyntheticimagesrespectively.Inthesecondbatch,theVQ
NetisedandtheDSNetisupdated,
J
VQ
test
=

E
I
2
S
log
(
CNN
VQ
(
CNN
DS
(
I
)))
:
(6.4)
6.2.4Lossfunctions
Themainchallengeforspoofmodelingisthelackofthegroundtruthforthespoofnoisepattern.
SincewehaveconcludedsomepropertiesaboutthespoofnoiseinSec.6.2.1,wecanleverage
themtodesignseveralnovellossfunctionstoconstraintheconvergencespace.First,weintroduce
magnitudelosstoenforcethespoofnoiseoftheliveimagetobezero.Second,zero\onemaplossis
usedtodemonstratetheubiquitousnessofthespoofnoise.Third,weencouragetherepetitiveness
propertyofspoofnoiseviarepetitiveloss.Wedescribethreelossfunctionsasthefollowing:
6.2.4.1MagnitudeLoss
Thespoofnoisepatternfortheliveimagesiszero.Themagnitudelosscanbeutilizedtoimpose
theconstraintfortheestimatednoise.Giventheestimatednoise
N
andreconstructedliveimage
‹
I
=
I

N
ofanoriginalliveimage
I
,wehave,
J
m
=
k
N
k
1
:
(6.5)
Zero\OneMapLoss:
Tolearndiscriminativefeaturesintheencoderlayers,weasub-task
intheDSNettoestimateazero-mapforthelivefacesandanone-mapforthespoof.Sincethisis
aper
pixel
supervision,itisalsoaconstraintofubiquitousnessonthenoise.Moreover,0\1map
108
enablesthereceptiveofeachpixeltocoveralocalarea,whichhelpstolearngeneralizable
featuresforthisproblem.Formally,giventheextractedfeatures
F
fromaninputfaceimage
I
in
theencoder,wehave,
J
z
=


CNN
01
map
(
F
;
Q
)

M


1
;
(6.6)
where
M
2
0
32

32
or
M
2
1
32

32
isthezero\onemaplabel.
6.2.4.2RepetitiveLoss
Basedonthepreviousdiscussion,weassumethespoofnoisepatterntoberepetitive,becauseit
isgeneratedfromtherepetitivespoofmedium.Toencouragetherepetitiveness,weconvertthe
estimatednoise
N
totheFourierdomainandcomputethemaximumvalueinthehigh-frequency
band.Theexistenceofhighpeakisindicativeoftherepetitivepattern.Wewouldliketomaximize
thispeakforspoofimages,butminimizeitforliveimages,asthefollowinglossfunction:
J
r
=
8
>
<
>
:

max
(
H
(
F
(
N
)
;
k
))
,
I
2
Spoof
k
max
(
H
(
F
(
N
)
;
k
))
k
1
,
I
2
Live
;
where
F
istheFouriertransformoperator,
H
isanoperatorformaskingthelow-frequencydomain
ofanimage,i.e.,settinga
k

k
regioninthecenteroftheshifted2DFourierresponsetozero.
Finally,thetotallossfunctioninourtrainingistheweightedsummationoftheaforementioned
lossfunctionsandthesupervisionsfortheimagequalities,
J
T
=
J
z
+
l
1
J
m
+
l
2
J
r
+
l
3
J
DQ
+
l
4
J
VQ
test
;
(6.7)
where
l
1
;
l
2
;
l
3
;
l
4
aretheweights.Duringthetraining,wealternatebetweenoptimizingEqn.6.7
andEqn.6.3.
109
6.3ExperimentalResults
6.3.1ExperimentalSetup
Databases
Weevaluateourworkonthreefaceantidatabases,withprintandreplay
attacks:Oulu-NPU[19],CASIA-MFSD[143]andReplay-Attack[28].Oulu-NPU[19]isahigh-
resolutiondatabase,consideringmanyreal-worldvariations.Oulu-NPUalsoincludes4testing
protocols:Protocol1evaluatesontheilluminationvariation,Protocol2examinesthe
ofdifferentspoofmedium,Protocol3inspectstheeffectofdifferentcameradevicesandProto-
col4containsallthechallengesabove,whichisclosetothescenarioofcrosstesting.CASIA-
MFSD[143]containsvideoswithresolution640

480and1280

720.Replay-Attack[28]in-
cludesvideosof320

240.Thesetwodatabasesareoftenusedforcrosstesting[83].
Parametersetting
WeimplementourmethodinTw[2].Modelsaretrainedwiththe
batchsizeof6andthelearningrateof3e

5.Wesetthe
k
=
64intherepetitivelossandset
l
1
to
l
4
inEqn.6.7as3
;
0
:
005
;
0
:
1and0
:
016,respectively.DQNetistrainedseparatelyandremains
edduringtheupdateofDSNetandVQNet,butallsub-networksaretrainedwiththesameand
respectivedataineachprotocol.
Evaluationmetrics
Tocomparewithpreviousmethods,weuseAttackPresentation
ErrorRate(
APCER
)[46],BonaFidePresentationErrorRate(
BPCER
)[46]and,
ACER
=(
APCER
+
BPCER
)
=
2[46]fortheintratestingonOulu-NPU,andHalfTotalErrorRate
(
HTER
)[9],halfofthesummationofFARandFRR,forthecrosstestingbetweenCASIA-MFSD
andReplay-Attack.
110
Table6.2:Theaccuracyofdifferentoutputsoftheproposedarchitectureandtheirfusions.
Method
0\1map
Spoofnoise
Depthmap
Fusion(Spoofnoise,Depthmap)
Fusionofallthreeoutputs
Maximum
Average
Maximum
Average
APCER
2
:
50
1
:
70
1
:
66
1
:
70
1
:
27
1
:
70
1
:
27
BPCER
2
:
52
1
:
70
1
:
68
1
:
73
1
:
73
1
:
73
1
:
73
ACER
2
:
51
1
:
70
1
:
67
1
:
72
1
:
50
1
:
72
1
:
50
6.3.2AblationStudy
UsingOulu-NPUProtocol1,weperformthreestudiesontheeffectofscorefusing,theimportance
ofeachlossfunction,andtheofimageresolutionandblurriness.
Differentfusionmethods
Intheproposedarchitecture,threeoutputscanbeutilizedforclassi-
thenormsofeitherthe0\1map,thespoofnoisepatternorthedepthmap.Becauseof
thediscriminativenessenabledbyourlearning,wecansimplyusearudimentarylike
L
-1norm.Notethatamoreadvanceisapplicableandwouldlikelyleadtohigherper-
formance.Table6.2showstheperformanceofeachoutputandtheirfusionwithmaximumand
average.Itshowsthatthefusionofspoofnoiseanddepthmapachievesthebestperformance.
However,addingthe0\1mapscoresdonotimprovetheaccuracysinceitcontainsthesamein-
formationasthespoofnoise.Hence,fortherestofexperiments,wereportperformancefromthe
averagefusionofthespoofnoise
N
andthedepthmap
‹
D
,i.e.,
score
=(
k
N
k
1
+


‹
D


1
)
=
2.
Advantageofeachlossfunction
Wehavethreemainlossfunctionsinourproposedarchitecture.
Toshowstheeffectofeachlossfunction,wetrainanetworkwitheachlossexcludedonebyone.
Bydisablingthemagnitudeloss,the0\1maplossandtherepetitiveloss,weobtaintheACERs
5
:
24,2
:
34and1
:
50,respectively.Tofurthervalidatetherepetitiveloss,weperformanexperiment
onhigh-resolutionimagesbychangingthenetworkinputtothecheekregionoftheoriginal1080P
resolution.TheACERofthenetworkwiththerepetitivelossis2
:
92butthenetworkwithout
cannotconverge.
Resolutionandblurriness
Asshownintheablationstudyofrepetitiveloss,theimagequal-
111
Table6.3:
ACERoftheproposedmethodwithdifferentimageresolutionsandblurriness.Tocreateblurry
images,weapplyGaussianwithdifferentkernelsizestotheinputimages.
X
X
X
X
X
X
X
X
X
X
X
Metric
Resolution
256

256
128

128
64

64
APCER
1
:
27
2
:
27
5
:
24
BPCER
1
:
73
3
:
36
5
:
30
ACER
1
:
50
3
:
07
5
:
27
X
X
X
X
X
X
X
X
X
X
X
Metric
Blurriness
1

1
3

3
5

5
7

7
9

9
APCER
1
:
27
2
:
29
3
:
12
3
:
95
4
:
79
BPCER
1
:
73
2
:
50
3
:
33
4
:
16
5
:
00
ACER
1
:
50
2
:
39
3
:
22
4
:
06
4
:
89
Table6.4:Theintratestingresultson4protocolsofOulu-NPU.
Protocol
Method
APCER(%)
BPCER(%)
ACER(%)
CPqD[15]
2
:
9
10
:
8
6
:
9
1
GRADIANT[15]
1
:
3
12
:
5
6
:
9
Auxiliary[73]
1
:
6
1.6
1
:
6
Ours
1.2
1
:
7
1.5
MixedFASNet[15]
9
:
7
2
:
5
6
:
1
2
Ours
4
:
2
4
:
4
4
:
3
Auxiliary[73]
2
:
7
2
:
7
2
:
7
GRADIANT
3.1
1.9
2.5
MixedFASNet
5
:
3

6
:
7
7
:
8

5
:
5
6
:
5

4
:
6
3
GRADIANT
2.6

3.9
5
:
0

5
:
3
3
:
8

2
:
4
Ours
4
:
0

1
:
8
3
:
8

1
:
2
3
:
6

1
:
6
Auxiliary[73]
2
:
7

1
:
3
3.1

1.7
2.9

1.5
Massy_HNU[15]
35
:
8

35
:
3
8
:
3

4
:
1
22
:
1

17
:
6
4
GRADIANT
5.0

4.5
15
:
0

7
:
1
10
:
0

5
:
0
Auxiliary[73]
9
:
3

5
:
6
10
:
4

6
:
0
9
:
5

6
:
0
Ours
5
:
1

6
:
3
6.1

5.1
5.6

5.7
ityiscriticalforachievingahighaccuracy.Thespoofnoisepatternmaynotbedetectedinthe
low-resolutionormotion-blurredimages.Thetestingresultsondifferentimageresolutionsand
blurrinessareshowninTab.6.3.Theseresultsvalidatethatthespoofnoisepatternislessdiscrim-
inativeforthelower-resolutionorblurryimages,asthehigh-frequencypartoftheinputimages
containsmostofthespoofnoisepattern.
6.3.3ExperimentalComparison
Toshowtheperformanceofourproposedmethod,wepresentouraccuracyintheintratestingof
Oulu-NPUandthecrosstestingonCASIAandReplay-Attack.
112
6.3.3.1IntraTesting
Wecompareourintratestingperformanceonall4protocolsofOulu-NPU.Table6.4showsthe
comparisonofourmethodandthebest3outof18previousmethods[15,73].Ourproposed
methodachievespromisingresultsonallprotocols.,weoutperformthepreviousstate
oftheartbyalargemargininProtocol4,whichisthemostchallengingprotocol,andsimilarto
crosstesting.
6.3.3.2CrossTesting
WeperformcrosstestingbetweenCASIA-MFSD[143]andReplay-Attack[28].Asshownin
Tab.6.5,ourmethodachievesthecompetitiveperformanceonthecrosstestingfromCASIA-
MFSDtoReplay-Attack.However,weachieveaworseHTERcomparedtothebestperform-
ingmethodsfromReplayAttacktoCASIA-MFSD.Wehypothesizethereasonisthatimagesof
CASIA-MFSDareofmuchhigherresolutionthanthoseofReplayAttack.Thisshowsthatthe
modeltrainedwithhigher-resolutiondatacangeneralizewellonlower-resolutiontestingdata,
butnottheotherwayaround.Thisisonelimitationoftheproposedmethod,andworthyfurther
research.
6.3.4QualitativeExperiments
6.3.4.1Spoofmedium
Theestimatedspoofnoisepatternofthetestimagescanbeusedforclusteringthemintodifferent
groupsandeachgrouprepresentsonespoofmedium.Tovisualizetheresults,weuset-SNE[76]
fordimensionreduction.Thet-SNEprojectsthenoise
N
2
R
256

256

6
to2dimensionsbybest
preservingtheKLdivergencedistance.Fig.6.4showsthedistributionsofthetestingvideoson
113
Table6.5:TheHTERofdifferentmethodsforthecrosstestingbetweentheCASIA-MFSDand
theReplay-Attackdatabases.Wemarkthetop-2performancesinbold.
Method
Train
Test
Train
Test
CASIA
Replay
Replay
CASIA
MFSD
Attack
Attack
MFSD
Motion[34]
50
:
2%
47
:
9%
LBP-TOP[34]
49
:
7%
60
:
6%
Motion-Mag[11]
50
:
1%
47
:
0%
Spectralcubes[90]
34
:
4%
50
:
0%
CNN[130]
48
:
5%
45
:
5%
LBP[16]
47
:
0%
39
:
6%
ColourTexture[17]
30
:
3%
37.7
%
Auxiliary[73]
27.6
%
28.4
%
Ours
28.5
%
41
:
1%
Oulu-NPUProtocol1.Theleftimageshowsthatthenoiseofliveiswell-clustered,andthenoise
ofspoofissubjectdependent,whichisconsistentwithournoiseassumption.Toobtainabetter
visualization,weutilizethehighpasstoextractthehigh-frequencyinformationofnoise
patternfordimensionreduction.Therightimageshowsthatthehighfrequencyparthasmore
subjectindependentinformationaboutthespooftypeandcanbeutilizedforofthe
spoofmedium.
Tofurthershowthediscriminativepoweroftheestimatedspoofnoise,wedividethetestingset
ofProtocol1totrainingandtestingpartsandtrainanSVMforspoofmedium
tion.Wetraintwomodels,athree-class(live,printanddisplay)andave-class
(live,print1,print2,display1anddisplay2),andtheyachievetheaccuracyof82
:
0%
and54
:
3%respectively,showninTab.6.6.Mosterrorsoftheve-classmodelare
withinthesamespoofmedium.Thisresultisnoteworthygiventhatnolabelofspoofmediumtype
isprovidedduringthelearningofthespoofnoisemodel.Yettheestimatednoiseactuallycarries
appreciableinformationregardingthemediumtype;hencewecanobservereasonableresultsof
spoofmediumThisdemonstratesthattheestimatednoisecontainsspoofmedium
114
Figure6.4:The2DvisualizationoftheestimatedspoofnoisefortestvideosonOulu-NPUProtocol
1.Left:theestimatednoise,Right:thehigh-frequencybandoftheestimatednoise,
Colorcode
used:
black
=live,
green
=printer1,
blue
=printer2,
magenta
=display1,
red
=display2.
Table6.6:
Theconfusionmatricesofspoofmediumsbasedonspoofnoisepattern.
X
X
X
X
X
X
X
X
X
X
X
Actual
Predicted
live
print
display
live
59
1
0
print
0
88
32
display
13
8
99
X
X
X
X
X
X
X
X
X
X
X
Actual
Predicted
live
print1
print2
display1
display2
live
59
0
1
0
0
print1
0
41
2
11
6
print2
0
34
11
9
6
display1
10
6
0
13
31
display2
8
7
0
6
39
informationandindeedwearemovingtowardestimatingthefaithfulspoofnoiseresidingineach
spoofimage.Inthefuture,iftheperformanceofspoofmediumimproves,thiscould
bringnewimpacttoapplicationssuchasforensic.
6.3.4.2Successfulandfailurecases
WeshowseveralsuccessandfailurecasesinFig.6.5-6.6.Fig.6.5showsthattheestimatedspoof
noisesaresimilarwithineachmediumbutdifferentfromtheothermediums.Wesuspectthatthe
yellowishcolorinthefourcolumnsisduetothestrongercolordistortioninthepaperattack.
Therowshowsthattheestimatednoisefortheliveimagesisnearlyzero.Forthefailurecases,
weonlyhaveafewfalsepositivecases.Thefailuresareduetoundesirednoiseestimationwhich
willmotivateusforfurtherresearch.
115
Figure6.5:Thevisualizationofinputimages,estimatedspoofnoisesandestimatedliveimages
fortestvideosofProtocol1ofOulu-NPUdatabase.Thefourcolumnsintheroware
paperattacksandthesecondfourarethereplayattacks.Forabettervisualization,wemagnifythe
noiseby5timesandaddthevaluewith128,toshowbothpositiveandnegativenoise.
Figure6.6:Thefailurecasesforconvertingthespoofimagestotheliveones.
6.4Summary
Thischapterintroducesanewperspectiveforsolvingthefacebyinverselydecom-
posingaspooffaceintothelivefaceandthespoofnoisepattern.AnovelCNNarchitecturewith
116
multipleappropriatesupervisionsisproposed.Wedesignlossfunctionstoencouragethepattern
ofthespoofimagestobeubiquitousandrepetitive,whilethenoiseoftheliveimagesshouldbe
zero.Wevisualizethespoofnoisepatternwhichcanhelptohaveadeeperunderstandingofthe
addednoisebyeachspoofmedium.Weevaluatetheproposedmethodonmultiplewidely-used
facedatabases.
117
Chapter7
ConclusionsandFutureWork
Facealignmentisanimportantresearchtopicbecauseithasmanyapplicationsinfacerecognition,
facetracking,expressionestimation,augmentedandvirtualreality,etc.Byutilizingdeeplearning
methods,facealignmentsystemsimproveddramaticallyandtheycanbeusedinmanycommercial
applicationsforfacetrackingandaugmentedrealitytaskse.g.,SnapchatandFacebook.
Inchapter2to4,weproposethreefacealignmentmethodswiththesameideaofconvertingthe
2Dfacealignmentto3Dfacealignmentandreconstructingthe3Dshapeoftheface.Wepresent
ouradvantagetoperformfacealignmentforlargeposefaceimagesbyutilizingtheestimated3D
shape.Thesethreemethodsareanextendedversionofeachother,andtheaccuracyandthespeed
ofourpose-invariantfacealignmentareimprovedineachmethod.
Thefaceisoneofthemostpopularbiometricmodalitiesthatcanbeusedforaccesscontroland
authentication.Althoughfacerecognitionsystemscanachieveveryhighnaccuracy,
ifwewanttousefacerecognitionsystemsinpracticeweneedtohavearobustface
system
Inchapter5and6,weproposeourCNN-basedfacemethodsthatusethesupervi-
sionsfromboththespatialandtemporalauxiliaryinformationforthepurposeofrobustlydetecting
facePAfromafacevideo.
118
7.1Limitations
Pose-invariantFaceAlignment:
ThemainlimitationoftheproposedPIFAmethodsisthatthe
estimated3Dshapeoffacedoesnotcontainthedetailedchangesinthe3Dshapee.g.,wrinkles.
Thisisduetothelimitednumbersofgroundtruth2Dlandmarkswhichareutilizedduringtraining
assupervision.Tran
et.al
[106]showthatutilizingunsupervisedlossfunctionsbasedonthe
texturereconstructionishelpfultohavemoredetailed3Dshapeestimation.
Face
ThemainchallengeofmethodsforalloftheBiometricmodal-
itiesisunknownspoofattack(spoofmedium).Themethodsaremainlyconstrained
withthespoofmediumdatawhichusedfortrainingthem.Makingthemethodsmore
generalizableandrobusttonewspoofmediumisthemainlimitationoftheproposedmethods.
7.2FutureWork
Detailed
3
DShapeforFace
Theestimated3Dshapes,byourproposedmethods,arelimitedto
the3DMMbaseswhichweutilizetorepresentthe3Dshape.Inordertohavemorepowerfulbases
torepresentthe3Dshapes,wecanlearnthebasesfromthedataandmakeamoredetailedrecon-
structionbyincorporatingunsupervisedlossfunctions.In[106],aCNNbasedmethodproposedto
learnasetofnon-linearbasesforrepresentingthe3Dshapeoftheface.Similarly,Tan
et.al
[103]
utilizevariationalauto-encodersforrepresentingthe3Dmeshesinordertohavemorexibility
forrepresentingthe3Dshapeandallowingnon-lineardeformations.
GeneralObject
Theideaoffaceantcanbeextendedfordetecting
thespoofimageforallobjects.Thegeneralobjecthasapplicationsfordetectingthe
spoofimagesinonlineshoppingwebsitese.g.,eBayandAmazon.Thegeneralis
muchmorechallengingthanthefaceanti-spoofduetovariousmaterialsanddifferenttextureof
119
Figure7.1:
Left:ArepresentationoftheestimatedpointcloudiniPhoneX.Right:Thehardwaretechnol-
ogyinHuaweiP11forcapturingthepointcloud.
objects.Thegoalistodetecttheprintattackandthereplayattackforgeneralobject
3
DPointCloud
The3Dpointcloudhasseveralapplicationsin
computervision.Recently,withtheadvancementofhardwaretechnology,wecanhavesensors
onthephoneforcapturingthe3Dpointcloud.IniPhoneX,thepointclouddataisutilizedfor
face(LeftinFig.7.1).Similarly,othercellphonecompaniesareutilizing
pointcloudforfaceForexample,therightinFig.7.1showsthehardware
technologyinHuaweiP11forcapturingthepointcloud.Thesedatacanbeutilizedfordetecting
objectsandmeasuringthelengthofobjectsandthedistancetotheobjects.
120
BIBLIOGRAPHY
121
BIBLIOGRAPHY
[1]
ExplainableIntelligence(XAI)(https://www.darpa.mil/program/explainable-

[2]
M.Abadi,A.Agarwal,andetal.TensorFlow:Large-scalemachinelearningonheteroge-
neoussystems,2015.
[3]
A.Agarwal,R.Singh,andM.Vatsa.FaceusingHaralickfeatures.IEEE,
2016.
[4]
A.Asthana,S.Zafeiriou,S.Cheng,andM.Pantic.Robustdiscriminativeresponsemap
withconstrainedlocalmodels.pages3444Œ3451.IEEE,2013.
[5]
Y.Atoum,Y.Liu,A.Jourabloo,andX.Liu.Faceusingpatchanddepth-based
cnns.IEEE,2017.
[6]
W.Bao,H.Li,N.Li,andW.Jiang.Alivenessdetectionmethodforfacerecognitionbased
onopticalwIn
IASP
.IEEE,2009.
[7]
P.N.Belhumeur,D.W.Jacobs,D.Kriegman,andN.Kumar.Localizingpartsoffacesusing
aconsensusofexemplars.pages545Œ552.IEEE,2011.
[8]
S.BellandK.Bala.Learningvisualsimilarityforproductdesignwithconvolutionalneural
networks.34(4):98,2015.
[9]
S.BengioandJ.Mariéthoz.Astatisticaltestforpersonauthentication.In
ProceedingsofOdyssey2004:TheSpeakerandLanguageRecognitionWorkshop
,2004.
[10]
S.Bharadwaj,T.Dhamecha,M.Vatsa,andR.Singh.Faceviamotionmagni-
andmultifeaturevideoletaggregation.2014.
[11]
S.Bharadwaj,T.I.Dhamecha,M.Vatsa,andR.Singh.Computationallyefface
detectionwithmotionIEEE,2013.
[12]
V.BlanzandT.Vetter.Amorphablemodelforthesynthesisof3Dfaces.In
ACMSIG-
GRAPH
,pages187Œ194,1999.
[13]
V.BlanzandT.Vetter.Facerecognitionbasedona3Dmorphablemodel.25(9):1063Œ
1074,2003.
[14]
S.Bobbia,Y.Benezeth,andJ.Dubois.Remotephotoplethysmographybasedonimplicit
livingskintissuesegmentation.pages361Œ365,2016.
122
[15]
Z.Boulkenafet.Acompetitionongeneralizedsoftware-basedfacepresentationattackde-
tectioninmobilescenarios.IEEE,2017.
[16]
Z.Boulkenafet,J.Komulainen,andA.Hadid.Facebasedoncolortexture
analysis.IEEE,2015.
[17]
Z.Boulkenafet,J.Komulainen,andA.Hadid.Facedetectionusingcolourtexture
analysis.11(8):1818Œ1830,2016.
[18]
Z.Boulkenafet,J.Komulainen,andA.Hadid.Faceusingspeeded-uprobust
featuresandvectorencoding.
IEEESignalProcessingLetters
,24(2):141Œ145,2017.
[19]
Z.Boulkenafet,J.Komulainen,L.Li,X.Feng,andA.Hadid.OULU-NPU:Amobileface
presentationattackdatabasewithreal-worldvariations.IEEE,2017.
[20]
J.Bromley,J.W.Bentz,L.Bottou,I.Guyon,Y.LeCun,C.Moore,E.Säckinger,and
R.Shah.Signaturevusingaﬁsiameseﬂtimedelayneuralnetwork.7(04):669Œ
688,1993.
[21]
A.BulatandG.Tzimiropoulos.Convolutionalaggregationoflocalevidenceforlargepose
facealignment.2016.
[22]
X.P.Burgos-Artizzu,P.Perona,andP.Dollár.Robustfacelandmarkestimationunder
occlusion.pages1513Œ1520,2013.
[23]
H.Caesar,J.Uijlings,andV.Ferrari.Region-basedsemanticsegmentationwithend-to-end
training.pages381Œ397,2016.
[24]
C.Cao,Q.Hou,andK.Zhou.Displaceddynamicexpressionregressionforreal-timefacial
trackingandanimation.
ACMTransactionsonGraphics(TOG)
,33(4):43,2014.
[25]
C.Cao,Y.Weng,S.Zhou,Y.Tong,andK.Zhou.Facewarehouse:a3Dfacialexpression
databaseforvisualcomputing.20(3):413Œ425,2014.
[26]
X.Cao,Y.Wei,F.Wen,andJ.Sun.Facealignmentbyexplicitshaperegression.
107(2):177Œ190,2014.
[27]
G.ChettyandM.Wagner.Multi-levellivenessvforface-voicebiometricauthen-
tication.In
BC
,2006.
[28]
I.Chingovska,A.Anjos,andS.Marcel.Ontheeffectivenessoflocalbinarypatternsinface
IEEE,2012.
[29]
T.Cootes,G.Edwards,andC.Taylor.Activeappearancemodels.23(6):681Œ685,June
2001.
123
[30]
T.Cootes,C.Taylor,andA.Lanitis.Activeshapemodels:Evaluationofamulti-resolution
methodforimprovingimagesearch.volume1,pages327Œ336,1994.
[31]
T.F.Cootes,C.J.Taylor,D.H.Cooper,andJ.Graham.ActiveshapemodelsŠtheir
trainingandapplication.61(1):38Œ59,Jan1995.
[32]
D.CristinacceandT.Cootes.Boostedregressionactiveshapemodels.volume2,pages
880Œ889,2007.
[33]
T.deFreitasPereira,A.Anjos,J.M.DeMartino,andS.Marcel.LBP-TOPbasedcounter-
measureagainstfaceattacks.Springer,2012.
[34]
T.deFreitasPereira,A.Anjos,J.M.DeMartino,andS.Marcel.Canface
countermeasuresworkinarealworldscenario?IEEE,2013.
[35]
G.deHaanandV.Jeanne.Robustpulseratefromchrominance-basedrPPG.
IEEETrans.
BiomedicalEngineering
,60(10):2878Œ2886,2013.
[36]
C.Dong,C.C.Loy,K.He,andX.Tang.Learningadeepconvolutionalnetworkforimage
super-resolution.Springer,2014.
[37]
L.Feng,L.-M.Po,Y.Li,X.Xu,F.Yuan,T.C.-H.Cheung,andK.-W.Cheung.Integration
ofimagequalityandmotioncuesforfaceaneuralnetworkapproach.
Journal
ofVisualCommunicationandImageRepresentation
,38:451Œ460,2016.
[38]
R.W.FrischholzandU.Dieckmann.BiolD:amultimodalbiometricsystem.
J.Computer
,33(2):64Œ68,2000.
[39]
R.W.FrischholzandA.Werner.Avoidingreplay-attacksinafacerecognitionsystemusing
head-poseestimation.In
AMFGW
,pages234Œ235,2003.
[40]
M.Gharbi,G.Chaurasia,S.Paris,andF.Durand.Deepjointdemosaickinganddenoising.
35(6):191,2016.
[41]
X.Glorot,A.Bordes,andY.Bengio.Deepsparseneuralnetworks.In
Proc.
IntelligenceandStatistics(AISTATS)
,pages315Œ323,2011.
[42]
R.Gross,I.Matthews,andS.Baker.Genericvs.personactiveappearancemodels.
23(11):1080Œ1093,Nov.2005.
[43]
L.GuandT.Kanade.3Dalignmentoffaceinasingleimage.volume1,pages1305Œ1312,
2006.
[44]
K.He,X.Zhang,S.Ren,andJ.Sun.Deepresiduallearningforimagerecognition.IEEE,
2016.
124
[45]
G.-S.Hsu,K.-H.Chang,andS.-C.Huang.Regressivetreestructuredmodelforfacial
landmarklocalization.pages3855Œ3861,2015.
[46]
ISO/IECJTC1/SC37Biometrics.Informationtechnologybiometricpresentation
attackdetectionpart1:Framework.internationalorganizationforstandardization
(https://www.iso.org/obp/ui/iso),2016.
[47]
M.Jaderberg,K.Simonyan,A.Zisserman,etal.Spatialtransformernetworks.pages2017Œ
2025,2015.
[48]
L.A.Jeni,J.F.Cohn,andT.Kanade.Dense3Dfacealignmentfrom2dvideosinreal-time.
volume1,pages1Œ8,2015.
[49]
Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.Girshick,S.Guadarrama,and
T.Darrell.Caffe:Convolutionalarchitectureforfastfeatureembedding.In
ACMMM,
(2014)
,pages675Œ678,2014.
[50]
J.Johnson,A.Alahi,andL.Fei-Fei.Perceptuallossesforreal-timestyletransferandsuper-
resolution.Springer,2016.
[51]
A.Jourabloo,A.Feghahati,andM.Jamzad.Newalgorithmsforrecoveringhighlycorrupted
imageswithimpulsenoise.
ScientiaIranica
,19(6):1738Œ1745,2012.
[52]
A.JourablooandX.Liu.Pose-invariant3Dfacealignment.pages3694Œ3702,2015.
[53]
A.JourablooandX.Liu.Large-posefacealignmentviacnn-baseddense3Dmodel
2016.
[54]
A.JourablooandX.Liu.Pose-invariantfacealignmentviacnn-baseddense3Dmodel
pages1Œ17,2017.
[55]
A.Jourabloo,X.Yin,andX.Liu.Attributepreservedface2015.
[56]
V.KazemiandJ.Sullivan.Onemillisecondfacealignmentwithanensembleofregression
trees.pages1867Œ1874,2014.
[57]
M.Koestinger,P.Wohlhart,P.M.Roth,andH.Bischof.Annotatedfaciallandmarksin
thewild:Alarge-scale,real-worlddatabaseforfaciallandmarklocalization.In
FirstIEEE
InternationalWorkshoponBenchmarkingFacialImageAnalysisTechnologies
,2011.
[58]
K.Kollreider,H.Fronthaler,M.I.Faraj,andJ.Bigun.Real-timefacedetectionandmotion
analysiswithapplicationinlivenessassessment.2(3):548Œ558,2007.
[59]
J.Komulainen,A.Hadid,andM.Pietikainen.ContextbasedfaceIEEE,
2013.
125
[60]
J.Komulainen,A.Hadid,M.Pietikäinen,A.Anjos,andS.Marcel.Complementarycoun-
termeasuresfordetectingscenicfaceattacks.2013.
[61]
M.Köstinger,P.Wohlhart,P.M.Roth,andH.Bischof.Annotatedfaciallandmarksinthe
wild:Alarge-scale,real-worlddatabaseforfaciallandmarklocalization.pages2144Œ2151,
2011.
[62]
K.Kulkarni,S.Lohit,P.Turaga,R.Kerviche,andA.Ashok.Reconnet:Non-iterative
reconstructionofimagesfromcompressivelysensedmeasurements.IEEE,2016.
[63]
S.Lefkimmiatis.Non-localcolorimagedenoisingwithconvolutionalneuralnetworks.
IEEE,2017.
[64]
H.Li,Z.Lin,X.Shen,J.Brandt,andG.Hua.Aconvolutionalneuralnetworkcascadefor
facedetection.pages5325Œ5334,2015.
[65]
J.Li,Y.Wang,T.Tan,andA.K.Jain.Livefacedetectionbasedontheanalysisoffourier
spectra.In
BiometricTechnologyforHuman
.SPIE,2004.
[66]
L.Li,X.Feng,Z.Boulkenafet,Z.Xia,M.Li,andA.Hadid.Anoriginalface
approachusingpartialconvolutionalneuralnetwork.In
ImageProcessingTheoryToolsand
Applications(IPTA),20166thInternationalConferenceon
.IEEE,2016.
[67]
Y.Li,B.Sun,T.Wu,Y.Wang,andW.Gao.Facedetectionwithend-to-endintegrationofa
convnetanda3Dmodel.2016.
[68]
F.Liu,D.Zeng,Q.Zhao,andX.Liu.Jointfacealignmentand3Dfacereconstruction.
pages545Œ560,2016.
[69]
S.Liu,P.C.Yuen,S.Zhang,andG.Zhao.3Dmaskfacewithremotephoto-
plethysmography.pages85Œ100,2016.
[70]
X.Liu.Discriminativefacealignment.31(11):1941Œ1954,2009.
[71]
X.LiuandT.Chen.Pose-robustfacerecognitionusinggeometryassistedprobabilistic
modeling.volume1,pages502Œ509,2005.
[72]
X.Liu,J.Rittscher,andT.Chen.Optimalposeforfacerecognition.volume2,pages
1439Œ1446,2006.
[73]
Y.Liu,A.Jourabloo,andX.Liu.LearningdeepmodelsforfaceBinaryor
auxiliarysupervision.IEEE,2018.
[74]
Y.Liu,A.Jourabloo,W.Ren,andX.Liu.Densefacealignment.IEEE,2017.
[75]
S.Lucey,R.Navarathna,A.B.Ashraf,andS.Sridharan.FourierLucas-Kanadealgorithm.
126
35(6):1383Œ1396,2013.
[76]
L.v.d.MaatenandG.Hinton.Visualizingdatausingt-SNE.
Journalofmachinelearning
research
,9(Nov):2579Œ2605,2008.
[77]
J.Määttä,A.Hadid,andM.Pietikäinen.Facedetectionfromsingleimagesusing
micro-textureanalysis.IEEE,2011.
[78]
I.MatthewsandS.Baker.Activeappearancemodelsrevisited.60(2):135Œ164,2004.
[79]
H.MohammadzadeandD.Hatzinakos.Iterativeclosestnormalpointfor3Dfacerecogni-
tion.35(2):381Œ397,2013.
[80]
A.Newell,K.Yang,andJ.Deng.Stackedhourglassnetworksforhumanposeestimation.
pages483Œ499,2016.
[81]
E.M.Nowara,A.Sabharwal,andA.Veeraraghavan.Ppgsecure:Biometricpresentation
attackdetectionusingphotopletysmograms.pages56Œ62,2017.
[82]
G.Pan,L.Sun,Z.Wu,andS.Lao.Eyeblink-basedanti-infacerecognitionfroma
genericwebcamera.IEEE,2007.
[83]
K.Patel,H.Han,andA.K.Jain.Cross-databasefacewithrobustfeature
representation.In
ChineseConferenceonBiometricRecognition
.Springer,2016.
[84]
K.Patel,H.Han,andA.K.Jain.Securefaceunlock:Spoofdetectiononsmartphones.
11(10):2268Œ2283,2016.
[85]
D.Pathak,P.Krahenbuhl,J.Donahue,T.Darrell,andA.A.Efros.Contextencoders:
Featurelearningbyinpainting.IEEE,2016.
[86]
P.Paysan,R.Knothe,B.Amberg,S.Romdhani,andT.Vetter.A3Dfacemodelforpose
andilluminationinvariantfacerecognition.pages296Œ301,2009.
[87]
X.Peng,R.S.Feris,X.Wang,andD.N.Metaxas.Arecurrentencoder-decodernetwork
forsequentialfacealignment.pages38Œ56.Springer,2016.
[88]
T.,K.Simonyan,J.Charles,andA.Zisserman.Deepconvolutionalneuralnetworks
forefposeestimationingesturevideos.pages538Œ552,2015.
[89]
P.J.Phillips,H.Moon,S.Rizvi,P.J.Rauss,etal.TheFERETevaluationmethodologyfor
face-recognitionalgorithms.22(10):1090Œ1104,2000.
[90]
A.Pinto,H.Pedrini,W.R.Schwartz,andA.Rocha.Facedetectionthroughvisual
codebooksofspectraltemporalcubes.24(12):4726Œ4740,2015.
127
[91]
L.-M.Po,L.Feng,Y.Li,X.Xu,T.C.-H.Cheung,andK.-W.Cheung.Block-basedadaptive
ROIforremotephotoplethysmography.
J.MultimediaToolsandApplications
,pages1Œ27,
2017.
[92]
C.Qu12,E.Monari,T.Schuchert,andJ.Beyerer21.Adaptivecontourforpose-
invariant3dfaceshapereconstruction.2015.
[93]
S.Ren,X.Cao,Y.Wei,andJ.Sun.Facealignmentat3000fpsviaregressinglocalbinary
features.pages1685Œ1692,2014.
[94]
J.Roth,Y.Tong,andX.Liu.Unconstrained3Dfacereconstruction.pages2606Œ2615,
2015.
[95]
J.Roth,Y.Tong,andX.Liu.Adaptive3Dfacereconstructionfromunconstrainedphoto
collections.2016.
[96]
C.Sagonas,G.Tzimiropoulos,S.Zafeiriou,andM.Pantic.300facesin-the-wildchallenge:
Thefaciallandmarklocalizationchallenge.pages397Œ403,2013.
[97]
E.Sánchez-Lozano,F.DelaTorre,andD.González-Jiménez.Continuousregressionfor
non-rigidimagealignment.pages250Œ263.Springer,2012.
[98]
R.Shao,X.Lan,andP.C.Yuen.Deepconvolutionaldynamictexturelearningwithadaptive
channel-discriminabilityfor3DmaskfaceIn
IJCB
,2017.
[99]
Y.Sun,X.Wang,andX.Tang.Deepconvolutionalnetworkcascadeforfacialpointdetec-
tion.pages3476Œ3483,2013.
[100]
Y.Tai,J.Yang,andX.Liu.Imagesuper-resolutionviadeeprecursiveresidualnetwork.
IEEE,2017.
[101]
Y.Tai,J.Yang,X.Liu,andC.Xu.Memnet:Apersistentmemorynetworkforimage
restoration.IEEE,2017.
[102]
Y.Taigman,M.Yang,M.Ranzato,andL.Wolf.Deepface:Closingthegaptohuman-level
performanceinfacevpages1701Œ1708,2014.
[103]
Q.Tan,L.Gao,Y.-K.Lai,andS.Xia.Variationalautoencodersfordeforming3dmeshmod-
els.In
ProceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition
,
pages5841Œ5850,2018.
[104]
X.Tan,Y.Li,J.Liu,andL.Jiang.Facelivenessdetectionfromasingleimagewithsparse
lowrankbilineardiscriminativemodel.pages504Œ517,2010.
[105]
Y.Tong,X.Liu,F.W.Wheeler,andP.Tu.Automaticfaciallandmarklabelingwithminimal
128
supervision.2009.
[106]
L.TranandX.Liu.Nonlinear3dfacemorphablemodel.In
ProceedingsoftheIEEE
ConferenceonComputerVisionandPatternRecognition
,pages7346Œ7355,2018.
[107]
G.Trigeorgis,P.Snape,M.A.Nicolaou,E.Antonakos,andS.Zafeiriou.Mnemonicdescent
method:Arecurrentprocessappliedforend-to-endfacealignment.2016.
[108]
S.Tulyakov,X.Alameda-Pineda,E.Ricci,L.Yin,J.F.Cohn,andN.Sebe.Self-adaptive
matrixcompletionforheartrateestimationfromfacevideosunderrealisticconditions.
pages2396Œ2404,2016.
[109]
S.TulyakovandN.Sebe.Regressinga3Dfaceshapefromasingleimage.pages3748Œ
3755,2015.
[110]
G.TzimiropoulosandM.Pantic.OptimizationproblemsforfastAAMin-the-wild.
pages593Œ600.IEEE,2013.
[111]
M.Valstar,B.Martinez,X.Binefa,andM.Pantic.Facialpointdetectionusingboosted
regressionandgraphmodels.pages2729Œ2736.IEEE,2010.
[112]
L.VanderMaatenandG.Hinton.Visualizingdatausingt-SNE.9(2579-2605),2008.
[113]
A.VedaldiandK.Lenc.MatConvNetŒconvolutionalneuralnetworksformatlab.In
ACM
MM,(2015)
,pages689Œ692,2015.
[114]
A.Wagner,J.Wright,A.Ganesh,Z.Zhou,H.Mobahi,andY.Ma.Towardapracticalface
recognitionsystem:Robustalignmentandilluminationbysparserepresentation.34(2):372Œ
386,2012.
[115]
N.Wang,X.Gao,D.Tao,andX.Li.Facialfeaturepointdetection:Acomprehensive
survey.
arXivpreprintarXiv:1410.1037
,2014.
[116]
W.Wang,S.Tulyakov,andN.Sebe.Recurrentconvolutionalfacealignment.2016.
[117]
D.Wen,H.Han,andA.Jain.FaceSpoofDetectionwithImageDistortionAnalysis.
10(4):746Œ761,2015.
[118]
B.-F.Wu,Y.-W.Chu,P.-W.Huang,M.-L.Chung,andT.-M.Lin.Amotionrobustremote-
PPGapproachtodriver'shealthstatemonitoring.pages463Œ476,2016.
[119]
J.Wu,T.Xue,J.J.Lim,Y.Tian,J.B.Tenenbaum,A.Torralba,andW.T.Freeman.Single
image3Dinterpreternetwork.2016.
[120]
Y.WuandQ.Ji.Robustfaciallandmarkdetectionunderheadposesandocclu-
sion.pages3658Œ3666,2015.
129
[121]
J.Xiao,S.Baker,I.Matthews,andT.Kanade.Real-timecombined2D+3Dactiveappear-
ancemodels.volume2,pages535Œ542,2004.
[122]
S.Xiao,J.Feng,J.Xing,H.Lai,S.Yan,andA.Kassim.Robustfaciallandmarkdetection
viarecurrentattentivnetworks.pages57Œ72,2016.
[123]
J.Xing,Z.Niu,J.Huang,W.Hu,andS.Yan.Towardsmulti-viewandpartially-occluded
facealignment.pages1829Œ1836.IEEE,2014.
[124]
X.XiongandF.DelaTorre.Superviseddescentmethodanditsapplicationstofacealign-
ment.pages532Œ539,2013.
[125]
Z.Xu,S.Li,andW.Deng.LearningtemporalfeaturesusingLSTM-CNNarchitecturefor
faceIn
IAPRAsianConference
.IEEE,2015.
[126]
J.Yan,Z.Lei,D.Yi,andS.Z.Li.Learntocombinemultiplehypothesesforaccurateface
alignment.pages392Œ396.IEEE,2013.
[127]
Z.Yan,H.Zhang,R.Piramuthu,V.Jagadeesh,D.DeCoste,W.Di,andY.Yu.Hd-cnn:
hierarchicaldeepconvolutionalneuralnetworksforlargescalevisualrecognition.pages
2740Œ2748,2015.
[128]
B.Yang,J.Yan,Z.Lei,andS.Z.Li.Convolutionalchannelfeatures.pages82Œ90,2015.
[129]
H.YangandI.Patras.Mirror,mirroronthewall,tellme,istheerrorsmall?pages4685Œ
4693,2015.
[130]
J.Yang,Z.Lei,andS.Z.Li.Learnconvolutionalneuralnetworkforface
arXivpreprintarXiv:1408.5601
,2014.
[131]
J.Yang,Z.Lei,S.Liao,andS.Z.Li.Facelivenessdetectionwithcomponentdependent
descriptor.IEEE,2013.
[132]
J.Yang,S.E.Reed,M.-H.Yang,andH.Lee.Weakly-superviseddisentanglingwithrecur-
renttransformationsfor3Dviewsynthesis.pages1099Œ1107,2015.
[133]
L.Yin,X.Chen,Y.Sun,T.Worm,andM.Reale.Ahigh-resolution3Ddynamicfacial
expressiondatabase.2008.
[134]
J.Yosinski,J.Clune,Y.Bengio,andH.Lipson.Howtransferablearefeaturesindeepneural
networks?pages3320Œ3328,2014.
[135]
X.Yu,J.Huang,S.Zhang,W.Yan,andD.N.Metaxas.Pose-freefaciallandmarkvia
optimizedpartmixturesandcascadeddeformableshapemodel.pages1944Œ1951,2013.
[136]
X.Yu,Z.Lin,J.Brandt,andD.N.Metaxas.Consensusofregressionforocclusion-robust
130
facialfeaturelocalization.pages105Œ118,2014.
[137]
S.ZagoruykoandN.Komodakis.Learningtocompareimagepatchesviaconvolutional
neuralnetworks.pages4353Œ4361,2015.
[138]
C.ZhangandZ.Zhang.Asurveyofrecentadvancesinfacedetection.Technicalreport,
Tech.rep.,MicrosoftResearch,2010.
[139]
J.Zhang,S.Shan,M.Kan,andX.Chen.auto-encodernetworks(cfan)for
real-timefacealignment.pages1Œ16.2014.
[140]
J.Zhang,S.Zhou,D.Comaniciu,andL.McMillan.Conditionaldensitylearningviare-
gressionwithapplicationtodeformableshapesegmentation.2008.
[141]
X.Zhang,L.Yin,J.F.Cohn,S.Canavan,M.Reale,A.Horowitz,P.Liu,andJ.M.Girard.
BP4D-spontaneous:ahigh-resolutionspontaneous3Ddynamicfacialexpressiondatabase.
32(10):692Œ706,2014.
[142]
Z.Zhang,P.Luo,C.C.Loy,andX.Tang.Faciallandmarkdetectionbydeepmulti-task
learning.pages94Œ108.2014.
[143]
Z.Zhang,J.Yan,S.Liu,Z.Lei,D.Yi,andS.Z.Li.Afacedatabasewith
diverseattacks.IEEE,2012.
[144]
E.Zhou,H.Fan,Z.Cao,Y.Jiang,andQ.Yin.Extensivefaciallandmarklocalizationwith
convolutionalnetworkcascade.pages386Œ391,2013.
[145]
R.Zhou,R.Achanta,andS.Süsstrunk.Deepresidualnetworkforjointdemosaicingand
super-resolution.
arXivpreprintarXiv:1802.06573
,2018.
[146]
S.ZhouandD.Comaniciu.Shaperegressionmachine.pages13Œ25,2007.
[147]
S.Zhu,C.Li,C.ChangeLoy,andX.Tang.Facealignmentbyshapesearch-
ing.pages4998Œ5006,2015.
[148]
S.Zhu,C.Li,C.C.Loy,andX.Tang.Unconstrainedfacealignmentviacascadedcompo-
sitionallearning.2016.
[149]
X.Zhu,Z.Lei,X.Liu,H.Shi,andS.Z.Li.Facealignmentacrosslargeposes:A3D
solution.pages146Œ155,2016.
[150]
X.Zhu,Z.Lei,J.Yan,D.Yi,andS.Z.Li.Hiposeandexpressionnormalization
forfacerecognitioninthewild.pages787Œ796,2015.
[151]
X.ZhuandD.Ramanan.Facedetection,poseestimation,andlandmarklocalizationinthe
wild.pages2879Œ2886,2012.
131
[152]
X.Zhu,J.Yan,D.Yi,Z.Lei,andS.Z.Li.Discriminative3Dmorphablemodel
pages1Œ8,2015.
132