FACEANTI-SPOOFING:DETECTION,GENERALIZATION,ANDVISUALIZATION
By
YaojieLiu
ADISSERTATION
Submittedto
MichiganStateUniversity
inpartialoftherequirements
forthedegreeof
ComputerScience-DoctorofPhilosophy
2021
ABSTRACT
FACEANTI-SPOOFING:DETECTION,GENERALIZATION,ANDVISUALIZATION
By
YaojieLiu
Faceistheprocessofdistinguishinggenuinefacesandfacepresentationattacks:
attackerspresentingngfaces(e.g.photograph,digitalscreen,andmask)tothefacerecognition
systemandattemptingtobeauthenticatedasthegenuineuser.Inrecentyears,face
hasbroughtincreasingattentiontothevisioncommunityasitisacrucialsteptopreventface
recognitionsystemsfromasecuritybreach.Previousapproachesformulatefaceas
abinaryproblem,andmanyofthemstruggletogeneralizetodifferentconditions
(suchaspose,lighting,expressions,camerasensorsandunknownspooftypes).Moreover,those
methodsworkasablack-boxandcannotprovideinterpretationorvisualizationtotheirdecision.To
addressthosechallenges,weinvestigatefacein
3
stages:detection,generalizationand
visualization.Inthedetectionstage,welearnaCNN-RNNmodeltoestimateauxiliarytasksofface
depthandrPPGsignalsestimation,whichcanbringadditionalknowledgeforthespoofdetection.
Inthegeneralizationstage,weinvestigatethedetectionofunknownspoofattacksandproposea
novelDeepTreeNetwork(DTN)towellrepresenttheunknownspoofattacks.Inthevisualization
stage,weﬁspooftrace,thesubtleimagepatterninspooffaces(e.g.,colordistortion,3Dmask
edge,andMoirepattern),iseffectivetoexplainwhyaspoofisaspoof.Weprovideaproper
physicalmodelingofthespooftracesanddesignagenerativemodeltodisentanglethespooftraces
frominputfaces.Inaddition,wealsoshowthataproperphysicalmodelingcanotherface
problems,suchasfaceshadowdetectionandremoval.Apropershadowmodelingcannotonly
detecttheshadowregioneffectively,butalsoremovetheshadowinavisuallyplausiblemanner.
ACKNOWLEDGMENTS
ThroughoutpreparingthisdissertationIhavereceivedagreatdealofsupportandassistance.
Firstofall,Iwouldliketoexpressmysinceregratitudetomyadvisor,Prof.XiaomingLiufor
hisinvaluableadvice,continuoussupport,andpatienceduringmyPhDstudy.Yourknowledge,
experience,andenthusiasmmotivatemetoimprovemyacademicresearchaswellaslifeplanning.
IwouldliketothankProf.ArunRossforhisleadontheproject.Yourleadprovides
mewithcleardirectionsandgreatsupportofconductingresearchandachievingprojectgoals.I
wouldalsoliketothanktherestofmythesiscommittee:Prof.AnilJainandProf.DanielMorris,
foryourinsightfulcommentsandsuggestions,whichpushmetoworkonaboarderresearchimpact.
It'smyhonortohaveyouasmycommittee.
Mysincerethanksalsogoestoallmyco-authors,Dr.AminJourabloo,Dr.YousefAtoum,
JoelStehouwer,andXiaohongLiu,forworkingwithmeonalltheseexcitingresearchprojects
andholdontightincatchingthosedeadlines.Ialsowouldliketothankmymentors,Dr.Barry
TheobaldandDr.NicholasApostoloffromApple,andDr.XinyuHuangandDr.LiuRenfrom
Bosch,forofferingmethesummerinternshipopportunitiesandleadingmeworkingondiverse
excitingprojects.Specialshout-outtoChristopherPerry.Ireallyenjoyworkingwithyouonthe
projectandappreciateyourskillsandhelponthefacePADsolutions.
IthankmyfellowlabmatesinComputerVisionLabs:JosephRoth,XiYin,LuanTran,Ying
Tai,BangjieYin,ZiyuanZhang,GarrickBrazil,FengLiu,ShengjieZhu,MasaHu,AndrewHou,
AbhinavKumar,VishalAsnani,andXiaoGuo,forsharingandexchangingknowledgeandopinions,
andforallthefunwehavehadinalltheactivities.AlsoIthankmyfriends:JialinLiu,Zhiwei
Wang,JoshuaEnglesma,XueJiang,HeZhang,JiaXue,andthelistgoesonandon.
SpecialthankstomygirlfriendYufengWangandmydogBagelthebeagle,forenlighteningmy
iii
lifeandgivingmeemotionalsupport.
Lastbutnottheleast,Iwouldliketothankmyparentsforyourunconditionalloveand
Thousandsofwordsarenotevenenoughtoexpressmygratitudeandlovetoyou.
iv
TABLEOFCONTENTS
LISTOFTABLES
.......................................
viii
LISTOFFIGURES
......................................
x
LISTOFALGORITHMS
...................................
xiv
Chapter1IntroductiontoFace
......................
1
1.1Introduction......................................1
1.2Overviewofthethesis.................................4
1.2.1Contributionsofthethesis..........................6
Chapter2Detection:FacewithAuxiliarySupervisions
........
8
2.1Introduction......................................8
2.2PriorWork.......................................11
2.3FacewithDeepNetwork.......................14
2.3.1DepthMapSupervision............................14
2.3.2rPPGSupervision...............................15
2.3.3NetworkArchitecture.............................17
2.3.3.1CNNNetwork...........................17
2.3.3.2RNNNetwork...........................18
2.3.3.3ImplementationDetails......................18
2.3.4Non-rigidRegistrationLayer.........................20
2.4CollectionofFaceDatabase......................22
2.5ExperimentalResults.................................24
2.5.1ExperimentalSetup..............................24
2.5.2ExperimentalComparison..........................25
2.5.2.1AblationStudy...........................25
2.5.2.2IntraTesting............................26
2.5.2.3CrossTesting............................27
2.5.2.4VisualizationandAnalysis.....................28
2.6Conclusions......................................30
Chapter3Generalization:Zero-shotandOpen-setFace
........
31
3.1Introduction......................................31
3.2PriorWork.......................................34
3.3DeepTreeNetworkforZSFA.............................37
3.3.1UnsupervisedTreeLearning.........................38
3.3.1.1NodeRoutingFunction......................38
3.3.1.2TreeofKnownSpoofs.......................39
3.3.2SupervisedFeatureLearning.........................40
v
3.3.3NetworkArchitecture.............................42
3.4SpoofintheWildDatabasewithMultipleAttackTypes...............44
3.5ExperimentalResults.................................45
3.5.1ExperimentalSetup..............................45
3.5.2ExperimentalComparison..........................46
3.5.2.1AblationStudy...........................46
3.5.2.2Testingonexistingdatabases...................47
3.5.2.3TestingonSiW-M.........................48
3.5.2.4VisualizationandAnalysis.....................49
3.6Conclusions......................................52
Chapter4Visualization:DisentanglingSpoofTraceswithPhysicalModeling
....
53
4.1Introduction......................................53
4.2RelatedWork.....................................58
4.3Physics-basedSpoofTraceDisentanglement.....................61
4.3.1ProblemFormulation.............................61
4.3.2DisentanglementGenerator..........................65
4.3.3ReconstructionandSynthesis.........................67
4.3.3.1Online
3
DWarpingLayer.....................69
4.3.4Multi-scaleDiscriminators..........................71
4.3.5LossFunctionsandTrainingSteps......................72
4.4Experiments......................................76
4.4.1ExperimentalSetup..............................77
4.4.2forKnownSpoofTypes....................78
4.4.3forUnknownandOpen-setSpoofs..............81
4.4.4SpoofTraces..........................84
4.4.5AblationStudy................................86
4.4.6Visualization.................................86
4.5Conclusions......................................91
Chapter5Visualization:BlindRemovalofFacialForeignShadow
..........
92
5.1Introduction......................................92
5.2RelatedWork.....................................96
5.3ProposedMethod...................................97
5.3.1Shadowsynthesisandmodeling.......................97
5.3.2Grayscaleshadowremoval..........................101
5.3.3Colorization..................................102
5.3.4Temporalinformationsharing........................103
5.3.5Training....................................105
5.4TrainingandEvaluationData.............................107
5.5Experiment......................................108
5.5.1Experimentalsetup..............................108
5.5.2Shadowremovalandsegmentation......................110
5.5.3AblationStudies...............................112
5.6Conclusion......................................112
vi
Chapter6ConclusionsandFutureWork
........................
114
6.1FutureWorks.....................................115
APPENDIX
....................................
117
BIBLIOGRAPHY
.................................
121
vii
LISTOFTABLES
Table1.1
Thetermusedinthiswork.
.....................6
Table2.1
ThecomparisonofourcollectedSiWdatasetwithexistingdatasetsforface

.................................22
Table2.2
TDRatdifferentFDRs,crosstestingonOuluProtocol
1
.
............25
Table2.3
ACERofourmethodatdifferent
N
f
,onOuluProtocol
2
.
...........25
Table2.4
Theintra-testingresultsonfourprotocolsofOulu.
...............27
Table2.5
Theintra-testingresultsonthreeprotocolsofSiW.
...............27
Table2.6
CrosstestingonCASIA-MFSDvs.Replay-Attack.
...............28
Table3.1
ComparingourSiW-Mwithexistingfacedatasets.
........36
Table3.2
Comparemodelswithdifferentroutingstrategies.
...............47
Table3.3
Comparemodelswithdifferenttreelossesandstrategies.Thetwotermsof
row
2
-
5
refertousingliveorspoofdataintreelearning.Thelastrowisourmethod.
47
Table3.4
AUC(
%
)ofthemodeltestingonCASIA,Replay,andMSU-MFSD.
......47
Table3.5
TheevaluationandcomparisonofthetestingonSiW-M.
............49
Table4.1
TheevaluationonfourprotocolsinOULU-NPU.
Bold
indicatesthebestscorein
eachprotocol.
.................................76
Table4.2
TheevaluationonthreeprotocolsinSiWDataset.Wecomparewiththetop
7
performances.
.................................78
Table4.3
TheevaluationandablationstudyonSiW-MProtocolI:knownspoofdetection.
.79
Table4.4
TheevaluationonSiW-MProtocolII:unknownspoofdetection.
........79
Table4.5
TheevaluationonSiW-MProtocolIII:opensetspoofdetection.
.........81
Table4.6
Theperformancecomparisonbetweenimpersonationattacksandobfascation
attacks.
....................................82
viii
Table4.7
Confusionmatricesofspoofmediumsbasedonspooftraces.The
resultsarecomparedwiththepreviousmethodJourablooetal.(2018).Green
representsimprovementoverJourablooetal.(2018).Redrepresentsperformance
drop.
.....................................84
Table4.8
Confusionmatricesof
6
-classspooftracesonSiW-Mdatabase.
..84
Table5.1
AquantitativecomparisonforshadowremovalonUCBdataset.Zhang
et
al.
Zhangetal.(2020b)

isourimplementationandtrainedusingoursynthesized
data.
.....................................110
Table5.2
AquantitativecomparisonofshadowsegmentationonSFWdatabase.
.....112
ix
LISTOFFIGURES
Figure2.1
ConventionalCNN-basedfaceanti-spoofapproachesutilizethebinarysupervi-
sion,whichmayleadtoovgiventheenormoussolutionspaceofCNN.
Thisworkdesignsanovelnetworkarchitecturetoleveragetwoauxiliaryinforma-
tionassupervision:thedepthmapandrPPGsignal,withthegoalsofimproved
generalizationandexplainabledecisionsduringinference.
...........9
Figure2.2
Theoverviewoftheproposedmethod.
.....................11
Figure2.3
TheproposedCNN-RNNarchitecture.Thenumberofareshownontop
ofeachlayer,thesizeofallis
3

3
withstride
1
forconvolutionaland
2
forpoolinglayers.
Colorcode
used:
orange
=convolution,
green
=pooling,
purple
=responsemap.
.............................14
Figure2.4
ExamplegroundtruthdepthmapsandrPPGsignals.
..............19
Figure2.5
Thenon-rigidregistrationlayer.
........................20
Figure2.6
ThestatisticsofthesubjectsintheSiWdatabase.Leftside:Thehistogramshows
thedistributionofthefacesizes.
........................22
Figure2.7
Examplelive(top)andspoof(bottom)videosinSiW.
.............24
Figure2.8
(a)
8
successfulexamplesandtheirestimateddepthmapsandrPPG
signals.(b)
4
failureexamples:thetwoareliveandtheothertwoarespoof.
NoteourabilitytoestimatediscriminativedepthmapsandrPPGsignals.
....28
Figure2.9
Mean/Stdoffrontalizedfeaturemapsforliveandspoof.
............29
Figure2.10
TheMSEofestimatingdepthmapsandrPPGsignals.
.............29
Figure3.1
Todetectunknownspoofattacks,weproposeaDeepTreeNetwork(DTN)to
unsuperviselylearnahierarchicembeddingforknownspoofattacks.Samplesof
unknownattackswillberoutedthroughDTNandatthedestinedleaf
node.
.....................................32
x
Figure3.2
TheproposedDeepTreeNetwork(DTN)architecture.(a)theoverallstructureof
DTN.AtreenodeconsistsofaConvolutionalResidualUnit(CRU)andaTree
RoutingUnit(TRU),andaleafnodeconsistsofaCRUandaSupervisedFeature
Learning(SFL)module.(b)theconceptofTreeRoutingUnit(TRU):
thebasewithlargestvariations;(c)thestructureofeachConvolutionalResidual
Unit(CRU);(d)thestructureoftheSupervisedFeatureLearning(SFL)inthe
leafnodes.
..................................37
Figure3.3
Theexamplesofthelivefacesand
13
typesofspoofattacks.Thesecondrow
showsthegroundtruthmasksforthepixel-wisesupervision
D
k
.For
(
m;n
)
in
thethirdrow,
m=n
denotesthenumberofsubjects/videosforeachtypeofdata.
.42
Figure3.4
ThestructureoftheTreeRoutingUnit(TRU).
.................42
Figure3.5
VisulizationoftheTreeRouting.
........................50
Figure3.6
Treeroutingdistributionoflive/spoofdata.X-axisdenotes
8
leafnodes,and
y-axisdenotes
15
typesofdata.Thenumberineachcellrepresentsthepercentage
(
%
)ofdatathatfallinthatleafnode.Eachrowissumto
1
.(a)PrintProtocol.
(b)TransparentMaskProtocol.Yellowboxdenotestheunknownattacks.
....50
Figure3.7
t-SNEVisualizationoftheDTNleaffeatures.
.................51
Figure4.1
Theproposedapproachcandetectspooffaces,disentanglethespooftraces,and
reconstructthelivecounterparts.Itcanbeappliedtodiversespooftypesand
estimatedistincttraces(
e.g.
,Moirpatterninreplayattack,eyebrowand
waxinmakeupattack,colordistortioninprintattack,andspecularhighlightsin
3
Dmaskattack).Zoominfordetails.
.....................54
Figure4.2
Thecomparisonofdifferentdeep-learningbasedface(a)direct
FASonlyprovidesabinarydecisionofspoofness;(b)auxiliaryFAScanprovide
simpleinterpretationofspoofness.
M
denotestheauxiliarytask,suchasdepth
mapestimation;(c)generativeFAScanprovidemoreintuitiveinterpretation
ofspoofness,butonlyforalimitednumberofspoofattacks;(d)theproposed
methodcanprovidespooftraceestimationforgenericfacespoofattacks.
....55
Figure4.3
OverviewoftheproposedPhysics-guidedSpoofTraceDisentanglement(PhySTD).
60
Figure4.4
TheproposedPhySTDnetworkarchitecture.Exceptthelastlayer,eachconv
andtransposedconvisconcatenatedwithabatchnormalizstionlayerandaleaky
ReLUlayer.
k3c64s2
indicatesthekernelsizeof
3

3
,theconvolutionchannel
of
64
andthestrideof
2
.
...........................64
Figure4.5
Thevisualizationofimagedecompositionfordifferentinputfaces:(a)liveface
(b)
3
Dmaskattack(c)replayattack(d)printattack.
..............65
xi
Figure4.6
Theonline
3
Dwarpinglayer.(a)Giventhecorrespondingdenseoffset,wewarp
thespooftraceandaddthemtothetargetlivefacetocreateanewspoof.E.g.
pixel
(
x;y
)
withoffset
(3
;
5)
iswarpedtopixel
(
x
+3
;y
+5)
inthenewimage.
(b)Toobtainadenseoffsetsfromthespareoffsetsoftheselectedfaceshape
vertices,Delaunaytriangulationinterpolationisadopted.
............69
Figure4.7
Preliminarymask
P
0
forthenegativetermininpaintingmaskloss.Whitepixels
denote
1
andblackpixelsdenote
0
.Whiteindicatestheareashouldnotbe
inpainted.
P
0
for:(a)print,replay;(b)3Dmaskandmakeup;(c)partialattacks
thatcovertheeyeportion;(d)partialattacksthatcoverthemouthportion.
...73
Figure4.8
Examplesofeachspooftracecomponents.(a)theinputsamplefaces.(b)
B
.(c)
C
.(d)
T
.(e)
P
.(f)thelivecounterpartreconstructionandzoom-indetails.
(g)resultsfromLiuetal.(2020).(h)resultsfromStep
1
+Step
2
withasingletrace
representation.
................................81
Figure4.9
ExamplesofspooftracedisentanglementonSiW(a-h)andSiW-M(i-x).(a)-(d)
itemsareprintattacksand(e)-(h)itemsarereplayattacks.(i)-(x)itemsarelive,
print,replay,halfmask,siliconemask,papermask,transparentmask,obfuscation
makeup,impersonationmakeup,cosmeticmakeup,paperglasses,partialpaper,
funnyeyeglasses,andmannequinhead.Thecolumnistheinputface,
thesecondcolumnistheoverallspooftrace(
I

^
I
),thethirdcolumnisthe
reconstructedlive.
..............................83
Figure4.10
Examplesofthespoofdatasynthesis.Therowarethesourcespooffaces,
thecolumnarethetargetlivefaces,andtheremainingarethesynthesized
spooffacesfromthelivefacewiththecorrespondingspooftraces.
.......85
Figure4.11
ThetSNEvisualizationoffeaturesfromdifferentscalesandlayers.The
3
visualizationarefromtheencoderfeature
F
1
,
F
2
,
F
3
,andthelast
2
visualization
arefromthefeaturesthatproduce
f
B
;
C
;
T
g
and
f
P
;
I
P
g
.
............87
Figure4.12
Theillustrationofremovingthedisentangledspooftracecomponentsoneby
one.Theestimatedspooftraceelementsofinputspoof(thecolumn)are
progressivelyremovedintheorderof
B
;
C
;
T
;
T
P
.Thelastcolumnshowsthe
reconstructedliveimageafterremovingallthreeadditivetracecomponentsand
theinpaintingtrace.(a)Replayattack;(b)Makeupattack;(c)Maskattack;(d)
Paperglassesattack.
..............................88
Figure4.13
Theillustrationofdoublespooftracedisentangling.Theleft
4
samplesarelive
faces,andtheright
4
samplesarespooffaces.(a)OriginalInput.(b)
1
stround
livereconstruction.(c)
1
stroundspooftraces.(d)
2
ndroundlivereconstruction.
(e)
2
ndroundspooftraces.
...........................89
xii
Figure5.1
TheresultsofourshadowremovalmodelonimagesfromourShadowFacein
theWild(SFW)database(
top
)andUCBdatabaseZhangetal.(2020b)(
bottom
).
Thelefttorightareinputface,outputface,andshadowmatte.
.........93
Figure5.2
Examplesof(a)foreignshadow,(b)strongselfshadow,and(c)normalself
shadow.Ourmodelisdesignedtoremoveunwantedshadowsin(a-b)while
keepingnormalshadowin(c).
........................95
Figure5.3
Illustrationofdatasynthesiscomponents.
...................98
Figure5.4
Illustrationofournetworkarchitecture.Themodelmainlyconsistsofanencoder,
ashadowmattedecoder,acolormatrixdecoder,andashadowresidualdecoder.
TheTemporalSharingModule(TSM)canbeeasilypluggedintothefaceen-
coder.Togetherwiththetemporalconsistencyloss
L
T
,wecanleveragethe
unlabeledimageframesefciently.Thegreendashedlinesindicatetheshort-cut
connectionsandtheorangedashedlinesandboxesindicatethelossfunctions.
..101
Figure5.5
IllustrationoftheTemporalSharingModule(TSM).Itcanbeappliedtotemporal
framesaswellasmirroredinput.
.......................104
Figure5.6
AnillustrationofSFWdatabase.Therowshowstheshadowfacescollected
underhighlydynamicenvironments.(
e.g.
,varyingshadowsandheadposesdue
towalkinganddriving);Thesecondrowshowsthepixel-levelannotationsof
shadowsegmentation.Zoominforviewingthequalityofourannotation.
....108
Figure5.7
AqualitativecomparisonofshadowremovalontestingimagesofUCBdatabase.
Fromtoptobottom,weshowshadowfaceandshadowremovalresultsprovided
byZhangetal.(2020b),thenetworkwithnaiveRGBshadowmodeling,our
single-framenetworkwithgrayscaleshadowremovalandcolorization(GS+C),
andournetworkwithadditionalTSMandtemporalloss.
............109
Figure5.8
QualitativeshadowremovalevaluationsonSFWdatabase.Fromtoptobot-
tom,weshowshadowface,shadowremovalresultsfromLe&Samaras(2019)
andZhangetal.(2020b),oursingle-framemodel,andourtemporalmodel,
groundtruthshadowsegmentation(inbrightpurple),andpredictedshadowmask
(beforethresholding).
.............................111
xiii
LISTOFALGORITHMS
Algorithm1PhySTDTrainingIteration..........................
71
xiv
ﬂForinterpretationofthereferencestocolorinthisandallotherthereaderisreferredto
theelectronicversionofthisthesis.ﬂ
xv
Chapter1
IntroductiontoFace
1.1Introduction
Biometricsutilizephysiological,suchasface,andiris,orbehavioralcharacteristics,
suchastypingrhythmandgait,touniquelyidentifyorauthenticateanindividual.Asbiometric
systemsarewidelyusedinreal-worldapplicationsincludingmobilephoneauthenticationand
accesscontrol,biometricspoof,orPresentationAttack(PA)arebecomingalargethreat,wherea
spoofedbiometricsampleispresentedtothebiometricsystemandattemptedtobeauthenticated.
Face,asoneofthemostpopularmodalities,hasreceivedincreasingattentionintheacademiaand
industryintherecentyears(e.g.,iPhoneX).However,theattentionalsobringsagrowingincentive
forhackerstodesignbiometricpresentationattacks(PA),orspoofs,tobeauthenticatedasthe
genuineuser.Duetothealmostno-costaccesstothehumanface,thespooffacecanbeassimple
asaprintedphotopaper(i.e.,printattack)andadigitalimage/video(i.e.,replayattack),oras
complicatedasa3DMaskandfacialcosmeticmakeup.Withproperhandling,thosespoofscanbe
visuallyveryclosetothegenuineusersliveface.Asaresult,thesecallfortheneedofdeveloping
robustfacealgorithms.
InordertodevelopafacerecognitionsystemthatisinvulnerabletovarioustypesofPAs,thereis
anincreasingdemandondesigningarobustface(orPAdetection)systemtoclassify
afacesampleasliveorspoofbeforerecognizingitsidentity.AsRGBimageandvideoarethe
standardinputtofacerecognitionsystems,mostfacestudiesareRGB-based,either
1
singleimageoraclipofvideo.Previousapproachestotacklefacecanbecategorized
inthreegroups.Theisthemotion-basedmethodsthataimatclassifyingfacevideosbasedon
detectingmovementsoffacialparts.Eye-blinkingisonecueproposedinPanetal.(2007);Sun
etal.(2007),todetectspoofattackssuchaspaperattack.InKollreideretal.(2007),Kollreider
etal.uselipmotiontomonitorthefaceliveness.MethodsproposedinChetty(2010);Chetty
&Wagner(2006)combineaudioandvisualcuestoverifythefaceliveness.Thesemethodsare
suitableforstaticattacks,butnotdynamicattackssuchasreplayormaskattacks.Thesecondis
imagequalityandmethods,whichdesignfeaturestocapturethesuperimposed
illuminationandnoiseinformationtothespoofimages.Astheimagequalityfactorsareheuristic
basedonhumanobservation(notdata-driven),theyshowverylimitedcapabilitytogeneralizeto
complexsituationwithvariationssuchaslighting,pose,camerasandexpressions.Thethirdis
thetexture-basedmethods,whichdiscoverdiscriminativetexturecharacteristicsuniquetovarious
attackmediums.Comparedtotheprevioustwogroups,texture-basedmethodsexploretheintrinsic
propertiesofmaterialandmedium,andthusismoregeneralizable.However,duetoa
lackofunderstandingbetweenpixelintensitiesanddifferenttypesofattacksintheearlystudies,
extractingrobusttexturefeatureswaschallenging.
Mosttexture-basedworksutilizehand-craftedfeaturesandadoptsshallowlearningtechniques
(e.g.,SVMandLDA)todevelopansystem.Commonlocalfeaturesthathavebeen
usedinpriorworkincludeLBPindeFreitasPereiraetal.(2012,2013);M
¨
a
¨
att
¨
aetal.(2011),HOG
inKomulainenetal.(2013a);Yangetal.(2013),DoGinPeixotoetal.(2011);Tanetal.(2010),
SIFTinPateletal.(2016b)andSURFinBoulkenafetetal.(2017a).However,theaforementioned
featurestodetecttexturedifferencecouldbeverysensitivetodifferentilluminations,cameradevices
andidentities.ResearchersalsoseeksolutionsondifferentcolorspacessuchasHSVand
YCbCrinBoulkenafetetal.(2015,2016),FourierspectraLietal.(2004)andOpticalFlowMaps
2
(OFM)inBaoetal.(2009).
WithCNNproventosuccessfullyoutperformotherlearningparadigmsinmanycomputervision
tasksinKalchbrenneretal.(2014);Krizhevskyetal.(2012);Lawrenceetal.(1997),itisthen
introducedasanewapproachtohandlefaceInLietal.(2016a);Pateletal.(2016a),
theCNNservesasafeatureextractor.Bothmethodstheirnetworkfromapretrained
model(CaffeNetinPateletal.(2016a),VGG-facemodelininLietal.(2016a)),andextractthe
featurestodistinguishlivevs.spoof.Yangetal.(2014)proposetolearnaCNNasafor
faceRegisteredfaceimageswithdifferentspatialscalesarestackedasinputand
live/spooflabelingisassignedastheoutput.Inaddition,Fengetal.Fengetal.(2016)proposeto
usemultiplecuesastheCNNinputforlive/spoofTheyselectShearlet-basedfeatures
tomeasuretheimagequalityandtheOFMofthefaceareaaswellasthewholescenearea.And
inXuetal.(2015),Xuetal.proposeanLSTM-CNNarchitecturetoconductajointpredictionfor
multipleframesofavideo.
ThoughCNN-basedmethodsprovideimprovementintermsofdetectionaccuracy
tofaceg,comparedtootherfacerelatedproblemssuchasfacerecognitionandface
alignment,therearestillsubstantiallylesseffortsandexplorationonfaceusingdeep
learningtechniques.Therefore,inthisworkweaimtofurtherexplorethecapabilityofCNN
inhandlingfacemainlyinthreeaspects:improvingthedetectionperformance,
generalizationtowarddifferentdomainssuchasunseen/unknownspooftypesandcapturingcamera
sensors,andprovidingvisualizationtotheCNN'sprediction.
3
1.2Overviewofthethesis
InChapter
2
,wearguetheimportanceofauxiliarysupervisiontoguidethelearningtowardmore
discriminativecues.ACNN-RNNmodelislearnedtoestimatethefacedepthwithpixel-wise
supervision,andtoestimaterPPGsignalswithsequence-wisesupervision.Theestimateddepthand
rPPGarefusedtodistinguishlivevs.spooffaces.Further,weintroduceanewface
databasethatcoversalargerangeofillumination,subject,andposevariations.Experimentsshow
thatourmodelachievesthestate-of-theartresultsonbothintra-andcross-databasetesting.
Whileadvancedfacemethodsaredeveloped,newtypesofspoofattacksarealso
beingcreatedandbecomingathreattoallexistingsystems.Tostudythegeneralizationofthe
facemethods,inChapter
3
,wethedetectionofunknownspoofattacksas
Zero-ShotFace(ZSFA).PreviousZSFAworksonlystudy
1
-
2
typesofspoofattacks,
suchasprint/replay,whichlimitstheinsightofthisproblem.Inthischapter,weinvestigatethe
ZSFAprobleminawiderangeof13typesofspoofattacks,includingprint,replay,3Dmask,and
soon.AnovelDeepTreeNetwork(DTN)isproposedtopartitionthespoofsamplesintosemantic
sub-groupsinanunsupervisedfashion.Whenadatasamplearrives,beingknoworunknownattacks,
DTNroutesittothemostsimilarspoofcluster,andmakesthebinarydecision.Inaddition,to
enablethestudyofZSFA,weintroducethefacedatabasethatcontainsdiverse
typesofspoofattacks.Experimentsshowthatourproposedmethodachievesthestateofthearton
multipletestingprotocolsofZSFA.
Togainabettervisualunderstandingandinterpretationofthespoofattacks,inChapter
4
,we
identifyanewproblemofspooftracedisentangling.Weshowthatthekeytofacelies
inthesubtleimagepattern,termedﬁspooftraceﬂ,
e.g.
,colordistortion,
3
Dmaskedge,Moirpattern,
andmanyothers.Spooftracedisentanglingismotivatedbythenoisemodelinganddenoising
4
algorithms,forthepurposeofinverselydecomposingaspooffaceintoaspooftrace
andaliveface,andthenutilizingthespooftraceforDesigningageneric
modeltoestimatethosespooftracescanimprovenotonlythegeneralizationofthespoofdetection,
butalsotheinterpretabilityofthemodel'sdecision.Westprovideapropermodelingofspoof
traceasanadditivenoisetothegenuineface.Wedesignsanoveladversariallearningframeworkto
disentanglethespooftracesfrominputfacesasahierarchicalcombinationofpatternsatmultiple
scales.Withthedisentangledspooftraces,weunveilthelivecounterpartoftheoriginalspoofface,
andfurthersynthesizerealisticnewspooffacesafterapropergeometriccorrection.Ourmethod
demonstratessuperiorspoofdetectionperformanceonbothseenandunseenspoofscenarioswhile
providingvisually-convincingestimationofspooftraces.
InChapter5,weshowthataproperphysicalmodelingcanfaceshadowdetection
andremovalproblemaswell.In-the-wildfacephotographsoftensufferfromundesiredforeign
shadowscastbyexternalobjects,e.g.,hands,phones,andtrees.Removingfacialforeignshadows
notonlyimprovesimageaestheticsbutalsomitigatesthenegativeimpactsonface-relatedtasks.
Wetackletheblindremovaloffacialforeignshadowforbothsingleimageandvideos,bymaking
threecontributions.Firstly,weproposeanoveltwo-stageshadowmodelingthatconsistsofgray-
scaleshadowremovalandcolorization.Thismodelingprovidesaneffectivewaytohandleboth
colordistortionandsubsurfacescatteringeffects.Second,weproposeanovelTemporalSharing
Moduletoextracthierarchicalfeaturesacrossmultiplealignedvideoframes,whichrepresentsthe
shadow-freefaces.Third,wecollectarealfacedatabasewith280videoscapturedunderhighly
dynamicenvironmentsandannotatepixel-levelshadowsegmentationmaps.Extensiveexperiments
demonstratetheeffectivenessofourapproachcom-paringwithboththebaselineandstate-of-the-art
methods
Tobetterunderstandtheworkinthisdissertation,Tab.1.1listsallthetermsandthe
5
Terms
LivefaceGenuinefacefromthesubjectwithoutanyphysicalmanipulationofitsidentity.Alsoknownas
bona
face.
SpooffaceFacenotfromtheoriginalsubject.Alsoknownas
presentationattack
.
ImpersonationattackAttacksinwhichtheattackerwantstoberecognizedasadifferentsubject.
ObfuscationattackAttacksinwhichtheattackerwanttohidetheidentityoftheattacker.
SpoofmediumMaterialusedtopresentthespoofface,suchasprintedphotoanddigitalscreen.
SpooftracePatternsonlyexistinginthespooffaces,suchasmoirepattern,and3Dmaskedges.
Table1.1
Thetermusedinthiswork.
usedinthiswork.
1.2.1Contributionsofthethesis
Inthissection,welistthecontributionsinthisdissertation:

PreviousworkregardthefaceasabinaryWeproposetoleverage
novelauxiliaryinformation(i.e.,depthmapandrPPG)tosupervisetheCNN-RNNlearningfor
improveddetectionperformance.Theauxiliarysupervisionsbringadditionalknowledgeto
theface

WeconductanextensivestudyofthegeneralizationproblemoffaceSpecially,
westudyzero-shotfaceon
13
differenttypesofspoofattacksandproposeaDeepTree
Network(DTN)tolearnfeatureshierarchically.DTNleveragesexistingspoofattackknowledgeto
effectivelyrepresenttheunknownspoofattacks;

Toprovidevisualinterpretation,westudyanovelproblemofspooftracedisentanglingand
proposeanovelmodelingtodisentanglespooftracesintoahierarchicalrepresentationongeneric
spoofattacks.Toourknowledge,thesearetheworktosolvefaceinagenerative
andvisually-intuitiveapproach;

Inspiredbytheeffectivespooftracemodeling,weprovideaeffectivemodelingonthefacewith
foriegnshadow,andproposeanovelapproachtodecomposeRGBshadowremovalintograyscale
shadowremovalandcolorization,andatemporalsharingmoduletoensurevideoconsistency;
6

WecollectdatabasesforfaceincludingSiWandSiW-M,andforfaceshadow
detection,termedSFW.
7
Chapter2
Detection:Facewith
AuxiliarySupervisions
2.1Introduction
RGBimageandvideoarethestandardinputtofacesystems,similartofacerecognition
systems.Researchersstartthetexture-basedapproachesbyfeedinghandcrafted
featurestobinaryBoulkenafetetal.(2017a);deFreitasPereiraetal.(2012,2013);
Komulainenetal.(2013a);M
¨
a
¨
att
¨
aetal.(2011);Mirjalili&Ross(2017);Pateletal.(2016b);
Yangetal.(2013).Laterinthedeeplearningera,severalConvolutionalNeuralNetworks(CNN)
approachesutilizesoftmaxlossasthesupervisionFengetal.(2016);Lietal.(2016a);Pateletal.
(2016a);Yangetal.(2014).Itappearsalmostallpriorworkregardthefaceproblem
asmerelya
binary
(livevs.spoof)problem.
Therearetwomainissuesinlearningdeepmodelswithbinarysupervision,shown
inFig.2.1.First,therearedifferentlevelsofimagedegradation,namely
spoofpatterns
,comparing
aspooffacetoaliveone,whichconsistofskindetailloss,colordistortion,moir
´
epattern,shape
deformationandspoofartifacts(e.g.,Lietal.(2004);Pateletal.(2016b).ACNNwith
softmaxlossmightdiscover
arbitrary
cuesthatareabletoseparatethetwoclasses,suchasscreen
bezel,butnotthe
faithful
spoofpatterns.Whenthosecuesdisappearduringtesting,thesemodels
wouldfailtodistinguishspoofvs.livefacesandresultinpoorgeneralization.Second,during
8
Figure2.1
ConventionalCNN-basedfaceanti-spoofapproachesutilizethebinarysupervision,whichmay
leadtoovgiventheenormoussolutionspaceofCNN.Thisworkdesignsanovelnetworkarchitecture
toleveragetwoauxiliaryinformationassupervision:thedepthmapandrPPGsignal,withthegoalsof
improvedgeneralizationandexplainabledecisionsduringinference.
thetesting,modelslearntwithbinarysupervisionwillonlygenerateabinarydecisionwithout
explanation
or
rationale
forthedecision.InthepursuitofExplainableIntelligenceTurek
(2016),itisdesirableforthelearntmodeltogeneratethespoofpatternsthatsupportthebinary
decision.
Toaddresstheseissues,asshowninFig.2.2,weproposeadeepmodelthatusesthesupervision
fromboththe
spatial
and
temporalauxiliaryinformation
ratherthanbinarysupervision,forthe
purposeofrobustlydetectingfacePAfromafacevideo.Theseauxiliaryinformationareacquired
basedonourdomainknowledgeaboutthekey
differences
betweenliveandspooffaces,which
includetwoperspectives:spatialandtemporal.Fromthespatialperspective,itisknownthatlive
faceshaveface-likedepth,e.g.,thenoseisclosertothecamerathanthecheekinfrontal-viewfaces,
whilefacesinprintorreplayattackshaveorplanardepth,e.g.,allpixelsontheimageofa
paperhavethesamedepthtothecamera.Hence,depthcanbeutilizedasauxiliaryinformationto
supervisebothliveandspooffaces.Fromthetemporalperspective,itwasshownthatthenormal
rPPGsignals(i.e.,heartpulsesignal)aredetectablefromlive,butnotspoof,facevideosLiu
9
etal.(2016b);Nowaraetal.(2017).Therefore,weprovidedifferentrPPGsignalsasauxiliary
supervision,whichguidesthenetworktolearnfromliveorspooffacevideosrespectively.To
enablebothsupervisions,wedesignanetworkarchitecturewithashort-cutconnectiontocapture
differentscalesandanovelnon-rigidregistrationlayertohandlethemotionandposechangefor
rPPGestimation.
Furthermore,similartomanyvisionproblems,dataplaysaroleintrainingthe
models.Asweknow,camera/screenqualityisacriticalfactortothequalityofspoof
faces.Existingfacedatabases,suchasNUAATanetal.(2010),CASIAZhang
etal.(2012),Replay-AttackChingovskaetal.(2012),andMSU-MFSDWenetal.(2015),were
collected
3

5
yearsago.Giventhefastadvanceofconsumerelectronics,thetypesofequipment
(e.g.,camerasandmediums)usedinthosedatacollectionareoutdatedcomparedtothe
onesnowadays,regardingtheresolutionandimagingquality.MorerecentMSU-USSAPateletal.
(2016b)andOULUdatabasesBoulkenafetetal.(2017b)havesubjectswithfewervariationsin
poses,illuminations,expressions(PIE).Thelackofnecessaryvariationswouldmakeithardto
learnaneffectivemodel.Giventheclearneedformoreadvanceddatabases,wecollectaface
database,namedSpoofintheWildDatabase(SiW).SiWdatabaseconsistsof
165
subjects,
6
mediums,and
4
sessionscoveringvariationssuchasPIE,distance-to-camera,
etc
.SiWcoversmuchlargervariationsthanpreviousdatabases,asdetailedinTab.2.1andSec.2.4.
Themaincontributionsofthisworkinclude:

Weproposetoleveragenovelauxiliaryinformation(i.e.,depthmapandrPPG)tosupervise
theCNNlearningforimprovedgeneralization.

WeproposeanovelCNN-RNNarchitectureforend-to-endlearningthedepthmapandrPPG
signal.

WereleaseanewdatabasethatcontainsvariationsofPIE,andotherpracticalfactors.We
10
Figure2.2
Theoverviewoftheproposedmethod.
achievethestate-of-the-artperformanceforface
2.2PriorWork
Wereviewthepriorfaceworksinthreegroups:texture-basedmethods,temporal-based
methods,andremotephotoplethysmographymethods.
Texture-basedMethods
SincemostfacerecognitionsystemsadoptonlyRGBcameras,using
textureinformationhasbeenanaturalapproachtotacklingfaceManypriorworks
utilizehand-craftedfeatures,suchasLBPdeFreitasPereiraetal.(2012,2013);M
¨
a
¨
att
¨
aetal.(2011),
HoGKomulainenetal.(2013a);Yangetal.(2013),SIFTPateletal.(2016b)andSURFBoulkenafet
etal.(2017a),andadopttraditionalsuchasSVMandLDA.Toovercometheof
illuminationvariation,theyseeksolutionsinadifferentinputdomain,suchasHSVandYCbCr
colorspaceBoulkenafetetal.(2015,2016),andFourierspectrumLietal.(2004).
Asdeeplearninghasproventobeeffectiveinmanycomputervisionproblems,therearemany
recentattemptsofusingCNN-basedfeaturesorCNNsinfaceFengetal.(2016);Li
etal.(2016a);Pateletal.(2016a);Yangetal.(2014).Mostoftheworktreatsfaceas
asimple
binary
problembyapplyingthesoftmaxloss.Forexample,Lietal.(2016a);
11
Pateletal.(2016a)useCNNasfeatureextractorandfromImageNet-pretrainedCaffeNet
andVGG-face.TheworkofFengetal.(2016);Lietal.(2016a)feeddifferentdesignsofthe
faceimagesintoCNN,suchasmulti-scalefacesandhand-craftedfeatures,anddirectlyclassify
livevs.spoof.OnepriorworkthatsharesthesimilaritywithoursisAtoumetal.(2017),where
Atoum
et.al.
proposeatwo-steamCNN-basedmethodusingtextureanddepth.We
advanceAtoumetal.(2017)inanumberofaspects,includingfusionwithtemporalsupervision(i.e.,
rPPG),architecturedesign,novelnon-rigidregistrationlayer,andcomprehensiveexperimental
support.
Temporal-basedMethods
Oneoftheearliestsolutionsforfaceisbasedontemporal
cuessuchaseye-blinkingPanetal.(2007);Pateletal.(2016a).MethodssuchasKollreideretal.
(2007);Shaoetal.(2017)trackthemotionofmouthandliptodetectthefaceliveness.Whilethese
methodsareeffectivetotypicalpaperattacks,theybecomevulnerablewhenattackerspresenta
replayattackorapaperattackwitheye/mouthportionbeingcut.
Therearealsomethodsrelyingonmoregeneraltemporalfeatures,insteadofthefacial
motion.Themostcommonapproachisframeconcatenation.Manyhandcraftedfeature-based
methodsmayimproveintra-databasetestingperformancebysimplyconcatenatingthefeaturesof
consecutiveframestotrainthersBoulkenafetetal.(2015);deFreitasPereiraetal.(2012);
Komulainenetal.(2013b).Additionally,therearesomeworksproposingfeatures,
e.g.,HaralickfeaturesAgarwaletal.(2016),motionmagBharadwajetal.(2014),andoptical
wBaoetal.(2009).Inthedeeplearningera,Feng
et.al.
feedtheopticalwmapandShearlet
imagefeaturetoCNNFengetal.(2016).InXuetal.(2015),Xu
et.al.
proposeanLSTM-CNN
architecturetoutilizetemporalinformationforbinaryOverall,allpriormethods
stillregardfaceasabinaryonproblem,andthustheyhaveahardtime
togeneralizewellinthecross-databasetesting.Inthiswork,weextractdiscriminativetemporal
12
informationbylearningtherPPGsignalofthefacevideo.
RemotePhotoplethysmography(rPPG)
Remotephotoplethysmography(rPPG)isthetechnique
totrackvitalsignals,suchasheartrate,withoutanycontactwithhumanskinBobbiaetal.(2016);
deHaan&Jeanne(2013);Poetal.(2017);Tulyakovetal.(2016);Wuetal.(2016).Research
startswithfacevideoswithnomotionorilluminationchangetovideoswithmultiplevariations.
IndeHaan&Jeanne(2013),Haan
et.al.
estimaterPPGsignalsfromRGBfacevideoswithlighting
andmotionchanges.Itutilizescolordifferencetoeliminatethespecularandestimate
twoorthogonalchrominancesignals.AfterapplyingtheBandPassFilter(BPM),theratioofthe
chrominancesignalsareusedtocomputetherPPGsignal.
rPPGhaspreviouslybeenutilizedtotacklefaceLiuetal.(2016b);Nowaraetal.
(2017).InLiuetal.(2016b),rPPGsignalsareusedfordetectingthe
3
Dmaskattack,wherethelive
facesexhibitapulseofheartrateunlikethe
3
Dmasks.TheyuserPPGsignalsextractedbydeHaan
&Jeanne(2013)andcomputethecorrelationfeaturesforSimilarly,Magdalena
et.
al.
Nowaraetal.(2017)extractrPPGsignals(alsoviadeHaan&Jeanne(2013))fromthreeface
regionsandtwonon-faceregions,fordetectingprintandreplayattacks.Althoughinreplayattacks,
therPPGextractormightstillcapturethenormalpulse,thecombinationofmultipleregionscan
differentiatelivevs.spooffaces.WhiletheanalyticsolutiontorPPGextractiondeHaan&Jeanne
(2013)iseasytoimplement,weobservethatitissensitivetoPIEvariations.Hence,weemploya
novelCNN-RNNarchitectureto
learn
amappingfromafacevideototherPPGsignal,whichisnot
onlyrobusttoPIEvariations,butalsodiscriminativetolive
vs.
spoof.
13
Figure2.3
TheproposedCNN-RNNarchitecture.Thenumberofareshownontopofeachlayer,
thesizeofallis
3

3
withstride
1
forconvolutionaland
2
forpoolinglayers.
Colorcode
used:
orange
=convolution,
green
=pooling,
purple
=responsemap.
2.3FacewithDeepNetwork
Themainideaoftheproposedapproachistoguidethedeepnetworktofocusonthe
knownspoof
patterns
acrossspatialandtemporaldomains,ratherthantoextractanycuesthatcouldseparate
twoclassesbutarenotgeneralizable.AsshowninFig.2.2,theproposednetworkcombinesCNN
andRNNarchitecturesinacoherentway.TheCNNpartutilizesthedepthmapsupervisionto
discoversubtletexturepropertythatleadstodistinctdepthsforliveandspooffaces.Then,itfeeds
theestimateddepthandthefeaturemapstoanovel
non-rigidregistration
layertocreatealigned
featuremaps.TheRNNpartistrainedwiththealignedmapsandtherPPGsupervision,which
examinestemporalvariabilityacrossvideoframes.
2.3.1DepthMapSupervision
Depthmapsarearepresentationofthe
3
Dshapeofthefaceina
2
Dimage,whichshowstheface
locationandthedepthinformationofdifferentfacialareas.Thisrepresentationismoreinformative
thanbinarylabelssinceitindicatesoneofthefundamentaldifferencesbetweenlivefaces,andprint
andreplayPA.WeutilizethedepthmapsinthedepthlossfunctiontosupervisetheCNNpart.The
pixel-baseddepthlossguidestheCNNtolearnamappingfromthefaceareawithinareceptive
toalabeleddepthvalueŒascalewithin
[0
;
1]
forlivefacesand
0
forspooffaces.
14
Toestimatethedepthmapfora
2
Dfaceimage,givenafaceimage,weutilizethestate-of-the-art
densefacealignment(DeFA)methodsJourabloo&Liu(2017);Liuetal.(2017)toestimatethe
3
Dshapeoftheface.Thefrontaldense
3
Dshape
S
F
2
R
3

Q
,with
Q
vertices,isrepresentedasa
linearcombinationofidentitybases
f
S
i
id
g
N
id
i
=1
andexpressionbases
f
S
i
exp
g
N
exp
i
=1
,
S
F
=
S
0
+
N
id
X
i
=1

i
id
S
i
id
+
N
exp
X
i
=1

i
exp
S
i
exp
;
(2.1)
where

id
2
R
199
and

ext
2
R
29
aretheidentityandexpressionparameters,and

=[

id
;
exp
]
aretheshapeparameters.WeutilizetheBasel
3
DfacemodelPaysanetal.(2009)andthe
facewearhouseCaoetal.(2014)astheidentityandexpressionbases.
Withtheestimatedposeparameters
P
=(
s;
R
;
t
)
,where
R
isa
3
Drotationmatrix,
t
isa
3
D
translation,and
s
isascale,wealignthe
3
Dshape
S
tothe
2
Dfaceimage:
S
=
s
RS
F
+
t
:
(2.2)
Giventhechallengeofestimatingthe
absolute
depthfroma
2
Dface,wenormalizethe
z
values
of
3
Dverticesin
S
tobewithin
[0
;
1]
.Thatis,thevertexclosesttothecamera(e.g.,nose)has
adepthofone,andthevertexfurthestawayhasthedepthofzero.Then,weapplytheZ-Buffer
algorithmZhuetal.(2016)to
S
forprojectingthenormalized
z
valuestoa
2
Dplane,whichresults
inanestimatedﬁgroundtruthﬂ
2
Ddepthmap
D
2
R
32

32
forafaceimage.
2.3.2rPPGSupervision
rPPGsignalshaverecentlybeenutilizedforfaceLiuetal.(2016b);Nowaraetal.
(2017).TherPPGsignalprovidestemporalinformationaboutfaceliveness,asitisrelatedtothe
15
intensitychangesoffacialskinovertime.Theseintensitychangesarehighlycorrelatedwiththe
bloodw.ThetraditionalmethoddeHaan&Jeanne(2013)forextractingrPPGsignalshasthree
drawbacks.First,itissensitivetoposeandexpressionvariation,asitbecomesharderto
track
afaceareaformeasuringintensitychanges.Second,itisalsosensitivetoillumination
changes,sincetheextralightingaffectstheamountofctedlightfromtheskin.Third,for
thepurposeofanti-spoof,rPPGsignalsextractedfromspoofvideosmightnotbesuf
distinguishable
tosignalsoflivevideos.
Onenoveltyaspectofourapproachisthat,insteadofcomputingtherPPGsignalviadeHaan&
Jeanne(2013),ourRNNpartlearnstoestimatetherPPGsignal.Thiseasesthesignalestimation
fromfacevideoswithPIEvariations,andalsoleadstomorediscriminativerPPGsignals,asdifferent
rPPGsupervisionsareprovidedtolivevs.spoofvideos.Weassumethatthevideosofthesame
subjectunderdifferentPIEconditionshavethe
same
groundtruthrPPGsignal.Thisassumptionis
validsincetheheartbeatissimilarforthevideosofthesamesubjectthatarecapturedinashort
spanoftime(
<
5
minutes).TherPPGsignalextractedfromtheconstrainedvideos(i.e.,noPIE
variation)areusedastheﬁgroundtruthﬂsupervisionintherPPGlossfunctionfor
all
livevideosof
thesamesubject.ThisconsistentsupervisionhelpstheCNNandRNNpartstoberobusttothePIE
changes.
InordertoextracttherPPGsignalfromafacevideowithoutPIE,weapplytheDeFALiu
etal.(2017)toeachframeandestimatethedense
3
Dfaceshape.Weutilizetheestimated
3
D
shapetotrackafaceregion.Foratrackedregion,wecomputetwoorthogonalchrominancesignals
x
f
=3
r
f

2
g
f
,
y
f
=1
:
5
r
f
+
g
f

1
:
5
b
f
where
r
f
;
g
f
;
b
f
arethebandpassversionsof
the
r
;
g
;
b
channelswiththeskin-tonenormalization.Weutilizetheratioofthestandarddeviation
ofthechrominancesignals

=
˙
(
x
f
)
˙
(
y
f
)
tocomputebloodwsignalsdeHaan&Jeanne(2013).We
16
calculatethesignal
p
as:
p
=3(1


2
)
r
f

2(1+

2
)
g
f
+
3

2
b
f
:
(2.3)
ByapplyingFFTto
p
,weobtaintherPPGsignal
f
2
R
50
,whichshowsthemagnitudeofeach
frequency.
2.3.3NetworkArchitecture
Ourproposednetworkconsistsoftwodeepnetworks.First,aCNNpartevaluateseachframe
separatelyandestimatesthedepthmapandfeaturemapofeachframe.Second,arecurrentneural
network(RNN)partevaluatesthetemporalvariabilityacrossthefeaturemapsofasequence.
2.3.3.1CNNNetwork
WedesignaFullyConvolutionalNetwork(FCN)asourCNNpart,asshowninFig.2.3.The
CNNpartcontainsmultipleblocksofthreeconvolutionallayers,poolingandresizinglayerswhere
eachconvolutionallayerisfollowedbyoneexponentiallinearlayerandbatchnormalizationlayer.
Then,theresizinglayersresizetheresponsemapsaftereachblocktoasizeof
64

64
andconcatenatetheresponsemaps.Thebypassconnectionshelpthenetworktoutilizeextracted
featuresfromlayerswithdifferentdepthssimilartotheResNetstructureHeetal.(2016).After
that,ourCNNhastwobranches,oneforestimatingthedepthmapandtheotherforestimatingthe
featuremap.
TheoutputoftheCNNistheestimateddepthmapoftheinputframe
I
2
R
256

256
,which
17
issupervisedbytheestimatedﬁgroundtruthﬂdepth
D
,

D
=argmin

D
N
d
X
i
=1
jj
CNN
D
(
I
i
;
D
)

D
i
jj
2
1
;
(2.4)
where

D
istheCNNparametersand
N
d
isthenumberoftrainingimages.Thesecondoutputof
theCNNisthefeaturemap,whichisfedintothenon-rigidregistrationlayer.
2.3.3.2RNNNetwork
TheRNNpartaimstoestimatetherPPGsignal
f
ofaninputsequencewith
N
f
frames
f
I
j
g
N
f
j
=1
.As
showninFig.2.3,weutilizeoneLSTMlayerwith
100
hiddenneurons,onefullyconnectedlayer,
andanFFTlayerthatconvertstheresponseoffullyconnectedlayerintotheFourierdomain.Given
theinputsequence
f
I
j
g
N
f
j
=1
andtheﬁgroundtruthﬂrPPGsignal
f
,wetraintheRNNtominimize
the
`
1
distanceoftheestimatedrPPGsignaltoﬁgroundtruthﬂ
f
:

R
=argmin

R
N
s
X
i
=1
jj
RNN
R
([
f
F
j
g
N
f
j
=1
]
i
;
R
)

f
i
jj
2
1
;
(2.5)
where

R
istheRNNparameters,
F
j
2
R
32

32
isthefrontalizedfeaturemap(detailsinSec.2.3.4),
and
N
s
isthenumberofsequences.
2.3.3.3ImplementationDetails
GroundTruthData
Givenasetofliveandspooffacevideos,weprovidethegroundtruth
supervisionforthedepthmap
D
andrPPGsignal
f
,asinFig.2.4.Wefollowtheprocedurein
Sec.2.3.1tocomputeﬁgroundtruthﬂdataforlivevideos.Forspoofvideos,wesetthegroundtruth
depthmapstoaplainsurface,i.e.,zerodepth.Similarly,wefollowtheprocedureinSec.2.3.2to
computetheﬁgroundtruthﬂrPPGsignalfromapatchontheforehead,foronelivevideoofeach
18
Figure2.4
ExamplegroundtruthdepthmapsandrPPGsignals.
subjectwithoutPIEvariation.Also,wenormalizethenormofestimatedrPPGsignalsuchthat
k
f
k
2
=1
.Forspoofvideos,weconsidertherPPGsignalsarezero.
Notethat,whilethetermﬁdepthﬂisusedhere,ourestimateddepthisdifferenttotheconventional
depthmapincomputervision.Itcanbeviewedasaﬁpseudo-depthﬂandservesthepurposeof
providingdiscriminativeauxiliarysupervisiontothelearningprocess.Thesameperspectiveapplies
tothesupervisionbasedonpseudo-rPPGsignal.
TrainingStrategy
OurproposednetworkcombinestheCNNandRNNpartsforend-to-end
training.ThedesiredtrainingdatafortheCNNpartshouldbefromdiversesubjects,soasto
makethetrainingproceduremorestableandincreasethegeneralizabilityofthelearnedmodel.
Meanwhile,thetrainingdatafortheRNNpartshouldbelongsequencestoleveragethetemporal
informationacrossframes.Thesetwopreferencescanbecontradictorytoeachother,especially
giventhelimitedGPUmemory.Hence,tosatisfybothpreferences,wedesignatwo-streamtraining
strategy.ThestreamthepreferenceoftheCNNpart,wheretheinputincludesface
images
I
andthegroundtruthdepthmaps
D
.ThesecondstreamtheRNNpart,wherethe
inputincludesfacesequences
f
I
j
g
N
f
j
=1
,thegroundtruthdepthmaps
f
D
j
g
N
f
j
=1
,theestimated3D
shapes
f
S
j
g
N
f
j
=1
,andthecorrespondinggroundtruthrPPGsignals
f
.Duringtraining,ourmethod
19
Figure2.5
Thenon-rigidregistrationlayer.
alternatesbetweenthesetwostreamstoconvergetoamodelthatminimizesboththedepthmap
andrPPGlosses.NotethateventhoughthestreamonlyupdatestheweightsoftheCNNpart,
thebackpropagationofthesecondstreamupdatestheweightsofbothCNNandRNNpartsinan
end-to-endmanner.
Testing
Toprovideascore,wefeedthetestingsequencetoournetworkandcompute
thedepthmap
^
D
ofthelastframeandtherPPGsignal
^
f
.Insteadofdesigningausing
^
D
and
^
f
,wecomputethescoreas:
score
=
jj
^
f
jj
2
2
+

jj
^
D
jj
2
2
;
(2.6)
where

isaconstantweightforcombiningthetwooutputsofthenetwork.
2.3.4Non-rigidRegistrationLayer
Wedesignanewnon-rigidregistrationlayertopreparedatafortheRNNpart.Thislayerutilizes
theestimateddense
3
DshapetoaligntheactivationorfeaturemapsfromtheCNNpart.Thislayer
isimportanttoensurethattheRNNtracksandlearnsthechangesoftheactivationforthe
same
20
facialarea
acrosstime,aswellasacrossallsubjects.
AsshowninFig.2.5,thislayerhasthreeinputs:thefeaturemap
T
2
R
32

32
,thedepthmap
^
D
andthe
3
Dshape
S
.Withinthislayer,wethresholdthedepthmapandgenerateabinarymask
V
2
R
32

32
:
V
=
^
D

threshold:
(2.7)
Then,wecomputetheinnerproductofthebinarymaskandthefeaturemap
U
=
T

V
,which
essentiallyutilizesthedepthmapasavisibilityindicatorforeachpixelinthefeaturemap.Ifthe
depthvalueforonepixelislessthanthethreshold,weconsiderthatpixeltobeinvisible.Finally,
wefrontalize
U
byutilizingtheestimated
3
Dshape
S
,
F
(
i;j
)=
U
(
S
(
m
ij
;
1)
;
S
(
m
ij
;
2))
;
(2.8)
where
m
2
R
K
isthelistof
K
indexesofthefaceareain
S
0
,and
m
ij
isthe
correspondingindexofpixel
i;j
.Weutilize
m
toprojectthemaskedactivationmap
U
tothe
frontalizedimage
F
.Thisproposednon-rigidregistrationlayerhasthreecontributionstoour
network:

Byapplyingthenon-rigidregistration,theinputdataarealignedandtheRNNcancompare
thefeaturemapswithoutconcerningaboutthefacialposeorexpression.Inotherwords,itcanlearn
thetemporalchangesintheactivationofthefeaturemapsforthesamefacialarea.

Thenon-rigidregistrationremovesthebackgroundareainthefeaturemap.Hencethe
backgroundareawouldnotparticipateinRNNlearning,althoughthebackgroundinformationis
alreadyutilizedinthelayersoftheCNNpart.

Forspooffaces,thedepthmapsarelikelytobeclosertozero.Hence,theinnerproductwith
thedepthmapssubstantiallyweakenstheactivationinthefeaturemaps,whichmakesiteasierfor
21
Dataset
#of#of#oflive/attackPoseDifferentExtra
Displaydevices
Spoof
subj.sess.video/image(I)rangeexpres.light.attacks
NUAA[Tanetal.(2010)]
1535105
=
7509
(I)FrontalNoYes-
1
Print
CASIA-MFSD[Zhangetal.(2012)]
503150
=
450
(V)FrontalNoNoiPad
1
Print,
1
Replay
Replay-Attack[Chingovskaetal.(2012)]
501200
=
1000
(V)FrontalNoYesiPhone
3
GS,iPad
1
Print,
2
Replay
MSU-MFSD[Wenetal.(2015)]
351110
=
330
(V)FrontalNoNoiPadAir,iPhone
5
S
1
Print,
2
Replay
MSU-USSA[Pateletal.(2016b)]
114011140
=
9120
(I)
<
45

YesYesMacBook,Nexus
5
,NvidiaShieldTablet
2
print,
6
Replay
Oulu-NPU[Boulkenafetetal.(2017b)]
5531980
=
3960
(V)FrontalNoYesDell
1905
FP,MacbookRetina
2
Print,
2
Replay
SiW
16541320
=
3300
(V)
<
90

YesYesiPadPro,iPhone
7
,GalaxyS
8
,AsusMB
168
B
2
Print,
4
Replay
Table2.1
ThecomparisonofourcollectedSiWdatasetwithexistingdatasetsforface
Figure2.6
ThestatisticsofthesubjectsintheSiWdatabase.Leftside:Thehistogramshowsthedistribution
ofthefacesizes.
theRNNtooutputzerorPPGsignals.Likewise,thebackpropagationfromtherPPGlossalso
encouragestheCNNparttogeneratezerodepthmapsforeitherallframes,oronepixellocationin
majorityoftheframeswithinaninputsequence.
2.4CollectionofFaceDatabase
Withtheadvanceofsensortechnology,existingsystemscanbevulnerabletoemerging
high-qualityspoofmediums.Onewaytomakethesystemrobusttotheseattacksistocollect
newhigh-qualitydatabases.Inresponsetothisneed,wecollectanewfacedatabase
namedSpoofintheWild(SiW)database,whichhasmultipleadvantagesoverpreviousdatasetsas
22
inTab.2.1.First,itcontainssubstantiallymorelivesubjectswithdiverseraces,e.g.,
3
timesofthe
subjectsofOulu-NPU.NotethatMSU-USSAisconstructedusingexistingimagesofcelebrities
withoutcapturinglivefaces.Second,livevideosarecapturedwithtwohigh-qualitycameras(Canon
EOST
6
,LogitechC
920
webcam)withdifferentPIEvariations.
SiWprovidesliveandspoof
30
-FPSvideosfrom
165
subjects.Foreachsubject,wehave
8
liveand
20
spoofvideos,intotal
4
;
620
videos.Somestatisticsofthesubjectsareshownin
Fig.2.6.Thelivevideosarecollectedinfoursessions.InSession
1
,thesubjectmoveshisheadwith
varyingdistancestothecamera.InSession
2
,thesubjectchangestheyawangleoftheheadwithin
[

90

;
90

]
,andmakesdifferentfaceexpressions.InSessions
3
;
4
,thesubjectrepeatstheSessions
1
;
2
,whilethecollectormovesthepointlightsourcearoundthefacefromdifferentorientations.
Thelivevideoscapturedbybothcamerasareof
1
;
920

1
;
080
resolution.Weprovidetwo
printandfourreplayvideoattacksforeachsubject,withexamplesshowninFig.2.7.Togenerate
differentqualitiesofprintattacks,wecaptureahigh-resolutionimage(
5
;
184

3
;
456
)foreach
subjectanduseittomakeahigh-qualityprintattack.Also,weextractafrontal-viewframefrom
alivevideoforlower-qualityprintattack.WeprinttheimageswithanHPcolorLaserJetM652
printer.Theprintattackvideosarecapturedbyholdingprintedpapersstillorwarpingthemin
frontofthecameras.Togeneratehigh-qualityreplayattackvideos,weselectfourspoofmediums:
SamsungGalaxyS
8
,iPhone
7
,iPadPro,andPC(AsusMB
168
B)screens.Foreachsubject,we
randomlyselecttwoofthefourhigh-qualitylivevideostodisplayinthespoofmediums.
23
Figure2.7
Examplelive(top)andspoof(bottom)videosinSiW.
2.5ExperimentalResults
2.5.1ExperimentalSetup
Databases
Weevaluateourmethodonmultipledatabasestodemonstrateitsgeneralizability.We
utilizeSiWandOuludatabasesBoulkenafetetal.(2017b)asnewhigh-resolutiondatabasesand
performintraandcrosstestingbetweenthem.Also,weusetheCASIA-MFSDZhangetal.(2012)
andReplay-AttackChingovskaetal.(2012)databasesforcrosstestingandcomparingwiththe
stateoftheart.
Parametersetting
TheproposedmethodisimplementedinTensorFlowAbadietal.(2016)with
aconstantlearningrateof
3e

3
,and
10
epochsofthetrainingphase.ThebatchsizeoftheCNN
streamis
10
andthatoftheCNN-RNNstreamis
2
with
N
f
being
5
.Werandomlyinitializeour
networkbyusinganormaldistributionwithzeromeanandstdof
0
:
02
.Weset

inEq.4.10to
0
:
015
and
threshold
inEq.2.7to
0
:
1
.
Evaluationmetrics
Tocomparewithpriorworks,wereportourresultswiththefollowingmet-
rics:AttackPresentationErrorRate
APCER
ISO/IEC-JTC-1/SC-37(2016),Bona
FidePresentationErrorRate
BPCER
ISO/IEC-JTC-1/SC-37(2016),
ACER
=
24
FDR
1%2%10%20%
Model
18
:
5%18
:
1%71
:
4%81
:
0%
Model
240
:
2%46
:
9%78
:
5%93
:
5%
Model
339
:
4%42
:
9%67
:
5%87
:
5%
Model
4
45.8%47.9%81%94.2%
Table2.2
TDRatdifferentFDRs,crosstestingonOuluProtocol
1
.
Test
Train
51020
54
:
16%4
:
16%3
:
05%
104
:
02%3
:
61%2
:
78%
204
:
10%3
:
67%2
:
98%
Table2.3
ACERofourmethodatdifferent
N
f
,onOuluProtocol
2
.
APCER
+
BPCER
2
ISO/IEC-JTC-1/SC-37(2016),andHalfTotalErrorRate
HTER
.The
HTER
ishalfofthesummationoftheFalseRejectionRate(FRR)andtheFalseAcceptanceRate(FAR).
2.5.2ExperimentalComparison
2.5.2.1AblationStudy
Advantageofproposedarchitecture
Wecomparefourarchitecturestodemonstratetheadvan-
tagesoftheproposedlosslayersandnon-rigidregistrationlayer.
Model
1
hasanarchitecturesimilar
totheCNNpartinourmethod(Fig.2.3),exceptthatitisextendedwithadditionalpoolinglayers,
fullyconnectedlayers,andsoftmaxlossforbinary
Model
2
istheCNNpartinour
methodwithadepthmaplossfunction.Wesimplyuse
jj
^
D
jj
2
for
Model
3
contains
theCNNandRNNpartswithoutthenon-rigidregistrationlayer.BothofthedepthmapandrPPG
lossfunctionsareutilizedinthismodel.However,theRNNpartwouldprocessunregisteredfeature
mapsfromtheCNN.
Model
4
istheproposedarchitecture.
Wetrainallfourmodelswiththeliveandspoofvideosfrom
20
subjectsofSiW.Wecompute
25
thecross-testingperformanceofallmodelsonProtocol
1
ofOuludatabase.TheTDRatdifferent
FDRarereportedinTab.2.2.
Model
1
hasapoorperformanceduetothebinarysupervision.In
comparison,byonlyusingthedepthmapassupervision,
Model
2
achievessubstantiallybetter
performance.However,afteraddingtheRNNpartwiththerPPGsupervision,ourproposed
Model
4
canfurthertheperformanceimprovement.Bycomparing
Model
4
and
3
,wecanseetheadvantage
ofthenon-rigidregistrationlayer.ItisclearthattheRNNpartcannotusefeaturemapsdirectlyfor
trackingthechangesintheactivationsandestimatingtherPPGsignals.
Advantageoflongersequences
Toshowtheadvantageofutilizinglongersequencesforestimating
therPPG,wetrainandtestourmodelwhenthesequencelength
N
f
is
5
,
10
,or
20
,usingintra-testing
onOuluProtocol
2
.FromTab.2.3,wecanseethatbyincreasingthesequencelength,theACER
decreasesduetomorereliablerPPGestimation.Despitetheoflongersequences,inpractice,
wearelimitedbytheGPUmemorysize,andforcedtodecreasetheimagesizeto
128

128
forall
experimentsinTab.2.3.Hence,weset
N
f
tobe
5
withtheimagesizeof
256

256
insubsequent
experiments,duetoimportanceofhigherresolution(e.g,alower
ACER
of
2
:
5%
inTab.2.4is
achievedthan
4
:
16%
).
2.5.2.2IntraTesting
WeperformintratestingonOuluandSiWdatabases.ForOulu,wefollowthefourprotocolsBoulke-
nafet(2017)andreporttheir
APCER
,
BPCER
and
ACER
.Tab.2.4showsthecomparisonofour
proposedmethodandthebesttwomethodsfor
each
protocolrespectively,intheface
competitionBoulkenafet(2017).Ourmethodachievesthelowest
ACER
in
3
outof
4
protocols.
Wehaveslightlyworse
ACER
onProtocol
2
.TosetabaselineforfuturestudyonSiW,we
threeprotocolsforSiW.TheProtocol
1
dealswithvariationsinfaceposeandexpression.Wetrain
usingthe
60
framesofthetrainingvideosthataremainlyfrontalviewfaces,andtestonall
26
Prot.MethodAPCER(%)BPCER(%)ACER(%)
CPqD
2
:
910
:
86
:
9
1
GRADIANT
1.3
12
:
56
:
9
Ours
1
:
6
1.61.6
MixedFASNet
9
:
72
:
56
:
1
2
Ours
2.7
2
:
72
:
7
GRADIANT
3
:
1
1.92.5
MixedFASNet
5
:
3

6
:
77
:
8

5
:
56
:
5

4
:
6
3
GRADIANT
2.6

3.9
5
:
0

5
:
33
:
8

2
:
4
Ours
2
:
7

1
:
3
3.1

1.72.9

1.5
MassyHNU
35
:
8

35
:
3
8.3

4.1
22
:
1

17
:
6
4
GRADIANT
5.0

4.5
15
:
0

7
:
110
:
0

5
:
0
Ours
9
:
3

5
:
610
:
4

6
:
0
9.5

6.0
Table2.4
Theintra-testingresultsonfourprotocolsofOulu.
Prot.SubsetSubject#AttackAPCER(%)BPCER(%)ACER(%)
1
Train
90
First
60
Frames
3
:
583
:
583
:
58
Test
75
All
2
Train
903
display
0
:
57

0
:
690
:
57

0
:
690
:
57

0
:
69
Test
751
display
3
Train
90
print(display)
8
:
31

3
:
818
:
31

3
:
808
:
31

3
:
81
Test
75
display(print)
Table2.5
Theintra-testingresultsonthreeprotocolsofSiW.
testingvideos.TheProtocol
2
evaluatestheperformanceofcrossspoofmediumofreplayattack.
TheProtocol
3
evaluatestheperformanceofcrossPA,i.e.,fromprintattacktoreplayattackand
viceversa.Tab.2.5showstheprotocolandourperformanceofeachprotocol.
2.5.2.3CrossTesting
Todemonstratethegeneralizationofourmethod,weperformmultiplecross-testingexperiments.
Ourmodelistrainedwithliveandspoofvideosof
80
subjectsinSiW,andtestonallprotocolsof
Oulu.The
ACER
onProtocol
1
-
4
arerespectively:
10
:
0%
,
14
:
1%
,
13
:
8

5
:
7%
,and
10
:
0

8
:
8%
.
Comparingthesecross-testingresultstothe
intra-testing
resultsinBoulkenafet(2017),weare
rankedsixthontheaverage
ACER
offourprotocols,amongthe
15
participantsofthefaceanti-
competition.EspeciallyonProtocol
4
,thehardestoneamongallprotocols,weachievethe
sameACER
of
10
:
0%
asthetopperformer.Thisisanotableresultsincecrosstestingisknownto
27
MethodCASIA
!
ReplayReplay
!
CASIA
Motion[deFreitasPereiraetal.(2013)]
50
:
2%47
:
9%
LBP[deFreitasPereiraetal.(2013)]
55
:
9%57
:
6%
LBP-TOP[deFreitasPereiraetal.(2013)]
49
:
7%60
:
6%
Motion-Mag[Bharadwajetal.(2013)]
50
:
1%47
:
0%
Spectral[Pintoetal.(2015)]
34
:
4%50
:
0%
CNN[Yangetal.(2014)]
48
:
5%45
:
5%
LBP[Boulkenafetetal.(2015)]
47
:
0%39
:
6%
ColourTexture[Boulkenafetetal.(2016)]
30
:
3%37
:
7%
Ours
27
:
6
%
28
:
4
%
Table2.6
CrosstestingonCASIA-MFSDvs.Replay-Attack.
Figure2.8
(a)
8
successfulexamplesandtheirestimateddepthmapsandrPPGsignals.(b)
4
failureexamples:thetwoareliveandtheothertwoarespoof.Noteourabilitytoestimatediscriminative
depthmapsandrPPGsignals.
besubstantiallyharderthanintratesting,andyetourcross-testingresultiscomparablewiththetop
intra-testingperformance.Thisdemonstratesthegeneralizationabilityofourlearntmodel.
Furthermore,weutilizetheCASIA-MFSDandReplay-Attackdatabasestoperformcross
testingbetweenthem,whichiswidelyusedasacross-testingbenchmark.Tab.2.6comparesthe
cross-testing
HTER
ofdifferentmethods.Ourproposedmethodreducesthecross-testingerrorson
theReplay-AttackandCASIA-MFSDdatabasesby
8
:
9%
and
24
:
6%
respectively,relativetothe
previousSOTA.
2.5.2.4VisualizationandAnalysis
ExamplesofsuccessfulandfailurecasesinestimatingdepthmapsandrPPGsignalsareshown
inFig.2.8.Intheproposedarchitecture,thefrontalizedfeaturemapsareutilizedasinputtothe
28
Figure2.9
Mean/Stdoffrontalizedfeaturemapsforliveandspoof.
Figure2.10
TheMSEofestimatingdepthmapsandrPPGsignals.
RNNpartandaresupervisedbytherPPGlossfunction.Thevaluesofthesemapscanshowthe
importanceofdifferentfacialareastorPPGestimation.Fig.2.9showsthemeanandstandard
deviationoffrontalizedfeaturemaps,computedfrom
1
,
080
liveandspoofvideosofOulu.Wecan
seethatthesideareasofforeheadandcheekhavehigherforrPPGestimation.
WhilethegoalofoursystemistodetectPAs,ourmodelistrainedtoestimatetheauxiliary
information.Hence,inadditiontoanti-spoof,wealsoliketoevaluatetheaccuracyofauxiliary
informationestimation.Forthispurpose,wecalculatetheaccuracyofestimatingdepthmapsand
rPPGsignals,fortestingdatainProtocol
2
ofOulu.AsshowninFig.2.10,theaccuracyforboth
estimationinspoofdataishigh,whilethatofthelivedataisrelativelylower.Notethatthedepth
estimationofthemouthareahasmoreerrors,whichisconsistentwiththefeweractivationsofthe
sameareainFig.2.9.
Finally,weconductstatisticalanalysisonthefailurecases,sinceoursystemcandetermine
potentialcausesusingtheauxiliaryinformation.WithProctocol
2
ofOulu,weidentify
31
failure
cases(
2
:
7%
ACER
).Foreachcase,wecalculatewhetherausingitsdepthmaporrPPG
29
signalwouldfailifthatinformationaloneisused.Intotal,
29
31
,
13
31
,and
11
31
samplesfailduetodepth
map,rPPGsignals,orboth.Thisindicatesthefutureresearchdirection.
2.6Conclusions
Thischaptertheimportanceofauxiliarysupervisiontodeepmodel-basedfaceanti-
TheproposednetworkcombinesCNNandRNNarchitecturestojointlyestimatethe
depthoffaceimagesandrPPGsignaloffacevideo.
Oneimprovementofauxiliary
supervisionsistomakeeachCNNpredictiontobebasedonalocalreceiptItwouldeffective
reducetheCNNtrainingfromovwithlimiteddata.
WeintroducetheSiWdatabasethat
containsmoresubjectsandvariationsthanpriordatabases.Finally,weexperimentallydemonstrate
thesuperiorityofourmethod.
30
Chapter3
Generalization:Zero-shotandOpen-set
Face
3.1Introduction
Attackerscanutilizeawidevarietyofmediumstolaunchspoofattacks.Themostcommonones
arereplayingvideos/imagesondigitalscreens,i.e.,replayattack,andprintedphotograph,i.e.,
printattack.Differentmethodsareproposedtohandlereplayandprintattacks,basedoneither
handcraftedfeaturesBoulkenafetetal.(2015);M
¨
a
¨
att
¨
aetal.(2011);Pateletal.(2016b)orCNN-
basedfeaturesAtoumetal.(2017);Fengetal.(2016);Jourablooetal.(2018);Liuetal.(2018c).
Recently,high-quality
3
Dcustommaskisalsousedforattacking,i.e.,
3
Dmaskattack.InLiuetal.
(2018b,2016a,b),methodsfordetectingprint/replayattacksarefoundtobelesseffectiveforthis
newspoof,andhencetheauthorsleveragetheremotephotoplethysmography(r-PPG)todetect
theheartratepulseasthecue.Further,facialmakeupmayalsotheoutcomeof
recognition,i.e.,makeupattackChenetal.(2013).ManyworksChangetal.(2018);Chenetal.
(2013,2014)studyfacialmakeup,despitenotasanproblem.
Allaforementionedmethodspresentalgorithmicsolutionstothe
known
spoofattack(s),where
modelsaretrainedandtestedonthe
same
type(s)ofspoofattacks.However,inreal-world
applications,attackerscanalsoinitiatespoofattacksthatwe,thealgorithmdesigners,arenotaware
31
Figure3.1
Todetectunknownspoofattacks,weproposeaDeepTreeNetwork(DTN)tounsuperviselylearn
ahierarchicembeddingforknownspoofattacks.SamplesofunknownattackswillberoutedthroughDTN
andatthedestinedleafnode.
of,termed
unknown
spoofattacks
1
.Researchersincreasinglypayattentiontothegeneralizationof
models,i.e.,howwelltheyareabletodetectspoofattacksthathaveneverbeenseen
duringthetraining?Wetheproblemofdetectingunknownfacespoofattacksas
Zero-Shot
Face(ZSFA)
.Despitethesuccessoffaceonknownattacks,ZSFA,on
theotherhand,isanewandunsolvedchallengetothecommunity.
TheattemptsonZSFAareArashlooetal.(2017);Xiong&AbdAlmageed(2018).They
addressZSFAbetweenprintandreplayattacks,andregarditasanoutlierdetectionproblemforlive
faces(a.k.a.realhumanfaces).Withhandcraftedfeatures,thelivefacesaremodeledviastandard
generativemodels,e.g.,GMM,auto-encoder.Duringtesting,anunknownattackisdetectedifit
liesoutsidetheestimatedlivedistribution.TheseZSFAworkshavethreedrawbacks:
Lackingspooftypevariety:
Priormodelsaredevelopedw.r.t.printandreplayattacksonly.The
1
Thereissubtledistinctionbetween1)
unseenattacks
,attacktypesthatareknowntoalgorithmdesignerssothat
algorithmscouldbetailoredtothem,buttheirdataareunseenduringtraining;2)
unknownattacks
,attacktypesthatare
neitherknowntodesignersnorseenduringtraining.Wedonotdifferentiatethesetwocasesandtermbothunknown
attacks.
32
respectivefeaturedesignmaynotbeapplicabletodifferentunknownattacks.
Nospoofknowledge:
Priormodelsonlyuselivefaces,withoutleveragingtheavailableknown
spoofdata.Whiletheunknownattacksaredifferent,theknownspoofattacksmaystillprovide
valuableinformationtolearnthemodel.
Limitationoffeatureselection:
TheyusehandcraftedfeaturessuchasLBPtorepresentlive
faces,whichwereshowntobelesseffectiveforknownspoofdetectionLietal.(2016a);Liuetal.
(2018c);Pateletal.(2016a);Yangetal.(2014).RecentdeeplearningmodelsJourablooetal.
(2018);Liuetal.(2018c)showtheadvantageofCNNmodelsforface
Thisworkaimstoaddressallthreedrawbacks.SinceoneZSFAmodelmayperformdifferently
whentheunknownspoofattackisdifferent,itshouldbeevaluatedonawiderangeofunknown
attackstypes.Inthiswork,wesubstantiallyexpandthestudyofZSFAfrom
2
typesofspoofattacks
to
13
types.Besidesprintandreplayattacks,weinclude
5
typesof
3
Dmaskattacks,
3
typesof
makeupattacks,and
3
partialattacks.Theseattackscoverbothimpersonationi.e.,attempt
tobeauthenticatedassomeoneelse,andobfuscationi.e.,attempttocoverattacker'sown
identity.Wecollectthefacedatabasethatincludesthesediversespoofattacks,
termedSpoofintheWilddatabasewithMultipleAttackTypes(SiW-M),showninTab.3.1.
TotacklethebroaderZSFA,weproposeaDeepTreeNetwork(DTN).Assumingthereare
bothhomogeneousfeaturesamongdifferentspooftypesanddistinctfeatureswithineachspoof
type,atree-likemodeliswell-suitedtohandlethiscase:learningthehomogeneousfeaturesin
theearlytreenodesanddistinctfeaturesinlatertreenodes.Withoutanyauxiliarylabelsofspoof
types,DTNlearnstopartitiondatainanunsupervisedmanner.Ateachtreenode,thepartitionis
performedalongthedirectionofthelargestdatavariation.Intheend,itclustersthedataintoseveral
sub-groupsattheleaflevel,andlearnstodetectspoofattacksforeachsub-groupindependently,
showninFig.3.1.Duringthetesting,adatasampleisroutedtothemostsimilarleafnodeto
33
produceabinarydecisionoflivevs.spoof.
Insummary,ourcontributionsinthisworkinclude:

Conductanextensivestudyofzero-shotfaceon
13
typesofspoofattacks;

ProposeaDeepTreeNetworktolearnfeatureshierarchicallyanddetectunknownspoof;

CollectanewdatabaseforZSFAandachievethestate-of-the-artperformanceonmultiple
testingprotocols.
3.2PriorWork
Face
Image-basedfacereferstofacetechniquesthatonly
takeRGBimagesasinputwithoutextrainformationsuchasdepthorheat.Inearlyyears,researchers
utilizelivenesscues,suchaseyeblinkingandheadmotion,todetectprintattacksKollreideretal.
(2007);Panetal.(2007);Pateletal.(2016a);Shaoetal.(2017).However,whenencountering
unknownattacks,suchasphotograhwitheyeportioncut,andvideoreplay,thosemethodssuffer
fromatotalfailure.Later,researchmovetoamoregeneraltextureanalysisandaddressprint
andreplayattacks.Researchersmainlyutilizehandcraftedfeatures,e.g.,LBPBoulkenafetetal.
(2015);deFreitasPereiraetal.(2012,2013);M
¨
a
¨
att
¨
aetal.(2011),HoGKomulainenetal.(2013a);
Yangetal.(2013),SIFTPateletal.(2016b)andSURFBoulkenafetetal.(2017a),withtraditional
e.g.,SVMandLDA,tomakeabinarydecision.Thosemethodsperformwellonthe
testingdatafromthesamedatabase.However,whilechangingthetestingconditionssuchaslighting
andbackground,theyoftenhavealargeperformancedrop,whichcanbeviewedasanov
issue.Moreover,theyalsoshowlimitationsinhandling
3
Dmaskattacks,mentionedinLiuetal.
(2016a).
Toovercometheovissue,researchersmakevariousattempts.Boulkenafetetal.extract
34
thefeaturesinHSV
+
YCbCRspaceBoulkenafetetal.(2015).WorksinAgarwaletal.
(2016);Baoetal.(2009);Bharadwajetal.(2014);Fengetal.(2016);Xuetal.(2015)consider
featuresinthetemporaldomain.RecentworksAgarwaletal.(2016);Atoumetal.(2017)augment
thedatabyusingimagepatches,andfusethescoresfrompatchestoasingledecision.For
3
Dmask
attacks,theheartpulserateisestimatedtodifferentiate
3
DmaskfromrealfacesLietal.(2016b);
Liuetal.(2016a).Inthedeeplearningera,researchersproposeseveralCNNworksAtoumetal.
(2017);Fengetal.(2016);Jourablooetal.(2018);Lietal.(2016a);Liuetal.(2018c);Pateletal.
(2016a);Yangetal.(2014)thatoutperformthetraditionalmethods.
Zero-shotlearningandunknownspoofattacks
Zero-shotobjectrecognition,ormoregenerally,
zero-shotlearning,aimstorecognizeobjectsfromunknownclassesSocheretal.(2013),i.e.,object
classesunseenintraining.Theoverallideaistoassociatetheknownandunknownclassesvia
asemanticembedding,whoseembeddingspacescanbeattributesLampertetal.(2009),word
vectorFromeetal.(2013),textdescriptionZhangetal.(2017a)andhumangazeKaresslietal.
(2017).
Zero-shotlearningforunknownspoofattack,i.e.,ZSFA,isarelativelynewtopicwithunique
properties.Firstly,unlikezero-shotobjectrecognition,ZSFAemphasizesthedetectionofspoof
attacks,insteadofrecognizingspooftypes.Secondly,unlikegenericobjectswithrichse-
manticembedding,thereisnoexplicitsemanticembeddingforspoofpatternsJourabloo
etal.(2018).AselaboratedinSec.3.1,priorZSFAworksArashlooetal.(2017);Xiong&AbdAl-
mageed(2018)onlymodelthelivedataviahandcraftedfeaturesandstandardgenerativemodels,
withseveraldrawbacks.Inthiswork,weproposeadeeptreenetworktounsuperviselylearnthe
semanticembeddingforknownspoofattacks.Thepartitionofthedatanaturallyassociatescertain
semanticattributeswiththesub-groups.Duringthetesting,theunknownattacksareprojectedto
theembeddingtotheclosestattributesforspoofdetection.
35
Dataset
Num.ofFacevariationsSpoofattacktypesTotalnum.of
subj./vid.poseexpressionlightingreplayprint
3
Dmaskmakeuppartialspooftypes
CASIA-FASDZhangetal.(2012)(
2012
)
50
/
600
FrontalNoNo
120003
Replay-AttackChingovskaetal.(2012)(
2012
)
50
/
1
,
200
FrontalNoYes
110002
HKBU-MARsLiuetal.(2016a)(
2016
)
35
/
1
,
008
FrontalNoYes
002002
Oulu-NPUBoulkenafetetal.(2017b)(
2017
)
55
/
5
,
940
FrontalNoNo
110002
SiWLiuetal.(2018c)(
2018
)
165
/
4
,
620
<
90

YesYes
110002
SiW-M(
2019
)
493
/
1
,
630
<
90

YesYes
1153313
Table3.1
ComparingourSiW-Mwithexistingfacedatasets.
Deeptreenetworks
Treestructureisoftenfoundhelpfulintacklinglanguage-relatedtaskssuchas
parsingandtranslationChenetal.(2018),duetotheintrinsicrelationofwordsandsentences.E.g.,
treemodelsareappliedtojointvisionandlanguageproblemssuchasvisualquestionreasoningCao
etal.(2018).Treestructurealsohasthepropertyforlearningfeatureshierarchically.Facealignment
worksKazemi&Sullivan(2014);Valleetal.(2018)utilizetheregressiontreestoestimatefacial
landmarksfromcoarseto.Xiongetal.proposeatreeCNNtohandlethelarge-poseface
recognitionXiongetal.(2015).InKanekoetal.(2018),Kanekoetal.proposeaGANwithdecision
treestolearnhierarchicallyinterpretablerepresentations.Inourwork,weutilizetreenetworksto
learnthelatentsemanticembeddingforZSFA.
Facedatabases
Giventheofagood-qualitydatabase,researchers
havereleasedseveralfacedatabases,suchasCASIA-FASDZhangetal.(2012),
Replay-AttackChingovskaetal.(2012),OULU-NPUBoulkenafetetal.(2017b),andSiWLiu
etal.(2018c)forprint/replayattacks,andHKBU-MARsLiuetal.(2016a)for
3
Dmaskattacks.
EarlydatabasessuchasCASIA-FASDandReplay-AttackZhangetal.(2012)havelimitedsubject
variety,pose/expression/lightingvariations,andvideoresolutions.RecentdatabasesBoulkenafet
etal.(2017b);Liuetal.(2016a,2018c)improvethoseaspects,andalsosetupdiverseevaluation
protocols.However,uptonow,alldatabasesfocusoneitherprint/replayattacks,or
3
Dmaskattacks.
ToprovideacomprehensivestudyoffaceespeciallythechallengingZSFA,wefor
thetimecollectthedatabasewithdiversetypesofspoofattacks,asinTab.3.1.Thedetailsof
36
Figure3.2
TheproposedDeepTreeNetwork(DTN)architecture.(a)theoverallstructureofDTN.Atree
nodeconsistsofaConvolutionalResidualUnit(CRU)andaTreeRoutingUnit(TRU),andaleafnode
consistsofaCRUandaSupervisedFeatureLearning(SFL)module.(b)theconceptofTreeRoutingUnit
(TRU):thebasewithlargestvariations;(c)thestructureofeachConvolutionalResidualUnit(CRU);
(d)thestructureoftheSupervisedFeatureLearning(SFL)intheleafnodes.
ourdatabaseareinSec.3.4.
3.3DeepTreeNetworkforZSFA
ThemainpurposesofDTNaretwofold:
1)
discoverthesemanticsub-groupsforknownspoofs;
2)
learnthefeaturesinahierarchicalway.ThearchitectureofDTNisshowninFig.3.2.Eachtree
nodeconsistsofaConvolutionalResidualUnit(CRU)andaTreeRoutingUnit(TRU),whilethe
leafnodeconsistsofaCRUandaSupervisedFeatureLearning(SFL)module.CRUisablockwith
convolutionallayersandtheshort-cutconnection.TRUanoderoutingfunctiontoroutea
datasampletooneofthechildnodes.Theroutingfunctionpartitionsallvisitingdataalongthe
directionwiththelargestdatavariation.SFLmoduleconcatenatesthesupervision
andthepixel-wisesupervisiontolearnthefeatures.
37
3.3.1UnsupervisedTreeLearning
3.3.1.1NodeRoutingFunction
ForaTRUnode,let'sassumetheinput
x
=
f
(
I
j

)
2
R
m
isthevectorizedfeatureresponse,
I
isdata
input,

istheparametersofthepreviousCRUs,and
S
isthesetofdatasamples
I
k
;k
=1
;
2
;:::;K
thatvisitthisTRUnode.InXiongetal.(2015),Xiongetal.aroutingfunctionas:
'
(
x
)=
x
T

v
+
˝;
(3.1)
where
v
denotestheprojectionvectorand
˝
isthebias.Data
S
canthenbesplitinto
S
left
:
f
I
k
j
'
(
x
k
)
<
0
;
I
k
2Sg
and
S
right
:
f
I
k
j
'
(
x
k
)

0
;
I
k
2Sg
,anddirectedtotheleftandright
childnode,respectively.Tolearnthisfunction,theyproposetomaximizethedistancebetweenthe
meanof
S
left
and
S
right
,whilekeepingthemeanof
S
centeredat
0
.Thisunsupervisedlossis
formulatedas:
L
=
(
1
N
P
I
k
2S
'
(
x
k
))
2
(
1
N
l
P
I
k
2S
left
'
(
x
k
)

1
N
r
P
I
k
2S
right
'
(
x
k
))
2
;
(3.2)
where
N
,
N
l
,
N
r
denotethenumberofsamplesineachset.
However,inpractice,minizingEqu.3.2mightnotleadtoasatisfactorysolution.Firstly,the
losscanbeminimizedbyincreasingthenormofeither
v
or
x
,whichisatrivialsolution.Secondly,
evenwhenthenormsof
v
,
x
areconstrained,Equ.3.2isaffectedbythedensityofdata
S
andcan
besensitivetotheoutliers.Inotherwords,thezeroexpectationof
'
(
x
)
doesnotnecessarilyresult
inabalancedpartitionofdata
S
.Localminimacouldbeachievedwhenalldataarespittooneside.
Insomecases,thetreemaysufferfromcollapsingtoafew(evenone)leafnodes.
Tobetterpartitionthedata,weproposeanovelroutingfunctionandanunsupervisedloss.
38
Regardlessof
˝
,thedotproductbetween
x
T
and
v
canberegardedasprojecting
x
tothedirection
of
v
.Wedesign
v
suchthatwecanobservethelargestvariationafterprojection.Inspiredbythe
conceptofPCA,theoptimalsolutionnaturallybecomesthelargestPCAbasisofdata
S
.Toachieve
this,weconstrain
v
tobenorm
1
andreformulateEqu.3.1as:
'
(
x
)=(
x


)
T

v
;
k
v
k
=1
;
(3.3)
where

isthemeanofdata
S
.Then,
v
isidenticaltothelargesteigenvectorofthe
covariancematrix

X
T
S

X
S
,where

X
S
=
X
S


,and
X
S
2
R
N

K
isthedatamatrix.Basedonthe
ofeigen-analysis

X
T
S

X
S
v
=

v
,ouroptimizationaimstomaximize:
argmax
v


=argmax
v

v
T

X
T
S

X
S
v
:
(3.4)
Thelossforlearningtheroutingfunctionisformulatedas:
L
route
=
exp
(


v
T

X
T
S

X
S
v
)+

Tr
(

X
T
S

X
S
)
;
(3.5)
where
;
arescalars,andsetas
1
e-
3
,
1
e-
2
inourexperiments.Weapplytheexponentialfunction
onthetermtomakethemaximizationproblembounded.Thesecondtermisintroducedasa
regularizertopreventtrivialsolutionsbyconstrainingthetraceofcovariancematrixof

X
S
.
3.3.1.2TreeofKnownSpoofs
Withtheroutingfunction,wecanbuildtheentirebinarytree.Fig.3.2showsabinarytreeofdepth
of
4
,with
8
leafnodes.AsmentionedearlyinSec.3.3,thetreeisdesignedtothesemantic
sub-groupsfromallknownspoofs,andistermedasspooftree.Similarly,wemayalsotrainlivetree
39
withlivefacesonly,aswellasgeneraldatatreewithbothliveandspoofdata.Comparedtospoof
tree,liveandgeneraldatatreehavesomedrawbacks.Livetreedoesnotconveysemanticmeaning
forthespoof,andtheattributeslearnedateachnodecannothelptorouteandbetterdetectspoof;
Generaldatatreemayresultinimbalancedsub-groups,wheresamplesofoneclassoutnumber
another.Suchimbalancewouldcausebiasforsupervisedlearninginthenextstage.
Hence,whenwecomputeEqu.3.5tolearntheroutingfunctions,weonlyconsiderthespoof
samplestoconstruct
X
S
.Tohaveabalancedsub-groupforeachleaf,wesuppresstheresponsesof
livedatatozero,sothatalllivedatacanbeevenlypartitionedtothechildnodes.Meanwhile,we
alsosuppresstheresponsesofthespoofdatathatdonotvisitthisnode,sothateverynodemodels
thedistributionofauniquespoofsubset.
Formally,foreachnode,wemaximizetheroutingfunctionresponsesofspoofdatathatvisit
thisnode(denotedas
S
),whileminimizingtheresponsesofotherdata(denotedas
S

),including
alllivedataandspoofdatathatdon'tvisitthisnode,i.e.,thatvisitneighboringnodes.Toachieve
thisobjective,wethefollowingloss:
L
uniq
=

1
N
X
I
k
2S


x
T
k
v


2
+
1
N

X
I
k
2S


x
T
k
v


2
:
(3.6)
3.3.2SupervisedFeatureLearning
Giventheroutingfunctions,adatasample
I
k
willbeassignedtooneoftheleafnodes.Let's
thefeatureoutputofleafnodeas
F
(
I
k
j

)
,shortenedas
F
k
forsimplicity.Ateach
leafnode,wetwonode-wisesupervisedtaskstolearndiscriminativefeatures:
1)
binary
drivesthelearningofahigh-levelunderstandingoflivevs.spooffaces,
2)
pixel-wise
maskregressiondrawsCNN'sattentiontolow-levellocalfeaturelearning.
40
supervision
Tolearnabinary,asshowninFig.3.2(d),weapplytwo
additionalconvolutionlayersandtwofullyconnectedlayerson
F
k
togenerateafeaturevector
c
k
2
R
500
.Wesupervisethelearningviathesoftmaxcrossentropyloss:
L
class
=
1
N
X
I
k
2S
n
(1

y
k
)
log
(1

p
k
)

y
k
log
p
k
o
(3.7)
p
k
=
exp
(
w
1
T
c
k
)
exp
(
w
0
T
c
k
)+
exp
(
w
1
T
c
k
)
;
(3.8)
where
S
representsallthedatasamplesthatarrivethisleafnode,
N
denotesthenumberofsamples
in
S
,
f
w
0
;
w
1
g
aretheparametersinthelastfullyconnectedlayer,and
y
k
isthelabelofdata
sample
k
(
1
denotesspoof,and
0
live).
Pixel-wisesupervision
Wealsoconcatenateanotherconvolutionlayerto
F
k
togenerateamap
response
M
k
2
R
32

32
.InspiredbythepriorworkLiuetal.(2018c),weleveragethesemantic
priorknowledgeoffaceshapesandspoofattackpositiontoprovideapixel-wisesupervision.Using
thedensefacealignmentmodelLiuetal.(2017),weprovideabinarymask
D
k
2
R
32

32
,shown
inFig.3.3,toindicatethepixelsofspoofmediums.Thus,foraleafnode,thelossfunctionforthe
pixel-wisesupervisionis:
L
mask
=
1
N
X
I
k
2S
k
M
k

D
k
k
1
:
(3.9)
Overallloss
Finally,weapplythesupervisedlosseson
p
leafnodes,theunsupervisedlosseson
q
TRUnodes,andformulateourtraininglossas:
L
=
p
X
i
=1
(

1
L
i
class
+

2
L
i
mask
)+
q
X
j
=1
(

3
L
j
route
+

4
L
j
uniq
)
;
(3.10)
41
Figure3.3
Theexamplesofthelivefacesand
13
typesofspoofattacks.Thesecondrowshowstheground
truthmasksforthepixel-wisesupervision
D
k
.For
(
m;n
)
inthethirdrow,
m=n
denotesthenumberof
subjects/videosforeachtypeofdata.
Figure3.4
ThestructureoftheTreeRoutingUnit(TRU).
where

1
,

2
,

3
,

4
aretheregularizationcoefforeachterm,andaresetas
0
:
001
,
1
:
0
,
2
:
0
,
0
:
001
respectively.Fora
4
-layerDTN,
p
=8
and
q
=7
.
3.3.3NetworkArchitecture
DeepTreeNetwork(DTN)
DTNisthemainframeworkoftheproposedmodel.Ittakes
I
2
R
256

256

6
asinput,wherethe
6
channelsareRGB+HSVcolorspaces.Weconcatenatethree
3

3
convolutionlayerswith
40
channelsand
1
max-poolinglayer,andgroupthemasoneConvolutional
ResidualUnit(CRU).EachconvolutionlayerisequippedwithReLUandgroupnormalization
layerWu&He(2018),duetothedynamicbatchsizeinthenetwork.Wealsoapplyashortcut
connectionforeachconvolutionlayer.Foreachtreenode,wedeployoneCRUbeforetheTRU.At
42
theleafnode,DTNproducesthefeaturerepresentationofinput
I
as
F
(
I
j

)
2
R
32

32

40
,then
usesone
1

1
convolutionlayertogeneratethebinarymaskmap
M
.
TreeRoutingUnit(TRU)
TRUisthemoduleroutingthedatasampletooneofthechildCRUs.
AsshowninFig.3.4,itcompressesthefeaturebyusingan
1

1
convolutionlayer,andresizing
theresponsespatially.Fortherootnode,wecompresstheCRUfeatureto
x
2
R
32

32

10
,andfor
latertreenode,wecompresstheCRUfeatureto
x
2
R
16

16

20
.Compressingtheinputfeature
toasmallersizehelpstoreducetheburdenofcomputatingandsavingthecovariancematrixin
Equ.3.5.E.g.,thevectorizedfeaturefortheCRUis
x
2
R
655
;
360
,andthecovariancematrixof
x
cantake
˘
400
GBinmemory.However,aftercompressionthevectorizedfeatureis
x
2
R
10
;
240
,
andthecovariancematrixof
x
onlyneeds
˘
0
:
1
GBofmemory.
Afterthat,wevectorizetheoutputandapplytheroutingfunction
'
(
x
)
.Tocompute

in
Equ.3.3,insteadofoptimizingitasavariableofthenetwork,wesimplyapplyabatchnormalization
layerwithoutscalingtosavethemovingaverageofeachmini-batch.Intheend,weprojectthe
compressedCRUresponsetothelargestbasis
v
andobtaintheprojectioncoefcient.Thenwe
assignthesampleswithnegativecoefcienttotheleftchildCRUandthesampleswithpositive
coeftotherightchildCRU.
Implementationdetails
WiththeoveralllossinEqu.3.10,ourproposednetworkistrainedinan
end-to-endfashion.Alllossesarecomputedbasedoneachmini-batch.DTNmodulesandTRU
modulesareoptimizedalternately.WhileoptimizingDTN,wekeeptheparametersofTRUsed
andviceversa.
43
3.4SpoofintheWildDatabasewithMultipleAttackTypes
Tobenchmarkfacemethodsforunknownattacks,wecollecttheSpoofin
theWilddatabasewithMultipleAttackTypes(SiW-M).Comparedwiththepreviousdatabasesin
Tab.3.1,SiW-Mshowsagreatdiversityinspoofattacks,subjectidentities,environmentsandother
factors.
Forspoofdatacollection,weconsidertwoscenarios:
impersonation
,whichentailsthe
useofspooftoberecognizedassomeoneelse,and
obfuscation
,whichentailstheusetoremovethe
attacker'sownidentity.Intotal,wecollect
968
videosof
13
typesofspoofattackslistedhieratically
inFig3.3.Forall
5
maskattacks,
3
partialattacks,obfuscationmakeupandcosmeticmakeup,we
record
1080
PHDvideos.Forimpersonationmakeup,wecollect
720
PvideosfromYoutubedueto
thelackofspecialmakeupartists.Forprintandreplayattacks,weintendtocollectvideosfrom
hardercaseswheretheexistingsystemfails.Hence,wedeployanoff-the-shelfface
algorithmLiuetal.(2018c)andrecordspoofvideoswhenthealgorithmpredictslive.
Forlivedata,weinclude
660
videosfrom
493
subjects.Incomparison,thenumberofsubjects
inSiW-Mis
9
timeslargerthanOulu-NPUBoulkenafetetal.(2017b)andCASIA-FASDZhang
etal.(2012),and
3
timeslargerthanSiWLiuetal.(2018c).Inaddition,subjectsarediversein
ethnicityandage.Thelivevideosarecollectedin
3
sessions:
1)
aroomenvironmentwherethe
subjectsarerecordedwithfewvariationssuchaspose,lightingandexpression(PIE).
2)
adifferent
andmuchlargerroomwherethesubjectsarealsorecordedwithPIEvariations.
3)
amobilephone
mode,wherethesubjectsaremovingwhilethephonecameraisrecording.Extremeposeangles
andlightingconditionsareintroduced.Similartoprintandreplayvideos,wedeploytheface
algorithmLiuetal.(2018c)tooutthevideoswherethealgorithmpredictsspoof.
Hence,thisthirdsessionisaharderscenario.
44
Intotal,wecollect
1
,
630
videosandeachlasts
5
-
7
seconds.The
1080
Pvideosarerecordedby
LogitechC
920
webcamandCanonEOST
6
.TouseSiW-MforthestudyofZSFA,wethe
leave-one-outtestingprotocols.Eachtimewetrainamodelwith
12
typesofspoofattacksplusthe
80%
ofthelivevideos,andtestontheleft
1
attacktypeplusthe
20%
oflivevideos.Thereisno
overlappingsubjectsbetweenthetrainingandtestingsetsoflivevideos.
3.5ExperimentalResults
3.5.1ExperimentalSetup
Databases
Weevaluateourproposedmethodonmultipledatabases.Wedeploytheleave-one-out
testingprotocolsonSiW-Mandreporttheresultsof
13
experiments.Also,wetestonpreviousface
databases,includingCASIAZhangetal.(2012),Replay-AttackChingovskaetal.
(2012),andMSU-MFSDWenetal.(2015)),comparewiththestateoftheart.
Evaluationmetrics
Weevaluatewiththefollowingmetrics:AttackPresentation
ErrorRate(APCER)ISO/IEC-JTC-1/SC-37(2016),BonaFidePresentationClasError
Rate(BPCER)ISO/IEC-JTC-1/SC-37(2016),theaverageofAPCERandBPCER,AverageCla
cationErrorRate(ACER)ISO/IEC-JTC-1/SC-37(2016),EqualErrorRate(EER),andAreaUnder
Curve(AUC).Notethat,intheevaluationofunknownattacks,weassumethereisnovalidationset
totunethemodelandthresholdswhilecalculatingthemetrics.Hence,wedeterminethethreshold
basedonthetrainingsetanditforalltestingprotocols.Asingletestsampleisonevideoframe,
insteadofonevideo.
Parametersetting
TheproposedmethodisimplementedinTwAbadietal.(2016),and
trainedwithaconstantlearningrateof
0
:
001
withabatchsizeof
32
.Ittakes
15
epochstoconverge.
Werandomlyinitializealltheweightsusinganormaldistributionof
0
meanand
0
:
02
standard
45
deviation.
3.5.2ExperimentalComparison
3.5.2.1AblationStudy
AllablationstudiesusetheFunnyEyeprotocol.
Differentfusionmethods
Intheproposedmodel,boththenormofthemaskmapsandbinary
spoofscorescouldbeutilizedfortheTondthebestfusionmethod,wecompute
ACERfromusingmapnorm,softmaxscore,themaximumofmapnormandsoftmaxscore,andthe
averageoftwovalues,andobtain
31
:
7%
,
20
:
5%
,
21
:
0%
,and
19
:
3%
respectively.Sincetheaverage
scoreofthemasknormandbinaryspoofscoreperformsthebest,weuseitfortheremaining
experiments.Moreover,weset
0
:
2
asthethresholdtocomputeAPCER,BPCERandACER
foralltheexperiments.
Differentroutingmethods
Routingisacrucialsteptothebestsubgrouptodetectspoofness
ofatestingsample.Toshowtheeffectofproperrouting,weevaluate
2
alternativeroutingstrategies:
randomroutingandpick-one-leaf.Randomroutingdenotesrandomlyselectingoneleafnodefora
testingsampletoproduceprediction;Pick-one-leafdenotesconstantlyselectingoneparticularleaf
nodetoproduceresults,forwhichwereportthemeanscoreandstandarddeviationof
8
selections.
ShowninTab.3.2,bothstrategiesperformworsethantheproposedroutingfunction.Inaddition,
thelargestandarddeviationofpick-one-leafstrategyshowsthe
large
performancedifferenceof
8
subgroupsonthe
sametype
ofunknownattacks,anddemonstratesthenecessityofaproperrouting.
Advantageofeachlossfunction
Wehavethreeimportantdesignsinourunsupervisedtree
learning:routeloss
L
route
,datausedtocomputetherouteloss,andtheuniqueloss
L
uniq
.To
46
StrategiesAPCERBPCERACEREER
Randomrouting
37
:
1
16
:
1
26
:
624
:
7
Pick-one-leaf
51
:
2

20
:
018
:
1

4
:
934
:
7

8
:
824
:
1

3
:
1
Proposedroutingfunction
17
:
0
21
:
5
19
:
319
:
8
Table3.2
Comparemodelswithdifferentroutingstrategies.
MethodsAPCERBPCERACEREER
MPTXiongetal.(2015)
31
:
424
:
227
:
827
:
3
Livedata
p
,Spoofdata
p
,UniqueLoss

1
:
4
73
:
337
:
331
:
2
Livedata

,Spoofdata
p
,UniqueLoss

70
:
012
:
741
:
344
:
8
Livedata
p
,Spoofdata
p
,UniqueLoss
p
54
:
2
12
:
5
33
:
436
:
2
Livedata

,Spoofdata
p
,UniqueLoss
p
17
:
021
:
5
19
:
319
:
8
Table3.3
Comparemodelswithdifferenttreelossesandstrategies.Thetwotermsofrow
2
-
5
referto
usingliveorspoofdataintreelearning.Thelastrowisourmethod.
Methods
CASIAZhangetal.(2012)Replay-AttackChingovskaetal.(2012)MSUWenetal.(2015)
Overall
VideoCutPhotoWarpedPhotoVideoDigitalPhotoPrintedPhotoPrintedPhotoHRVideoMobileVideo
OC-SVM
RBF
+BSIFArashlooetal.(2017)
70
:
760
:
795
:
984
:
388
:
173
:
764
:
887
:
474
:
778
:
7

11
:
7
SVM
RBF
+LBPBoulkenafetetal.(2017b)
91
:
591
:
784
:
599
:
198
:
287
:
347
:
799
:
5
97
:
6
88
:
6

16
:
3
NN+LBPXiong&AbdAlmageed(2018)
94
:
2
88
:
479
:
999
:
895
:
278
:
950
:
699
:
993
:
586
:
7

15
:
6
Ours
90
:
0
97
:
397
:
599
:
999
:
999
:
681
:
699
:
9
97
:
5
95
:
9

6
:
2
Table3.4
AUC(
%
)ofthemodeltestingonCASIA,Replay,andMSU-MFSD.
showtheeffectofeachlossandthetrainingstrategy,wetrainandcomparenetworkswitheachloss
excludedandalternativestrategies.First,wetrainanetworkwiththeroutingfunctionproposed
inXiongetal.(2015),andthen
4
modelswithdifferentmodulesonandoff,showninTab.3.3.The
modelwithMPTXiongetal.(2015)routesdataonlyto
2
leafnodesoutof
8
(i.e.treecollapse
issue),whichlimitstheperformance.Modelswithouttheuniquelossexhibittheimbalancerouting
issuewheresub-groupscannotbetrainedproperly.Modelsusingalldatatolearnthetreeshow
worseperformancesthanusingspoofdataonly.Finally,theproposedmethodperformsthebest
amongalloptions.
3.5.2.2Testingonexistingdatabases
FollowingtheprotocolproposedinArashlooetal.(2017),weuseCASIAZhangetal.(2012),
Replay-AttackChingovskaetal.(2012)andMSU-MFSDWenetal.(2015)toperformZSFA
47
testingbetweenreplayandprintattacks.Tab.3.4comparestheproposedmethodwithtopthree
methodsselectedfromover
20
methodsinArashlooetal.(2017);Boulkenafetetal.(2017b);
Xiong&AbdAlmageed(2018).Ourproposedmethodoutperformsthepriorstateoftheartbya
convincingmarginof
7
:
3%
,andoursmallerstandarddeviationfurtherindicatesaconsistentlygood
performanceamongunknownattacks.
3.5.2.3TestingonSiW-M
Weexecute
13
leave-one-outtestingprotocolsonSiW-M.Wecomparewithtwoofthemostrecent
facemethodsBoulkenafetetal.(2017b);Liuetal.(2018c),andsetLiuetal.(2018c)
asthebaseline,whichhasdemonstrateditsSOTAperformanceonvariousbenchmarks.Forafair
comparisonwiththebaseline,weprovidethesamepixel-wiselabeling(asinFig.3.3),andsetthe
samethresholdof
0
:
2
tocomputeAPCER,BPCER,andACER.
AsshowninTab.3.5,ourmethodachievesanoverallbetterAPCER,ACERandEER,withthe
improvementofbaselineby
55%
,
29%
,and
5%
.,wereducetheACERsoftransparent
mask,funnyeye,andpaperglassesby
31%
,
61%
,and
51%
,wherethebaselinemodelscanbe
consideredastotalfailuressincetheyrecognizemostoftheattacksaslive.Notethat,ACERis
morevaluableinthecontextofZSFA:noevaluationdataforsettingthresholdandconsiderably
variedthresholdsforobtainingtheEERperformance.Forinstance,EERsofpaperglassesmodel
aresimilarbetweenthebaselineandourmethod,butwithapresetthreshold,ourmethodoffersa
muchbetterACER.
Moreover,theproposedmethodisamorecompactmodelthanLiuetal.(2018c).Giventhe
inputsizeof
256

256

6
,thebaselinerequires
87
GFlopstocomputetheresultwhileourmethod
onlyneeds
6
GFlops(

15
smaller).MoreanalysisareshownwithvisualizationinSec.3.5.2.4.
Amongalltheattacks,replay,print,halfmask,papermask,impersonationmakeupareim-
48
MethodsMetrics(%)ReplayPrint
MaskAttacksMakeupAttacksPartialAttacks
Average
HalfSiliconeTrans.PaperManne.Obfusc.Imperson.CosmeticFunnyEyePaperGlassesPaper
SVM
RBF
+LBPBoulkenafetetal.(2017b)
APCER
19
:
115
:
440
:
820
:
370
:
3
0
:
0
4
:
696
:
935
:
3
11
:
3
53
:
358
:
50
:
632
:
8

29
:
8
BPCER
22
:
121
:
521
:
921
:
420
:
723
:
122
:
921
:
712
:
522
:
218
:
420
:
022
:
921
:
0

2
:
9
ACER
20
:
618
:
431
:
321
:
445
:
511
:
613
:
859
:
323
:
916
:
735
:
939
:
211
:
726
:
9

14
:
5
EER
20
:
818
:
636
:
321
:
437
:
27
:
514
:
151
:
219
:
816
:
134
:
433
:
07
:
924
:
5

12
:
9
AuxiliaryLiuetal.(2018c)
APCER
23
:
77
:
327
:
7
18
:
2
97
:
88
:
316
:
2100
:
018
:
016
:
391
:
872
:
20
:
438
:
3

37
:
4
BPCER
10
:
16
:
510
:
911
:
66
:
27
:
89
:
3
11
:
6
9
:
37
:
16
:
28
:
810
:
38
:
9

2
:
0
ACER
16
:
86
:
919
:
3
14
:
9
52
:
18
:
012
:
855
:
813
:
7
11
:
7
49
:
040
:
5
5
:
3
23
:
6

18
:
5
EER
14
:
04
:
3
11
:
612
:
424
:
6
7
:
810
:
072
:
310
:
1
9
:
4
21
:
4
18
:
64
:
0
17
:
0

17
:
7
Ours
APCER
1.0
0
:
00
:
7
24
:
5
58
:
6
0
:
5
3
:
873
:
213
:
2
12
:
4
17
:
017
:
00
:
217
:
1

23
:
3
BPCER
18
:
611
:
929
:
312
:
813
:
48
:
523
:
0
11
:
5
9
:
616
:
021
:
522
:
616
:
816
:
6

6
:
2
ACER
9
:
86
:
015
:
0
18
:
7
36
:
04
:
57
:
748
:
111
:
4
14
:
2
19
:
319
:
8
8
:
5
16
:
8

11
:
1
EER
10
:
02
:
1
14
:
418
:
626
:
5
5
:
79
:
650
:
210
:
1
13
:
2
19
:
8
20
:
58
:
8
16
:
1

12
:
2
Table3.5
TheevaluationandcomparisonofthetestingonSiW-M.
personationattacks.TheaverageACER/EERofimpersonationattacksis
9
:
3
/
8
:
5
,whichislower
thantheoverallaverageACER/EER.Thisshowsthattheproposedmethodhandlesimpersonation
attacksbetter.Whentheattackerstrytoimpersonatesomeone,thespooffaceisrequiredtobe
similarasaliveface,thusthenetworkcanextractfeaturemoreeasily.However,when
theattackersjusttrytohiddenitsownidentity(obfascationattacks),thespooffaceisnotnecessary
tobelookaliveface,whichiseasiertobecomeanoutlierofthedatadistributionandfalsethe
system.
3.5.2.4VisualizationandAnalysis
ToprovideabetterunderstandingofthetreelearningandZSFA,wevisualizetheresultsinseveral
ways.First,weillustratethetreeroutingresults.InFig.3.5,werankthespoofdatabasedonthe
routingfunctionvalues
'
(
x
)
,andprovide
8
exampleswithresponsesfromthesmallesttothelargest.
Thisoffersusanintuitiveunderstandingofwhatarelearnedateachtreenode.Weobservean
obviousspoofstyletransfer:forthetwo-layernodes
N
1
,
N
2
and
N
3
,thetransfercapturesthe
changeofgeneralspoofattributessuchasimagequalityandcolortemperature;forthethird-layer
treenodes
N
4
,
N
5
,
N
6
,and
N
7
,thetransferinvolvesmorespooftypechanges.E.g.,
N
7
transfersfromeyeportionspoofstofullface
3
Dmaskspoofs.
Further,Fig.3.6quantitativelyanalyzesthetreeroutingdistributionsofalltypesofdata.We
49
Figure3.5
VisulizationoftheTreeRouting.
Figure3.6
Treeroutingdistributionoflive/spoofdata.X-axisdenotes
8
leafnodes,andy-axisdenotes
15
typesofdata.Thenumberineachcellrepresentsthepercentage(
%
)ofdatathatfallinthatleafnode.Each
rowissumto
1
.(a)PrintProtocol.(b)TransparentMaskProtocol.Yellowboxdenotestheunknownattacks.
utilizetwomodels,PrintandTrans.Mask,togeneratethedistributions.Itcanbeobservedthatlive
samplesarerelativelymorespreadoutto
8
leafnodeswhilethespoofattacksareroutedtofewer
50
Figure3.7
t-SNEVisualizationoftheDTNleaffeatures.
leafnodes.TwodistributionsinFig.3.6(a)&(b)sharesimilarsemanticsub-groups,which
demonstratesthesuccessoftheproposedmethodonlearningatree.E.g.,inbothmodels,abouthalf
oftrans.masksamplessharethesameleafnodeasob.makeup.Bycomparingtwodistributions,
mosttestingunknownspoofsinbothmodelsaresuccessfullyroutedtothemostsimilarsub-groups.
Inaddition,weuset-SNEMaaten&Hinton(2008)tovisualizethefeaturespaceofPrintmodel.
Thet-SNEisabletoprojecttheoutputoftheleafnode
F
(
I
j

)
2
R
32

32

40
to
2
Dbypreserving
theKLdivergencedistance.Fig.3.7showsthefeaturesofdifferenttypesofspoofattacksare
well-clusteredinto
8
semanticsub-groupseventhoughwedon'tprovideanyauxiliarylabels.Based
onthesesub-groups,thefeaturesofunknownprintattacksarewellliedinthesub-groupofreplay
andsiliconemask,andthusarerecognizedasspoof.Moreover,withthevisualization,wecan
explaintheperformancevariationamongdifferentspoofattacks,showninTab.3.5.Amongall,the
performanceoftrans.mask,funnyeye,paperglassesandob.makeupareworsethanotherprotocols.
Thefeaturespaceshowsthatthelivesamplesliesmuchclosertothoseattacksthanothers(ﬁ
!
ﬂ
places),andhenceit'shardertodistinguishthemwiththelivesamples.Thisdemonstratesthe
51
diversepropertyofdifferentunknownattacksandthenecessityofsuchawiderangeevaluation.
3.6Conclusions
Thischaptertacklesthezero-shotfaceproblemamong
13
typesofspoofattacks.The
proposedmethodleveragesadeeptreenetworktoroutetheunknownattackstothemostproper
leafnodeforspoofdetection.Thetreeistrainedinanunsupervisedfashiontothefeaturebase
withthelargestvariationtosplitthespoofdata.WecollectSiW-Mthatcontainsmoresubjectsand
spooftypesthananypreviousdatabases.Finally,weexperimentallyshowsuperiorperformanceof
theproposedmethod.
52
Chapter4
Visualization:DisentanglingSpoofTraces
withPhysicalModeling
4.1Introduction
AsmostfacerecognitionsystemsarebasedonamonocularRGBcamera,monocularRGBbased
facehasbeenstudiedforoveradecade,andoneofthemostcommonapproachesis
basedontextureanalysisinBoulkenafetetal.(2015,2016);Pateletal.(2016b).Researchersnoticed
thatpresentingfacesfromspoofmediumsintroducesspecialtexturedifferences,suchascolor
distortions,unnaturalspecularhighlights,Moirpatterns,
etc
.Thosetexturedifferencesareinherent
withinspoofmediumsandthushardtoremoveorConventionalapproachesbuilda
featureextractorpluspipeline,suchasLBP+SVMandHOG+SVMindeFreitasPereira
etal.(2012);Komulainenetal.(2013a),andshowgoodperformanceonseveralsmalldatabases
withconstraintenvironments.Inrecentyears,manyworkssuchasAtoumetal.(2017);Liuetal.
(2018c,2019a);Shaoetal.(2019a);Yangetal.(2019a)leveragedeeplearningtechniquesandshow
greatprogressinfaceperformance.Deeplearningbasedmethodscanbegenerally
groupedinto
3
categories:directFAS,auxiliaryFAS,andgenerativeFAS,asillustratedinFig.4.2.
EarlyworksXuetal.(2015);Yangetal.(2014)buildvanillaCNNwithbinaryoutputtodirectly
predictthespoofnessofaninputface(Fig.4.2a).MethodsLiuetal.(2018c);Yangetal.(2019a)
proposetolearnanintermediaterepresentation,
e.g.
,depth,rPPG,insteadofbinary
53
Figure4.1
Theproposedapproachcandetectspooffaces,disentanglethespooftraces,andreconstructthe
livecounterparts.Itcanbeappliedtodiversespooftypesandestimatedistincttraces(
e.g.
,Moirpattern
inreplayattack,eyebrowandwaxinmakeupattack,colordistortioninprintattack,andspecular
highlightsin
3
Dmaskattack).Zoominfordetails.
classes,whichcanleadtobettergeneralizationandperformance(Fig.4.2b).Fengetal.(2020);
Jourablooetal.(2018);Stehouweretal.(2020)additionallyattempttogeneratethevisualpatterns
existinginthespoofsamples(Fig.4.2c),providingamoreintuitiveinterpretationofthesample's
spoofness.
Despitethesuccess,therearestillatleastthreeunsolvedproblemsinthetopicofdeeplearning-
basedfaceFirst,mostpriorworksaredesignedtotackle
limitedspooftypes
,either
print/replayor
3
Dmasksolely,whileareal-worldsystemmayencounterawide
varietyofspooftypesincludingprint,replay,various
3
Dmasks,facialmakeup,andevenunseen
attacktypes.Therefore,tobettertreal-worldperformance,weneedabenchmarktoevaluate
faceunderknownattacks,unknownattacks,andtheircombination(termed
open-
set
setting).Second,manyapproachesformulatefaceasagression
54
Figure4.2
Thecomparisonofdifferentdeep-learningbasedface(a)directFASonlyprovides
abinarydecisionofspoofness;(b)auxiliaryFAScanprovidesimpleinterpretationofspoofness.
M
denotes
theauxiliarytask,suchasdepthmapestimation;(c)generativeFAScanprovidemoreintuitiveinterpretation
ofspoofness,butonlyforalimitednumberofspoofattacks;(d)theproposedmethodcanprovidespooftrace
estimationforgenericfacespoofattacks.
problem,withasinglescoreastheoutput.AlthoughauxiliaryFASandgenerativeFASattemptto
offersomeextentofinterpretationbysaliency,ornoiseanalysis,thereislittleunderstanding
onwhattheexactdifferencesarebetweenliveandspoof,andwhatpatternsthesdecision
isbasedupon.Abetterinterpretationcanbeestimatingtheexactpatternsdifferentiatingaspoof
faceanditslivecounterpart,termed
spooftrace
.Thirdly,comparedwithotherfaceanalysistasks
suchasrecognitionoralignment,thedataforfacehasseverallimitations.Most
FASdatabasesarecapturedintheconstraintindoorenvironment,whichhaslimitedintra-subject
variationandenvironmentvariation.Forsomespecialspooftypessuchasmakeupandcustomized
siliconemask,theyrequirehighlyskilledexpertstoapplyorcreate,withhighcost,whichresultsin
verylimitedsamples(
i.e.
,long-taildata).Thus,howtolearnfromdatawithlimitedvariationsor
samplesisachallengeforFAS.
Inthiswork,weaimtodesignafacemodelthatisapplicabletoawidevarietyof
spooftypes,termed
genericface
.Weequipthismodelwiththeabilitytoexplicitly
55
disentanglethespooftracesfromtheinputfaces.Someexamplesofspooftracedisentanglement
areshowninFig.4.1.Thisisachallengingobjectiveduetothediversityofspooftracesandthe
lackofgroundtruthduringmodellearning.However,webelievethatthisobjectivecan
bringseveral
1.
Binaryforfacewouldharvestanycuethathelps
whichmightincludespoof-irrelevantcuessuchaslighting,andthushindergeneralization.In
contrast,spooftracedisentanglementexplicitlytacklesthemostfundamentalcuein
uponwhichthecanbemoregroundedandwitnessbettergeneralization.
2.
WiththetrendofpursuingexplainableAIasmentionedinArrietaetal.(2020);Turek(2016),
itisdesirableforthefacemodeltogeneratethespoofpatternsthatsupportits
binarydecision,sincespooftraceservesasagoodvisualexplanationofthemodel'sdecision.
Certainproperties(
e.g.
,severity,methodology)ofspoofattacksmightpotentiallyberevealed
fromthetraces.
3.
Disentangledspooftracescanenablethesynthesisofrealisticspoofsamples,whichaddresses
theissueoflimitedtrainingdatafortheminorityspooftypes,suchasspecial
3
Dmasksand
makeup.
AsshowninFig.4.2d,weproposeaPhysics-guidedSpoofTraceDisentanglement(PhySTD)
toexplorethespooftracesforgenericfaceTomodelalltypesofspoofs,we
formulatethespooftracedisentanglementasacombinationof
additive
processand
inpainting
process.Additiveprocessdescribesasspoofmaterialintroducingextrapatterns(
e.g.
,moire
pattern),wherethelivecounterpartcanberecoveredbyremovingthosepatterns.Inpaintingprocess
describesasspoofmaterialfullycoveringcertainregionsoftheoriginalface,wherethe
livecounterpartofthoseregionshastobeﬁguessedﬂasshowninBertalmioetal.(2000);Liu&
56
Shu(2015).Wefurtherdecomposethespooftracesintofrequency-dependentcomponents,sothat
traceswithdifferentfrequencypropertiescanbeequallyhandled.Forthenetworkarchitecture,we
extendabackbonenetworkforauxiliaryFASwithadecodertoperformthedisentanglement.With
nogroundtruthofspooftraces,weadoptanoverallGAN-basedtrainingstrategy.Thegenerator
takesaninputface,estimatesitsspoofness,anddisentanglesthespooftrace.Afterobtainingthe
spooftrace,wecanreconstructthelivecounterpartfromthespoofandsynthesizenewspooffrom
thelive.Thesynthesizedsamplesarethensenttomultiplediscriminatorswithrealsamplesfor
adversarialtraining.Thesynthesizedspoofsamplesarefurtherutilizedtotrainthegeneratorina
fullysupervisedfashion,thankstodisentangledspooftracesasgroundtruthforthesynthesized
samples.Tocorrectpossiblegeometricdiscrepancyduringspoofsynthesis,weproposeanovel
3
D
warpinglayertodeformspooftracestowardthetargetliveface.
ApreliminaryversionofthisworkwaspublishedintheProceedingsEuropeanConferenceon
ComputerVision(ECCV)2020Liuetal.(2020).Weextendtheworkfromthreeaspects.
1
)Guided
bythephysicsofhowaspoofisgenerated,weintroduceaspoofgenerationfunction(SGF)tomodel
thespooftracedisentanglementasacombinationofadditiveandinpaintingprocesses.SGFhasa
betterandmorenaturalmodelingofgenericspoofattacks,suchaspaperglass.
2
)Previoustrace
components
f
S
;
B
;
C
;
T
g
arenotsupervisedhierarchicallysothatthereexistssemanticambiguity.
Inthiswork,weintroduceseveralhierarchicaldesignsintheGANframeworktoremedysuch
ambiguity.
3
)Weproposeanopen-settestingscenariotofurtherevaluatethereal-worldperformance
forfacemodels.Bothknownandunknownattacksareincludedintheopen-settesting.
Weperformaside-by-sidecomparisonbetweentheproposedapproachandthestate-of-the-art
(SOTA)facesolutionsonmultipledatasetsandprotocols.
Insummary,themaincontributionsofthisworkareasfollows:

Weforthetimestudyspooftracefor
generic
facewhereawidevarietyof
57
spooftypesaretackledwithoneframework;

Weproposeanovelphysics-guidedmodeltodisentanglespooftraces,andutilizethespoof
tracestosynthesizenewdatasamplesforenhancedtraining;

Weproposenovelprotocolsforagenericopen-setface

WeachieveSOTAperformanceandprovideconvincingvisualizationforawide
varietyofspooftypes.
4.2RelatedWork
Face
Facehasbeenstudiedformorethanadecadeanditsdevelopment
canberoughlydividedintothreestages.Intheearlyyears,researchersleveragespontaneoushuman
movement,suchaseyeblinkingandheadmotion,todetectsimpleprintphotographorstaticreplay
attacksKollreideretal.(2007);Panetal.(2007).However,whenfacingcounterattacks,such
asprintfacewitheyeregioncut,andreplayingafacevideo,thosemethodswouldfail.Inthe
secondstage,researcherspaymoreattentiontotexturedifferencesbetweenliveandspoof,which
areinherenttospoofmediums.Researchersmainlyextracthand-craftedfeaturesfromthefaces,
e.g.
,LBPBoulkenafetetal.(2015);deFreitasPereiraetal.(2012,2013);M
¨
a
¨
att
¨
aetal.(2011),HoG
Komulainenetal.(2013a);Yangetal.(2013),SIFTPateletal.(2016b)andSURFBoulkenafet
etal.(2016),andtrainatosplitthelive
vs.
spoof,
e.g.
,SVMandLDA.
Recently,facesolutionsequippedwithdeeplearningtechniqueshavedemonstrated
improvementsovertheconventionalmethods.MethodsinFengetal.(2016);Lietal.
(2016a);Pateletal.(2016a);Yangetal.(2014)trainadeepneuralnetworktolearnabinary
betweenliveandspoof.InAtoumetal.(2017);Liuetal.(2018c,2019a);Shaoetal.
58
(2019a);Yangetal.(2019a),additionalsupervisions,suchasfacedepthmapandrPPGsignal,are
utilizedtohelpthenetworktolearnmoregeneralizablefeatures.Asthelatestapproachesachieving
saturatedperformanceonseveralbenchmarks,researchersstarttoexploremorechallengingcases,
suchasfew-shot/zero-shotfaceLiuetal.(2019a);Qinetal.(2019);Zhaoetal.(2019)
anddomainadaptationinfaceShaoetal.(2019a,b).
Inthiswork,weaimtosolveaninterestingyetverychallengingproblem:disentanglingand
visualizingthespooftracesfromaninputface.ArelatedworkJourablooetal.(2018)alsoadopts
GANseekingtoestimatethespooftraces.However,theyformulatethetracesaslow-intensity
noises,whichislimitedtoprintandreplayattacksonlyandcannotprovideconvincingvisual
results.Incontrast,weexplorespooftracesforamuchwiderrangeofspoofattacks,visualizethem
withnoveldisentanglement,andalsoevaluatetheproposedmethodonthechallengingcases,
e.g.
,
zero-shotface
DisentanglementLearning
Disentanglementlearningisoftenadoptedtobetterrepresentcomplex
dataandfeatures.DR-GANTranetal.(2017b)disentanglesafaceintoidentityandposevectors
forpose-invariantfacerecognitionandviewsynthesis.Similarlyingaitrecognition,Zhangetal.
(2019)disentanglestherepresentationsofappearance,canonical,andposefeaturesfromaninput
gaitvideo.
3
DreconstructionworksLiuetal.(2018a);Tran&Liu(2021)alsodisentanglethe
representationofa
3
Dfaceintoidentity,expressions,poses,albedo,andilluminations.Forimage
synthesis,Esseretal.(2018)disentanglesanimageintoappearanceandshapewithU-Netand
VariationalAutoEncoder(VAE).
DifferentfromLiuetal.(2018a);Tranetal.(2017b);Zhangetal.(2019),weintendto
disentanglefeaturesthathavedifferentscalesandcontaingeometricinformation.Weleveragethe
multipleoutputstorepresentfeaturesatdifferentscales,andadoptmultiple-scalediscriminators
toproperlylearnthem.Moreover,weproposeanovelwarpinglayertotacklethegeometric
59
Figure4.3
OverviewoftheproposedPhysics-guidedSpoofTraceDisentanglement(PhySTD).
discrepancyduringthedisentanglementandreconstruction.
ImageTraceModeling
Imagetracesarecertainsignalsexistingintheimagethatcanreveal
informationaboutthecapturingcamera,imagingsetting,environment,andsoon.Thosesignals
oftenhavemuchlowerenergycomparedtotheimagecontent,whichneedspropermodelingto
explorethem.Abdelhamedetal.(2018);Thaietal.(2013,2016)observethedifferenceofimage
noises,andusethemtorecognizethecapturecameras.Fromthefrequencydomain,Stehouwer
etal.(2020)showstheimagenoisesfromdifferentcamerasobeydifferentnoisedistributions.Such
techniquesareappliedtotheofimageforensics,andlaterChenetal.(2020);Wangetal.
(2017)proposemethodstoremovesuchtracesforimageanti-forensics.
Recently,imagetracemodelingiswidelyusedinimageforgerydetectionandimageadversarial
attackdetectionDangetal.(2020);Wuetal.(2019).Inthiswork,weattempttoexplorethetraces
ofspooffacepresentation.Duetodifferentspoofmediums,spooftracesshowlargevariations
incontent,intensity,andfrequencydistribution.Weproposetodisentanglethetracesasadditive
tracesandinpaintingtrace.Andforadditivetraces,wefurtherdecomposethembasedondifferent
frequencybands.
60
4.3Physics-basedSpoofTraceDisentanglement
4.3.1ProblemFormulation
Letthedomainoflivefacesbedenotedas
Lˆ
R
N

N

3
andspooffacesas
Sˆ
R
N

N

3
,where
N
istheimagesize.Weintendtoobtainnotonlythecorrectprediction(live
vs.
spoof)oftheinput
face,butalsoaconvincingestimationofthespooftraceandlivefacereconstruction.Torepresent
thespooftrace,ourpreliminaryversionassumesanadditiverelationbetweenliveandspoof,and
uses
4
tracecomponents
f
S
;
B
;
C
;
T
g
atdifferentfrequencybandsas:
I
spoof
=(1+
b
S
c
n
1
)
I
live
+
b
B
c
n
1
+
b
C
c
n
2
+
T
;
(4.1)
where
S
;
B
representlow-frequencytraces,
C
representsmid-frequencyones,and
T
represents
high-frequencyones.

isthelowbandpassoperation,andinpractice,weachievethis
bydownsamplingtheoriginalimageandupsamplingitback.Intheprevioussetting,
n
1
=1
and
n
2
=64
.ComparedtothesimplerepresentationwithonlyasinglecomponentinJourablooetal.
(2018),thismulti-scalerepresentationof
f
S
;
B
;
C
;
T
g
canlargelyimprovedisentanglementquality
andsuppressundesiredartifactsduetoitsprocess.Themodelisdesignedtoprovide
avalidestimationofspooftraces
f
S
;
B
;
C
;
T
g
withoutrespectivegroundtruth.Ourpreliminary
versionLiuetal.(2020)aimstoaminimumintensitychangethattransfersaninputfacetothe
livedomain:
argmin
^
I
k
I

^
I
k
F
s:t:
I
2
(
S[L
)
and
^
I
2L
;
(4.2)
where
I
isthesourceface,
^
I
isthetargetfacetobeoptimized,and
I

^
I
isasthespooftrace.
Whenthesourcefaceislive
I
live
,
I

^
I
shouldbe
0
as
I
isalreadyin
L
.Whenthesourcefaceis
spoof
I
spoof
,
I

^
I
shouldberegularizedtopreventunnecessarychangessuchasidentityshift.
61
Despitetheeffectivenessofthisrepresentation,therearestilltwodrawbacks:First,thespoof
tracedisentanglementismainlyformulatedasanadditiveprocessing.TheoptimizationofEqn.4.2
limitsthetraceintensity,andthereconstructionforspoofregionswithlargeappearancedivergence
mightbesub-optimal,suchasspoofglassesormask.Forthosespoofregions,thephysical
relationshipbetweentheliveandthespoofisbetterdescribedasreplacementratherthanaddition;
Second,whileourpreliminaryversionrepresentingthetraceswithhierarchicalcomponents,these
componentsarelearnedwithlossesontheirsummation.Withoutcarefulsupervision,thelearned
componentscanbeambiguousintheirsemanticmeanings,
e.g.
,thehigh-frequencycomponentmay
includelow-frequencyinformation.
Toaddressthedrawback,weintroduceaspoofgenerationfunction(SGF)asanadditive
processfollowedbyaninpaintingprocess:
I
spoof
=(1

P
)(
I
live
+
T
A
)+
P

T
P
;
(4.3)
where
T
A
2
R
N

N

3
indicatesthetracesfromadditiveprocess,
T
P
indicatesthetracesfrom
inpaintingprocess,and
P
2
R
N

N

1
denotestheinpaintingregion.Givenaspoofface,onemay
reconstructthelivecounterpartbyinversingEqn.4.3:
^
I
live
=(1

P
)(
I
spoof

T
A
)+
P

(
^
I
live
+
I
spoof

T
P
)
;
(4.4)
Astheinpaintingphysicallyreplacescontent,thespooftrace
T
P
intheinpaintingregion
P
is
identicaltothespoofimage
I
spoof
inthesameregion,andthusbothcanceloutinthesecondterm
ofEqn.4.4.Wefurtherrenamethe
^
I
live
inthesecondtermas
I
P
toindicatetheinpaintingcontent
withintheinpaintingregionthatshouldbeestimatedfromthemodel.Therefore,thereconstruction
62
oftheliveimagebecomes:
^
I
live
=(1

P
)(
I
spoof

T
A
)+
P

I
P
;
(4.5)
where
T
A
=
b
B
c
n
1
+
b
C
c
n
2
+
T
denotestheadditivetracerepresentedbythreehierarchical
components.
n
1
and
n
2
aresettobe
32
and
128
respectively.Withalarger
n
1
,theeffectof
component
S
inthepreliminaryversioncanbeincorporatedinto
B
,andhenceweremove
S
for
simplicity.Besidestheadditivetraces,themodelisfurtherrequiredtoestimatetheinpaintingregion
P
andinpaintinglivecontent
I
P
.
I
P
isestimatedbasedontherestofthelivefacialregionwithout
intensityconstraint.Weuseafunction
G
(

)
torepresentthereconstructionprocessofEqn.4.5.
Accordingly,theoptimizationofEqn.4.2isre-formulatedbyreplacing
^
I
withEqn.4.5as:
argmin
T
A
;
P
;
I
P
k
I

(1

P
)(
I

T
A
)

P

I
P
k
F
!
argmin
T
A
;
P
;
I
P
k
(1

P
)
T
A
k
F
+
k
P

(
I

I
P
)
k
F
:
(4.6)
Aswedonotwishtoimposeanyintensityconstrainton
I
P
,theobjectiveisformulatedas:
argmin
T
A
;
P
k
(1

P
)
T
A
k
F
+

k
P
k
F
s:t:
I
2S[L
;
^
I
2L
;
(4.7)
where

isaweighttobalancetwoterms.Inaddition,basedonEqn.4.3,wecananother
function
G

(

)
tosynthesizenewspooffaces,bytransferringthespooftracesfrom
I
i
to
I
j
:
^
I
i
!
j
spoof
=
G

(
I
j
j
I
i
)=(1

P
i
)(
I
j
+
T
i
A
)+
P
i

I
i
:
(4.8)
Notethat
T
P
inEqn.4.3hasbeenreplacedwith
I
i
sincethespoofimage
I
i
containsthespoof
63
Figure4.4
TheproposedPhySTDnetworkarchitecture.Exceptthelastlayer,eachconvandtransposed
convisconcatenatedwithabatchnormalizstionlayerandaleakyReLUlayer.
k3c64s2
indicatesthekernel
sizeof
3

3
,theconvolutionchannelof
64
andthestrideof
2
.
tracefortheinpaintingregion.
Estimating
f
T
A
;
P
;
I
P
g
fromaninputface
I
istermedas
spooftracedisentanglement
.Given
thatnogroundtruthoftracesisavailable,thisdisentanglementcanbeachievedviagenerative
adversarialbasedtraining.AsshowninFig.4.3,theproposedPhysics-guidedSpoofTraceDis-
entanglement(PhySTD)consistsofageneratoranddiscriminator.Givenaninputimage,the
generatorisdesignedtopredictthespoofness(representedbythepseudodepthmap)aswellas
estimatetheadditivetraces
f
B
;
C
;
T
g
andtheinpaintingcomponents
f
P
;
I
P
g
.Withthetraces,
wecanapplyfunction
G
(

)
toreconstructthelivecounterpartandfunction
G

(

)
tosynthesize
newspooffaces.Weadoptasetofdiscriminatorsatmultipleimageresolutionstodistinguishthe
realfaces
f
I
live
;
I
spoof
g
withthesyntheticfaces
f
^
I
live
;
^
I
spoof
g
.Toremedythesemanticambiguity
during
f
B
;
C
;
T
g
learning,threetracecomponentcombinations,
f
B
g
,
f
B
;
C
g
,and
f
B
;
C
;
T
g
,
willcontributetothesynthesisoflivereconstructionatoneparticularresolution,whichisthen
supervisedbyarespectivediscriminator(detailsinSec.4.3.3).Tolearnaproperinpaintingregion
P
,weleverageboththepriorknowledgeandtheinformationfromtheadditivetraces.
64
Figure4.5
Thevisualizationofimagedecompositionfordifferentinputfaces:(a)liveface(b)
3
Dmask
attack(c)replayattack(d)printattack.
Intherestofthissection,wepresentthedetailsofthegenerator,thediscriminators,thedetails
offacereconstructionandsynthesis,andthelossesandtrainingstepsusedinPhySTD.
4.3.2DisentanglementGenerator
AsshowninFig.4.4,thedisentanglementgeneratorconsistsofabackboneencoder,aspooftrace
decoderandadepthestimationnetwork.Thebackboneencoderaimstoextractmulti-scalefeatures,
thedepthestimationnetworkleveragesthefeaturestoestimatethefacialdepthmap,andaspoof
tracedecodertoestimatetheadditivetracecomponents
f
B
;
C
;
T
g
andtheinpaintingcomponents
f
P
;
I
P
g
.Thedepthmapandthespooftraceswillbeusedtocomputethespoofnessscore.
Backboneencoder
Backboneencoderextractsfeaturesfromtheinputimagesforbothdepthmap
65
estimationandspooftracedisentanglement.AsshowninourpreliminaryworkLiuetal.(2020),the
spooftracesconsistsofcomponentsfromdifferentfrequencybands:low-frequencytracesincludes
colordistortion,mid-frequencytracesincludesmakeupstrikes,andhigh-frequencytracesincludes
Moirpatternsandmaskedges.However,avanillaCNNmodelmightoverlookhigh-frequency
tracessincetheenergyofhigh-frequencytracesisoftenmuchweakerthanthatoflow-frequency
traces.Inordertoencouragethenetworktoequallyregardtraceswithdifferentphysicalproperties,
weexplicitlydecomposetheimageintothreeelements
f
I
B
;
I
C
;
I
T
g
as:
I
B
=
b
I
c
n
1
;
I
C
=
b
I
c
n
2
b
I
c
n
1
;
I
T
=
I
b
I
c
n
2
;
(4.9)
where
n
1
=32
,
n
2
=128
andtheimagesize
N
=256
.Inaddition,weamplifythevaluein
I
C
;
I
T
bytwoconstants
15
and
25
,andthenfeedtheconcatenationofthreeelementstothebackbone
network.Fig.4.5providesthevisualizationofimagedecomposition.Weobservethatthetracesthat
arelessdistinctintheoriginalimagesbecomemorehighlightedinthe
I
T
component:
3
Dmaskand
replayattackbringuniquepatternsdifferentwiththelivefacepattern,whileprintattackislacking
ofnecessaryhighfrequencydetails.Semantically,
I
B
;
I
C
;
I
T
sharethesamefrequencydomains
with
B
;
C
;
T
respectively,andthusthedecompositionpotentiallyeasesthelearningof
B
;
C
;
T
.
Afterthat,theencoderprogressivelydownsamplesthedecomposedimagecomponents
3
times
toobtainfeatures
F
1
2
R
128

128

64
,
F
2
2
R
64

64

96
,
F
3
2
R
32

32

128
viaconvlayers.
Spooftracedecoder
Thedecoderupsamplesthefeature
F
3
withtransposeconvlayersback
totheinputfacesize
256
.Thelastlayeroutputsbothadditivetraces
f
B
;
C
;
T
g
andinpainting
components
f
P
;
I
P
g
.SimilartoU-NetRonnebergeretal.(2015),weapplytheshort-cutconnection
66
betweenthebackboneencoderanddecodertobypassthemultiplescaledetailsforahigh-quality
traceestimation.
Depthestimationnetwork
Westillrecognizetheimportanceofthediscriminativesupervision
usedinauxiliaryFAS,andthusintroduceadepthestimationnetworktoperformthepseudo-depth
estimationforfaceasproposedinLiuetal.(2018c).Thedepthestimationnetwork
takestheconcatenatedfeaturesof
F
1
,
F
2
,
F
3
fromthebackboneencoderand
U
3
fromthedecoder
asinput.ThefeaturesareputthroughaspatialattentionmechanismfromYuetal.(2020b)and
resizetothesamesizeof
K
=32
.Itoutputsafacedepthmap
M
2
R
32

32
,wherethedepthvalues
arenormalizedwithin
[0
;
1]
.Regardingthenumberofparameters,bothspooftracedecoderand
depthestimationnetworkarelightweighed,whilethebackbonenetworkismuchheavier.With
morenetworklayersbeingsharedtotacklebothdepthestimationandspooftracedisentanglement,
theknowledgelearntfromspooftracedisentanglementcanbebettersharedwithdepthestimation
task,whichcanleadtoabetterperformance.
Finalscoring
Inthetestingphase,weusethenormofthedepthmapandtheintensityofspoof
tracesforrealvs.spoof
score
=
1
2
K
2
k
M
k
1
+

0
2
N
2
(
k
B
k
1
+
k
C
k
1
+
k
T
k
1
+
k
P
k
1
)
;
(4.10)
where

0
istheweightforthespooftrace.
4.3.3ReconstructionandSynthesis
Therearemultipleoptionstousethedisentangledspooftraces:
1
)livereconstruction,
2
)spoof
synthesis,and
3
)ﬁharderﬂsamplesynthesis,whichwillbedescribedbelowrespectively.
Livereconstruction:
BasedonEqn.4.5,weproposeahierarchicalreconstructionoftheliveface
67
counterpartfromtheinputimages.Toreconstructfacesatacertainresolution,eachadditivetraceis
includedonlyifitsfrequencydomainislowerthanthetargetresolution.Weapply
f
hi
;
mid
;
low
g
threeresolutionsettingsas:
^
I
hi
=(1

P
)(
I
b
B
c
n
1
b
C
c
n
2

T
)+
P

I
P
;
^
I
mid
=(1

P
)(
b
I
c
n
2
b
B
c
n
1
b
C
c
n
2
)+
P

I
P
;
^
I
low
=(1

P
)(
b
I
c
n
1
b
B
c
n
1
)+
P

I
P
:
(4.11)
Spoofsynthesis:
BasedonEqn.4.8,wecanobtainanewspooffaceviaapplyingthespoof
tracesdisentangledfromaspoofface
I
i
toaliveface
I
j
.However,spooftracesmaycontain
face-dependentcontentassociatedwiththeoriginalspoofsubject.Directlyapplyingthemtoanew
facewithdifferentshapesorposesmayresultinmis-alignmentandstrongvisualimplausibility.
Therefore,thespooftraceshouldgothroughageometrycorrectionbeforeperformingthissynthesis.
Weproposeanonline
3
Dwarpinglayerandwillintroduceitinthefollowingsubsection.
ﬁHarderﬂsamplesynthesis:
Thedisentangledspooftracescannotonlyreconstructliveand
synthesizenewspoof,butalsosynthesizeﬁharderﬂspoofsamplesbyremovingoramplifyingpart
ofthespooftraces.Wecantuneoneorsomeofthetraceelements
f
B
;
C
;
T
;
P
g
tomakethespoof
sampletobecomeﬁlessspoofedﬂ,whichisthusclosertoalivefacesincethespooftracesare
weakened.Suchspoofdatacanberegardedas
harder
samplesandmaythegeneralizationof
thedisentanglementgenerator.Forinstance,whileremovingthelowfrequencyelement
B
froma
replayspooftrace,thegeneratormaybeforcedtorelyonotherelementssuchashigh-leveltexture
patterns.Tosynthesizetheﬁharderﬂsample
^
I
hard
,wefollowEqn.4.8withtwominorchanges:1)
generate
3
randomweightsbetween
[0
;
1]
andmultipleeachwithonecomponentof
f
B
;
C
;
T
g
;2)
randomlyremovetheinpaintingprocess(
i.e.
,set
P
=0
)withaprobabilityof
0
:
5
.Comparedwith
68
Figure4.6
Theonline
3
Dwarpinglayer.(a)Giventhecorrespondingdenseoffset,wewarpthespooftrace
andaddthemtothetargetlivefacetocreateanewspoof.E.g.pixel
(
x;y
)
withoffset
(3
;
5)
iswarpedto
pixel
(
x
+3
;y
+5)
inthenewimage.(b)Toobtainadenseoffsetsfromthespareoffsetsoftheselectedface
shapevertices,Delaunaytriangulationinterpolationisadopted.
othermethods,suchasbrightnessandcontrastchangeLiuetal.(2019b),andblurriness
effectYangetal.(2019a),or
3
DdistortionGuoetal.(2019),ourapproachcanintroducemore
realisticandeffectivedatasamples,asshowninSec.4.4.
4.3.3.1Online
3
DWarpingLayer
Weproposeanonline
3
Dwarpinglayertocorrecttheshapediscrepancy.Toobtainthewarping,
previousmethodsinChangetal.(2018);Liuetal.(2018c)useoffaceswappingandpre-
computeddenseoffsetrespectively,wherebothmethodsarenon-differentiableaswellasmemory
intensive.Incontrast,ourwarpinglayerisdesignedtobebothdifferentiableandcomputationally
efwhichisnecessaryforonlinesynthesisduringthetraining.
First,thelivereconstructionofaspoofface
I
i
canbeexpressedas:
G
i
=
G
(
I
i
)[
p
0
]
;
(4.12)
69
where
p
0
=
f
(0
;
0)
;
(0
;
1)
;:::;
(255
;
255)
g2
R
256

256

2
enumeratespixellocationsin
I
i
.Toalign
thespooftraceswhilesynthesizinganewspoofface,adenseoffset

p
i
!
j
2
R
256

256

2
is
requiredtoindicatethedeformationbetweenface
I
i
andface
I
j
.Adiscretedeformationcanbe
acquiredfromthedistancesofthecorrespondingfaciallandmarksbetweentwofaces.Duringthe
datapreparation,weuseLiuetal.(2017)toa
3
DMMmodelandextractthe
2
Dlocationsof
Q
facialverticesforeachface:
s
=
f
(
x
0
;y
0
)
;
(
x
1
;y
1
)
;:::;
(
x
N
;y
N
)
g2
R
Q

2
:
(4.13)
Asparseoffsetonthecorrespondingverticescanthenbecomputedtwofacesas

s
i
!
j
=
s
j

s
i
.To
convertthesparseoffset

s
i
!
j
tothedenseoffset

p
i
!
j
,weapplyatriangulationinterpolation:

p
i
!
j
=
Tri
(
p
0
;
s
i
;

s
i
!
j
)
;
(4.14)
where
Tri
(

)
istheinterpolation,
s
i
denotesthevertexlocations,

s
i
!
j
arethevertexvalues,and
weadoptDelaunaytriangulation.Thewarpingoperationcanbedenotedas:
G

i
!
j
=
G

(
I
j
j
I
i
)[
p
0
+
p
i
!
j
]
;
(4.15)
wheretheoffset

p
i
!
j
appliestoallsubject
i
relatedelements
f
T
i
A
;
I
i
;
P
i
g
.Sincetheoffset

p
i
!
j
istypicallycomposedoffractionalnumbers,weimplementthebilinearinterpolationto
samplethefractionalpixellocations.Weselect
Q
=140
verticestocoverthefaceregionsothat
theycanrepresentnon-rigiddeformation,duetoposeandexpression.Asthepixelvaluesinthe
warpedfacearealinearcombinationofpixelvaluesofthetriangulationvertices,thisentireprocess
isdifferentiable.ThisprocessisillustratedinFig.4.6.
70
Algorithm1
PhySTDTrainingIteration.
Input:
livefaces
I
live
andfaciallandmarks
s
live
,spooffaces
I
spoof
andfaciallandmarks
s
spoof
,groundtruthdepthmap
M
0
,preliminarymask
P
0
;
Output:
reconstructedlive
^
I
live
,synthesizedspoof
^
I
spoof
,spooftraces
f
T
l
A
,
P
l
,
I
l
P
;
T
s
A
,
P
s
,
I
s
P
g
,depthmaps
f
M
l
,
M
s
g
;
while
iteration
<
max
iteration
do
//
trainingstep
1
1
:compute
T
l
A
,
P
l
,
I
l
P
 
G
(
I
live
)
andcompute
T
s
A
,
P
s
,
I
s
P
 
G
(
I
spoof
)
;
2
:estimatethedepthmap
M
l
,
M
s
;
3
:computelosses
L
depth
,
L
P
,
L
R
;
//
trainingstep
2
4
:compute
^
I
low
,
^
I
mid
,
^
I
hi
from
T
s
A
,
P
s
,
I
s
P
and
I
spoof
(Eqn.4.4);
5
:computewarpingoffset

p
s
!
l
from
s
live
,
s
spoof
(Eqn.5.8);
6
:compute
^
I
spoof
fromwarped
T
s
!
l
A
,
P
s
!
l
and
I
live
(Eqn.5.9);
7
:send
I
live
,
I
spoof
,
^
I
low
,
^
I
mid
,
^
I
hi
,
^
I
spoof
todiscriminators;
8
:computetheadversariallossforgenerator
L
G
andfordiscriminators
L
D
;
//
trainingstep
3
9
:createhardersamples
I
hard
from
T
s
!
l
A
,
P
s
!
l
and
I
live
withrandomperturbationon
traces;
10
:compute
T
h
A
,
P
h
,
I
h
P
 
G
(
I
hard
)
;
11
:computedepthmap
M
h
for
I
hard
;
12
:computelosses
L
S
,
L
H
;
//
backpropagation
13
:back-propagatethelossesfromstep
3
;
8
;
12
tocorrespondingpartsandupdatethe
network;
end
4.3.4Multi-scaleDiscriminators
MotivatedbyWangetal.(2018b),weadoptmultiplediscriminatorsatdifferentresolutions(
e.g.
,
32
,
96
,and
256
)inourGANarchitecture.WefollowthedesignofPatchGANIsolaetal.(2017),
whichessentiallyisafullyconvolutionalnetwork.Fullyconvolutionalnetworksareshowntobe
effectivetonotonlysynthesizehigh-qualityimagesIsolaetal.(2017);Wangetal.(2018b),butalso
tacklefaceproblemsLiuetal.(2018c).Foreachdiscriminator,weadoptthesame
structurebutdonotsharetheweights.
71
AsshowninFig.4.4,weuseintotal
4
discriminatorsinourwork:
D
1
,workinginthelowest
resolutionof
32
,focusesonlowfrequencyelementssincethehigher-frequencytracesareerasedby
downsampling.
D
2
,workingattheresolutionof
96
,focusesonthemiddlelevelcontentpattern.
D
3
and
D
4
,workingonthehighestresolutionof
256
,focusonthetexturedetails.Our
preliminaryversionresizesrealandsyntheticsamples
f
I
;
^
I
g
todifferentresolutionsandassign
toeachdiscriminator.Toremovesemanticambiguityandprovidecorrespondencetothetrace
components,weinsteadassignthehierarchicalreconstructionfromEqn.4.11tothediscriminators:
wesendlowfrequencypairs
f
I
live
;
^
I
low
g
to
D
1
,middlefrequencypairs
f
I
live
;
^
I
mid
g
to
D
2
,high
frequencypairs
f
I
live
;
^
I
hi
g
to
D
3
,andreal/syntheticspoof
f
I
spoof
;
^
I
spoof
g
to
D
4
.Eachdiscriminator
outputsa
1
-channelmapintherangeof
[0
;
1]
,where
0
denotesfakeand
1
denotesreal.
4.3.5LossFunctionsandTrainingSteps
Weutilizemultiplelossfunctionstosupervisethelearningofdepthmapsandspooftraces.Each
trainingiterationconsistsofthreetrainingsteps.Weintroducethelossfunction,followedby
howtheyareusedinthetrainingsteps.
Depthmaploss:
WefollowtheauxiliaryFASinLiuetal.(2018c)toestimateanauxiliarydepth
map
M
,wherethedepthgroundtruth
M
0
foralivefacecontainsface-likeshapeandthedepthfor
spoofshouldbezero.Weapplythe
L
-
1
normonthislossas:
L
depth
=
1
K
2
E
i
˘L[S
k
M
i

M
i
0
k
F
;
(4.16)
where
K
=32
isthesizeof
M
.WeapplythedensefacealignmentLiuetal.(2017)toestimatethe
3
Dshapeandrenderthedepthgroundtruth
M
0
.
Adversariallossfor
G
:
WeemploytheLSGANsMaoetal.(2017)onreconstructedlivefaces
72
Figure4.7
Preliminarymask
P
0
forthenegativetermininpaintingmaskloss.Whitepixelsdenote
1
and
blackpixelsdenote
0
.Whiteindicatestheareashouldnotbeinpainted.
P
0
for:(a)print,replay;(b)3Dmask
andmakeup;(c)partialattacksthatcovertheeyeportion;(d)partialattacksthatcoverthemouthportion.
andsynthesizedspooffaces.Itencouragesthereconstructedlivetolooksimilartoreallivefrom
domain
L
,andthesynthesizedspooffacestolooksimilartofacesfromdomain
S
:
L
G
=
E
i
˘L
;j
˘S
h
k
D
1
(
^
I
j
low
)

1
k
2
F
+
k
D
2
(
^
I
j
mid
)

1
k
2
F
+
k
D
3
(
^
I
j
hi
)

1
k
2
F
+
k
D
4
(
^
I
j
!
i
spoof
)

1
k
2
F
i
:
(4.17)
Adversariallossfor
D
:
Theadversariallossfordiscriminatorsencourages
D
(

)
todistinguish
betweenreallive
vs.
reconstructedlive,andrealspoof
vs.
synthesizedspoof:
L
D
=
E
i
˘L
;j
˘S
h
k
D
1
(
I
i
)

1
k
2
F
+
k
D
2
(
I
i
)

1
k
2
F
+
k
D
3
(
I
i
)

1
k
2
F
+
k
D
4
(
I
j
)

1
k
2
F
+
k
D
1
(
^
I
j
low
)
k
2
F
+
k
D
2
(
^
I
j
mid
)
k
2
F
+
k
D
3
(
^
I
j
hi
)
k
2
F
+
k
D
4
(
^
I
j
!
i
spoof
)
k
2
F
i
:
(4.18)
Inpaintingmaskloss:
Thegroundtruthinpaintingregionforallspoofattacksisbarelypossible
toobtain,henceafullysupervisedtrainingusedinTranetal.(2017a)forinpaintingmaskisoutof
thequestion.However,wemaystillleveragethepriorknowledgeofspoofattackstofacilitatethe
estimationofinpaintingmasks.Theinpaintingmasklossconsistsofapositivetermandanegative
term.First,thepositivetermencouragescertainregiontobeinpainted.Asthegoalofinpainting
processistoallowcertainregiontochangewithoutintensityconstraint,theregionwithlarger
magnitudeofadditivetraceswouldhaveahigherprobabilitytobeinpainted.Hence,thepositive
termadoptsa
L
-
2
normbetweentheinpaintingregion
P
andtheregionwheretheadditivetraceis
73
largerthanathreshold

.
Second,thenegativetermdiscouragescertainregiontobeinpainted.Whilethegroundtruth
inpaintingmaskisunknown,it'sstraightforwardtomarkalargeportionofregionthatshouldnotbe
inpainted.Forinstance,theinpaintingregionforfunnyeyeglassesshouldnotappearinthelower
partofaface.Hence,weprovideapreliminarymask
P
0
toindicatethenot-to-be-inpaintedregion,
andadoptanormalized
L
-
2
normonthemaskedinpaintingregion
P

P
0
asthenegativeterm.The
preliminarymask
P
0
isillustratedinFig.4.7.Overall,theinpaintingmasklossisformedas:
L
P
=
E
i
˘S
h
k
P
i

(
T
i
A
>
)
k
2
F
+
k
P
i

P
i
0
k
2
F
k
P
i
0
k
2
F
i
:
(4.19)
Traceregularization:
BasedonEqn.4.6with

=1
,weregularizetheintensityofadditivetraces
f
B
;
C
;
T
g
andinpaintingregion
P
.Theregularizerlossisdenotedas:
L
R
=
E
i
˘L[S
h
k
B
k
2
F
+
k
C
k
2
F
+
k
T
k
2
F
+
k
P
k
2
F
i
:
(4.20)
Synthesizedspoofloss:
Synthesizedspoofdatacomewithgroundtruthspooftraces.Asaresult,
weareabletoasupervisedpixellossforthegeneratortodisentangletheexactspooftraces
thatwereadded:
L
S
=
E
i
˘L
;
j
˘S
h
k
G
(
d
G

j
!
i
e
)
d
G
j
!
i
ek
1
F
i
;
(4.21)
where
G
j
!
i
istheoveralleffectof
f
P
j
;
I
j
P
;
B
j
;
C
j
;
T
j
g
afterwarpingtosubject
i
,and

isthe
stop_gradient
operation.Withoutstoppingthegradient,
G
j
!
i
maycollapseto
0
.
Depthmaplossforﬁharderﬂsamples:
Wesendtheﬁharderﬂsynthesizedspoofdatatodepth
estimationnetworktoimprovethedatadiversity,andhopetoincreasetheFASmodel'sgeneraliza-
74
tion:
L
H
=
1
K
2
E
i
˘
^
S
h
k
M
i

M
i
0
k
F
i
;
(4.22)
where
^
S
denotesthedomainofsynthesizedspooffaces.
Trainingstepsandtotalloss:
Eachtrainingiterationhas
3
trainingsteps.Inthetrainingstep
1
,
livefaces
I
live
andspooffaces
I
spoof
arefedintogenerator
G
(

)
todisentanglethespooftraces.
Thespooftracesareusedtoreconstructthelivecounterpart
^
I
live
andsynthesizenewspoof
^
I
spoof
.
Thegeneratorisupdatedwithrespecttothedepthmaploss
L
depth
,adversarialloss
L
G
,inpainting
maskloss
L
P
,andregularizerloss
L
R
:
L
=

1
L
depth
+

2
L
G
+

3
L
P
+

4
L
R
:
(4.23)
Inthetrainingstep
2
,thediscriminatorsaresupervisedwiththeadversarialloss
L
D
tocompete
withthegenerator.Inthetrainingstep
3
,
I
live
and
^
I
hard
arefedintothegeneratorwiththeground
truthlabelandtracetominimizethesynthesizedspoofloss
L
S
anddepthmaploss
L
H
:
L
=

5
L
S
+

6
L
H
;
(4.24)
where

1
-

6
aretheweightstobalancethemultitasktraining.Tonotethat,wesendtheoriginallive
faces
I
live
with
^
I
hard
forabalancedmini-batch,whichisimportantwhencomputingthemoving
averageinthebatchnormalizationlayer.Weexecuteall
3
stepsineachminibatchiteration,but
reducethelearningratefordiscriminatorstepbyhalf.Thewholetrainingprocessisdepictedin
Alg.1.
75
ProtocolMethodAPCER(%)BPCER(%)ACER(%)
1
STASN(Yangetal.(2019a))
1
:
22
:
51
:
9
Auxiliary(Liuetal.(2018c))
1
:
61
:
61
:
6
DeSpoof(Jourablooetal.(2018))
1
:
21
:
71
:
5
DRL(Zhangetal.(2020a))
1
:
70
:
81
:
3
STDN(Liuetal.(2020))
0
:
81
:
31
:
1
CDCN(Yuetal.(2020b))
0
:
41
:
71
:
0
HMP(Yuetal.(2020a))
0
:
01
:
60
:
8
CDCN++(Yuetal.(2020b))
0
:
4
0.00.2
Ours
0.0
0
:
80
:
4
2
DeSpoof(Jourablooetal.(2018))
4
:
24
:
44
:
3
Auxiliary(Liuetal.(2018c))
2
:
72
:
72
:
7
DRL(Zhangetal.(2020a))
1
:
13
:
62
:
4
STASN(Yangetal.(2019a))
4
:
2
0.3
2
:
2
STDN(Liuetal.(2020))
2
:
31
:
61
:
9
HMP(Yuetal.(2020a))
2
:
60
:
81
:
7
CDCN(Yuetal.(2020b))
0
:
41
:
71
:
5
CDCN++(Yuetal.(2020b))
1
:
80
:
81
:
3
Ours
1.2
1
:
3
1.3
3
DeSpoof(Jourablooetal.(2018))
4
:
0

1
:
83
:
8

1
:
23
:
6

1
:
6
Auxiliary(Liuetal.(2018c))
2
:
7

1
:
33
:
1

1
:
72
:
9

1
:
5
STDN(Liuetal.(2020))
1
:
6

1
:
64
:
0

5
:
42
:
8

3
:
3
STASN(Yangetal.(2019a))
4
:
7

3
:
9
0.9

1.2
2
:
8

1
:
6
HMP(Yuetal.(2020a))
2
:
8

2
:
42
:
3

2
:
82
:
5

1
:
1
CDCN(Yuetal.(2020b))
2
:
4

1
:
32
:
2

2
:
02
:
3

1
:
4
DRL(Zhangetal.(2020a))
2
:
8

2
:
21
:
7

2
:
62
:
2

2
:
2
CDCN++(Yuetal.(2020b))
1
:
7

1
:
52
:
0

1
:
2
1.8

0.7
Ours
1.7

1.4
2
:
2

3
:
51
:
9

2
:
3
4
Auxiliary(Liuetal.(2018c))
9
:
3

5
:
610
:
4

6
:
09
:
5

6
:
0
STASN(Yangetal.(2019a))
6
:
7

10
:
68
:
3

8
:
47
:
5

4
:
7
CDCN(Yuetal.(2020b))
4
:
6

4
:
69
:
2

8
:
06
:
9

2
:
9
DeSpoof(Jourablooetal.(2018))
5
:
1

6
:
36
:
1

5
:
15
:
6

5
:
7
HMP(Yuetal.(2020a))
2
:
9

4
:
07
:
5

6
:
95
:
2

3
:
7
CDCN++(Yuetal.(2020b))
4
:
2

3
:
45
:
8

4
:
95
:
0

2
:
9
DRL(Zhangetal.(2020a))
5
:
4

2
:
9
3.3

6.0
4
:
8

6
:
4
STDN(Liuetal.(2020))
2
:
3

3
:
65
:
2

5
:
43
:
8

4
:
2
Ours
2.3

3.6
4
:
2

5
:
4
3.6

4.2
Table4.1
TheevaluationonfourprotocolsinOULU-NPU.
Bold
indicatesthebestscoreineachprotocol.
4.4Experiments
Inthissection,weintroducetheexperimentalsetup,andthenpresenttheresultsintheknown,
unknown,andopen-setspoofscenarios,withcomparisonstorespectivebaselines.Next,we
quantitativelyevaluatethespooftracesbyperformingaspoofmediumandconduct
anablationstudyoneachdesignintheproposedmethod.Finally,weprovidevisualizationresults
onthespooftracedisentanglement,newspoofsynthesisandt-SNEvisualization.
76
4.4.1ExperimentalSetup
Databases
Weconductexperimentsonthreemajordatabases:Oulu-NPUBoulkenafetetal.
(2017b),SiWLiuetal.(2018c),andSiW-MLiuetal.(2019a).Oulu-NPUandSiWinclude
print/replayattacks,whileSiW-Mincludes
13
spooftypes.Wefollowalltheexistingtesting
protocolsandcomparewithSOTAmethods.Similartomostpriorworks,weonlyusetheface
regionfortrainingandtesting.
Evaluationmetrics
Twocommonmetricsareusedinthisworkforcomparison:EERand
APCER/BPCER/ACER.EERdescribesthetheoreticalperformanceandpredeterminesthethresh-
oldformakingdecisions.APCER/BPCER/ACERinISO/IEC-JTC-1/SC-37(2016)describethe
practicalperformancegivenapredeterminedthreshold.Forbothevaluationmetrics,lowervalue
meansbetterperformance.ThethresholdforAPCER/BPCER/ACERiscomputedfromeither
trainingsetorvalidationset.Inaddition,wealsoreporttheTrueDetectionRate(TDR)atagiven
FalseDetectionRate(FDR).Thismetricdescribesthespoofdetectionrateatastricttoleranceto
liveerrors,whichiswidelyusedtoevaluatereal-worldsystemsIARPA(2016).Inthiswork,we
reportTDRatFDR
=0
:
5%
.ForTDR,thehigherthebetter.
Parametersetting
PhySTDisimplementedinTwwithaninitiallearningrateof
5
e
-
5
.
Wetrainintotal
150
;
000
iterationswithabatchsizeof
8
,anddecreasethelearningratebya
ratioof
10
every
45
;
000
iterations.Weinitializetheweightswith
[0
;
0
:
02]
normaldistribution.
f

1
;
2
;
3
;
4
;
5
;
6
g
aresettobe
f
100
;
5
;
1
;
1
e
-
4
;
10
;
1
g
,and

=0
:
1
.

0
isempirically
determinedfromthetrainingorvalidationset.Weusetheopen-sourcefacealignmentBulat
&Tzimiropoulos(2017)and
3
DMMLiuetal.(2017)tocropthefaceandprovide
140
landmarks.
77
ProtocolMethodAPCER(%)BPCER(%)ACER(%)
1
Auxiliary(Liuetal.(2018c))
3
:
63
:
63
:
6
STASN(Yangetal.(2019a))

1
:
0
Meta-FAS-DR(Zhaoetal.(2019))
0
:
50
:
50
:
5
HMP(Yuetal.(2020a))
0
:
60
:
20
:
5
DRL(Zhangetal.(2020a))
0
:
10
:
50
:
3
CDCN(Yuetal.(2020b))
0
:
10
:
20
:
1
CDCN++(Yuetal.(2020b))
0
:
10
:
20
:
1
Ours
0.00.00.0
2
Auxiliary(Liuetal.(2018c))
0
:
6

0
:
70
:
6

0
:
70
:
6

0
:
7
Meta-FAS-DR(Zhaoetal.(2019))
0
:
3

0
:
30
:
3

0
:
30
:
3

0
:
3
STASN(Yangetal.(2019a))

0
:
3

0
:
1
HMP(Yuetal.(2020a))
0
:
1

0
:
20
:
2

0
:
00
:
1

0
:
1
DRL(Zhangetal.(2020a))
0
:
1

0
:
20
:
1

0
:
10
:
1

0
:
0
CDCN(Yuetal.(2020b))
0
:
0

0
:
00
:
1

0
:
10
:
1

0
:
0
CDCN++(Yuetal.(2020b))
0
:
0

0
:
00
:
1

0
:
10
:
0

0
:
1
Ours
0.0

0.00.0

0.00.0

0.0
3
STASN(Yangetal.(2019a))

12
:
1

1
:
5
Auxiliary(Liuetal.(2018c))
8
:
3

3
:
88
:
3

3
:
88
:
3

3
:
8
Meta-FAS-DR(Zhaoetal.(2019))
8
:
0

5
:
07
:
4

5
:
77
:
7

5
:
3
DRL(Zhangetal.(2020a))
9
:
4

6
:
11
:
8

2
:
65
:
6

4
:
4
HMP(Yuetal.(2020a))
2
:
6

0
:
92
:
3

0
:
52
:
5

0
:
7
CDCN(Yuetal.(2020b))
2
:
4

1
:
32
:
2

2
:
02
:
3

1
:
4
CDCN++(Yuetal.(2020b))
1.7

1.52.0

1.21.8

0.7
Ours
13
:
1

9
:
41
:
6

0
:
67
:
4

4
:
3
Table4.2
TheevaluationonthreeprotocolsinSiWDataset.Wecomparewiththetop
7
performances.
4.4.2forKnownSpoofTypes
Oulu-NPU
Oulu-NPUBoulkenafetetal.(2017b)isacommonlyusedfacebenchmark
duetoitshigh-qualitydataandchallengingtestingprotocols.Tab.4.1showsour
performanceonOulu-NPU,comparedwithSOTAalgorithms.Ourmethodachievesthebest
overallperformanceonthisdatabase.ComparedwithourpreliminaryversionLiuetal.(2020),
wedemonstrateimprovementsinall
4
protocols,withimprovementonprotocol
1
and
protocol
3
,
i.e.
,reducingtheACERby
63
:
6%
and
32
:
1%
respectively.ComparedwiththeSOTA,
ourapproachachievessimilarbestperformancesonthethreeprotocolsandoutperformsthe
SOTAonthefourthprotocol,whichisthemostchallengingone.Tonotethat,inprotocol
3
and
protocol
4
,theperformancesoftestingcamera
6
aremuchlowerthanthoseofcameras
1
-
5
:the
ACERforcamera
6
are
6
:
4%
and
10
:
2%
,whiletheaverageACERfortheothercamerasare
1
:
0%
and
2
:
0%
respectively.Comparedwithothercameras,wenoticethatcamera
6
hasstrongersensor
78
Metrics(%)MethodReplayPrint
3DMaskMakeupPartialAttacks
Overall
HalfSilic.Trans.PaperMann.Ob.Im.Cos.Funny.Papergls.Paper
ACER
Auxiliary(Liuetal.(2018c))
5
:
15
:
05
:
010
:
25
:
09
:
86
:
319
:
65
:
026
:
55
:
55
:
25
:
06
:
3
SDTN(Liuetal.(2020))
3
:
23
:
13
:
09
:
03
:
03
:
44
:
7
3.0
3
:
024
:
54
:
13
:
73
:
04
:
1
Step
16
:
15
:
45
:
45
:
45
:
45
:
45
:
422
:
75
:
426
:
85
:
45
:
55
:
410
:
9
Step
1
+Step
2
w/singletrace
8
:
77
:
87
:
87
:
87
:
87
:
87
:
825
:
07
:
928
:
87
:
87
:
87
:
813
:
8
Step
1
+Step
24
:
13
:
93
:
93
:
94
:
03
:
94
:
013
:
54
:
025
:
13
:
93
:
93
:
94
:
6
Step
1
+Step
2
+Step
3
(Ours)
3.21.41.02.31.32.92.5
12
:
4
1.218.51.70.41.62.8
EER
Auxiliary(Liuetal.(2018c))
4
:
70
:
01
:
610
:
54
:
610
:
06
:
412
:
70
:
019
:
69
:
37
:
50
:
06
:
7
SDTN(Liuetal.(2020))
2
:
12
:
20
:
07
:
20
:
13
:
94
:
8
0.0
0
:
019
:
65
:
35
:
4
0.0
4
:
8
Step
13
:
82
:
71
:
52
:
71
:
9
1.8
2
:
415
:
10
:
728
:
74
:
14
:
91
:
04
:
3
Step
1
+Step
2
w/singletrace
6
:
75
:
30
:
8
1.5
1
:
43
:
33
:
221
:
51
:
027
:
16
:
56
:
11
:
55
:
8
Step
1
+Step
22
:
43
:
10
:
42
:
61
:
23
:
02
:
49
:
50
:
423
:
51
:
10
:
50
:
62
:
8
Step
1
+Step
2
+Step
3
(Ours)
2.51.00.0
2
:
1
1.0
1
:
9
2.2
8
:
2
0.018.50.80.0
0
:
4
2.5
SDTN(Liuetal.(2020))
90.1
76
:
180
:
771
:
562
:
374
:
485
:
0
100
10033
:
849
:
630
:
697
:
770
:
4
TPR@Step
143
:
843
:
347
:
244
:
562
:
954
:
855
:
416
:
790
:
631
:
560
:
356
:
777
:
159
:
3
FNR=
:
5%
Step
1
+Step
2
w/singletrace
58
:
976
:
897
:
6
94.2
94
:
966
:
378
:
313
:
394
:
149
:
162
:
458
:
592
:
174
:
8
Step
1
+Step
284
:
774
:
710070
:
1
96.6
77
:
589
:
636
:
910040
:
196
:
399
:
499
:
489
:
7
Step
1
+Step
2
+Step
3
(Ours)
85
:
7
85.4100
76
:
696
:
3
80.293.8
41
:
1
10055.898.110099.891.2
Table4.3
TheevaluationandablationstudyonSiW-MProtocolI:knownspoofdetection.
Metrics
MethodReplayPrint
3DMaskMakeupPartialAttacks
Average
(%)HalfSilic.Trans.PaperMann.Ob.Im.Cos.Fun.Papergls.Paper
APCER
Auxiliary(Liuetal.(2018c))
23
:
77
:
327
:
718
:
297
:
88
:
316
:
2100
:
018
:
016
:
391
:
872
:
20
:
438
:
3

37
:
4
LBP+SVM(Boulkenafetetal.(2017b))
19
:
115
:
440
:
820
:
370
:
30
:
04
:
696
:
935
:
3
11
:
3
53
:
358
:
50
:
632
:
8

29
:
8
DTL(Liuetal.(2019a))
1.0
0
:
00
:
724
:
558
:
60
:
53
:
873
:
213
:
212
:
417
:
017
:
00
:
217
:
1

23
:
3
CDCN(Yuetal.(2020b))
8
:
26
:
98
:
37
:
420
:
55
:
95
:
043
:
51
:
614
:
024
:
518
:
31
:
212
:
7

11
:
7
SDTN(Liuetal.(2020))
1
:
6
0
:
00
:
57
:
2
9
:
70
:
5
0
:
0
96
:
10
:
021
:
8
14
:
46
:
5
0
:
012
:
2

26
:
1
CDCN++(Yuetal.(2020b))
9
:
26
:
04
:
27
:
418
:
2
0
:
0
5
:
039
:
10
:
014
:
023
:
314
:
30
:
010
:
8

11
:
2
HMP(Yuetal.(2020a))
12
:
45
:
28
:
39
:
713
:
6
0
:
0
2
:
5
30
:
4
0
:
012
:
022
:
615
:
91
:
2
10
:
3

9
:
1
Ours
10
:
04
:
95
:
316
:
7
3
:
5
2
:
02
:
892
:
8
0
:
0
37
:
533
:
723
:
20
:
217
:
9

25
:
8
BPCER
LBP+SVM(Boulkenafetetal.(2017b))
22
:
121
:
521
:
921
:
420
:
723
:
122
:
921
:
712
:
522
:
218
:
420
:
022
:
921
:
0

2
:
9
DTL(Liuetal.(2019a))
18
:
611
:
929
:
312
:
813
:
48
:
523
:
011
:
59
:
616
:
021
:
522
:
616
:
816
:
6

6
:
2
SDTN(Liuetal.(2020))
14
:
014
:
613
:
618
:
618
:
18
:
113
:
410
:
39
:
217
:
227
:
035
:
511
:
216
:
2

7
:
6
CDCN(Yuetal.(2020b))
9
:
38
:
513
:
910
:
921
:
0
3
:
1
7
:
045
:
02
:
316
:
226
:
420
:
95
:
414
:
6

11
:
7
CDCN++(Yuetal.(2020b))
12
:
48
:
514
:
013
:
219
:
47
:
06
:
045
:
01
:
614
:
024
:
820
:
93
:
914
:
6

11
:
4
HMP(Yuetal.(2020a))
13
:
26
:
213
:
110
:
816
:
33
:
9
2
:
3
34
:
1
1
:
6
13
:
923
:
217
:
12
:
312
:
2

9
:
4
Auxiliary(Liuetal.(2018c))
10
:
16
:
510
:
911
:
6
6
:
2
7
:
89
:
311
:
69
:
37
:
1
6
:
2
8
:
810
:
38
:
9

2
:
0
Ours
3
:
86
:
34
:
45
:
5
11
:
33
:
56
:
0
6
:
6
1
:
8
2
:
7
6
:
5
8
:
01
:
15
:
7

2
:
8
ACER
LBP+SVM(Boulkenafetetal.(2017b))
20
:
618
:
431
:
321
:
445
:
511
:
613
:
859
:
323
:
916
:
735
:
939
:
211
:
726
:
9

14
:
5
Auxiliary(Liuetal.(2018c))
16
:
86
:
919
:
314
:
952
:
18
:
012
:
855
:
813
:
7
11
:
7
49
:
040
:
55
:
323
:
6

18
:
5
DTL(Liuetal.(2019a))
9
:
86
:
015
:
018
:
736
:
04
:
513
:
448
:
111
:
414
:
219
:
319
:
88
:
516
:
8

11
:
1
CDCN(Yuetal.(2020b))
8
:
77
:
711
:
1
9
:
1
20
:
74
:
55
:
944
:
22
:
015
:
125
:
419
:
63
:
313
:
6

11
:
7
SDTN(Liuetal.(2020))
7
:
87
:
37
:
112
:
913
:
94
:
36
:
753
:
24
:
619
:
520
:
721
:
05
:
614
:
2

13
:
2
CDCN++(Yuetal.(2020b))
10
:
87
:
39
:
110
:
318
:
83
:
55
:
642
:
10
:
814
:
024
:
017
:
61
:
912
:
7

11
:
2
HMP(Yuetal.(2020a))
12
:
85
:
710
:
710
:
314
:
9
1
:
92
:
432
:
30
:
812
:
9
22
:
916
:
51
:
7
11
:
2

9
:
2
Ours
6
:
95
:
64
:
8
11
:
1
7
:
4
2
:
74
:
449
:
70
:
920
:
1
20
:
115
:
60
:
6
11
:
5

13
:
2
EER
LBP+SVM(Boulkenafetetal.(2017b))
20
:
818
:
636
:
321
:
437
:
27
:
514
:
151
:
219
:
816
:
134
:
433
:
07
:
924
:
5

12
:
9
Auxiliary(Liuetal.(2018c))
14
:
04
:
311
:
612
:
424
:
67
:
810
:
072
:
310
:
1
9
:
4
21
:
418
:
64
:
017
:
0

17
:
7
DTL(Liuetal.(2019a))
10
:
0
2
:
1
14
:
418
:
626
:
55
:
79
:
650
:
210
:
113
:
219
:
820
:
58
:
816
:
1

12
:
2
CDCN(Yuetal.(2020b))
8
:
27
:
88
:
3
7
:
4
20
:
55
:
95
:
047
:
81
:
614
:
024
:
518
:
31
:
113
:
1

12
:
6
SDTN(Liuetal.(2020))
7
:
63
:
88
:
413
:
814
:
55
:
34
:
435
:
40
:
019
:
321
:
020
:
81
:
612
:
0

10
:
0
CDCN++(Yuetal.(2020b))
9
:
25
:
6
4
:
2
11
:
119
:
35
:
95
:
043
:
50
:
014
:
023
:
314
:
3
0
:
0
11
:
9

11
:
8
HMP(Yuetal.(2020a))
13
:
45
:
28
:
39
:
713
:
65
:
8
2
:
533
:
8
0
:
014
:
023
:
316
:
61
:
211
:
3

9
:
5
Ours
5
:
2
4
:
44
:
410
:
1
8
:
62
:
6
4
:
347
:
2
0
:
0
19
:
6
18
:
612
:
4
0
:
7
10
:
6

12
:
6
TPR@SDTNLiuetal.(2020)
45
:
040
:
545
:
736
:
711
:
740
:
974
:
00
:
067
:
516
:
013
:
49
:
462
:
835
:
7

23
:
9
FNR=
:
5%
Ours
55
:
146
:
457
:
365
:
133
:
091
:
776
:
70
:
0100
:
046
:
431
:
815
:
497
:
753
:
7

31
:
8
Table4.4
TheevaluationonSiW-MProtocolII:unknownspoofdetection.
noisesandourmodelrecognizesthemasunknownspooftraces,whichleadstoanincreasedfalse
negativerate(
i.e.
,BPCER).Howtoseparatesensornoisesfromspooftracescanbeanimportant
futureresearchtopic.
SiW
SiWLiuetal.(2018c)isanotherrecenthigh-qualitydatabase.Itincludesfewercapture
79
camerasbutmorespoofmediumsandenvironmentvariations,suchaspose,illumination,and
expression.ThecomparisononthreeprotocolsisshowninTab.4.2.Weoutperformtheprevious
worksonthetwoprotocolsandrankinthemiddleonprotocol
3
.Protocol
3
aimstotestthe
performanceofunknownspoofdetection,wherethemodelistrainedononespoofattack(print
orreplay)andtestedontheother.AswecanseefromFig.4.8-4.9,thetracesofprintandreplay
aredifferent,wherethereplaytracesaremoreonthehigh-frequencypart(
i.e.
,trace
component
T
)andtheprinttracesaremoreonthelow-frequencypart(
i.e.
,tracecomponent
S
).
Thesepatterndivergenceleadstotheadaptiongapofourmethodwhiletrainingononeattackand
testingontheother.
SiW-M
SiW-MLiuetal.(2019a)containsalargediversityofspooftypes,includingprint,replay,
3
Dmask,makeup,andpartialattacks.Thisallowsustohaveacomprehensiveevaluationof
theproposedapproachwithdifferentspoofattacks.TouseSiW-Mforknownspoofdetection,
werandomlysplitthedataofalltypesintotrain/testsetwitharatioof
60%
vs.
40%
,andthe
resultsareshowninTab.4.3.ComparedtothepreliminaryversionLiuetal.(2020),ourmethod
outperformsonmostspooftypesaswellastheoverallEERperformanceby
47
:
9%
relatively,which
demonstratesthesuperiorityofouronknownspoofattacks.
ForexperimentsonSiW-M(protocolI,II,andIII),weadditionallyreporttheTPRatFNR
equalto
0
:
5%
.WhileEERandACERprovidethetheoreticalevaluation,theusersinreal-world
applicationscaremoreaboutthetruespoofdetectionrateunderagivenlivedetectionerrorrate,and
henceTPRcanbetterhowwellthemodelcandetectoneorafewspoofattacksinpractices.
AsshowninTab.4.3,weimprovetheoverallTDRofourpreliminaryversionLiuetal.(2020)by
29
:
5%
.
80
Metrics
MethodReplayPrint
3DMaskMakeupPartialAttacks
Overall
(%)HalfSilic.Trans.PaperMann.Ob.Im.Cos.Funny.Papergls.Paper
ACER
Auxiliary(Liuetal.(2018c))
6
:
75
:
68
:
57
:
511
:
66
:
76
:
48
:
95
:
76
:
114
:
315
:
95
:
48
:
4

3
:
4
Ours
4.73.53.43.36.42.63.87.02.33.210.77.33.24.7

2.4
EER
Auxiliary(Liuetal.(2018c))
6
:
45
:
67
:
76
:
510
:
36
:
16
:
18
:
45
:
16
:
315
:
313
:
15
:
77
:
9

3
:
2
Ours
4.12.83.43.15.63.63.06.72.23.410.28.62.24.5

2.5
TPR@Auxiliary(Liuetal.(2018c))
60
:
465
:
564
:
470
:
447
:
567
:
071
:
664
:
375
:
169
:
845
:
847
:
862
:
962
:
5

9
:
7
FNR=
:
5%
Ours
87.478.781.084.569.086.384.785.091.089.366.664.491.181.6

9.2
Table4.5
TheevaluationonSiW-MProtocolIII:opensetspoofdetection.
Figure4.8
Examplesofeachspooftracecomponents.(a)theinputsamplefaces.(b)
B
.(c)
C
.(d)
T
.(e)
P
.
(f)thelivecounterpartreconstructionandzoom-indetails.(g)resultsfromLiuetal.(2020).(h)results
fromStep
1
+Step
2
withasingletracerepresentation.
4.4.3forUnknownandOpen-setSpoofs
Anotherimportantaspectistotesttheperformanceonunknownspoof.TouseSiW-M
forunknownspoofdetection,Liuetal.(2019a)theleave-one-outtestingprotocols,termed
asSiW-MProtocolII.Inthisprotocol,eachmodel(
i.e.
,onecolumninTab.4.4)istrainedwith
12
typesofspoofattacks(asknownattacks)plusthe
80%
ofthelivefaces,andtestedonthe
remaining
1
attack(asunknownattack)plusthe
20%
oflivefaces.AsshowninTab.4.4,our
PhySTDachievesimprovementoverourpreliminaryversion,withrelatively
11
:
7%
on
81
Metrics(%)AttacksProtocolIProtocolIIProtocolIII
ACER
Impersonation
2
:
24
:
03
:
3
Obfuscation
4
:
614
:
95
:
4
EER
Impersonation
1
:
43
:
13
:
2
Obfuscation
3
:
614
:
05
:
1
TPR@Impersonation
87
:
873
:
385
:
9
FNR=
:
5%
Obfuscation
84
:
647
:
079
:
5
Table4.6
Theperformancecomparisonbetweenimpersonationattacksandobfascationattacks.
theoverallEER,
19
:
0%
ontheoverallACER,
50
:
4%
ontheoverallTPR.,wereducethe
EERsofhalfmask,paperglasses,transparentmask,replayattack,andpartialpaperrelativelyby
47
:
6%
,
40
:
4%
,
37
:
7%
,
31
:
6%
,
56
:
3%
,respectively.Overall,comparedwiththetop
7
performances,
weoutperformtheSOTAperformanceofEER/TPRandachievecomparableACER.Amongall,
thedetectionofsiliconemask,paper-craftedmask,mannequinhead,impersonationmakeup,and
partialpaperattacksarerelativelygood,withthedetectionaccuracy(
i.e.
,TPR@FNR=
0
:
5%
)above
65%
.ObfuscationmakeupisthemostchallengingonewithTPRof
0%
,wherewepredictallthe
spoofsamplesaslive.Thisisduetothefactthatthemakeuplooksverysimilartothelivefaces,
whilebeingdissimilartoanyotherspooftypes.However,onceweobtainafewsamples,ourmodel
canquicklyrecognizethespooftracesontheeyebrowandcheek,synthesizenewspoofsamples,
andsuccessfullydetecttheattack(TPR=
41
:
1%
inTab.4.3).
Moreover,inthereal-worldscenario,thetestingsamplescanbeeitheraknownspoofattackor
anunknownone.Thus,weproposeSiW-MProtocolIIItoevaluatethisopen-settestingsituation.
InProtocolIII,wefollowthetrain/testsplitfromprotocolI,andthenfurtherremoveonespoof
typeastheunknownattack.Duringthetesting,wetestontheentireunknownspoofsamplesas
wellthetestsplitsetoftheknowspoofsamples.TheresultsarereportedinTab.4.5.Comparedto
theSOTAfacemethodLiuetal.(2018c),ourapproachsubstantiallyoutperformsitin
allthreemetrics.
Impersonation
v.s.
Obfuscation
InTab.4.6,weshowthecomparisonofourperformancetowards
82
Figure4.9
ExamplesofspooftracedisentanglementonSiW(a-h)andSiW-M(i-x).(a)-(d)itemsareprint
attacksand(e)-(h)itemsarereplayattacks.(i)-(x)itemsarelive,print,replay,halfmask,siliconemask,
papermask,transparentmask,obfuscationmakeup,impersonationmakeup,cosmeticmakeup,paperglasses,
partialpaper,funnyeyeglasses,andmannequinhead.Thecolumnistheinputface,thesecondcolumn
istheoverallspooftrace(
I

^
I
),thethirdcolumnisthereconstructedlive.
impersonationattacksandobfuscationattackson
3
protocolsinSiW-Mdatabase.Onall
3
protocols,
weseeabetterperformanceonimpersonationattacksoverobfuscationattacks,especiallyinthe
ProtocolIIunknowattacksituations.Obfuscationattacksgenerallyhavealargerappearance
discrepancycomparedtoimpersonationattacks,andit'snaturallymorediffortheCNNmodel
todoout-of-distributionpredictions.Inpractice,mostoftheattacksareimpersonationattacks,and
henceoursolutioncanbeeffectivetothepracticalsituations.
83
Label
Predict
LivePrint
1
Print
2
Replay
1
Replay
2
Live
56(

4
)1(
+1
)1(
+1
)1(
+1
)1(
+1
)
Print
10
43(
+2
)11(
+9
)3(

8
)3(

3
)
Print
209(

25
)
48(
+37
)1(

8
)2(

4
)
Replay
11(

9
)2(

1
)3(
+3
)
51(
+38
)3(

28
)
Replay
21(

7
)2(

5
)2(
+2
)3(

3
)
52(
+13
)
Table4.7
Confusionmatricesofspoofmediumsbasedonspooftraces.Theresultsare
comparedwiththepreviousmethodJourablooetal.(2018).GreenrepresentsimprovementoverJourabloo
etal.(2018).Redrepresentsperformancedrop.
Label
Predict
LivePrintReplayMasksMakeupPartial
Live
11666300
Print
1
401301
Replay
31
32101
Masks
311
9003
Makeup
3000
360
Partial
20020
146
Table4.8
Confusionmatricesof
6
-classspooftracesonSiW-Mdatabase.
4.4.4SpoofTraces
Toquantitativelyevaluatethespooftracedisentanglement,weperformaspoofmedium
onthedisentangledspooftracesandreporttheaccuracy.Thespooftracesshould
containspoofinformation,sothattheycanbeusedforclusteringwithoutseeing
theface.TomakeafaircomparisonwithJourablooetal.(2018),weremovetheadditionalspoof
typeinformationfromthepreliminarymask
P
0
.Thatis,forthisexperiment,weonlyuse
theadditivetraces
f
B
;
C
;
T
g
tolearnthetraceAfter
f
B
;
C
;
T
g
trainingwith
onlybinarylabels,wePhySTDandapplyasimpleCNN(
i.e.
,AlexNet)ontheestimatedadditive
tracestodoasupervisedspoofmediumWefollowthesame
5
-classtestingprotocol
inJourablooetal.(2018)inOulu-NPUProtocol
1
.Wereporttheaccuracyastheratio
betweencorrectlypredictedsamplesfromallclassesandalltestingsamples.ShowninTab.4.7.
Ourmodelcanachievea
5
-classclaccuracyof
83
:
3%
.Ifwetreattwoprintattacksasthe
84
Figure4.10
Examplesofthespoofdatasynthesis.Therowarethesourcespooffaces,thecolumn
arethetargetlivefaces,andtheremainingarethesynthesizedspooffacesfromthelivefacewiththe
correspondingspooftraces.
sameclassandtworeplayasthesameclass,ourmodelcanachievea
3
-classaccuracy
of
92
:
0%
.ComparedwiththepriormethodJourablooetal.(2018),weshowanimprovementof
29%
onthe
5
-classmodel.Inaddition,wetrainthesameCNNontheoriginalimagesinsteadofthe
estimatedspooftracesforthesamespoofmediumtask,andtheaccuracy
canonlyreach
80
:
6%
.Thisfurtherdemonstratesthattheestimatedtracesdocontain
informationtodistinguishdifferentspoofmediums.
WealsoexecutethespooftracestaskonmorespooftypesinSiW-Mdatabase.We
leveragethetrain/testsplitonSiW-MProtocol
1
.WetrainthePhySTDtillconvergence,and
usetheestimatedtracesfromthetrainingsettotrainthetracenetwork.Weexplore
the
6
-classscenario,showninTab.4.8.Our
6
-classmodelcanachievetheaccuracyof
92
:
0%
.Sincethetracesaremoredistinctamongdifferentspooftypes,thisperformanceiseven
betterthan
5
-classonprint/replayscenarioinOulu-NPUProtocol
1
.Thisfurther
demonstratesthatPhySTDcanestimatespooftracesthatcontainantinformationofspoof
mediumsandcanbeappliedtomultiplespooftypes.
85
4.4.5AblationStudy
Inthissection,weshowtheimportanceofeachdesignofourproposedapproachontheSiW-M
ProtocolI,inTab.4.3.OurbaselineistheauxiliaryFASLiuetal.(2018c),withoutthetemporal
module.Itconsistsofthebackboneencoderanddepthestimationnetwork.Whenincludingthe
imagedecomposition,thebaselinebecomesthetrainingstep
1
inAlg.1,asthetracesarenot
activatedwithoutthetrainingstep
2
.TovalidatetheeffectivenessofGANtraining,wereportthe
resultsfromthebaselinemodelwithourGANdesign,denotedasStep
1
+Step
2
.Wealsoprovide
thecontrolexperimentwherethetracesarerepresentedbyasinglecomponenttodemonstratethe
effectivenessoftheproposed
5
-elementtracerepresentation.ThismodelisdenotedasStep
1
+Step
2
withsingletrace.Inaddition,weevaluatetheeffectoftrainingwithmoresynthesizeddatavia
enablingthetrainingstep
3
asStep
1
+Step
2
+Step
3
,whichisourapproach.
AsshowninTab.4.3,thebaselinemodel(Auxiliary)canachieveadecentperformanceofEER
6
:
7%
.Addingimagedecompositiontothebaseline(Step1)canimprovetheEERfrom
6
:
7%
to
4
:
3%
,butmorelivesamplesarepredictedwithhigherscores,causingaworseACER.Adding
simpleGANdesign(Step
1
+Step
2
withsingletrace)mayleadtoasimilarEERperformanceof
5
:
8%
,butbasedontheTPR(
59
:
3%
!
74
:
8%
)itspracticalperformancemaybeimproved.With
theproperphysics-guidedtracedisentanglement,wecanimprovetheEERto
2
:
8%
andTPRto
89
:
7%
.AndourdesigncanachievetheperformanceofHTER
2
:
8%
,EER
2
:
5%
,andTPR
91
:
2%
.Comparedwithourpreliminaryversion,theEERisimprovedby
47
:
9%
,HTERisimproved
by
31
:
7%
andTPRisimprovedby
29
:
5%
.
4.4.6Visualization
Spooftracecomponents
InFig.4.8,weprovideillustrationofeachspooftracecomponent.Strong
86
Figure4.11
ThetSNEvisualizationoffeaturesfromdifferentscalesandlayers.The
3
visualizationare
fromtheencoderfeature
F
1
,
F
2
,
F
3
,andthelast
2
visualizationarefromthefeaturesthatproduce
f
B
;
C
;
T
g
and
f
P
;
I
P
g
.
colordistortion(low-frequencytrace)showsupintheprintattacks.Moirpatternsinthereplay
attackarewelldetectedinthehigh-frequencytrace.Thelocalspecularhighlightsintransparent
maskarewellpresentedinthelow-andmid-frequencycomponents,andtheinpaintingprocess
furtherthemosthighlightedarea.Forthetwoglassesattacks,thecolordiscrepancy
iscorrectedinthelow-frequencytrace,andthesharpedgesarecorrectedinthemid-andhigh-
frequencytraces.Eachcomponentshowsaconsistentsemanticmeaningondifferentspoofsamples,
andthissuccessfultracedisentanglementcanleadtobettervisualresults.Asshownonthe
rightsideofFig.4.8,wecomparewithourpreliminaryversionLiuetal.(2020)andtheablated
GANdesignwithasingletracerepresentation.Theresultofsingletracerepresentationshows
strongartifactsonmostofthelivereconstruction.Themulti-scalefromourpreliminaryversion
hasalreadyshownalargevisualqualityimprovement,butstillhavesomespooftraces(
e.g.
,glass
edges)remainedinthelivereconstruction.Incontrast,ourapproachcanfurtherhandlethemissing
tracesandachievebettervisualization.
Livereconstruction
InFig.4.9,weshowmoreexamplesfromdifferentspooftypesinSiWand
SiW-Mdatabases.Theoveralltraceistheexactdifferencebetweentheinputfaceanditslive
reconstruction.Forthelivefaces,thetraceiszero,andforthespooffaces,ourmethodremoves
spooftraceswithoutunnecessarychanges,suchasidentityshift,andmakethemlooklikelivefaces.
Forexample,strongcolordistortionshowsupinprint/replayattacks(Fig.4.9a-h)andsome
3
D
87
Figure4.12
Theillustrationofremovingthedisentangledspooftracecomponentsonebyone.Theestimated
spooftraceelementsofinputspoof(thetcolumn)areprogressivelyremovedintheorderof
B
;
C
;
T
;
T
P
.
Thelastcolumnshowsthereconstructedliveimageafterremovingallthreeadditivetracecomponentsand
theinpaintingtrace.(a)Replayattack;(b)Makeupattack;(c)Maskattack;(d)Paperglassesattack.
maskattacks(Fig.4.9l-o).Formakeupattacks(Fig.4.9q-s),thefakeeyebrows,lipstick,
wax,andcheekshadeareclearlydetected.Thefoldsandedges(Fig.4.9t-w)arewelldetectedand
removedinpaper-craftedmasks,paperglasses,andpartialpaperattacks.
Spoofsynthesis
Additionally,weshowexamplesofnewspoofsynthesisusingthedisentangled
spooftraces,whichisanimportantcontributionofthiswork.AsshowninFig.4.10,thespoof
tracescanbepreciselytransferredtoanewfacewithoutchangingtheidentityofthetargetface.
Duetotheadditionalinpaintingprocess,spoofattackssuchastransparentmaskandpartialattacks
canbebetterattachedtothenewliveface.Thankstotheproposed
3
Dwarpinglayer,thegeometric
discrepancybetweenthesourcespooftraceandthetargetfacecanbecorrectedduringthesynthesis.
Especiallyonthesecondsourcespoof,therightpartofthetracesissuccessfullytransferredtothe
newlivefacewhiletheleftsideremainstobestilllive.Itdemonstratesthatourtraceregularization
cansuppressunnecessaryartifactsgeneratedbythenetwork.Boththelivereconstructionresults
88
Figure4.13
Theillustrationofdoublespooftracedisentangling.Theleft
4
samplesarelivefaces,andthe
right
4
samplesarespooffaces.(a)OriginalInput.(b)
1
stroundlivereconstruction.(c)
1
stroundspoof
traces.(d)
2
ndroundlivereconstruction.(e)
2
ndroundspooftraces.
inFig.4.9andthespoofsynthesisresultsinFig.4.10demonstratethatourapproachdisentangles
visuallyconvincingspooftracesthathelpface
Spooftraceremovingprocess
AsshowninFig.4.12,weillustratetheeffectsoftracecomponents
byprogressivelyremovingthemonebyone.Forthereplayattack,thespoofsamplecomeswith
strongover-exposureaswellasclearMoirpattern.Removingthelow-frequencytracecaneffectively
correcttheover-exposureandcolordistortioncausedbythedigitalscreen.Andremovingthe
texturepatterninthehigh-frequencytracecanpeeloffthehigh-frequencygrideffectandreconstruct
thelivecounterpart.
Forthemakeupattack,sincethereisnostrongcolorrangebias,removingestimatedlow-
frequencytracewouldmainlyremovethelip-stickcolorandfakeeyebrow,butinthemeantime
bringafewartifactsattheedges.Next,whileremovingthecontentpattern,theshadowonthe
cheekandthefakeeyebrowsareadequatelylightened.Finally,removingthetexturepattern
wouldcorrectthespooftracesfromwax,eyeliner,andshadowonthecheek.
Similarly,inmaskandpartialattacks,thereconstructionwillbegraduallyasweremoving
89
componentsonebyone.
Tovalidatethequalityofspooftraceremoving,weexecuteadoublespooftracedisentanglement,
showninFig.4.13.Foreachsample,weexecutea
2
-rounddisentanglement,wherethesecond-
roundinputisthelivereconstructionfromtheround.Aswecanseefromtheregardless
oftheoriginalspoofness,thenetworksrecognizealllivereconstructionaslive,whichshowshigh
ontheroundspooftraceremoval.The
1
st-roundaveragespoofscoreforlivefaces
is
0
:
11
,andthesecondroundis
0
:
03
.The
1
st-roundaveragespoofscoreforspooffacesis
0
:
89
,
andthesecondroundis
0
:
03
.However,wecanstillseedifferentdegreesofspooftracesleftinthe
reconstructedlivefaces.Howtoeffectivelymeasurethequalityofspooftraceremovinganduseit
forasecond-roundsupervisioncanbeafutureresearch.
Tovalidatetherelationofspooftraceintensityandthespoofnessscore,weexecuteareverse
doublespooftracedisentanglement,wheretheinputofthesecondrounddisentanglementisthe
originalinputwithonly
50%
oftheestimatedspooftracesremoved.Forthisexperiment,the
secondroundscoreforlivefaceschangedfrom
0
:
03
t0
0
:
07
,andthesecondroundscoreforspoof
faceschangedfrom
0
:
03
to
0
:
53
.Basedontheresults,wecantellthatthespooftraceintensityis
positivelycorrelatedwiththespoofnessscore.
t-SNEvisualization
Weuset-SNEMaaten&Hinton(2008)tovisualizetheencoderfeatures
F
1
,
F
2
,
F
3
,andthefeaturesthatproduce
f
B
;
C
;
T
g
and
f
P
;
I
P
g
.Thet-SNEisabletoprojectthe
outputoffeaturesfromdifferentscalesandlayersto
2
DbypreservingtheKLdivergencedistance.
AsshowninFig.4.11,amongthethreefeaturescalesintheencoder,
F
3
isthemostseparable
featurespace,thenextis
F
1
,andtheworstis
F
2
.Thefeaturesforadditivetraces
f
B
;
C
;
T
g
are
well-clusteredassemanticsub-groupsoflive,makeup,mask,andpartialattacks.Asweknow
theinpaintingmasksforlivesamplesareclosetozero,thefeatureforinpaintingtraces
f
P
;
I
P
g
showstheinpaintingprocessmostlyupdatethepartialattacks,andthensomemakeupattacksand
90
maskattacks,
i.e.
,thegreendotsbeingfurtherawayfromtheblackdotsmeanstheyhavegreater
magnitude.Thisvalidatesourpriorknowledgeoftheinpaintingprocess.
4.5Conclusions
Thisworkproposesaphysics-guidedspooftracesdisentanglementnetwork(PhySTD)totacklethe
challengingproblemofdisentanglingspooftracesfromtheinputfaces.Withthespooftraces,we
reconstructthelivefacesaswellassynthesizenewspoofs.Tocorrectthegeometricdiscrepancy
insynthesis,weproposea
3
Dwarpinglayertodeformthetraces.Thedisentanglementnotonly
improvestheSOTAoffaceinknown,unknown,andopen-setspoofsettings,butalso
providesvisualevidencetosupportthemodel'sdecision.
Tonotethat,eventhoughtheproposed
spooftracemodelingisbasedonaphysicalapproximationofthespoofpresentationattack,thewhole
learningprocessisstillintenselyrelyingonthedata.Inaddition,ourvisualization/interpretationof
spoofattacksisontheimageappearance,ratherthanthefeaturelevel,suchasGrad-CAM.
91
Chapter5
Visualization:BlindRemovalofFacial
ForeignShadow
5.1Introduction
Inourdailyactivities,manyexternalobjectsarounduscancastshadowsonfaces,termedas
facial
foreignshadow
.Forinstance,whilewetakeaoutdoors,ourhandandcameramightblockpart
ofthesunlightandcreateashadowontheface.Dynamicandscatteredshadowsmaybeproduced
byleaveswhilewalkingundertrees.Duringdriving,thedrivermayconfrontthehigh-contrast
lightingcausedbythedirectsunlightandcarpillars.Whilepeoplemayexperiencethesesituations
everyday,theshadowcastsometimescanbeunwanted.Insomecases,theshadowscastshould
beremovedforaestheticpurposes,suchasphotoshopandfaceediting.Inothers,theshadows
castcouldnegativelyimpactface-relatedtasks,suchasfacerecognition,expressionanalysis,age
estimation,anddrivermonitoring.
Whilefacialforeignshadowremovalisarelativelynewtopic,thereareafewrelatedstudies.
Manyworksaimtohandletheselfshadowandrelightthefaceunderadifferentlighting,via
quotientimageWenetal.(2003);Zhouetal.(2019),inverserenderingNaganoetal.(2019);
Nestmeyeretal.(2020),andstyletransferGuetal.(2019);Leeetal.(2020).Thosemethodsfocus
moreonthegloballightingdistributionandmightbelimitedinhandlingarbitraryhigh-frequency
structurecausedbyharshforeignshadows.Therearealsofacecompletionworksunderstructured
92
Figure5.1
TheresultsofourshadowremovalmodelonimagesfromourShadowFaceintheWild(SFW)
database(
top
)andUCBdatabaseZhangetal.(2020b)(
bottom
).Thelefttorightareinputface,outputface,
andshadowmatte.
occlusions,suchassquare,circle,andlatticeLietal.(2020);Yangetal.(2019b);Zhangetal.
(2017b).Comparedwithforeignshadowremoval,facecompletionisrelativelyeasierastheshape
islesscomplicatedandtheocclusionisoftenwithasinglecolorsuchaswhite.Further,some
worksstudyshadowongenericobjectsLe&Samaras(2019);Shor&Lischinski(2008).Whilethey
excelatshadowdetection,whenappliedtofaces,observableartifactscanbedetectedonde-shadow
resultsduetothelackoffacepriors.
Themajorproblemofthesepriormethodsisthattheycannothandlethehigh-frequencystructure
causedbytheharshshadowsasdemonstratedinSunetal.(2019);Zhangetal.(2020b).Insteadof
predictingillumination,Xuaner
etal.
proposeasingleimage-basedapproachusingonlyperceptual
andpixelintensitylossesandtrainthenetworkonasyntheticshadowdatasetZhangetal.(2020b).
Itturnsoutthepixelintensitylossworksbettertoremoveharshshadowsandrecoverdetails.
However,themodelbasedonlyonperceptualandpixelintensitylossesdoesnotgeneralizewellin
practice,asitishardtobuildthetrainingdatasetcoveringreal-worldcomplexlightingconditions.
Asstatedabove,thischapteraimstodetectandremovetheforeignshadowfromin-the-wild
93
faces.Whilewefocusontheforeignshadow,wealsoliketoaddressstrongself-shadowcaused
byselfocclusion(seeFig.5.2).Totacklethisproblem,wefacethreemajorchallenges.First,
theshadowofin-the-wildfacesisarbitrary,varyingfromdifferentsizesandshapes,todifferent
locations,colors,blurrinessandintensities.PriorworksLe&Samaras(2019);Shor&Lischinski
(2008);Zhangetal.(2020b)modeltheshadowdirectlyintheRGBspace.Giventhelevelof
diversity,theyhaveahardtimetoaddressallthediscrepancies,leavingsomeobservableartifacts
inde-shadowedfaces.Second,therearefewpublicdatabasesfortrainingandevaluation.To
capturethepairedshadowandnon-shadowfaces,boththesubjectandphotographerneedtobe
perfectlystill,whichisrarelyfeasible.Third,sometimestheshadowremovalisextendedfrom
singleimagestovideo.Ononehand,multipleframes(
e.g.
,livephoto)mayprovideadditioncuesto
single-imageshadowremoval.Ontheotherhand,videoshadowremovalrequiresadditional
temporalconsistency.
Toaddresstheaforementionedchallenges,weproposeanovelblindremovalmodeloffacial
foreignshadow.Tohandletheshadowdiversity,weproposeasimpleyeteffectiveapproachto
decomposethedirectRGBshadowremovalintograyscaleshadowremovalandcolorization.We
showthat,withoutcolor,theshadowmodelingbecomesamuchsimplertaskandthegrayscale
removalmodeliseasytogeneralizetounseendata.Afterthat,withtheknowledgeofshadowregion
fromgrayscaleshadowremoval,thecolorizationisturnedintoanimageinpaintingprocess.Without
seenthebiasedcolorinformationfromshadowregion,thecolorizationprocessalsobecomesmore
generalizable.Moreover,toensurethetemporalconsistency,weproposeatemporalsharingmodule
(TSM)toaggregatetheinformationamongmultipleframes.TSMincludesanefwarping
layer,inordertohandleframeswithposeandexpressionvariations.
Fortrainingthemodel,wefollowtheprocessproposedbyZhangetal.(2020b)tobuiltasyn-
theticdatabasethatcontainspairedshadowandshadow-freefaces.Foreignshadowsaregenerated
94
Figure5.2
Examplesof(a)foreignshadow,(b)strongselfshadow,and(c)normalselfshadow.Ourmodel
isdesignedtoremoveunwantedshadowsin(a-b)whilekeepingnormalshadowin(c).
withrandomizedpropertiesandmovingtrajectories.Further,wecollectafacedatabasewith
280
videoscapturedunderhighlydynamicenvironmentsforevaluationpurposes.Theexternalobjects
castingshadowsincludehands,books,leaves,trees,windowblinds,carpillars,andbuildings.To
quantitativeevaluateshadowsegmentation,weprovidedetailedpixel-levelsegmentationannotation
forthisdatabase.
Insummary,themaincontributionsofthischapterinclude:

AnovelapproachtodecomposeRGBshadowremovalintograyscaleshadowremovaland
colorization;

Atemporalsharingmoduletoensurevideoconsistency;

Afaceshadowdatabaseunderdynamicenvironments;

SOTAresultsandphoto-realisticde-shadowquality.
95
5.2RelatedWork
Facerelighting
Facerelightingmethodscouldberoughlydividedintothreecategories,quotient
image-based,styletransfer,andinverserendering.Thecolorratio(
i.e.
,quotientimages)is
proposedinShashua&Riklin-Raviv(2001)totransferafrontalfacefromonelightingtoanother.
Thisbasicideahasbeenextendedtohandledifferentposes,useratiosofradianceenvironment
maps,andgeneratesyntheticrelightingdatasetinStoschek(2000);Wenetal.(2003);Zhouetal.
(2019).FaciallightingalsocanbechangedbystyletransferGuetal.(2019);Leeetal.(2020);
Liaoetal.(2017);Shihetal.(2014);Shuetal.(2017a);Sunetal.(2019).Similartoquotient
images,styletransfermethodsneedatleastareferenceimageasthetargetstyle.Moreover,the
faceposesofinputandreferenceimagesareoftenveryclose.Inthecategoryofinverserendering,
afaceimagecouldbedecomposedintomultiplecomponentssuchasgeometry,and
lightingEggeretal.(2018);Naganoetal.(2019);Nestmeyeretal.(2020);Senguptaetal.(2018);
Shuetal.(2017b);Tewarietal.(2017);Tran&Liu(2018);Wangetal.(2008).Forexample,
inNestmeyeretal.(2020),theintrinsiccomponents(
e.g.
normalandalbedo)arepredictedand
combinedthroughdiffuserenderinginthenetwork,thesecondnetworkisthenappliedtolearn
non-diffuseresidual.Ingeneral,inverserenderingmethodsrelyonmultiplesub-networksforthe
decomposition,whicharenotsufeftohandlefacialforeignshadowsinvideos.
FaceCompletion
Facecompletionaimstollinthemissingoroccludedfaceregionswithsemantic
meaningfulinformation.InZhangetal.(2017b),Zhang
etal.
proposedaDemeshNetwithtwo
sub-networkstoremovemesh-likelinesorwatermarksonfaces.Li
etal.
proposedadisentangling
andfusingnetworkthatcontainsdiscriminatorsinthreedomains,
i.e.
,occludedfaces,cleanfaces,
andstructuredocclusionsLietal.(2020).ThefaceinpaintingnetworkinYangetal.(2019b)
comprisesofalandmarkpredictingsubnetandaninpaintingsubnet.Differentfromtheshadow
96
removal,thestructuredocclusionseitherareopaqueorcontainrepeatedpatterns.Thenetworksare
mainlyusedtohallucinatetheinvisiblefaceregions.
GenericShadowDetectionandRemoval
Withoutmuchtrainingdata,earlyworksingeneral-
purposeshadowdetectionandremovalmainlystudyshadowproperties,especiallyaroundshadow
edgesChuangetal.(2003);Finlaysonetal.(2002);Wuetal.(2012);Wu&Tang(2005).For
example,Wuetal.(2012)appliedthegraph-cutinferencetodetectshadowregions,andthenused
theshadowmattingtogeneratesoftshadowboundaries.Deeplearningbasedmethodshavebeen
proposedrecentlytodetectandremoveshadowsDingetal.(2019);Le&Samaras(2019);Qu
etal.(2017);Wangetal.(2018a).Huetal.(2018)designedthedirection-awarespatialcontext
moduleandappliedaspatialRNNtodetectshadows.Cunetal.(2020)learnedtohierarchically
aggregatesthedilatedmulti-contextsandattentions.AuthorsinZhangetal.(2020b)demonstrated
thatthegeneral-purposemethodssuchasCunetal.(2020);Huetal.(2018)cannotpreservethe
authenticityoftheinputfaces.Onereasonisthatthesegeneral-purposenetworksareunableto
capturethefacecharacteristics.Forinstance,humanfaceskinisahighlyscatteringmaterial
thatalsohasacomplexabsorptionspectrumDonner&Jensen(2006).Inthiswork,weproposea
noveltwo-stageshadowmodelingthatcanbetterhandlebothsubsurfacescatteringeffectsandcolor
distortion.
5.3ProposedMethod
5.3.1Shadowsynthesisandmodeling
Shadowisproducedbyaforeignobjectthatblockspartofthelightraysfromarrivingtotheface.
Letamatte
M
representtheshadowshape,theshadowformationcanbemodeledasablending
97
Figure5.3
Illustrationofdatasynthesiscomponents.
betweenthewell-illuminatedface
I
b
andtheunder-illuminatedface
I
d
:
I
=
I
b

(1

M
)+
I
d

M
;
(5.1)
where

denoteselement-wisemultiplication.Asthereal-worldshadowvariesinbothshapesand
intensities,it'svitaltohave
f
I
;
I
b
g
paireddatawithalargevarietyof
M
totrainageneralized
shadowremovalmodel.However,it'shardlyfeasibletocollectalarge-scaledatasetwithsuch
pairing,asthesubjectneedstobeperfectlystaticwhilecapturingthepair.Therefore,creatinga
syntheticdatasetbecomesourgo-toapproachtotacklethisproblem.
Shadowsynthesis
AsindicatedinZhangetal.(2020b),usingEqn.5.1tosynthesizenaturalface
shadowshallincludeadditionalvariations:shape,intensity,subsurfacescatteringandcolor.Let
shape
B
beabinarymaskthatthewholeregionaffectedbyforeignshadow.Theshadowis
oftenunevenlydistributed,suchasinmottledpatternsorgraduallychangedpatterns,depending
ontherelativedistancebetweentheobjectandface,andtheenvironmentallighting.Weusea
98
gray-scalematte
M
I
torepresenttheunevenintensity.Inaddition,thelightoutsidetheshadow
regionwouldpenetratebeneaththeskin,reachthevesselsandback,creatingaredband
aroundtheshadowboundary.Werepresentsuchsubsurfacescatteringeffectby
M
ss
,whichis
computedbyblurring
B
withdifferentkernelperRGBchannel.Therefore,Eqn.5.1canbeupdated
to:
I
=
I
b

(1

B

M
ss
)+
I
d

B

M
ss

M
I
:
(5.2)
Moreover,theshadowregionmaybeundercertaincolordistortion,duetotheblockofpartsof
light.Weformulatesuchcolordistortionbya
3

3
colortransfermatrix
C
:
I
d
=
I
b
C
:
(5.3)
The
B
,
M
I
,
M
ss
,and
C
areillustratedinFig.5.3.Duringthesynthesis,givenawell-illuminated
face
I
b
,wegeneraterandomparametersforeachcomponenttosynthesizedifferentshadowfaces
I
,
whichisdetailedinSec.5.4.
Shadowmodeling
Withsyntheticpairwisedata,wecantrainamodel
G
(

)
todetectandremove
foreignshadows(
I
!
I
b
).Despitethecomplexityofshadowsynthesisprocess,priorworksLe
&Samaras(2019);Shor&Lischinski(2008);Zhangetal.(2020b)opttosimplifytherelation
between
I
and
I
b
in
G
(

)
as:
W
;
N
 
G
(
I
j
!
)
;
(5.4)
^
I
b
=
I

W
+
N
;
(5.5)
where
!
aretheparametersoftheshadowremovalmodel,andboththescaling
W
andoffset
N
are
ofthesamesizeas
I
.Themotivationofthisistwo-fold:1)preciselyestimatingall
shadowcomponents(
i.e.
,
B
,
M
I
,
M
ss
,
C
)canbeverychallenging,and2)evenwithfullsupervision
99
ofallthecomponents,reversingshadowformationmayraiseaconvergenceissue.Thisisdueto
theambiguityintheshadowparameterization,whereoneshadowcanbegeneratedfromdifferent
combinationsofshadowcomponents.
However,thepriorworksbasedonEqn.5.5haveahardtimetoaddressallthediscrepancies
betweentheshadowandnon-shadowregions,leavingsomeobservableartifactsinthede-shadowed
faces.WeobservethatitisnotstraightforwardtoderivefromEqn.5.1toEqn.5.5.Duetothe
existingofcolortransfermatrix
C
,
W
and
N
themselvesbecomeafunctionof
I
b
,insteadof
independentto
I
b
.Thus,themodellearningbecomesa
chicken-and-egg
problem,whichmayeasily
turnintoamemorizationmode(
i.e.
,atypeoflearningthatgeneralizespoorlyCorneanuetal.
(2019)).Totacklethisissue,weproposetodecomposethecolorshadowremovalintograyscale
shadowremovalandcolorization.Whiledealingwithshadowremovalingrayscale,
C
inEqn.5.3
simplybecomesascalar,andhenceboth
C
and
M
ss
canbeintegratedinto
M
I
as
M
0
I
.Wecanthen
transfertherelationofEqn.5.1into:
^
I
b,gs
=
I
gs

(1

B
)+
I
gs

B

M
0
I
=
I
gs

(1

B
+
B

M
0
I
)
=
I
gs

W
;
(5.6)
where
^
I
b,gs
and
I
gs
arethegrayscaleversionof
I
b
and
I
,

iselement-wisedivision,and
W
=
1

B
+
B

M
0
I
.It'sclearthatEqn.5.6isinacloseformandwellalignedwithEqn.5.5.As
W
and
N
aredetachedwith
I
b
,theyareeasiertolearn.Next,wesimplyneedtocolorizethegrayscale
facetogetthefacerecoveryinRGB.Withtheknowledgeprovidedbythegrayscaleshadow
removal,weturntheblindcolorrecoveryintoamask-guidedimageinpainting.
TheoverallpipelineisshowninFig.5.4.Ourapproachconsistsofthreemajorsteps:
1
)
100
Figure5.4
Illustrationofournetworkarchitecture.Themodelmainlyconsistsofanencoder,ashadow
mattedecoder,acolormatrixdecoder,andashadowresidualdecoder.TheTemporalSharingModule(TSM)
canbeeasilypluggedintothefaceencoder.Togetherwiththetemporalconsistencyloss
L
T
,wecanleverage
theunlabeledimageframesef.Thegreendashedlinesindicatetheshort-cutconnectionsandthe
orangedashedlinesandboxesindicatethelossfunctions.
grayscaleshadowremoval
(Sec.5.3.2),
2
)colorization
(Sec.5.3.3),and
3
)temporalinformation
sharing
(Sec.5.3.4).Step
1
and
2
arethekeyingredientsforasingleframeshadowremoval,and
step
3
isthekeyingredientforasmoothvideoshadowremoval.InSec.5.3.5,wediscussthelosses
andtrainingstrategiesindetail.
5.3.2Grayscaleshadowremoval
ThegrayscaleshadowremovalmoduletakesaRGBface
I
2
R
N
2

3
asinput,andoutputsthe
scalingmap
W
2
R
N
2

1
andoffsetmap
N
2
R
N
2

1
thatcanrecoverawell-illuminatedgrayscale
face
^
I
b,gs
2
R
N
2

1
basedonEqn.5.5.Themoduleconsistsofanencoder,astackofresidual
non-localblocks,andadecoder.Theencoderextractsfeaturesas
F
frominputimagesforshadow
removal.Itcontains
4
convolutionlayersand
3
timesofdownsampling.Toencourageaspatial
consistencyoffaciallightingandalbedo,weleveragethelatestdesignofnon-localblockandvisual
transformerCarionetal.(2020);Dosovitskiyetal.(2020);Wangetal.(2018c).Westack
3
residual
non-localblockstoprocesstheencoderfeatureswithpositionencoding.Afterthat,thedecoder
101
upsamplesthefeaturesfromnon-localblocksvia
3
transposeconvolutionlayers,andestimate
W
and
N
.Weadoptashort-cutconnectionateachfeaturescaletobypasshigh-frequencyinformation.
Forthepositionencoding,weadopttheprojectednormalizedcoordinatecode(PNCC)Zhu
etal.(2017)andconcatenateittotheencoderfeature.PNCCisthenormalizedmeanshapeof
3
DMMBlanz&Vetter(2003),andisprojectedtoagivenface.Itencodesthefacesemanticsas
eachvertex(
e.g.
,eyecorner)hasitsunique
3
Dcoordinatebetween
[0
;
0
;
0]
and
[1
;
1
;
1]
,regardless
ofthepose,expressionandidentity.ComparedwithconventionalpositionencodinginCarionetal.
(2020);Dosovitskiyetal.(2020),PNCCprovidesabetterfacesemanticthathelpstodetectand
removeshadows.
5.3.3Colorization
Animportanttakeawayfromthegrayscaleshadowremovalmoduleisthatwecanlocatetheshadow
regionas
^
B
=
j
^
I
b,gs

I
gs
j
>;
(5.7)
where
^
B
istheshadowsegmentationmaskbinarizedwiththethresholdof

.Withthisknowledge,
wecanturntheblindcolorrecoveryprocessintoanimageinpaintingprocesswithagiveninpainting
region.Incomparison,ifnoknowledgeisprovidedtothecolorizationprocess,thistwo-step
approachisnearlyidenticaltodirectRGBshadowremovalappliedinpreviousworkLe&Samaras
(2019);Shor&Lischinski(2008);Zhangetal.(2020b),whichmaystillsufferfromthepoor
generalizationissue.
Ourcolorizationmodulebreaksdowninto
3
steps:
1
)erasing,
2
)inpainting,
3
)colorspace
transformation.Structurally,colorizationmoduleissimilarwiththegrayscaleshadowremoval
module.Itconsistsof
3
residualnon-localblocksandadecoder.First,basedontheshadowmask
^
B
,
102
wesettheshadowregionof
F
as
0
tocircumventanypotentialdisturbance,andtermitasinpainting
feature.Secondly,theinpaintingfeature
F

(1

^
B
)
isconcatenatedwith
^
B
andthePNCCcoding,
andfedtothemodule.Thenon-localblocksaimtointhemissingregionin
F
,andthedecoder
isdesignedtoproducea
M
-channelcolorspace
C
2
R
N
2

M
.Intheend,weusethree
1

1
convolutionlayerstotransfergrayscaleface
^
I
b,gs
withthecolorspace
C
backtotheRGBface
^
I
b
.
Duringthetraining,nogradientsofcolorizationmodulewillbesentbacktothegrayscaleshadow
removalmodulevia
^
B
.
5.3.4Temporalinformationsharing
Wecanextendournetworkforsingle-frameprocesstoleveragethetemporalinformationviaa
TemporalSharingModule(TSM).Similartoothervideo-basedimagerestorationproblems,such
asvideodeblurring,shadowmotioncanbearbitraryinshapeandspeedvariations.Thus,the
orderoftheframesmightnotcarryusefulcuesforde-shadow.Asaresult,weproposetoadopt
atemporal-wisemaxpoolingtoaggregatetheilluminationinformationamongdifferentframes,
showninFig.5.5.
Assuming
F
1
;
F
2
;:::;
F
k
arethefeaturestobesharedamong
k
frames.Beforecomputingthe
temporal-wisemaxpooling,weapplyawarpinglayertoregisterfeaturesbasedthefaceshape.
Afterthetemporal-wisemaxpooling,weapplyareversewarpingtore-alignthesharedfeatureback
toeachframefeature,andconcatenatewiththeoriginalfeature
F
i
forthenext-stagecomputation.
TheTSMisaplug-indesignforfeaturesatallscales.TSMcanbeusedtonotonlysharethe
temporalinformation,butalsoenforcethepriorknowledgeoffacesymmetry,whichhasbeenused
inothertasksWuetal.(2020).Toachievethis,wetreatthemirroredfaceasadifferentframe,and
sendtoTSMforinformationsharing.Incasethereisonlyasingleframeavailable,theTSMsimply
concatenateswiththeoriginalfeature.
103
Figure5.5
IllustrationoftheTemporalSharingModule(TSM).Itcanbeappliedtotemporalframesaswell
asmirroredinput.
Thewarpinglayerleveragesthepre-computed
68
faciallandmarksviaBulat&Tzimiropoulos
(2017).Giventhelandmarksfortheneuralface
s
0
andface
s
i
attheframe
i
,asparseoffsetcan
becomputedas

s
i
!
0
=
s
0

s
i
2
R
68

2
toindicatewhereeachpixelinthelandmarkposition
shouldbemovedto.Toobtainadenseoffsetmap

S
i
!
0
2
R
N
2

2
indicatingwhereeachpixelin
theentirefeaturemapshouldbemovedto,weapplyatriangulationinterpolation,

S
i
!
0
 
Tri
(
s
i
;

s
i
!
0
;N
)
;
(5.8)
where
Tri
(

)
isDelaunaytriangulation-basedinterpolation.Theregistrationoperationoffeature
F
isdenotedas:
F
i
!
0
=
F
i
(
S
0
+
S
i
!
0
)
;
(5.9)
where
S
0
=
f
(0
;
0)
;
(0
;
1)
;:::;
(
N;N
)
g2
R
N
2

2
enumeratespixellocationsin
F
i
.Similarly,when
wegetthesharedfeature
F
max
,wecanuse

S
0
!
i
towarpitback.
104
5.3.5Training
Weusesyntheticshadowfacestotrainourmodel.Weapplymultiplelossestosuperviseallthree
stepsinthemodel.
Shadowremovalloss:
Withthepairedshadowfaceandwell-illuminatedface,weenableapixel-
to-pixelsupervisionontherecoveryingrayscale.,weintroduceaweightingmapto
encouragethelosstofocusmoreontheshadowandshadowboundary,as
L
gs
=
E
i
˘
P
"


^
I
b,gs
i

I
b,gs
i


1

1+
B
i
+
B
edge
i
R
#
;
(5.10)
where
P
indicatessynthesizeddatadistribution,
B
edge
istheboundaryof
B
,
1+
B
+
B
edge
isthe
weightingmap,and
R
isthenormalizationtermoftheweightingmap.Asimilarlossalsoapplied
totheRGBrecoveryas
L
clr
.
Imagegradientloss:
Humanvisionisverysensitivetohighfrequencyartifacts,suchasedges.To
furthersuppressartifactsaroundshadowboundariesandrecoverhigh-frequencydetailsbeneath
shadows,weadoptanimagegradientlosstoencouragetheimagegradientsbetween
^
I
b
and
I
b
to
besimilar.Thislossisdenotedas:
L
r
=
E
i
˘
P
"
X
k
krb
^
I
b
i
c
k
rb
I
b
i
c
k
k
1
#
;
(5.11)
where
r
isthegradientoperatorand

k
denotesdownsamplingbytheratio
k
=
f
1
;
2
;
4
;
8
g
.
Multiscalegradientshelpremovebothsharpandblurringshadowboundaries.
Perceptualloss
L
P
:
Toensurethevisualquality,weadopttheperceptuallossbetweenthe
recoveredface
^
I
b
and
I
b
.
GANloss:
MotivatedbyWangetal.(2018b),weadoptamultiscalePatchGANIsolaetal.(2017)
105
atthescales
1
,
1
=
2
,
1
=
4
oftheoriginalimage'sresolution.Eachdiscriminatorconsistsof
5
convolutionallayersand
4
poolinglayers,andoutputsa
1
-channelmapintherangeof
[0
;
1]
,where
0
denotessyntheticand
1
real.WeusethehingelossintheGANtraining:
L
D
=

E
i
˘
P
2
4
X
n
=1
;
2
;
3
min
(0
;

1+
D
n
(
I
b
i
))
3
5

E
i
˘
P
2
4
X
n
=1
;
2
;
3
min
(0
;

1

D
n
(
^
I
b
i
))
3
5
;
L
G
=

E
i
˘
P
2
4
X
n
=1
;
2
;
3
D
n
(
^
I
b
i
)
3
5
;
(5.12)
where
D
1
,
D
2
and
D
3
arediscriminatorsat
3
scales.
L
D
isthelosstotrainthediscriminatorsand
L
G
isthelosstoguidetheshadowremovalmodeltorecovermorerealisticshadow-freefaces.
Temporalconsistencyloss:
Weadoptatemporalconsistencylosstoencouragetheimagegradients
between
^
I
sf
i
and
I
sf
i
tobesimilar.Thisconsistencylossisdenotedas:
L
T
=
E
i
˘
P
h


^
I
b
i;t
1

^
I
b
i;t
2


1
i
;
(5.13)
where
t
1
,
t
2
canbeeithertwonearbyframesofthesamevideo,northeframewithitsmirrored
image.
OverallLoss
Thegeneratorissupervisedanoveralllossas:
L
=

1
L
gs
+

2
L
clr
+

3
L
r
+

4
L
P
+

5
L
G
+

6
L
T
:
(5.14)
Thediscriminatorsaresupervisedwithadversarialloss
L
D
tocompetewiththegenerator.We
executethegeneratorstepandthediscriminatorstepineachmini-batchiteration.
106
5.4TrainingandEvaluationData
Trainingdata
TosynthesizeourtrainingdatabasedonEqn.5.1-5.3,wemanuallyselect
15
;
000
faceimagesfromFFHQdatasetKarrasetal.(2019)thatdonotcontainanyforeignshadowsand
strongselfshadows.Therawbinaryshadowshape
B
comesfrom:1)
100
silhouette
shapes2)Perlinnoisefunction.Afterthat,therawshapesarerandomlyaugmentedwithdifferent
scales,rotationsandboundaryblurriness.Intensitymap
M
I
isalsogeneratedbyrandomPerlin
noisefunctionattwooctaves.
Tosimulatecommonshadowmotioninfacevideos,weproposetwoapproachestosynthesize
theshadowmotion:translationandering.Intranslationmode,theshadowmovesontheface
regionwithrandomlyselectedspeed,direction,androtation.Theshapeoftheshadowisedfor
eachvideobutthescale,rotation,andboundaryblurrinesscanbecontinuouslyshiftingfromframe
toframe.Ineringmode,thelocationfromframetoframeisrandomlypickedandthechanges
ofscale,rotationandboundaryblurrinessarenotcontinuous.
Evaluationdata
Toourknowledge,thereisnolargevideodatabaseofreal-worldhumanfaces
withforeignshadows.Oneexistingdatabase,UCBZhangetal.(2020b),includesaverylimited
numberof
100
faceimages.Moreimportantly,thisdatabasecontainsonlysingleimagessothat
consistentimagereconstructiononvideoscannotbeevaluated.Inresponsetotheneedofalarge
videodatabase,wecollectadatabasetermedShadowFaceintheWild(SFW)fortheevaluation
ofreal-worldfacialshadowremoval.Intotal,SFWincludes
280
videosfrom
20
subjects.Some
examplesareshowninFig.5.6.Mostvideosarecapturedat
1080
presolutionbyvarioussmartphone
cameras.
Foreachsubject,thevideosarecollectedinvesessions:indoor,outdoorstanding,outdoor
walking,outdoorextreme,anddriving.Theindoorsessioncollectsfacevideosinanindoor
107
Figure5.6
AnillustrationofSFWdatabase.Therowshowstheshadowfacescollectedunderhighly
dynamicenvironments.(
e.g.
,varyingshadowsandheadposesduetowalkinganddriving);Thesecondrow
showsthepixel-levelannotationsofshadowsegmentation.Zoominforviewingthequalityofourannotation.
environment,wherethelightingisrelativelysoftwithnostrongspecularlights.Foroutdoor
collection,thestandingsessionrequiresthesubjecttoholdastandingpositionwithnoambient
lightvariations,andthewalkingandextremesessionsrequirethesubjecttobemoving,creatinga
changingambientlight.Forthethreesessions,subjectsusecommonobjectstocreateshadows,
suchashand,phone,paper,pen
etc
.Fortheoutdoorextremesession,westrivetocreatemore
complexshadowpatternsandrequirethesubjectswalkingundertreestocreateleaf-shapeshadows.
Inthelastsession,thesubjectsrecordvideosinamovingcar,wheretheshadowsmaycomefrom
theblockingofsunvisor,rear-viewmirrorA-pillar,andsurroundingbuildings.
Fortheevaluationpurpose,weannotatethepixel-levelshadowsegmentationmapsofkeyframes
selectedfromthevideoset,andmoreannotationswillbeaddedinthefuture.
5.5Experiment
5.5.1Experimentalsetup
Metrics
ToevaluateonUCBdataset,wecandirectlycomparebetweenthede-shadowface
imagesfromourmodelandthegroundtruthfaceimages.Weevaluatetheperformancebasedon
thefollowingmetrics:peaksignal-to-noiseratio(PSNR)andstructuralsimilarityindexmeasure
108
Figure5.7
AqualitativecomparisonofshadowremovalontestingimagesofUCBdatabase.Fromtopto
bottom,weshowshadowfaceandshadowremovalresultsprovidedbyZhangetal.(2020b),thenetwork
withnaiveRGBshadowmodeling,oursingle-framenetworkwithgrayscaleshadowremovalandcolorization
(GS+C),andournetworkwithadditionalTSMandtemporalloss.
(SSIM).ToevaluateonSFWdataset,wecanevaluateonhowtheshadowisdetectedsincethe
groundtruthshadowsegmentationisprovided.ThepredictedshadowmasksarefromEqn.5.7.We
computetheareaundercurve(AUC)ofreceiveroperatingcharacteristiccurve(ROC)andaccuracy
basedonthepredictedshadowmasksandthegroundtruthmasks.Theaccuracyiscomputedas
TP
+
TN
N
p
+
N
n
where
TP
,
TN
,
N
p
,and
N
n
aretruepositives,truenegatives,numberofshadowpixels,
andnumberofnon-shadowpixels,respectively.Webinarizetheshadowmatte
M
intoshadowmask
withathresholdof
0
:
1
.
Implementationdetails
OurshadowremovalnetworkisimplementedinTwwithaninitial
learningrateof
1
e
-
4
.Wetrainthenetworkfor
50
;
000
iterationsintotalwithabatchsizeof
32
,and
decreasethelearningratebyaratioof
10
every
25
;
000
iterations.Weinitializetheweightswiththe
normaldistributionof[
0
;
0
:
02
].
f

1
;
2
;
3
;
4
;
5
;
6
;
g
aresettobe
f
100
;
100
;
1
;
1
;
1
;
1
;
0
:
1
g
.
WeuseBulat&Tzimiropoulos(2017)tocropthefaceandprovide
68
faciallandmarks.
109
RemovalModelPSNRSSIM
InputImage
19
:
6710
:
766
Guo
etal.
Guoetal.(2012)
15
:
9390
:
593
Hu
etal.
Huetal.(2018)
18
:
9560
:
699
Cun
etal.
Cunetal.(2020)
19
:
3860
:
722
Zhang
etal.
Zhangetal.(2020b)
23.8160.782
Zhang
etal.
Zhangetal.(2020b)

20
:
2200
:
677
RGB
21
:
4640
:
725
GS+C(Ours)
23
:
3640
:
784
TemporalGS+C(Ours)
23.7930.805
Table5.1
AquantitativecomparisonforshadowremovalonUCBdataset.Zhang
etal.
Zhangetal.(2020b)

isourimplementationandtrainedusingoursynthesizeddata.
5.5.2Shadowremovalandsegmentation
WecomparetheresultsonUCBdataset.ThebaselinemethodistheonefromZhangetal.
(2020b),whichalsoincludestheperformanceofseveralpreviousworksCunetal.(2020);Guo
etal.(2012);Huetal.(2018).However,asnopre-trainedmodels,trainingdata,andtraining
scriptsofallthesemethodsareavailable,itishardtoobtainanyfaircomparisonbetweenthese
methodsandoursonUCB.Tobridgethegap,Were-implementZhangetal.(2020b)withour
bestefforts,andtrainitusingoursynthesizeddata.Webelieveittobeafaithfulimplementation
astheconventionalperceptuallossandpixel-wiselossaremainlyused.Table5.1reportsthe
comparisonresultsofPSNRandSSIM.Oursingle-framegrayscaleshadowremoval+colorization
model(GS+C)outperformsthemethodsofCunetal.(2020);Guoetal.(2012);Huetal.(2018)
andachievescomparableperformancewithZhangetal.(2020b).Withthetemporalsharingmodule
(TSM)andtemporalconsistencyloss(
L
T
),ourmethodcanachieveacompetitivePSNRand
outperformthereportedZhangetal.(2020b)onSSIM.NoticethatbothPSNRandSSIMfromour
implementedZhangetal.(2020b)arelowerthanthereportednumbers.Webelieveitismainlydue
tothelargedomaingapbetweenourtrainingdataandthetrainingdatausedinZhangetal.(2020b).
AqualitativecomparisonisshowninFig.5.7.
110
Figure5.8
QualitativeshadowremovalevaluationsonSFWdatabase.Fromtoptobottom,weshowshadow
face,shadowremovalresultsfromLe&Samaras(2019)andZhangetal.(2020b),oursingle-framemodel,
andourtemporalmodel,groundtruthshadowsegmentation(inbrightpurple),andpredictedshadowmask
(beforethresholding).
Secondly,weevaluatethemodelsontheSFWdatabase,whichismorechallengingdueto
highlydynamicenvironments.Weconductaquantitativecomparisonontheperformanceofshadow
segmentation,whichisarequiredmoduleinmanyapplications.Table5.2showsthecomparison
withrecentmethodsHuetal.(2019);Le&Samaras(2019);Zhangetal.(2020b).Ourmethod
outperformsallothersintermsofAUCandaccuracy.Fig.5.8showsthequalitativecomparisonand
ourmethodproducesmuchbetterrecoveryoffacialforeignshadowscomparingwiththebaselines.
Furthermore,ourmodelisabletoremovestrongself-shadows(theforeheadareaofthe
9
thcolumn)
andkeepnormalshadows(theleft-cheekareaofthelastcolumn).Asourdatasetsarehighlydiverse,
wethatmethodsinLe&Samaras(2019);Zhangetal.(2020b)cannotbewellgeneralizedand
theirperformancedegrades.
111
SegmentationModelAUCAccuracy
Le
etal.
Le&Samaras(2019)
0
:
6030
:
683
Hu
etal.
Huetal.(2019)
0
:
5400
:
604
Zhang
etal.
Zhangetal.(2020b)
0
:
6860
:
756
GS+C
0
:
8980
:
874
TemporalGS+C
0.9040.882
Table5.2
AquantitativecomparisonofshadowsegmentationonSFWdatabase.
5.5.3AblationStudies
Weconductablationtobetterunderstandeachcomponent.Thebaselinemethodisourimplemented
versionofZhangetal.(2020b)andthedirectRGBshadowmodelingwithourbackbonenetwork.
Forafaircomparison,wemergethecomputationresourceofGS+Cmodel(
i.e.
,doublingthe
bottleneckdepthandthedecoderchannel.).AsshowninTab.5.1,ourimplementedZhangetal.
(2020b)showsaperformanceofPSNRas
20
:
220
andSSIMas
0
:
677
.Byupdatingtoabetter
backbonenetworkwithnon-localblocksandfacepositioncoding,ourbaselinemodelwithRGB
shadowmodelingachievesabetterPSNRof
21
:
464
andSSIMof
0
:
725
.Next,oursingle-frame
GS+Cmodeloutperformstheprevioustwobaselinemodels,thankstotheeffectivenessofnovel
shadowmodeling.AndwiththetemporaldesignofTSMandthecorrespondingloss,ourmodel
canfurtherimprovethePSNRandSSIMto
23
:
793
and
0
:
805
respectively.Qualitativecomparison
areshowninFig.5.7.Wecanseebothsingle-frameandtemporalGS+Cmodelshowbettervisual
qualitytotheRGBmodel.Inaddition,thetemporalmodelfurtherimprovesthecolorconsistency
andsuppressestheartifactsonseveralsubjects.
5.6Conclusion
Inthischapter,weintroducetheproblemofblindremovaloffacialforeignshadow.Wepropose
aneffectiveshadowmodelingtohelpthemodeltobemoregeneralized.Wedecomposethe
112
conventionalRGBshadowmodelingintograyscaleshadowmodelingandcolorization.Wealso
proposeatemporalsharingmodule(TSM)thatcanbeeasilyintegratedintoanyencodersand
decoderstoimposetemporalconsistency.Ourmethodcanproducephoto-realisticde-shadow
faceswithhighPSNRandSSIM.Ourlarge-scalevideodatabasecollectedunderhighlydynamic
environmentsisanothermajorcontributionthatcanvariousface-relatedresearchesand
applications.
113
Chapter6
ConclusionsandFutureWork
Faceisoneofthemostpopularbiometricmodalitiesduetoitsconvenienceofusage,e.g.,access
control,phoneunlock.Despitethehighrecognitionaccuracy,facerecognitionsystemsarenotable
todistinguishbetweenrealhumanfacesandfakeones.Thus,theyarevulnerabletofacespoof
attacks,whichdeceivesthesystemstorecognizeasanotherperson.Tosafelyusefacerecognition,
facetechniquesarerequiredtodetectspoofattacksbeforeperformingrecognition.
InChapter
2
,weproposeaCNN-RNNmodelislearnedtoestimatethefacedepthwithpixel-
wisesupervision,andtoestimaterPPGsignalswithsequence-wisesupervision.Theestimated
depthandrPPGarefusedtodistinguishlivevs.spooffaces.Experimentsshowthatourmodel
achievesimprovementsonbothintra-andcross-databasedetectionperformance.
InChapter
3
,westudythegeneralizationproblemoffaceWethedetection
ofunknownspoofattacksasZero-ShotFace(ZSFA)andextendthestudyofZSFA
from
1
-
2
typesto
13
types.WeproposeanovelDeepTreeNetwork(DTN)topartitionthespoof
samplesintosemanticsub-groupsinanunsupervisedfashion.Experimentsshowthatourproposed
methodachievesthestateoftheartonmultipletestingprotocolsofZSFA.
InChapter
4
,westudyanewproblemofinterpretingfacemodel'sdecision.We
provideacomprehensivemodelingofthespooftracesofvariousspoofattacks,anddesignsanovel
adversariallearningframeworktodisentanglethespooftracesfrominputfacesasahierarchical
combinationofpatternsatmultiplescales.Withthedisentangledspooftraces,weunveilthelive
counterpartoftheoriginalspoofface,andfurthersynthesizerealisticnewspooffacesafteraproper
114
geometriccorrection.Ourmethoddemonstratessuperiorspoofdetectionperformanceonbothseen
andunseenspoofscenarioswhileprovidingvisually-convincingestimationofspooftraces.
InChapter
5
,wetheproperphysicalmodelingcanalsootherfaceproblemsand
studyanewproblemoffaceshadowremoval.Weproposeaneffectiveshadowmodelingtohelp
themodeltobemoregeneralized.WedecomposetheconventionalRGBshadowmodelinginto
grayscaleshadowmodelingandcolorization.Inaddition,weproposeatemporalsharingmodule
(TSM)thatcanbeeasilyintegratedintoanyencodersanddecoderstoimposetemporalconsistency.
6.1FutureWorks
Uncertaintyofface
Welookintoopen-setfaceandwesometime
wethemodelmaynotbealwaysveryaboutitsdecision.Inpractice,sometime
it'sokaytorejectthesampleorprovideadditionalmanualinspectionwhenthemodelisnotvery
Soafuturedirectionistoenablethemodeltoprovideascorewithitsdecision.
Retrainabilityoffaceanti-s
Inpractice,thefacemodelmayneedtodeliver
todifferentuserstohandledifferentsituation,wheretheprocess,orretrainingprocess
isengaged.Toeasetheretrainingprocess,severaltopciscanbefurtherinvastigated,suchas
incrementallearning,modelfusionandearlystoppingpolicy.
Sensorvariations
Sensorvariationisanimportantfactorinpracticalfacesystem,
whichhaven'tbeenquantitativelystudied.Whilesensorvariationcauseslargenegativeimpact
onthefaceperformance,facemodelshavetobere-trainedeverytime
switchingtoanewsensor,whichismerelypossibleandconsuming.Weintendtoevaluatethe
cross-sensorperformanceandproposesolutioniftheperformanceisnotideal.
Improvingsynthesisonlargeposeandexpression
Weintendtomodifythe
3
DWarpingLayer
115
tobetterhandlelargeposevariation.,weadditionallyprovidevisibilityinthelandmark
preparation,sothewarpinglayercanleveragethevisibilitytoimplementafastanddifferentiable
z-bufferrendering.Thiscanadditionallyleverageotherlargeposefacedatabases,suchas
300
-VW,
tosynthesizelargeposespooffacesfortraining,whichishardtoincludeinrealface
databases.
116
APPENDIX
117
ContributiononCo-authoredPublications
Chapters2ofthisdissertationarebasedonresearchpapersLiuetal.(2018c)andJourablooetal.
(2018)respectively.AminJourablooandIhaveequalcontributionsontheproposedmethodsand
implementationoftheseresearchpapers.Here,Imentionthedetailcontributionsofeachindividual:
1.LearningDeepModelsforFaceBinaryorAuxiliarySuper-
vision[Liuetal.(2018c)]
YaojieLiu:

DatacollectionforSiWdataset;

Providingthegroundtruthforthepseudo-depthmapforthetrainingimages;

DesigningtheCNNpartofthenetwork,includingDepthCNNandnon-rigidregistration
layer;

Implementingthetrainingondifferentdatasetsandprotocols,andnalizingtrainingsettings
andtricks;

CarryingonthedailymanagementofSiWdataset,includinggrantaccessestoresearchgroups
worldwide,andansweringquestionsofthedatabase.
AminJourabloo:

DatacollectionforSiWdataset;

ProvidingthegroundtruthrPPGsignalsforthetrainingvideos;

DesigningtheRNNpartofthenetwork,includingtheLSTM,FFTlayer,andcorresponding
lossfunction;

Implementingevaluationmetrics(suchasEER,HTER,andACER),performingtestingon
118
differentdatasetsandprotocols,andgeneratequalitativeandquantitativeresultsandanalysis.
2.FaceAntviaNoiseModeling[Jourablooetal.
(2018)]
YaojieLiu:

Performingacasestudyonspoofnoisepattern,andexplorethreeimportantpropertiesof
spoofnoisepattern;

Implementingexperimentsandanalyzingresultsondifferentdatasetsandprotocols(CASIA
andReplayAttackdatasets);
AminJourabloo:

Designingthelossfunctionsandthenetworkarchitectureforthefaceincluding
theDSNet,DQNet,andVQNet;

ImplementingthetrainingonallprotocolsinOuludatabase,andtrainingsettings
andtricks;

AnalyzingtheexperimentresultsonOuludatasetandexecutingseveralablationstudies.
119
RelatedPublication
Inthissection,IlistallrelatedpublicationIduringmyPhDstudyinMichiganState
University:
1.
Liu,Xiaohong,YaojieLiu,JunChen&XiaomingLiu.2021.PSCC-Net:Progressive
spatio-channelcorrelationnetworkforimagemanipulationdetectionandlocalization.
arXiv
preprintarXiv:2103.10596.
2.
Liu,Yaojie&XiaomingLiu,Physics-GuidedSpoofTraceDisentanglementforGenericFace

arXivpreprintarXiv:2012.05185.
3.
Liu,Yaojie,JoelStehouwer&XiaomingLiu.2020.OnDisentanglingSpoofTracesfor
GenericFaceIn
ECCV
.
4.
Stehouwer,Joel,AminJourabloo,YaojieLiu&XiaomingLiu.2020.NoiseModeling,
SynthesisandforGenericObjectIn
CVPR
,IEEE.
5.
Liu,Yaojie,JoelStehouwer,AminJourabloo&XiaomingLiu.2019.PresentationAttack
DetectionforFaceinMobilePhones,In
Biometrics
,Springer.
6.
Liu,Yaojie,JoelStehouwer,AminJourabloo&XiaomingLiu.2019.DeepTreeLearningfor
Zero-shotFaceIn
CVPR
,IEEE.
7.
Jourabloo,Amin

,YaojieLiu

&XiaomingLiu.2018.Facevia
NoiseModeling.In
ECCV
.
8.
Liu,Yaojie

,AminJourabloo

&XiaomingLiu.2018.LearningDeepModelsforFace
BinaryorAuxiliarySupervision.In
CVPR
,IEEE.
9.
Atoum,Yousef

,YaojieLiu

,AminJourabloo

&XiaomingLiu.2017.Face
UsingPatchandDepth-basedCNNs.In
IJCB
,IEEE.
10.
Liu,Yaojie,AminJourabloo,WilliamRen&XiaomingLiu.2017.DenseFaceAlignment.
In
ICCVW
,IEEE.
11.
Liu,Yaojie,XinyuHuang,LiuRen&XiaomingLiu.2021.BlindRemovalofFacialForeign
Shadow.Submittedto
ICCV
.
120
BIBLIOGRAPHY
121
BIBLIOGRAPHY
Abadi,Mart
´
PaulBarham,JianminChen,ZhifengChen,AndyDavis,JeffreyDean,Matthieu
Devin,SanjayGhemawat,GeoffreyIrving,MichaelIsardetal.2016.Tw:Asystemfor
large-scalemachinelearning.In
OSDI
,.
Abdelhamed,Abdelrahman,StephenLin&MichaelSBrown.2018.Ahigh-qualitydenoising
datasetforsmartphonecameras.In
CVPR
,IEEE.
Agarwal,Akshay,RichaSingh&MayankVatsa.2016.FaceusingHaralickfeatures.
In
BTAS
,IEEE.
Arashloo,ShervinRahimzadeh,JosefKittler&WilliamChristmas.2017.Ananomalydetection
approachtofacedetection:Anewformulationandevaluationprotocol.
IEEEAccess
.
Arrieta,AlejandroBarredo,NataliaD
´

´
JavierDelSer,AdrienBennetot,Siham
Tabik,AlbertoBarbado,SalvadorGarc
´
SergioGil-L
´
opez,DanielMolina,RichardBenjamins
etal.2020.Explainableintelligence(xai):Concepts,taxonomies,opportunitiesand
challengestowardresponsibleai.
InformationFusion
.
Atoum,Yousef,YaojieLiu,AminJourabloo&XiaomingLiu.2017.Faceanti-susingpatch
anddepth-basedCNNs.In
IJCB
,IEEE.
Bao,Wei,HongLi,NanLi&WeiJiang.2009.Alivenessdetectionmethodforfacerecognition
basedonopticalwIn
IASP
,.
Bertalmio,Marcelo,GuillermoSapiro,VincentCaselles&ColomaBallester.2000.Imagein-
painting.In
Proceedingsofthe27thannualconferenceoncomputergraphicsandinteractive
techniques
,.
Bharadwaj,Samarth,TejasIDhamecha,MayankVatsa&RichaSingh.2013.Computationally
effacedetectionwithmotionIn
CVPRW
,IEEE.
Bharadwaj,Samarth,TejasIDhamecha,MayankVatsa&RichaSingh.2014.Face
viamotionandmultifeaturevideoletaggregation.
Blanz,Volker&ThomasVetter.2003.Facerecognitionbasedona3Dmorphablemodel.
PAMI
.
Bobbia,Serge,YannickBenezeth&JulienDubois.2016.Remotephotoplethysmographybasedon
implicitlivingskintissuesegmentation.In
ICPR
,.
Boulkenafet,Zinelabdine.2017.Acompetitionongeneralizedsoftware-basedfacepresentation
attackdetectioninmobilescenarios.In
IJCB
,IEEE.
Boulkenafet,Zinelabidine,JukkaKomulainen&AbdenourHadid.2015.Facebased
oncolortextureanalysis.In
ICIP
,IEEE.
122
Boulkenafet,Zinelabidine,JukkaKomulainen&AbdenourHadid.2016.Facedetection
usingcolourtextureanalysis.
TIFS
.
Boulkenafet,Zinelabidine,JukkaKomulainen&AbdenourHadid.2017a.Faceusing
speeded-uprobustfeaturesandFishervectorencoding.
SignalProcessLetters
.
Boulkenafet,Zinelabinde,JukkaKomulainen,LeiLi,XiaoyiFeng&AbdenourHadid.2017b.
OULU-NPU:Amobilefacepresentationattackdatabasewithreal-worldvariations.In
FG
,.
Bulat,Adrian&GeorgiosTzimiropoulos.2017.Howfararewefromsolvingthe2d&3dface
alignmentproblem?(andadatasetof230,0003dfaciallandmarks).In
ICCV
,IEEE.
Cao,Chen,YanlinWeng,ShunZhou,YiyingTong&KunZhou.2014.Facewarehouse:A3D
facialexpressiondatabaseforvisualcomputing.
Trans.Vis.Comput.Graphics
.
Cao,Qingxing,XiaodanLiang,BailingLi,GuanbinLi&LiangLin.2018.Visualquestion
reasoningongeneraldependencytree.In
CVPR
,IEEE.
Carion,Nicolas,FranciscoMassa,GabrielSynnaeve,NicolasUsunier,AlexanderKirillov&Sergey
Zagoruyko.2020.End-to-endobjectdetectionwithtransformers.In
ECCV
,Springer.
Chang,Huiwen,JingwanLu,FisherYu&AdamFinkelstein.2018.PairedCycleGAN:Asymmetric
styletransferforapplyingandremovingmakeup.In
CVPR
,IEEE.
Chen,Chang,ZhiweiXiong,XiaomingLiu&FengWu.2020.Cameratraceerasing.In
CVPR
,
IEEE.
Chen,Cunjian,AntitzaDantcheva&ArunRoss.2013.Automaticfacialmakeupdetectionwith
applicationinfacerecognition.In
ICB
,IEEE.
Chen,Cunjian,AntitzaDantcheva&ArunRoss.2014.Impactoffacialcosmeticsonautomatic
genderandageestimationalgorithms.In
Internationalconferenceoncomputervisiontheory
andapplications(visapp)
,IEEE.
Chen,Xinyun,ChangLiu&DawnSong.2018.Tree-to-treeneuralnetworksforprogramtranslation.
In
NIPS
,.
Chetty,Girija.2010.Biometriclivenesscheckingusingmultimodalfuzzyfusion.In
International
conferenceonfuzzysystems
,IEEE.
Chetty,Girija&MichaelWagner.2006.Audio-visualmultimodalfusionforbiometricperson
authenticationandlivenessvIn
Proceedingsofthe2005nicta-hcsnetmultimodaluser
interactionworkshop-volume57
,.
Chingovska,Ivana,Andr
´
eAnjos&S
´
ebastienMarcel.2012.Ontheeffectivenessoflocalbinary
patternsinfaceIn
BIOSIG
,IEEE.
Chuang,Yung-Yu,DanBGoldman,BrianCurless,DavidHSalesin&RichardSzeliski.2003.
Shadowmattingandcompositing.In
Siggraph
,ACM.
123
Corneanu,CiprianA,MeysamMadadi,SergioEscalera&AleixMMartinez.2019.Whatdoesit
meantolearnindeepnetworks?and,howdoesonedetectadversarialattacks?In
CVPR
,IEEE.
Cun,Xiaodong,Chi-ManPun&ChengShi.2020.Towardsghost-freeshadowremovalviadual
hierarchicalaggregationnetworkandshadowmattingGAN.In
AAAI
,.
Dang,Hao,FengLiu,JoelStehouwer,XiaomingLiu&AnilKJain.2020.Onthedetectionof
digitalfacemanipulation.In
CVPR
,IEEE.
Ding,Bin,ChengjiangLong,LingZhang&ChunxiaXiao.2019.ARGAN:Attentiverecurrent
generativeadversarialnetworkforshadowdetectionandremoval.In
ICCV
,IEEE.
Donner,Craig&HenrikWannJensen.2006.AspectralBSSRDFforshadinghumanskin.In
Proceedingsofthe17theurographicsconferenceonrenderingtechniques
,409Œ417.
Dosovitskiy,Alexey,LucasBeyer,AlexanderKolesnikov,DirkWeissenborn,XiaohuaZhai,Thomas
Unterthiner,MostafaDehghani,MatthiasMinderer,GeorgHeigold,SylvainGellyetal.2020.
Animageisworth16x16words:Transformersforimagerecognitionatscale.
arXivpreprint
arXiv:2010.11929
.
Egger,Bernhard,SandroSch
¨
onborn,AndreasSchneider,AdamKortylewski,AndreasMorel-
Forster,ClemensBlumer&ThomasVetter.2018.Occlusion-aware3Dmorphablemodelsandan
illuminationpriorforfaceimageanalysis.
IJCV
.
Esser,Patrick,EkaterinaSutter&Bj
¨
ornOmmer.2018.AvariationalU-Netforconditional
appearanceandshapegeneration.In
CVPR
,IEEE.
Feng,Haocheng,ZhibinHong,HaixiaoYue,YangChen,KeyaoWang,JunyuHan,JingtuoLiu
&ErruiDing.2020.Learninggeneralizedspoofcuesforface
arXivpreprint
arXiv:2005.03922
.
Feng,Litong,Lai-ManPo,YumingLi,XuyuanXu,FangYuan,TerenceChun-HoCheung&
Kwok-WaiCheung.2016.IntegrationofimagequalityandmotioncuesforfaceA
neuralnetworkapproach.
J.VisualCommunicationandImageRepresentation
.
Finlayson,GrahamD,StevenDHordley&MarkSDrew.2002.Removingshadowsfromimages.
In
ECCV
,Springer.
deFreitasPereira,Tiago,Andr
´
eAnjos,Jos
´
eMarioDeMartino&S
´
ebastienMarcel.2012.LBP-TOP
basedcountermeasureagainstfaceattacks.In
ACCV
,IEEE.
deFreitasPereira,Tiago,Andr
´
eAnjos,Jos
´
eMarioDeMartino&S
´
ebastienMarcel.2013.Can
facecountermeasuresworkinarealworldscenario?In
ICB
,IEEE.
Frome,Andrea,GregSCorrado,JonShlens,SamyBengio,JeffDean,Marc'AurelioRanzato&
TomasMikolov.2013.Devise:Adeepvisual-semanticembeddingmodel.In
NIPS
,.
Gu,Shuyang,JianminBao,HaoYang,DongChen,FangWen&LuYuan.2019.Mask-guided
portraiteditingwithconditionalGANs.In
CVPR
,IEEE.
124
Guo,Jianzhu,XiangyuZhu,JinchuanXiao,ZhenLei,GenxunWan&StanZLi.2019.Improving
faceby3Dvirtualsynthesis.
arXivpreprintarXiv:1901.00488
.
Guo,Ruiqi,QieyunDai&DerekHoiem.2012.Pairedregionsforshadowdetectionandremoval.
PAMI
.
deHaan,Gerard&VincentJeanne.2013.Robustpulseratefromchrominance-basedrPPG.
Trans.
BiomedicalEngineering
.
He,Kaiming,XiangyuZhang,ShaoqingRen&JianSun.2016.Deepresiduallearningforimage
recognition.In
CVPR
,IEEE.
Hu,X,CWFu,LZhu,JQin&PAHeng.2019.Direction-awarespatialcontextfeaturesforshadow
detectionandremoval.
PAMI
.
Hu,Xiaowei,LeiZhu,Chi-WingFu,JingQin&Pheng-AnnHeng.2018.Direction-awarespatial
contextfeaturesforshadowdetection.In
CVPR
,IEEE.
IARPA.2016.IARPAresearchprogramOdin.
https://www.iarpa.gov/index.php/
research-programs/odin
.
ISO/IEC-JTC-1/SC-37.2016.Biometrics.informationtechnologybiometricpresentationattack
detectionpart1:Framework.
https://www.iso.org/obp/ui/iso
.
Isola,Phillip,Jun-YanZhu,TinghuiZhou&AlexeiAEfros.2017.Image-to-imagetranslationwith
conditionaladversarialnetworks.In
CVPR
,IEEE.
Jourabloo,Amin&XiaomingLiu.2017.Pose-invariantfacealignmentviaCNN-baseddense3D
model
IJCV
.
Jourabloo,Amin,YaojieLiu&XiaomingLiu.2018.Facevianoise
modeling.In
ECCV
,.
Kalchbrenner,Nal,EdwardGrefenstette&PhilBlunsom.2014.Aconvolutionalneuralnetwork
formodellingsentences.
arXivpreprintarXiv:1404.2188
.
Kaneko,Takuhiro,KaoruHiramatsu&KunioKashino.2018.Generativeadversarialimage
synthesiswithdecisiontreelatentcontroller.In
CVPR
,IEEE.
Karessli,Nour,ZeynepAkata,BerntSchiele&AndreasBulling.2017.Gazeembeddingsfor
zero-shotimageIn
CVPR
,IEEE.
Karras,Tero,SamuliLaine&TimoAila.2019.Astyle-basedgeneratorarchitectureforgenerative
adversarialnetworks.In
CVPR
,IEEE.
Kazemi,Vahid&JosephineSullivan.2014.Onemillisecondfacealignmentwithanensembleof
regressiontrees.In
CVPR
,IEEE.
Kollreider,Klaus,HartwigFronthaler,MaycelIsaacFaraj&JosefBigun.2007.Real-timeface
detectionandmotionanalysiswithapplicationinlivenessassessment.
TIFS
.
125
Komulainen,Jukka,AbdenourHadid&MattiPietikainen.2013a.Contextbasedface
In
BTAS
,IEEE.
Komulainen,Jukka,AbdenourHadid,MattiPietik
¨
ainen,Andr
´
eAnjos&S
´
ebastienMarcel.2013b.
Complementarycountermeasuresfordetectingscenicfaceattacks.In
ICB
,IEEE.
Krizhevsky,Alex,IlyaSutskever&GeoffreyEHinton.2012.Imagenetwithdeep
convolutionalneuralnetworks.In
NIPS
,.
Lampert,ChristophH,HannesNickisch&StefanHarmeling.2009.Learningtodetectunseen
objectclassesbybetween-classattributetransfer.In
CVPR
,IEEE.
Lawrence,Steve,CLeeGiles,AhChungTsoi&AndrewDBack.1997.Facerecognition:A
convolutionalneural-networkapproach.
IEEETransactionsonNeuralNetworks
.
Le,Hieu&DimitrisSamaras.2019.Shadowremovalviashadowimagedecomposition.In
ICCV
,
IEEE.
Lee,Cheng-Han,ZiweiLiu,LingyunWu&PingLuo.2020.MaskGAN:Towardsdiverseand
interactivefacialimagemanipulation.In
CVPR
,IEEE.
Li,Jiangwei,YunhongWang,TieniuTan&AnilKJain.2004.Livefacedetectionbasedonthe
analysisoffourierspectra.In
BTHI
,SPIE.
Li,Lei,XiaoyiFeng,ZinelabidineBoulkenafet,ZhaoqiangXia,MingmingLi&AbdenourHadid.
2016a.Anoriginalfaceapproachusingpartialconvolutionalneuralnetwork.In
IPTA
,.
Li,Xiaobai,JukkaKomulainen,GuoyingZhao,Pong-ChiYuen&MattiPietik
¨
ainen.2016b.
Generalizedfacebydetectingpulsefromfacevideos.In
ICPR
,IEEE.
Li,Zhihang,YiboHu,RanHe&ZhenanSun.2020.Learningdisentanglingandfusingnetworks
forfacecompletionunderstructuredocclusions.
PatternRecognition
.
Liao,Jing,YuanYao,LuYuan,GangHua&SingBingKang.2017.Visualattributetransfer
throughdeepimageanalogy.
arXivpreprintarXiv:1705.01088
.
Liu,Feng,DanZeng,QijunZhao&XiaomingLiu.2018a.Disentanglingfeaturesin3Dfaceshapes
forjointfacereconstructionandrecognition.In
CVPR
,IEEE.
Liu,Si-Qi,XiangyuanLan&PongCYuen.2018b.Remotephotoplethysmographycorrespondence
featurefor3dmaskfacepresentationattackdetection.In
ECCV
,558Œ573.
Liu,Siqi,BaoyaoYang,PongCYuen&GuoyingZhao.2016a.A3Dmaskface
databasewithrealworldvariations.In
CVPRW
,IEEE.
Liu,Siqi,PongCYuen,ShengpingZhang&GuoyingZhao.2016b.3Dmaskface
withremotephotoplethysmography.In
ECCV
,.
126
Liu,Yaojie,AminJourabloo&XiaomingLiu.2018c.Learningdeepmodelsforface
Binaryorauxiliarysupervision.In
CVPR
,IEEE.
Liu,Yaojie,AminJourabloo,WilliamRen&XiaomingLiu.2017.Densefacealignment.In
ICCVW
,IEEE.
Liu,Yaojie&ChangShu.2015.Acomparisonofimageinpaintingtechniques.In
Sixthinterna-
tionalconferenceongraphicandimageprocessing(icgip)
,InternationalSocietyforOpticsand
Photonics.
Liu,Yaojie,JoelStehouwer,AminJourabloo&XiaomingLiu.2019a.Deeptreelearningfor
zero-shotfaceIn
CVPR
,IEEE.
Liu,Yaojie,JoelStehouwer,AminJourabloo&XiaomingLiu.2019b.Presentationattackdetection
forfaceinmobilephones.
Biometrics
.
Liu,Yaojie,JoelStehouwer&XiaomingLiu.2020.Ondisentanglingspooftracesforgenericface
In
ECCV
,.
Maaten,Laurensvander&GeoffreyHinton.2008.Visualizingdatausingt-SNE.
Journalof
machinelearningresearch
.
M
¨
a
¨
att
¨
a,Jukka,AbdenourHadid&MattiPietik
¨
ainen.2011.Facedetectionfromsingle
imagesusingmicro-textureanalysis.In
IJCB
,IEEE.
Mao,Xudong,QingLi,HaoranXie,RaymondYKLau,ZhenWang&StephenPaulSmolley.2017.
Leastsquaresgenerativeadversarialnetworks.In
ICCV
,IEEE.
Mirjalili,Vahid&ArunRoss.2017.Softbiometricprivacy:Retainingbiometricutilityofface
imageswhileperturbinggender.In
ICB
,IEEE.
Nagano,Koki,HuiwenLuo,ZejianWang,JaewooSeo,JunXing,LiwenHu,LingyuWei&HaoLi.
2019.Deepfacenormalization.
TOG
.
Nestmeyer,Thomas,Jean-Fran
c¸
oisLalonde,IainMatthews&AndreasLehrmann.2020.Learning
physics-guidedfacerelightingunderdirectionallight.In
CVPR
,IEEE.
Nowara,EwaMagdalena,AshutoshSabharwal&AshokVeeraraghavan.2017.Ppgsecure:Biomet-
ricpresentationattackdetectionusingphotopletysmograms.In
FG
,.
Pan,Gang,LinSun,ZhaohuiWu&ShihongLao.2007.Eyeblink-basedinface
recognitionfromagenericwebcamera.In
ICCV
,IEEE.
Patel,Keyurkumar,HuHan&AnilKJain.2016a.Cross-databasefacewithrobust
featurerepresentation.In
CCBR
,.
Patel,Keyurkumar,HuHan&AnilKJain.2016b.Securefaceunlock:Spoofdetectionon
smartphones.
TIFS
.
127
Paysan,Pascal,ReinhardKnothe,BrianAmberg,SamiRomdhani&ThomasVetter.2009.A3D
facemodelforposeandilluminationinvariantfacerecognition.In
AVSS
,.
Peixoto,Bruno,CarolinaMichelassi&AndersonRocha.2011.Facelivenessdetectionunderbad
illuminationconditions.In
ICIP
,IEEE.
Pinto,Allan,HelioPedrini,WilliamRobsonSchwartz&AndersonRocha.2015.Face
detectionthroughvisualcodebooksofspectraltemporalcubes.
TIP
.
Po,Lai-Man,LitongFeng,YumingLi,XuyuanXu,TerenceChun-HoCheung&Kwok-Wai
Cheung.2017.Block-basedadaptiveROIforremotephotoplethysmography.
J.MultimediaTools
andApplications
.
Qin,Yunxiao,ChenxuZhao,XiangyuZhu,ZezhengWang,ZitongYu,TianyuFu,FengZhou,
JingpingShi&ZhenLei.2019.Learningmetamodelforzero-andfew-shotface
arXivpreprintarXiv:1904.12490
.
Qu,Liangqiong,JiandongTian,ShengfengHe,YandongTang&RynsonWHLau.2017.Deshad-
ownet:Amulti-contextembeddingdeepnetworkforshadowremoval.In
CVPR
,IEEE.
Ronneberger,Olaf,PhilippFischer&ThomasBrox.2015.U-net:Convolutionalnetworksfor
biomedicalimagesegmentation.In
Internationalconferenceonmedicalimagecomputingand
computer-assistedintervention
,Springer.
Sengupta,Soumyadip,AngjooKanazawa,CarlosDCastillo&DavidWJacobs.2018.SfSNet:
Learningshape,andilluminanceoffacesinthewild'.In
CVPR
,IEEE.
Shao,Rui,XiangyuanLan,JiaweiLi&PongCYuen.2019a.Multi-adversarialdiscriminativedeep
domaingeneralizationforfacepresentationattackdetection.In
CVPR
,IEEE.
Shao,Rui,XiangyuanLan&PongC.Yuen.2017.Deepconvolutionaldynamictexturelearning
withadaptivechannel-discriminabilityfor3DmaskfaceIn
IJCB
,IEEE.
Shao,Rui,XiangyuanLan&PongCYuen.2019b.Regularizedmetaface
arXivpreprintarXiv:1911.10771
.
Shashua,Amnon&TammyRiklin-Raviv.2001.Thequotientimage:Class-basedre-renderingand
recognitionwithvaryingilluminations.
PAMI
.
Shih,YiChang,SylvainParis,ConnellyBarnes,WilliamTFreeman&Fr
´
edoDurand.2014.Style
transferforheadshotportraits.
Shor,Yael&DaniLischinski.2008.Theshadowmeetsthemask:Pyramid-basedshadowremoval.
In
Computergraphicsforum
,WileyOnlineLibrary.
Shu,Zhixin,SunilHadap,EliShechtman,KalyanSunkavalli,SylvainParis&DimitrisSamaras.
2017a.Portraitlightingtransferusingamasstransportapproach.
TOG
.
Shu,Zhixin,ErsinYumer,SunilHadap,KalyanSunkavalli,EliShechtman&DimitrisSamaras.
2017b.Neuralfaceeditingwithintrinsicimagedisentangling.In
CVPR
,IEEE.
128
Socher,Richard,MilindGanjoo,ChristopherDManning&AndrewNg.2013.Zero-shotlearning
throughcross-modaltransfer.In
NIPS
,.
Stehouwer,Joel,AminJourabloo,YaojieLiu&XiaomingLiu.2020.Noisemodeling,synthesis
andforgenericobjectIn
CVPR
,IEEE.
Stoschek,Arne.2000.Image-basedre-renderingoffacesforcontinuousposeandillumination
directions.In
CVPR
,IEEE.
Sun,Lin,GangPan,ZhaohuiWu&ShihongLao.2007.Blinking-basedlivefacedetectionusing
conditionalrandom
J.AdvancesinBiometrics
.
Sun,Tiancheng,JonathanTBarron,Yun-TaTsai,ZexiangXu,XuemingYu,GrahamFyffe,
ChristophRhemann,JayBusch,PaulEDebevec&RaviRamamoorthi.2019.Singleimage
portraitrelighting.
TOG
.
Tan,Xiaoyang,YiLi,JunLiu&LinJiang.2010.Facelivenessdetectionfromasingleimagewith
sparselowrankbilineardiscriminativemodel.In
ECCV
,.
Tewari,Ayush,MichaelZollhofer,HyeongwooKim,PabloGarrido,FlorianBernard,PatrickPerez
&ChristianTheobalt.2017.MoFA:Model-baseddeepconvolutionalfaceautoencoderfor
unsupervisedmonocularreconstruction.In
ICCVW
,IEEE.
Thai,ThanhHai,RemiCogranne&FlorentRetraint.2013.Cameramodelbasedon
theheteroscedasticnoisemodel.
TIP
.
Thai,ThanhHai,FlorentRetraint&R
´
emiCogranne.2016.Cameramodelbasedon
thegeneralizednoisemodelinnaturalimages.
DigitalSignalProcessing
.
Tran,Luan&XiaomingLiu.2018.Nonlinear3Dfacemorphablemodel.In
CVPR
,IEEE.
Tran,Luan&XiaomingLiu.2021.Onlearning3dfacemorphablemodelfromin-the-wildimages.
PAMI
.
Tran,Luan,XiaomingLiu,JiayuZhou&RongJin.2017a.Missingmodalitiesimputationvia
cascadedresidualautoencoder.In
CVPR
,IEEE.
Tran,Luan,XiYin&XiaomingLiu.2017b.DisentangledrepresentationlearningGANfor
pose-invariantfacerecognition.In
CVPR
,IEEE.
Tulyakov,Sergey,XavierAlameda-Pineda,ElisaRicci,LijunYin,JeffreyFCohn&NicuSebe.
2016.Self-adaptivematrixcompletionforheartrateestimationfromfacevideosunderrealistic
conditions.In
CVPR
,IEEE.
Turek,Matt.2016.ExplainableIntelligence(XAI).
https://www.darpa.mil/
program/explainable-artificial-intelligence
.
Valle,Roberto,JoseMBuenaposada,AntonioValdes&LuisBaumela.2018.Adeeply-initialized
ensembleofregressiontreesforfacealignment.In
ECCV
,.
129
Wang,Jifeng,XiangLi&JianYang.2018a.Stackedconditionalgenerativeadversarialnetworks
forjointlylearningshadowdetectionandshadowremoval.In
CVPR
,IEEE.
Wang,Ting-Chun,Ming-YuLiu,Jun-YanZhu,AndrewTao,JanKautz&BryanCatanzaro.2018b.
High-resolutionimagesynthesisandsemanticmanipulationwithconditionalGANs.In
CVPR
,
IEEE.
Wang,Xiaolong,RossGirshick,AbhinavGupta&KaimingHe.2018c.Non-localneuralnetworks.
In
CVPR
,IEEE.
Wang,Yang,LeiZhang,ZichengLiu,GangHua,ZhenWen,ZhengyouZhang&DimitrisSamaras.
2008.Facerelightingfromasingleimageunderarbitraryunknownlightingconditions.
PAMI
.
Wang,Yu,LucaBondi,PaoloBestagini,StefanoTubaro,DavidJEdwardDelpetal.2017.A
counter-forensicmethodforcnn-basedcameramodelIn
CVPRW
,IEEE.
Wen,Di,HuHan&A.K.Jain.2015.Facespoofdetectionwithimagedistortionanalysis.
TIFS
.
Wen,Zhen,ZichengLiu&ThomasSHuang.2003.Facerelightingwithradianceenvironment
maps.In
CVPR
,IEEE.
Wu,Bing-Fei,Yun-WeiChu,Po-WeiHuang,Meng-LiangChung&Tzu-MinLin.2016.Amotion
robustremote-PPGapproachtodriver'shealthstatemonitoring.In
ACCV
,IEEE.
Wu,Qi,WendeZhang&BVKVijayaKumar.2012.Strongshadowremovalviapatch-based
shadowedgedetection.In
ICRA
,IEEE.
Wu,Shangzhe,ChristianRupprecht&AndreaVedaldi.2020.Unsupervisedlearningofprobably
symmetricdeformable3Dobjectsfromimagesinthewild.In
CVPR
,IEEE.
Wu,Tai-Pang&Chi-KeungTang.2005.Abayesianapproachforshadowextractionfromasingle
image.In
ICCV
,IEEE.
Wu,Yue,WaelAbdAlmageed&PremkumarNatarajan.2019.Mantra-net:Manipulationtracing
networkfordetectionandlocalizationofimageforgerieswithanomalousfeatures.In
CVPR
,
IEEE.
Wu,Yuxin&KaimingHe.2018.Groupnormalization.In
ECCV
,.
Xiong,Chao,XiaoweiZhao,DanhangTang,KarlekarJayashree,ShuichengYan&Tae-KyunKim.
2015.Conditionalconvolutionalneuralnetworkformodality-awarefacerecognition.In
ICCV
,
IEEE.
Xiong,Fei&WaelAbdAlmageed.2018.Unknownpresentationattackdetectionwithfacergb
images.In
BTAS
,IEEE.
Xu,Zhenqi,ShanLi&WeihongDeng.2015.LearningtemporalfeaturesusingLSTM-CNN
architectureforfaceIn
ACPR
,IEEE.
130
Yang,Jianwei,ZhenLei&StanZLi.2014.Learnconvolutionalneuralnetworkforfaceanti-

arXivpreprintarXiv:1408.5601
.
Yang,Jianwei,ZhenLei,ShengcaiLiao&StanZLi.2013.Facelivenessdetectionwithcomponent
dependentdescriptor.In
ICB
,IEEE.
Yang,Xiao,WenhanLuo,LinchaoBao,YuanGao,DihongGong,ShibaoZheng,ZhifengLi&Wei
Liu.2019a.FaceModelmatters,sodoesdata.In
CVPR
,IEEE.
Yang,Yang,XiaojieGuo,JiayiMa,LinMa&HaibinLing.2019b.Generativelandmark
guidedfaceinpainting.
arXivpreprintarXiv:1911.11394
.
Yu,Zitong,XiaobaiLi,XuesongNiu,JingangShi&GuoyingZhao.2020a.Face
withhumanmaterialperception.In
ECCV
,.
Yu,Zitong,ChenxuZhao,ZezhengWang,YunxiaoQin,ZhuoSu,XiaobaiLi,FengZhou&
GuoyingZhao.2020b.Searchingcentraldifferenceconvolutionalnetworksforfaceanti-
In
CVPR
,IEEE.
Zhang,Ke-Yue,TaipingYao,JianZhang,YingTai,ShouhongDing,JilinLi,FeiyueHuang,
HaichuanSong&LizhuangMa.2020a.Faceviadisentangledrepresentation
learning.In
ECCV
,.
Zhang,Li,TaoXiang&ShaogangGong.2017a.Learningadeepembeddingmodelforzero-shot
learning.In
CVPR
,IEEE.
Zhang,Shu,RanHe,ZhenanSun&TieniuTan.2017b.Demeshnet:Blindfaceinpaintingfordeep
meshfacev
TIFS
.
Zhang,Xuaner,JonathanT.Barron,Yun-TaTsai,RohitPandey,XiumingZhang,RenNg&DavidE.
Jacobs.2020b.Portraitshadowmanipulation.
TOG
.
Zhang,Zhiwei,JunjieYan,SifeiLiu,ZhenLei,DongYi&StanZLi.2012.Aface
databasewithdiverseattacks.In
ICB
,IEEE.
Zhang,Ziyuan,LuanTran,XiYin,YousefAtoum,JianWan,NanxinWang&XiaomingLiu.2019.
Gaitrecognitionviadisentangledrepresentationlearning.In
CVPR
,IEEE.
Zhao,Chenxu,YunxiaoQin,ZezhengWang,TianyuFu&HailinShi.2019.Metaant
Learningtolearninface
arXivpreprintarXiv:1904.12490
.
Zhou,Hao,SunilHadap,KalyanSunkavalli&DavidWJacobs.2019.Deepsingle-imageportrait
relighting.In
ICCV
,IEEE.
Zhu,Xiangyu,ZhenLei,XiaomingLiu,HailinShi&StanZLi.2016.Facealignmentacrosslarge
poses:A3Dsolution.In
CVPR
,IEEE.
Zhu,Xiangyu,XiaomingLiu,ZhenLei&StanZLi.2017.Facealignmentinfullposerange:A
3Dtotalsolution.
PAMI
.
131