LEARNING3DMODELFROM2DIN-THE-WILDIMAGES
By
LuanQuocTran
ADISSERTATION
Submittedto
MichiganStateUniversity
inpartialoftherequirements
forthedegreeof
ComputerScienceŠDoctorofPhilosophy
2020
ABSTRACT
LEARNING3DMODELFROM2DIN-THE-WILDIMAGES
By
LuanQuocTran
Understanding3Dworldisoneofcomputervision'sfundamentalproblems.Whileahuman
hasnodifunderstandingthe3Dstructureofanobjectuponseeingits2Dimage,sucha
3Dinferringtaskremainsextremelychallengingforcomputervisionsystems.Tobetterhandlethe
ambiguityinthisinverseproblem,onemustrelyonadditionalpriorassumptionssuchasconstrain-
ingfacestolieinarestrictedsubspacefroma3Dmodel.Conventional3Dmodelsarelearnedfrom
asetof3Dscansorcomputer-aideddesign(CAD)models,andrepresentedbytwosetsofPCA
basisfunctions.Duetothetypeandamountoftrainingdata,aswellas,thelinearbases,therepre-
sentationpowerofthesemodelcanbelimited.Toaddresstheseproblems,thisthesisproposesan
innovativeframeworktolearnanonlinear3Dmodelfromalargecollectionofin-the-wildimages,
withoutcollecting3Dscans.,givenaninputimage(ofafaceoranobject),anetwork
encoderestimatestheprojection,lighting,shapeandalbedoparameters.Twodecodersserveas
thenonlinearmodeltomapfromtheshapeandalbedoparameterstothe3Dshapeandalbedo,
respectively.Withtheprojectionparameter,lighting,3Dshape,andalbedo,anovelanalytically-
differentiablerenderinglayerisdesignedtoreconstructtheoriginalinput.Theentirenetwork
isend-to-endtrainablewithonlyweaksupervision.Wedemonstratethesuperiorrepresentation
powerofourmodelsondifferentdomains(face,genericobjects),andtheircontributiontomany
otherapplicationsonfacialanalysisandmonocular3Dobjectreconstruction.
ThisthesisisdedicatedtomybeautifulwifeMyNhatNguyen,whoseencouragementalong
thewaywasparamounttomakingitthusfar.
iii
ACKNOWLEDGMENTS
Foremost,IwouldliketoexpressmysinceregratitudetomyadvisorProf.XiaomingLiufor
thecontinuoussupportofmyPh.Dstudy.Hisdesiretoseemesucceedhaspushedmetoobtain
farmorethanIcouldimaginealone.Thelatenightsspentwritingpaperstogether,attentionto
thesmallestdetails,anddesirethepushtheboundsofknowledgehaveinspiredmydedicationto
excellence.Iamdeeplyindebtedforhisofmywritingandpresentationskills.
Iwouldalsoliketothanktheremainderofmycommitteemembers,Dr.ArunRoss,Dr.Jiayu
ZhouandDr.DanielMorrisfortheirvaluableinsightsandcontributionsalongtheway.
IamgratefultomyComputerVisionLabmembers,bothpresentandpast,Dr.JosephRoth,
Dr.XiYin,Dr.AminJourabloo,Dr.MortezaSafdarnejad,Dr.YousefAtoum,YaojieLiu,Garrick
Brazil,AdamTerwilliger,JoelStehouwer,BangjieYin,HieuNguyen,ShengjieZhuandMasaHu
fortheexcellentworkingatmosphere.Thewillingnesstoansweranyquestionsandlatenights
workingtogethercausesallofourworktoIwillalsoneverforgetthememorieswehave
together,fromboardgamenightstotraveltrips,thatmademyPhDaverypleasantjourney.
AwordofappreciationtoBrendaHodge,KatherineTrinklein,StevenSmithandAmyKing
fortheiradministrativeassistance.
Finally,Iwouldliketothankmyfamily-myparentsandmysister-whohaveprovidedme
throughmoralandemotionalsupportinmylife.ThelargestthanksformybeautifulwifeMyNhat.
iv
TABLEOFCONTENTS
LISTOFTABLES
.......................................
vii
LISTOFFIGURES
......................................
ix
Chapter1IntroductionandContributions
.......................
1
1.1ThesisContributions.................................3
1.2ThesisOrganization..................................4
Chapter2BackgroundandRelatedWork
.......................
5
2.13DMorphableModel.................................5
2.2ImprovingLinear3DMM...............................7
2.32DFaceAlignment..................................8
2.43DFaceReconstruction................................9
2.53DObjectModelingandReconstruction.......................9
Chapter3Learning3DFaceMorphableModelfromIn-the-wildImages
......
11
3.1Introduction......................................11
3.2TheProposedNonlinear3DMM...........................13
3.2.1Nonlinear3DMM...............................13
3.2.1.1ProblemFormulation........................14
3.2.1.2Albedo&ShapeRepresentation..................15
3.2.1.3In-NetworkPhysically-BasedFaceRendering..........18
3.2.1.4Occlusion-awareRendering....................20
3.2.1.5ModelLearning..........................21
3.3ExperimentalResults.................................25
3.3.1AblationStudy................................26
3.3.1.1EffectofRegularization......................26
3.3.1.2ModelingLightingandShapeRepresentation..........27
3.3.1.3ComparisontoAutoencoders...................28
3.3.2Expressiveness................................29
3.3.3RepresentationPower.............................30
3.3.4Applications..................................34
3.3.4.1FaceAlignment..........................35
3.3.4.23DFaceReconstruction......................36
3.3.5Runtime....................................41
3.4Conclusions......................................41
Chapter4TowardsNonlinear3DFaceMorphoableModel
......
43
4.1Introduction......................................43
4.2ProposedMethod...................................45
4.2.1Nonlinear3DMMwithProxyandResidual.................45
v
4.2.2GlobalLocalBasedNetworkArchitecture..................47
4.3ExperimentalResults.................................49
4.3.1AblationStudy................................49
4.3.2RepresentationPower.............................51
4.3.3Identity-Preserving..............................53
4.3.43DReconstruction..............................54
4.3.5Faceediting..................................58
4.4Conclusions......................................62
Chapter5Intrinsic3DDecomposition,Segmentation,andModelingGenericOb-
jects
......................................
63
5.1Introduction......................................63
5.1.13DShapeandAlbedoRepresentation....................66
5.1.2Physis-BasedRendering...........................68
5.1.3ModelLearning................................70
5.1.3.1UnsupervisedJointModelingandFitting.............71
5.1.3.2SupervisedPriorLearningwithSyntheticImage.........73
5.1.4ImplementationDetails............................74
5.1.4.1Modeltraining...........................74
5.1.5NetworkStructure..............................75
5.2ExperimentalResults.................................78
5.2.1ExperimentSetup...............................78
5.2.2AblationStudy................................80
5.2.3UnsupervisedSegmentation.........................81
5.2.43DImageDecomposition...........................84
5.2.5Single-view3DReconstruction........................84
5.2.5.1Reconstructiononsyntheticimages................84
5.2.5.2Reconstructiononrealimages...................86
5.3Conclusions......................................88
Chapter6ConclusionsandFutureWork
........................
91
APPENDIX
...........................................
93
BIBLIOGRAPHY
.......................................
130
vi
LISTOFTABLES
Table3.1:
Thearchitecturesof
E
,
D
A
and
D
S
networks.
.....................17
Table3.2:
FacealignmentperformanceonALFW2000.
.....................28
Table3.3:
Quantitativecomparisonoftexturerepresentationpower(Averagereconstructionerror
onnon-occludedfaceportion.)
............................32
Table3.4:
3Dscanreconstructioncomparison(NME).
......................33
Table3.5:
Runningtimeofvarious3Dfacereconstructionmethods.
...............41
Table4.1:
Quantitativecomparisonoftexturerepresentationpower(Averagereconstructionerror
onnon-occludedfaceportion.)
............................51
Table5.1:
Coloredvoxelencodernetworkstructure.
.......................76
Table5.2:
Imageencodernetworkstructure(slightlyfromResNet-18).
.........78
Table5.3:
Effectoflosstermsonposeandreconstructionestimation.
..............80
Table5.4:
Segmentationandshaperepresentationcomparisons(IoU/CD)onShapeNetpart[181].
IoUisutilizedtomeasureforsegmentationagainstground-truthparts.CDisusedfor
shaperepresentationevaluation.Chair*istrainingonchair+tablejointset.
......81
Table5.5:Quantitativecomparisonofsingle-view3Dreconstructiononsyntheticimages
ofShapeNet....................................84
Table5.6:
Realimage3DreconstructiononPASCAL3D
+
withCD.
..............88
Table5.7:
Realimage3DreconstructiononPix3D
+
withCD.
..................88
TableA1:
DR-GANanditspartialvariantsperformancecomparison.
..............112
TableA2:
Comparisonofsinglevs.multi-imageDR-GANonCFP.
...............114
TableA3:
PerformanceofIJB-Awhenremovingimagesbythreshold
w
t
.ﬁSelected"showsthe
percentageofretainedimages.
............................117
TableA4:
FusionschemescomparisonsonIJB-Adataset.
...................118
TableA5:
Lossfunctioncomparisons.Alluseﬁmeanmin"fusion.
................119
vii
TableA6:
PerformancecomparisononIJB-Adataset.
......................119
TableA7:
Performance(Accuracy)comparisononCFP.
.....................120
TableA8:
rate(%)comparisononMulti-PIEdataset.
...............120
TableA9:
Representation
f
(
x
)
vs.syntheticimage
‹
x
onIJB-A.
.................121
viii
LISTOFFIGURES
Figure2.1:
ThevisualabstractoftheseminalworkbyBlanzandVetter[13].Itproposesastatis-
ticalmodelforfacestoperform3Dreconstructionfrom2Dimagesandaparametric
facespacewhichenablescontrolledmanipulation
..................6
Figure3.1:
Conventional3DMMemployslinearbasesmodelsforshape/albedo,whicharetrained
with3Dfacescansandassociatedcontrolled2Dimages.Weproposeanonlinear
3DMMtomodelshape/albedoviadeepneuralnetworks(DNNs).Itcanbetrained
fromin-the-wildfaceimageswithout3Dscans,andalsobetterreconstructtheoriginal
imagesduetotheinherentnonlinearity.
.......................12
Figure3.2:
Jointlylearninganonlinear3DMManditsalgorithmfromunconstrained2D
in-the-wildfaceimagecollection,inaweaklysupervisedfashion.
L
S
isavisualization
ofshadingonaspherewithlightingparameters
L
.
.................14
Figure3.3:
Threealbedorepresentations.(a)Albedovaluepervertex,(b)Albedoasa2Dfrontal
face,(c)UVspace2Dunwarpedalbedo.
......................15
Figure3.4:
UVspaceshaperepresentation.Fromlefttoright:individualchannelsfor
x
,
y
and
z
spatialdimensionandcombinedshapeimage.
.................17
Figure3.5:
Forwardandbackwardpassoftherenderinglayer.
..................18
Figure3.6:
Renderingwithsegmentationmasks.Lefttoright:segmentationresults,naiveren-
dering,occulusion-awarerendering.
.........................21
Figure3.7:
Effectofalbedoregularizations:albedosymmetry(sym)andalbedoconstancy(const).
Whenthereisnoregularizationbeingused,shadingismostlybakedintothealbedo.
Usingthesymmetrypropertyhelpstoresolvethegloballighting.Usingconstancy
constraintfutherremovesshadingfromthealbedo,whichresultsinabetter3Dshape.
26
Figure3.8:
Effectofshapesmoothnessregularization.
.....................27
Figure3.9:
Comparisontoconvolutionalautoencoders(AE).Ourapproachproducesresultsof
higherquality.Alsoitprovidesaccesstothe3Dfacialshape,albedo,lighting,and
projectionmatrix.
.................................29
Figure3.10:
Eachcolumnshowsshapechangeswhenvaryingoneelementof
f
S
,by10timesstan-
darddeviations,inoppositedirections.Orderedbythemagnitudeofshapechanges.
.30
Figure3.11:
Eachcolumnshowsalbedochangeswhenvaryingoneelementof
f
A
inoppositedi-
rections.
......................................30
ix
Figure3.12:
Nonlinear3DMMgeneratesshapeandalbedoembeddedwithdifferentattributes.
..31
Figure3.13:
Texturerepresentationpowercomparison.Ournonlinearmodelcanbetterreconstruct
thefacialtexture.
..................................32
Figure3.14:
Shaperepresentationpowercomparison(
l
S
=
160).Theerrormapshowthenormal-
izedper-vertexerror.
................................33
Figure3.15:
3DMMtofaceswithdiverseskincolor,pose,expression,lighting,facialhair,and
faithfullyrecoversthesecues.LefthalfshowsresultsfromAFLW2000dataset,right
halfshowsresultsfromCelebA.
..........................34
Figure3.16:
Ourfacealignmentresults.Invisiblelandmarksaremarkedasred.Wecanwellhandle
extremepose,lightingandexpression.
.......................35
Figure3.17:
FacealignmentCumulativeErrorsDistribution(CED)curvesonAFLW2000-3Don
2D(left)and3Dlandmarks(right).NMEsareshowninlegendboxes.
........36
Figure3.18:
3DreconstructionresultscomparisontoTewari
etal
.[153].Theirreconstructed
shapessufferfromthesurfaceshrinkagewhendealingwithchallengingtextureor
shapeoutsidethelinearmodelsubspace.Theycan'thandlelargeposevariationwell
either.Meanwhile,ournonlinearmodelismorerobusttothesevariations.
......37
Figure3.19:
3DreconstructionresultscomparisontoTewari
etal
.[152].Ourmodelbetterre-
constructtheinputimageinbothtexture(facialhairdirectionontheimage)and
shape(nasolabialfoldsinthesecondimage).
....................37
Figure3.20:
3DreconstructionresultscomparisontoSela
etal
.[129].Besidesshowingtheshape,
wealsoshowtheirestimateddepthandcorrespondencemap.Facialhairorocclusion
cancauseseriousproblemsintheiroutputmaps.
...................38
Figure3.21:
3DreconstructionresultscomparisontoVRNbyJackson
etal
.[63]onCelebA
dataset.Volumetricshaperepresentationresultsinnon-smooth3Dshapeandloses
correspondencebetweenreconstructedshapes.
...................39
Figure3.22:
3DreconstructionquantitativeevaluationonFaceWarehouse.Weobtainalowererror
comparedtoPRN[46]and3DDFA+[195].
....................39
Figure3.23:
3DfacereconstructionresultsontheFlorencedataset[9].TheNMEofeachmethod
isshowedinthelegend
...............................40
Figure4.1:
Theproposedframework.Eachshapeoralbedodecoderconsistoftwobranchesto
reconstructthetrueelementanditsproxy.Proxiesfreeshapeandalbedofromstrong
regularizations,allowthemtolearnmodelswithhighlevelofdetails.
........45
x
Figure4.2:
Theproposedgloballocalbasednetworkarchitecture.
...............48
Figure4.3:
Reconstructionresultswithdifferentlossfunctions.
.................49
Figure4.4:
Imagereconstructionwithour3DMMmodelusingtheproxyandthetrueshapeand
albedo.Ourshapeandalbedocanfaithfullyrecoverdetailsoftheface.Note:forthe
shape,weshowtheshadinginUVspaceŒabettervisuallizationthantheraw
S
UV
.
..50
Figure4.5:
Affectofsoftsymmetrylossonourshapemodel.
..................51
Figure4.6:
Texturerepresentationpowercomparison.Ournonlinearmodelcanbetterreconstruct
thefacialtexture.
..................................52
Figure4.7:
Shaperepresentationpowercomparison.Givena3Dshape,weoptimizethefeature
f
S
toapproximatetheoriginalone.
.........................53
Figure4.8:
Thedistancebetweentheinputimagesandtheirreconstructionfromthreemodels.
Forbettervisualization,imagesaresortedbasedontheirdistancetoourmodel'sre-
constructions.
...................................54
Figure4.9:
3DMMtofaceswithdiverseskincolor,pose,expression,lighting,andfaithfully
recoversthesecues.
.................................55
Figure4.10:
3DreconstructioncomparisontoTewari
etal
.[153].
.................56
Figure4.11:
3Dreconstructioncomparisonstononlinear3DMMapproachesbyTewari
etal
.[152]
orTranandLiu[161].Ourmodelcanreconstructfaceimageswithhigherlevelof
details.Pleasezoom-informoredetails.Bestviewelectronically.
..........56
Figure4.12:
3DreconstructioncomparisonstoSela
etal
.[139]orTran
etal
.[159],whichgo
beyondlatentspacerepresentations.
........................57
Figure4.13:
Lightingtransferresults.Wetransferthelightingofsourceimagesrow)totarget
imagescolumn).Wehavesimilarperformancecomparetothestate-of-the-art
methodofShu
etal
.[143]despitebeingordersofmagnitudefaster(150msvs.3
minperimage).
..................................59
Figure4.14:
Growingmustacheeditingresults.Thecollumnshowsoriginalimages,the
followingcollumnsshoweditedimageswithincreasingmagnitudes.Comparingto
Shu
etal
.[144]results(lastrow),oureditedimagesaremorerealisticandidentity
preserved.
.....................................60
Figure4.15:
Addingstickerstofaces.Thestickerisnaturallyaddedintofacesfollowingthesurface
normalorlighting.
.................................61
xi
Figure5.1:
Thisworkdecomposesa2Dimageofgeneticobjectsintoalbedo,3Dshape,illumi-
nation,andcameraprojection.
...........................65
Figure5.2:
Shapeandalbedodecodernetworks.Shapedecoder
D
S
takesashapelatentrepresen-
tation
f
S
andaspatialpoint
x
=(
x
;
y
;
z
)
andproducestheimplicitforeachbranch.
Theoutputlayergroupsthebranchoutputs,viamaxpooling,toformthespa-
tialprobabilityofoccupancy.Albedodecoder
D
A
receivesbothlatentrepresentations
f
S
;
f
A
andestimatesthealbedocolorsof4branches,oneofwhichisselectedbythe
shapebranch/segmentationandreturnedasthealbedocolorof
x
.
........67
Figure5.3:
Raytracingforsurfacepointsdetection.InLinearsearch,candidates(redpoints)are
uniformlydistributedinthegrid.InLinear-Binarysearch,afterthepointinside
theobjectfound,Binarysearchwillbeusedbetweenthelastoutsidepointandcurrent
insidepointforallremainingiterations.
......................69
Figure5.4:
ColorvoxelizationofShapeNetmodels.Original3Dmesh(left)and64
3
colored
voxel(right).
....................................75
Figure5.5:
Theshapedecodernetworkiscomposedof3fullyconnectedlayers,denotesasﬁFCﬂ.
Theshapelatentvector(128-dim)isconcatenated,denotedﬁ+ﬂ,withthexyzquery,
makinga131-dimvector,andisprovidedasinputtothelayer.TheLeakyReLU
activationisappliedtothe2FClayerswhilethevalueisobtainedwith
Sigmoid
activationdenotedasﬁSig.ﬂ.
........................76
Figure5.6:
Thealbedodecodernetworkisalsocomposedof3fullyconnectedlayers.
cally,ittakesthepointcoordinate
(
x
;
y
;
z
)
,alongwithshapeandalbedofeaturevectors,
andoutputstheRGBcolorvalue.'TH'denotes
Tanh
activation.
...........77
Figure5.7:
Oneexampleofboundarypointsselectionforlocalfeatureextraction.
........77
Figure5.8:
Localfeaturedistanceundernoiseofdifferentstandarddeviations.
..........78
Figure5.9:
3Dreconstructionusingmodelslearnedwith(thirdrow)andwithoutrealimage(sec-
ondrow).Higherqualityreconstructionisobservedinthebottom.
.........79
Figure5.10:
UnsupervisedsegmentationresultsonShapeNetPartdataset.Werendertheoriginal
mesheswithdifferentcolorsrepresentingdifferentparts.
...............82
Figure5.11:
Visualizationofalbedobranchoutputsforour5categories.Werenderthealbedowith
reconstructedmesh.
................................83
Figure5.12:
3Dimagedecompositiononreal-worldimages.Ourworkdecomposesa2Dimageof
genericobjectsintoalbedo,completed3Dshapeandillumination.
..........85
xii
Figure5.13:Qualitativecomparisonforsingle-view3DreconstructiononShapeNet,Pascal
3D+,andPix3Ddatasets..............................86
Figure5.14:Qualitativecomparisonforsingle-view3Dreconstructiononrealimagesfrom
Pascal3D+(left)andPix3D(right)........................87
Figure5.15:
Additional3DreconstructionresultsonPascal3D
+
[177]dataset.
..........89
Figure5.16:
Additional3DreconstructionresultsonPix3D[147].Foreachinputimage,weshow
reconstructionsbyShapeHD[174],andgroundtruth.Ourreconstructionsresemble
thegroundtruth.
..................................90
FigureA1:
Givenoneormultiplein-the-wildfaceimagesastheinput,DR-GANcanproducea
identityrepresentation,byvirtuallyrotatingthefacetoarbitraryposes.The
learntrepresentationisboth
discriminative
and
generative
,i.e.,therepresentationis
abletodemonstratesuperiorPIFRperformance,andsynthesizeidentity-preserved
facesattargetposesbytheposecode.
...................95
FigureA2:
ComparisonofpreviousGANarchitecturesandourproposedDR-GAN.
.......102
FigureA3:
Generatorinmlti-imageDR-GAN.Fromanimagesetofasubject,wecanfusethe
featurestoasinglerepresentationviadynamicallylearntcoefandsynthesize
imagesinanypose.
.................................107
FigureA4:Themeanfacesof13posegroupsinCASIA-Webface.Theblurrinessshows
thechallengesofposeestimationforlargeposes.................111
FigureA5:GeneratedfacesofDR-GANanditspartialvariants...............112
FigureA6:
Responsesoftwowiththehighestresponsestoidentity(left),andpose
(right).Responsesofeachrowareofthesamesubject,andeachcolumnareofthe
samepose.Notethewithin-rowsimilarityontheleftandwithin-columnsimilarityon
theright.
......................................114
FigureA7:
CoefdistributionsonIJB-A(a)andCFP(b).ForIJB-A,wevisualizeimages
atfourregionsofthedistribution.ForCFP,weplotthedistributionsforfrontalfaces
(blue)andfaces(red)separatelyandshowimagesattheheadsandtailsofeach
distribution.
....................................115
FigureA8:Thecorrelationbetweentheestimatedcoefandtheprob-
abilities.......................................116
xiii
FigureA9:
FacerotationcomparisononMulti-PIE.Giventheinput(inillumination07and75

pose),weshowsyntheticimagesof
L
2loss(top),adversarialloss(middle),and
groundtruth(bottom).Column2-5showtheabilityofDR-GANinsimultaneous
facerotationandre-lighting.
............................121
FigureA10:
Interpolationof
f
(
x
)
,
c
,and
z
.(a)Syntheticimagesbyinterpolatingbetweenthe
identityrepresentationsoftwofaces(Column1and12).Notethesmoothtransition
betweendifferentgendersandfacialattributes.(b)Poseangles0

,15

,30

,45

,60

,
75

,90

areavailableinthetrainingset.DR-GANinterpolatesin-between
unseen
posesvia
continuous
posecodes,shownaboveRow3.(c)ForeachimageatColumn
1,DR-GANsynthesizestwoimagesat
z
=

1
(Column2)and
z
=
1
(Column12),
andin-betweenimagesbyinterpolatingalongtwo
z
.
................122
FigureA11:FacerotationonCFP:(a)input,(b)frontalizedfaces,(c)realfrontalfaces,(d)
rotatedfacesat15

,30

,45

poses.Weexpectthefrontalizedfacestopre-
servetheidentity,ratherthanallfacialattributes.Thisisverychallengingfor
facerotationduetothein-the-wildvariationsandextremeviews.The
artifactintheimageboundaryisduetoimageextrapolationinpre-processing.
Whentheinputsarefrontalfaceswithvariationsinroll,expression,orocclu-
sions,thesyntheticfacescanremovethesevariations..............123
FigureA12:
FacefrontalizationonIJB-A.Foreachoffoursubjects,weshow11inputimageswith
estimatedcoefoverlaidatthetopleftcornerrow)andtheirfrontalized
counterpart(secondrow).Thelastcolumnisthegroundtruthfrontalandsynthetic
frontalfromthefusedrepresentationofall11images.Notethechallengesoflarge
poses,occlusion,andlowresolution,andour
opportunistic
frontalization.
......124
FigureA13:
FacefrontalizationonIJB-Aforanimagesetsubject)andavideosequence(sec-
ondsubject).Foreachsubject,weshow11inputimagesrow),theirrespective
frontalizedfaces(secondrow)andthefrontalizedfacesusing
incrementally
fusedrep-
resentationsfromallpreviousinputsuptothisimage(thirdrow).Inthelastcolumn,
weshowthegroundtruthfrontalface.
........................125
xiv
Chapter1
IntroductionandContributions
Understanding3Dstructureisalong-standingproblemwithmuchinterestincomputervision.A
humanhasnodifunderstandingthe3Dstructureofanobjectuponseeingits2Dimage.Even
withoutgeometriccues(motionorstereopsis),ourvisualsystemcanstillinferdetailedsurfaces
orplausiblyhiddenparts.Meanwhile,sucha3Dinferringtaskremainsextremelychallengingfor
computervisionsystems.
Oneobjectinparticular,theface,ishighlystudied,sinceobtainingauser3Dface
surfacemodelisusefulformanyapplicationsincludingbutnotlimitedtofacerecognition[6,102,
185],videoediting[47,155],avatarpuppeteering[20,23,189]orvirtualmake-up[48,83].
Inferringa3Dfacemeshfromasinglephotographisarduousandill-posedsincetheimage
formationprocessblendsmultiplecomponents(shape,albedo)aswellasenvironment(lighting)
intoasinglecolorforeachpixel.Tobetterhandletheambiguity,onemustrelyonadditionalprior
assumptions,suchasconstraining3Dobjectstolieinarestrictedsubspace,e.g.,3DMorphable
Models(3DMM)[13]learnedfromasmall3Dscanscollection.
Traditionally,3DMMislearntthrough
supervision
byperformingdimensionreduction,typi-
callyPrincipalComponentAnalysis(PCA),onatrainingsetofco-captured3Dfacescansand2D
images.Tomodelhighlyvariable3Dfaceshapes,alargeamountofhigh-quality3Dfacescansis
required.However,thisrequirementisexpensivetoasacquiringfacescansisverylaborious,
inbothdatacapturingandpost-processingstage.The3DMM[13]wasbuiltfromscansof
1
200subjectswithasimilarethnicity/agegroup.Theywerealsocapturedinwell-controlledcondi-
tions,withonlyneutralexpressions.Hence,itisfragiletolargevariancesinthefaceidentity.The
widelyusedBaselFaceModel(BFM)[121]isalsobuiltwithonly200subjectsinneutralexpres-
sions.LackofexpressioncanbecompensatedusingexpressionbasesfromFaceWarehouse[24]
orBD-3FE[183],whicharelearnedfromtheoffsetstotheneutralpose.Aftermorethanadecade,
almostallexistingmodelsusenomorethan300trainingscans.Suchsmalltrainingsetsarefar
fromadequatetodescribethefullvariabilityofhumanfaces[19].Untilrecently,witha
effortaswellasanovelautomatedandrobustmodelconstructionpipeline,Booth
etal
.[19]build
thelarge-scale3DMMfromscansof
˘
10
;
000subjects,whichisstillrestrictedtothepublic.
Second,thetexturemodelof3DMMisnormallybuiltwithasmallnumberof2Dfaceim-
ages
co-captured
with3Dscans,underwell-controlledconditions.Despitethereisaconsiderable
improvementof3Dacquisitiondevicesinthelastfewyears,thesedevicesstillcannotoperatein
arbitraryin-the-wildconditions.Therefore,allthecurrent3Dfacialdatasetshavebeencaptured
inthelaboratoryenvironment.Hence,suchmodelsareonlylearnttorepresentthefacialtexture
insimilar,ratherthanin-the-wild,conditions.Thissubstantiallylimitsapplicationscenariosof
3DMM.
Finally,therepresentationpowerof3DMMislimitedbynotonlythesizeortypeoftrain-
ingdatabutalsoits
formulation
.Thefacialvariationsarenonlinearinnature.E.g.,thevaria-
tionsindifferentfacialexpressionsorposesarenonlinear,whichviolatesthelinearassumption
ofPCA-basedmodels.Thus,aPCAmodelisunabletointerpretfacialvariationssuf
well.Thisisespeciallytrueforfacialtexture.Forallcurrent3DMMmodels,theirlow-dimension
albedosubspacefacesthesameproblemoflackingfacialhair,e.g.,beards.Toreducethe
error,itcompensatesunexplainabletexturebyalternatingsurfacenormal,orshrinkingtheface
shape[198].Eitherway,linear3DMM-basedapplicationsoftendegradetheirperformanceswhen
2
handlingout-of-subspacevariations.
Giventhebarrierof3DMMinitsdata,supervisionandlinearbases,thisthesisaimstorevolu-
tionizetheparadigmoflearning3DMMbyansweringafundamentalquestion:
Whetherandhowcanwelearnanonlinear
3
DMorphableModeloffaceshapeand
albedofromasetofin-the-wild
2
Dfaceimages,withoutcollecting
3
Dfacescans?
Iftheanswerwereyes,thiswouldbeinsharpcontrasttotheconventional3DMMapproach,
andremedyallaforementionedlimitations.Fortunately,wehavedevelopedapproachestooffer
positiveanswerstothisquestion.Withtherecentdevelopmentofdeepneuralnetworks,weview
thatitistherighttimetoundertakethisnewparadigmof3DMMlearning.Therefore,thecore
ofthisthesisisregardinghowtolearnthisnew3DMM,whatistherepresentationpowerofthe
model,andwhatistheofthemodeltofacialanalysis.
1.1ThesisContributions
Inthisthesis,weproposeanovelparadigmto
learnanonlinear
3
DMMmodelfromalargein-the-
wild
2
Dfaceimagecollection,withoutacquiring
3
Dfacescans
,byleveragingthepowerofdeep
neuralnetworkscapturesvariationsandstructuresincomplexfacedata.Theframeworkisalso
furtherextendedtogenericobjects,withsubstantiallylargershapedeformation,thankstoanovel
representation.Insummary,thisdissertationmakesthefollowingcontributions:

Toovercometheshortageofannotated3Ddata,wedevelopaframeworktojointlylearnthe
3Dmodelandthemodelalgorithmviaweaksupervision,byleveragingalargecollection
of2Dimageswithout3Dscans.Twomodulesareoptimizedend-to-endwiththeobjectiveto
reconstructtheinputimage.Thisobjectiveallowsustouseanyphotographsformodeltraining
3
withoutany3Dlabels.

Differentfrompreviousmethodsthatfocusonmodelingonly3Dshape,theproposednon-
linear3DMMfullymodelsshape,albedoandlighting,whichenablesustotrainthemodelinweak
supervisionfashion.

Byusingneuralnetworkstorepresentallmodelcomponents,ourmodelcanbettermodel
nonlinearshape/albedovariations.Henceourmodelhasgreaterrepresentationpowerthanits
traditionallinearcounterpart.

Inrealizationthatthestrongregularizationandglobal-basedmodelingaretheroadblocks
toachieve3DMMmodel,weproposetorelaxregularizationbyusingproxiesand
proposeaglobal-localnetworkarchitecture.

Toextendthelearningframeworktogenericobjectswhichusuallyhaslargeshapedeforma-
tionaswellasinconsistentshapetopology,weproposeanovelrepresentation,coloredoccupancy
inwhicheach3Dspatialpointisasinside/outsidethe3Dshapeaswellasassigned
withanalbedocolor.
1.2ThesisOrganization
Therestofthisdissertationisorganizedasfollows.Chapter2givesmorebackgroundintroduc-
tionandreviewsrelatedworkon3Dreconstruction.Chapter3developsthelearningframework
onnonlinear3DMM.Chapter4improvesthemodelinbothlearningobjectiveandarchitecture.
Chapter5presentstheextensionoftheframeworktogenericobjectswithanovelrepresentation,
coloredoccupancyChapter6concludesthisdissertation.
4
Chapter2
BackgroundandRelatedWork
Nowthatabasicunderstandingoftheproblemisknown,Iwillpresentsomebackgroundinforma-
tionandrelatedworknecessaryforfullyunderstandingthisthesis.
2.13DMorphableModel
The3DMorphableModel(3DMM)[13]andits2Dcounterpart,ActiveAppearanceModel[37,
94,91],provideparametricmodelsforsynthesizingfaces,wherefacesaremodeledusingtwo
components:shapeandalbedo(skin
BlanzandVetter[13]proposethegeneric3Dfacemodellearnedfromscandata.They
alinearsubspacetorepresentshapeandalbedousingprincipalcomponentanalysis(PCA)
andshowhowtothemodeltodata.The3DfacespacecanrepresentedwithPCAas:
S
=
¯
S
+
G
a
;
(2.1)
where
S
2
R
3
Q
isa3Dfacemeshwith
Q
vertices,
¯
S
2
R
3
Q
isthemeanshape,
a
2
R
l
S
isthe
shapeparametercorrespondingtoa3Dshapebases
G
.Theshapebasescanbefurthersplitinto
G
=[
G
id
;
G
exp
]
,where
G
id
istrainedfrom3Dscanswithneutralexpression,and
G
exp
isfromthe
offsetsbetweenexpressionandneutralscans.
Thealbedooftheface
A
2
R
3
Q
iswithinthemeanshape
¯
S
,whichdescribestheR,
5
Figure2.1:
ThevisualabstractoftheseminalworkbyBlanzandVetter[13].Itproposesastatistical
modelforfacestoperform3Dreconstructionfrom2Dimagesandaparametricfacespacewhichenables
controlledmanipulation
G,Bcolorsof
Q
correspondingvertices.
A
isalsoformulatedasalinearcombinationofbasis
functions:
A
=
¯
A
+
R
b
;
(2.2)
where
¯
A
isthemeanalbedo,
R
isthealbedobases,and
b
2
R
l
T
isthealbedoparameter.
The3DMMcanbeusedtosynthesizenovelviewsoftheface.Firstly,a3Dfaceisprojected
ontotheimageplanewiththeweakperspectiveprojectionmodel:
V
=
R

S
;
(2.3)
g
(
S
;
m
)=
V
2D
=
f

Pr

V
+
t
2
d
=
M
(
m
)

2
6
4
S
1
3
7
5
;
(2.4)
where
g
(
S
;
m
)
istheprojectionfunctionleadingtothe2Dpositions
V
2D
of3Drotatedvertices
V
,
f
isthescalefactor,
Pr
=
2
6
4
100
010
3
7
5
istheorthographicprojectionmatrix,
R
istherotationmatrix
constructedfromthreerotationangles(pitch,yaw,roll),and
t
2
d
isthetranslationvector.Whilethe
projectmatrix
M
isofthesizeof2

4,ithassixdegreesoffreedom,whichisparameterizedbya
6
6-dimvector
m
.Then,the2Dimageisrenderedusingtextureandanilluminationmodelsuchas
Phongmodel[122]orSphericalHarmonics[125].
SinceBlanzandVetter'sseminalwork[13],therehasbeenalargeamountofeffortonim-
proving3DMMmodelingmechanism.In[13],thedensecorrespondencebetweenfacialmesh
issolvedwitharegularisedformofopticalw.However,thistechniqueisonlyeffectiveina
constrainedsetting,wheresubjectssharesimilarethnicitiesandages.Toovercomethischallenge,
PatelandSmith[120]employaThinPlateSplines(TPS)warp[16]toregisterthemeshesinto
acommonreferenceframe.Alternatively,Paysan
etal
.[121]useaNonrigidIterativeClosest
Point[7]todirectlyalign3Dscans.Inadifferentdirection,Amberg
etal
.[6]extendedBlanz
andVetter'sPCA-basedmodeltoemotivefacialshapesbyadoptinganadditionalPCAmodeling
oftheresidualsfromtheneutralpose.Thisresultsinasinglelinearmodelofbothidentityand
expressionvariationof3Dfacialshape.Vlasic
etal
.[166]useamultilinearmodeltorepresent
thecombinedeffectofidentityandexpressionvariationonthefacialshape.Later,Bolkartand
Wuhrer[15]showhowsuchamultilinearmodelcanbeestimateddirectlyfromthe3Dscansusing
ajointoptimizationoverthemodelparametersandgroupwiseregistrationof3Dscans
2.2ImprovingLinear3DMM
WithPCAbases,thestatisticaldistributionunderlying3DMMisGaussian.Koppen
etal
.[77]
arguethatsingle-modeGaussiancan'twellrepresentreal-worlddistribution.Theyintroducethe
GaussianMixture3DMMthatmodelstheglobalpopulationasamixtureofGaussiansubpopula-
tions,eachwithitsownmean,butsharedcovariance.Booth
etal
.[17,18]aimtoimprovetexture
of3DMMtogobeyondcontrolledsettingsbylearningﬁin-the-wildﬂfeature-basedtexturemodel.
Onanotherdirection,Tran
etal
.[158]learntoregressrobustanddiscriminative3DMMrepresen-
7
tation,byleveragingmultipleimagesfromthesamesubject.However,allworksarestillbasedon
statisticalPCAbases.Duong
etal
.[112]addresstheproblemoflinearityinfacemodelingbyus-
ingDeepBoltzmannMachines.However,theyonlyworkwith2Dfaceandsparselandmarks;and
hencecannothandlefaceswithlarge-posevariationsorocclusionwell.Concurrenttoourwork,
Tewari
etal
.[152]learna(potentiallynon-linear)correctivemodelontopofalinearmodel.The
modelisasummationofthebaselinearmodelandthelearnedcorrectivemodel,whichcon-
traststoourmodel.Furthermore,ourmodelhasanadvantageofusing2Drepresentation
ofbothshapeandalbedo,whichmaintainsspatialrelationsbetweenverticesandleveragesCNN
powerforimagesynthesis.Finally,thanksforournovelrenderinglayer,weareabletoemploy
perceptual,adversariallosstoimprovethereconstructionquality.
2.32DFaceAlignment
2DFaceAlignment[172,90]canbecastasaregressionproblemwhere2Dlandmarklocationsare
regresseddirectly[42].Forlarge-poseoroccludedfaces,strongpriorsof3Dmodelfaceshapehave
beenshowntobe[67].Hence,thereisincreasingattentioninconductingfacealignment
bya3Dfacemodeltoasingle2Dimage[68,193,195,86,106,71,69].Amongthe
priorworks,iterativeapproacheswithcascadeofregressorstendtobepreferred.Ateachcascade,
thereisasingle[165,67]oreventworegressors[175]usedtoimproveitsprediction.Recently,
JourablooandLiu[71]proposeaCNNarchitecturethatenablestheend-to-endtrainingability
oftheirnetworkcascade.Contrastedtoaforementionedworksthatuseaed3DMMmodel,
ourmodelandmodelarelearnedjointly.Thisresultsinamorepowerfulmodel:asingle-
passencoder,whichislearnedjointlywiththemodel,achievesstate-of-the-artfacealignment
performanceondifferentbenchmarkdatasets.
8
2.43DFaceReconstruction
Facereconstructioncreatesa3Dfacemodelfromanimagecollection[130,131]orevenwitha
singleimage[128,139].Thislong-standingproblemdrawsalotofinterestbecauseofitswideap-
plications.3DMMalsodemonstratesitsstrengthinfacereconstruction,especiallyinthemonoc-
ularcase.Thisproblemisahighlyunder-constrained,aswithasingleimage,presentinforma-
tionaboutthesurfaceislimited.Hence,3Dfacereconstructionmustrelyonpriorknowledge
like3DMM[132].StatisticalPCAlinear3DMMisthemostcommonlyusedapproach.Besides
3DMMmethods[14,55,190,43,153,88],recently,Richardson
etal
.[129]designa
mentnetworkthataddsfacialdetailsontopofthe3DMM-basedgeometry.However,thisapproach
canonlylearn2
:
5Ddepthmap,whichlosesthecorrespondencepropertyof3DMM.Thefollow
upworkbySela
etal
.[139]trytoovercomethisweaknessbylearningacorrespondencemap.
Despitehavingsomeimpressivereconstructionresults,boththesemethodsarelimitedbytrain-
ingdatasynthesizedfromthelinear3DMMmodel.Hence,theyfailtohandleout-of-subspace
variations,e.g.,facialhair.
2.53DObjectModelingandReconstruction
Recently,autoencoderhasbeenwidelyusedfor3Dobjectmodeling[65,126,85,8,38,146]
duetoitseffeaturerepresentation.Thesemethodscanbenaturallyappliedtosingle-image
3Dreconstruction.Thereconstructionprocessencodestheinputimagewithdeepconvolutional
networks,andthenusesthetraineddecodertoreconstructthecorresponding3Dshapesfromthe
shapelatentvectors.However,mostofthesemethodssufferfromthedomainmismatchissuesince
themodelsaretrainedon
synthetic
data.
Anotherrelateddirection,e.g.,MarrNet[173]andShapeHD[174],istodevelopatwo-step
9
pipeline.Theyrecover2
:
5Dsketches(depthandnormalmaps),fromwhichavoxelized3D
shapecanbefurtherinferred.VON[192]methodalsotsfromthistwo-stepprocessfor
realisticimagesynthesis.However,despitetheuseof2
:
5Dsketchescanrelaxtheburdenon
domaintransferandconstrainthereconstructed3Dshapetobeconsistentwith2Dobservations,
theystillhavetwolimitations:1)Evenwithhigh-resolutionvoxel,theyarefarfromproducing
visuallycompellingshapes;2)Theydonotlearndisentangledandinterpretablelatentvectorsthat
allowimagemanipulationunderdifferentconditions(e.g.,poseandlighting).
10
Chapter3
Learning3DFaceMorphableModelfrom
In-the-wildImages
3.1Introduction
The3DMorphableModel(3DMM)isastatisticalmodelof3Dfacialshapeandtextureina
spacewherethereareexplicitcorrespondences[13].Themorphablemodelframeworkprovides
twokeyapoint-to-pointcorrespondencebetweenthereconstructionandallother
models,enablingﬁmorphingﬂ,andsecond,modelingunderlyingtransformationsbetweentypesof
faces(maletofemale,neutraltosmile,etc.).3DMMhasbeenwidelyappliedinnumerousareas
includingcomputervision[13,186,159],computergraphics[5,141,154,155],humanbehavioral
analysis[6,185]andcraniofacialsurgery[145].
Giventhebarrierof3DMMinitsdata,supervisionandlinearbases,weproposeanovel
paradigmto
learnanonlinear
3
DMMmodelfromalargein-the-wild
2
Dfaceimagecollection,
withoutacquiring
3
Dfacescans
.AsshowninFig.A1,startingwithanobservationthatthelinear
Thischapterisadaptedfromfollowingpublications:
[1]LuanTranandXiaomingLiu,ﬁNonlinear3DFaceMorphableModelﬂinCVPR,2018.
[2]LuanTranandXiaomingLiu,ﬁOnLearning3DFaceMorphableModelFromIn-the-wildimagesﬂinTPAMI,
2019.
11
Figure3.1:
Conventional3DMMemployslinearbasesmodelsforshape/albedo,whicharetrainedwith3D
facescansandassociatedcontrolled2Dimages.Weproposeanonlinear3DMMtomodelshape/albedovia
deepneuralnetworks(DNNs).Itcanbetrainedfromin-the-wildfaceimageswithout3Dscans,andalso
betterreconstructtheoriginalimagesduetotheinherentnonlinearity.
3DMMformulationisequivalenttoasinglelayernetwork,usingadeepnetworkarchitecturenat-
urallyincreasesthemodelcapacity.Hence,weutilizetwoconvolutionneuralnetworkdecoders,
insteadoftwoPCAspaces,astheshapeandalbedomodelcomponents,respectively.Eachdecoder
willtakeashapeoralbedoparameterasinputandoutputthedense3Dfacemeshorafaceskin
Thesetwodecodersareessentiallythenonlinear3DMM.
Further,welearnthealgorithmtoournonlinear3DMM,whichisformulatedasaCNN
encoder.Theencodernetworktakesafaceimageasinputandgeneratestheshapeandalbedo
parameters,fromwhichtwodecodersestimateshapeandalbedo.
The3Dfaceandalbedowould
perfectly
reconstructtheinputface,ifthealgorithm
and3DMMarewelllearnt.Therefore,wedesignadifferentiablerenderinglayertogeneratea
reconstructedfacebyfusingthe3Dface,albedo,lighting,andthecameraprojectionparameters
estimatedbytheencoder.Finally,theend-to-endlearningschemeisconstructedwheretheencoder
12
andtwodecodersarelearntjointlytominimizethedifferencebetweenthereconstructedfaceand
theinputface.Jointlylearningthe3DMMandthemodelencoderallowsustoleverage
thelargecollectionof
in-the-wild
2Dimageswithoutrelyingon3Dscans.Weshow
improvedshapeandfacialtexturerepresentationpoweroverthelinear3DMM.Consequently,this
alsoothertaskssuchas2Dfacealignment,3Dreconstruction,andfaceediting.
Insummary,thischaptermakesthefollowingmaincontributions.

Welearna
nonlinear
3DMMmodel,fullymodelsshape,albedoandlighting,thathasgreater
representationpowerthanitstraditionallinearcounterpart.

Bothshapeandalbedoarerepresentedas2Dimages,whichhelptomaintainspatialrelations
aswellasleverageCNNpowerinimagesynthesis.

Wejointlylearnthemodelandthemodelalgorithmvia
weaksupervision
,bylever-
agingalargecollectionof2Dimageswithout3Dscans.Thenovelrenderinglayerenablesthe
end-to-endtraining.

Thenew3DMMfurtherimprovesperformanceinrelatedfacialanalysistasks:facealign-
ment,facereconstruction.
3.2TheProposedNonlinear
3
DMM
3.2.1Nonlinear
3
DMM
AsmentionedinSec.3.1,thelinear3DMMhastheproblemssuchasrequiring3Dfacescans
forsupervisedlearning,unabletoleveragemassivein-the-wildfaceimagesforlearning,andthe
limitedrepresentationpowerduetothelinearbases.Weproposetolearnanonlinear3DMM
modelusingonlylarge-scalein-the-wild2Dfaceimages.
13
Figure3.2:
Jointlylearninganonlinear3DMManditsalgorithmfromunconstrained2Din-the-wild
faceimagecollection,inaweaklysupervisedfashion.
L
S
isavisualizationofshadingonaspherewith
lightingparameters
L
.
3.2.1.1ProblemFormulation
Inlinear3DMM(Sec2.1),thefactorizationofeachofcomponents(shape,albedo)canbeseenas
amatrixmultiplicationbetweencoefandbases.Fromaneuralnetwork'sperspective,this
canbeviewedasashallownetworkwithonly
onefullyconnectedlayer
andnoactivationfunction.
Naturally,toincreasethemodel'srepresentationpower,theshallownetworkcanbeextendedtoa
deeparchitecture.Inthiswork,wedesignanovellearningschemetojointlearnadeep3DMM
modelanditsinference(oralgorithm.
,asshowninFig.3.2,weusetwodeepnetworkstodecodetheshape,albedo
parametersintothe3Dfacialshapeandalbedorespectively.Tomaketheframeworkend-to-end
trainable,theseparametersareestimatedbyanencodernetwork,whichisessentiallythe
algorithmofour3DMM.Threedeepnetworksjoinforcesfortheultimategoalofreconstructing
theinputfaceimage,withtheassistantofaphysically-basedrenderinglayer.Fig.3.2visualizes
thearchitectureoftheproposedframework.Eachcomponentwillbepresentinfollowingsections.
14
Figure3.3:
Threealbedorepresentations.(a)Albedovaluepervertex,(b)Albedoasa2Dfrontalface,(c)
UVspace2Dunwarpedalbedo.
Formally,givenasetof
K
2Dfaceimages
f
I
i
g
K
i
=
1
,weaimtolearnanencoder
E
:
I
!
P
;
L
;
f
S
;
f
A
thatestimatestheprojectionmatrix
P
,lightingparameter
L
,shapeparameters
f
S
2
R
l
S
,andalbedo
parameter
f
A
2
R
l
A
,a3Dshapedecoder
D
S
:
f
S
!
S
thatdecodestheshapeparametertoa3Dshape
S
2
R
3
Q
,andanalbedodecoder
D
A
:
f
A
!
A
thatdecodesthealbedoparametertoarealisticalbedo
A
2
R
3
Q
,withtheobjectivethattherenderedimagewith
P
,
L
,
S
,and
A
canwellapproximatethe
originalimage.Mathematically,theobjectivefunctionis:
argmin
E
;
D
S
;
D
A
K
å
i
=
1


‹
I
i

I
i


1
;
(3.1)
‹
I
=
R
(
E
P
(
I
)
;
E
L
(
I
)
;
D
S
(
E
S
(
I
))
;
D
A
(
E
A
(
I
)))
;
where
R
(
P
;
L
;
S
;
A
)
istherenderinglayer(Sec.5.1.2).
3.2.1.2Albedo&ShapeRepresentation
Fig.3.3illustratesthreepossiblealbedorepresentations.Intraditional3DMM,albedois
pervertex(Fig.3.3(a)).Thisrepresentationisalsoadoptedinrecentworksuchas[153,152].
Thereisanalbedointensityvaluecorrespondingtoeachvertexinthefacemesh.Despitewidely
used,thisrepresentationhasitslimitations.Since3Dverticesarenotona2Dgrid,this
representationismostlyparameterizedasavector,whichnotonlylosesthespatialrelationofits
15
vertices,butalsopreventsittoleveragetheconvenienceofdeployingCNNon2Dalbedo.In
contrast,giventherapidprogressinimagesynthesis,itisdesirabletochoosea2Dimage,e.g.,a
frontal-viewfaceimageinFig.3.3(b),asanalbedorepresentation.However,frontalfacescontain
littleinformationoftwosides,whichwouldlosemanyalbedoinformationforside-viewfaces.
Inlightoftheseconsideration,weuseanunwrapped2Dtextureasourtexturerepresentation
(Fig.3.3(c)).,each3Dvertex
v
isprojectedontotheUVspaceusingcylindrical
unwarp.Assumingthatthefacemeshhasthetoppointingupthe
y
axis,theprojectionof
v
=
(
x
;
y
;
z
)
ontotheUVspace
v
uv
=(
u
;
v
)
iscomputedas:
v
!
a
1
:
arctan

x
z

+
b
1
;
u
!
a
2
:
y
+
b
2
;
(3.2)
where
a
1
;
a
2
;
b
1
;
b
2
areconstantscaleandtranslationscalarstoplacetheunwrappedfaceintothe
imageboundaries.Here,per-vertexalbedo
A
2
R
3
Q
couldbeeasilycomputedbysamplingfrom
itsUVspacecounterpart
A
uv
2
R
U

V
:
A
(
v
)=
A
uv
(
v
uv
)
:
(3.3)
Usually,itinvolvessub-pixelsamplingviabilinearinterpolation:
A
(
v
)=
å
u
0
2fb
u
c
;
d
u
eg
v
0
2fb
v
c
;
d
v
eg
A
uv
(
u
0
;
v
0
)(
1
j
u

u
0
j
)(
1
j
v

v
0
j
)
;
(3.4)
where
v
uv
=(
u
;
v
)
istheUVspaceprojectionof
v
viaEqn.3.2.
AlbedoinformationisnaturallyexpressedintheUVspacebutspatialdatacanbeembedded
inthesamespaceaswell.Here,a3Dfacialmeshcanberepresentedasa2Dimagewiththree
16
Figure3.4:
UVspaceshaperepresentation.Fromlefttoright:individualchannelsfor
x
,
y
and
z
spatial
dimensionandcombinedshapeimage.
Table3.1:
Thearchitecturesof
E
,
D
A
and
D
S
networks.
E
D
A
=
D
S
LayerFilter/StrideOutputSizeLayerFilter/StrideOutputSize
FC6

7

320
Conv117

7
=
2112

112

32FConv523

3
=
212

14

160
Conv123

3
=
1112

112

64FConv513

3
=
112

14

256
Conv213

3
=
256

56

64FConv433

3
=
224

28

256
Conv223

3
=
156

56

64FConv423

3
=
124

28

128
Conv233

3
=
156

56

128FConv413

3
=
124

28

192
Conv313

3
=
228

28

128FConv333

3
=
248

56

192
Conv323

3
=
128

28

96FConv323

3
=
148

56

96
Conv333

3
=
128

28

192FConv313

3
=
148

56

128
Conv413

3
=
214

14

192FConv233

3
=
296

112

128
Conv423

3
=
114

14

128FConv223

3
=
196

112

64
Conv433

3
=
114

14

256FConv213

3
=
196

112

64
Conv513

3
=
27

7

256FConv133

3
=
2192

224

64
Conv523

3
=
17

7

160FConv123

3
=
1192

224

32
Conv533

3
=
17

7

(
l
S
+
l
A
+
64
)
FConv113

3
=
1192

224

3
AvgPool7

7
=
11

1

(
l
S
+
l
A
+
64
)
FC
m
64

66
FC
L
64

2727
channels,oneforeachspatialdimension
x
,
y
and
z
.Fig3.4givesanexampleofthisUVspace
shaperepresentation
S
uv
2
R
U

V
.
Representing3DfaceshapeinUVspaceallowustouseaCNNforshapedecoder
D
S
instead
ofusingamulti-layerperceptron(MLP)asinourpreliminaryversion[160].Avoidingusingwide
17
Figure3.5:
Forwardandbackwardpassoftherenderinglayer.
fully-connectedlayersallowustousedeepernetworkfor
D
S
,potentiallymodelmorecomplex
shapevariations.Thisresultsinbetterresultsasbeingdemonstratedinourexperiment
(Sec.3.3.1.2).
Thereferenceshapeusedhasthemouthopen.Thischangehelpsthenetworktoavoidlearning
alargegradientnearthetwolips'bordersintheverticaldirectionwhenthemouthisopen.
Toregressthese2Drepresentationofshapeandalbedo,wecanemployCNNsasshapeand
albedonetworksrespectively.,
D
S
,
D
A
areCNNconstructedbymultiplefractionally-
stridedconvolutionlayers.AftereachconvolutionisbatchnormandeLUactivation,exceptthelast
convolutionlayersofencoderanddecoders.Theoutputlayerhasa
tanh
activationtoconstraint
theoutputtobeintherangeof
[

1
;
1
]
.ThedetailednetworkarchitectureispresentedinTab.3.1.
3.2.1.3In-NetworkPhysically-BasedFaceRendering
Toreconstructafaceimagefromthealbedo
A
,shape
S
,lightingparameter
L
,andprojection
parameter
m
,wearenderinglayer
R
(
m
;
L
;
S
;
A
)
torenderafaceimagefromtheabove
parameters.Thisisaccomplishedinthreesteps,asshowninFig.3.5.Firstly,thefacialtextureis
computedusingthealbedo
A
andthesurfacenormalmapoftherotatedshape
N
(
V
)=
N
(
P
;
S
)
.
Here,following[169],weassumedistantilluminationandapurely
Lambertian
surface
18
Hencetheincomingradiancecanbeapproximatedusingsphericalharmonics(SH)basisfunctions
H
b
:
R
3
!
R
,andcontrolledbycoef
L
.,thetextureinUVspace
T
uv
2
R
U

V
iscomposedofalbedo
A
uv
andshading
C
uv
:
T
uv
=
A
uv

C
uv
=
A
uv

B
2
å
b
=
1
L
b
H
b
(
N
(
m
;
S
uv
))
;
(3.5)
where
B
isthenumberofsphericalharmonicsbands.Weuse
B
=
3,whichleadsto
B
2
=
9
coefin
L
foreachofthreecolorchannels.Secondly,the3Dshape/mesh
S
isprojectedto
theimageplaneviaEqn.2.4.Finally,the3DmeshisthenrenderedusingaZ-bufferrenderer,
whereeachpixelisassociatedwithasingletriangleofthemesh,
‹
I
(
m
;
n
)=
R
(
P
;
L
;
S
uv
;
A
uv
)
m
;
n
=
T
uv
(
å
v
i
2
F
uv
(
g
;
m
;
n
)
l
i
v
i
)
;
(3.6)
where
F
(
g
;
m
;
n
)=
f
v
1
;
v
2
;
v
3
g
isanoperationreturningthreeverticesofthetrianglethatencloses
thepixel
(
m
;
n
)
afterprojection
g
;
F
uv
(
g
;
m
;
n
)
isthesameoperationwithresultantverticesmapped
intothereferencedUVspaceusingEqn.3.2.Inordertohandleocclusions,whenasinglepixel
residesinmorethanonetriangle,thetrianglethatisclosesttotheimageplaneisselected.The
locationofeachpixelisdeterminedbyinterpolatingthelocationofthreeverticesviabarycentric
coordinates
f
l
i
g
3
i
=
1
.
Therearealternativedesignstoourrenderinglayer.Ifthetexturerepresentationisper
vertex,asinFig.3.3(a),onemaywarptheinputimage
I
i
ontothevertexspaceofthe3Dshape
S
,
whosedistancetotheper-vertextexturerepresentationcanformareconstructionloss.Thisdesign
isadoptedbytherecentworkof[153,152].Incomparison,ourrenderedimageisona
19
2Dgridwhilethealternativeisontopofthe3Dmesh.Asaresult,ourrenderedimagecanenjoy
theconvenienceofapplyingtheperceptuallossoradversarialloss,whichisshowntobecritical
inimprovingthequalityofsynthetictexture.Anotherdesignforrenderinglayerisimagewarping
basedonthesplineinterpolation,asin[36].However,thiswarpingiscontinuous:everypixelin
theinputwillmaptotheoutput.Hencethiswarpingoperationfailsintheoccludedregion.Asa
result,Cole
etal
.[36]limittheirscopetoonlysynthesizingfrontal-viewfacesbywarpingfrom
normalizedfaces.
TheCUDAimplementationofourrenderinglayerispubliclyavailableat
https://github.
com/tranluan/Nonlinear_Face_3DMM
.
3.2.1.4Occlusion-awareRendering
Veryoften,in-the-wildfacesareoccludedbyglasses,hair,hands,etc.Tryingtoreconstructab-
normaloccludedregionscouldmakethemodellearningmorediforresultinanmodelwith
externalocclusionbakedin.Hence,weproposetouseasegmentationmasktoexcludeoccluded
regionsintherenderingpipeline:
‹
I
 
‹
I

M
+
I

(
1

M
)
:
(3.7)
Asaresult,theseoccludedregionswon'taffectouroptimizationprocess.Theforeground
mask
M
isestimatedusingthesegmentationmethodgivenbyNirkin
etal
.[113].Examplesof
segmentationmasksandrenderingresultscanbefoundinFig.3.6.
20
Figure3.6:
Renderingwithsegmentationmasks.Lefttoright:segmentationresults,naiverendering,
occulusion-awarerendering.
3.2.1.5ModelLearning
Theentirenetworkisend-to-endtrainedtoreconstructtheinputimages,withthelossfunction:
L
=
L
rec
(
‹
I
;
I
)+
l
lan
L
lan
+
l
reg
L
reg
;
(3.8)
wherethereconstructionloss
L
rec
enforcestherenderedimage
‹
I
tobesimilartotheinput
I
,the
landmarkloss
L
L
enforcesgeometryconstraint,andtheregularizationloss
L
rec
encouragesplausi-
blesolutions.
ReconstructionLoss.
Themainobjectiveofthenetworkistoreconstructtheoriginalfacevia
disentanglerepresentation.Hence,weenforcethereconstructedimagetobesimilartotheoriginal
inputimage:
L
i
rec
(
‹
I
;
I
)=
1
jVj
å
q
2
V
jj
‹
I
(
q
)

I
(
q
)
jj
2
(3.9)
where
V
isthesetofallpixelsintheimagescoveredbytheestimatedfacemesh.Therearedifferent
normscanbeusedtomeasurethecloseness.Tobetterhandleoutliers,weadopttherobust
l
2
;
1
,
wherethedistanceinthe3DRGBcolorspaceisbasedon
l
2
andthesummationoverallpixels
enforcessparsitybasedon
l
1
-norm[155,156].
Toimprovefromblurryreconstructionresultsof
l
p
losses,inourpreliminarywork[160],
thanksforourrenderinglayer,weemployadversariallosstoenhancetheimagerealism.However,
21
adversarialobjectiveonlyencouragethereconstructiontobeclosetotherealimagedistribution
butnotnecessarytheinputimage.Also,it'sknowntobenotstabletooptimize.Here,wepropose
touseaperceptuallosstoenforcetheclosenessbetweenimages
‹
I
and
I
,whichovercomesboth
ofadversarialloss'sweaknesses.Besidesencouragingthepixelsoftheoutputimage
‹
I
toexactly
matchthepixelsoftheinput
I
,weencouragethemtohavesimilarfeaturerepresentationsas
computedbythelossnetwork
j
.
L
f
rec
(
‹
I
;
I
)=
1
jCj
å
j
2
C
1
W
j
H
j
C
j
jj
j
j
(
‹
I
)

j
j
(
I
)
jj
2
2
:
(3.10)
WechooseVGG-Face[118]asour
j
toleverageitsface-relatedfeaturesandalsobecauseofsim-
plicity.Thelossissummedover
C
,asubsetoflayersof
j
.Here
j
j
(
I
)
istheactivationsofthe
j
-th
layerof
j
whenprocessingtheimage
I
withdimension
W
j

H
j

C
j
.Thisfeaturereconstruction
lossisoneofperceptuallosseswidelyusedindifferentimageprocessingtasks[66].
Thereconstructionlossisaweightedsumoftwoterms:
L
rec
(
‹
I
;
I
)=
L
i
rec
(
‹
I
;
I
)+
l
f
L
f
rec
(
‹
I
;
I
)
:
(3.11)
SparseLandmarkAlignment.
Tohelpachievingbettermodelting,whichinturnhelpsto
improvethemodellearningitself,weemploythelandmarkalignmentloss,measuringEuclidean
distancebetweenestimatedandgroundtruthlandmarks,asanauxiliarytask,
L
lan
=


P

2
6
4
S
(
:
;
d
)
1
3
7
5

U


2
2
;
(3.12)
where
U
2
R
2

68
isthemanuallylabeled2Dlandmarklocations,
d
isaconstant68-dimvector
22
storingtheindexesof683Dverticescorrespondingtothelabeled2Dlandmarks.Differentfrom
traditionalfacealignmentworkwheretheshapebasesareed,ourworkjointlylearnsthebases
functions(i.e.,theshapedecoder
D
S
)aswell.Minimizingthelandmarklosswhileupdating
D
S
onlymovesatinysubsetsofvertices.Iftheshape
S
isrepresentedasavectorand
D
S
isaMLP
consistingoffullyconnectedlayers,verticesareindependent.Hence
L
L
onlyadjusts68vertices.
Incase
S
isrepresentedintheUVspaceand
D
S
isaCNN,localneighborregioncouldalsobe
Inbothcases,updating
D
S
basedon
L
L
onlymovesasubsetsofvertices,whichcould
leadtoimplausibleshapes.Hence,whenoptimizingthelandmarkloss,wethedecoder
D
S
and
onlyupdatetheencoder.
Also,notethatdifferentfromsomepriorwork[49],ournetworkonlyrequiresground-truth
landmarksduringtraining.Itisabletopredictlandmarksvia
P
and
S
duringthetesttime.
Regularizations.
Toensureplausiblereconstruction,weaddafewregularizationterms:
L
reg
=
L
sym
(
A
)+
l
con
L
con
(
A
)+
l
smo
L
smo
(
S
)
:
(3.13)
AlbedoSymmetry
Asthefaceissymmetry,weenforcethealbedosymmetryconstraint,
L
sym
(
A
)=
k
A
uv


(
A
uv
)
k
1
:
(3.14)
Employingon2Dalbedo,thisconstraintcanbeeasilyimplementedviaahorizontalimage
operation
()
.
AlbedoConstancy
Usingsymmetryconstraintcanhelptocorrecttheglobalshading.However,
symmetricaldetails,i.e.,dimples,canstillbeembeddedinthealbedochannel.
Tofurtherremoveshadingfromthealbedochannel,followingRetinextheory[]whichas-
23
sumesalbedotobepiecewiseconstant,weenforcesparsityintwodirectionsofitsgradient,similar
to[107,144]:
L
con
(
A
)=
å
v
uv
j
2
N
i
w
(
v
uv
i
;
v
uv
j
)


A
uv
(
v
uv
i
)

A
uv
(
v
uv
j
)


p
2
;
(3.15)
where
N
i
denotesasetof4-pixelneighborhoodofpixel
v
uv
i
.Withtheassumptionthatpixels
withthesamechromaticity(i.e.,
c
(
x
)=
I
(
x
)
=
j
I
(
x
)
j
)aremorelikelytohavethesamealbedo,we
settheconstantweight
w
(
v
uv
i
;
v
uv
j
)=
exp


a


c
(
v
uv
i
)

c
(
v
uv
j
)


,wherethecolorisreferenced
fromtheinputimageusingthecurrentestimatedprojection.Following[107],weset
a
=
15and
p
=
0
:
8inourexperiment.
ShapeSmoothness
Forshapecomponent,weimposethesmoothnessbyaddingtheLaplacian
regularizationonthevertexlocationsforthesetofallvertices.
L
smo
(
S
)=
å
v
uv
i
2
S
uv


S
uv
(
v
uv
i
)

1
jN
i
j
å
v
uv
j
2
N
i
S
uv
(
v
uv
j
)


2
:
(3.16)
IntermediateSemi-SupervisedTraining.
Fullyunsupervisedtrainingusingonlytherecon-
structionandadversariallossontherenderedimagescouldleadtoadegeneratesolution,sincethe
initialestimationisfarfromidealtorendermeaningfulimages.Therefore,weintroduceinterme-
diatelossfunctionstoguidethetrainingintheearlyiterations.
Withthefacetechnique,Zhu
etal
.[193]expandthe300Wdataset[134]into122
;
450
imageswith3DMMshapes
e
S
andprojectionmatrix
e
P
.Given
e
S
and
e
P
,wecreatethepseudo
groundtruthtexture
e
T
byreferringeverypixelintheUVspacebacktotheinputimage,i.e.,the
backwardofourrenderinglayer.With
e
P
,
e
S
,
e
T
,weourintermediatelossby:
L
0
=
L
S
+
l
T
L
T
+
l
P
L
P
+
l
L
L
L
+
l
reg
L
reg
;
(3.17)
24
where:
L
S
=


S

e
S


2
2
;
(3.18)
L
T
=


T

e
T


1
;
(3.19)
L
m
=


P

e
P


2
2
:
(3.20)
It'salsopossibletoprovidepseudogroundtruthtotheSHcoef
L
andfollowedbyalbedo
A
usingleastsquareoptimizationwithaconstantalbedoassumption,asin[169,144].However,
thisestimationisnotreliableforin-the-wildimageswithocclusionregions.Alsoempirically,with
proposedregularizations,themodelisabletoexploreplausiblesolutionsforthesecomponentsby
itself.Hence,wedecidetorefrainfromsupervising
L
and
A
tosimplifyourpipeline.
Duetothepseudogroundtruth,using
L
0
mayrunintotheriskthatoursolutionlearnstomimic
thelinearmodel.Thus,weswitchtothelossofEqn.3.8after
L
0
converges.Notethattheestimated
groundtruthof
e
P
,
e
S
,
e
T
andthelandmarksaretheonlysupervisionusedinourtraining,forwhich
ourlearningisconsideredas
weakly
supervised.
3.3ExperimentalResults
Theexperimentsstudythreeaspectsoftheproposednonlinear3DMM,intermsofitsexpressive-
ness,representationpower,andapplicationstofacialanalysis.Usingfacialmeshtriangle
byBaselFaceModel(BFM)[121],wetrainour3DMMusing300W-LPdataset[193],whichcon-
tains122
;
450in-the-wildfaceimages,inawideposerangefrom

90

to90

.Imagesareloosely
squarecroppedaroundthefaceandscaleto256

256.Duringtraining,imagesofsize224

224
arerandomlycroppedfromtheseimagestointroducetranslationvariations.
25
SymConstInputOverlayAlbedoShadingTexture
X
XX
Figure3.7:
Effectofalbedoregularizations:albedosymmetry(sym)andalbedoconstancy(const).When
thereisnoregularizationbeingused,shadingismostlybakedintothealbedo.Usingthesymmetryproperty
helpstoresolvethegloballighting.Usingconstancyconstraintfutherremovesshadingfromthealbedo,
whichresultsinabetter3Dshape.
ThemodelisoptimizedusingAdamoptimizerwithalearningrateof0
:
001inbothtraining
stages.Wesetthefollowingparameters:
Q
=
53
;
215,
U
=
192
;
V
=
224,
l
S
=
l
T
=
160.
l
values
aresettomakelossestohavesimilarmagnitudes.
3.3.1AblationStudy
3.3.1.1EffectofRegularization
AlbedoRegularization.
Inthiswork,toregularizealbedolearning,weemploytwoconstraints
toefremoveshadingfromalbedonamelyalbedosymmetryandconstancy.Todemonstrate
theeffectoftheseregularizationterms,wecompareourfullmodelwithitspartialvariants:one
withoutanyalbedoreqularizationandonewiththesymmetryconstraintonly.Fig.3.7shows
visualcomparisonofthesemodels.Learningwithoutanyconstraintsresultsinthelightingis
26
InputOverlayShapeOverlayShape
WithsmoothnessWithoutsmoothness
Figure3.8:
Effectofshapesmoothnessregularization.
totallyexplainedbythealbedo,meanwhileistheshadingisalmostconstant(Fig.3.7(a)).Using
symmetryhelptocorrectthegloballighting.However,symmetricgeometrydetailsarestillbaked
intothealbedo(Fig.3.7(b)).Enforcingalbedoconstancyhelpstofurtherremoveshadingfrom
it(Fig.3.7(c)).Combiningthesetworegularizationshelpstolearnplausiblealbedoandlighting,
whichimprovestheshapeestimation.
ShapeSmoothnessRegularization.
Wealsoevaluatetheneedinshaperegularization.Fig.3.8
showsvisualcomparisonsbetweenourmodelanditsvariantwithouttheshapesmoothnesscon-
straint.Withoutthesmoothnesstermthelearnedshapebecomesnoisyespeciallyontwosidesof
theface.Thereasonisthat,thehairregionisnotcompletelyexcludedduringtrainingbecauseof
imprecisesegmentationestimation.
3.3.1.2ModelingLightingandShapeRepresentation
Inthiswork,wemaketwomajoralgorithmicdifferenceswithourpreliminarywork[160]:incor-
poratinglightingintothemodelandchangingtheshaperepresentation.
Ourpreviouswork[160]modelsthetexturedirectly,whilethisworkdisentanglestheshading
fromthealbedo.Asargued,modelingthelightingshouldhaveapositiveimpactonshapelearning.
27
Table3.2:
FacealignmentperformanceonALFW2000.
MethodLightingUVshapeNME
Our[160]4.70
Our
X
4.30
Our
XX
4
:
12
Hencewecompareourmodelswithresultsfrom[160]infacealignmenttask.
Also,inourpreliminarywork[160],aswellasintraditional3DMM,shapeisrepresentedas
avector,whereverticesareindependent.Despitethisshortage,thisapproachhasbeenwidely
adoptedduetoitssimplicityandsamplingefy.Inthiswork,weexploreanalternativeto
thisrepresentation:representthe3Dshapeasapositionmapinthe2DUVspace.Thisrepresen-
tationhasthreechannels:oneforeachspatialdimension.Thisrepresentationmaintainsthespatial
relationamongfacialmesh'svertices.Also,wecanuseCNNastheshapedecoderreplacingan
expensiveMLP.Herewealsoevaluatetheperformancegainbyswitchingtothisrepresentation.
Tab.3.2reportstheperformanceonthefacealignmenttaskofdifferentvariants.Asaresult,
modelinglightinghelpstoreducetheerrorfrom4
:
70to4
:
30.Usingthe2Drepresentation,with
theconvenienceofusingCNN,theerrorisfurtherreducedto4
:
12.
3.3.1.3ComparisontoAutoencoders
Wecompareourmodel-basedapproachwithaconvolutionalautoencoderinFig.3.9.Theautoen-
codernetworkhasasimilardepthandmodelsizeasours.Itgivesblurryreconstructionresultsas
thedatasetcontainlargevariationsonfaceappearance,poseangleandevendiversitybackground.
Ourmodel-basedapproachobtainssharperreconstructionresultsandprovidessemanticparame-
tersallowingaccesstodifferentcomponentsincluding3Dshape,albedo,lightingandprojection
matrix.
28
InputOurAEInputOurAE
Figure3.9:
Comparisontoconvolutionalautoencoders(AE).Ourapproachproducesresultsofhigher
quality.Alsoitprovidesaccesstothe3Dfacialshape,albedo,lighting,andprojectionmatrix.
3.3.2Expressiveness
Exploringfeaturespace.
WefeedtheentireCelebAdataset[97]with
˘
200kimagestoour
networktoobtaintheempiricaldistributionofourshapeandtextureparameters.Byvarying
themeanparameteralongeachdimensionproportionaltoitsstandarddeviation,wecangeta
sensehoweachelementcontributetotheshapeandtexture.Wesortelementsintheshape
parameter
f
S
basedontheirdifferencestothemean3Dshape.Fig.3.10showsfourexamplesof
shapechanges,whosedifferencesrankNo.1,40,80,and120among160elements.Mostoftop
changesareexpressionrelated.Similarly,inFig.3.11,wevisualizedifferenttexturechangesby
adjustingonlyoneelementof
f
A
offthemeanparameter
¯
f
A
.Theelementswiththesame4ranks
astheshapecounterpartareselected.
AttributeEmbedding.
Tobetterunderstanddifferentshapeandalbedoinstancesembedded
inourtwodecoders,wedigintotheirattributemeaning.Foragivenattribute,e.g.,male,we
feedimageswiththatattribute
f
I
i
g
n
i
=
1
intoourencoder
E
toobtaintwosetsofparameters
f
f
i
S
g
n
i
=
1
and
f
f
i
A
g
n
i
=
1
.Thesesetsrepresentcorrespondingempiricaldistributionsofthedatainthelowdi-
mensionalspaces.Computingthemeanparameters
¯
f
S
;
¯
f
A
andfeedintotheirrespectivedecoders,
alsousingthemeanlightingparameter,wecanreconstructthemeanshapeandtexturewiththat
attribute.Fig.3.12visualizesthereconstructedtextured3Dmeshrelatedtosomeattributes.Dif-
ferencesamongattributespresentinbothshapeandtexture.Herewecanobservethepowerofour
29
Figure3.10:
Eachcolumnshowsshapechangeswhenvaryingoneelementof
f
S
,by10timesstandard
deviations,inoppositedirections.Orderedbythemagnitudeofshapechanges.
Figure3.11:
Eachcolumnshowsalbedochangeswhenvaryingoneelementof
f
A
inoppositedirections.
nonlinear3DMMtomodelsmalldetailssuchasﬁbagundereyes",orﬁrosycheeks",etc.
3.3.3RepresentationPower
Wecomparetherepresentationpoweroftheproposednonlinear3DMMvs.traditionallinear
3DMM.
Albedo.
Givenafaceimage,assumingweknowthegroundtruthshapeandprojectionparameters,
wecanunwarpthetextureintotheUVspace,aswegenerateﬁpseudogroundtruth"texturein
30
MaleMustacheBagsUnderEyesOld
FemaleRosyCheeksBushyEyebrowsSmiling
Figure3.12:
Nonlinear3DMMgeneratesshapeandalbedoembeddedwithdifferentattributes.
theweaklysupervisionstep.Withthegroundtruthtexture,byusinggradientdescent,wecan
jointlyestimate,alightingparameter
L
andanalbedoparameter
f
A
whosedecodedtexturematches
withthegroundtruth.Alternatively,wecanminimizethereconstructionerrorintheimagespace,
throughtherenderinglayerwiththegroundtruth
S
and
P
.Empirically,twomethodsgivesimilar
performancesbutwechoosetheoptionasitinvolvesonlyonewarpingstep,insteadofdoing
renderingineveryoptimizationiteration.Forthelinearmodel,weusealbedobasesofBasel
FaceModel(BFM)[121].AsinFig.3.13,ournonlineartextureisclosertothegroundtruth
thanthelinearmodel.Thisisexpectedsincethelinearmodelistrainedwithcontrolledimages.
Quantitatively,ournonlinearmodelhasloweraveraged
L
1
reconstructionerrorthan
thelinearmodel(0
:
053vs.0
:
097,asinTab.3.3).
3DShape.
Wealsocomparethepowerofnonlinearandlinear3DMMsinrepresentingreal-world
3Dscans.WecomparewithBFM[121],themostcommonlyused3DMMatpresent.Weuseten
3Dfacescansprovidedby[121],whicharenotincludedinthetrainingsetofBFM.Asthese
31
InputLinear
Nonlinear
GraddescNetwork
Figure3.13:
Texturerepresentationpowercomparison.Ournonlinearmodelcanbetterreconstructthe
facialtexture.
Table3.3:
Quantitativecomparisonoftexturerepresentationpower(Averagereconstructionerroronnon-
occludedfaceportion.)
MethodLinearNonlinearw.GradDe.Nonlinearw.Network
L
1
0
:
0620
:
0530
:
057
facemeshesarealreadyregisteredusingthesametrianglewithBFM,noregistration
isnecessary.Giventhegroundtruthshape,byusinggradientdescent,wecanestimateashape
parameterwhosedecodedshapematchesthegroundtruth.Wematchingcriteriononboth
vertexdistancesandsurfacenormaldirection.Thisempiricallyimprovesofresults
32
3DScanNonlinearLinear
Figure3.14:
Shaperepresentationpowercomparison(
l
S
=
160).Theerrormapshowthenormalized
per-vertexerror.
Table3.4:
3Dscanreconstructioncomparison(NME).
l
S
4080160
Linear0
:
03210
:
02790
:
0241
Nonlinear[160]0
:
02770
:
02360
:
0196
Nonlinear0
:
02680
:
0214
0
:
0146
comparedtoonlyoptimizingvertexdistances.Also,toemphasizethecompactnessofnonlinear
models,wetraindifferentmodelswithdifferentlatentspacesizes.Fig.3.14showsthevisual
qualityoftwomodels'reconstruction.Ourreconstructionscloselymatchthefaceshapesdetails.
Toquantifythedifference,weuseNME,averagedper-vertexerrorsbetweentherecoveredand
groundtruthshapes,normalizedbyinter-oculardistances.Ournonlinearmodelhasa
smallerreconstructionerrorthanthelinearmodel,0
:
0146vs.0
:
0241(Tab.3.4).Also,thenonlinear
modelsaremorecompact.Theycanachievesimilarperformancesaslinearmodelswhoselatent
space'ssizesdoubled.
33
InputOverlayAlbedoShapeShadingInputOverlayAlbedoShapeShading
Figure3.15:
3DMMtofaceswithdiverseskincolor,pose,expression,lighting,facialhair,andfaith-
fullyrecoversthesecues.LefthalfshowsresultsfromAFLW2000dataset,righthalfshowsresultsfrom
CelebA.
3.3.4Applications
Havingshownthecapabilityofournonlinear3DMM(i.e.,twodecoders),nowwedemonstratethe
applicationsofourentirenetwork,whichhastheadditionalencoder.Manyapplicationsof3DMM
arecenteredonitsabilitytoto2Dfaceimages.Similartolinear3DMM,ournonlinear3DMM
canbeutilizedformodelwhichdecomposesa2Dfaceintoitsshape,albedoandlighting.
Fig.3.15visualizesour3DMMresultsonAFLW2000andCelebAdataset.Ourencoder
estimatestheshape
S
,albedo
A
aswellaslighting
L
andprojectionmatrix
P
.Wecanrecover
personalfacialcharacteristicinbothshapeandalbedo.Ouralbedocanpresentfacialhair,which
isnormallyhardtoberecoveredbylinear3DMM.
34
Figure3.16:
Ourfacealignmentresults.Invisiblelandmarksaremarkedasred.Wecanwellhandle
extremepose,lightingandexpression.
3.3.4.1FaceAlignment
Facealignmentisacriticalstepformanyfacialanalysistaskssuchasfacerecognition[162,163].
Withenhancementinthemodeling,wehopetoimprovethistask(Fig.3.16).Wecompareface
alignmentperformancewithstate-of-the-artmethods,3DDFA[193],DeFA[96],3D-FAN[22]and
PRN[46],onAFLW2000datasetonboth2Dand3Dsettings.
TheaccuracyisevaluatedusingNormalizedMeanError(NME)astheevaluationmetricwith
boundingboxsizeasthenormalizationfactor[22].Forfaircomparisonwiththesemethodsinterm
ofcomputationalcomplexity,forthiscomparisonweuseResNet18[60]asourencoder.Here,
3DDFAandDeFAusethelinear3DMMmodel(BFM).Eventhoughbeingtrainedwithlarger
trainingcorpus(DeFA)orhavingacascadeofCNNsiterativelytheestimation(3DDFA),
thesemethodsarestilloutperformedbyournonlinearmodel(Fig.3.17).Meanwhile,
3D-FANandPRNachievecompetitiveperformancesbyby-passingthelinear3DMMmodel.3D-
FANusesheatmaprepresentation.PRNusesthepositionmaprepresentationwhichsharesa
similarspirittoourUVrepresentation.Notonlyoutperformsthesemethodsintermofregressing
landmarklocations(Fig.3.17),ourmodelalsodirectlyprovidesheadposeinformationaswellas
thefacialalbedoandenvironmentlightingcondition.
35
Figure3.17:
FacealignmentCumulativeErrorsDistribution(CED)curvesonAFLW2000-3Don2D(left)
and3Dlandmarks(right).NMEsareshowninlegendboxes.
3.3.4.23DFaceReconstruction
Wecompareourapproachtothreerecentrepresentativefacereconstructionwork:3DMM
networkslearnedinunsupervised(Tewari
etal
.[153,152])orsupervisedfashion(Sela
etal
.[139])
andalsoanon-3DMMapproach(Jackson
etal
.[63]).
MoFA,themonocularreconstructionworkbyTewari
etal
.[153],isrelevanttousastheyalso
learnto3DMMinanunsupervisedfashion.Evenbeingtrainedonin-the-wildimages,their
methodisstilllimitedtothelinearbases.Hencetherereconstructionssufferthesurfaceshrinkage
whendealingwithchallengingtexture,i.e.,facialhair(Fig.3.18).Ournetworkfaithfullymodels
thesein-the-wildtexture,whichleadstobetter3Dshapereconstruction.
Concurrently,Tewari
etal
.[152]trytoimprovethelinear3DMMrepresentationpowerby
learningacorrectivespaceontopofatraditionallinearmodel.Despitesharingsimilarspirit,our
modelexploitsspatialrelationbetweenneighborverticesandusesCNNsasshape/albedo
decoders,whichismoreefthanMLPs.Asaresult,ourreconstructionsmorecloselymatch
theinputimagesinbothtextureandshape(Fig.3.19).
36
InputOurTewari17
Figure3.18:
3DreconstructionresultscomparisontoTewari
etal
.[153].Theirreconstructedshapes
sufferfromthesurfaceshrinkagewhendealingwithchallengingtextureorshapeoutsidethelinearmodel
subspace.Theycan'thandlelargeposevariationwelleither.Meanwhile,ournonlinearmodelismorerobust
tothesevariations.
InputOurTewari18
Figure3.19:
3DreconstructionresultscomparisontoTewari
etal
.[152].Ourmodelbetterreconstructthe
inputimageinbothtexture(facialhairdirectionontheimage)andshape(nasolabialfoldsinthesecond
image).
Thehigh-quality3DreconstructionworkbyRichardson
etal
.[128,129],Sela
etal
.[139]
obtainimpressiveresultsonaddingveldetailstothefaceshapewhenimagesarewithin
thespanoftheusedsynthetictrainingcorpusortheemployed3DMMmodel.However,their
performancedegradeswhendealingwithvariationsnotinitstrainingdataspan,e.g.,
facialhair.Ourapproachisnotonlyrobusttofacialhairandmake-up,butalsoautomatically
learnstoreconstructsuchvariationsbasedonthejointlylearnedmodel.Weprovidecomparisons
withtheminFig.3.20,usingthecodeprovidedbytheauthor.
37
InputOurSela17
Figure3.20:
3DreconstructionresultscomparisontoSela
etal
.[129].Besidesshowingtheshape,wealso
showtheirestimateddepthandcorrespondencemap.Facialhairorocclusioncancauseseriousproblemsin
theiroutputmaps.
Thecurrentstate-of-artmethodbySela
etal
.[139]consistingofthreesteps:animage-to-image
networkestimatingadepthmapandacorrespondencemap,non-rigidregistrationandadetail
reconstruction.Theirimage-to-imagenetworkistrainedonsyntheticdatageneratedbythelinear
model.Besidesdomaingapbetweensyntheticandrealimages,thisnetworkfacesamoreserious
problemoflackingfacialhairinthelow-dimensiontexturesubspaceofthelinearmodel.This
network'soutputtendstoignoretheseunexplainableregion(Fig.3.20),whichleadstofailurein
latersteps.Ournetworkismorerobustinhandingthesein-the-wildvariations.Furthermore,our
approachisorthogonaltoSela
etal
.[139]'sdetailreconstructionmoduleorRichardson
etal
.
[129]'sEmployingtheseontopofourcouldleadtopromisingfurther
improvement.
Wealsocompareourapproachwithanon-3DMMapporachVRNbyJackson
etal
.[63].To
avoidusinglow-dimensionsubspaceofthelinear3DMM,itdirectlyregressesa3Dshapevolu-
metricrepresentationviaanencoder-decodernetworkwithskipconnection.Thispotentiallyhelps
thenetworktoexplorealargersolutionspacethanthelinearmodel,howeverwithacostoflosing
correspondencebetweenfacialmeshes.Fig.3.21shows3Dreconstructionvisualcomparisonbe-
tweenVRNandours.Ingeneral,VRNrobustlyhandlesin-the-wildtexturevariations.However,
38
InputOurVRNInputOurVRN
Figure3.21:
3DreconstructionresultscomparisontoVRNbyJackson
etal
.[63]onCelebAdataset.Volu-
metricshaperepresentationresultsinnon-smooth3Dshapeandlosescorrespondencebetweenreconstructed
shapes.
Figure3.22:
3DreconstructionquantitativeevaluationonFaceWarehouse.Weobtainalowererrorcom-
paredtoPRN[46]and3DDFA+[195].
becauseofthevolumetricshaperepresentation,thesurfaceisnotsmoothandispartiallylimitedto
presentmedium-leveldetailsasours.Also,ourmodelfurtherprovidesprojectionmatrix,lighting
andalbedo,whichisapplicableformoreapplications.
QuantitativeComparisons.
Toquantitativelycompareourmethodwithpriorworks,weevaluatemonocular3Dreconstruc-
tionperformanceonFaceWarehouse[24]andFlorencedataset[9],inwhichgroundtruth3Dshape
39
(a)CEDCurves(b)NME
Figure3.23:
3DfacereconstructionresultsontheFlorencedataset[9].TheNMEofeachmethodis
showedinthelegend
isavailable.Duetothediffrenceinmeshtopology,ICP[7]isusedtoestablishcorrespondence
betweenestimatedshapesandgroundtruthpointclouds.Similartopreviousexperiments,NME
(averagedper-vertexerrorsnormalizedbyinter-oculardistances)isusedasthecomparisonmetric.
FaceWarehouse.
Wecompareourmethodwithpriorworkswithavailablepretrainedmodels
onall19expressionsof150subjectsofFaceWarehousedatabase[24].Visualandquantitative
comparisonsareshowninFig.3.22.Ourmodelcanfaithfullyresembletheinputexpressionand
surpassallotherregressionmethods(PRN[46]and3DDFA+[195])intermofdense
facealignment.
Florence.
Usingtheexperimentalsettingproposedin[63],wealsoquantitativelycompared
ourapproachwithstate-of-the-artmethods(
e.g
.VRN[63]andPRN[46])ontheFlorencedataset[9].
Eachsubjectisrenderedwithmultipleposes:pitchrotationsof

15

,20

and25

andrawrota-
tionsbetween

80

and80

.Ourmodelconsistentlyoutperformsothermethodsacrossdifferent
viewangles(Fig.3.23).
40
Table3.5:
Runningtimeofvarious3Dfacereconstructionmethods.
MethodEncoderDecoderPost-processingRendering
Sela
etal
.[139]
˘
10ms
˘
180s-
VRN[63]
˘
10ms--
MoFA[153]
˘
4ms
Neglectable
--
Our2
:
7ms5
:
5ms-140ms
3.3.5Runtime
Inthissection,wecomparerunningtimeformultiple3Dreconstructionapproaches.Sincediffer-
entmethodsimplementedindifferentframeworks/languages;thiscomparisonaimstoonlyprovide
relativecomparisonsbetweenthem.Sela
etal
.[139]andVRN[63]bothuseanencoder-decoder
networkwithskipconnectionswithsimilarruntime.However,Sela
etal
.[139]requiresanex-
pensivenonrigidregistrationstepaswellasanmodule.Wegetacomparableencoder
runningtimewith3DMMregressionnetworkofMoFA[153].However,sincetheydirectlyuse
linerbases,thedecodingstepistrivialasasinglemultiplication;ourmodelrequiresdecoding
featuresviatwoCNNsforshapeandtexture,respectively.Wealsonotethattherunningtimefor
therenderinglayerishigherthanothercomponents.Luckily,renderingtoreconstruct
inputhasnovalueanditisnotrequiredduringtesting.
3.4Conclusions
Sinceitsdebutin1999,3DMMhasbecameacornerstoneoffacialanalysisresearchwithapplica-
tionstomanyproblems.Despiteitsimpact,ithasdrawbacksinrequiringtrainingdataof3Dscans,
learningfromcontrolled2Dimages,andlimitedrepresentationpowerduetolinearbasesforboth
shapeandtexture.Thesedrawbackscouldbeformidablewhen3DMMtounconstrained
faces,orlearning3DMMforgenericobjectssuchasshoes.Thispaperdemonstratesthatthere
41
existsanalternativeapproachto3DMMlearning,whereanonlinear3DMMcanbelearnedfrom
alargesetofin-the-wildfaceimageswithoutcollecting3Dfacescans.Further,themodel
algorithmcanbelearntjointlywith3DMM,inanend-to-endfashion.
Ourexperimentscoveradiverseaspectsofourlearntmodel,someofwhichmightneedthe
subjectivejudgmentofthereaders.Wehopethatboththejudgmentandquantitativeresultscould
beviewedunderthecontextthat,unlikelinear3DMM,nogenuine3Dscansareusedinour
learning.Finally,webelievethatunsupervisedlyorweak-supervisedlylearning3Dmodelsfrom
large-scalein-the-wild2Dimagesisonepromisingresearchdirection.Thisworkisonestepalong
thisdirection.
42
Chapter4
TowardsNonlinear3DFace
MorphoableModel
4.1Introduction
Inchapter3,wepresentourproposedframeworkusingdeepneuralnetworkstopresentthe3DMM
basisfunctionstoincreasemodelrepresentationpowerandlearningthemodeldirectlyfromun-
constrained2Dimagestobettercapturein-the-wildvariations.
However,evenwithbetterrepresentationpowers,thismodelorrelatedworks[152]stillrelyon
manyconstraintstoregularizethemodellearning.Hence,theirobjectiveinvolvesthe
requirementsofastrongregularizationforaglobalshapevs.aweakregularizationforcapturing
higherleveldetails.Forexample,inordertofaithfullyseparateshadingandalbedo,albedois
usuallyassumedtobepiecewiseconstant[82,144],whichpreventslearningalbedowithhigh
levelofdetails.Inthischapter,besidelearningtheshapeandthealbedo,weproposetolearn
additionalshapeandalbedoproxies,onwhichwecanenforceregularizations.Thisalsoallowsus
toxiblypairthetrueshapewithstronglyregularizedalbedoproxytolearnthedetailedshapeor
Thischapterisadaptedfromthefollowingpublication:
[1]LuanTran,FengLiu,andXiaomingLiu,ﬁTowardsHigh-Nonlinear3DFaceMorphableModelﬂin
CVPR,2019.
43
viceversa.Asaresult,eachelementcanbelearnedwithwithouttheother
element'squality.
Onadifferentnote,many3DMMmodelsfailtorepresentsmalldetailsbecauseoftheirparam-
eterization.Manyglobal3Dfaceparameterizationhasbeenproposedtoovercometheambiguities
associatedwithmonocularfacesuchasnoiseorocclusion.However,becausethey
aredesignedtomodelthewholefaceatonce,itisdiftousethemtorepresentsmalldetails.
Meanwhile,local-basedmodelsoffermorexibilitythanglobalapproachesbutwiththecostof
beinglessconstrainedtorealisticallyrepresenthumanfaces.Weproposeusingdual-pathwaynet-
workstoprovideabetterbalancebetweenglobalandlocal-basedmodels.Fromthelatentspace,
thereisaglobalpathwayfocusingontheinferenceofglobalfacestructureandmultiplelocalpath-
waysgeneratingdetailsofdifferentsemanticfacialparts.Theircorrespondingfeaturesarethen
fusedtogetherforsuccessiveprocessgenerationoftheshapeandalbedo.Thisnetworkalso
helpstospecializeinlocalpathwaysforeachfacialpartwhichbothimprovesthequality
andsavescomputationpower.
Inthischapter,weimprovethenonlinear3Dfacemorphablemodelinbothlearningobjective
andarchitecture:

Wesolvetheobjectiveproblembylearningadditionalshapeandalbedoproxies
withproperregularization.

Thenovelpairingschemeallowslearningbothdetailedshapeandalbedowithout
one'squality.

Thegloballocal-basednetworkarchitectureoffersmorebalancebetweenmodelrobustness
andxibility.

Theproposedmodelallows,forthetime,3Dfacereconstructionbysolely
optimizelatentrepresentations.
44
Figure4.1:
Theproposedframework.Eachshapeoralbedodecoderconsistoftwobranchestoreconstruct
thetrueelementanditsproxy.Proxiesfreeshapeandalbedofromstrongregularizations,allowthemto
learnmodelswithhighlevelofdetails.
4.2ProposedMethod
4.2.1Nonlinear3DMMwithProxyandResidual
Recalfromthelastchapter,intheoriginalnonlinear3DMM,theoverallobjectivecanbesumma-
rizedas:
L
=
L
recon
(
‹
I
;
I
)+
L
lan
+
L
reg
;
(4.1)
with
L
reg
=
L
sym
(
A
)+
l
con
L
con
(
A
)+
l
smo
L
smo
(
S
)
:
(4.2)
ProxyandResidualLearning.
Strongregularizationhasbeenshowntobecriticalinensuring
theplausibilityofthelearnedmodels[152,161].However,thestrongregularizationalsoprevents
themodelfromrecoveringhigh-levelofdetailsineithershapeoralbedo.Hence,thispreventsus
fromachievingtheultimategoaloflearninga3DMMmodel.
Inthiswork,weproposetolearnadditional
proxyshape
(
Ÿ
S
)and
proxyalbedo
(
Ÿ
A
),onwhich
45
wecanapplytheregularization.Allpresentedregularizationswillbemovedtoproxiesnow:
L

reg
=
L
sym
(
Ÿ
A
)+
l
con
L
con
(
Ÿ
A
)+
l
smo
L
smo
(
Ÿ
S
)
:
(4.3)
Therewillbenoregularizationapplieddirectlytotheactualshape
S
andalbedo
A
otherthana
weakregularizationencouragingeachtobeclosetoitsproxy:
L
res
=
k
D
S
k
1
+
k
D
A
k
1
=


S

Ÿ
S


1
+


A

Ÿ
A


1
:
(4.4)
Bypairingtwoshapes
S
;
Ÿ
S
andtwoalbedos
A
;
Ÿ
A
,wecanrenderfourdifferentoutputimages
(Fig.4.1).Anyofthemcanbeusedtocomparewiththeoriginalinputimage.Werewriteour
reconstructionlossas:
L

rec
=
L
rec
(
‹
I
(
Ÿ
S
;
Ÿ
A
)
;
I
)+
L
rec
(
‹
I
(
Ÿ
S
;
A
)
;
I
)+
L
rec
(
‹
I
(
S
;
Ÿ
A
)
;
I
)
:
(4.5)
Pairingstronglyregularizedproxiesandweaklyregularizedcomponentsisacriticalpointinour
approach.Usingproxiesallowsustolearnshapeandalbedowithout
qualityofeithercomponent.ThispairingisinspiredbytheobservationthatShapefromShading
techniquesareabletorecoverdetailedfacemeshbyassumingoverregularizedalbedooreven
usingthemeanalbedo[129].Here,
L
rec
(
‹
I
(
S
;
Ÿ
A
)
;
I
)
losspromote
S
torecovermoredetailsas
Ÿ
A
isconstrainedbypiece-wiseconstant
L
con
(
Ÿ
A
)
objective.Viceversa,
L
rec
(
‹
I
(
Ÿ
S
;
A
)
;
I
)
aimstolearn
betteralbedo.Inorderforthesetwolossestoworkasdesired,proxies
Ÿ
S
and
Ÿ
A
shouldperformwell
enoughtoapproximatetheinputimagesbythemselves.Without
L
rec
(
‹
I
(
Ÿ
S
;
Ÿ
A
)
;
I
)
,avalidsolution
thatminimizes
L
rec
(
‹
I
(
S
;
Ÿ
A
)
;
I
)
iscombinationofaconstantalbedoproxyandnoisyshapecreating
surfacenormalwithdarkshadinginnecessaryregions,i.e.,eyebrows.
46
Anothernotabledesignchoiceisthatweintentionallyleftoutthelossfunctionon
‹
I
(
S
;
A
)
,even
thoughthistheoreticallyisthemostimportantobjective.Thisistoavoidthecasethattheshape
S
learnsanin-betweensolutionthatworkswellwithboth
Ÿ
A
;
A
andviceversa.
OcclusionImputation.
Withproposedobjectivefunction,ourmodelisabletofaithfullyrecon-
structinputimages.However,weempiricallyfoundthatbesidesvisibleregions,the
modeltendstokeepinvisibleregionsmooth.Sincethereisnosupervisiononthoseareasotherthan
theresidualmagnitudelosspullingtheshapeandalbedoclosertotheirproxies.Tolearnamore
meaningfulmodel,whichistootherapplications,i.e.,faceeditingorfacesynthesis,we
proposetouseasoftsymmetryloss[159]onoccludedregions:
L
res-sym
(
S
)=


T

(
D
S
uv
z


(
D
S
uv
z
))


1
;
(4.6)
where
T
isamaskinUVspaceindicatingvisibilityofeachpixel,approximatedbasedoncurrent
surfacenormaldirection.Eventhoughtheshapeitselfisnotsymmetric,i.e.,facewithasymmetric
expression,weenforcesymmetricalpropertyonitsdepthresidual.
4.2.2GlobalLocalBasedNetworkArchitecture
Whileglobal-basedmodelsareusuallyrobusttonoiseandmismatches,theyareusuallyover-
constrainedanddonotprovidesufxibilitytorepresenthigh-frequencydeformationsas
local-basedmodels.Inordertotakethebestofbothworlds,weproposetousedual-pathway
networksforourshapeandalbedodecoders.
Here,wetransferthesuccessofcombininglocalandglobalmodelsinimagesynthesis[110,
62]to3Dfacemodeling.ThegeneralarchitectureofadecoderisshowninFig.4.2.Fromthe
latentvector,thereisaglobalpathwayfocusingontheinferenceofglobalstructureandalocal
47
Figure4.2:
Theproposedgloballocalbasednetworkarchitecture.
pathwaywithfoursmallsub-networksgeneratingdetailsofdifferentfacialparts,includingeyes,
noseandmouth.Theglobalpathwayisbuiltfromfractionalstridedconvolutionlayerswithve
up-samplingsteps.Meanwhile,eachsub-networkinlocalpathwayhavesimilararchitecturebut
shallowerwithonlythreeup-samplingsteps.Usingdifferentsmallsub-networksforeachfacial
partofferstwoi)withlessup-samplingsteps,thenetworkisbetterabletorepresenthigh
frequencydetailsinearlylayersii)eachsub-networkcanlearnwhichismore
computationallyefthanapplyingacrossglobalface.
AsshowninFig.4.2,tofusetwopathways'features,weintegratefourlocalpathways'
outputsintoonesinglefeaturetensor.Differentfromotherworksthatsynthesizefaceimages
withdifferentyawangles[162,163,73]withnoedkeypoints'locations,our3DMMgenerates
facialalbedoaswellas3DshapeinUVspacewithtopology.Mergingtheselocal
featuretensorsisefntlydonewithzeropaddingoperation.Themax-poolingfusionstrategyis
alsousedtoreducethestitchingartifactsontheoverlappingareas.Thenresultantfeatureissimply
concatenatedwiththeglobalpathway'sfeature,whichhasthesamespatialresolution.Successive
convolutionlayersintegrateinformationfrombothpathwaysandgeneratethealbedo/shape
(ortheirproxies).
48
Input
l
2
;
1
l
2
;
1
+
Grad.diff.
l
2
;
1
+
Perceptual
Figure4.3:
Reconstructionresultswithdifferentlossfunctions.
4.3ExperimentalResults
Theexperimentsstudydifferentaspectsoftheproposednonlinear3DMM,intermsofitsrepre-
sentationpower,andapplicationstofacialanalysis.Themodelistrainedfollowedthesamesetting
asinchapter3,includingtrainingdataset,meshtopology,optimizerparametters.
4.3.1AblationStudy
ReconstructionLossFunctions.
Westudyeffectsofdifferentreconstructionlossesonquality
ofthereconstructedimages(Fig.4.3).Asexpected,themodeltrainedwith
l
2
;
1
lossonlyresultsin
blurryreconstruction,similartoother
l
p
loss.Tomakethereconstructiontobemorerealistic,we
exploreotheroptionssuchasgradientdifference[104]orperceptualloss[66].Whileaddingthe
gradientdifferencelosscreatesmoredetailsinthereconstruction,combiningperceptuallosswith
49
Input
‹
I
(
Ÿ
S
;
Ÿ
A
)
‹
I
(
S
;
Ÿ
A
)
‹
I
(
Ÿ
S
;
A
)
‹
I
(
S
;
A
)
Ÿ
SS
Ÿ
AA
Figure4.4:
Imagereconstructionwithour3DMMmodelusingtheproxyandthetrueshapeandalbedo.
Ourshapeandalbedocanfaithfullyrecoverdetailsoftheface.Note:fortheshape,weshowtheshadingin
UVspaceŒabettervisuallizationthantheraw
S
UV
.
l
2
;
1
givesbestresultswithhighlevelofdetailsandrealism.Fortherestofthepaperwewillrefer
tothemodeltrainedusingthiscombination.
Understandingimagepairing.
Fig.4.4showsresultsofourmodelona2Dfaceimage.
Byusingtheproxyorthecomponents(shapeoralbedo)wecanrenderfourdifferentrecon-
structedimageswithdifferentqualityandcharacteristics.Theimagegeneratedbytwoproxies
Ÿ
S
;
Ÿ
A
isquiteblurrybutisstillbeabletocapturemajorvariationsintheinputface.Bypairing
S
andtheproxy
Ÿ
A
,
S
isenforcedtocapturehighlevelofdetailstobringtheimageclosertotheinput.
Similarly,
A
isalsoencouragedtocapturemoredetailsbypairingwiththeproxy
Ÿ
S
.Theimage
‹
I
(
S
;
A
)
inherentlyachieveshighlevelofdetailsandrealismevenwithoutdirectoptimization.
ResidualSoftSymmetryLoss.
Westudyeffectsoftheresidualsoftsymmetrylossonrecovering
detailsonoccludedfaceregion.AsshowninFig.4.5,without
L
res-sym
,thelearnedmodelcanresult
inanunnaturalshape,inwhichonesideofthefaceisover-smooth,onoccludedregions,whilethe
50
Without
L
res-sym
With
L
res-sym
Figure4.5:
Affectofsoftsymmetrylossonourshapemodel.
Table4.1:
Quantitativecomparisonoftexturerepresentationpower(Averagereconstructionerroronnon-
occludedfaceportion.)
MethodReconstructionerror(
l
2
;
1
)
Linear[193]0
:
1287
Nonlinear[161]0
:
0427
Nonlinear+GL(Ours)0
:
0386
Nonlinear+GL+Proxy(Ours)
0
:
0363
othersidestillhashighlevelofdetails.Ourmodellearnedwith
L
res-sym
canconsistentlycreate
detailsacrosstheface,eveninoccludedareas.
4.3.2RepresentationPower
Wecomparetherepresentationpoweroftheproposednonlinear3DMMwithBaselFaceModel[121],
themostcommonlyusedlinear3DMM.Wealsomakecomparisonswiththerecentlyproposed
nonlinear3DMM[160].
Texture.
Weevaluateourmodel'spowertorepresentin-the-wildfacialtexture.Givenaface
image,alsowiththegroundtruthshapeandprojectionmatrix,wecanjointlyestimateanalbedo
parameter
f
A
andalightingparameter
L
whosedecodedtexturecanreconstructtheoriginalim-
age.Toaccomplishthis,weuseSGDon
f
A
and
L
withtheinitialparametersestimatedbyour
encoder
E
.Forthelinearmodel,Zhu
etal
.[193]resultsofBaselalbedousingPhongillu-
minationmodel[122]isused.AsinFig.4.6,nonlinearmodeloutperformstheBasel
51
InputLinearNonlinear+GL+GL&Proxy
Figure4.6:
Texturerepresentationpowercomparison.Ournonlinearmodelcanbetterreconstructthefacial
texture.
Facemodel.Despite,beingclosetotheoriginalimage,Tran
etal
.[161]modelreconstruction
resultsarestillblurry.Usinggloballocal-basednetworkarchitecture(ﬁ+GLﬂ)withthesameloss
functionshelpstobringtheimageclosertotheinput.However,thesemodelsarestillconstrained
byregularizationsonthealbedo.Bylearningusingproxytechnique(ﬁ+Proxyﬂ),ourmodelcan
learnmorerealisticalbedowithmorehighfrequencydetailsontheface.Thisconclusionisfur-
thersupportedwithquantitativecomparisoninTab.4.1.Wereporttheaveraged
l
2
;
1
reconstruction
erroroverthefaceportionofeachimage.Ourmodelachievesthelowestaveragedreconstruction
erroramongfourmodels,0
:
0363,whichisa15%errorreductionoftherecentnonlinear3DMM
work[161].
Shape.
Similarly,wealsocomparemodels'powertorepresentreal-world3Dscans.Using
ten3Dfacemeshesprovidedby[121],whichsharethesametriangletopologywithus,wecan
52
OriginLinear[121]Nonlinear[161]Our
NME0
:
02410
:
0146
0
:
0139
Figure4.7:
Shaperepresentationpowercomparison.Givena3Dshape,weoptimizethefeature
f
S
to
approximatetheoriginalone.
optimizetheshapeparametertogenerate,throughthedecoder,shapesmatchingthegroundtruth
scans.Thematchingcriterionisbasedonbothvertexdistances(Euclidean)andsurface
normaldirection(cosinedistance),whichempiricallyimprovesofreconstructedmeshes
comparedtooptimizingvertexdistancesonly.Fig.4.7showsthevisualcomparisonsbetween
differentreconstructedmeshes.Ourreconstructionscloselymatchthefaceshapesdetails.To
quantifythedifference,weuseNMEŠaveragedper-vertexEuclideandistancesbetweenthe
recoveredandgroundtruthmeshes,normalizedbyinter-oculardistances.Theproposedmodel
hasasmallerreconstructionerrorthanthelinearmodel,andisalsosmallerthanthe
nonlinearmodelbyTran
etal
.[161](0
:
0139vs.0
:
0146[161],and0
:
0241[121])
4.3.3Identity-Preserving
Weexploretheeffectofourproposed3DMMonpreservingidentitywhenreconstructingface
images.UsingDR-GAN[163],apretrainedfacerecognitionnetwork,wecancomputethecosine
53
Figure4.8:
Thedistancebetweentheinputimagesandtheirreconstructionfromthreemodels.Forbetter
visualization,imagesaresortedbasedontheirdistancetoourmodel'sreconstructions.
distancebetweentheinputanditsreconstructionfromdifferentmodels.Fig.4.8showstheplot
ofthesescoredistributions.Ateachhorizontalmark,thereareexactlythreepointspresenting
distancesbetweenanimagewithitsreconstructionsfromthreemodels.Imagesaresortedbased
onthedistancetoourreconstruction.Forthemajorityofthecases(77
:
2%),ourreconstructionhas
thesmallestdifferencetotheinputintheidentityspace.
4.3.43DReconstruction
Usingourmodel
D
S
;
D
A
,togetherwiththemodelCNN
E
,wecandecomposea2Dpho-
tographintodifferentcomponents:3Dshape,albedoandlighting(Fig.4.9).Herewecompare
our3Dreconstructionresultswithdifferentlinesofworks:linear3DMM[153],nonlinear
3DMM[152,161]andapproachesbeyond3DMM[63,139].
Forlinear3DMMmodel,therepresentativework,MoFAbyTewari
etal
.[153,151],learnsto
regress3DMMparametersinanunsupervisedfashion.Evenbeingtrainedonin-the-wildimages,
itisstilllimitedtothelinearsubspace,withlimitedpowertorecoveringin-the-wildtexture.This
resultsinthesurfaceshrinkagewhendealingwithchallengingtexture,i.e.,facialhairasdiscussed
54
InputOverlayAlbedoShapeShading
Figure4.9:
3DMMtofaceswithdiverseskincolor,pose,expression,lighting,andfaithfullyrecovers
thesecues.
55
InputOurTewari17
Figure4.10:
3DreconstructioncomparisontoTewari
etal
.[153].
InputOurTewari18Tran18a
Figure4.11:
3Dreconstructioncomparisonstononlinear3DMMapproachesbyTewari
etal
.[152]orTran
andLiu[161].Ourmodelcanreconstructfaceimageswithhigherlevelofdetails.Pleasezoom-informore
details.Bestviewelectronically.
in[152,160,161].Besides,evenwithregularskintexturetheirreconstructionisstillblurryand
haslessdetailscomparedtoours(Fig.4.10).
ThemostrelatedworktoourproposedmodelisTewari
etal
.[152],TranandLiu[161],in
56
InputOurTran18bSela17
Figure4.12:
3DreconstructioncomparisonstoSela
etal
.[139]orTran
etal
.[159],whichgobeyondlatent
spacerepresentations.
which3DMMbasesareembeddedinneuralnetworks.Withmorerepresentationpower,these
modelscanrecoverdetailswhichthetraditional3DMMusuallycan't,i.e.make-up,facialhair.
However,themodellearningprocessisattachedwithstrongregularizationwhichlimitstheir
abilitytorecoverhighfrequencydetailsoftheface.Ourproposemodelenhancesthelearning
processinbothlearningobjectiveandnetworkarchitecturetoallowhigherreconstruc-
57
tions(Fig.4.11).
Toimprove3Dreconstructionquality,manyapproachesalsotrytomovebeyondthe3DMM
suchasRichardson
etal
.[129],Sela
etal
.[139]orTran
etal
.[159].Thecurrentstate-of-the-art3D
monocularfacereconstructionmethodbySela
etal
.[139]usingadetailreconstructionstepto
helpreconstructinghighmeshes.However,theirdepthmapregressionstepistrained
onsyntheticdatageneratedbythelinear3DMM.Besidesdomaingapbetweensyntheticandreal,
itfacesamoreseriousproblemoflackingfacialhairinthelow-dimensiontexture.Hence,this
network'soutputtendstoignoretheseunexplainableregions,whichleadstofailureinlatersteps.
Ournetworkismorerobustinhandlingthesein-the-wildvariations(Fig.4.12).Theapproach
ofTran
etal
.[159]shareasimilarobjectivewithustobebothrobustandmaintainhighlevel
ofdetailsin3Dreconstruction.However,theyuseanover-constrainedfoundationwhichloses
personalcharacteristicsoftheeachfacemesh.Asaresult,the3Dshapeslooksimilaracross
differentsubjects(Fig.4.12).
4.3.5Faceediting
Decomposingfaceimageintoindividualcomponentsgiveusabilitytoeditthefacebymanipulat-
inganycomponent.Hereweshowthreeexamplesoffaceeditingusingourmodel.
Relighting.
Firstweshowanapplicationtoreplacingthelightingofatargetfaceimageusing
lightingfromasourceface(Fig.4.13).Afterestimatingthelightingparameters
L
source
ofthe
sourceimage,werenderthetransfershadingusingthetargetshape
S
target
andthesourcelighting
L
source
.Thistransfershadingcanbeusedtoreplacetheoriginalsourceshading.Alternatively,
valueof
L
source
canbearbitrarilychosenbasedontheSHlightingmodel,withouttheneedof
sourceimages.Also,hereweusetheoriginaltextureinsteadoftheoutputofourdecoderto
58
Figure4.13:
Lightingtransferresults.Wetransferthelightingofsourceimagesrow)totargetimages
column).Wehavesimilarperformancecomparetothestate-of-the-artmethodofShu
etal
.[143]
despitebeingordersofmagnitudefaster(150msvs.3minperimage).
maintainimagedetails.
AttributeManipulation.
Givenfacesby3DMMmodel,wecaneditimagesbynaive
modifyingoneormoreelementsinthealbedoorshaperepresentation.Moreinterestingly,we
canevenmanipulatethesemanticattribute,suchasgrowingbeard,smiling,etc.Theapproachis
similartolearningattributeembeddinginSec.3.3.2.Assuming,wewouldliketoeditappearance
59
Figure4.14:
Growingmustacheeditingresults.Thecollumnshowsoriginalimages,thefollowing
collumnsshoweditedimageswithincreasingmagnitudes.ComparingtoShu
etal
.[144]results(lastrow),
oureditedimagesaremorerealisticandidentitypreserved.
only.Foragivenattribute,e.g.,beard,wefeedtwosetsofimageswithandwithoutthatattribute
f
I
p
i
g
n
i
=
1
and
f
I
n
i
g
n
i
=
1
intoourencodertoobtaintwoaverageparameters
f
p
A
and
f
n
A
.Theirdifference
D
f
A
=
f
p
A

f
n
A
isthedirectiontomovefromthedistributionofnegativeimagestopositiveones.
60
Figure4.15:
Addingstickerstofaces.Thestickerisnaturallyaddedintofacesfollowingthesurfacenormal
orlighting.
Byadding
D
f
A
withdifferentmagnitudes,wecangenerateimageswithdifferentdegree
ofchanges.Toachievehigh-qualityeditingwithidentity-preserved,theeditingresultis
obtainedbyaddingtheresidual,thedifferentbetweentheimageandourreconstruction,
totheoriginalinputimage.ThisisacriticaldifferencetoShu
etal
.[144]toimproveresultsquality
(Fig.4.14).
AddingSticker.
Withmoreprecise3Dfacemeshreconstruction,thequalityofsuccessivetasks
isalsoimproved.Here,weshowanapplicationofourmodelonfaceediting:addingstickersor
tattoosontofaces.Usingtheestimatedshapeaswellastheprojectionmatrix,wecanunwrapthe
facialtextureintotheUVspace.Thankstothelightingdecomposition,wecanalsoremovethe
shadingfromthetexturetogetthedetailedalbedo.Fromherewecandirectlyeditthealbedoby
addingsticker,tattooormake-up.Finally,theeditedimagescanberenderedusingthe
albedotogetherwithotheroriginalelements.Fig.4.15showsoureditingresultsbyaddingstickers
intodifferentpeople'sface.
61
4.4Conclusions
Inrealizationthatthestrongregularizationandglobal-basedmodelingaretheroadblockstoachieve
3DMMmodel,thischapterpresentsanovelapproachtoimprovethenonlinear3DMM
modelinginbothlearningobjectiveandnetworkarchitecture.Hopefully,withinsightsand
ingsdiscussed,thiscanbeasteptowardunlockingthepossibilitytobuildamodelwhichcan
capturemidandhigh-leveldetailsintheface.Throughwhich,3Dfacereconstruction
canbeachievedsolelybydoingmodel
62
Chapter5
Intrinsic3DDecomposition,Segmentation,
andModelingGenericObjects
5.1Introduction
Understanding3Dstructureisoneofcomputervision'sfundamentalproblems.Ahumanhasno
difunderstandingthe3Dstructureofanobjectuponseeingits2Dimage.Evenwithoutge-
ometriccues(motionorstereopsis),ourvisualsystemcanstillinferdetailedsurfacesorplausibly
hiddenparts.Meanwhile,sucha3Dinferringtaskremainsextremelychallengingforcomputer
visionsystems.
Inrecentyears,withadvancementsindeeplearning,manyhaveshownhuman-levelperfor-
manceon2Dimageunderstanding,suchasobjectdetection[59],recognition[61,163],segmen-
tation[57,21].Oneofthemainreasonsforthissuccessistheabundanceofannotateddata.For
majorityof2Dunderstandingtasks,nowadays,thereusuallybemanydatabaseswithsuf
annotatedimages.Hence,thedecentperformancecanbeobtainedusingend-to-endsupervised
learning.However,extendingthissuccesstosupervisedlearningfor3Dinferenceisfarbehind
Thischapterisadaptedfromthefollowingwork:
[1]FengLiu,LuanTranandXiaomingLiu,ﬁIntrinsic3DDecompositionandModelingforGenericObjectsvia
ColoredOccupancyFieldﬂundersubmission.(LuanTranandFengLiumakeequalcontributiontothiswork).
63
duetolimitedavailabilityof3Dlabels.
Withtheintroductionoflarge3DComputerAidedDesign(CAD)databaseslikeObjectNet3D[176],
ShapeNet[26],majorityofrecentworkon3Dmonocularobjectreconstruction[56,54,34]and
intrinsicimagedecomposition[64,142]relyentirelyonsyntheticimagesgeneratedfromtheCAD
models.However,usingsyntheticdataalonehasamajordrawback.Firstofall,creatingCAD
modelsisnotscalable.Makingasingle3Dobjectinstanceislaborextensiveandrequiresex-
pertiseincomputergraphics.Hence,it'snotfeasibletobuildmodelsfor
all
availableobjects.
Secondly,therestillbeanobviousgapbetweensyntheticrenderingimagesandrealimageseven
withadvancedrenderingtechniquesincomputergraphic.Therefore,thesemethodshavelimited
abilityinreconstructionfromreal-worldimages.
Meanwhile,thereisalargecollectionof2Dimagesforanyobjectcategories.Ifthoseimages
canbeeffectivelyusedineither3Dobjectmodelingorlearningtothemodel,itcouldhavea
greatimpactonthe3Dobjectreconstruction.Essentially,thereasonthatreal-world2Dimages
havenotbeeneffectivelyusedingenericobject3Dreconstructionisthelackofcorresponding
groundtruth3Dshapesfortheseimages,andthusnosupervisedlearning.
Earlyattempts[89,164]onlearning3Dshapemodelfrom2Dphotographsinanunsupervised
fashionarestilllimitedonexploiting2Dimages.Givenaninputimage,theymainlytrytolearn
3Dmodeltoreconstruct2Dsilhouetteoftheobject.Tolearnabettermodel,multipleviewsofthe
sameobjectwithground-truthposeorkeypointsannotationsareneeded.Moreimportantly,they
ignoreadditionalmonocularcues,
e.g.
,shading,thatobtainrich3Dinformation.Onecommon
issueamongpriorworkisthelackofmodelingforalbedo,onekeyelementinimageformulation.
Asaresult,
analysis-by-synthesis
approachesisnotapplicableto3Dmodelingofgenericobjects.
Toaddresstheseissues,weproposeanovelparadigmtojointlylearnacompletedandseg-
mented3Dmodel,consistingofboth3Dshapeandalbedo,aswellasamodelmoduleto
64
Figure5.1:
Thisworkdecomposesa2Dimageofgeneticobjectsintoalbedo,3Dshape,illumination,and
cameraprojection.
estimatetheshape,albedo,lightingandcameramatrixfrom2Dimages,asinFig.5.1.Different
fromprior3Dreconstructionwork,thisistheworkmodelingbothshapeandalbedoofa
genericobject,ina
semi-supervised
manner.Modelingalbedo,togetherwithestimatingtheenvi-
ronmentlightingcondition,enablesustofullyexploittheshadingcuesfrom2Dimagestoestimate
the3Dshape.
,consideringlargeintra-classvariationsinmeshtopology,weproposetouse
col-
oredoccupancy
tocompletelyrepresenta3Dobject.Foreveryspatialpoint,thecolored
occupancyprovidestheprobabilitywhetheritisinsidetheobjectandalsotheRGBvalueof
itsalbedo.Thesurfaceoftheobjectisimplicitlyrepresentedastheiso-surfaceatacertainthresh-
oldoftheoccupancyprobability.Coloredoccupancytheoreticallycanrepresentashapeatan
arbitrarilyhighresolution,whichonlydependsonthesamplingdensityofspatialpoints.More-
over,alsoduetothelackofconsistencyinmeshes'topology,thedensecorrespondencebetween
3Dshapesismissing.Weproposetojointlymodeltheobjectpartsegmentationwhichexploits
itsimplicitcorrelationwithshapeandalbedo,andalsocreatesexplicitconstraintsforourmodel
learning.
Insummary,thecontributionsofthischapterinclude:

Webuildthe3Dmodel,thatfullymodelssegmented3Dshape,albedoforgenericobjects
using
coloredoccupancy
asarepresentation.
65

Modelingintrinsiccomponentsallowsustonotonlybetterexploitvisualcues,butalso,for
thetime,userealimagesformodeltraininginanunsupervisedmanner.

Incorporatingunsupervisedpartsegmentationenablesbetterconstraintstotheshape
andposeestimation.

Wedemonstratesuperiorperformanceon3Dreconstructionofgenericobjectsfromasingle
2Dimage.
5.1.13DShapeandAlbedoRepresentation
ShapeImplicitField.
Incontrastto2Ddomain,thecommunityhasnotyetagreedona3D
representationthatisbothmemoryefandinferablefromdata.Recently,alotofattention
isfocusonimplicitrepresentation,whereeachshapecanberepresentedbyafunction
o
:
R
3
!
[
0
;
1
]
.Thisfunctiontakesaspatiallocation
x
2
R
3
asaninputandoutputsitsprobabilityof
occupancy[34,108,117].Withthisimplicitrepresentation,theshapecanbeviewedatanarbitrary
highresolution.Anotherappealingpropertyofthisrepresentationisthatthesurfacenormalcan
beanalyticallycomputedusingthespatialderivative
d
D
S
(
f
S
;
x
)
d
x
viaback-propagationthroughthe
network.Thisishelpfulforsuccessiveanalysistaskssuchasrendering.
Asin[34,108,117],leveragingdeepneuralnetworks,afamilyoraninstanceofshapefunc-
tionscanberepresentedusingadecodernetwork
D
S
andeachshape
S
isencodedbyalatent
representation
f
S
2
R
d
S
(Fig5.2.a):
D
S
:
R
3

R
d
S
!
[
0
;
1
]
:
(5.1)
Theshapedecoder'sarchitecturefollowsBAE-NET[33].BAE-NETisajointshapeco-
segmentationandreconstructionnetwork,whichtakesshapelatentrepresentation
f
S
andaspatial
66
Figure5.2:
Shapeandalbedodecodernetworks.Shapedecoder
D
S
takesashapelatentrepresentation
f
S
andaspatialpoint
x
=(
x
;
y
;
z
)
andproducestheimplicitforeachbranch.Theoutputlayer
groupsthebranchoutputs,viamaxpooling,toformthespatialprobabilityofoccupancy.Albedodecoder
D
A
receivesbothlatentrepresentations
f
S
;
f
A
andestimatesthealbedocolorsof4branches,oneofwhichis
selectedbytheshapebranch/segmentationandreturnedasthealbedocolorof
x
.
point
x
asinputs.Itiscomposedof3fullyconnectedlayerseachfollowedbyaLeakyReLU,except
theoutput(
Sigmoid
).Thelayergivestheimplicitforfourbranches
(
o
1
;
o
2
;
o
3
;
o
4
)
.
Finally,amaxpoolingoperatoronbranchoutputsresultsintheimplicit
o
.BAE-NET
ismuchshallowerandthinnercomparedtoIM-NET[34],sinceitcaresmoreaboutthequalityof
segmentationratherthanreconstruction.Weproposetointegratetheshapeintoalbedolearning,
whichisshowntobothsegmentationandreconstruction.
AlbedoImplicitField.
Foracompletedmodel,eachvertexontheshapesurfaceisassigneda
RGBalbedocolor.Extendingtheideaoftheoccupancytoalbedo,weproposetorepresent
thealbedoasa
colored
.Thealbedodecoder
D
A
returnsanRGBcolorforanyspatiallocation
x
2
R
3
.Oneapproachforthecoloredisnaïvelyusingasinglealbedolatentrepresentation
f
A
67
torepresentacoloredshape,
i.e.
,
D
A
(
f
A
;
x
)
.However,itputsaredundantburdento
f
A
toencode
theobjectgeometry,
e.g.
,thepositionofthetire,andbodyofacar.Hence,weproposetotakethe
shapelatentvector
f
S
asanadditionalinputtothealbedodecoder
D
A
(
f
A
;
f
S
;
x
)
(Fig5.2.b):
D
A
:
R
3

R
d
T

R
d
S
!
R
3
:
(5.2)
Forsimplicitywewillomit
f
S
;
f
A
in
D

inlatersections.
Thealbedodecoderhasasimilararchitectureastheshapedecoder,withafewdifferences.
Theinputtothenetworkhasanadditionalvector,albedorepresentation
f
A
.Theoutputisapplied
Tanh
activation.Also,thethirdlayergivesthecolorforfourbranches
(
c
1
;
c
2
;
c
3
;
c
4
)
andeach
with3channels.Ateveryspatiallocation,thecoloris
c
k
,where
k
=
argmax
i
(
o
i
)
(Fig.5.2).
Onekeymotionforintegratingshapesegmentationintoalbedodecoderisthat,differentpartsof
anobjectoftendifferin
both
shapeandtexture.Thefouralbedobranchesessentiallyrepresentthe
dominant
albedocolorsoftheobject,whoselearningwillinturnencouragetheshapedecoderto
segmentpartsthatdiffernotonlyinshape,butalsoindominantalbedo.
5.1.2Physis-BasedRendering
Torenderanobjectimagefromshape,albedo,representedbylatentvectors
f
S
,
f
A
,aswellas
lighting
L
andprojectionmatrix
P
,weasetof
W

H
surfacepointscorrespondingto
eachpixel.ThentheRGBcolorofeachpixeliscomputedviaalightingmodelusinglighting
parameters
L
anddecoderoutputs.
Cameramodel.
Weassumeafullperspectivecameramodel.Anyspatialpoints
x
inthe3D
worldspacecanbeprojectedin2Dbyamultiplicationbetweenaprojectionmatrix
P
andits
homogeneouscoordinatesrepresentation,
68
(a)LinearSearch(b)Linear-BinarySearch
Figure5.3:
Raytracingforsurfacepointsdetection.InLinearsearch,candidates(redpoints)areuniformly
distributedinthegrid.InLinear-Binarysearch,afterthepointinsidetheobjectfound,Binarysearch
willbeusedbetweenthelastoutsidepointandcurrentinsidepointforallremainingiterations.
u
=
P
[
x
;
1
]
T
;
(5.3)
where
P
isa3

4fullperspectiveprojectionmatrix.
Essentially,
P
canbeextendedtoits4

4versionwithzerotranslationinz-direction,With
anabuseinannotationinhomogeneouscoordinates,relationbetween3Dpoints
x
anditscamera
spaceprojection
u
canbewrittenas:
u
=
Px
;
and
x
=
P

1
u
:
(5.4)
Surfacepointdetection.
Torendera2Dimage,foreachrayfromthecameratothepixel
j
=(
u
;
v
)
,weselectoneﬁsurfacepointﬂ.Here,asurfacepointisastheinteriorpoint
(
D
S
(
x
)
>
t
)ortheouteriorpointwithlargest
D
S
(
x
)
incasetheraydoesn'thittheobject.
Forefnetworktraining,insteadofexactsurfacepoints,weapproximatethem
usingLinearsearchorLinear-Binarysearch(Fig.5.3).
Intuitively,withthedistancemarginerrorof
e
,inLinearsearch,fromaninitiallocationin
69
theobjectboundary,weevaluate
D
S
(
x
)
forallspatialpointcandidates
x
withstepsizeof
e
.In
Linear-Binarysearch,aftertheinteriorpointisfound,as
D
S
(
x
)
isacontinuousfunction,a
Binarysearchcanbeusedtobetterapproximatethesurfacepoint.
Forbetterparallelization,thenumberofpointsevaluatedoneachrayisthesame.Inthiscase,
Linear-Binarysearchdoesn'tresultinspeedupbutleadstobetterapproximationofsurfacepoints,
hencebetterrenderquality.
Imageformation.
Weassumedistantlow-frequencyilluminationandapurelyLambertiansur-
faceHencetheincomingradiancecanbeapproximatedviaSphericalHarmonics(SH)
basisfunctions
H
b
:
R
3
!
R
,andcontrolledbycoef
L
.Atthepixel
j
withcorresponding
surfacepoint
x
j
,theimagecolorvalueiscomputedasaproductofalbedo
A
andshading
C
:
I
j
=
A
j
:
C
j
=
A
j
:
B
2
å
b
=
1
g
b
H
b
(
n
j
)
(5.5)
=
D
A
(
x
j
)
:
B
2
å
b
=
1
g
b
H
b

s

d
D
S
(
x
j
)
d
x
j

;
(5.6)
where
n
j
=
s

d
D
S
(
x
j
)
d
x
j

isthe
L
2
-normalizedsurfacenormalat
x
j
,and
s
()
isavectornormaliza-
tionfunction.Weuse
B
=
3SHbands,whichleadsto
B
2
=
9coefin
L
foreachofthree
colorchannels.
5.1.3ModelLearning
Ourmodelisdesignedtolearnfromreal-world2Dimages.However,inadditionwealsoneedto
learnshapepriorfrom3DCADmodels,duetoinherentambiguityininverseproblems.We
describelearningfrom2Dimages,andthenlearningfromCADmodels.
70
5.1.3.1UnsupervisedJointModelingandFitting
Givenasetof2Dimages,withoutcorrespondinggroundtruth3Dshape,wethelossfunction
as:
L
=
L
img
+
l
sil
L
sil
+
l
fea-const
L
fea-const
+
l
reg
L
reg
;
(5.7)
where
L
img
isthephotometricloss,
L
sil
enforcesconsistencebetweenpredictedsilhouetteand
groundtruthsilhouette,and
L
fea-const
isthelocalfeatureconsistencyloss,
L
reg
consistsofdifferent
regularizationterms.
SilhouetteConsistencyLoss.
Giventheobject'ssilhouettemask
M
foreachimage,obtainedby
anoff-the-shellsegmentationmethod[21],thesilhouetteconsistencylossis:
L
sil
=
1
W

H
W

H
å
j
=
1
L

D
S
(
f
S
;
x
j
)
;
o
j

(5.8)
=
1
W

H
W

H
å
j
=
1
L

D
S
(
E
S
;
E

1
P
u
j
)
;
o
j

:
(5.9)
Withtheoccupancythetargetvalue
o
j
isas
o
j
=
0
:
5if
M
j
=
1,otherwise
o
j
=
0.
Here,wealsoanalyzehowoursilhouettelossdifferstopriorwork.If3Dshapeisrepresented
asamesh,thereisnogradientwhencomparingtwobinarymasks,unlessthepredictedsilhouetteis
expensivelyapproximatedasinSoftrasterizer[89].Iftheshapeisrepresentedbyavoxel,theloss
canprovidegradienttoadjustvoxeloccupancypredictions,butnottheobjectorientation[164].
Ourlosscanupdatebothoccupancycameraprojectionestimation(Eqn.5.9).
PhotometricLoss.
Toenforcesimilaritybetweenourreconstructionandinput,weusea
L
1
loss
ontheforeground:
L
img
=
1
j
M
j


(
‹
I

I
)

M


1
:
(5.10)
71
Toourbestknowledge,thisistheworkongeneric3Dobjectmodelingthatcanfully
exploittheRGBcolorinformationtosupervisetheshapelearningratherthanjustsilhouetteguid-
ance[89].Thisisonlypossibleduetotwodesignsofourapproach.1)Welearnthe
completedmodelincludingalbedo.2)Theshapeimplicitrepresentation(contrasttovoxel)pro-
videsaccurate,efsurfacenormalcomputation,allowsshadingdecomposition.
LocalFeatureConsistencyLoss.
Weproposeanovellocalfeatureconsistencylossbasedon
the3Dsegmentationprovidedbytheshapedecoder.Weselect
q
boundarypoints
U
3
D
2
R
q

3
fromallpairsofneighboringsegmentsbasedontheshapedecoderbranches.Thenthese3Dpoints
areprojectedto2Dlocations
U
2
D
2
R
q

2
ontheimageplaneusingtheestimatedprojectionmatrix
P
.Similarto[178],weretrievefeaturesoneachfeaturemapusingthelocation
U
2
D
andformthe
localimagefeatures
F
2
R
q

256
,where256isthefeaturedimension.Finally,weperformPCA
toobtaintheengenvectorassociatedwiththelargesteigenvalue(
v
2
R
1

256
),whichdescribesthe
largestvariationamongthevisualfeaturesof
q
points.Despitethedifferentcolorsoftwoimages
ofthesameobjectcategory,weassumethatthismajorvariationissimilar.Thus,wethe
localfeatureconsistencylossas:
L
fea-const
=
1
j
B
j
å
(
i
;
j
)
2
B


v
i

v
j


1
;
(5.11)
where
B
isthetrainingbatch.
Regularization.
Wetworegularizationtermstoconstrainthelearning.
Albedolocalconstancy
:followingRetinextheory[82]whichassumesalbedotobepiecewise
constant,weenforcethegradientsparsityintwodirections,similarto[144]:
L
alb-const
=
å
t
2
N
j
w
(
j
;
t
)


A
j

A
t


p
2
;
(5.12)
72
where
N
j
representspixelj'ssetof4neighborpixels.Withtheassumptionthatpixelswiththe
samechromaticity(i.e.,
c
j
=
I
j
=
j
I
j
j
)aremorelikelytohavethesamealbedo,wesettheconstant
weight
w
(
i
;
t
)=
exp


a


c
j

c
t


,wherethecolorisreferencedfromtheinputimage.Inour
experimentweset
a
=
15and
p
=
0
:
8asin[107].
Batch-wiseWhiteShading
:Duetoambiguityinthemagnitudeoflighting,andthereforethe
intensityofshading,itisnecessarytoincorporateconstraintsontheshadingmagnitudetoprevent
thenetworkfromgeneratingarbitrarybright/darkshading.Tohandletheseambiguities,weusea
Batch-wiseWhiteShading[144]constraintonshading:
L
bws
=


1
m
m
å
j
=
1
C
s
(
r
)
j

c


1
;
(5.13)
where
C
s
(
r
)
j
isaredchanneldiffuseshadingofpixel
j
,
m
isthenumberofforegroundpixelsina
trainingbatch.
c
isaconstantforthetargetaverageshading,whichissetto1.Thesameconstraint
isappliedforblueandgreenchannels.
5.1.3.2SupervisedPriorLearningwithSyntheticImage
TheCADmodelhelpstolearntheshapepriorandprovidesupervisionintraining.
LearningShapeandAlbedoDecoder.
Tolearntheshapeandalbedomodel(decoders),we
adoptwidelyusedtechniqueswhichistrainingencoder-decodernetworks[34,50].Heretheinput
totheencoderisa
colored
voxel,andtheencoder
E
0
is3DCNN.Voxelispickedover2Dimages
asitcontainsallshapeinformationwhichbettereliminatesambiguityfortheencodingprocess.
Givenadatasetof
N
models,eachofwhichcanberepresentedasacolored3Doccupancy
voxel
V
.Equivalently,eachmodelcanalsoberepresentedwith
K
spatialpoints
x
2
R
3
andits
occupancylabel
o
2
[
0
;
1
]
andalbedo
c
.Thismodellearningobjectiveiswrittenas:
73
argmin
D
S
;
D
A
;
E
0
N
å
i
=
1

K
å
j
=
1

L
(
D
S
(
E
0
S
(
V
i
)
;
x
j
)
;
o
j
)+
L
(
D
A
(
E
0
S
(
V
i
)
;
E
0
A
(
V
i
)
;
x
j
)
;
c
j
)


:
(5.14)
Theloss
L
(softmaxcross-entropyor
L
p
)penalizesdeviationofthenetworkpredictionfromthe
actualvalue
o
j
,
c
j
.
Wealsoadoptprogressivetrainingtechniques[34],totrainourmodelongraduallyincreas-
ingresolutiondata.Sincethemodelstructuredoesn'tchangewhenswitchingtrainingdataof
differentresolutions,thushigher-resolutionmodelscanbetrainedwithpre-trainedweightson
low-resolutiondata.Progressivetrainingstabilizesandspeedsupthetraining.
LearningImageEncoder.
ForeachCADmodel,werendermultipleimagesofthesameobject
withdifferentposesandlightingconditions.Hereeachtrainingsampleisatripletofvoxel,2D
imageanditscorrespondinggroundtruthprojectionmatrix
(
V
;
I
;
e
P
)
.Theycanbeusedasan
additionalsupervisionforourencoderanddecoders.
L
S
=


E
S
(
I
)
E
0
S
(
V
)


2
2
;
(5.15)
L
A
=


E
A
(
I
)
E
0
A
(
V
)


2
2
;
(5.16)
L
P
=


E
P
(
I
)

e
P


2
2
;
(5.17)
Thegroundtruthlatentrepresentationsareobtainedfromthegroundtruthvoxel(
E
0
(
V
)
).
5.1.4ImplementationDetails
5.1.4.1Modeltraining
Thefullmodelistrainedinthreestages.First,theshapeandalbedodecoderistrainedwithcolored
voxeldata.Thentheencoderistrainedwith2Dsyntheticimagesasinputs.Bothsupervisedand
74
unsupervisedlossesareusedinthisstage.Finally,themodelmodule(encoderandalbedo
decoder)canbeusingrealimageswithunsupervisedlosses.Weempiricallyfoundthat,
therealimagestraininghasincrementalontheshapedecoder.Butit
improvesthegeneralizationabilityofourencoderonmodeltorealimages.Hence,we
decidetotheweightoftheshapedecoderafterthestage.TheencoderisaResNet-
18,whiledecodersare3layersMLPs[33].Weightsareinitializedfromanormaldistributionwith
astandarddeviationof0
:
02.Adamoptimizerisusedwithalearningrateof0
:
0001inallstages.
5.1.5NetworkStructure
ColoredVoxelEncoder.
Tolearntheshapeandalbedomodels(prior)simultaneously,ourvoxel
encoderrequirescoloredvoxelsasinput.WeobtaincolorvoxelizationfortheShapeNet3Dmesh
modelsbythework[29].Fig.5.4showstwoexamplesofcolorvoxelization.Thevoxelencoder
architecture(Table5.1)is3DCNN,whichisadoptedfrom[34,33].
Figure5.4:
ColorvoxelizationofShapeNetmodels.Original3Dmesh(left)and64
3
coloredvoxel(right).
ShapeandAlbedoDecoders.
Theshapedecoderarchitectureisfollowedtheworkof[33]
(unsupervisedcase).Thenetworktakesshapelatentrepresentation
f
S
andaspatialpoint
(
x
;
y
;
z
)
asinputs.Itiscomposedof3fullyconnectedlayerseachofwhichareappliedwithLeakyReLU,
exceptthealoutputisapplied
Sigmoid
activation(Fig.5.5).Thealbedodecoderarchitecture
75
Table5.1:
Coloredvoxelencodernetworkstructure.
LayerKernelsizeStride
Activation
function
Outputsize
(d1,d2,d3,C)
input---
(
64
;
64
;
64
;
3
)
conv3d
(
4
;
4
;
4
)(
2
;
2
;
2
)
LReLU
(
32
;
32
;
32
;
32
)
conv3d
(
4
;
4
;
4
)(
2
;
2
;
2
)
LReLU
(
16
;
16
;
16
;
64
)
conv3d
(
4
;
4
;
4
)(
2
;
2
;
2
)
LReLU
(
8
;
8
;
8
;
128
)
conv3d
(
4
;
4
;
4
)(
2
;
2
;
2
)
LReLU
(
4
;
4
;
4
;
256
)
conv3d
(
4
;
4
;
4
)(
1
;
1
;
1
)
Sigmoid
(
1
;
1
;
1
;
256
)
f
A
---128
f
S
---128
issimilar,withonlytwodifferences.Theinputtothenetworkhasanadditionalvector,albedo
latentrepresentation
f
A
.Theoutputisapplied
Tanh
activation.Fig.5.6depictsthealbedodecoder
architecture.
Figure5.5:
Theshapedecodernetworkiscomposedof3fullyconnectedlayers,denotesasﬁFCﬂ.The
shapelatentvector(128-dim)isconcatenated,denotedﬁ+ﬂ,withthexyzquery,makinga131-dimvector,
andisprovidedasinputtothelayer.TheLeakyReLUactivationisappliedtothe2FClayerswhile
thevalueisobtainedwith
Sigmoid
activationdenotedasﬁSig.ﬂ.
LocalFeatureExtraction.
Weselect
q
boundarypoints
U
3
D
2
R
q

3
fromallpairsof
neighboringsegmentsbasedontheshapedecoderbranches.Thenthese3Dpointsareprojected
to2Dlocations
U
2
D
2
R
q

2
ontheimageplaneusingtheestimatedprojectionmatrix
P
.Fig.5.7
showsoneexampleoftheselectedvisiblepoints.Weset
q
=
50inourexperiment.
TheimageencoderisaResNet-18.Table5.2illustratesthedetailnetworkarchitec-
ture.Giventhe3
D
points
U
3
D
,weidentitytheprojectedlocation
U
2
D
onthefeaturemaplayers
oftheencoder.Here,weconcatenatefeaturesfromtheoutputsof
conv
1,
conv
2and
conv
3(see
76
Figure5.6:
Thealbedodecodernetworkisalsocomposedof3fullyconnectedlayers.,ittakes
thepointcoordinate
(
x
;
y
;
z
)
,alongwithshapeandalbedofeaturevectors,andoutputstheRGBcolorvalue.
'TH'denotes
Tanh
activation.
Figure5.7:
Oneexampleofboundarypointsselectionforlocalfeatureextraction.
Table5.2)togetthelocalfeatures
F
2
R
q

256
(size:64
+
64
+
128)ofthepoint.Here,wereshape
thefeaturemapstotheoriginalimagesizewithbilinearinterpolation.
Tobetterillustratetheefyoftheproposedlocalfeatureconsistenceconstraintforpose
andshapeestimation,weselecttheboundarypointsfor20pairsofinputimagesbasedon
theground-truthcameraparameterandshapepartsinformation.Thenwedisturbtheselected
pointswithadditivezero-meanGaussiannoise.Fig.5.8presentstheaveragelocalfeaturedistance
undernoiseofdifferentstandarddeviations.Asshownbytheresults,thelocalfeaturedistance
issensitivetothenoiseinselectedpoints,whichmeansthelocalfeatureconsistencelossenables
theframeworktogeneratebettercameraandshapeparametersothatthecorrespondingsemantic
pointscanbeobtained.
77
Table5.2:
Imageencodernetworkstructure(slightlyfromResNet-18).
LayerKernelsizeStrideActivationfunctionInputsizeOutputsize
input----
(
128
;
128
;
3
)
conv1
(
7
;
7
)(
2
;
2
)
Max-pooling,BN,LReLU
(
128
;
128
;
3
)(
32
;
32
;
64
)
conv2(ResNetblock)
(
3
;
3
)
--
(
32
;
32
;
64
)(
32
;
32
;
64
)
conv3(ResNetblock)
(
3
;
3
)
--
(
32
;
32
;
64
)(
16
;
16
;
128
)
conv4(ResNetblock)
(
3
;
3
)
--
(
16
;
16
;
128
)(
8
;
8
;
256
)
conv5(ResNetblock)
(
3
;
3
)
--
(
8
;
8
;
256
)(
4
;
4
;
512
)
averagepool
(
4
;
4
)
--
(
4
;
4
;
512
)(
1
;
1
;
512
)
FC
l
---51227
FC
p
---51212
FC
shape
--Sigmoid512128
FC
albedo
--Sigmoid512128
Figure5.8:
Localfeaturedistanceundernoiseofdifferentstandarddeviations.
5.2ExperimentalResults
Westudyfouraspectsofproposedmethods,intermsofablationstudy,unsupervisedsegmentation,
single-view3Dreconstructiononsynthetic,andreal-worldimages.
5.2.1ExperimentSetup
Data.
Forevaluationof3Dshapereconstruction,weusetheShapeNetCorev1dataset[26].Itis
composedofCADmodelsofobjectsinvariouscategories.Followingthesettingsof[56],weuse
thesametraining/testingsplit.Whileusingthesametestset,werendertrainingdataourselves,
78
Figure5.9:
3Dreconstructionusingmodelslearnedwith(thirdrow)andwithoutrealimage(secondrow).
Higherqualityreconstructionisobservedinthebottom.
addinglightingandreal-worldposevariations(posedistributionfromPascal3D
+
[177]training
data).Thishelpsustoleveragetheshadingcuetobetterlearnthemodelaswellasmodel
toreal-worldimages.
WeuseimagesfromPascal3D
+
database[177],intoourunsupervisedmodeltrainingstep.
Pascal3D
+
augments12rigidcategoriesofPascalVOC2012[44]with3Dannotations.Weselect
thesame5categories(plane,car,chair,couchandtable)withoursyntheticdata.Thetraining
subsetoffromPascal3D
+
imagesweconsideredafteroccludedinstances,whichwould
affecttheimagedecompositiontrainingprocess.
Metrics.
Weadoptthestandard3Dreconstructionmetric:IoUandChamferDistance(CD)[108]
forevaluation.Tocomparewithmethodsthatoutputpointclouds,weusemarchingcubesto
obtainmeshesfrom256
3
-voxelizedmodels.ForIoU,largerisbetter.ForCD,smallerisbetter.
79
Table5.3:
Effectoflosstermsonposeandreconstructionestimation.
AzimuthangleerrorReconstructionerror(CD)
w/o
L
sil
18
:
51

0
:
136
w/o
L
fea-const
15
:
02

0
:
124
w/o
L
reg
13
:
01

0
:
131
Fullmodel12
:
20

0
:
116
5.2.2AblationStudy
EffectofUnsupervisedTraining.
Bymodelingthecompletedshapeandestimatingimagefor-
mationparameters,ourmethodcanleveragein-the-wildimageswithoutannotationsofitsground
truthshapeviaunsupervisedlosses.Herewedemonstratetheofaddingrealimagesinto
trainingtoimproveourmodelabilityonrealimages.Fig.5.9showsvisualreconstructions
onimagesfromPix3DandPascal3D
+
datasetsofourmodelatdifferentstageoftraining:amodel
trainedwithsyntheticdataonlyandamodeltrainedwithadditionalrealimages.
EffectofLossTerms.
Wecompareourfullmodelwithitspartialvariants,withoutsilhouette
consistencyloss,localfeatureconsistencyloss,oralbedoregularizationloss.Weconductexperi-
mentsonPascal3D+database(carcategory)andevaluatetheposeestimationandreconstruction.
Table5.3showsquantitativecomparisonofthesefourmodels.Asthesilhouetteprovidesstrong
constraintsonglobalshapeandpose,withoutsilhouetteloss,theperformanceonbothmetricsare
severelyimpaired.Theregularizationhelpstodisentangleshadingfromalbedo,whichleadsto
bettersurfacenormal,thusbettershapeandposeThelocalfeatureconsistencylosshelps
tothemodelwhichimprovestheposeandshapeestimation.Theseresults
demonstratethatallthelosscomponentspresentedinthisworkcontributetotheperformance.
80
Table5.4:
Segmentationandshaperepresentationcomparisons(IoU/CD)onShapeNetpart[181].IoU
isutilizedtomeasureforsegmentationagainstground-truthparts.CDisusedforshaperepresentation
evaluation.Chair*istrainingonchair+tablejointset.
Shape(#parts)airplane(3)chair(3)chair*(4)table(2)
BAE-Net80
:
4
=
0
:
1986
:
6
=
0
:
2783
:
7
=

87
:
0
=
0
:
30
Proposed83
:
0
=
0
:
1687
:
4
=
0
:
2384
:
1
=
0
:
2888
:
2
=
0
:
25
5.2.3UnsupervisedSegmentation
Asmodelingshape,albedoandco-segmentationareclosely-relatedtasks[188],jointlymodel-
ingthemallowsustoexploittheircorrelation.Followingthesametrainingandtestingsetting
with[33],weevaluateourmodel'sco-segmentationandshaperepresentationpoweronthecat-
egoryofairplane,chairandtable.AsinTable5.4,ourmodelachievesahighersegmentation
accuracy,comparingwithBAE-NET[33].Further,wecomparethepoweroftwomethodsin
representing3Dshapes.Byfeedingaground-truthvoxelshapefromthetestingsettothevoxel
encoderandshapedecoder,wecanestimatetheshapeparameterwhosedecodedshapematchesthe
ground-truthCADmodel.ThelowerCD,aswellashigherIoU,inTable5.4showthatthenovel
designofourshapeandalbedodecodersimprovesboththesegmentationandreconstruction.
Weshowadditionalupsupervisedsegmentationresultsofour5categoriesonShapeNetPart
datasetinFig.5.10.Weassignacolorfortheoutputofeachbranchofourshapedecoderand
reasonablepartsareobtained.Sinceoursegmentationisunsupervisedandthemodelforeach
categoryistrainedseparately,ourresultsarenotguaranteedtoproducethesamepartcountsfor
allcategories.Fig.5.11showstheestimationsofalbedocolorsof4branches.Thefouralbedo
branchesdorepresentthedominantalbedocolorsoftheobjects.
81
Figure5.10:
UnsupervisedsegmentationresultsonShapeNetPartdataset.Werendertheoriginalmeshes
withdifferentcolorsrepresentingdifferentparts.
82
Figure5.11:
Visualizationofalbedobranchoutputsforour5categories.Werenderthealbedowith
reconstructedmesh.
83
Table5.5:Quantitativecomparisonofsingle-view3Dreconstructiononsyntheticimagesof
ShapeNet.
Category
ChamferDistanceIoU
3D-R2N2PSGPix2MeshAtlasNetIM-SVR
Proposed
3D-R2N2PSGPix2MeshAtlasNetIM-SVR
Proposed
airplane0
:
2270
:
1370
:
187
0
:
104
0
:
1370
:
1100
:
426

0
:
5150
:
3920
:
554
0
:
577
car0
:
2130
:
1690
:
1800
:
1410
:
123
0
:
092
0
:
661

0
:
5010
:
2200
:
745
0
:
773
chair0
:
2700
:
2470
:
2650
:
2090
:
199
0
:
155
0
:
439

0
:
4020
:
2570
:
522
0
:
546
couch0
:
2290
:
2240
:
212
0
:
177
0
:
1810
:
1780
:
626

0
:
6000
:
2790
:
641
0
:
651
table0
:
2390
:
2220
:
2180
:
1900
:
173
0
:
164
0
:
420

0
:
3120
:
2330
:
450
0
:
479
Mean
0
:
2780
:
1880
:
2160
:
1750
:
187
0.165
0
:
493

0
:
4730
:
3000
:
546
0.567
5.2.43DImageDecomposition
Wefurtherprovideseveral3Dimagedecompositionresultsonreal-worldimagesonFig.5.12.
Sinceournetworkproducesafull3Dshape,wecanchangethereconstructionoranysinglecom-
ponenttoadifferentviewpoint.
5.2.5Single-view3DReconstruction
5.2.5.1Reconstructiononsyntheticimages
Monocular3Dreconstructionperformanceisevaluatedonsyntheticimages.Wecompare
ourmodelagainstmultiplestate-of-the-artbaselinesthatleveragevarious3Drepresentations:
3D-R2N2[35](voxel),PointSetGeneration(PSG)[45](pointcloud),Pixel2Mesh[147],At-
lasNet[54](mesh),andIM-SVR[34](implicitForourmodel,weemploybothsupervised
andunsupervisedlosses.
Ingeneral,ourmodelisabletopredict3Dshapesthatcloselyresemblethegroundtruth
shapes(Fig.5.13.a).Ourapproachoutperformstheothermethodsinmostcategoriesandachieves
thebestmeanscore(bothIoUandCD(Tab.5.5)).Whileusingthesameshaperepresentationas
us,IM-SVR[34]onlylearnstoreconstructthe3Dshapebyminimizingthelatentrepresentation
differentwithground-truthlatentvectors.Bymodelingalbedo,ourmodelisfromlearn-
84
Figure5.12:
3Dimagedecompositiononreal-worldimages.Ourworkdecomposesa2Dimageofgeneric
objectsintoalbedo,completed3Dshapeandillumination.
85
Figure5.13:Qualitativecomparisonforsingle-view3DreconstructiononShapeNet,Pascal3D+,
andPix3Ddatasets.
ingwithbothsupervisedandunsupervised(photometric,silhouette)losses.Thisresultsinbetter
performanceinbothquantitativeandqualitativecomparisons.
5.2.5.2Reconstructiononrealimages
Wealsoevaluateourapproachinreconstructionontworealimagedatabases,Pascal3D
+
[177]
andPix3D[147].OurmodelisonrealimagesfromPascal3D
+
train
subset
without
accesstogroundtruth3Dshapes.Sincemostofreconstructionmethodsonlycaninfershapes
forsynthetic.Here,wecompareproposedmethodwiththestate-of-the-artmethodswhichcan
workforrealworldimages,including3D-R2N2[35],differentiablerayconsistency(DRC)[164],
ShapeHD[174]andDAREC[123].Again,ourworkistheonethatcanfullyleveragereal
imagestolearnmodelinaunsupervisedfashion.ForPascal3D
+
evaluation,weusethe
val
subsetofthe5categories.ForPix3D,weuse3categories(chair,couchandtable)whichare
86
Figure5.14:Qualitativecomparisonforsingle-view3DreconstructiononrealimagesfromPascal
3D+(left)andPix3D(right).
overlappedwithour5realcategories.
AsshowninFig.5.14,ourmodelinfersreasonableshapeseveninchallengingconditions.
Quantitatively,Table5.6suggeststhattheproposedmethodperformsbetterthanother
methodsinPascal3D
+
database.AsPascal3D
+
onlyhas10CADmodelsforeachobjectcate-
goryasgroundtruth3Dshapes,thegroundtruthlabelsandthescorescanbeinaccurate,failing
totheshapedetails.Wethereforeconductanexperimentonmoreprecise3Dannotation
databasePix3D.AsshowninTable5.7,ourmodelalsohaslowestChamferDistance
andbestqualityasinFig.5.14comparingtobaselines.
Toprovidemorecomprehensivecomparisonsonthe3Dreconstructionquality.Weprovide
morereconstructionresultsonPascal3D
+
[177](Fig.5.15)andPix3D[147]dataset(Fig.5.16.
ComparisonsaremadewithShapeHD[174],AtlasNet[54]usingpre-trainedmodelsprovidedby
theauthors.
87
Table5.6:
Realimage3DreconstructiononPASCAL3D
+
withCD.
Category
3D-R2N2DRCShapeHDDARECProposed
plane0
:
3050
:
112
0
:
094
0
:
1080
:
102
car0
:
305
0
:
099
0
:
1290
:
1010
:
113
chair0
:
2380
:
1580
:
1370
:
135
0
:
119
couch0
:
3470
:
1690
:
176-
0
:
148
table0
:
3210
:
1620
:
153-
0
:
127
Mean
0
:
3030
:
1400
:
138-
0
:
122
Table5.7:
Realimage3DreconstructiononPix3D
+
withCD.
Category
3D-R2N2DRCShapeHDDARECProposed
chair0
:
2390
:
1600
:
1230
:
112
0
:
091
couch0
:
3070
:
1780
:
137-
0
:
114
table0
:
2890
:
1630
:
133-
0
:
127
Mean
0
:
2780
:
1670
:
131-
0
:
110
5.3Conclusions
Withtheobjectiveof3Dmodelingfromreal-world2Dimages,thischapterpresentsasemi-
supervisedlearningapproachthatjointlylearnsthealgorithmandthemodels.Sinceour
approachofferscompletedalbedoand3Dshapemodels,aswellasintrinsicdecompositionfrom
images,weareabletoeffectivelyleveragerealimagesinthetraining.Asaresult,weobserve
substantialimprovementonthequalityof3Dreconstructionfromasingleimage.Inessential,
ourproposedmethodisapplicableto3Dmodelingandreconstructionforanyobjectcategoryif
bothi)anin-the-wild2Dimagecollectionandii)CADmodelsoftheobjectareavailable.Weare
interestedinapplyingthismethodtoawidevarietyofobjectcategoriesandbuildingaﬁzoo"of
3Dmodels.
88
Figure5.15:
Additional3DreconstructionresultsonPascal3D
+
[177]dataset.
89
Figure5.16:
Additional3DreconstructionresultsonPix3D[147].Foreachinputimage,weshowrecon-
structionsbyShapeHD[174],andgroundtruth.Ourreconstructionsresemblethegroundtruth.
90
Chapter6
ConclusionsandFutureWork
Reconstructingfacesorgenericobjectsfromasinglephotographisextremelychallengingdue
totheambiguityintheimageformationprocess.Reconstructionqualityishighlydependon
expressivenessoftheunderlyingusedmodel.Givenlimitedinannotated3
D
data,throughout
thisthesis,Ihavepresentedanapproachtolearnandimprove3Dmodelsrepresentationpower
aswellasabilitybyusinglargecollectionof2Din-the-wildimages.Evenachievingthe
state-of-the-artperformance,thecurrentmodelstillhaslimitations.
Lightingmodel
TheLambertianlightingmodel,whichisusedinthisthesis,isknowntobeapoorapproxima-
tionforthecomplexpropertiesoffacialskinorgenericobjects.Whenhumanssweat,
theskinclearlyexhibitsspecularparticularlyonthenoseandforehead.Thespecular
isevenmoreobviousonotherobjectslikecars.Amorecomplexlightingassumptionis
necessarytoaccuratelyhandlethesescenarios.
Ibelievebettermodelingthelightingiscriticalforunsupervised/weaklysupervisedapproach
asusingaapproximationofarealrenderingprocesspreventsthemodelfromlearningthetrue
shapeoralbedoasthesetruthfulelementscouldleadtoahigherlossvalueunderapoorapproxima-
tionoflightingmodel.Incomputergraphics,extremelycomplex,physically-validlightingmodels
havebeendevelopedformaterialsofrelevancetoface,forexampleforskin[79]
andhair[100].However,thesemethodshaveproventobetoocomplexandtoocomputational-
91
expensivetointegrateinto3DMMpipelines.
FeedbackMechanism
Differentfromtaskswheresmallchangesinpredictedprobabilitycanbetolerated
aslongasthe(classrankings)resultsaren'tchanged;ourmodelsdoregressionon
pose,shape/albedoparametters.Highprecisionestimationisusuallyrequired.Currently,across
allchapters,weuseasingleencodertoestimateparamettersfromtheinputimage.Withmultiple
down-samplingoperationsinthenetworkstructure,maintainiginfomationoftheface,inclusing
preciselandmarklocations,smallfacialstructurecouldbechallenging.Asaresults,theestimated
shape,posecouldbeofffromthegroud-truthvalue.
Besides,inourtasks,visualizingourcurrentestimations,intheformofreconstructedimages,
givesusaluxuryofcomparingourestimationtotheoriginalinput.Studythedisperencybetween
thereconstructionandinputimagecouldbeaformoffeedbacksignalthatwecanusetofurther
thecurrentestimation.Hence,oneinterestingideathatwecouldexploreistolearnasecond
encoderthattakebothoriginalinputandourrenderedimageasinputsandtrytoproduceparametter
residualstoourinitalpredictedparametters.
92
APPENDIX
93
RepresentationLearningGANforpose-invariantface
recognition(DR-GAN)
Whileotherchaptersinthisthesislookingatimageformation/synthesisinamodel-driven
approach,thereareotherapproachthatcanlearntomanipulateimageswithoutusingany3D
models.Inthisappendix,Iwouldliketointroduceoneofourworkinthatdirectionwithan
applicationonfacesynthesisandfacerecognition.
A1Introduction
Facerecognitionisoneofthemostwidelystudiedtopicsincomputervisionduetoitswideap-
plicationinlawenforcement,biometrics,marketing,andetc.Recently,greatprogresshasbeen
achievedinfacerecognitionwithdeeplearning-basedmethods[149,119,138].Forexample,
surpassinghumanperformanceisreportedbySchroffetal.[138]onLabeledFacesintheWild
(LFW)database.However,oneoftheshortcomingsoftheLFWdatabaseisthatitdoesnotoffer
ahighdegreeofposevariationŠthevariancethathasbeenshowntobeamajorchallengein
facerecognition.Uptonow,thekeyabilityofPose-InvariantFaceRecognition(PIFR)desiredby
real-worldapplicationsisfarfromsolved[92,93,25,4,41].Arecentstudy[140]observesasig-
drop,over10%,inperformanceofmostalgorithmsfromfrontal-frontalto
facevwhilehumanperformanceonlydegradesslightly.Thisindicatesthatthepose
variationremainstobeachallengeinfacerecognitionandwarrantsfuturestudy.
InPIFR,thefacialappearancechangecausedbyposevariationoftensurpasses
theintrinsicappearancedifferencesbetweenindividuals.Toovercomethesechallenges,awide
Thischapterisadaptedfromfollowingpublications:
[1]LuanTran,XiYin,andXiaomingLiu,ﬁDisentangledRepresentationLearningGANforPose-InvariantFace
Recognition,ﬂinCVPR,2017.
[2]LuanTran,XiYin,andXiaomingLiu,ﬁRepresentationLearningbyRotatingyourFacesﬂinTPAMI,2019.
94
FigureA1:
Givenoneormultiplein-the-wildfaceimagesastheinput,DR-GANcanproducea
identityrepresentation,byvirtuallyrotatingthefacetoarbitraryposes.Thelearntrepresentationisboth
discriminative
and
generative
,i.e.,therepresentationisabletodemonstratesuperiorPIFRperformance,
andsynthesizeidentity-preservedfacesattargetposesbytheposecode.
varietyofapproacheshavebeenproposed,whichcanbegroupedintotwocategories.First,some
workemploy
facefrontalization
ontheinputimagetosynthesizeafrontal-viewface,wheretra-
ditionalfacerecognitionalgorithmsareapplicable[58,194],oranidentityrepresentationcanbe
obtainedviamodelingthefacefrontalization/rotationprocess[72,197,182].Theabilitytogen-
eratearealisticidentity-preservedfrontalfaceisalsoforlawenforcementpractitioners
toidentifysuspects.Second,otherworkfocuson
learningdiscriminativerepresentations
directly
fromthenon-frontalfacesthrougheitheronejointmodel[119,138]ormultiple
models[101,40].Incontrast,weproposeanovelframeworktotakethebestofbothworlds
Š
simultaneouslylearnpose-invariantidentityrepresentation
and
synthesizefaceswitharbitrary
poses
,wherefacerotationisbothafacilitatorandaby-productforrepresentationlearning.
AsshowninFig.A1,weproposeDisentangledRepresentationlearning-GenerativeAdversar-
ialNetwork(DR-GAN)forPIFR.GenerativeAdversarialNetworks(GANs)[51]cangenerate
95
samplesfollowingadatadistributionthroughatwo-playergamebetweenagenerator
G
anda
discriminator
D
.Despitemanyrecentpromisingdevelopments[109,39,124,30,11],imagesyn-
thesisremainstobethemainobjectiveofGAN.Tothebestofourknowledge,thisisthework
thatutilizesthegeneratorinGANforrepresentationlearning.Toachievethis,weconduct
G
with
anencoder-decoderstructure(Fig.A2(d))tolearnadisentangledrepresentationforPIFR.The
inputtotheencoder
G
enc
isafaceimageofanypose,theoutputofthedecoder
G
dec
isasynthetic
faceatatargetpose,andthelearntrepresentationbridges
G
enc
and
G
dec
.While
G
servesasaface
rotator,
D
istrainedtonotonlydistinguishrealvs.synthetic(orfake)images,butalsopredictthe
identityandposeofaface.Withtheadditional
D
strivesfortherotatedfaceto
havethesameidentityastheinputrealface,whichhastwoeffectson
G
:1)Therotatedfacelooks
moreliketheinputsubjectintermsofidentity.2)Thelearntrepresentationismore
inclusive
or
generative
forsynthesizinganidentity-preservedface.
InconventionalGANs,
G
takesarandomnoisevectortosynthesizeanimage.Incontrast,our
G
takesafaceimage,aposecode
c
,andarandomnoisevector
z
astheinput,withtheobjective
ofgeneratingafaceofthesameidentitywiththetargetposethatcanfool
D
.,
G
enc
learnsamappingfromtheinputimagetoafeaturerepresentation.Therepresentationisthen
concatenatedwiththeposecodeandthenoisevectortofeedto
G
dec
forfacerotation.Thenoise
modelsfacialappearancevariationsotherthanidentityorpose.Notethatitisacrucialarchitecture
designtoconcatenateonerepresentationwith
varying
randomlygeneratedposecodesandnoise
vectors.ThisenablesDR-GANtolearna
disentangled
identityrepresentationthatis
exclusive
or
invariant
toposeandothervariations,whichistheholygrailforPIFRwhenachievable.
Mostexistingfacerecognitionalgorithmsonlytakesoneimagefortesting.Inpractice,there
aremanyscenarioswhenanimagecollectionofthesameindividualisavailable[75].Inthis
case,priorworkfuseresultseitherinthefeaturelevel[27]orthedistance-metriclevel[167,103].
96
Differently,ourfusionisconductedwithinaframework.Givenmultipleimagesasthe
input,
G
enc
operatesoneachimage,andproducesanidentityrepresentationandacoef
whichisanindicatorofthequalityofthatinputimage.Usingthedynamicallylearnedcoef
therepresentationsofallinputimagesarelinearlycombinedasonerepresentation.Duringtesting,
G
enc
takesanynumberofimagesandgeneratesasingleidentityrepresentation,whichisusedby
G
dec
forfacesynthesisalongwiththeposecode.
Ourgeneratorisessentialtobothrepresentationlearningandimagesynthesis.Weproposetwo
techniquestofurtherimprove
G
enc
and
G
dec
respectively.First,wehaveobservedthatour
G
enc
canalwaysoutperform
D
inrepresentationlearningforPIFR.Therefore,weproposetoreplacethe
identitypartof
D
withthelatest
G
enc
duringtrainingsothatasuperior
D
canpush
G
enc
tofurtherimproveitself.Second,sinceour
G
dec
learnsamappingfromthefeaturespacetothe
imagespace,weproposetoimprovethelearningof
G
dec
byregularizingtheaveragerepresentation
oftworepresentationsfromdifferentsubjectstobeavalidface,assumingaconvexspaceofface
identities.Thesetwotechniquesareshowntobeeffectiveinimprovingthegeneralizationability
ofDR-GAN.
Insummary,thispapermakesthefollowingcontributions.

WeproposeDR-GANviaanencoder-decoderstructuredgeneratorthatcanfrontalizeor
rotateafacewithanarbitrarypose,eventheextreme

Ourlearntrepresentationisexplicitlydisentangledfromtheposevariationviatheposecode
inthegeneratorandtheposeestimationinthediscriminator.Similardisentanglementis
conductedforothervariations,e.g.,illumination.

Weproposeanovelschemetoadaptivelyfusemultiplefacestoasinglerepresentationbased
onthelearntcoefwhichempiricallyshowstobeagoodindicatorofthefaceimage
97
quality.

Weachievestate-of-the-artfacefrontalizationandfacerecognitionperformanceonmultiple
benchmarkdatasets,includingMulti-PIE[52],CFP[140],andIJB-A[75].
A2PriorWork
GenerativeAdversarialNetwork(GAN).
Goodfellow
etal
.[51]introduceGANtolearngen-
erativemodelsviaanadversarialprocess.Withaminimaxtwo-playergame,thegeneratorand
discriminatorcanbothimprovethemselves.GANhasbeenusedforimagesynthesis[39,127],
imagesuperresolution[187],andetc.Morerecentworkfocusonincorporatingconstraintsto
z
orleveragingsideinformationforbettersynthesis.E.g.,MirzaandOsindero[109]feedclass
labelstoboth
G
and
D
togenerateimagesconditionedonclasslabels.In[136]and[114],GANis
generalizedtolearnadiscriminativewhere
D
istrainedtonotonlydistinguishbetween
realvs.fake,butalsoclassifytheimages.InInfoGAN[30],
G
appliesinformationregularization
totheoptimizationbyusingtheadditionallatentcode.Incontrast,thispaperproposesanovel
DR-GANaimingforface
representationlearning
,whichisachievedviamodelingthefacerota-
tionprocess.InSec.A3.4,wewillprovidein-depthdiscussiononourdifferencetomostrelevant
workinGANs.
OnecrucialissuewithGANsisthedifforquantitativeevaluation.Previousworkeither
performhumanstudytoevaluatethequalityofsyntheticimages[39]orusethefeaturesinthe
discriminatorforimage[124].Incontrast,weinnovativelyconstructthegenerator
forrepresentationlearning,whichcanbequantitativelyevaluatedforPIFR.
FaceFrontalization.
Generatingafrontalfacefromafaceisverychallengingdueto
self-occlusion.Priormethodsinfacefrontalizationcanbeintothreecategories:3D-
98
basedmethods[194,58,84],statisticalmethods[135],anddeeplearningmethods[197,179,182,
72,191].E.g.,Hassner
etal
.[58]useamean3Dfacemodeltogenerateafrontalfaceforany
subject.Apersonalizedfacemodelcouldbeusedbutaccurate3Dfacereconstructionremains
achallenge[133,87,160,161].In[135],astatisticalmodelisusedforjointfrontalizationand
landmarklocalizationbysolvingaconstrainedlow-rankminimizationproblem.Fordeeplearning
methods,Kan
etal
.[72]proposeSPAEtoprogressivelyrotateanon-frontalfacetoafrontalone
viaauto-encoders.Yang
etal
.[179]applytherecurrentactionunittoagroupofhiddenunitsto
incrementallyrotatefacesinedyawangles.
Allpriorworkfrontalizeonlynearfrontalin-the-wildfaces[58,194]orlarge-posecontrolled
faces[182,197].Incontrast,wecansynthesizearbitrary-posefacesfromalarge-posein-the-
wildface.Weusethe
adversarialloss
toimprovethequalityofthesyntheticimagesandidentity
inthediscriminatortopreserveidentity.
RepresentationLearning.
Designingtheappropriateobjectivesforlearningagoodrepresen-
tationisanopenquestion[10].Theworkin[99]isamongthetouseanencoder-decoder
structureforrepresentationlearning,which,however,isnotexplicitlydisentangled.DR-GANis
similartoDC-IGN[80]Šavariationalautoencoder-basedmethodtodisentangledrepresentation
learning.However,DC-IGNachievesdisentanglementbyprovidingbatchtrainingsampleswith
oneattributebeinged,whichmaynotbeapplicabletounstructuredin-the-wilddata.
PriorworkalsoexplorejointrepresentationlearningandfacerotationforPIFRwhere[197,
182]aremostrelevanttoourwork.In[197],Multi-ViewPerceptron[197]isusedtountangle
theidentityandviewrepresentationsbyprocessingthemwithdifferentneuronsandmaximizing
thedatalog-likelihood.Yim
etal
.[182]useamulti-taskCNNtorotateafacewithanyposeand
illuminationtoatargetpose,andthe
L
2loss-basedreconstructionoftheinputisthesecondtask.
Bothworkfocusonimagesynthesisandtheidentityrepresentationisaby-productduringthe
99
networklearning.Incontrast,DR-GANfocusesonrepresentationlearning,ofwhichfacerotation
isbothafacilitatorandaby-product.Wedifferto[197,182]infouraspects.First,weexplicitly
disentangletheidentityrepresentationfromposevariationsbyposecodes.Second,weemploythe
adversariallossforhigh-qualitysynthesis,whichdrivesbetterrepresentationlearning.Third,none
ofthemappliestoin-the-wildfacesaswedo.Finally,ourabilitytolearntherepresentationfrom
multipleunconstrainedimageshasnotbeenobservedinpriorwork.
FaceImageQualityEstimation.
Lowimagequalityisknowntobeachallengeforvision
tasks[95,32].Imagequalityestimationisimportantforbiometricrecognitionsystems[12,53,
157].Numerousmethodshavebeenproposedtomeasuretheimagequalityofdifferentbiometric
modalitiesincludingface[1,3,116],iris[31,78],[148,150],andgait[111,105].In
thescenariooffacerecognition,aneffectivealgorithmforfaceimagequalityestimationcanhelp
toeither(i)reducethenumberofpoorimagesacquiredduringenrollment,or(ii)improvefeature
fusionduringtesting.Bothcasescanimprovethefacerecognitionperformance.Abaza
etal
.[1]
evaluatemultiplequalityfactorssuchascontrast,brightness,sharpness,focusandilluminationas
afaceimagequalityindexforfacerecognition.However,theydidnotconsiderposevariance,
whichisamajorchallengeinfacerecognition.Ozay
etal
.[116]employaBayesiannetworkto
modeltherelationshipsbetweenqualityrelatedimagefeaturesandfacerecognition,
whichisshowtoboosttheperformance.Theauthorsin[171]proposeapatch-based
faceimagequalityestimationmethod,whichtakesintoaccountofgeometricalignment,pose,
sharpness,andshadows.
Inthiswork,weemployqualityestimationinaGANframeworkthatconsidersallfac-
torsofimagequalitypresentedinthedataset,with
no
directsupervision.Foreachinputimage,
DR-GANcangenerateacoefthatindicatesthequalityoftheinputimage.Therepresen-
tationsfrommultipleimagesofthesamesubjectarefusedbasedonthelearntcoefto
100
generateonerepresentation.Wewillshowthatthelearntcoefarecorrelatedtothe
imagequality,i.e.,ameasurementofhowgooditcanbeusedforfacerecognition.
A3TheProposedDR-GANModel
OurproposedDR-GANhastwovariations:thebasicmodelcantakeoneimagepersubjectfor
training,termed
single-imageDR-GAN
,andtheextendedmodelcanleveragemultipleimages
persubjectforbothtrainingandtesting,termed
multi-imageDR-GAN
.Westartbyintroducing
theoriginalGAN,followedbytwoDR-GANvariations,andtheproposedtechniquestoimprove
thegeneralizationofourgenerator.Finally,wewillcompareourDR-GANwithpreviousGAN
variationsindetail.
A3.1GenerativeAdversarialNetwork
GenerativeAdversarialNetworkconsistsofagenerator
G
andadiscriminator
D
thatcompeteina
two-playerminimaxgame.Thediscriminator
D
triestodistinguishbetweenarealimage
x
anda
syntheticimage
G
(
z
)
.Thegenerator
G
triestosynthesizerealistic-lookingimagesfromarandom
noisevector
z
thatcanfool
D
,i.e.,
G
(
z
)
beingasarealimage.Concretely,
D
and
G
play
thegamewiththefollowinglossfunction:
min
G
max
D
L
gan
=
E
x
˘
p
d
(
x
)
[
log
D
(
x
)]+
E
z
˘
p
z
(
z
)
[
log
(
1

D
(
G
(
z
)))]
:
(1)
Itisprovedin[51]thatthisminimaxgamehasaglobaloptimumwhenthedistribution
p
g
ofthe
syntheticsamplesandthedistribution
p
d
oftherealsamplesarethesame.Undermildconditions
101
FigureA2:
ComparisonofpreviousGANarchitecturesandourproposedDR-GAN.
(e.g.,
G
and
D
haveenoughcapacity),
p
g
convergesto
p
d
.Inthebeginningoftraining,thesamples
generatedfrom
G
areextremelypoorandarerejectedby
D
withhighInpractice,it
isbetterfor
G
tomaximizelog
(
D
(
G
(
z
)))
insteadofminimizinglog
(
1

D
(
G
(
z
)))
[51].This
objectiveresultsinthesameedpointofthedynamicsof
G
and
D
butprovidesmuchstronger
gradientsearlyinlearning.Asaresult,
G
and
D
aretrainedtoalternativelyoptimizethefollowing
objectives:
max
D
L
D
gan
=
E
x
˘
p
d
(
x
)
[
log
D
(
x
)]+
E
z
˘
p
z
(
z
)
[
log
(
1

D
(
G
(
z
)))]
;
(2)
max
G
L
G
gan
=
E
z
˘
p
z
(
z
)
[
log
(
D
(
G
(
z
))]
:
(3)
A3.2Single-ImageDR-GAN
Oursingle-imageDR-GANhastwodistinctivenoveltiescomparedtopriorGANs.First,itlearns
anidentityrepresentationforafaceimagebyusinganencoder-decoderstructuredgenerator,where
therepresentationistheencoder'soutputandthedecoder'sinput.Sincetherepresentationisthe
102
inputtothedecodertosynthesizevariousfacesofthesamesubject,i.e.,virtuallyrotatinghis/her
face,itisa
generative
representation.
Second,theappearanceofafaceisdeterminedbynotonlytheidentity,butalsothenumerous
distractivevariations,suchaspose,illumination,expression.Thus,theidentityrepresentation
learnedbytheencoderwouldinevitablyincludethedistractivesidevariations.E.g.,theencoder
wouldgenerate
different
identityrepresentationsfortwofacesofthesamesubjectwith0

and90

yawangles.Toremedythis,inadditiontotheclasslabelssimilartosemi-supervisedGAN[136],
weemploysideinformationsuchasposeandilluminationtoexplicitlydisentanglethesevariations,
whichinturnhelpstolearna
discriminative
representation.
A3.2.1ProblemFormulation
Givenafaceimage
x
withlabel
y
=
f
y
d
;
y
p
g
,where
y
d
representsthelabelforidentityand
y
p
forpose,theobjectivesofourlearningproblemaretwofold:1)tolearnapose-invariantidentity
representationforPIFR,and2)tosynthesizeafaceimage
‹
x
withthe
same
identity
y
d
butata
different
posebyaposecode
c
.OurapproachistotrainaDR-GANconditionedonthe
originalimage
x
andtheposecode
c
withitsarchitectureillustratedinFig.A2(d).
DifferentfromthediscriminatorinconventionalGAN,our
D
isamulti-taskCNNconsisting
ofthreecomponents:
D
=[
D
r
;
D
d
;
D
p
]
.
D
r
2
R
1
isforreal/fakeimage
D
d
2
R
N
d
is
foridentitywith
N
d
asthetotalnumberofsubjectsinthetrainingset.
D
p
2
R
N
p
is
forposewith
N
p
asthetotalnumberofdiscreteposes.Givenafaceimage
x
,
D
aims
toclassifyitastherealimageclass,andestimateitsidentityandpose;whilegivenasynthetic
faceimagefromthegenerator
‹
x
=
G
(
x
;
c
;
z
)
,
D
attemptstoclassify
‹
x
asfake,usingthefollowing
103
objectives:
L
D
gan
=
E
x
;
y
˘
p
d
(
x
;
y
)
[
log
D
r
(
x
)]+
E
x
;
y
˘
p
d
(
x
;
y
)
;
z
˘
p
z
(
z
)
;
c
˘
p
c
(
c
)
[
log
(
1

D
r
(
G
(
x
;
c
;
z
)))]
;
(4)
L
D
id
=
E
x
;
y
˘
p
d
(
x
;
y
)
[
log
D
d
y
d
(
x
)]
;
(5)
L
D
pos
=
E
x
;
y
˘
p
d
(
x
;
y
)
[
log
D
p
y
p
(
x
)]
;
(6)
where
D
d
i
and
D
p
i
arethe
i
thelementin
D
d
and
D
p
.Forclarity,wewilleliminateallsubscriptsfor
expectedvaluenotations,asallrandomvariablesaresampledfromtheirrespecteddistributions
(
x
;
y
˘
p
d
(
x
;
y
)
;
z
˘
p
z
(
z
)
;
c
˘
p
c
(
c
))
.Theobjectivefortraining
D
istheweightedaverageof
allobjectives:
max
D
L
D
=
l
g
L
D
gan
+
l
d
L
D
id
+
l
p
L
D
pos
;
(7)
whereweset
l
g
=
l
d
=
l
p
=
1.
Meanwhile,
G
consistsofanencoder
G
enc
andadecoder
G
dec
.
G
enc
aimstolearnanidentity
representation
f
(
x
)=
G
enc
(
x
)
fromafaceimage
x
.
G
dec
aimstosynthesizeafaceimage
‹
x
=
G
dec
(
f
(
x
)
;
c
;
z
)
withidentity
y
d
andatargetposeby
c
,and
z
2
R
N
z
isthenoisemodeling
othervariationsbesidesidentityorpose.Theposecode
c
2
R
N
p
isaone-hotvectorwiththetarget
pose
y
t
being1.Thegoalof
G
istofool
D
toclassify
‹
x
totheidentityofinput
x
andthetarget
posewiththefollowingobjectives:
L
G
gan
=
E
[
log
D
r
(
G
(
x
;
c
;
z
))]
;
(8)
L
G
id
=
E
[
log
D
d
y
d
(
G
(
x
;
c
;
z
))]
;
(9)
L
G
pos
=
E
[
log
D
p
y
t
(
G
(
x
;
c
;
z
))]
:
(10)
104
Similarly,theobjectivefortrainingthediscriminator
G
istheweightedaverageofeach
objective:
max
G
L
G
=
m
g
L
G
gan
+
m
d
L
G
id
+
m
p
L
G
pos
;
(11)
whereweset
m
g
=
m
d
=
m
p
=
1.
G
and
D
improveseachotherduringthealternativetrainingprocess.With
D
beingmore
powerfulindistinguishingrealvs.fakeimagesandclassifyingposes,
G
strivesforsynthesizing
anidentity-preservedfacewiththetargetposetocompetewith
D
.Wefromthisprocess
inthreeaspects.First,thelearntrepresentation
f
(
x
)
willpreservemorediscriminativeidentity
information.Second,theposecin
D
guidestheposeoftherotatedfacetobemore
accurate.Third,withaseparateposecodeasinputto
G
dec
,
G
enc
istrainedtodisentanglethepose
variationfrom
f
(
x
)
,i.e.,
f
(
x
)
shouldencodeas
much
identityinformationaspossible,butas
little
poseinformationaspossible.Therefore,
f
(
x
)
isnotonlygenerativeforimagesynthesis,butalso
discriminativeforPIFR.
A3.2.2NetworkStructure
Thenetworkstructureofsingle-imageDR-GANisadoptedfromCASIA-Net[180]withbatch
normalization(BN)for
G
enc
and
D
.Besides,sincethestabilityoftheGANgamesuffersifsparse
gradientlayers(MaxPool,ReLU)areused,wereplacethemwithstridedconvolutionandexponen-
tiallinearunit(ELU)respectively.
D
istrainedtooptimizeEqn.7byaddingafullyconnectedlayer
withthesoftmaxlossforrealvs.fake,identity,andposerespectively.
G
includes
G
enc
and
G
dec
thatarebridgedbytheto-be-learnedidentityrepresentation
f
(
x
)
2
R
N
f
,whichis
theAvgPooloutputinour
G
enc
.
f
(
x
)
isconcatenatedwithaposecode
c
andarandomnoise
z
.
Aseriesoffractionally-stridedconvolutions(FConv)[124]transformsthe
(
N
f
+
N
p
+
N
z
)
-dim
105
concatenatedvectorintoasyntheticimage
‹
x
=
G
(
x
;
c
;
z
)
,whichisthesamesizeas
x
.
G
istrained
tomaximizeEqn.11whenasyntheticface
‹
x
isfedto
D
andthegradientisback-propagatedto
update
G
.
Previousworkinfacerotationuse
L
2loss[197,182]toenforcethesyntheticfacetobesimilar
tothegroundtruthfaceatthetargetpose.Thislineofworkrequiresthetrainingdatatoinclude
faceimagepairsofthesameidentityatdifferentposes,whichisachievableforcontrolleddatasets
suchasMulti-PIE,buthardtoforin-the-wilddatasets.Oncontrary,DR-GANdoesnot
requireimagepairssincethereisnodirectsupervisiononthesyntheticimages.Thisenablesus
toutilizeextensivereal-worldunstructureddatasetsformodeltraining.Toinitializethetraining,
givenatrainingimage,werandomlysampletheposecodewithequalprobabilityforeachpose
view.Sucharandomsamplingisconductedat
each
epochduringthetraining,forthepurpose
ofassigning
multiple
posecodestoonetrainingimage.Forthenoisevector,wealsorandomly
sampleeachdimensionindependentlyfromtheuniformdistributionintherangeof[

1
;
1].
A3.3Multi-ImageDR-GAN
Oursingle-imageDR-GANextractsanidentityrepresentationandperformsfacerotationbypro-
cessingonesingleimage.Yet,weoftenhavemultipleimagespersubjectintrainingandsometimes
intesting.Toleveragethem,weproposemulti-imageDR-GANthatcanboththetraining
andtestingstages.Fortraining,itcanlearnabetteridentityrepresentationfrommultipleimages
thatarecomplementarytoeachother.Fortesting,itcanenabletemplate-to-templatematching,
whichaddressesacrucialneedinreal-worldsurveillanceapplications.
Themulti-imageDR-GANhasthesame
D
assingle-imageDR-GAN,butadifferent
G
as
showninFig.A3.Given
n
images
f
x
i
g
n
i
=
1
ofthesameidentity
y
d
atvariousposesasinput,
besidesextractingthefeaturerepresentation
f
(
x
i
)
,
G
enc
alsoestimatesacoef
w
i
106
FigureA3:
Generatorinmlti-imageDR-GAN.Fromanimagesetofasubject,wecanfusethefeaturesto
asinglerepresentationviadynamicallylearntcoefandsynthesizeimagesinanypose.
foreachimage,whichpredictsthequalityofthelearntrepresentation.Thefusedrepresentationof
n
imagesistheweightedaverageofallrepresentations,
f
(
x
1
;:::;
x
n
)=
å
n
i
=
1
w
i
f
(
x
i
)
å
n
i
=
1
w
i
:
(12)
Thisfusedrepresentationisthenconcatenatedwith
c
and
z
andfedto
G
dec
togenerateanew
image,whichisexpectedtohavethesameidentityasallinputimagesandatargetpose
y
t

bytheposecode.Thus,eachsub-objectiveforlearning
G
has
(
n
+
1
)
terms:
L
G
gan
=
n
å
i
=
1
h
E
[
log
(
D
r
(
G
(
x
i
;
c
;
z
)))]
i
+
E
[
log
(
D
r
(
G
(
x
1
;:::;
x
n
;
c
;
z
)))]
:
(13)
Thesimilarextensionappliedfor
L
G
id
and
L
G
pos
.Thecoef
w
i
inEqn.12islearnedso
thatanimagewithahigherqualitycontributesmoretothefusedrepresentation.Thequalityis
anindicatorofthePIFRperformanceoftheimage,ratherthanthelow-levelimagequality.Face
qualitypredictionisaclassictopicwheremanypriorworkattempttoestimatetheformerfrom
thelatter[116,171].Ourcoeflearningisessentiallythequalityprediction,fromnovel
107
perspectivesincontrasttopriorwork.Thatis,withoutexplicitsupervision,itisdrivenby
D
throughthedecodedimage
G
dec
(
f
(
x
1
;:::;
x
n
)
;
c
;
z
)
,andlearnedinthecontextof,asabyproduct
of,representationlearning.Notethat,jointlytrainingmultipleimagespersubjectresultsin
one
,
butnotmultiple,generator,i.e.,all
G
enc
inFig.A3sharethesameparameters.Thismakesit
xibletotakean
arbitrarynumber
ofimagesduringtestingforrepresentationlearningandface
rotation.
Forthenetworkstructure,multi-imageDR-GANonlymakesminorfromthe
single-imagecounterpart.,attheendof
G
enc
,weaddonemoreconvolutionalto
thelayerbeforeAvgPooltoestimatethecoef
w
.Weapply
Sigmoid
activationtoconstrain
w
intherangeof[0
;
1].Duringtraining,despiteunnecessary,wekeepthenumberofinputimages
persubject
n
thesameforthesakeofconvenienceinimagesamplingandnetworktraining.To
mimicthevariationinthenumberofinputimages,weuseasimplebuteffectivetrick:applying
Dropoutonthecoef
w
:each
w
issetto0withaprobabilityof0
:
5.Hence,duringtraining,
thenetworktakesanynumberofinputsvaryingfrom1to
n
.
DR-GANcanbeusedinPIFR,imagequalityprediction,andfacerotation.Whilethenetwork
inFig.A2(d)isusedfortraining,ournetworkfortestingismuchFirst,forPIFR,
only
G
enc
isusedtoextracttherepresentationfromoneormultipleimages.Second,forquality
prediction,only
G
enc
isusedtocompute
w
fromoneimage.Thirdly,both
G
enc
and
G
dec
areused
forfacerotationbyspecifyingatargetposeandanoisevector.
A3.4ComparisontoPriorGANs
WecompareDR-GANwithmostrelevantGANvariants(Fig.A2).
ConditionalGAN.
ConditionalGAN[109,81]extendsGANbyfeedingthelabelstoboth
G
108
and
D
togenerateimagesconditionedonlabels,eitherclasslabels,modalityinformation,oreven
partialdataforinpainting.IthasbeenusedtogenerateMNISTdigitsconditionedontheclasslabel
andtolearnmulti-modalmodels.InconditionalGAN,
D
istrainedtoclassifyarealimagewith
mismatchedconditionstoafakeclass.InDR-GAN,
D
arealimagetothecorresponding
classbasedonthelabels.
AuxiliaryGAN.
Odena
etal
.[115]extendsconditionalGANtoaddanadditional
to
D
toclassifyrealimagesinto
N
c
classes.DR-GANsharesasimilarlossfor
D
butwith
adistinguishpurpose.TheauxiliaryinOdena
etal
.[115]isusedtohelpimprovingthe
stabilityandqualityofGANtraining.Meanwhile,weemploytwoadditionaltoguide
therepresentationlearningintheencoder-decoderstructure
G
.
AdversarialAutoencoder(AAE).
InAAE[98],
G
istheencoderofanautoencoder.AAEhas
twoobjectivesinordertoturnanautoencoderintoagenerativemodel:theautoencoderrecon-
structstheinputimage,andthelatentvectorgeneratedbytheencodermatchesanarbitraryprior
distributionbytraining
D
.DR-GANdifferstoAAEintwoaspects.First,theautoencoderin[98]is
trainedtolearnalatentrepresentationsimilartoanimposedpriordistribution,whileourencoder-
decoderlearnsdiscriminativeidentityrepresentations.Second,
D
inAAEistrainedtodistinguish
real/fakedistributionswhileour
D
istrainedtoclassifyreal/fakeimages,theidentityandposeof
theimages.
A4Experiments
DR-GANcanbeusedforfacerecognitionbyusingthelearntrepresentationfrom
G
enc
,andface
rotationbyspecifyingdifferentposecodesandnoisevectorswith
G
.WeevaluateDR-GANquan-
titativelyforPIFRandqualitativelyforfacerotation.Wefurtherconductexperimentstoanalyze
109
thetrainingstrategy,disentanglerepresentation,andimagecoefOurexperimentsarecon-
ductedforbothcontrolledandin-the-wilddatabases.
A4.1ExperimentalSettings
Databases.
Multi-PIE[52]isthelargestdatabaseforevaluatingfacerecognitionunderpose,
illumination,andexpressionvariationsincontrolledsetting.Forfaircomparison,wefollowthe
settingin[197]:using337subjectswithneutralexpression,9poseswithin

60

,and20illumina-
tions.The200subjectsareusedfortrainingandtherest137subjectsfortesting.Inthetesting
set,oneimagepersubjectwithfrontalviewandneutralilluminationformsthegallerysetand
theothersaretheprobeset.ForMulti-PIEexperiments,weaddanadditionalilluminationcode
similartotheposecodetodisentangletheilluminationvariation.Therefore,wehave
N
d
=
200,
N
p
=
9,
N
il
=
20.Further,todemonstrateourabilityinsynthesizinglarge-posefaces,wetraina
secondmodelwithtrainingfacesupto90

(i.e.,
N
p
=
13).
Forthein-the-wildsetting,wetrainonCASIA-WebFace[180]andAFLW[76],andteston
CFP[140]andIJB-A[75].CASIA-WebFaceincludes494
;
414near-frontalfacesof10
;
575sub-
jects.WeaddtheAFLW(25
;
993images)tothetrainingsettosupplymoreposevariation.Since
thereisnoidentityinformationinthisdataset,thoseimagesonlyusedtocomputeGAN,pose
relatedlosses.CFPconsistsof500subjectseachwith10frontaland4images.Theevalua-
tionprotocolincludesfrontal-frontal(FF)and(FP)faceveachhaving10
folderswith350same-personpairsand350different-personpairs.Asanotherlarge-posedatabase,
IJB-Ahas5
;
396imagesand20
;
412videoframesof500subjects.Ittemplate-to-template
facerecognitionwhereeachtemplatehasoneormultipleimages.Weremove27overlapsubjects
betweenCASIA-WebfaceandIJB-Afromthetraining.Wehave
N
d
=
10
;
548,
N
p
=
13.Weset
110
FigureA4:Themeanfacesof13posegroupsinCASIA-Webface.Theblurrinessshowsthe
challengesofposeestimationforlargeposes.
N
f
=
320,
N
z
=
50forbothsettings.
ImplementationDetails.
Following[180],wealignallfaceimagestoacanonicalviewofsize
110

110.Werandomlysample96

96regionsfromthealigned110

110faceimagesfordata
augmentation.Imageintensitiesarelinearlyscaledtotherangeof
[

1
;
1
]
.Toprovideposelabels
y
p
forCASIA-WebFace,weapply3Dfacealignment[71,70]toclassifyeachfacetooneof13
poses.ThemeanfaceimageofeachposegroupisshowninFig.A4.Themeanfacesof
facesarelesssharpthanthoseofthenear-frontalposegroups,whichindicatestheposeestimation
errorcausedbythefacealignmentalgorithm.
OurimplementationisextensivelyfromapubliclyavailableimplementationofDC-
GAN.Wefollowtheoptimizationstrategyin[124].Thebatchsizeissettobe64.Allweights
areinitializedfromazero-centerednormaldistributionwithastandarddeviationof0
:
02.Adam
optimizer[74]isusedwithalearningrateof0
:
0002andmomentum0
:
5.
Evaluation.
TheproposedDR-GANaimsforbothfacerepresentationlearningandfaceimage
synthesis.Thecosinedistancebetweentworepresentationsisusedforfacerecognition.Wealso
evaluatetheperformanceoffacerecognitionw.r.t.differentnumbersofimagesinbothtrainingand
testing.Forimagesynthesis,weshowqualitativeresultsbycomparingdifferentlossesandinter-
polationofthelearntrepresentations.Wealsoevaluatethevariouseffectsofdifferentcomponents
inourmethod.
111
TableA1:
DR-GANanditspartialvariantsperformancecomparison.
V
Method@FAR=
:
01@FAR=
:
001@Rank-1@Rank-5
DR-GAN

D
r
80
:
0

2
:
255
:
5

3
:
588
:
7

0
:
895
:
0

0
:
8
DR-GAN

D
p
78
:
0

2
:
053
:
9

6
:
887
:
5

0
:
894
:
5

0
:
7
DR-GAN81
:
2

2
:
756
:
2

9
:
189
:
0

1
:
495
:
1

0
:
9
FigureA5:GeneratedfacesofDR-GANanditspartialvariants.
A4.2Ablationstudy
DiscriminatorComponents.
Ourdiscriminatorisdesignedasamulti-taskCNNwiththree
components,namely
D
g
;
D
d
;
D
p
,forreal/fake,identityandposerespectively.While
D
d
playsacriticalroletoguidethegeneratortopreservetheinputidentity,wewouldliketo
studytheroleoftheremainingcomponents.TableA1presentstherecognitionperformanceof
single-imageDR-GANpartialvariantswitheachof
D
componentsremoved.Whilethevariant
withoutadversariallosshasaslightlyperformancedrop,themodelwithoutpose
taskhasmoreseveredrop.Thisshowstheimportantofgeneratingfaceimagesindifferentposes.
Also,theroleofeachcomponentisshowningeneratedfaces(Fig.A5).Whenremoving
D
r
,
generatedimageshaslowerqualityalthoughtheycanberealizedasfacesandincorrectposes.
Whenremoving
D
p
,theposeofgeneratedimagescan'tbecontrolledbytheposecodeandusually
affectedbytheinputface'spose.Thiscanbecausedbyposeinformationresidinginthefeature
112
representation.Thisalsoexplainstheseveredropinthemodel'srecognitionperformance.
DisentangledRepresentation.
InDR-GAN,weclaimthatthelearntrepresentationisdisentan-
gledfromposevariationsviatheposecode.Tovalidatethis,followingtheenergy-basedweight
visualizationmethodproposedin[184],weperformfeaturevisualizationontheFClayer,denoted
as
h
2
R
6

6

320
,in
G
dec
.Ourgoalistoselecttwooutofthe320thathavehighestre-
sponsesforidentityandposerespectively.Theassumptionisthatifthelearntrepresentationis
pose-invariant,thereshouldbeseparateneuronstoencodetheidentityfeaturesandposefeatures.
Recallthatweconcatenate
f
(
x
)
2
R
320
,
c
2
R
13
and
z
2
R
50
intoonefeaturevector,which
multiplieswithaweightmatrix
W
fc
2
R
(
320
+
13
+
50
)

(
6

6

320
)
andgeneratestheoutput
h
with
h
i
2
R
6

6
beingthefeatureoutputofoneinFC.Let
W
fc
=[
W
fx
;
W
c
;
W
z
]
denotetheweight
matrixwiththreesub-matrices,whichwouldmultiplywith
f
(
x
)
;
c
;
z
respectively.Takingtheiden-
titymatrixasanexample,wehave
W
fx
=[
W
1
fx
;
W
2
fx
;:::;
W
320
fx
]
where
W
i
fx
2
R
320

36
.Wecom-
puteanenergyvector
s
d
2
R
320
witheachelementas:
s
i
d
=
jj
W
i
fx
jj
F
.Wethenthewith
thehighestenergyin
s
d
as
k
d
=
argmax
i
s
i
d
.Similarly,bypartitioning
W
c
,weanother,
denotedas
k
p
,withthehighestenergyforpose.
Giventherepresentation
f
(
x
)
ofonesubject,alongwithaposecode
c
andnoise
z
,wecan
computetheresponsesoftwovia
h
k
d
=(
f
(
x
)
;
c
;
z
)
|
W
k
d
fc
and
h
k
p
=(
f
(
x
)
;
c
;
z
)
|
W
k
p
fc
.By
varyingthesubjectsandposecodes,wegeneratetwoarraysofresponsesinFig.A6,foridentity
(
h
k
d
)andpose(
h
k
p
)respectively.Forbotharrays,eachrowrepresentstheresponsesofthesame
subjectandeachcolumnrepresentsthesamepose.Theresponsesforidentityencodetheidentity
features,whereeachrowshowssimilarpatternsandeachcolumndoesnotsharesimilarity.On
contrary,forposeresponses,eachcolumnsharesimilarpatternswhileeachrowisnotrelated.This
visualizationsupportsourclaimthatthelearntrepresentationispose-invariant.
113
FigureA6:
Responsesoftwowiththehighestresponsestoidentity(left),andpose(right).
Responsesofeachrowareofthesamesubject,andeachcolumnareofthesamepose.Notethewithin-row
similarityontheleftandwithin-columnsimilarityontheright.
TableA2:
Comparisonofsinglevs.multi-imageDR-GANonCFP.
MethodFrontal-Frontal
DR-GAN:n=197
:
13

0
:
6890
:
82

0
:
28
DR-GAN:n=497
:
86

0
:
7592
:
93

1
:
39
DR-GAN:n=697
:
84

0
:
7993
:
41

1
:
17
Singlevs.MultipleImageDR-GAN.
Weevaluatetheeffectofthenumberoftrainingimages(
n
)
persubjectonthefacerecognitionperformanceonCFP.,withthe
same
trainingset,we
trainthreemodelswith
n
=
1
;
4
;
6,where
n
=
1denotessingle-imageDR-GANand
n
>
1denotes
multi-imageDR-GAN.ThefacevperformanceonCFPusing
f
(
x
)
ofeachmodelare
showninTab.A2.Weobservetheadvantageofmulti-imageDR-GANoverthesingle-image
counterpartdespitetheyusethe
sameamount
oftrainingdata,whichattributestomoreconstraints
inlearning
G
enc
thatleadstoabetterrepresentation.However,wedonotkeepincreasing
n
due
tothelimitedcomputationcapacity.Intherestofthepaper,weusemulti-imageDR-GANwith
n
=
6unless
114
FigureA7:
CoefdistributionsonIJB-A(a)andCFP(b).ForIJB-A,wevisualizeimagesatfour
regionsofthedistribution.ForCFP,weplotthedistributionsforfrontalfaces(blue)andfaces(red)
separatelyandshowimagesattheheadsandtailsofeachdistribution.
A4.3
Inmulti-imageDR-GAN,welearnacoefforeachinputimagebyassumingthat
thelearntcoefisindicativeoftheimagequality,i.e.,howgooditcanbeusedforface
recognition.Therefore,alow-qualityimageshouldhavearelativelypoorrepresentationandsmall
coefsothatitwouldcontributelesstothefusedrepresentation.Tovalidatethisassumption,
wecomputethecoefforallimagesinIJB-AandCFPdatabasesandplotthe
distributionasshowninFig.A7.
ForIJB-A,weshowfourexampleimageswithlow,medium-low,medium-high,andhighco-
efItisobviousthatthelearntcoefarecorrelatedtotheimagequality.Images
withrelativelylowcoefareusuallyblurring,withlargeposesorfailurecropping.While
imageswithrelativelyhighcoefareofveryhighqualitywithfrontalfacesandlessocclu-
sion.SinceCFPconsistsof5
;
000frontalfacesand2
;
000faces,weplottheirdistributions
separately.Despitesomeoverlapinthemiddleregion,thefacesclearlyhaverelativelylow
coefcomparedtothefrontalfaces.Withineachdistribution,thecoefarerelatedto
othervariationsexpectyawangles.Thelow-qualityimagesforeachposegrouparewithocclu-
sionand/orchallenginglightingconditions,whilethehigh-qualityonesarewithlessocclusionand
115
FigureA8:Thecorrelationbetweentheestimatedcoefandtheprobabilities.
undernormallighting.
Toquantitativelyevaluatethecorrelationbetweenthecoefandfacerecognitionper-
formance,weconductanidentityexperimentonIJB-A.,werandomly
selectallframesofonevideoforeachsubjectandselecthalfofimagesfortrainingandremaining
fortesting.Thetrainingandtestingsetssharethesameidentities.Therefore,inthetestingstage,
wecanusetheoutputofthesoftmaxlayerastheprobabilityofeachtestingimagebelongingtothe
rightidentityclass.Thisprobabilityisanindicatorofhowwelltheinputimagecanberecognized
asthetrueidentity.Giventheestimatedcoefweplotthesetwovaluesforthetestingset,
asshowninFig.A8.Thesetwovaluesarehighlycorrelatedtoeachotherwithacorrelationof
0
:
69,whichagainsupportsourassumptionthatthelearntcoefareindicativeoftheimage
quality.
Imageselectionwith
w
.
Onecommonapplicationofimagequalityistopreventlow-quality
imagesfromcontributingtofacerecognition.Tovalidatewhetherourcoefhavesuchus-
ability,wedesignthefollowingexperiment.ForeachtemplateinIJB-A,wekeepimageswhose
116
TableA3:
PerformanceofIJB-Awhenremovingimagesbythreshold
w
t
.ﬁSelected"showsthepercentage
ofretainedimages.
w
t
SelectedV
(%)@FAR=
:
01@FAR=
:
001@Rank-1@Rank-5
0100
:
0
84
:
3

1
:
472
:
6

4
:
491
:
0

1
:
595
:
6

1
:
1
0
:
194
:
984
:
2

1
:
772
:
7

2
:
9
91
:
3

1
:
3
95
:
7

1
:
0
0
:
2571
:
983
:
6

1
:
2
73
:
3

3
:
090
:
7

1
:
295
:
2

1
:
0
0
:
524
:
680
:
9

1
:
971
:
3

4
:
786
:
5

1
:
993
:
1

1
:
6
1
:
05
:
777
:
8

2
:
264
:
0

6
:
283
:
4

2
:
391
:
6

1
:
2
coef
w
arelargerthanathreshold
w
t
,orifall
w
aresmallerwekeeponeimage
withthehighest
w
.Tab.A3reportstheperformanceonIJB-A,withdifferent
w
t
.With
w
t
being
0,alltestimagesarekeptandtheresultisthesameasTab.A6.Theseresultsshowthatkeeping
allormajorityofthesamplesarebetterthanremovingthem.Thisisencouragingasitthe
effectivenessofDR-GANinautomaticallydiminishingtheimpactoflow-qualityimages,without
removingthembythresholding.
Featurefusionwith
w
.
Wealsowouldliketoshowourproposedfeaturefusionusingcoef
w
iseffectiveforthetemplatetotemplatematchingpurpose.Wecompareitwithmultiplefusion
methodsinbothfeaturelevelandscorelevel.TableA4showscomparisonsofdifferentfusion
methodsonourmulti-imageDR-GANfeatures.Tocomparetwotemplatewithsize
n
1
;
n
2
,for
score-level,min,max,meanarerespectivelytakingminimum,maximumandaverageofall
n
1
n
2
possiblepairwisedistances.Mean-ministheaverageof
n
1
+
n
2
minimumdistancesfromeach
featurefromonetemplatetotheother.Allofthesemethodshavethetimecomplexityof
O
(
n
1
n
2
)
.
Softmax,proposedin[2],aggregatesmultipleweightedaveragesofthepair-wisescores,where
eachweightisthefunctionofthescoreusinganexponentialfunctionindifferentscales.Ithas
thetimecomplexityof
O
(
mn
1
n
2
)
,where
m
isthenumberofweightscale.Here,following[101],
weuseatotalof
m
=
21scalesfrom0to20.Forfeature-levelfusion,max,meanarerespectively
117
TableA4:
FusionschemescomparisonsonIJB-Adataset.
V
Method@FAR=
:
01@FAR=
:
001@Rank-1@Rank-5
Score
Min78
:
3

2
:
746
:
0

6
:
986
:
7

1
:
494
:
0

0
:
6
Max22
:
8

2
:
012
:
3

2
:
330
:
6

2
:
852
:
8
:
0

2
:
7
Mean72
:
8

2
:
949
:
2

5
:
385
:
7

1
:
393
:
1

0
:
6
Mean-min82
:
4

2
:
258
:
5

6
:
390
:
2

1
:
0
95
:
6

0
:
5
Softmax
84
:
3

1
:
669
:
2

6
:
890
:
1

1
:
095
:
5

0
:
8
Feature
Max19
:
0

1
:
312
:
1

1
:
745
:
4

5
:
362
:
6

0
:
9
Mean83
:
0

1
:
567
:
0

4
:
889
:
6

1
:
595
:
4

0
:
7
w
-fusion
84
:
3

1
:
4
72
:
6

4
:
4
91
:
0

1
:
5
95
:
6

1
:
1
max-poolingandaverage-poolingalongeachfeaturedimension.Allfeature-levelfusionmethods,
includingour
w
-fusion,havethetimecomplexityof
O
(
n
1
+
n
2
)
.FromTab.A4,ourfusionusing
estimated
w
achievesthebestperformanceamongallmethods.
A4.4RepresentationLearning
LossFunctionComparison.
Our
G
dec
and
D
canbeviewedasalossfunctionfor
f
(
x
)
.Typical
lossfunctionsusedindeeplearning-basedfacerecognitioncanbedividedintotwocategories:
probability-andenergy-basedlosses.Probability-basedlosses(i.e.,softmaxanditsvariants)usu-
allycomputeadistributionofprobabilitytoallidentities.Meanwhile,energy-basedlosses(con-
trastive,triplet,etc.)associateanenergytoeachHere,wecompareDR-GANto
multiplecommonlossfunctionsoffacerecognition.TohaveafaircomparisononIJB-A,forall
functions,weuseour
G
enc
networkarchitectureandﬁmeanmin"fusion.DR-GANbyitselfcan
surpassallpriorlossfunctions(Tab.A5).Also,anyadvancedlossfunctioncanalsobeto
DR-GAN:energy-basedlosses(center,triplet,etc.)canbeemployeddirectlyonourrepresentation
f
(
x
)
orprobability-basedlosses(angular,additive-marginsoftmax,etc.)canbeusedtoreplacethe
D
d
'ssoftmax.Empirically,usingadditive-marginsoftmax[168]asasoftmaxreplacementon
D
d
118
TableA5:
Lossfunctioncomparisons.Alluseﬁmeanmin"fusion.
V
Method@FAR=
:
01@FAR=
:
001@Rank-1@Rank-5
Softmax75
:
9

3
:
944
:
1

9
:
987
:
8

0
:
994
:
6

0
:
6
Center[170]74
:
9

3
:
150
:
3

7
:
087
:
2

1
:
495
:
2

0
:
9
Triplet[138]74
:
9

3
:
150
:
3

7
:
087
:
2

1
:
495
:
2

0
:
9
AM-Softmax[168]81
:
3

3
:
052
:
7

8
:
988
:
7

0
:
794
:
3

0
:
4
DR-GAN
singleimg.
81
:
2

2
:
756
:
2

9
:
189
:
0

1
:
495
:
1

0
:
9
DR-GAN82
:
4

2
:
358
:
5

8
:
090
:
2

1
:
0
95
:
6

0
:
5
DR-GAN
AM
85
:
7

1
:
6
70
:
3

5
:
79
91
:
0

1
:
5
95
:
6

1
:
1
TableA6:
PerformancecomparisononIJB-Adataset.
V
Method@FAR=
:
01@FAR=
:
001@Rank-1@Rank-5
GOTS[75]40
:
6

1
:
419
:
8

0
:
844
:
3

2
:
159
:
5

2
:
0
Wang
etal
.[167]72
:
9

3
:
551
:
0

6
:
182
:
2

2
:
393
:
1

1
:
4
DCNN[27]78
:
7

4
:
3Œ85
:
2

1
:
893
:
7

1
:
0
PAM
frontal
[101]73
:
3

1
:
855
:
2

3
:
277
:
1

1
:
688
:
7

0
:
9
PAMs[101]82
:
6

1
:
865
:
2

3
:
784
:
0

1
:
292
:
5

0
:
8
p-CNN[184]77
:
5

2
:
553
:
9

4
:
285
:
8

1
:
493
:
8

0
:
9
FF-GAN[185]85
:
2

1
:
066
:
3

3
:
390
:
2

0
:
695
:
4

0
:
5
DR-GAN85
:
6

1
:
575
:
1

4
:
291
:
3

1
:
695
:
8

1
:
0
DR-GAN
AM
87
:
2

1
:
4
78
:
1

3
:
5
92
:
0

1
:
3
96
:
1

0
:
7
canfurtherimproveDR-GANperformance,wenamethisvariantasDR-GAN
AM
.
ResultsonBenchmarkDatabases.
WecompareDR-GANwithstate-of-the-artfacerecognizers
onIJB-A,CFPandMulti-PIE.
TableA6showstheperformanceofbothfaceandvonIJB-A.For
ourresults,wereportresultsofmulti-imageDR-GANusingtheproposed
w
-fusion.Therow
showstheperformanceofpresentedDR-GANmodel(usingtypicalsoftmaxloss).Thesecondrow
presentsthevariantusingadditivemarginsoftmax[168].Comparedtothestateoftheart,DR-
GANachievessuperiorresultsonbothvandThesein-the-wildresults
119
TableA7:
Performance(Accuracy)comparisononCFP.
MethodFrontal-Frontal
Senguptaetal.[140]96
:
40

0
:
6984
:
91

1
:
82
Sankaranaetal.[137]96
:
93

0
:
6189
:
17

2
:
35
Chenetal.[28]
98
:
67

0
:
3691
:
97

1
:
70
Human96
:
24

0
:
6794
:
57

1
:
10
DR-GAN98
:
13

0
:
8193
:
64

1
:
51
DR-GAN
AM
98
:
36

0
:
75
93
:
89

1
:
39
TableA8:
rate(%)comparisononMulti-PIEdataset.
Method0

15

30

45

60

Average
Zhuetal.[196]94
:
390
:
780
:
764
:
145
:
972
:
9
Zhuetal.[197]95
:
792
:
883
:
772
:
960
:
179
:
3
Yimetal.[182]
99
:
595
:
0
88
:
579
:
961
:
983
:
3
Using
L
2loss95
:
190
:
882
:
772
:
757
:
978
:
3
DR-GAN98
:
194
:
991
:
187
:
284
:
690
:
4
DR-GAN
AM
98
:
1
95
:
091
:
388
:
085
:
890
:
8
showthepowerofDR-GANforPIFR.
TableA7showsthecomparisononCFPevaluatedwithAccuracy.Resultsarereportedwith
theaveragewithstandarddeviationover10folds.Overall,weachievecomparableperformance
onfrontal-frontalvwhilehaving1
:
92%improvementonthev
TableA8showsthefaceperformanceonMulti-PIEcomparedtothemethods
withthesamesetting.Ourmethodshowsaimprovementforlarge-posefaces,e.g.,there
ismorethan20%improvementmarginat

60

poses.Thevariationofrecognitionratesacross
differentposesismuchsmallerthanthebaselines,whichsuggeststhatourlearntrepresentationis
morerobusttotheposevariation.
Representationvs.SyntheticImageforPIFR.
Manypriorwork[58,194]usefrontalizedfaces
forPIFR.ToevaluatetheidentitypreservationofsyntheticimagesfromDR-GAN,wealsoperform
120
TableA9:
Representation
f
(
x
)
vs.syntheticimage
‹
x
onIJB-A.
V
Features@FAR=
:
01@FAR=
:
001@Rank-1@Rank-5
f
(
‹
x
)
78
:
5

1
:
960
:
3

3
:
786
:
9

1
:
694
:
2

1
:
3
D
d
(
‹
x
)
77
:
1

2
:
953
:
5

6
:
285
:
7

1
:
793
:
6

1
:
6
f
0
(
‹
x
)
79
:
2

2
:
960
:
8

7
:
389
:
2

1
:
495
:
3

1
:
1
f
0
(
‹
x
)
&
f
(
‹
x
)
83
:
0

1
:
871
:
7

3
:
690
:
7

1
:
4
95
:
6

1
:
0
f
(
x
)
84
:
3

1
:
4
72
:
6

4
:
4
91
:
0

1
:
5
95
:
6

1
:
1
FigureA9:
FacerotationcomparisononMulti-PIE.Giventheinput(inillumination07and75

pose),we
showsyntheticimagesof
L
2loss(top),adversarialloss(middle),andgroundtruth(bottom).Column2-5
showtheabilityofDR-GANinsimultaneousfacerotationandre-lighting.
facerecognitionusingourfrontalizedfaces.Anyfacefeatureextractorcouldbeappliedtothem,
including
G
enc
or
D
d
.However,botharetrainedonrealimagesofvariousposes.Tospecializeto
syntheticfrontalfaces,weune
G
enc
withthesyntheticimagesanddenoteas
f
0
(

)
.Asshown
inTab.A9,althoughtheperformanceofsyntheticimages(anditsscore-levelfusiondenotedas
f
0
(
‹
x
)
&
f
(
‹
x
)
)isnotasgoodasthelearntrepresentation,usingthe
G
enc
onsynthetic
frontalstillachievescomparableperfromancetothepreviousmethods,whichshowstheidentity
preservationabilityofDR-GAN.
A4.5FaceRotation
AdversarialLossvs.L2loss.
Priorwork[196,182,179]onfacerotationnormallyemploythe
L
2losstolearnamappingbetweentwoviews.Tocomparethe
L
2losswithouradversarialloss,
121
FigureA10:
Interpolationof
f
(
x
)
,
c
,and
z
.(a)Syntheticimagesbyinterpolatingbetweentheidentity
representationsoftwofaces(Column1and12).Notethesmoothtransitionbetweendifferentgendersand
facialattributes.(b)Poseangles0

,15

,30

,45

,60

,75

,90

areavailableinthetrainingset.DR-GAN
interpolatesin-between
unseen
posesvia
continuous
posecodes,shownaboveRow3.(c)Foreachimageat
Column1,DR-GANsynthesizestwoimagesat
z
=

1
(Column2)and
z
=
1
(Column12),andin-between
imagesbyinterpolatingalongtwo
z
.
wetrainamodelwhere
G
issupervisedbyan
L
2lossonthegroundtruthfacewiththetargetview.
Thetrainingprocessiskeptthesameforafaircomparison.AsshowninFig.A9,DR-GANcan
generatefarmorerealisticfacesthataresimilartothegroundtruthfacesinallviews.Meanwhile,
imagessynthesizedbythe
L
2losscannotmaintainhighfrequencycomponentsandareblurry.
Infact,
L
2losstreatseachpixelequally,whichleadstothelossofdiscriminativeinformation.
ThisinferiorsynthesisisalsointhelowerPIFRperformanceinTab.A8.Incontrast,by
integratingtheadversarialloss,weexpecttolearnamorediscriminativerepresentationforbetter
recognition,andamoregenerativerepresentationforbetterfacesynthesis.
VariableInterpolations.
Takingtwoimagesofdifferentsubjects
x
1
;
x
2
,weextractfeatures
f
(
x
1
)
and
f
(
x
2
)
from
G
enc
.Theinterpolationbetween
f
(
x
1
)
and
f
(
x
2
)
cangeneratemanyrepre-
122
FigureA11:FacerotationonCFP:(a)input,(b)frontalizedfaces,(c)realfrontalfaces,(d)rotated
facesat15

,30

,45

poses.Weexpectthefrontalizedfacestopreservetheidentity,ratherthan
allfacialattributes.Thisisverychallengingforfacerotationduetothein-the-wildvariationsand
extremeviews.Theartifactintheimageboundaryisduetoimageextrapolationinpre-
processing.Whentheinputsarefrontalfaceswithvariationsinroll,expression,orocclusions,the
syntheticfacescanremovethesevariations.
sentations,whichcanbefedto
G
dec
tosynthesizefaceimages.InFig.A10(a),thetoprowshows
atransitionfromafemalesubjecttoamalesubjectwithbeardandglasses.Similarto[124],these
smoothsemanticchangesindicatethatthemodelhaslearnedessentialidentityrepresentationsfor
imagesynthesis.
Similarinterpolationcanbeconductedfortheposecodesaswell.Duringtraining,weuse
aone-hotvector
c
tospecifythe
discrete
poseofthesyntheticimage.Duringtesting,wecould
generatefaceimageswith
continuous
poses,whoseposecodeistheweightedaverage,i.e.,inter-
polation,oftwoneighboringposecodes.Notethattheresultantposecodeisnolongeraone-hot
vector.AsinFig.A10(b),thisleadstosmoothposetransitionfromoneviewtomanyviews
unseen
tothetrainingset.
Wecanalsointerpolatethenoisevector
z
.Wesynthesizefrontalfacesat
z
=

1
and
z
=
1
(a
vectorofall1s)andinterpolatebetweentwo
z
.Giventheedidentityrepresentationandpose
123
FigureA12:
FacefrontalizationonIJB-A.Foreachoffoursubjects,weshow11inputimageswithesti-
matedcoefoverlaidatthetopleftcornerrow)andtheirfrontalizedcounterpart(secondrow).
Thelastcolumnisthegroundtruthfrontalandsyntheticfrontalfromthefusedrepresentationofall11im-
ages.Notethechallengesoflargeposes,occlusion,andlowresolution,andour
opportunistic
frontalization.
code,thesyntheticimagesareidentity-preservedfrontalfaces.AsinFig.A10(c),thechangeof
z
leadstothechangeofthebackground,illuminationcondition,andfacialattributessuchasbeard,
whiletheidentityiswellpreservedandfacesareofthefrontalview.Thus,
z
modelsless
facevariations.
FaceRotationonBenchmarkDatabases.
Ourgeneratoristrainedtobeafacerotator.Given
oneormultiplefaceimageswitharbitraryposes,wecangeneratemultipleidentity-preservedfaces
atdifferentviews.FigureA9showsthefacerotationresultsonMulti-PIE.Givenaninputimage
atanypose,wecangeneratemulti-viewimagesofthesamesubjectbutatadifferentposeby
specifyingdifferentposecodesorinadifferentlightingconditionbyvaryingilluminationcode.
124
FigureA13:
FacefrontalizationonIJB-Aforanimagesetsubject)andavideosequence(second
subject).Foreachsubject,weshow11inputimages(row),theirrespectivefrontalizedfaces(second
row)andthefrontalizedfacesusing
incrementally
fusedrepresentationsfromallpreviousinputsuptothis
image(thirdrow).Inthelastcolumn,weshowthegroundtruthfrontalface.
Therotatedfacesaresimilartothegroundtruthwithwell-preservedattributessuchaseyeglasses.
Oneapplicationoffacerotationisfacefrontalization.OurDR-GANcanbeusedforface
frontalizationbyspecifyingthefrontal-viewasthetargetpose.FigureA11showsthefacefrontal-
izationonCFP.Givenanextremeinputimage,DR-GANcangeneratearealisticfrontal
facethathassimilaridentitycharacteristicsastherealfrontalface.Tothebestofourknowledge,
thisistheworkthatisableto
frontalizeaprwin-the-wildfaceimage
.Whentheinput
imageisalreadyinthefrontalview,thesyntheticimagescancorrectthepitchandrollangles,
normalizeilluminationandexpression,andimputeoccludedfacialareas,asshowninthelastfew
examplesofFig.A11.
FigureA12showsfacefrontalizationresultsonIJB-A.Foreachsubjectortemplate,weshow
11imagesandtheirrespectivefrontalizedfaces,andthefrontalizedfacegeneratedfromthefused
representation.Foreachinputimage,theestimatedcoef
w
isshownonthetop-leftcorner
125
ofeachimage,whichclearlyindicatesthequalityoftheinputimageaswellasthefrontalized
image.Forexample,coefforlow-qualityorlarge-poseinputimagesareverysmall.These
imageswillhaveverylittlecontributiontothefusedrepresentation.Finally,thefacefromthe
fusedrepresentationhassuperiorqualitycomparedtoallfrontalizedimagesfromasingleinput
face.Thisshowstheeffectivenessofourmulti-imageDR-GANintakingadvantageofmultiple
imagesofthesamesubjectforbetterrepresentationlearning.
Tofurtherevaluatefacefrontalizationresultsw.r.t.differentnumbersofinputimages,wevary
thenumberofinputimagesfrom1to11andvisualizethefrontalizedimagesfromthe
incremen-
tally
fusedrepresentations.AsshowninFig.A13,theindividuallyfrontalizedfaceshavevarying
degreesofresemblancetothetruesubject,accordingtothequalitiesofdifferentinputimages.
Thesyntheticimagesfromfusedrepresentations(thirdrow)improveasthenumberofimages
increases.
A5Conclusions
ThispaperpresentsDR-GANtolearnadisentangledrepresentationforPIFR,bymodelingtheface
rotationprocess.WearethetoconstructthegeneratorinGANwithanencoder-decoderstruc-
tureforrepresentationlearning,whichcanbequantitativelyevaluatedbyperformingPIFR.Using
theposecodefordecodingandposeinthediscriminatorleadtothedisentanglement
ofposevariationfromtheidentityfeatures.Wealsoproposemulti-imageDR-GANtoleverage
multipleimagespersubjectinbothtrainingandtestingtolearnabetterrepresentation.Thisis
theworkthatisabletofrontalizeanextreme-posein-the-wildface.Weattributethesuperior
PIFRandfacesynthesiscapabilitiestothediscriminativeyetgenerativerepresentationlearnedin
G
.Ourrepresentationisdiscriminativesincetheothervariationsareexplicitlydisentangledbythe
126
pose/illuminationcodes,andrandomnoise,andisgenerativesinceitsdecoded(synthetic)image
wouldstillbeastheoriginalidentity.
127
PUBLICATIONS
JournalPapers
1.LuanTranandXiaomingLiu,ﬁOnLearning3DFaceMorphableModelfromIn-the-wild
Images,ﬂinIEEETransactionsonPatternAnalysisandMachineIntelligence(TPAMI),July
2019.
2.LuanTran,XiYin,andXiaomingLiu,ﬁRepresentationLearningbyRotatingYourFaces,ﬂ
inIEEETransactionsonPatternAnalysisandMachineIntelligence(TPAMI),September
2018.
ConferencePapers
1.FengLiu,LuanTran,andXiaomingLiu,ﬁ3DFaceModelingfromDiverseRawScanData,ﬂ
ProceedingofIEEEInternationalConferenceonComputerVision(ICCV)2019,Seoul,
SouthKorea,October,2019.(Oralpresentation)
2.BangjieYin*,LuanTran*,HaoxiangLi,XiaohuiShen,XiaomingLiu,ﬁTowardsInter-
pretableFaceRecognition,ﬂProceedingofIEEEInternationalConferenceonComputerVi-
sion(ICCV)2019,Seoul,SouthKorea,October,2019.(Oralpresentation)(*denotesequal
contributionbytheauthors).
3.LuanTran,FengLiu,XiaomingLiu,andﬁTowardsNonlinear3DFaceMor-
phableModel,ﬂinProceedingofIEEEConferenceonComputerVisionandPatternRecog-
nition(CVPR)2019,LongBeach,California,June,2019.
4.LuanTran,KihyukSohn,XiangYu,XiaomingLiu,andManmohanChandraker,ﬁGotta
Adapt'EmAll:JointPixelandFeature-LevelDomainAdaptationforRecognitioninthe
128
Wild,ﬂinProceedingofIEEEConferenceonComputerVisionandPatternRecognition
(CVPR)2019,LongBeach,California,June,2019.
5.ZiyuanZhang,LuanTran,XiYin,YousefAtoum,JianWan,NanxinWang,andXiaoming
Liu,ﬁGaitRecognitionviaDisentangledRepresentationLearning,ﬂinProceedingofIEEE
ConferenceonComputerVisionandPatternRecognition(CVPR)2019,LongBeach,Cali-
fornia,June,2019.(Oralpresentation)
6.AnuragChowdhuryandYousefAtoum,LuanTran,XiaomingLiu,ArunRossﬁMSU-AVIS
dataset:FusingFaceandVoiceModalitiesforBiometricRecognitioninIndoorSurveillance
Videos,ﬂinProceedingofInternationalConferenceonPatternRecognition(ICPR),Beijing,
China,August,2018.
7.LuanTranandXiaomingLiu,ﬁNonlinear3DFaceMorphableModel,ﬂinProceedingof
IEEEConferenceonComputerVisionandPatternRecognition(CVPR)2018,SaltLake
City,Utah,June,2018.(Spotlightpresentation)
8.LuanTran,XiYin,andXiaomingLiu,ﬁDisentangledRepresentationLearningGANfor
Pose-InvariantFaceRecognition,ﬂinProceedingofIEEEConferenceonComputerVision
andPatternRecognition(CVPR)2017,Honolulu,Hawaii,July,2017.(Oralpresentation)
9.LuanTran,XiaomingLiu,JiayuZhou,andRongJin,ﬁMissingModalitiesImputationvia
CascadedResidualAutoencoder,ﬂinProceedingofIEEEConferenceonComputerVision
andPatternRecognition(CVPR)2017,Honolulu,Hawaii,July,2017.
129
BIBLIOGRAPHY
130
BIBLIOGRAPHY
[1]A.Abaza,M.A.Harrison,T.Bourlai,andA.Ross.Designandevaluationofphotometric
imagequalitymeasuresforeffectivefacerecognition.
IETBiometrics
,2014.
[2]W.AbdAlmageed,Y.Wu,S.Rawls,S.Harel,T.Hassner,I.Masi,J.Choi,J.Lekust,J.Kim,
P.Natarajan,R.Nevatia,andG.Medioni.Facerecognitionusingdeepmulti-poserepresen-
tations.In
WACV
,2016.
[3]M.Abdel-MottalebandM.H.Mahoor.Applicationnotes-algorithmsforassessingthequal-
ityoffacialimages.
IEEEComputationalIntelligenceMagazine
,2007.
[4]R.Abiantun,U.Prabhu,andM.Savvides.Sparsefeatureextractionforpose-tolerantface
recognition.
TPAMI
,2014.
[5]O.AldrianandW.A.Smith.Inverserenderingoffaceswitha3Dmorphablemodel.
TPAMI
,
2013.
[6]B.Amberg,R.Knothe,andT.Vetter.Expressioninvariant3Dfacerecognitionwitha
morphablemodel.In
FG
,2008.
[7]B.Amberg,S.Romdhani,andT.Vetter.OptimalstepnonrigidICPalgorithmsforsurface
registration.In
CVPR
,2007.
[8]T.Bagautdinov,C.Wu,J.Saragih,P.Fua,andY.Sheikh.Modelingfacialgeometryusing
compositionalVAEs.In
CVPR
,2018.
[9]A.D.Bagdanov,A.DelBimbo,andI.Masi.The2D/3Dhybridfacedataset.In
Proceedingsofthe2011jointACMworkshoponHumangestureandbehaviorunderstand-
ing
,pages79Œ80.ACM,2011.
[10]Y.Bengio,A.Courville,andP.Vincent.Representationlearning:Areviewandnewper-
spectives.
TPAMI
,2013.
[11]D.Berthelot,T.Schumm,andL.Metz.BEGAN:BoundaryEquilibriumGenerativeAdver-
sarialNetworks.
arXiv:1703.10717
,2017.
[12]S.Bharadwaj,M.Vatsa,andR.Singh.Biometricquality:Areviewofiris,and
face.
EURASIPJIVP
,2014.
[13]V.BlanzandT.Vetter.Amorphablemodelforthesynthesisof3Dfaces.In
Proceedingsof
the26thannualconferenceonComputergraphicsandinteractivetechniques
,1999.
131
[14]V.BlanzandT.Vetter.Facerecognitionbasedona3Dmorphablemodel.
TPAMI
,
2003.
[15]T.BolkartandS.Wuhrer.Agroupwisemultilinearcorrespondenceoptimizationfor3D
faces.In
ICCV
,2015.
[16]F.L.Bookstein.Principalwarps:Thin-platesplinesandthedecompositionofdeformations.
TPAMI
,1989.
[17]J.Booth,E.Antonakos,S.Ploumpis,G.Trigeorgis,Y.Panagakis,andS.Zafeiriou.3Dface
morphablemodelsﬁIn-the-wildﬂ.In
CVPR
,2017.
[18]J.Booth,A.Roussos,E.Ververas,E.Antonakos,S.Poumpis,Y.Panagakis,andS.P.
Zafeiriou.3DreconstructionofﬁIn-the-wildﬂfacesinimagesandvideos.
TPAMI
,2018.
[19]J.Booth,A.Roussos,S.Zafeiriou,A.Ponniah,andD.Dunaway.A3Dmorphablemodel
learntfrom10,000faces.In
CVPR
,2016.
[20]S.Bouaziz,Y.Wang,andM.Pauly.Onlinemodelingforrealtimefacialanimation.
ACM
TOG
,2013.
[21]G.BrazilandX.Liu.Pedestriandetectionwithautoregressivenetworkphases.In
CVPR
,
2019.
[22]A.BulatandG.Tzimiropoulos.Howfararewefromsolvingthe2D&3Dfacealignment
problem?(andadatasetof230,0003Dfaciallandmarks).In
ICCV
,2017.
[23]C.Cao,Q.Hou,andK.Zhou.Displaceddynamicexpressionregressionforreal-timefacial
trackingandanimation.
ACMTOG
,2014.
[24]C.Cao,Y.Weng,S.Zhou,Y.Tong,andK.Zhou.Facewarehouse:A3Dfacialexpression
databaseforvisualcomputing.
TVCG
,2014.
[25]X.Chai,S.Shan,X.Chen,andW.Gao.Locallylinearregressionforpose-invariantface
recognition.
TIP
,2007.
[26]A.X.Chang,T.Funkhouser,L.Guibas,P.Hanrahan,Q.Huang,Z.Li,S.Savarese,
M.Savva,S.Song,H.Su,J.Xiao,L.Yi,andF.Yu.Shapenet:Aninformation-rich3D
modelrepository.
arXivpreprintarXiv:1512.03012
,2015.
[27]J.-C.Chen,V.M.Patel,andR.Chellappa.UnconstrainedfacevusingdeepCNN
features.In
WACV
,2016.
[28]J.-C.Chen,J.Zheng,V.M.Patel,andR.Chellappa.Fishervectorencodeddeepconvolu-
tionalfeaturesforunconstrainedfacevIn
ICIP
,2016.
132
[29]K.Chen,C.B.Choy,M.Savva,A.X.Chang,T.Funkhouser,andS.Savarese.Text2shape:
Generatingshapesfromnaturallanguagebylearningjointembeddings.In
ACCV
,2018.
[30]X.Chen,Y.Duan,R.Houthooft,J.Schulman,I.Sutskever,andP.Abbeel.InfoGAN:
Interpretablerepresentationlearningbyinformationmaximizinggenerativeadversarialnets.
In
NIPS
,2016.
[31]Y.Chen,S.C.Dass,andA.K.Jain.Localizedirisimagequalityusing2-Dwavelets.In
ICB
,2006.
[32]Y.Chen,Y.Tai,X.Liu,C.Shen,andJ.Yang.FSRNet:End-to-endlearningfacesuper-
resolutionwithfacialpriors.In
CVPR
,2018.
[33]Z.Chen,K.Yin,M.Fisher,S.Chaudhuri,andH.Zhang.BAE-NET:Branchedautoencoder
forshapeco-segmentation.In
ICCV
,2019.
[34]Z.ChenandH.Zhang.Learningimplicitforgenerativeshapemodeling.In
CVPR
,
2019.
[35]C.B.Choy,D.Xu,J.Gwak,K.Chen,andS.Savarese.3D-R2N2:Aapproachfor
singleandmulti-view3Dobjectreconstruction.In
ECCV
,2016.
[36]F.Cole,D.Belanger,D.Krishnan,A.Sarna,I.Mosseri,andW.T.Freeman.Facesynthesis
fromfacialidentityfeatures.In
CVPR
,2017.
[37]T.F.Cootes,G.J.Edwards,andC.J.Taylor.Activeappearancemodels.
TPAMI
,2001.
[38]A.Dai,C.RuizhongtaiQi,andM.Nießner.Shapecompletionusing3D-encoder-predictor
CNNsandshapesynthesis.In
CVPR
,2017.
[39]E.L.Denton,S.Chintala,A.Szlam,andR.Fergus.Deepgenerativeimagemodelsusinga
Laplacianpyramidofadversarialnetworks.In
NIPS
,2015.
[40]C.DingandD.Tao.Robustfacerecognitionviamultimodaldeepfacerepresentation.
TMM
,
2015.
[41]C.DingandD.Tao.Acomprehensivesurveyonpose-invariantfacerecognition.
TIST
,
2016.
[42]P.Dollár,P.Welinder,andP.Perona.Cascadedposeregression.In
CVPR
,2010.
[43]P.Dou,S.K.Shah,andI.A.Kakadiaris.End-to-end3Dfacereconstructionwithdeep
neuralnetworks.In
CVPR
,2017.
[44]M.Everingham,L.VanGool,C.K.Williams,J.Winn,andA.Zisserman.Thepascalvisual
objectclasses(voc)challenge.
IJCV
,2010.
133
[45]H.Fan,H.Su,andL.J.Guibas.Apointsetgenerationnetworkfor3Dobjectreconstruction
fromasingleimage.In
CVPR
,2017.
[46]Y.Feng,F.Wu,X.Shao,Y.Wang,andX.Zhou.Joint3Dfacereconstructionanddense
alignmentwithpositionmapregressionnetwork.In
ECCV
,2018.
[47]P.Garrido,L.Valgaerts,H.Sarmadi,I.Steiner,K.Varanasi,P.Perez,andC.Theobalt.
Vdub:Modifyingfacevideoofactorsforplausiblevisualalignmenttoadubbedaudio
track.In
ComputerGraphicsForum
,volume34,pages193Œ204.WileyOnlineLibrary,
2015.
[48]P.Garrido,L.Valgaerts,C.Wu,andC.Theobalt.Reconstructingdetaileddynamicface
geometryfrommonocularvideo.
ACMTOG
,2013.
[49]P.Garrido,M.Zollhöfer,D.Casas,L.Valgaerts,K.Varanasi,P.Pérez,andC.Theobalt.
Reconstructionofpersonalized3Dfacerigsfrommonocularvideo.
ACMTOG
,2016.
[50]R.Girdhar,D.F.Fouhey,M.Rodriguez,andA.Gupta.Learningapredictableandgenera-
tivevectorrepresentationforobjects.In
ECCV
,2016.
[51]I.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu,D.Warde-Farley,S.Ozair,A.Courville,
andY.Bengio.Generativeadversarialnets.In
NeurIPS
,2014.
[52]R.Gross,I.Matthews,J.Cohn,T.Kanade,andS.Baker.Multi-PIE.
IVC
,2010.
[53]P.GrotherandE.Tabassi.Performanceofbiometricqualitymeasures.
TPAMI
,2007.
[54]T.Groueix,M.Fisher,V.G.Kim,B.C.Russell,andM.Aubry.Atlasnet:Apapier-mâché
approachtolearning3Dsurfacegeneration.In
CVPR
,2018.
[55]L.GuandT.Kanade.Agenerativeshaperegularizationmodelforrobustfacealignment.
In
ECCV
,2008.
[56]C.Häne,S.Tulsiani,andJ.Malik.Hierarchicalsurfacepredictionfor3Dobjectreconstruc-
tion.In
3DV
,2017.
[57]A.W.Harley,K.G.Derpanis,andI.Kokkinos.Segmentation-awareconvolutionalnetworks
usinglocalattentionmasks.In
CVPR
,2017.
[58]T.Hassner,S.Harel,E.Paz,andR.Enbar.Effectivefacefrontalizationinunconstrained
images.In
CVPR
,2015.
[59]K.He,G.Gkioxari,P.Dollár,andR.Girshick.MaskR-CNN.In
ICCV
,2017.
[60]K.He,X.Zhang,S.Ren,andJ.Sun.Deepresiduallearningforimagerecognition.In
CVPR
,2016.
134
[61]G.Huang,Z.Liu,L.vanderMaaten,andK.Q.Weinberger.Denselyconnectedconvolu-
tionalnetworks.In
CVPR
,2017.
[62]R.Huang,S.Zhang,T.Li,R.He,etal.Beyondfacerotation:Globalandlocalperception
ganforphotorealisticandidentitypreservingfrontalviewsynthesis.In
ICCV
,2017.
[63]A.S.Jackson,A.Bulat,V.Argyriou,andG.Tzimiropoulos.Largepose3Dfacerecon-
structionfromasingleimageviadirectvolumetriccnnregression.In
ICCV
,2017.
[64]M.Janner,J.Wu,T.D.Kulkarni,I.Yildirim,andJ.Tenenbaum.Self-supervisedintrinsic
imagedecomposition.In
NeurIPS
,2017.
[65]Z.-H.Jiang,Q.Wu,K.Chen,andJ.Zhang.Disentangledrepresentationlearningfor3D
faceshape.In
CVPR
,2019.
[66]J.Johnson,A.Alahi,andL.Fei-Fei.Perceptuallossesforreal-timestyletransferandsuper-
resolution.In
ECCV
,2016.
[67]A.JourablooandX.Liu.Pose-invariant3Dfacealignment.In
ICCV
,2015.
[68]A.JourablooandX.Liu.Large-posefacealignmentviaCNN-baseddense3Dmodel
In
CVPR
,2016.
[69]A.JourablooandX.Liu.Pose-invariantfacealignmentviaCNN-baseddense3Dmodel

IJCV
,2017.
[70]A.JourablooandX.Liu.Pose-invariantfacealignmentviaCNN-baseddense3Dmodel

IJCV
,2017.
[71]A.Jourabloo,X.Liu,M.Ye,andL.Ren.Pose-invariantfacealignmentwithasingleCNN.
In
ICCV
,2017.
[72]M.Kan,S.Shan,H.Chang,andX.Chen.StackedProgressiveAuto-Encoders(SPAE)for
facerecognitionacrossposes.In
CVPR
,2014.
[73]T.Karras,T.Aila,S.Laine,andJ.Lehtinen.Progressivegrowingofgansforimproved
quality,stability,andvariation.In
ICLR
,2018.
[74]D.KingmaandJ.Ba.Adam:Amethodforstochasticoptimization.In
ICLR
,2015.
[75]B.F.Klare,B.Klein,E.Taborsky,A.Blanton,J.Cheney,K.Allen,P.Grother,A.Mah,
M.Burge,andA.K.Jain.Pushingthefrontiersofunconstrainedfacedetectionandrecog-
nition:IARPAJanusBenchmarkA.In
CVPR
,2015.
[76]M.Koestinger,P.Wohlhart,P.M.Roth,andH.Bischof.Annotatedfaciallandmarksinthe
wild:Alarge-scale,real-worlddatabaseforfaciallandmarklocalization.In
ICCVW
,2011.
135
[77]P.Koppen,Z.-H.Feng,J.Kittler,M.Awais,W.Christmas,X.-J.Wu,andH.-F.Yin.Gaus-
sianmixture3Dmorphablefacemodel.
PatternRecognition
,2017.
[78]E.Krichen,S.Garcia-Salicetti,andB.Dorizzi.Anewprobabilisticirisqualitymeasurefor
comprehensivenoisedetection.In
BTAS
,2007.
[79]A.KrishnaswamyandG.V.Baranoski.Abiophysically-basedspectralmodeloflightinter-
actionwithhumanskin.In
ComputerGraphicsForum
,volume23,pages331Œ340.Wiley
OnlineLibrary,2004.
[80]T.D.Kulkarni,W.F.Whitney,P.Kohli,andJ.Tenenbaum.Deepconvolutionalinverse
graphicsnetwork.In
NIPS
,2015.
[81]H.KwakandB.-T.Zhang.Waysofconditioninggenerativeadversarialnetworks.In
NIPSW
,2016.
[82]E.H.LandandJ.J.McCann.Lightnessandretinextheory.
Josa
,1971.
[83]C.Li,K.Zhou,andS.Lin.Simulatingmakeupthroughphysics-basedmanipulationof
intrinsicimagelayers.In
CVPR
,2015.
[84]S.Li,X.Liu,X.Chai,H.Zhang,S.Lao,andS.Shan.Morphabledisplacementbased
imagematchingforfacerecognitionacrosspose.In
ECCV
,2012.
[85]F.Liu,L.Tran,andX.Liu.3Dfacemodelingfromdiverserawscandata.In
ICCV
,2019.
[86]F.Liu,D.Zeng,Q.Zhao,andX.Liu.Jointfacealignmentand3Dfacereconstruction.In
ECCV
,2016.
[87]F.Liu,D.Zeng,Q.Zhao,andX.Liu.Jointfacealignmentand3Dfacereconstruction.In
ECCV
,2016.
[88]F.Liu,R.Zhu,D.Zeng,Q.Zhao,andX.Liu.Disentanglingfeaturesin3Dfaceshapesfor
jointfacereconstructionandrecognition.In
CVPR
,2018.
[89]S.Liu,W.Chen,T.Li,andH.Li.Softrasterizer:Differentiablerenderingforunsupervised
single-viewmeshreconstruction.In
ICCV
,2019.
[90]X.Liu.Discriminativefacealignment.
TPAMI
,2009.
[91]X.Liu.Video-basedfacemodelusingadaptiveactiveappearancemodel.
Imageand
VisionComputing
,2010.
[92]X.LiuandT.Chen.Pose-robustfacerecognitionusinggeometryassistedprobabilistic
modeling.In
CVPR
,2005.
136
[93]X.Liu,J.Rittscher,andT.Chen.Optimalposeforfacerecognition.In
CVPR
,2006.
[94]X.Liu,P.Tu,andF.Wheeler.Facemodeltingonlowresolutionimages.In
BMVC
,2006.
[95]X.Liu,P.Tu,andF.Wheeler.Facemodeltingonlowresolutionimages.In
BMVC
,2006.
[96]Y.Liu,A.Jourabloo,W.Ren,andX.Liu.Densefacealignment.In
ICCVW
,2017.
[97]Z.Liu,P.Luo,X.Wang,andX.Tang.Deeplearningfaceattributesinthewild.In
ICCV
,
2015.
[98]A.Makhzani,J.Shlens,N.Jaitly,andI.Goodfellow.Adversarialautoencoders.In
ICLRW
,
2015.
[99]R.Marc'Aurelio,F.J.Huang,Y.-L.Boureau,andY.LeCun.Unsupervisedlearningof
invariantfeaturehierarchieswithapplicationstoobjectrecognition.In
CVPR
,2007.
[100]S.R.Marschner,H.W.Jensen,M.Cammarano,S.Worley,andP.Hanrahan.Lightscattering
fromhumanhairIn
ACMTransactionsonGraphics(TOG)
,volume22,pages780Œ
791.ACM,2003.
[101]I.Masi,S.Rawls,G.Medioni,andP.Natarajan.Pose-awarefacerecognitioninthewild.
In
CVPR
,2016.
[102]I.Masi,A.T.Tran,T.Hassner,J.T.Leksut,andG.Medioni.Dowereallyneedtocollect
millionsoffacesforeffectivefacerecognition?In
ECCV
,2016.
[103]I.Masi,A.T.Tran,T.Hassner,J.T.Leksut,andG.Medioni.Dowereallyneedtocollect
millionsoffacesforeffectivefacerecognition?In
ECCV
,2016.
[104]M.Mathieu,C.Couprie,andY.LeCun.Deepmulti-scalevideopredictionbeyondmean
squareerror.
arXiv:1511.05440
,2015.
[105]D.S.Matovski,M.Nixon,S.Mahmoodi,andT.Onincludingqualityinapplied
automaticgaitrecognition.In
ICPR
,2012.
[106]J.McDonaghandG.Tzimiropoulos.Jointfacedetectionandalignmentwithadeformable
Houghtransformmodel.In
ECCV
,2016.
[107]A.Meka,M.Zollhöfer,C.Richardt,andC.Theobalt.Liveintrinsicvideo.
ACMTOG
,
2016.
[108]L.Mescheder,M.Oechsle,M.Niemeyer,S.Nowozin,andA.Geiger.Occupancynetworks:
Learning3Dreconstructioninfunctionspace.In
CVPR
,2019.
[109]M.MirzaandS.Osindero.Conditionalgenerativeadversarialnets.
arXiv:1411.1784
,2014.
137
[110]U.Mohammed,S.J.Prince,andJ.Kautz.Visio-lization:generatingnovelfacialimages.
TOG
,2009.
[111]D.Muramatsu,Y.Makihara,andY.Yagi.Viewtransformationmodelincorporatingquality
measuresforcross-viewgaitrecognition.
IEEEtransactionsoncybernetics
,2016.
[112]C.NhanDuong,K.Luu,K.GiaQuach,andT.D.Bui.Beyondprincipalcomponents:Deep
BoltzmannMachinesforfacemodeling.In
CVPR
,2015.
[113]Y.Nirkin,I.Masi,A.T.Tran,T.Hassner,andG.M.Medioni.Onfacesegmentation,face
swapping,andfaceperception.In
FG
,2018.
[114]A.Odena.Semi-supervisedlearningwithgenerativeadversarialnetworks.In
ICMLW
,
2016.
[115]A.Odena,C.Olah,andJ.Shlens.Conditionalimagesynthesiswithauxiliarygans.
In
ICML
,2017.
[116]N.Ozay,Y.Tong,F.Wheeler,andX.Liu.Improvingfacerecognitionwithaquality-based
probabilisticframework.In
CVPRW
,2009.
[117]J.J.Park,P.Florence,J.Straub,R.Newcombe,andS.Lovegrove.DeepSDF:Learning
continuoussigneddistancefunctionsforshaperepresentation.In
CVPR
,2019.
[118]O.M.Parkhi,A.Vedaldi,andA.Zisserman.Deepfacerecognition.In
BMVC
,2015.
[119]O.M.Parkhi,A.Vedaldi,andA.Zisserman.Deepfacerecognition.In
BMVC
,2015.
[120]A.PatelandW.A.Smith.3Dmorphablefacemodelsrevisited.In
CVPR
,2009.
[121]P.Paysan,R.Knothe,B.Amberg,S.Romdhani,andT.Vetter.A3Dfacemodelforpose
andilluminationinvariantfacerecognition.In
AVSS
,2009.
[122]B.T.Phong.Illuminationforcomputergeneratedpictures.
CommunicationsoftheACM
,
1975.
[123]P.O.Pinheiro,N.Rostamzadeh,andS.Ahn.Domain-adaptivesingle-view3Dreconstruc-
tion.In
ICCV
,2019.
[124]A.Radford,L.Metz,andS.Chintala.Unsupervisedrepresentationlearningwithdeep
convolutionalgenerativeadversarialnetworks.In
ICLR
,2016.
[125]R.RamamoorthiandP.Hanrahan.Anefrepresentationforirradianceenvironment
maps.In
Proceedingsofthe28thannualconferenceonComputergraphicsandinteractive
techniques
,2001.
138
[126]A.Ranjan,T.Bolkart,S.Sanyal,andM.J.Black.Generating3Dfacesusingconvolutional
meshautoencoders.In
ECCV
,2018.
[127]S.Reed,Z.Akata,X.Yan,L.Logeswaran,B.Schiele,andH.Lee.Generativeadversarial
texttoimagesynthesis.In
ICML
,2016.
[128]E.Richardson,M.Sela,andR.Kimmel.3Dfacereconstructionbylearningfromsynthetic
data.In
3DV
,2016.
[129]E.Richardson,M.Sela,R.Or-El,andR.Kimmel.Learningdetailedfacereconstruction
fromasingleimage.In
CVPR
,2017.
[130]J.Roth,Y.Tong,andX.Liu.Unconstrained3Dfacereconstruction.In
CVPR
,2015.
[131]J.Roth,Y.Tong,andX.Liu.Adaptive3Dfacereconstructionfromunconstrainedphoto
collections.In
CVPR
,2016.
[132]J.Roth,Y.Tong,andX.Liu.Adaptive3Dfacereconstructionfromunconstrainedphoto
collections.
TPAMI
,2017.
[133]J.Roth,Y.Tong,andX.Liu.Adaptive3Dfacereconstructionfromunconstrainedphoto
collections.
TPAMI
,2017.
[134]C.Sagonas,E.Antonakos,G.Tzimiropoulos,S.Zafeiriou,andM.Pantic.300facesin-the-
wildchallenge:Databaseandresults.
ImageandVisionComputing
,2016.
[135]C.Sagonas,Y.Panagakis,S.Zafeiriou,andM.Pantic.Robuststatisticalfacefrontalization.
In
ICCV
,2015.
[136]T.Salimans,I.Goodfellow,W.Zaremba,V.Cheung,A.Radford,andX.Chen.Improved
techniquesfortrainingGANs.In
NIPS
,2016.
[137]S.Sankaranarayanan,A.Alavi,C.Castillo,andR.Chellappa.Tripletprobabilisticembed-
dingforfacevandclustering.In
BTAS
,2016.
[138]F.Schroff,D.Kalenichenko,andJ.Philbin.FaceNet:Aembeddingforfacerecog-
nitionandclustering.In
CVPR
,2015.
[139]M.Sela,E.Richardson,andR.Kimmel.Unrestrictedfacialgeometryreconstructionusing
image-to-imagetranslation.In
ICCV
,2017.
[140]S.Sengupta,J.-C.Chen,C.Castillo,V.M.Patel,R.Chellappa,andD.W.Jacobs.Frontal
tofacevinthewild.In
WACV
,2016.
[141]F.Shi,H.-T.Wu,X.Tong,andJ.Chai.Automaticacquisitionoffacialperfor-
mancesusingmonocularvideos.
ACMTOG
,2014.
139
[142]J.Shi,Y.Dong,H.Su,andS.X.Yu.Learningnon-lambertianobjectintrinsicsacross
shapenetcategories.In
CVPR
,2017.
[143]Z.Shu,S.Hadap,E.Shechtman,K.Sunkavalli,S.Paris,andD.Samaras.Portraitlighting
transferusingamasstransportapproach.
TOG
,2018.
[144]Z.Shu,E.Yumer,S.Hadap,K.Sunkavalli,E.Shechtman,andD.Samaras.Neuralface
editingwithintrinsicimagedisentangling.In
CVPR
,2017.
[145]F.C.Staal,A.J.Ponniah,F.Angullia,C.Ruff,M.J.Koudstaal,andD.Dunaway.Describing
crouzonandpfeiffersyndromebasedonprincipalcomponentanalysis.
JournalofCranio-
MaxillofacialSurgery
,2015.
[146]D.StutzandA.Geiger.Learning3Dshapecompletionfromlaserscandatawithweak
supervision.In
CVPR
,2018.
[147]X.Sun,J.Wu,X.Zhang,Z.Zhang,C.Zhang,T.Xue,J.B.Tenenbaum,andW.T.Freeman.
Pix3D:Datasetandmethodsforsingle-image3Dshapemodeling.In
CVPR
,2018.
[148]E.TabassiandC.L.Wilson.Anovelapproachtoimagequality.In
ICIP
,2005.
[149]Y.Taigman,M.Yang,M.Ranzato,andL.Wolf.Deepface:Closingthegaptohuman-level
performanceinfacevIn
CVPR
,2014.
[150]R.TeixeiraandN.Leite.Anewframeworkforqualityassessmentofhigh-resolution
gerprintimages.
TPAMI
,2016.
[151]A.Tewari,M.Zollhoefer,F.Bernard,P.Garrido,H.Kim,P.Perez,andC.Theobalt.High-
monocularfacereconstructionbasedonanunsupervisedmodel-basedfaceautoen-
coder.
TPAMI
,2018.
[152]A.Tewari,M.Zollhöfer,P.Garrido,F.Bernard,H.Kim,P.Pérez,andC.Theobalt.Self-
supervisedmulti-levelfacemodellearningformonocularreconstructionatover250Hz.In
CVPR
,2018.
[153]A.Tewari,M.Zollhöfer,H.Kim,P.Garrido,F.Bernard,P.Pérez,andC.Theobalt.MoFA:
Model-baseddeepconvolutionalfaceautoencoderforunsupervisedmonocularreconstruc-
tion.In
ICCV
,2017.
[154]J.Thies,M.Zollhöfer,M.Nießner,L.Valgaerts,M.Stamminger,andC.Theobalt.Real-
timeexpressiontransferforfacialreenactment.
ACMTrans.Graph.
,34(6):183:1Œ183:14,
2015.
[155]J.Thies,M.Zollhöfer,M.Stamminger,C.Theobalt,andM.Nießner.Face2face:Real-time
facecaptureandreenactmentofRGBvideos.In
CVPR
,2016.
140
[156]J.Thies,M.Zollhöfer,M.Stamminger,C.Theobalt,andM.Nießner.FaceVR:Real-time
facialreenactmentandeyegazecontrolinvirtualreality.
arXiv:1610.03151
,2016.
[157]Y.Tong,F.Wheeler,andX.Liu.Improvingbiometricthroughquality-based
faceandbiometricfusion.In
CVPRW
,2010.
[158]A.T.Tran,T.Hassner,I.Masi,andG.Medioni.Regressingrobustanddiscriminative3D
morphablemodelswithaverydeepneuralnetwork.In
CVPR
,2017.
[159]A.T.Tran,T.Hassner,I.Masi,E.Paz,Y.Nirkin,andG.Medioni.Extreme3Dface
reconstruction:Lookingpastocclusions.In
CVPR
,2018.
[160]L.TranandX.Liu.Nonlinear3Dmorphablemodel.In
CVPR
,2018.
[161]L.TranandX.Liu.Onlearning3Dfacemorphablemodelfromin-the-wildimages.
TPAMI
,
2019.
[162]L.Tran,X.Yin,andX.Liu.DisentangledrepresentationlearningGANforpose-invariant
facerecognition.In
CVPR
,2017.
[163]L.Tran,X.Yin,andX.Liu.Representationlearningbyrotatingyourfaces.
TPAMI
,2018.
[164]S.Tulsiani,T.Zhou,A.A.Efros,andJ.Malik.Multi-viewsupervisionforsingle-view
reconstructionviadifferentiablerayconsistency.In
CVPR
,2017.
[165]S.TulyakovandN.Sebe.Regressinga3Dfaceshapefromasingleimage.In
ICCV
,2015.
[166]D.Vlasic,M.Brand,H.,andJ.Popovi
´
c.Facetransferwithmultilinearmodels.In
TOG
,2005.
[167]D.Wang,C.Otto,andA.K.Jain.Facesearchatscale.
TPAMI
,2016.
[168]F.Wang,J.Cheng,W.Liu,andH.Liu.Additivemarginsoftmaxforfacev
IEEE
SignalProcessingLetters
,2018.
[169]Y.Wang,L.Zhang,Z.Liu,G.Hua,Z.Wen,Z.Zhang,andD.Samaras.Facerelighting
fromasingleimageunderarbitraryunknownlightingconditions.
TPAMI
,2009.
[170]Y.Wen,K.Zhang,Z.Li,andY.Qiao.Adiscriminativefeaturelearningapproachfordeep
facerecognition.In
ECCV
,2016.
[171]Y.Wong,S.Chen,S.Mau,C.Sanderson,andB.C.Lovell.Patch-basedprobabilistic
imagequalityassessmentforfaceselectionandimprovedvideo-basedfacerecognition.In
CVPRW
,2011.
[172]H.Wu,X.Liu,andG.Doretto.Facealignmentviaboostedrankingmodels.In
CVPR
,2008.
141
[173]J.Wu,Y.Wang,T.Xue,X.Sun,B.Freeman,andJ.Tenenbaum.Marrnet:3Dshape
reconstructionvia2.5Dsketches.In
NeurIPS
,2017.
[174]J.Wu,C.Zhang,X.Zhang,Z.Zhang,W.T.Freeman,andJ.B.Tenenbaum.Learningshape
priorsforsingle-view3Dcompletionandreconstruction.In
ECCV
,2018.
[175]Y.WuandQ.Ji.Robustfaciallandmarkdetectionunderheadposesandocclu-
sion.In
ICCV
,2015.
[176]Y.Xiang,W.Kim,W.Chen,J.Ji,C.Choy,H.Su,R.Mottaghi,L.Guibas,andS.Savarese.
Objectnet3D:Alargescaledatabasefor3Dobjectrecognition.In
ECCV
,2016.
[177]Y.Xiang,R.Mottaghi,andS.Savarese.Beyondpascal:Abenchmarkfor3Dobjectdetec-
tioninthewild.In
WACV
,2014.
[178]Q.Xu,W.Wang,D.Ceylan,R.Mech,andU.Neumann.DISN:Deepimplicitsurface
networkforhigh-qualitysingle-view3Dreconstruction.In
NeurIPS
,2019.
[179]J.Yang,S.E.Reed,M.-H.Yang,andH.Lee.Weakly-superviseddisentanglingwithrecur-
renttransformationsfor3Dviewsynthesis.In
NIPS
,2015.
[180]D.Yi,Z.Lei,S.Liao,andS.Z.Li.Learningfacerepresentationfromscratch.
arXiv:1411.7923
,2014.
[181]L.Yi,V.G.Kim,D.Ceylan,I.Shen,M.Yan,H.Su,C.Lu,Q.Huang,A.Sheffer,L.Guibas,
etal.Ascalableactiveframeworkforregionannotationin3Dshapecollections.
TOG
,2016.
[182]J.Yim,H.Jung,B.Yoo,C.Choi,D.Park,andJ.Kim.Rotatingyourfaceusingmulti-task
deepneuralnetwork.In
CVPR
,2015.
[183]L.Yin,X.Wei,Y.Sun,J.Wang,andM.J.Rosato.A3Dfacialexpressiondatabasefor
facialbehaviorresearch.In
FGR
,2006.
[184]X.YinandX.Liu.Multi-taskconvolutionalneuralnetworkforfacerecognition.
TIP
,2017.
[185]X.Yin,X.Yu,K.Sohn,X.Liu,andM.Chandraker.Towardslarge-posefacefrontalization
inthewild.In
ICCV
,2017.
[186]R.Yu,S.Saito,H.Li,D.Ceylan,andH.Li.Learningdensefacialcorrespondencesin
unconstrainedimages.In
ICCV
,2017.
[187]X.YuandF.Porikli.Ultra-resolvingfaceimagesbydiscriminativegenerativenetworks.In
ECCV
,2016.
[188]A.R.Zamir,A.Sax,W.B.Shen,L.J.Guibas,J.Malik,andS.Savarese.Taskonomy:
Disentanglingtasktransferlearning.In
CVPR
,2018.
142
[189]E.Zell,J.Lewis,J.Noh,M.Botsch,etal.Facialretargetingwithautomaticrangeofmotion
alignment.
TOG
,2017.
[190]L.ZhangandD.Samaras.Facerecognitionfromasingletrainingimageunderarbitrary
unknownlightingusingsphericalharmonics.
TPAMI
,2006.
[191]Y.Zhang,M.Shao,E.K.Wong,andY.Fu.Randomfacesguidedsparsemany-to-one
encoderforpose-invariantfacerecognition.In
ICCV
,2013.
[192]J.-Y.Zhu,Z.Zhang,C.Zhang,J.Wu,A.Torralba,J.Tenenbaum,andB.Freeman.Visual
objectnetworks:imagegenerationwithdisentangled3Drepresentations.In
NeurIPS
,2018.
[193]X.Zhu,Z.Lei,X.Liu,H.Shi,andS.Z.Li.Facealignmentacrosslargeposes:A3D
solution.In
CVPR
,2016.
[194]X.Zhu,Z.Lei,J.Yan,D.Yi,andS.Li.poseandexpressionnormalizationfor
facerecognitioninthewild.In
CVPR
,2015.
[195]X.Zhu,X.Liu,Z.Lei,andS.Li.Facealignmentinfullposerange:A3Dtotalsolution.
TPAMI
,2017.
[196]Z.Zhu,P.Luo,X.Wang,andX.Tang.Deeplearningidentity-preservingfacespace.In
ICCV
,2013.
[197]Z.Zhu,P.Luo,X.Wang,andX.Tang.Multi-viewperceptron:adeepmodelforlearning
faceidentityandviewrepresentations.In
NIPS
,2014.
[198]M.Zollhöfer,J.Thies,D.Bradley,P.Garrido,T.Beeler,P.Péerez,M.Stamminger,
M.Nießner,andC.Theobalt.Stateoftheartonmonocular3Dfacereconstruction,tracking,
andapplications.
Eurographics
,2018.
143