LEARNING3DMODELFROM2DIN-THE-WILDIMAGES By LuanQuocTran ADISSERTATION Submittedto MichiganStateUniversity inpartialoftherequirements forthedegreeof ComputerScienceŠDoctorofPhilosophy 2020 ABSTRACT LEARNING3DMODELFROM2DIN-THE-WILDIMAGES By LuanQuocTran Understanding3Dworldisoneofcomputervision'sfundamentalproblems.Whileahuman hasnodifunderstandingthe3Dstructureofanobjectuponseeingits2Dimage,sucha 3Dinferringtaskremainsextremelychallengingforcomputervisionsystems.Tobetterhandlethe ambiguityinthisinverseproblem,onemustrelyonadditionalpriorassumptionssuchasconstrain- ingfacestolieinarestrictedsubspacefroma3Dmodel.Conventional3Dmodelsarelearnedfrom asetof3Dscansorcomputer-aideddesign(CAD)models,andrepresentedbytwosetsofPCA basisfunctions.Duetothetypeandamountoftrainingdata,aswellas,thelinearbases,therepre- sentationpowerofthesemodelcanbelimited.Toaddresstheseproblems,thisthesisproposesan innovativeframeworktolearnanonlinear3Dmodelfromalargecollectionofin-the-wildimages, withoutcollecting3Dscans.,givenaninputimage(ofafaceoranobject),anetwork encoderestimatestheprojection,lighting,shapeandalbedoparameters.Twodecodersserveas thenonlinearmodeltomapfromtheshapeandalbedoparameterstothe3Dshapeandalbedo, respectively.Withtheprojectionparameter,lighting,3Dshape,andalbedo,anovelanalytically- differentiablerenderinglayerisdesignedtoreconstructtheoriginalinput.Theentirenetwork isend-to-endtrainablewithonlyweaksupervision.Wedemonstratethesuperiorrepresentation powerofourmodelsondifferentdomains(face,genericobjects),andtheircontributiontomany otherapplicationsonfacialanalysisandmonocular3Dobjectreconstruction. ThisthesisisdedicatedtomybeautifulwifeMyNhatNguyen,whoseencouragementalong thewaywasparamounttomakingitthusfar. iii ACKNOWLEDGMENTS Foremost,IwouldliketoexpressmysinceregratitudetomyadvisorProf.XiaomingLiufor thecontinuoussupportofmyPh.Dstudy.Hisdesiretoseemesucceedhaspushedmetoobtain farmorethanIcouldimaginealone.Thelatenightsspentwritingpaperstogether,attentionto thesmallestdetails,anddesirethepushtheboundsofknowledgehaveinspiredmydedicationto excellence.Iamdeeplyindebtedforhisofmywritingandpresentationskills. Iwouldalsoliketothanktheremainderofmycommitteemembers,Dr.ArunRoss,Dr.Jiayu ZhouandDr.DanielMorrisfortheirvaluableinsightsandcontributionsalongtheway. IamgratefultomyComputerVisionLabmembers,bothpresentandpast,Dr.JosephRoth, Dr.XiYin,Dr.AminJourabloo,Dr.MortezaSafdarnejad,Dr.YousefAtoum,YaojieLiu,Garrick Brazil,AdamTerwilliger,JoelStehouwer,BangjieYin,HieuNguyen,ShengjieZhuandMasaHu fortheexcellentworkingatmosphere.Thewillingnesstoansweranyquestionsandlatenights workingtogethercausesallofourworktoIwillalsoneverforgetthememorieswehave together,fromboardgamenightstotraveltrips,thatmademyPhDaverypleasantjourney. AwordofappreciationtoBrendaHodge,KatherineTrinklein,StevenSmithandAmyKing fortheiradministrativeassistance. Finally,Iwouldliketothankmyfamily-myparentsandmysister-whohaveprovidedme throughmoralandemotionalsupportinmylife.ThelargestthanksformybeautifulwifeMyNhat. iv TABLEOFCONTENTS LISTOFTABLES ....................................... vii LISTOFFIGURES ...................................... ix Chapter1IntroductionandContributions ....................... 1 1.1ThesisContributions.................................3 1.2ThesisOrganization..................................4 Chapter2BackgroundandRelatedWork ....................... 5 2.13DMorphableModel.................................5 2.2ImprovingLinear3DMM...............................7 2.32DFaceAlignment..................................8 2.43DFaceReconstruction................................9 2.53DObjectModelingandReconstruction.......................9 Chapter3Learning3DFaceMorphableModelfromIn-the-wildImages ...... 11 3.1Introduction......................................11 3.2TheProposedNonlinear3DMM...........................13 3.2.1Nonlinear3DMM...............................13 3.2.1.1ProblemFormulation........................14 3.2.1.2Albedo&ShapeRepresentation..................15 3.2.1.3In-NetworkPhysically-BasedFaceRendering..........18 3.2.1.4Occlusion-awareRendering....................20 3.2.1.5ModelLearning..........................21 3.3ExperimentalResults.................................25 3.3.1AblationStudy................................26 3.3.1.1EffectofRegularization......................26 3.3.1.2ModelingLightingandShapeRepresentation..........27 3.3.1.3ComparisontoAutoencoders...................28 3.3.2Expressiveness................................29 3.3.3RepresentationPower.............................30 3.3.4Applications..................................34 3.3.4.1FaceAlignment..........................35 3.3.4.23DFaceReconstruction......................36 3.3.5Runtime....................................41 3.4Conclusions......................................41 Chapter4TowardsNonlinear3DFaceMorphoableModel ...... 43 4.1Introduction......................................43 4.2ProposedMethod...................................45 4.2.1Nonlinear3DMMwithProxyandResidual.................45 v 4.2.2GlobalLocalBasedNetworkArchitecture..................47 4.3ExperimentalResults.................................49 4.3.1AblationStudy................................49 4.3.2RepresentationPower.............................51 4.3.3Identity-Preserving..............................53 4.3.43DReconstruction..............................54 4.3.5Faceediting..................................58 4.4Conclusions......................................62 Chapter5Intrinsic3DDecomposition,Segmentation,andModelingGenericOb- jects ...................................... 63 5.1Introduction......................................63 5.1.13DShapeandAlbedoRepresentation....................66 5.1.2Physis-BasedRendering...........................68 5.1.3ModelLearning................................70 5.1.3.1UnsupervisedJointModelingandFitting.............71 5.1.3.2SupervisedPriorLearningwithSyntheticImage.........73 5.1.4ImplementationDetails............................74 5.1.4.1Modeltraining...........................74 5.1.5NetworkStructure..............................75 5.2ExperimentalResults.................................78 5.2.1ExperimentSetup...............................78 5.2.2AblationStudy................................80 5.2.3UnsupervisedSegmentation.........................81 5.2.43DImageDecomposition...........................84 5.2.5Single-view3DReconstruction........................84 5.2.5.1Reconstructiononsyntheticimages................84 5.2.5.2Reconstructiononrealimages...................86 5.3Conclusions......................................88 Chapter6ConclusionsandFutureWork ........................ 91 APPENDIX ........................................... 93 BIBLIOGRAPHY ....................................... 130 vi LISTOFTABLES Table3.1: Thearchitecturesof E , D A and D S networks. .....................17 Table3.2: FacealignmentperformanceonALFW2000. .....................28 Table3.3: Quantitativecomparisonoftexturerepresentationpower(Averagereconstructionerror onnon-occludedfaceportion.) ............................32 Table3.4: 3Dscanreconstructioncomparison(NME). ......................33 Table3.5: Runningtimeofvarious3Dfacereconstructionmethods. ...............41 Table4.1: Quantitativecomparisonoftexturerepresentationpower(Averagereconstructionerror onnon-occludedfaceportion.) ............................51 Table5.1: Coloredvoxelencodernetworkstructure. .......................76 Table5.2: Imageencodernetworkstructure(slightlyfromResNet-18). .........78 Table5.3: Effectoflosstermsonposeandreconstructionestimation. ..............80 Table5.4: Segmentationandshaperepresentationcomparisons(IoU/CD)onShapeNetpart[181]. IoUisutilizedtomeasureforsegmentationagainstground-truthparts.CDisusedfor shaperepresentationevaluation.Chair*istrainingonchair+tablejointset. ......81 Table5.5:Quantitativecomparisonofsingle-view3Dreconstructiononsyntheticimages ofShapeNet....................................84 Table5.6: Realimage3DreconstructiononPASCAL3D + withCD. ..............88 Table5.7: Realimage3DreconstructiononPix3D + withCD. ..................88 TableA1: DR-GANanditspartialvariantsperformancecomparison. ..............112 TableA2: Comparisonofsinglevs.multi-imageDR-GANonCFP. ...............114 TableA3: PerformanceofIJB-Awhenremovingimagesbythreshold w t .fiSelected"showsthe percentageofretainedimages. ............................117 TableA4: FusionschemescomparisonsonIJB-Adataset. ...................118 TableA5: Lossfunctioncomparisons.Allusefimeanmin"fusion. ................119 vii TableA6: PerformancecomparisononIJB-Adataset. ......................119 TableA7: Performance(Accuracy)comparisononCFP. .....................120 TableA8: rate(%)comparisononMulti-PIEdataset. ...............120 TableA9: Representation f ( x ) vs.syntheticimage ‹ x onIJB-A. .................121 viii LISTOFFIGURES Figure2.1: ThevisualabstractoftheseminalworkbyBlanzandVetter[13].Itproposesastatis- ticalmodelforfacestoperform3Dreconstructionfrom2Dimagesandaparametric facespacewhichenablescontrolledmanipulation ..................6 Figure3.1: Conventional3DMMemployslinearbasesmodelsforshape/albedo,whicharetrained with3Dfacescansandassociatedcontrolled2Dimages.Weproposeanonlinear 3DMMtomodelshape/albedoviadeepneuralnetworks(DNNs).Itcanbetrained fromin-the-wildfaceimageswithout3Dscans,andalsobetterreconstructtheoriginal imagesduetotheinherentnonlinearity. .......................12 Figure3.2: Jointlylearninganonlinear3DMManditsalgorithmfromunconstrained2D in-the-wildfaceimagecollection,inaweaklysupervisedfashion. L S isavisualization ofshadingonaspherewithlightingparameters L . .................14 Figure3.3: Threealbedorepresentations.(a)Albedovaluepervertex,(b)Albedoasa2Dfrontal face,(c)UVspace2Dunwarpedalbedo. ......................15 Figure3.4: UVspaceshaperepresentation.Fromlefttoright:individualchannelsfor x , y and z spatialdimensionandcombinedshapeimage. .................17 Figure3.5: Forwardandbackwardpassoftherenderinglayer. ..................18 Figure3.6: Renderingwithsegmentationmasks.Lefttoright:segmentationresults,naiveren- dering,occulusion-awarerendering. .........................21 Figure3.7: Effectofalbedoregularizations:albedosymmetry(sym)andalbedoconstancy(const). Whenthereisnoregularizationbeingused,shadingismostlybakedintothealbedo. Usingthesymmetrypropertyhelpstoresolvethegloballighting.Usingconstancy constraintfutherremovesshadingfromthealbedo,whichresultsinabetter3Dshape. 26 Figure3.8: Effectofshapesmoothnessregularization. .....................27 Figure3.9: Comparisontoconvolutionalautoencoders(AE).Ourapproachproducesresultsof higherquality.Alsoitprovidesaccesstothe3Dfacialshape,albedo,lighting,and projectionmatrix. .................................29 Figure3.10: Eachcolumnshowsshapechangeswhenvaryingoneelementof f S ,by10timesstan- darddeviations,inoppositedirections.Orderedbythemagnitudeofshapechanges. .30 Figure3.11: Eachcolumnshowsalbedochangeswhenvaryingoneelementof f A inoppositedi- rections. ......................................30 ix Figure3.12: Nonlinear3DMMgeneratesshapeandalbedoembeddedwithdifferentattributes. ..31 Figure3.13: Texturerepresentationpowercomparison.Ournonlinearmodelcanbetterreconstruct thefacialtexture. ..................................32 Figure3.14: Shaperepresentationpowercomparison( l S = 160).Theerrormapshowthenormal- izedper-vertexerror. ................................33 Figure3.15: 3DMMtofaceswithdiverseskincolor,pose,expression,lighting,facialhair,and faithfullyrecoversthesecues.LefthalfshowsresultsfromAFLW2000dataset,right halfshowsresultsfromCelebA. ..........................34 Figure3.16: Ourfacealignmentresults.Invisiblelandmarksaremarkedasred.Wecanwellhandle extremepose,lightingandexpression. .......................35 Figure3.17: FacealignmentCumulativeErrorsDistribution(CED)curvesonAFLW2000-3Don 2D(left)and3Dlandmarks(right).NMEsareshowninlegendboxes. ........36 Figure3.18: 3DreconstructionresultscomparisontoTewari etal .[153].Theirreconstructed shapessufferfromthesurfaceshrinkagewhendealingwithchallengingtextureor shapeoutsidethelinearmodelsubspace.Theycan'thandlelargeposevariationwell either.Meanwhile,ournonlinearmodelismorerobusttothesevariations. ......37 Figure3.19: 3DreconstructionresultscomparisontoTewari etal .[152].Ourmodelbetterre- constructtheinputimageinbothtexture(facialhairdirectionontheimage)and shape(nasolabialfoldsinthesecondimage). ....................37 Figure3.20: 3DreconstructionresultscomparisontoSela etal .[129].Besidesshowingtheshape, wealsoshowtheirestimateddepthandcorrespondencemap.Facialhairorocclusion cancauseseriousproblemsintheiroutputmaps. ...................38 Figure3.21: 3DreconstructionresultscomparisontoVRNbyJackson etal .[63]onCelebA dataset.Volumetricshaperepresentationresultsinnon-smooth3Dshapeandloses correspondencebetweenreconstructedshapes. ...................39 Figure3.22: 3DreconstructionquantitativeevaluationonFaceWarehouse.Weobtainalowererror comparedtoPRN[46]and3DDFA+[195]. ....................39 Figure3.23: 3DfacereconstructionresultsontheFlorencedataset[9].TheNMEofeachmethod isshowedinthelegend ...............................40 Figure4.1: Theproposedframework.Eachshapeoralbedodecoderconsistoftwobranchesto reconstructthetrueelementanditsproxy.Proxiesfreeshapeandalbedofromstrong regularizations,allowthemtolearnmodelswithhighlevelofdetails. ........45 x Figure4.2: Theproposedgloballocalbasednetworkarchitecture. ...............48 Figure4.3: Reconstructionresultswithdifferentlossfunctions. .................49 Figure4.4: Imagereconstructionwithour3DMMmodelusingtheproxyandthetrueshapeand albedo.Ourshapeandalbedocanfaithfullyrecoverdetailsoftheface.Note:forthe shape,weshowtheshadinginUVspaceŒabettervisuallizationthantheraw S UV . ..50 Figure4.5: Affectofsoftsymmetrylossonourshapemodel. ..................51 Figure4.6: Texturerepresentationpowercomparison.Ournonlinearmodelcanbetterreconstruct thefacialtexture. ..................................52 Figure4.7: Shaperepresentationpowercomparison.Givena3Dshape,weoptimizethefeature f S toapproximatetheoriginalone. .........................53 Figure4.8: Thedistancebetweentheinputimagesandtheirreconstructionfromthreemodels. Forbettervisualization,imagesaresortedbasedontheirdistancetoourmodel'sre- constructions. ...................................54 Figure4.9: 3DMMtofaceswithdiverseskincolor,pose,expression,lighting,andfaithfully recoversthesecues. .................................55 Figure4.10: 3DreconstructioncomparisontoTewari etal .[153]. .................56 Figure4.11: 3Dreconstructioncomparisonstononlinear3DMMapproachesbyTewari etal .[152] orTranandLiu[161].Ourmodelcanreconstructfaceimageswithhigherlevelof details.Pleasezoom-informoredetails.Bestviewelectronically. ..........56 Figure4.12: 3DreconstructioncomparisonstoSela etal .[139]orTran etal .[159],whichgo beyondlatentspacerepresentations. ........................57 Figure4.13: Lightingtransferresults.Wetransferthelightingofsourceimagesrow)totarget imagescolumn).Wehavesimilarperformancecomparetothestate-of-the-art methodofShu etal .[143]despitebeingordersofmagnitudefaster(150msvs.3 minperimage). ..................................59 Figure4.14: Growingmustacheeditingresults.Thecollumnshowsoriginalimages,the followingcollumnsshoweditedimageswithincreasingmagnitudes.Comparingto Shu etal .[144]results(lastrow),oureditedimagesaremorerealisticandidentity preserved. .....................................60 Figure4.15: Addingstickerstofaces.Thestickerisnaturallyaddedintofacesfollowingthesurface normalorlighting. .................................61 xi Figure5.1: Thisworkdecomposesa2Dimageofgeneticobjectsintoalbedo,3Dshape,illumi- nation,andcameraprojection. ...........................65 Figure5.2: Shapeandalbedodecodernetworks.Shapedecoder D S takesashapelatentrepresen- tation f S andaspatialpoint x =( x ; y ; z ) andproducestheimplicitforeachbranch. Theoutputlayergroupsthebranchoutputs,viamaxpooling,toformthespa- tialprobabilityofoccupancy.Albedodecoder D A receivesbothlatentrepresentations f S ; f A andestimatesthealbedocolorsof4branches,oneofwhichisselectedbythe shapebranch/segmentationandreturnedasthealbedocolorof x . ........67 Figure5.3: Raytracingforsurfacepointsdetection.InLinearsearch,candidates(redpoints)are uniformlydistributedinthegrid.InLinear-Binarysearch,afterthepointinside theobjectfound,Binarysearchwillbeusedbetweenthelastoutsidepointandcurrent insidepointforallremainingiterations. ......................69 Figure5.4: ColorvoxelizationofShapeNetmodels.Original3Dmesh(left)and64 3 colored voxel(right). ....................................75 Figure5.5: Theshapedecodernetworkiscomposedof3fullyconnectedlayers,denotesasfiFCfl. Theshapelatentvector(128-dim)isconcatenated,denotedfi+fl,withthexyzquery, makinga131-dimvector,andisprovidedasinputtothelayer.TheLeakyReLU activationisappliedtothe2FClayerswhilethevalueisobtainedwith Sigmoid activationdenotedasfiSig.fl. ........................76 Figure5.6: Thealbedodecodernetworkisalsocomposedof3fullyconnectedlayers. cally,ittakesthepointcoordinate ( x ; y ; z ) ,alongwithshapeandalbedofeaturevectors, andoutputstheRGBcolorvalue.'TH'denotes Tanh activation. ...........77 Figure5.7: Oneexampleofboundarypointsselectionforlocalfeatureextraction. ........77 Figure5.8: Localfeaturedistanceundernoiseofdifferentstandarddeviations. ..........78 Figure5.9: 3Dreconstructionusingmodelslearnedwith(thirdrow)andwithoutrealimage(sec- ondrow).Higherqualityreconstructionisobservedinthebottom. .........79 Figure5.10: UnsupervisedsegmentationresultsonShapeNetPartdataset.Werendertheoriginal mesheswithdifferentcolorsrepresentingdifferentparts. ...............82 Figure5.11: Visualizationofalbedobranchoutputsforour5categories.Werenderthealbedowith reconstructedmesh. ................................83 Figure5.12: 3Dimagedecompositiononreal-worldimages.Ourworkdecomposesa2Dimageof genericobjectsintoalbedo,completed3Dshapeandillumination. ..........85 xii Figure5.13:Qualitativecomparisonforsingle-view3DreconstructiononShapeNet,Pascal 3D+,andPix3Ddatasets..............................86 Figure5.14:Qualitativecomparisonforsingle-view3Dreconstructiononrealimagesfrom Pascal3D+(left)andPix3D(right)........................87 Figure5.15: Additional3DreconstructionresultsonPascal3D + [177]dataset. ..........89 Figure5.16: Additional3DreconstructionresultsonPix3D[147].Foreachinputimage,weshow reconstructionsbyShapeHD[174],andgroundtruth.Ourreconstructionsresemble thegroundtruth. ..................................90 FigureA1: Givenoneormultiplein-the-wildfaceimagesastheinput,DR-GANcanproducea identityrepresentation,byvirtuallyrotatingthefacetoarbitraryposes.The learntrepresentationisboth discriminative and generative ,i.e.,therepresentationis abletodemonstratesuperiorPIFRperformance,andsynthesizeidentity-preserved facesattargetposesbytheposecode. ...................95 FigureA2: ComparisonofpreviousGANarchitecturesandourproposedDR-GAN. .......102 FigureA3: Generatorinmlti-imageDR-GAN.Fromanimagesetofasubject,wecanfusethe featurestoasinglerepresentationviadynamicallylearntcoefandsynthesize imagesinanypose. .................................107 FigureA4:Themeanfacesof13posegroupsinCASIA-Webface.Theblurrinessshows thechallengesofposeestimationforlargeposes.................111 FigureA5:GeneratedfacesofDR-GANanditspartialvariants...............112 FigureA6: Responsesoftwowiththehighestresponsestoidentity(left),andpose (right).Responsesofeachrowareofthesamesubject,andeachcolumnareofthe samepose.Notethewithin-rowsimilarityontheleftandwithin-columnsimilarityon theright. ......................................114 FigureA7: CoefdistributionsonIJB-A(a)andCFP(b).ForIJB-A,wevisualizeimages atfourregionsofthedistribution.ForCFP,weplotthedistributionsforfrontalfaces (blue)andfaces(red)separatelyandshowimagesattheheadsandtailsofeach distribution. ....................................115 FigureA8:Thecorrelationbetweentheestimatedcoefandtheprob- abilities.......................................116 xiii FigureA9: FacerotationcomparisononMulti-PIE.Giventheinput(inillumination07and75 pose),weshowsyntheticimagesof L 2loss(top),adversarialloss(middle),and groundtruth(bottom).Column2-5showtheabilityofDR-GANinsimultaneous facerotationandre-lighting. ............................121 FigureA10: Interpolationof f ( x ) , c ,and z .(a)Syntheticimagesbyinterpolatingbetweenthe identityrepresentationsoftwofaces(Column1and12).Notethesmoothtransition betweendifferentgendersandfacialattributes.(b)Poseangles0 ,15 ,30 ,45 ,60 , 75 ,90 areavailableinthetrainingset.DR-GANinterpolatesin-between unseen posesvia continuous posecodes,shownaboveRow3.(c)ForeachimageatColumn 1,DR-GANsynthesizestwoimagesat z = 1 (Column2)and z = 1 (Column12), andin-betweenimagesbyinterpolatingalongtwo z . ................122 FigureA11:FacerotationonCFP:(a)input,(b)frontalizedfaces,(c)realfrontalfaces,(d) rotatedfacesat15 ,30 ,45 poses.Weexpectthefrontalizedfacestopre- servetheidentity,ratherthanallfacialattributes.Thisisverychallengingfor facerotationduetothein-the-wildvariationsandextremeviews.The artifactintheimageboundaryisduetoimageextrapolationinpre-processing. Whentheinputsarefrontalfaceswithvariationsinroll,expression,orocclu- sions,thesyntheticfacescanremovethesevariations..............123 FigureA12: FacefrontalizationonIJB-A.Foreachoffoursubjects,weshow11inputimageswith estimatedcoefoverlaidatthetopleftcornerrow)andtheirfrontalized counterpart(secondrow).Thelastcolumnisthegroundtruthfrontalandsynthetic frontalfromthefusedrepresentationofall11images.Notethechallengesoflarge poses,occlusion,andlowresolution,andour opportunistic frontalization. ......124 FigureA13: FacefrontalizationonIJB-Aforanimagesetsubject)andavideosequence(sec- ondsubject).Foreachsubject,weshow11inputimagesrow),theirrespective frontalizedfaces(secondrow)andthefrontalizedfacesusing incrementally fusedrep- resentationsfromallpreviousinputsuptothisimage(thirdrow).Inthelastcolumn, weshowthegroundtruthfrontalface. ........................125 xiv Chapter1 IntroductionandContributions Understanding3Dstructureisalong-standingproblemwithmuchinterestincomputervision.A humanhasnodifunderstandingthe3Dstructureofanobjectuponseeingits2Dimage.Even withoutgeometriccues(motionorstereopsis),ourvisualsystemcanstillinferdetailedsurfaces orplausiblyhiddenparts.Meanwhile,sucha3Dinferringtaskremainsextremelychallengingfor computervisionsystems. Oneobjectinparticular,theface,ishighlystudied,sinceobtainingauser3Dface surfacemodelisusefulformanyapplicationsincludingbutnotlimitedtofacerecognition[6,102, 185],videoediting[47,155],avatarpuppeteering[20,23,189]orvirtualmake-up[48,83]. Inferringa3Dfacemeshfromasinglephotographisarduousandill-posedsincetheimage formationprocessblendsmultiplecomponents(shape,albedo)aswellasenvironment(lighting) intoasinglecolorforeachpixel.Tobetterhandletheambiguity,onemustrelyonadditionalprior assumptions,suchasconstraining3Dobjectstolieinarestrictedsubspace,e.g.,3DMorphable Models(3DMM)[13]learnedfromasmall3Dscanscollection. Traditionally,3DMMislearntthrough supervision byperformingdimensionreduction,typi- callyPrincipalComponentAnalysis(PCA),onatrainingsetofco-captured3Dfacescansand2D images.Tomodelhighlyvariable3Dfaceshapes,alargeamountofhigh-quality3Dfacescansis required.However,thisrequirementisexpensivetoasacquiringfacescansisverylaborious, inbothdatacapturingandpost-processingstage.The3DMM[13]wasbuiltfromscansof 1 200subjectswithasimilarethnicity/agegroup.Theywerealsocapturedinwell-controlledcondi- tions,withonlyneutralexpressions.Hence,itisfragiletolargevariancesinthefaceidentity.The widelyusedBaselFaceModel(BFM)[121]isalsobuiltwithonly200subjectsinneutralexpres- sions.LackofexpressioncanbecompensatedusingexpressionbasesfromFaceWarehouse[24] orBD-3FE[183],whicharelearnedfromtheoffsetstotheneutralpose.Aftermorethanadecade, almostallexistingmodelsusenomorethan300trainingscans.Suchsmalltrainingsetsarefar fromadequatetodescribethefullvariabilityofhumanfaces[19].Untilrecently,witha effortaswellasanovelautomatedandrobustmodelconstructionpipeline,Booth etal .[19]build thelarge-scale3DMMfromscansof ˘ 10 ; 000subjects,whichisstillrestrictedtothepublic. Second,thetexturemodelof3DMMisnormallybuiltwithasmallnumberof2Dfaceim- ages co-captured with3Dscans,underwell-controlledconditions.Despitethereisaconsiderable improvementof3Dacquisitiondevicesinthelastfewyears,thesedevicesstillcannotoperatein arbitraryin-the-wildconditions.Therefore,allthecurrent3Dfacialdatasetshavebeencaptured inthelaboratoryenvironment.Hence,suchmodelsareonlylearnttorepresentthefacialtexture insimilar,ratherthanin-the-wild,conditions.Thissubstantiallylimitsapplicationscenariosof 3DMM. Finally,therepresentationpowerof3DMMislimitedbynotonlythesizeortypeoftrain- ingdatabutalsoits formulation .Thefacialvariationsarenonlinearinnature.E.g.,thevaria- tionsindifferentfacialexpressionsorposesarenonlinear,whichviolatesthelinearassumption ofPCA-basedmodels.Thus,aPCAmodelisunabletointerpretfacialvariationssuf well.Thisisespeciallytrueforfacialtexture.Forallcurrent3DMMmodels,theirlow-dimension albedosubspacefacesthesameproblemoflackingfacialhair,e.g.,beards.Toreducethe error,itcompensatesunexplainabletexturebyalternatingsurfacenormal,orshrinkingtheface shape[198].Eitherway,linear3DMM-basedapplicationsoftendegradetheirperformanceswhen 2 handlingout-of-subspacevariations. Giventhebarrierof3DMMinitsdata,supervisionandlinearbases,thisthesisaimstorevolu- tionizetheparadigmoflearning3DMMbyansweringafundamentalquestion: Whetherandhowcanwelearnanonlinear 3 DMorphableModeloffaceshapeand albedofromasetofin-the-wild 2 Dfaceimages,withoutcollecting 3 Dfacescans? Iftheanswerwereyes,thiswouldbeinsharpcontrasttotheconventional3DMMapproach, andremedyallaforementionedlimitations.Fortunately,wehavedevelopedapproachestooffer positiveanswerstothisquestion.Withtherecentdevelopmentofdeepneuralnetworks,weview thatitistherighttimetoundertakethisnewparadigmof3DMMlearning.Therefore,thecore ofthisthesisisregardinghowtolearnthisnew3DMM,whatistherepresentationpowerofthe model,andwhatistheofthemodeltofacialanalysis. 1.1ThesisContributions Inthisthesis,weproposeanovelparadigmto learnanonlinear 3 DMMmodelfromalargein-the- wild 2 Dfaceimagecollection,withoutacquiring 3 Dfacescans ,byleveragingthepowerofdeep neuralnetworkscapturesvariationsandstructuresincomplexfacedata.Theframeworkisalso furtherextendedtogenericobjects,withsubstantiallylargershapedeformation,thankstoanovel representation.Insummary,thisdissertationmakesthefollowingcontributions: Toovercometheshortageofannotated3Ddata,wedevelopaframeworktojointlylearnthe 3Dmodelandthemodelalgorithmviaweaksupervision,byleveragingalargecollection of2Dimageswithout3Dscans.Twomodulesareoptimizedend-to-endwiththeobjectiveto reconstructtheinputimage.Thisobjectiveallowsustouseanyphotographsformodeltraining 3 withoutany3Dlabels. Differentfrompreviousmethodsthatfocusonmodelingonly3Dshape,theproposednon- linear3DMMfullymodelsshape,albedoandlighting,whichenablesustotrainthemodelinweak supervisionfashion. Byusingneuralnetworkstorepresentallmodelcomponents,ourmodelcanbettermodel nonlinearshape/albedovariations.Henceourmodelhasgreaterrepresentationpowerthanits traditionallinearcounterpart. Inrealizationthatthestrongregularizationandglobal-basedmodelingaretheroadblocks toachieve3DMMmodel,weproposetorelaxregularizationbyusingproxiesand proposeaglobal-localnetworkarchitecture. Toextendthelearningframeworktogenericobjectswhichusuallyhaslargeshapedeforma- tionaswellasinconsistentshapetopology,weproposeanovelrepresentation,coloredoccupancy inwhicheach3Dspatialpointisasinside/outsidethe3Dshapeaswellasassigned withanalbedocolor. 1.2ThesisOrganization Therestofthisdissertationisorganizedasfollows.Chapter2givesmorebackgroundintroduc- tionandreviewsrelatedworkon3Dreconstruction.Chapter3developsthelearningframework onnonlinear3DMM.Chapter4improvesthemodelinbothlearningobjectiveandarchitecture. Chapter5presentstheextensionoftheframeworktogenericobjectswithanovelrepresentation, coloredoccupancyChapter6concludesthisdissertation. 4 Chapter2 BackgroundandRelatedWork Nowthatabasicunderstandingoftheproblemisknown,Iwillpresentsomebackgroundinforma- tionandrelatedworknecessaryforfullyunderstandingthisthesis. 2.13DMorphableModel The3DMorphableModel(3DMM)[13]andits2Dcounterpart,ActiveAppearanceModel[37, 94,91],provideparametricmodelsforsynthesizingfaces,wherefacesaremodeledusingtwo components:shapeandalbedo(skin BlanzandVetter[13]proposethegeneric3Dfacemodellearnedfromscandata.They alinearsubspacetorepresentshapeandalbedousingprincipalcomponentanalysis(PCA) andshowhowtothemodeltodata.The3DfacespacecanrepresentedwithPCAas: S = ¯ S + G a ; (2.1) where S 2 R 3 Q isa3Dfacemeshwith Q vertices, ¯ S 2 R 3 Q isthemeanshape, a 2 R l S isthe shapeparametercorrespondingtoa3Dshapebases G .Theshapebasescanbefurthersplitinto G =[ G id ; G exp ] ,where G id istrainedfrom3Dscanswithneutralexpression,and G exp isfromthe offsetsbetweenexpressionandneutralscans. Thealbedooftheface A 2 R 3 Q iswithinthemeanshape ¯ S ,whichdescribestheR, 5 Figure2.1: ThevisualabstractoftheseminalworkbyBlanzandVetter[13].Itproposesastatistical modelforfacestoperform3Dreconstructionfrom2Dimagesandaparametricfacespacewhichenables controlledmanipulation G,Bcolorsof Q correspondingvertices. A isalsoformulatedasalinearcombinationofbasis functions: A = ¯ A + R b ; (2.2) where ¯ A isthemeanalbedo, R isthealbedobases,and b 2 R l T isthealbedoparameter. The3DMMcanbeusedtosynthesizenovelviewsoftheface.Firstly,a3Dfaceisprojected ontotheimageplanewiththeweakperspectiveprojectionmodel: V = R S ; (2.3) g ( S ; m )= V 2D = f Pr V + t 2 d = M ( m ) 2 6 4 S 1 3 7 5 ; (2.4) where g ( S ; m ) istheprojectionfunctionleadingtothe2Dpositions V 2D of3Drotatedvertices V , f isthescalefactor, Pr = 2 6 4 100 010 3 7 5 istheorthographicprojectionmatrix, R istherotationmatrix constructedfromthreerotationangles(pitch,yaw,roll),and t 2 d isthetranslationvector.Whilethe projectmatrix M isofthesizeof2 4,ithassixdegreesoffreedom,whichisparameterizedbya 6 6-dimvector m .Then,the2Dimageisrenderedusingtextureandanilluminationmodelsuchas Phongmodel[122]orSphericalHarmonics[125]. SinceBlanzandVetter'sseminalwork[13],therehasbeenalargeamountofeffortonim- proving3DMMmodelingmechanism.In[13],thedensecorrespondencebetweenfacialmesh issolvedwitharegularisedformofopticalw.However,thistechniqueisonlyeffectiveina constrainedsetting,wheresubjectssharesimilarethnicitiesandages.Toovercomethischallenge, PatelandSmith[120]employaThinPlateSplines(TPS)warp[16]toregisterthemeshesinto acommonreferenceframe.Alternatively,Paysan etal .[121]useaNonrigidIterativeClosest Point[7]todirectlyalign3Dscans.Inadifferentdirection,Amberg etal .[6]extendedBlanz andVetter'sPCA-basedmodeltoemotivefacialshapesbyadoptinganadditionalPCAmodeling oftheresidualsfromtheneutralpose.Thisresultsinasinglelinearmodelofbothidentityand expressionvariationof3Dfacialshape.Vlasic etal .[166]useamultilinearmodeltorepresent thecombinedeffectofidentityandexpressionvariationonthefacialshape.Later,Bolkartand Wuhrer[15]showhowsuchamultilinearmodelcanbeestimateddirectlyfromthe3Dscansusing ajointoptimizationoverthemodelparametersandgroupwiseregistrationof3Dscans 2.2ImprovingLinear3DMM WithPCAbases,thestatisticaldistributionunderlying3DMMisGaussian.Koppen etal .[77] arguethatsingle-modeGaussiancan'twellrepresentreal-worlddistribution.Theyintroducethe GaussianMixture3DMMthatmodelstheglobalpopulationasamixtureofGaussiansubpopula- tions,eachwithitsownmean,butsharedcovariance.Booth etal .[17,18]aimtoimprovetexture of3DMMtogobeyondcontrolledsettingsbylearningfiin-the-wildflfeature-basedtexturemodel. Onanotherdirection,Tran etal .[158]learntoregressrobustanddiscriminative3DMMrepresen- 7 tation,byleveragingmultipleimagesfromthesamesubject.However,allworksarestillbasedon statisticalPCAbases.Duong etal .[112]addresstheproblemoflinearityinfacemodelingbyus- ingDeepBoltzmannMachines.However,theyonlyworkwith2Dfaceandsparselandmarks;and hencecannothandlefaceswithlarge-posevariationsorocclusionwell.Concurrenttoourwork, Tewari etal .[152]learna(potentiallynon-linear)correctivemodelontopofalinearmodel.The modelisasummationofthebaselinearmodelandthelearnedcorrectivemodel,whichcon- traststoourmodel.Furthermore,ourmodelhasanadvantageofusing2Drepresentation ofbothshapeandalbedo,whichmaintainsspatialrelationsbetweenverticesandleveragesCNN powerforimagesynthesis.Finally,thanksforournovelrenderinglayer,weareabletoemploy perceptual,adversariallosstoimprovethereconstructionquality. 2.32DFaceAlignment 2DFaceAlignment[172,90]canbecastasaregressionproblemwhere2Dlandmarklocationsare regresseddirectly[42].Forlarge-poseoroccludedfaces,strongpriorsof3Dmodelfaceshapehave beenshowntobe[67].Hence,thereisincreasingattentioninconductingfacealignment bya3Dfacemodeltoasingle2Dimage[68,193,195,86,106,71,69].Amongthe priorworks,iterativeapproacheswithcascadeofregressorstendtobepreferred.Ateachcascade, thereisasingle[165,67]oreventworegressors[175]usedtoimproveitsprediction.Recently, JourablooandLiu[71]proposeaCNNarchitecturethatenablestheend-to-endtrainingability oftheirnetworkcascade.Contrastedtoaforementionedworksthatuseaed3DMMmodel, ourmodelandmodelarelearnedjointly.Thisresultsinamorepowerfulmodel:asingle- passencoder,whichislearnedjointlywiththemodel,achievesstate-of-the-artfacealignment performanceondifferentbenchmarkdatasets. 8 2.43DFaceReconstruction Facereconstructioncreatesa3Dfacemodelfromanimagecollection[130,131]orevenwitha singleimage[128,139].Thislong-standingproblemdrawsalotofinterestbecauseofitswideap- plications.3DMMalsodemonstratesitsstrengthinfacereconstruction,especiallyinthemonoc- ularcase.Thisproblemisahighlyunder-constrained,aswithasingleimage,presentinforma- tionaboutthesurfaceislimited.Hence,3Dfacereconstructionmustrelyonpriorknowledge like3DMM[132].StatisticalPCAlinear3DMMisthemostcommonlyusedapproach.Besides 3DMMmethods[14,55,190,43,153,88],recently,Richardson etal .[129]designa mentnetworkthataddsfacialdetailsontopofthe3DMM-basedgeometry.However,thisapproach canonlylearn2 : 5Ddepthmap,whichlosesthecorrespondencepropertyof3DMM.Thefollow upworkbySela etal .[139]trytoovercomethisweaknessbylearningacorrespondencemap. Despitehavingsomeimpressivereconstructionresults,boththesemethodsarelimitedbytrain- ingdatasynthesizedfromthelinear3DMMmodel.Hence,theyfailtohandleout-of-subspace variations,e.g.,facialhair. 2.53DObjectModelingandReconstruction Recently,autoencoderhasbeenwidelyusedfor3Dobjectmodeling[65,126,85,8,38,146] duetoitseffeaturerepresentation.Thesemethodscanbenaturallyappliedtosingle-image 3Dreconstruction.Thereconstructionprocessencodestheinputimagewithdeepconvolutional networks,andthenusesthetraineddecodertoreconstructthecorresponding3Dshapesfromthe shapelatentvectors.However,mostofthesemethodssufferfromthedomainmismatchissuesince themodelsaretrainedon synthetic data. Anotherrelateddirection,e.g.,MarrNet[173]andShapeHD[174],istodevelopatwo-step 9 pipeline.Theyrecover2 : 5Dsketches(depthandnormalmaps),fromwhichavoxelized3D shapecanbefurtherinferred.VON[192]methodalsotsfromthistwo-stepprocessfor realisticimagesynthesis.However,despitetheuseof2 : 5Dsketchescanrelaxtheburdenon domaintransferandconstrainthereconstructed3Dshapetobeconsistentwith2Dobservations, theystillhavetwolimitations:1)Evenwithhigh-resolutionvoxel,theyarefarfromproducing visuallycompellingshapes;2)Theydonotlearndisentangledandinterpretablelatentvectorsthat allowimagemanipulationunderdifferentconditions(e.g.,poseandlighting). 10 Chapter3 Learning3DFaceMorphableModelfrom In-the-wildImages 3.1Introduction The3DMorphableModel(3DMM)isastatisticalmodelof3Dfacialshapeandtextureina spacewherethereareexplicitcorrespondences[13].Themorphablemodelframeworkprovides twokeyapoint-to-pointcorrespondencebetweenthereconstructionandallother models,enablingfimorphingfl,andsecond,modelingunderlyingtransformationsbetweentypesof faces(maletofemale,neutraltosmile,etc.).3DMMhasbeenwidelyappliedinnumerousareas includingcomputervision[13,186,159],computergraphics[5,141,154,155],humanbehavioral analysis[6,185]andcraniofacialsurgery[145]. Giventhebarrierof3DMMinitsdata,supervisionandlinearbases,weproposeanovel paradigmto learnanonlinear 3 DMMmodelfromalargein-the-wild 2 Dfaceimagecollection, withoutacquiring 3 Dfacescans .AsshowninFig.A1,startingwithanobservationthatthelinear Thischapterisadaptedfromfollowingpublications: [1]LuanTranandXiaomingLiu,fiNonlinear3DFaceMorphableModelflinCVPR,2018. [2]LuanTranandXiaomingLiu,fiOnLearning3DFaceMorphableModelFromIn-the-wildimagesflinTPAMI, 2019. 11 Figure3.1: Conventional3DMMemployslinearbasesmodelsforshape/albedo,whicharetrainedwith3D facescansandassociatedcontrolled2Dimages.Weproposeanonlinear3DMMtomodelshape/albedovia deepneuralnetworks(DNNs).Itcanbetrainedfromin-the-wildfaceimageswithout3Dscans,andalso betterreconstructtheoriginalimagesduetotheinherentnonlinearity. 3DMMformulationisequivalenttoasinglelayernetwork,usingadeepnetworkarchitecturenat- urallyincreasesthemodelcapacity.Hence,weutilizetwoconvolutionneuralnetworkdecoders, insteadoftwoPCAspaces,astheshapeandalbedomodelcomponents,respectively.Eachdecoder willtakeashapeoralbedoparameterasinputandoutputthedense3Dfacemeshorafaceskin Thesetwodecodersareessentiallythenonlinear3DMM. Further,welearnthealgorithmtoournonlinear3DMM,whichisformulatedasaCNN encoder.Theencodernetworktakesafaceimageasinputandgeneratestheshapeandalbedo parameters,fromwhichtwodecodersestimateshapeandalbedo. The3Dfaceandalbedowould perfectly reconstructtheinputface,ifthealgorithm and3DMMarewelllearnt.Therefore,wedesignadifferentiablerenderinglayertogeneratea reconstructedfacebyfusingthe3Dface,albedo,lighting,andthecameraprojectionparameters estimatedbytheencoder.Finally,theend-to-endlearningschemeisconstructedwheretheencoder 12 andtwodecodersarelearntjointlytominimizethedifferencebetweenthereconstructedfaceand theinputface.Jointlylearningthe3DMMandthemodelencoderallowsustoleverage thelargecollectionof in-the-wild 2Dimageswithoutrelyingon3Dscans.Weshow improvedshapeandfacialtexturerepresentationpoweroverthelinear3DMM.Consequently,this alsoothertaskssuchas2Dfacealignment,3Dreconstruction,andfaceediting. Insummary,thischaptermakesthefollowingmaincontributions. Welearna nonlinear 3DMMmodel,fullymodelsshape,albedoandlighting,thathasgreater representationpowerthanitstraditionallinearcounterpart. Bothshapeandalbedoarerepresentedas2Dimages,whichhelptomaintainspatialrelations aswellasleverageCNNpowerinimagesynthesis. Wejointlylearnthemodelandthemodelalgorithmvia weaksupervision ,bylever- agingalargecollectionof2Dimageswithout3Dscans.Thenovelrenderinglayerenablesthe end-to-endtraining. Thenew3DMMfurtherimprovesperformanceinrelatedfacialanalysistasks:facealign- ment,facereconstruction. 3.2TheProposedNonlinear 3 DMM 3.2.1Nonlinear 3 DMM AsmentionedinSec.3.1,thelinear3DMMhastheproblemssuchasrequiring3Dfacescans forsupervisedlearning,unabletoleveragemassivein-the-wildfaceimagesforlearning,andthe limitedrepresentationpowerduetothelinearbases.Weproposetolearnanonlinear3DMM modelusingonlylarge-scalein-the-wild2Dfaceimages. 13 Figure3.2: Jointlylearninganonlinear3DMManditsalgorithmfromunconstrained2Din-the-wild faceimagecollection,inaweaklysupervisedfashion. L S isavisualizationofshadingonaspherewith lightingparameters L . 3.2.1.1ProblemFormulation Inlinear3DMM(Sec2.1),thefactorizationofeachofcomponents(shape,albedo)canbeseenas amatrixmultiplicationbetweencoefandbases.Fromaneuralnetwork'sperspective,this canbeviewedasashallownetworkwithonly onefullyconnectedlayer andnoactivationfunction. Naturally,toincreasethemodel'srepresentationpower,theshallownetworkcanbeextendedtoa deeparchitecture.Inthiswork,wedesignanovellearningschemetojointlearnadeep3DMM modelanditsinference(oralgorithm. ,asshowninFig.3.2,weusetwodeepnetworkstodecodetheshape,albedo parametersintothe3Dfacialshapeandalbedorespectively.Tomaketheframeworkend-to-end trainable,theseparametersareestimatedbyanencodernetwork,whichisessentiallythe algorithmofour3DMM.Threedeepnetworksjoinforcesfortheultimategoalofreconstructing theinputfaceimage,withtheassistantofaphysically-basedrenderinglayer.Fig.3.2visualizes thearchitectureoftheproposedframework.Eachcomponentwillbepresentinfollowingsections. 14 Figure3.3: Threealbedorepresentations.(a)Albedovaluepervertex,(b)Albedoasa2Dfrontalface,(c) UVspace2Dunwarpedalbedo. Formally,givenasetof K 2Dfaceimages f I i g K i = 1 ,weaimtolearnanencoder E : I ! P ; L ; f S ; f A thatestimatestheprojectionmatrix P ,lightingparameter L ,shapeparameters f S 2 R l S ,andalbedo parameter f A 2 R l A ,a3Dshapedecoder D S : f S ! S thatdecodestheshapeparametertoa3Dshape S 2 R 3 Q ,andanalbedodecoder D A : f A ! A thatdecodesthealbedoparametertoarealisticalbedo A 2 R 3 Q ,withtheobjectivethattherenderedimagewith P , L , S ,and A canwellapproximatethe originalimage.Mathematically,theobjectivefunctionis: argmin E ; D S ; D A K å i = 1 ‹ I i I i 1 ; (3.1) ‹ I = R ( E P ( I ) ; E L ( I ) ; D S ( E S ( I )) ; D A ( E A ( I ))) ; where R ( P ; L ; S ; A ) istherenderinglayer(Sec.5.1.2). 3.2.1.2Albedo&ShapeRepresentation Fig.3.3illustratesthreepossiblealbedorepresentations.Intraditional3DMM,albedois pervertex(Fig.3.3(a)).Thisrepresentationisalsoadoptedinrecentworksuchas[153,152]. Thereisanalbedointensityvaluecorrespondingtoeachvertexinthefacemesh.Despitewidely used,thisrepresentationhasitslimitations.Since3Dverticesarenotona2Dgrid,this representationismostlyparameterizedasavector,whichnotonlylosesthespatialrelationofits 15 vertices,butalsopreventsittoleveragetheconvenienceofdeployingCNNon2Dalbedo.In contrast,giventherapidprogressinimagesynthesis,itisdesirabletochoosea2Dimage,e.g.,a frontal-viewfaceimageinFig.3.3(b),asanalbedorepresentation.However,frontalfacescontain littleinformationoftwosides,whichwouldlosemanyalbedoinformationforside-viewfaces. Inlightoftheseconsideration,weuseanunwrapped2Dtextureasourtexturerepresentation (Fig.3.3(c)).,each3Dvertex v isprojectedontotheUVspaceusingcylindrical unwarp.Assumingthatthefacemeshhasthetoppointingupthe y axis,theprojectionof v = ( x ; y ; z ) ontotheUVspace v uv =( u ; v ) iscomputedas: v ! a 1 : arctan x z + b 1 ; u ! a 2 : y + b 2 ; (3.2) where a 1 ; a 2 ; b 1 ; b 2 areconstantscaleandtranslationscalarstoplacetheunwrappedfaceintothe imageboundaries.Here,per-vertexalbedo A 2 R 3 Q couldbeeasilycomputedbysamplingfrom itsUVspacecounterpart A uv 2 R U V : A ( v )= A uv ( v uv ) : (3.3) Usually,itinvolvessub-pixelsamplingviabilinearinterpolation: A ( v )= å u 0 2fb u c ; d u eg v 0 2fb v c ; d v eg A uv ( u 0 ; v 0 )( 1 j u u 0 j )( 1 j v v 0 j ) ; (3.4) where v uv =( u ; v ) istheUVspaceprojectionof v viaEqn.3.2. AlbedoinformationisnaturallyexpressedintheUVspacebutspatialdatacanbeembedded inthesamespaceaswell.Here,a3Dfacialmeshcanberepresentedasa2Dimagewiththree 16 Figure3.4: UVspaceshaperepresentation.Fromlefttoright:individualchannelsfor x , y and z spatial dimensionandcombinedshapeimage. Table3.1: Thearchitecturesof E , D A and D S networks. E D A = D S LayerFilter/StrideOutputSizeLayerFilter/StrideOutputSize FC6 7 320 Conv117 7 = 2112 112 32FConv523 3 = 212 14 160 Conv123 3 = 1112 112 64FConv513 3 = 112 14 256 Conv213 3 = 256 56 64FConv433 3 = 224 28 256 Conv223 3 = 156 56 64FConv423 3 = 124 28 128 Conv233 3 = 156 56 128FConv413 3 = 124 28 192 Conv313 3 = 228 28 128FConv333 3 = 248 56 192 Conv323 3 = 128 28 96FConv323 3 = 148 56 96 Conv333 3 = 128 28 192FConv313 3 = 148 56 128 Conv413 3 = 214 14 192FConv233 3 = 296 112 128 Conv423 3 = 114 14 128FConv223 3 = 196 112 64 Conv433 3 = 114 14 256FConv213 3 = 196 112 64 Conv513 3 = 27 7 256FConv133 3 = 2192 224 64 Conv523 3 = 17 7 160FConv123 3 = 1192 224 32 Conv533 3 = 17 7 ( l S + l A + 64 ) FConv113 3 = 1192 224 3 AvgPool7 7 = 11 1 ( l S + l A + 64 ) FC m 64 66 FC L 64 2727 channels,oneforeachspatialdimension x , y and z .Fig3.4givesanexampleofthisUVspace shaperepresentation S uv 2 R U V . Representing3DfaceshapeinUVspaceallowustouseaCNNforshapedecoder D S instead ofusingamulti-layerperceptron(MLP)asinourpreliminaryversion[160].Avoidingusingwide 17 Figure3.5: Forwardandbackwardpassoftherenderinglayer. fully-connectedlayersallowustousedeepernetworkfor D S ,potentiallymodelmorecomplex shapevariations.Thisresultsinbetterresultsasbeingdemonstratedinourexperiment (Sec.3.3.1.2). Thereferenceshapeusedhasthemouthopen.Thischangehelpsthenetworktoavoidlearning alargegradientnearthetwolips'bordersintheverticaldirectionwhenthemouthisopen. Toregressthese2Drepresentationofshapeandalbedo,wecanemployCNNsasshapeand albedonetworksrespectively., D S , D A areCNNconstructedbymultiplefractionally- stridedconvolutionlayers.AftereachconvolutionisbatchnormandeLUactivation,exceptthelast convolutionlayersofencoderanddecoders.Theoutputlayerhasa tanh activationtoconstraint theoutputtobeintherangeof [ 1 ; 1 ] .ThedetailednetworkarchitectureispresentedinTab.3.1. 3.2.1.3In-NetworkPhysically-BasedFaceRendering Toreconstructafaceimagefromthealbedo A ,shape S ,lightingparameter L ,andprojection parameter m ,wearenderinglayer R ( m ; L ; S ; A ) torenderafaceimagefromtheabove parameters.Thisisaccomplishedinthreesteps,asshowninFig.3.5.Firstly,thefacialtextureis computedusingthealbedo A andthesurfacenormalmapoftherotatedshape N ( V )= N ( P ; S ) . Here,following[169],weassumedistantilluminationandapurely Lambertian surface 18 Hencetheincomingradiancecanbeapproximatedusingsphericalharmonics(SH)basisfunctions H b : R 3 ! R ,andcontrolledbycoef L .,thetextureinUVspace T uv 2 R U V iscomposedofalbedo A uv andshading C uv : T uv = A uv C uv = A uv B 2 å b = 1 L b H b ( N ( m ; S uv )) ; (3.5) where B isthenumberofsphericalharmonicsbands.Weuse B = 3,whichleadsto B 2 = 9 coefin L foreachofthreecolorchannels.Secondly,the3Dshape/mesh S isprojectedto theimageplaneviaEqn.2.4.Finally,the3DmeshisthenrenderedusingaZ-bufferrenderer, whereeachpixelisassociatedwithasingletriangleofthemesh, ‹ I ( m ; n )= R ( P ; L ; S uv ; A uv ) m ; n = T uv ( å v i 2 F uv ( g ; m ; n ) l i v i ) ; (3.6) where F ( g ; m ; n )= f v 1 ; v 2 ; v 3 g isanoperationreturningthreeverticesofthetrianglethatencloses thepixel ( m ; n ) afterprojection g ; F uv ( g ; m ; n ) isthesameoperationwithresultantverticesmapped intothereferencedUVspaceusingEqn.3.2.Inordertohandleocclusions,whenasinglepixel residesinmorethanonetriangle,thetrianglethatisclosesttotheimageplaneisselected.The locationofeachpixelisdeterminedbyinterpolatingthelocationofthreeverticesviabarycentric coordinates f l i g 3 i = 1 . Therearealternativedesignstoourrenderinglayer.Ifthetexturerepresentationisper vertex,asinFig.3.3(a),onemaywarptheinputimage I i ontothevertexspaceofthe3Dshape S , whosedistancetotheper-vertextexturerepresentationcanformareconstructionloss.Thisdesign isadoptedbytherecentworkof[153,152].Incomparison,ourrenderedimageisona 19 2Dgridwhilethealternativeisontopofthe3Dmesh.Asaresult,ourrenderedimagecanenjoy theconvenienceofapplyingtheperceptuallossoradversarialloss,whichisshowntobecritical inimprovingthequalityofsynthetictexture.Anotherdesignforrenderinglayerisimagewarping basedonthesplineinterpolation,asin[36].However,thiswarpingiscontinuous:everypixelin theinputwillmaptotheoutput.Hencethiswarpingoperationfailsintheoccludedregion.Asa result,Cole etal .[36]limittheirscopetoonlysynthesizingfrontal-viewfacesbywarpingfrom normalizedfaces. TheCUDAimplementationofourrenderinglayerispubliclyavailableat https://github. com/tranluan/Nonlinear_Face_3DMM . 3.2.1.4Occlusion-awareRendering Veryoften,in-the-wildfacesareoccludedbyglasses,hair,hands,etc.Tryingtoreconstructab- normaloccludedregionscouldmakethemodellearningmorediforresultinanmodelwith externalocclusionbakedin.Hence,weproposetouseasegmentationmasktoexcludeoccluded regionsintherenderingpipeline: ‹ I ‹ I M + I ( 1 M ) : (3.7) Asaresult,theseoccludedregionswon'taffectouroptimizationprocess.Theforeground mask M isestimatedusingthesegmentationmethodgivenbyNirkin etal .[113].Examplesof segmentationmasksandrenderingresultscanbefoundinFig.3.6. 20 Figure3.6: Renderingwithsegmentationmasks.Lefttoright:segmentationresults,naiverendering, occulusion-awarerendering. 3.2.1.5ModelLearning Theentirenetworkisend-to-endtrainedtoreconstructtheinputimages,withthelossfunction: L = L rec ( ‹ I ; I )+ l lan L lan + l reg L reg ; (3.8) wherethereconstructionloss L rec enforcestherenderedimage ‹ I tobesimilartotheinput I ,the landmarkloss L L enforcesgeometryconstraint,andtheregularizationloss L rec encouragesplausi- blesolutions. ReconstructionLoss. Themainobjectiveofthenetworkistoreconstructtheoriginalfacevia disentanglerepresentation.Hence,weenforcethereconstructedimagetobesimilartotheoriginal inputimage: L i rec ( ‹ I ; I )= 1 jVj å q 2 V jj ‹ I ( q ) I ( q ) jj 2 (3.9) where V isthesetofallpixelsintheimagescoveredbytheestimatedfacemesh.Therearedifferent normscanbeusedtomeasurethecloseness.Tobetterhandleoutliers,weadopttherobust l 2 ; 1 , wherethedistanceinthe3DRGBcolorspaceisbasedon l 2 andthesummationoverallpixels enforcessparsitybasedon l 1 -norm[155,156]. Toimprovefromblurryreconstructionresultsof l p losses,inourpreliminarywork[160], thanksforourrenderinglayer,weemployadversariallosstoenhancetheimagerealism.However, 21 adversarialobjectiveonlyencouragethereconstructiontobeclosetotherealimagedistribution butnotnecessarytheinputimage.Also,it'sknowntobenotstabletooptimize.Here,wepropose touseaperceptuallosstoenforcetheclosenessbetweenimages ‹ I and I ,whichovercomesboth ofadversarialloss'sweaknesses.Besidesencouragingthepixelsoftheoutputimage ‹ I toexactly matchthepixelsoftheinput I ,weencouragethemtohavesimilarfeaturerepresentationsas computedbythelossnetwork j . L f rec ( ‹ I ; I )= 1 jCj å j 2 C 1 W j H j C j jj j j ( ‹ I ) j j ( I ) jj 2 2 : (3.10) WechooseVGG-Face[118]asour j toleverageitsface-relatedfeaturesandalsobecauseofsim- plicity.Thelossissummedover C ,asubsetoflayersof j .Here j j ( I ) istheactivationsofthe j -th layerof j whenprocessingtheimage I withdimension W j H j C j .Thisfeaturereconstruction lossisoneofperceptuallosseswidelyusedindifferentimageprocessingtasks[66]. Thereconstructionlossisaweightedsumoftwoterms: L rec ( ‹ I ; I )= L i rec ( ‹ I ; I )+ l f L f rec ( ‹ I ; I ) : (3.11) SparseLandmarkAlignment. Tohelpachievingbettermodelting,whichinturnhelpsto improvethemodellearningitself,weemploythelandmarkalignmentloss,measuringEuclidean distancebetweenestimatedandgroundtruthlandmarks,asanauxiliarytask, L lan = P 2 6 4 S ( : ; d ) 1 3 7 5 U 2 2 ; (3.12) where U 2 R 2 68 isthemanuallylabeled2Dlandmarklocations, d isaconstant68-dimvector 22 storingtheindexesof683Dverticescorrespondingtothelabeled2Dlandmarks.Differentfrom traditionalfacealignmentworkwheretheshapebasesareed,ourworkjointlylearnsthebases functions(i.e.,theshapedecoder D S )aswell.Minimizingthelandmarklosswhileupdating D S onlymovesatinysubsetsofvertices.Iftheshape S isrepresentedasavectorand D S isaMLP consistingoffullyconnectedlayers,verticesareindependent.Hence L L onlyadjusts68vertices. Incase S isrepresentedintheUVspaceand D S isaCNN,localneighborregioncouldalsobe Inbothcases,updating D S basedon L L onlymovesasubsetsofvertices,whichcould leadtoimplausibleshapes.Hence,whenoptimizingthelandmarkloss,wethedecoder D S and onlyupdatetheencoder. Also,notethatdifferentfromsomepriorwork[49],ournetworkonlyrequiresground-truth landmarksduringtraining.Itisabletopredictlandmarksvia P and S duringthetesttime. Regularizations. Toensureplausiblereconstruction,weaddafewregularizationterms: L reg = L sym ( A )+ l con L con ( A )+ l smo L smo ( S ) : (3.13) AlbedoSymmetry Asthefaceissymmetry,weenforcethealbedosymmetryconstraint, L sym ( A )= k A uv ( A uv ) k 1 : (3.14) Employingon2Dalbedo,thisconstraintcanbeeasilyimplementedviaahorizontalimage operation () . AlbedoConstancy Usingsymmetryconstraintcanhelptocorrecttheglobalshading.However, symmetricaldetails,i.e.,dimples,canstillbeembeddedinthealbedochannel. Tofurtherremoveshadingfromthealbedochannel,followingRetinextheory[]whichas- 23 sumesalbedotobepiecewiseconstant,weenforcesparsityintwodirectionsofitsgradient,similar to[107,144]: L con ( A )= å v uv j 2 N i w ( v uv i ; v uv j ) A uv ( v uv i ) A uv ( v uv j ) p 2 ; (3.15) where N i denotesasetof4-pixelneighborhoodofpixel v uv i .Withtheassumptionthatpixels withthesamechromaticity(i.e., c ( x )= I ( x ) = j I ( x ) j )aremorelikelytohavethesamealbedo,we settheconstantweight w ( v uv i ; v uv j )= exp a c ( v uv i ) c ( v uv j ) ,wherethecolorisreferenced fromtheinputimageusingthecurrentestimatedprojection.Following[107],weset a = 15and p = 0 : 8inourexperiment. ShapeSmoothness Forshapecomponent,weimposethesmoothnessbyaddingtheLaplacian regularizationonthevertexlocationsforthesetofallvertices. L smo ( S )= å v uv i 2 S uv S uv ( v uv i ) 1 jN i j å v uv j 2 N i S uv ( v uv j ) 2 : (3.16) IntermediateSemi-SupervisedTraining. Fullyunsupervisedtrainingusingonlytherecon- structionandadversariallossontherenderedimagescouldleadtoadegeneratesolution,sincethe initialestimationisfarfromidealtorendermeaningfulimages.Therefore,weintroduceinterme- diatelossfunctionstoguidethetrainingintheearlyiterations. Withthefacetechnique,Zhu etal .[193]expandthe300Wdataset[134]into122 ; 450 imageswith3DMMshapes e S andprojectionmatrix e P .Given e S and e P ,wecreatethepseudo groundtruthtexture e T byreferringeverypixelintheUVspacebacktotheinputimage,i.e.,the backwardofourrenderinglayer.With e P , e S , e T ,weourintermediatelossby: L 0 = L S + l T L T + l P L P + l L L L + l reg L reg ; (3.17) 24 where: L S = S e S 2 2 ; (3.18) L T = T e T 1 ; (3.19) L m = P e P 2 2 : (3.20) It'salsopossibletoprovidepseudogroundtruthtotheSHcoef L andfollowedbyalbedo A usingleastsquareoptimizationwithaconstantalbedoassumption,asin[169,144].However, thisestimationisnotreliableforin-the-wildimageswithocclusionregions.Alsoempirically,with proposedregularizations,themodelisabletoexploreplausiblesolutionsforthesecomponentsby itself.Hence,wedecidetorefrainfromsupervising L and A tosimplifyourpipeline. Duetothepseudogroundtruth,using L 0 mayrunintotheriskthatoursolutionlearnstomimic thelinearmodel.Thus,weswitchtothelossofEqn.3.8after L 0 converges.Notethattheestimated groundtruthof e P , e S , e T andthelandmarksaretheonlysupervisionusedinourtraining,forwhich ourlearningisconsideredas weakly supervised. 3.3ExperimentalResults Theexperimentsstudythreeaspectsoftheproposednonlinear3DMM,intermsofitsexpressive- ness,representationpower,andapplicationstofacialanalysis.Usingfacialmeshtriangle byBaselFaceModel(BFM)[121],wetrainour3DMMusing300W-LPdataset[193],whichcon- tains122 ; 450in-the-wildfaceimages,inawideposerangefrom 90 to90 .Imagesareloosely squarecroppedaroundthefaceandscaleto256 256.Duringtraining,imagesofsize224 224 arerandomlycroppedfromtheseimagestointroducetranslationvariations. 25 SymConstInputOverlayAlbedoShadingTexture X XX Figure3.7: Effectofalbedoregularizations:albedosymmetry(sym)andalbedoconstancy(const).When thereisnoregularizationbeingused,shadingismostlybakedintothealbedo.Usingthesymmetryproperty helpstoresolvethegloballighting.Usingconstancyconstraintfutherremovesshadingfromthealbedo, whichresultsinabetter3Dshape. ThemodelisoptimizedusingAdamoptimizerwithalearningrateof0 : 001inbothtraining stages.Wesetthefollowingparameters: Q = 53 ; 215, U = 192 ; V = 224, l S = l T = 160. l values aresettomakelossestohavesimilarmagnitudes. 3.3.1AblationStudy 3.3.1.1EffectofRegularization AlbedoRegularization. Inthiswork,toregularizealbedolearning,weemploytwoconstraints toefremoveshadingfromalbedonamelyalbedosymmetryandconstancy.Todemonstrate theeffectoftheseregularizationterms,wecompareourfullmodelwithitspartialvariants:one withoutanyalbedoreqularizationandonewiththesymmetryconstraintonly.Fig.3.7shows visualcomparisonofthesemodels.Learningwithoutanyconstraintsresultsinthelightingis 26 InputOverlayShapeOverlayShape WithsmoothnessWithoutsmoothness Figure3.8: Effectofshapesmoothnessregularization. totallyexplainedbythealbedo,meanwhileistheshadingisalmostconstant(Fig.3.7(a)).Using symmetryhelptocorrectthegloballighting.However,symmetricgeometrydetailsarestillbaked intothealbedo(Fig.3.7(b)).Enforcingalbedoconstancyhelpstofurtherremoveshadingfrom it(Fig.3.7(c)).Combiningthesetworegularizationshelpstolearnplausiblealbedoandlighting, whichimprovestheshapeestimation. ShapeSmoothnessRegularization. Wealsoevaluatetheneedinshaperegularization.Fig.3.8 showsvisualcomparisonsbetweenourmodelanditsvariantwithouttheshapesmoothnesscon- straint.Withoutthesmoothnesstermthelearnedshapebecomesnoisyespeciallyontwosidesof theface.Thereasonisthat,thehairregionisnotcompletelyexcludedduringtrainingbecauseof imprecisesegmentationestimation. 3.3.1.2ModelingLightingandShapeRepresentation Inthiswork,wemaketwomajoralgorithmicdifferenceswithourpreliminarywork[160]:incor- poratinglightingintothemodelandchangingtheshaperepresentation. Ourpreviouswork[160]modelsthetexturedirectly,whilethisworkdisentanglestheshading fromthealbedo.Asargued,modelingthelightingshouldhaveapositiveimpactonshapelearning. 27 Table3.2: FacealignmentperformanceonALFW2000. MethodLightingUVshapeNME Our[160]4.70 Our X 4.30 Our XX 4 : 12 Hencewecompareourmodelswithresultsfrom[160]infacealignmenttask. Also,inourpreliminarywork[160],aswellasintraditional3DMM,shapeisrepresentedas avector,whereverticesareindependent.Despitethisshortage,thisapproachhasbeenwidely adoptedduetoitssimplicityandsamplingefy.Inthiswork,weexploreanalternativeto thisrepresentation:representthe3Dshapeasapositionmapinthe2DUVspace.Thisrepresen- tationhasthreechannels:oneforeachspatialdimension.Thisrepresentationmaintainsthespatial relationamongfacialmesh'svertices.Also,wecanuseCNNastheshapedecoderreplacingan expensiveMLP.Herewealsoevaluatetheperformancegainbyswitchingtothisrepresentation. Tab.3.2reportstheperformanceonthefacealignmenttaskofdifferentvariants.Asaresult, modelinglightinghelpstoreducetheerrorfrom4 : 70to4 : 30.Usingthe2Drepresentation,with theconvenienceofusingCNN,theerrorisfurtherreducedto4 : 12. 3.3.1.3ComparisontoAutoencoders Wecompareourmodel-basedapproachwithaconvolutionalautoencoderinFig.3.9.Theautoen- codernetworkhasasimilardepthandmodelsizeasours.Itgivesblurryreconstructionresultsas thedatasetcontainlargevariationsonfaceappearance,poseangleandevendiversitybackground. Ourmodel-basedapproachobtainssharperreconstructionresultsandprovidessemanticparame- tersallowingaccesstodifferentcomponentsincluding3Dshape,albedo,lightingandprojection matrix. 28 InputOurAEInputOurAE Figure3.9: Comparisontoconvolutionalautoencoders(AE).Ourapproachproducesresultsofhigher quality.Alsoitprovidesaccesstothe3Dfacialshape,albedo,lighting,andprojectionmatrix. 3.3.2Expressiveness Exploringfeaturespace. WefeedtheentireCelebAdataset[97]with ˘ 200kimagestoour networktoobtaintheempiricaldistributionofourshapeandtextureparameters.Byvarying themeanparameteralongeachdimensionproportionaltoitsstandarddeviation,wecangeta sensehoweachelementcontributetotheshapeandtexture.Wesortelementsintheshape parameter f S basedontheirdifferencestothemean3Dshape.Fig.3.10showsfourexamplesof shapechanges,whosedifferencesrankNo.1,40,80,and120among160elements.Mostoftop changesareexpressionrelated.Similarly,inFig.3.11,wevisualizedifferenttexturechangesby adjustingonlyoneelementof f A offthemeanparameter ¯ f A .Theelementswiththesame4ranks astheshapecounterpartareselected. AttributeEmbedding. Tobetterunderstanddifferentshapeandalbedoinstancesembedded inourtwodecoders,wedigintotheirattributemeaning.Foragivenattribute,e.g.,male,we feedimageswiththatattribute f I i g n i = 1 intoourencoder E toobtaintwosetsofparameters f f i S g n i = 1 and f f i A g n i = 1 .Thesesetsrepresentcorrespondingempiricaldistributionsofthedatainthelowdi- mensionalspaces.Computingthemeanparameters ¯ f S ; ¯ f A andfeedintotheirrespectivedecoders, alsousingthemeanlightingparameter,wecanreconstructthemeanshapeandtexturewiththat attribute.Fig.3.12visualizesthereconstructedtextured3Dmeshrelatedtosomeattributes.Dif- ferencesamongattributespresentinbothshapeandtexture.Herewecanobservethepowerofour 29 Figure3.10: Eachcolumnshowsshapechangeswhenvaryingoneelementof f S ,by10timesstandard deviations,inoppositedirections.Orderedbythemagnitudeofshapechanges. Figure3.11: Eachcolumnshowsalbedochangeswhenvaryingoneelementof f A inoppositedirections. nonlinear3DMMtomodelsmalldetailssuchasfibagundereyes",orfirosycheeks",etc. 3.3.3RepresentationPower Wecomparetherepresentationpoweroftheproposednonlinear3DMMvs.traditionallinear 3DMM. Albedo. Givenafaceimage,assumingweknowthegroundtruthshapeandprojectionparameters, wecanunwarpthetextureintotheUVspace,aswegeneratefipseudogroundtruth"texturein 30 MaleMustacheBagsUnderEyesOld FemaleRosyCheeksBushyEyebrowsSmiling Figure3.12: Nonlinear3DMMgeneratesshapeandalbedoembeddedwithdifferentattributes. theweaklysupervisionstep.Withthegroundtruthtexture,byusinggradientdescent,wecan jointlyestimate,alightingparameter L andanalbedoparameter f A whosedecodedtexturematches withthegroundtruth.Alternatively,wecanminimizethereconstructionerrorintheimagespace, throughtherenderinglayerwiththegroundtruth S and P .Empirically,twomethodsgivesimilar performancesbutwechoosetheoptionasitinvolvesonlyonewarpingstep,insteadofdoing renderingineveryoptimizationiteration.Forthelinearmodel,weusealbedobasesofBasel FaceModel(BFM)[121].AsinFig.3.13,ournonlineartextureisclosertothegroundtruth thanthelinearmodel.Thisisexpectedsincethelinearmodelistrainedwithcontrolledimages. Quantitatively,ournonlinearmodelhasloweraveraged L 1 reconstructionerrorthan thelinearmodel(0 : 053vs.0 : 097,asinTab.3.3). 3DShape. Wealsocomparethepowerofnonlinearandlinear3DMMsinrepresentingreal-world 3Dscans.WecomparewithBFM[121],themostcommonlyused3DMMatpresent.Weuseten 3Dfacescansprovidedby[121],whicharenotincludedinthetrainingsetofBFM.Asthese 31 InputLinear Nonlinear GraddescNetwork Figure3.13: Texturerepresentationpowercomparison.Ournonlinearmodelcanbetterreconstructthe facialtexture. Table3.3: Quantitativecomparisonoftexturerepresentationpower(Averagereconstructionerroronnon- occludedfaceportion.) MethodLinearNonlinearw.GradDe.Nonlinearw.Network L 1 0 : 0620 : 0530 : 057 facemeshesarealreadyregisteredusingthesametrianglewithBFM,noregistration isnecessary.Giventhegroundtruthshape,byusinggradientdescent,wecanestimateashape parameterwhosedecodedshapematchesthegroundtruth.Wematchingcriteriononboth vertexdistancesandsurfacenormaldirection.Thisempiricallyimprovesofresults 32 3DScanNonlinearLinear Figure3.14: Shaperepresentationpowercomparison( l S = 160).Theerrormapshowthenormalized per-vertexerror. Table3.4: 3Dscanreconstructioncomparison(NME). l S 4080160 Linear0 : 03210 : 02790 : 0241 Nonlinear[160]0 : 02770 : 02360 : 0196 Nonlinear0 : 02680 : 0214 0 : 0146 comparedtoonlyoptimizingvertexdistances.Also,toemphasizethecompactnessofnonlinear models,wetraindifferentmodelswithdifferentlatentspacesizes.Fig.3.14showsthevisual qualityoftwomodels'reconstruction.Ourreconstructionscloselymatchthefaceshapesdetails. Toquantifythedifference,weuseNME,averagedper-vertexerrorsbetweentherecoveredand groundtruthshapes,normalizedbyinter-oculardistances.Ournonlinearmodelhasa smallerreconstructionerrorthanthelinearmodel,0 : 0146vs.0 : 0241(Tab.3.4).Also,thenonlinear modelsaremorecompact.Theycanachievesimilarperformancesaslinearmodelswhoselatent space'ssizesdoubled. 33 InputOverlayAlbedoShapeShadingInputOverlayAlbedoShapeShading Figure3.15: 3DMMtofaceswithdiverseskincolor,pose,expression,lighting,facialhair,andfaith- fullyrecoversthesecues.LefthalfshowsresultsfromAFLW2000dataset,righthalfshowsresultsfrom CelebA. 3.3.4Applications Havingshownthecapabilityofournonlinear3DMM(i.e.,twodecoders),nowwedemonstratethe applicationsofourentirenetwork,whichhastheadditionalencoder.Manyapplicationsof3DMM arecenteredonitsabilitytoto2Dfaceimages.Similartolinear3DMM,ournonlinear3DMM canbeutilizedformodelwhichdecomposesa2Dfaceintoitsshape,albedoandlighting. Fig.3.15visualizesour3DMMresultsonAFLW2000andCelebAdataset.Ourencoder estimatestheshape S ,albedo A aswellaslighting L andprojectionmatrix P .Wecanrecover personalfacialcharacteristicinbothshapeandalbedo.Ouralbedocanpresentfacialhair,which isnormallyhardtoberecoveredbylinear3DMM. 34 Figure3.16: Ourfacealignmentresults.Invisiblelandmarksaremarkedasred.Wecanwellhandle extremepose,lightingandexpression. 3.3.4.1FaceAlignment Facealignmentisacriticalstepformanyfacialanalysistaskssuchasfacerecognition[162,163]. Withenhancementinthemodeling,wehopetoimprovethistask(Fig.3.16).Wecompareface alignmentperformancewithstate-of-the-artmethods,3DDFA[193],DeFA[96],3D-FAN[22]and PRN[46],onAFLW2000datasetonboth2Dand3Dsettings. TheaccuracyisevaluatedusingNormalizedMeanError(NME)astheevaluationmetricwith boundingboxsizeasthenormalizationfactor[22].Forfaircomparisonwiththesemethodsinterm ofcomputationalcomplexity,forthiscomparisonweuseResNet18[60]asourencoder.Here, 3DDFAandDeFAusethelinear3DMMmodel(BFM).Eventhoughbeingtrainedwithlarger trainingcorpus(DeFA)orhavingacascadeofCNNsiterativelytheestimation(3DDFA), thesemethodsarestilloutperformedbyournonlinearmodel(Fig.3.17).Meanwhile, 3D-FANandPRNachievecompetitiveperformancesbyby-passingthelinear3DMMmodel.3D- FANusesheatmaprepresentation.PRNusesthepositionmaprepresentationwhichsharesa similarspirittoourUVrepresentation.Notonlyoutperformsthesemethodsintermofregressing landmarklocations(Fig.3.17),ourmodelalsodirectlyprovidesheadposeinformationaswellas thefacialalbedoandenvironmentlightingcondition. 35 Figure3.17: FacealignmentCumulativeErrorsDistribution(CED)curvesonAFLW2000-3Don2D(left) and3Dlandmarks(right).NMEsareshowninlegendboxes. 3.3.4.23DFaceReconstruction Wecompareourapproachtothreerecentrepresentativefacereconstructionwork:3DMM networkslearnedinunsupervised(Tewari etal .[153,152])orsupervisedfashion(Sela etal .[139]) andalsoanon-3DMMapproach(Jackson etal .[63]). MoFA,themonocularreconstructionworkbyTewari etal .[153],isrelevanttousastheyalso learnto3DMMinanunsupervisedfashion.Evenbeingtrainedonin-the-wildimages,their methodisstilllimitedtothelinearbases.Hencetherereconstructionssufferthesurfaceshrinkage whendealingwithchallengingtexture,i.e.,facialhair(Fig.3.18).Ournetworkfaithfullymodels thesein-the-wildtexture,whichleadstobetter3Dshapereconstruction. Concurrently,Tewari etal .[152]trytoimprovethelinear3DMMrepresentationpowerby learningacorrectivespaceontopofatraditionallinearmodel.Despitesharingsimilarspirit,our modelexploitsspatialrelationbetweenneighborverticesandusesCNNsasshape/albedo decoders,whichismoreefthanMLPs.Asaresult,ourreconstructionsmorecloselymatch theinputimagesinbothtextureandshape(Fig.3.19). 36 InputOurTewari17 Figure3.18: 3DreconstructionresultscomparisontoTewari etal .[153].Theirreconstructedshapes sufferfromthesurfaceshrinkagewhendealingwithchallengingtextureorshapeoutsidethelinearmodel subspace.Theycan'thandlelargeposevariationwelleither.Meanwhile,ournonlinearmodelismorerobust tothesevariations. InputOurTewari18 Figure3.19: 3DreconstructionresultscomparisontoTewari etal .[152].Ourmodelbetterreconstructthe inputimageinbothtexture(facialhairdirectionontheimage)andshape(nasolabialfoldsinthesecond image). Thehigh-quality3DreconstructionworkbyRichardson etal .[128,129],Sela etal .[139] obtainimpressiveresultsonaddingveldetailstothefaceshapewhenimagesarewithin thespanoftheusedsynthetictrainingcorpusortheemployed3DMMmodel.However,their performancedegradeswhendealingwithvariationsnotinitstrainingdataspan,e.g., facialhair.Ourapproachisnotonlyrobusttofacialhairandmake-up,butalsoautomatically learnstoreconstructsuchvariationsbasedonthejointlylearnedmodel.Weprovidecomparisons withtheminFig.3.20,usingthecodeprovidedbytheauthor. 37 InputOurSela17 Figure3.20: 3DreconstructionresultscomparisontoSela etal .[129].Besidesshowingtheshape,wealso showtheirestimateddepthandcorrespondencemap.Facialhairorocclusioncancauseseriousproblemsin theiroutputmaps. Thecurrentstate-of-artmethodbySela etal .[139]consistingofthreesteps:animage-to-image networkestimatingadepthmapandacorrespondencemap,non-rigidregistrationandadetail reconstruction.Theirimage-to-imagenetworkistrainedonsyntheticdatageneratedbythelinear model.Besidesdomaingapbetweensyntheticandrealimages,thisnetworkfacesamoreserious problemoflackingfacialhairinthelow-dimensiontexturesubspaceofthelinearmodel.This network'soutputtendstoignoretheseunexplainableregion(Fig.3.20),whichleadstofailurein latersteps.Ournetworkismorerobustinhandingthesein-the-wildvariations.Furthermore,our approachisorthogonaltoSela etal .[139]'sdetailreconstructionmoduleorRichardson etal . [129]'sEmployingtheseontopofourcouldleadtopromisingfurther improvement. Wealsocompareourapproachwithanon-3DMMapporachVRNbyJackson etal .[63].To avoidusinglow-dimensionsubspaceofthelinear3DMM,itdirectlyregressesa3Dshapevolu- metricrepresentationviaanencoder-decodernetworkwithskipconnection.Thispotentiallyhelps thenetworktoexplorealargersolutionspacethanthelinearmodel,howeverwithacostoflosing correspondencebetweenfacialmeshes.Fig.3.21shows3Dreconstructionvisualcomparisonbe- tweenVRNandours.Ingeneral,VRNrobustlyhandlesin-the-wildtexturevariations.However, 38 InputOurVRNInputOurVRN Figure3.21: 3DreconstructionresultscomparisontoVRNbyJackson etal .[63]onCelebAdataset.Volu- metricshaperepresentationresultsinnon-smooth3Dshapeandlosescorrespondencebetweenreconstructed shapes. Figure3.22: 3DreconstructionquantitativeevaluationonFaceWarehouse.Weobtainalowererrorcom- paredtoPRN[46]and3DDFA+[195]. becauseofthevolumetricshaperepresentation,thesurfaceisnotsmoothandispartiallylimitedto presentmedium-leveldetailsasours.Also,ourmodelfurtherprovidesprojectionmatrix,lighting andalbedo,whichisapplicableformoreapplications. QuantitativeComparisons. Toquantitativelycompareourmethodwithpriorworks,weevaluatemonocular3Dreconstruc- tionperformanceonFaceWarehouse[24]andFlorencedataset[9],inwhichgroundtruth3Dshape 39 (a)CEDCurves(b)NME Figure3.23: 3DfacereconstructionresultsontheFlorencedataset[9].TheNMEofeachmethodis showedinthelegend isavailable.Duetothediffrenceinmeshtopology,ICP[7]isusedtoestablishcorrespondence betweenestimatedshapesandgroundtruthpointclouds.Similartopreviousexperiments,NME (averagedper-vertexerrorsnormalizedbyinter-oculardistances)isusedasthecomparisonmetric. FaceWarehouse. Wecompareourmethodwithpriorworkswithavailablepretrainedmodels onall19expressionsof150subjectsofFaceWarehousedatabase[24].Visualandquantitative comparisonsareshowninFig.3.22.Ourmodelcanfaithfullyresembletheinputexpressionand surpassallotherregressionmethods(PRN[46]and3DDFA+[195])intermofdense facealignment. Florence. Usingtheexperimentalsettingproposedin[63],wealsoquantitativelycompared ourapproachwithstate-of-the-artmethods( e.g .VRN[63]andPRN[46])ontheFlorencedataset[9]. Eachsubjectisrenderedwithmultipleposes:pitchrotationsof 15 ,20 and25 andrawrota- tionsbetween 80 and80 .Ourmodelconsistentlyoutperformsothermethodsacrossdifferent viewangles(Fig.3.23). 40 Table3.5: Runningtimeofvarious3Dfacereconstructionmethods. MethodEncoderDecoderPost-processingRendering Sela etal .[139] ˘ 10ms ˘ 180s- VRN[63] ˘ 10ms-- MoFA[153] ˘ 4ms Neglectable -- Our2 : 7ms5 : 5ms-140ms 3.3.5Runtime Inthissection,wecomparerunningtimeformultiple3Dreconstructionapproaches.Sincediffer- entmethodsimplementedindifferentframeworks/languages;thiscomparisonaimstoonlyprovide relativecomparisonsbetweenthem.Sela etal .[139]andVRN[63]bothuseanencoder-decoder networkwithskipconnectionswithsimilarruntime.However,Sela etal .[139]requiresanex- pensivenonrigidregistrationstepaswellasanmodule.Wegetacomparableencoder runningtimewith3DMMregressionnetworkofMoFA[153].However,sincetheydirectlyuse linerbases,thedecodingstepistrivialasasinglemultiplication;ourmodelrequiresdecoding featuresviatwoCNNsforshapeandtexture,respectively.Wealsonotethattherunningtimefor therenderinglayerishigherthanothercomponents.Luckily,renderingtoreconstruct inputhasnovalueanditisnotrequiredduringtesting. 3.4Conclusions Sinceitsdebutin1999,3DMMhasbecameacornerstoneoffacialanalysisresearchwithapplica- tionstomanyproblems.Despiteitsimpact,ithasdrawbacksinrequiringtrainingdataof3Dscans, learningfromcontrolled2Dimages,andlimitedrepresentationpowerduetolinearbasesforboth shapeandtexture.Thesedrawbackscouldbeformidablewhen3DMMtounconstrained faces,orlearning3DMMforgenericobjectssuchasshoes.Thispaperdemonstratesthatthere 41 existsanalternativeapproachto3DMMlearning,whereanonlinear3DMMcanbelearnedfrom alargesetofin-the-wildfaceimageswithoutcollecting3Dfacescans.Further,themodel algorithmcanbelearntjointlywith3DMM,inanend-to-endfashion. Ourexperimentscoveradiverseaspectsofourlearntmodel,someofwhichmightneedthe subjectivejudgmentofthereaders.Wehopethatboththejudgmentandquantitativeresultscould beviewedunderthecontextthat,unlikelinear3DMM,nogenuine3Dscansareusedinour learning.Finally,webelievethatunsupervisedlyorweak-supervisedlylearning3Dmodelsfrom large-scalein-the-wild2Dimagesisonepromisingresearchdirection.Thisworkisonestepalong thisdirection. 42 Chapter4 TowardsNonlinear3DFace MorphoableModel 4.1Introduction Inchapter3,wepresentourproposedframeworkusingdeepneuralnetworkstopresentthe3DMM basisfunctionstoincreasemodelrepresentationpowerandlearningthemodeldirectlyfromun- constrained2Dimagestobettercapturein-the-wildvariations. However,evenwithbetterrepresentationpowers,thismodelorrelatedworks[152]stillrelyon manyconstraintstoregularizethemodellearning.Hence,theirobjectiveinvolvesthe requirementsofastrongregularizationforaglobalshapevs.aweakregularizationforcapturing higherleveldetails.Forexample,inordertofaithfullyseparateshadingandalbedo,albedois usuallyassumedtobepiecewiseconstant[82,144],whichpreventslearningalbedowithhigh levelofdetails.Inthischapter,besidelearningtheshapeandthealbedo,weproposetolearn additionalshapeandalbedoproxies,onwhichwecanenforceregularizations.Thisalsoallowsus toxiblypairthetrueshapewithstronglyregularizedalbedoproxytolearnthedetailedshapeor Thischapterisadaptedfromthefollowingpublication: [1]LuanTran,FengLiu,andXiaomingLiu,fiTowardsHigh-Nonlinear3DFaceMorphableModelflin CVPR,2019. 43 viceversa.Asaresult,eachelementcanbelearnedwithwithouttheother element'squality. Onadifferentnote,many3DMMmodelsfailtorepresentsmalldetailsbecauseoftheirparam- eterization.Manyglobal3Dfaceparameterizationhasbeenproposedtoovercometheambiguities associatedwithmonocularfacesuchasnoiseorocclusion.However,becausethey aredesignedtomodelthewholefaceatonce,itisdiftousethemtorepresentsmalldetails. Meanwhile,local-basedmodelsoffermorexibilitythanglobalapproachesbutwiththecostof beinglessconstrainedtorealisticallyrepresenthumanfaces.Weproposeusingdual-pathwaynet- workstoprovideabetterbalancebetweenglobalandlocal-basedmodels.Fromthelatentspace, thereisaglobalpathwayfocusingontheinferenceofglobalfacestructureandmultiplelocalpath- waysgeneratingdetailsofdifferentsemanticfacialparts.Theircorrespondingfeaturesarethen fusedtogetherforsuccessiveprocessgenerationoftheshapeandalbedo.Thisnetworkalso helpstospecializeinlocalpathwaysforeachfacialpartwhichbothimprovesthequality andsavescomputationpower. Inthischapter,weimprovethenonlinear3Dfacemorphablemodelinbothlearningobjective andarchitecture: Wesolvetheobjectiveproblembylearningadditionalshapeandalbedoproxies withproperregularization. Thenovelpairingschemeallowslearningbothdetailedshapeandalbedowithout one'squality. Thegloballocal-basednetworkarchitectureoffersmorebalancebetweenmodelrobustness andxibility. Theproposedmodelallows,forthetime,3Dfacereconstructionbysolely optimizelatentrepresentations. 44 Figure4.1: Theproposedframework.Eachshapeoralbedodecoderconsistoftwobranchestoreconstruct thetrueelementanditsproxy.Proxiesfreeshapeandalbedofromstrongregularizations,allowthemto learnmodelswithhighlevelofdetails. 4.2ProposedMethod 4.2.1Nonlinear3DMMwithProxyandResidual Recalfromthelastchapter,intheoriginalnonlinear3DMM,theoverallobjectivecanbesumma- rizedas: L = L recon ( ‹ I ; I )+ L lan + L reg ; (4.1) with L reg = L sym ( A )+ l con L con ( A )+ l smo L smo ( S ) : (4.2) ProxyandResidualLearning. Strongregularizationhasbeenshowntobecriticalinensuring theplausibilityofthelearnedmodels[152,161].However,thestrongregularizationalsoprevents themodelfromrecoveringhigh-levelofdetailsineithershapeoralbedo.Hence,thispreventsus fromachievingtheultimategoaloflearninga3DMMmodel. Inthiswork,weproposetolearnadditional proxyshape ( Ÿ S )and proxyalbedo ( Ÿ A ),onwhich 45 wecanapplytheregularization.Allpresentedregularizationswillbemovedtoproxiesnow: L reg = L sym ( Ÿ A )+ l con L con ( Ÿ A )+ l smo L smo ( Ÿ S ) : (4.3) Therewillbenoregularizationapplieddirectlytotheactualshape S andalbedo A otherthana weakregularizationencouragingeachtobeclosetoitsproxy: L res = k D S k 1 + k D A k 1 = S Ÿ S 1 + A Ÿ A 1 : (4.4) Bypairingtwoshapes S ; Ÿ S andtwoalbedos A ; Ÿ A ,wecanrenderfourdifferentoutputimages (Fig.4.1).Anyofthemcanbeusedtocomparewiththeoriginalinputimage.Werewriteour reconstructionlossas: L rec = L rec ( ‹ I ( Ÿ S ; Ÿ A ) ; I )+ L rec ( ‹ I ( Ÿ S ; A ) ; I )+ L rec ( ‹ I ( S ; Ÿ A ) ; I ) : (4.5) Pairingstronglyregularizedproxiesandweaklyregularizedcomponentsisacriticalpointinour approach.Usingproxiesallowsustolearnshapeandalbedowithout qualityofeithercomponent.ThispairingisinspiredbytheobservationthatShapefromShading techniquesareabletorecoverdetailedfacemeshbyassumingoverregularizedalbedooreven usingthemeanalbedo[129].Here, L rec ( ‹ I ( S ; Ÿ A ) ; I ) losspromote S torecovermoredetailsas Ÿ A isconstrainedbypiece-wiseconstant L con ( Ÿ A ) objective.Viceversa, L rec ( ‹ I ( Ÿ S ; A ) ; I ) aimstolearn betteralbedo.Inorderforthesetwolossestoworkasdesired,proxies Ÿ S and Ÿ A shouldperformwell enoughtoapproximatetheinputimagesbythemselves.Without L rec ( ‹ I ( Ÿ S ; Ÿ A ) ; I ) ,avalidsolution thatminimizes L rec ( ‹ I ( S ; Ÿ A ) ; I ) iscombinationofaconstantalbedoproxyandnoisyshapecreating surfacenormalwithdarkshadinginnecessaryregions,i.e.,eyebrows. 46 Anothernotabledesignchoiceisthatweintentionallyleftoutthelossfunctionon ‹ I ( S ; A ) ,even thoughthistheoreticallyisthemostimportantobjective.Thisistoavoidthecasethattheshape S learnsanin-betweensolutionthatworkswellwithboth Ÿ A ; A andviceversa. OcclusionImputation. Withproposedobjectivefunction,ourmodelisabletofaithfullyrecon- structinputimages.However,weempiricallyfoundthatbesidesvisibleregions,the modeltendstokeepinvisibleregionsmooth.Sincethereisnosupervisiononthoseareasotherthan theresidualmagnitudelosspullingtheshapeandalbedoclosertotheirproxies.Tolearnamore meaningfulmodel,whichistootherapplications,i.e.,faceeditingorfacesynthesis,we proposetouseasoftsymmetryloss[159]onoccludedregions: L res-sym ( S )= T ( D S uv z ( D S uv z )) 1 ; (4.6) where T isamaskinUVspaceindicatingvisibilityofeachpixel,approximatedbasedoncurrent surfacenormaldirection.Eventhoughtheshapeitselfisnotsymmetric,i.e.,facewithasymmetric expression,weenforcesymmetricalpropertyonitsdepthresidual. 4.2.2GlobalLocalBasedNetworkArchitecture Whileglobal-basedmodelsareusuallyrobusttonoiseandmismatches,theyareusuallyover- constrainedanddonotprovidesufxibilitytorepresenthigh-frequencydeformationsas local-basedmodels.Inordertotakethebestofbothworlds,weproposetousedual-pathway networksforourshapeandalbedodecoders. Here,wetransferthesuccessofcombininglocalandglobalmodelsinimagesynthesis[110, 62]to3Dfacemodeling.ThegeneralarchitectureofadecoderisshowninFig.4.2.Fromthe latentvector,thereisaglobalpathwayfocusingontheinferenceofglobalstructureandalocal 47 Figure4.2: Theproposedgloballocalbasednetworkarchitecture. pathwaywithfoursmallsub-networksgeneratingdetailsofdifferentfacialparts,includingeyes, noseandmouth.Theglobalpathwayisbuiltfromfractionalstridedconvolutionlayerswithve up-samplingsteps.Meanwhile,eachsub-networkinlocalpathwayhavesimilararchitecturebut shallowerwithonlythreeup-samplingsteps.Usingdifferentsmallsub-networksforeachfacial partofferstwoi)withlessup-samplingsteps,thenetworkisbetterabletorepresenthigh frequencydetailsinearlylayersii)eachsub-networkcanlearnwhichismore computationallyefthanapplyingacrossglobalface. AsshowninFig.4.2,tofusetwopathways'features,weintegratefourlocalpathways' outputsintoonesinglefeaturetensor.Differentfromotherworksthatsynthesizefaceimages withdifferentyawangles[162,163,73]withnoedkeypoints'locations,our3DMMgenerates facialalbedoaswellas3DshapeinUVspacewithtopology.Mergingtheselocal featuretensorsisefntlydonewithzeropaddingoperation.Themax-poolingfusionstrategyis alsousedtoreducethestitchingartifactsontheoverlappingareas.Thenresultantfeatureissimply concatenatedwiththeglobalpathway'sfeature,whichhasthesamespatialresolution.Successive convolutionlayersintegrateinformationfrombothpathwaysandgeneratethealbedo/shape (ortheirproxies). 48 Input l 2 ; 1 l 2 ; 1 + Grad.diff. l 2 ; 1 + Perceptual Figure4.3: Reconstructionresultswithdifferentlossfunctions. 4.3ExperimentalResults Theexperimentsstudydifferentaspectsoftheproposednonlinear3DMM,intermsofitsrepre- sentationpower,andapplicationstofacialanalysis.Themodelistrainedfollowedthesamesetting asinchapter3,includingtrainingdataset,meshtopology,optimizerparametters. 4.3.1AblationStudy ReconstructionLossFunctions. Westudyeffectsofdifferentreconstructionlossesonquality ofthereconstructedimages(Fig.4.3).Asexpected,themodeltrainedwith l 2 ; 1 lossonlyresultsin blurryreconstruction,similartoother l p loss.Tomakethereconstructiontobemorerealistic,we exploreotheroptionssuchasgradientdifference[104]orperceptualloss[66].Whileaddingthe gradientdifferencelosscreatesmoredetailsinthereconstruction,combiningperceptuallosswith 49 Input ‹ I ( Ÿ S ; Ÿ A ) ‹ I ( S ; Ÿ A ) ‹ I ( Ÿ S ; A ) ‹ I ( S ; A ) Ÿ SS Ÿ AA Figure4.4: Imagereconstructionwithour3DMMmodelusingtheproxyandthetrueshapeandalbedo. Ourshapeandalbedocanfaithfullyrecoverdetailsoftheface.Note:fortheshape,weshowtheshadingin UVspaceŒabettervisuallizationthantheraw S UV . l 2 ; 1 givesbestresultswithhighlevelofdetailsandrealism.Fortherestofthepaperwewillrefer tothemodeltrainedusingthiscombination. Understandingimagepairing. Fig.4.4showsresultsofourmodelona2Dfaceimage. Byusingtheproxyorthecomponents(shapeoralbedo)wecanrenderfourdifferentrecon- structedimageswithdifferentqualityandcharacteristics.Theimagegeneratedbytwoproxies Ÿ S ; Ÿ A isquiteblurrybutisstillbeabletocapturemajorvariationsintheinputface.Bypairing S andtheproxy Ÿ A , S isenforcedtocapturehighlevelofdetailstobringtheimageclosertotheinput. Similarly, A isalsoencouragedtocapturemoredetailsbypairingwiththeproxy Ÿ S .Theimage ‹ I ( S ; A ) inherentlyachieveshighlevelofdetailsandrealismevenwithoutdirectoptimization. ResidualSoftSymmetryLoss. Westudyeffectsoftheresidualsoftsymmetrylossonrecovering detailsonoccludedfaceregion.AsshowninFig.4.5,without L res-sym ,thelearnedmodelcanresult inanunnaturalshape,inwhichonesideofthefaceisover-smooth,onoccludedregions,whilethe 50 Without L res-sym With L res-sym Figure4.5: Affectofsoftsymmetrylossonourshapemodel. Table4.1: Quantitativecomparisonoftexturerepresentationpower(Averagereconstructionerroronnon- occludedfaceportion.) MethodReconstructionerror( l 2 ; 1 ) Linear[193]0 : 1287 Nonlinear[161]0 : 0427 Nonlinear+GL(Ours)0 : 0386 Nonlinear+GL+Proxy(Ours) 0 : 0363 othersidestillhashighlevelofdetails.Ourmodellearnedwith L res-sym canconsistentlycreate detailsacrosstheface,eveninoccludedareas. 4.3.2RepresentationPower Wecomparetherepresentationpoweroftheproposednonlinear3DMMwithBaselFaceModel[121], themostcommonlyusedlinear3DMM.Wealsomakecomparisonswiththerecentlyproposed nonlinear3DMM[160]. Texture. Weevaluateourmodel'spowertorepresentin-the-wildfacialtexture.Givenaface image,alsowiththegroundtruthshapeandprojectionmatrix,wecanjointlyestimateanalbedo parameter f A andalightingparameter L whosedecodedtexturecanreconstructtheoriginalim- age.Toaccomplishthis,weuseSGDon f A and L withtheinitialparametersestimatedbyour encoder E .Forthelinearmodel,Zhu etal .[193]resultsofBaselalbedousingPhongillu- minationmodel[122]isused.AsinFig.4.6,nonlinearmodeloutperformstheBasel 51 InputLinearNonlinear+GL+GL&Proxy Figure4.6: Texturerepresentationpowercomparison.Ournonlinearmodelcanbetterreconstructthefacial texture. Facemodel.Despite,beingclosetotheoriginalimage,Tran etal .[161]modelreconstruction resultsarestillblurry.Usinggloballocal-basednetworkarchitecture(fi+GLfl)withthesameloss functionshelpstobringtheimageclosertotheinput.However,thesemodelsarestillconstrained byregularizationsonthealbedo.Bylearningusingproxytechnique(fi+Proxyfl),ourmodelcan learnmorerealisticalbedowithmorehighfrequencydetailsontheface.Thisconclusionisfur- thersupportedwithquantitativecomparisoninTab.4.1.Wereporttheaveraged l 2 ; 1 reconstruction erroroverthefaceportionofeachimage.Ourmodelachievesthelowestaveragedreconstruction erroramongfourmodels,0 : 0363,whichisa15%errorreductionoftherecentnonlinear3DMM work[161]. Shape. Similarly,wealsocomparemodels'powertorepresentreal-world3Dscans.Using ten3Dfacemeshesprovidedby[121],whichsharethesametriangletopologywithus,wecan 52 OriginLinear[121]Nonlinear[161]Our NME0 : 02410 : 0146 0 : 0139 Figure4.7: Shaperepresentationpowercomparison.Givena3Dshape,weoptimizethefeature f S to approximatetheoriginalone. optimizetheshapeparametertogenerate,throughthedecoder,shapesmatchingthegroundtruth scans.Thematchingcriterionisbasedonbothvertexdistances(Euclidean)andsurface normaldirection(cosinedistance),whichempiricallyimprovesofreconstructedmeshes comparedtooptimizingvertexdistancesonly.Fig.4.7showsthevisualcomparisonsbetween differentreconstructedmeshes.Ourreconstructionscloselymatchthefaceshapesdetails.To quantifythedifference,weuseNMEŠaveragedper-vertexEuclideandistancesbetweenthe recoveredandgroundtruthmeshes,normalizedbyinter-oculardistances.Theproposedmodel hasasmallerreconstructionerrorthanthelinearmodel,andisalsosmallerthanthe nonlinearmodelbyTran etal .[161](0 : 0139vs.0 : 0146[161],and0 : 0241[121]) 4.3.3Identity-Preserving Weexploretheeffectofourproposed3DMMonpreservingidentitywhenreconstructingface images.UsingDR-GAN[163],apretrainedfacerecognitionnetwork,wecancomputethecosine 53 Figure4.8: Thedistancebetweentheinputimagesandtheirreconstructionfromthreemodels.Forbetter visualization,imagesaresortedbasedontheirdistancetoourmodel'sreconstructions. distancebetweentheinputanditsreconstructionfromdifferentmodels.Fig.4.8showstheplot ofthesescoredistributions.Ateachhorizontalmark,thereareexactlythreepointspresenting distancesbetweenanimagewithitsreconstructionsfromthreemodels.Imagesaresortedbased onthedistancetoourreconstruction.Forthemajorityofthecases(77 : 2%),ourreconstructionhas thesmallestdifferencetotheinputintheidentityspace. 4.3.43DReconstruction Usingourmodel D S ; D A ,togetherwiththemodelCNN E ,wecandecomposea2Dpho- tographintodifferentcomponents:3Dshape,albedoandlighting(Fig.4.9).Herewecompare our3Dreconstructionresultswithdifferentlinesofworks:linear3DMM[153],nonlinear 3DMM[152,161]andapproachesbeyond3DMM[63,139]. Forlinear3DMMmodel,therepresentativework,MoFAbyTewari etal .[153,151],learnsto regress3DMMparametersinanunsupervisedfashion.Evenbeingtrainedonin-the-wildimages, itisstilllimitedtothelinearsubspace,withlimitedpowertorecoveringin-the-wildtexture.This resultsinthesurfaceshrinkagewhendealingwithchallengingtexture,i.e.,facialhairasdiscussed 54 InputOverlayAlbedoShapeShading Figure4.9: 3DMMtofaceswithdiverseskincolor,pose,expression,lighting,andfaithfullyrecovers thesecues. 55 InputOurTewari17 Figure4.10: 3DreconstructioncomparisontoTewari etal .[153]. InputOurTewari18Tran18a Figure4.11: 3Dreconstructioncomparisonstononlinear3DMMapproachesbyTewari etal .[152]orTran andLiu[161].Ourmodelcanreconstructfaceimageswithhigherlevelofdetails.Pleasezoom-informore details.Bestviewelectronically. in[152,160,161].Besides,evenwithregularskintexturetheirreconstructionisstillblurryand haslessdetailscomparedtoours(Fig.4.10). ThemostrelatedworktoourproposedmodelisTewari etal .[152],TranandLiu[161],in 56 InputOurTran18bSela17 Figure4.12: 3DreconstructioncomparisonstoSela etal .[139]orTran etal .[159],whichgobeyondlatent spacerepresentations. which3DMMbasesareembeddedinneuralnetworks.Withmorerepresentationpower,these modelscanrecoverdetailswhichthetraditional3DMMusuallycan't,i.e.make-up,facialhair. However,themodellearningprocessisattachedwithstrongregularizationwhichlimitstheir abilitytorecoverhighfrequencydetailsoftheface.Ourproposemodelenhancesthelearning processinbothlearningobjectiveandnetworkarchitecturetoallowhigherreconstruc- 57 tions(Fig.4.11). Toimprove3Dreconstructionquality,manyapproachesalsotrytomovebeyondthe3DMM suchasRichardson etal .[129],Sela etal .[139]orTran etal .[159].Thecurrentstate-of-the-art3D monocularfacereconstructionmethodbySela etal .[139]usingadetailreconstructionstepto helpreconstructinghighmeshes.However,theirdepthmapregressionstepistrained onsyntheticdatageneratedbythelinear3DMM.Besidesdomaingapbetweensyntheticandreal, itfacesamoreseriousproblemoflackingfacialhairinthelow-dimensiontexture.Hence,this network'soutputtendstoignoretheseunexplainableregions,whichleadstofailureinlatersteps. Ournetworkismorerobustinhandlingthesein-the-wildvariations(Fig.4.12).Theapproach ofTran etal .[159]shareasimilarobjectivewithustobebothrobustandmaintainhighlevel ofdetailsin3Dreconstruction.However,theyuseanover-constrainedfoundationwhichloses personalcharacteristicsoftheeachfacemesh.Asaresult,the3Dshapeslooksimilaracross differentsubjects(Fig.4.12). 4.3.5Faceediting Decomposingfaceimageintoindividualcomponentsgiveusabilitytoeditthefacebymanipulat- inganycomponent.Hereweshowthreeexamplesoffaceeditingusingourmodel. Relighting. Firstweshowanapplicationtoreplacingthelightingofatargetfaceimageusing lightingfromasourceface(Fig.4.13).Afterestimatingthelightingparameters L source ofthe sourceimage,werenderthetransfershadingusingthetargetshape S target andthesourcelighting L source .Thistransfershadingcanbeusedtoreplacetheoriginalsourceshading.Alternatively, valueof L source canbearbitrarilychosenbasedontheSHlightingmodel,withouttheneedof sourceimages.Also,hereweusetheoriginaltextureinsteadoftheoutputofourdecoderto 58 Figure4.13: Lightingtransferresults.Wetransferthelightingofsourceimagesrow)totargetimages column).Wehavesimilarperformancecomparetothestate-of-the-artmethodofShu etal .[143] despitebeingordersofmagnitudefaster(150msvs.3minperimage). maintainimagedetails. AttributeManipulation. Givenfacesby3DMMmodel,wecaneditimagesbynaive modifyingoneormoreelementsinthealbedoorshaperepresentation.Moreinterestingly,we canevenmanipulatethesemanticattribute,suchasgrowingbeard,smiling,etc.Theapproachis similartolearningattributeembeddinginSec.3.3.2.Assuming,wewouldliketoeditappearance 59 Figure4.14: Growingmustacheeditingresults.Thecollumnshowsoriginalimages,thefollowing collumnsshoweditedimageswithincreasingmagnitudes.ComparingtoShu etal .[144]results(lastrow), oureditedimagesaremorerealisticandidentitypreserved. only.Foragivenattribute,e.g.,beard,wefeedtwosetsofimageswithandwithoutthatattribute f I p i g n i = 1 and f I n i g n i = 1 intoourencodertoobtaintwoaverageparameters f p A and f n A .Theirdifference D f A = f p A f n A isthedirectiontomovefromthedistributionofnegativeimagestopositiveones. 60 Figure4.15: Addingstickerstofaces.Thestickerisnaturallyaddedintofacesfollowingthesurfacenormal orlighting. Byadding D f A withdifferentmagnitudes,wecangenerateimageswithdifferentdegree ofchanges.Toachievehigh-qualityeditingwithidentity-preserved,theeditingresultis obtainedbyaddingtheresidual,thedifferentbetweentheimageandourreconstruction, totheoriginalinputimage.ThisisacriticaldifferencetoShu etal .[144]toimproveresultsquality (Fig.4.14). AddingSticker. Withmoreprecise3Dfacemeshreconstruction,thequalityofsuccessivetasks isalsoimproved.Here,weshowanapplicationofourmodelonfaceediting:addingstickersor tattoosontofaces.Usingtheestimatedshapeaswellastheprojectionmatrix,wecanunwrapthe facialtextureintotheUVspace.Thankstothelightingdecomposition,wecanalsoremovethe shadingfromthetexturetogetthedetailedalbedo.Fromherewecandirectlyeditthealbedoby addingsticker,tattooormake-up.Finally,theeditedimagescanberenderedusingthe albedotogetherwithotheroriginalelements.Fig.4.15showsoureditingresultsbyaddingstickers intodifferentpeople'sface. 61 4.4Conclusions Inrealizationthatthestrongregularizationandglobal-basedmodelingaretheroadblockstoachieve 3DMMmodel,thischapterpresentsanovelapproachtoimprovethenonlinear3DMM modelinginbothlearningobjectiveandnetworkarchitecture.Hopefully,withinsightsand ingsdiscussed,thiscanbeasteptowardunlockingthepossibilitytobuildamodelwhichcan capturemidandhigh-leveldetailsintheface.Throughwhich,3Dfacereconstruction canbeachievedsolelybydoingmodel 62 Chapter5 Intrinsic3DDecomposition,Segmentation, andModelingGenericObjects 5.1Introduction Understanding3Dstructureisoneofcomputervision'sfundamentalproblems.Ahumanhasno difunderstandingthe3Dstructureofanobjectuponseeingits2Dimage.Evenwithoutge- ometriccues(motionorstereopsis),ourvisualsystemcanstillinferdetailedsurfacesorplausibly hiddenparts.Meanwhile,sucha3Dinferringtaskremainsextremelychallengingforcomputer visionsystems. Inrecentyears,withadvancementsindeeplearning,manyhaveshownhuman-levelperfor- manceon2Dimageunderstanding,suchasobjectdetection[59],recognition[61,163],segmen- tation[57,21].Oneofthemainreasonsforthissuccessistheabundanceofannotateddata.For majorityof2Dunderstandingtasks,nowadays,thereusuallybemanydatabaseswithsuf annotatedimages.Hence,thedecentperformancecanbeobtainedusingend-to-endsupervised learning.However,extendingthissuccesstosupervisedlearningfor3Dinferenceisfarbehind Thischapterisadaptedfromthefollowingwork: [1]FengLiu,LuanTranandXiaomingLiu,fiIntrinsic3DDecompositionandModelingforGenericObjectsvia ColoredOccupancyFieldflundersubmission.(LuanTranandFengLiumakeequalcontributiontothiswork). 63 duetolimitedavailabilityof3Dlabels. Withtheintroductionoflarge3DComputerAidedDesign(CAD)databaseslikeObjectNet3D[176], ShapeNet[26],majorityofrecentworkon3Dmonocularobjectreconstruction[56,54,34]and intrinsicimagedecomposition[64,142]relyentirelyonsyntheticimagesgeneratedfromtheCAD models.However,usingsyntheticdataalonehasamajordrawback.Firstofall,creatingCAD modelsisnotscalable.Makingasingle3Dobjectinstanceislaborextensiveandrequiresex- pertiseincomputergraphics.Hence,it'snotfeasibletobuildmodelsfor all availableobjects. Secondly,therestillbeanobviousgapbetweensyntheticrenderingimagesandrealimageseven withadvancedrenderingtechniquesincomputergraphic.Therefore,thesemethodshavelimited abilityinreconstructionfromreal-worldimages. Meanwhile,thereisalargecollectionof2Dimagesforanyobjectcategories.Ifthoseimages canbeeffectivelyusedineither3Dobjectmodelingorlearningtothemodel,itcouldhavea greatimpactonthe3Dobjectreconstruction.Essentially,thereasonthatreal-world2Dimages havenotbeeneffectivelyusedingenericobject3Dreconstructionisthelackofcorresponding groundtruth3Dshapesfortheseimages,andthusnosupervisedlearning. Earlyattempts[89,164]onlearning3Dshapemodelfrom2Dphotographsinanunsupervised fashionarestilllimitedonexploiting2Dimages.Givenaninputimage,theymainlytrytolearn 3Dmodeltoreconstruct2Dsilhouetteoftheobject.Tolearnabettermodel,multipleviewsofthe sameobjectwithground-truthposeorkeypointsannotationsareneeded.Moreimportantly,they ignoreadditionalmonocularcues, e.g. ,shading,thatobtainrich3Dinformation.Onecommon issueamongpriorworkisthelackofmodelingforalbedo,onekeyelementinimageformulation. Asaresult, analysis-by-synthesis approachesisnotapplicableto3Dmodelingofgenericobjects. Toaddresstheseissues,weproposeanovelparadigmtojointlylearnacompletedandseg- mented3Dmodel,consistingofboth3Dshapeandalbedo,aswellasamodelmoduleto 64 Figure5.1: Thisworkdecomposesa2Dimageofgeneticobjectsintoalbedo,3Dshape,illumination,and cameraprojection. estimatetheshape,albedo,lightingandcameramatrixfrom2Dimages,asinFig.5.1.Different fromprior3Dreconstructionwork,thisistheworkmodelingbothshapeandalbedoofa genericobject,ina semi-supervised manner.Modelingalbedo,togetherwithestimatingtheenvi- ronmentlightingcondition,enablesustofullyexploittheshadingcuesfrom2Dimagestoestimate the3Dshape. ,consideringlargeintra-classvariationsinmeshtopology,weproposetouse col- oredoccupancy tocompletelyrepresenta3Dobject.Foreveryspatialpoint,thecolored occupancyprovidestheprobabilitywhetheritisinsidetheobjectandalsotheRGBvalueof itsalbedo.Thesurfaceoftheobjectisimplicitlyrepresentedastheiso-surfaceatacertainthresh- oldoftheoccupancyprobability.Coloredoccupancytheoreticallycanrepresentashapeatan arbitrarilyhighresolution,whichonlydependsonthesamplingdensityofspatialpoints.More- over,alsoduetothelackofconsistencyinmeshes'topology,thedensecorrespondencebetween 3Dshapesismissing.Weproposetojointlymodeltheobjectpartsegmentationwhichexploits itsimplicitcorrelationwithshapeandalbedo,andalsocreatesexplicitconstraintsforourmodel learning. Insummary,thecontributionsofthischapterinclude: Webuildthe3Dmodel,thatfullymodelssegmented3Dshape,albedoforgenericobjects using coloredoccupancy asarepresentation. 65 Modelingintrinsiccomponentsallowsustonotonlybetterexploitvisualcues,butalso,for thetime,userealimagesformodeltraininginanunsupervisedmanner. Incorporatingunsupervisedpartsegmentationenablesbetterconstraintstotheshape andposeestimation. Wedemonstratesuperiorperformanceon3Dreconstructionofgenericobjectsfromasingle 2Dimage. 5.1.13DShapeandAlbedoRepresentation ShapeImplicitField. Incontrastto2Ddomain,thecommunityhasnotyetagreedona3D representationthatisbothmemoryefandinferablefromdata.Recently,alotofattention isfocusonimplicitrepresentation,whereeachshapecanberepresentedbyafunction o : R 3 ! [ 0 ; 1 ] .Thisfunctiontakesaspatiallocation x 2 R 3 asaninputandoutputsitsprobabilityof occupancy[34,108,117].Withthisimplicitrepresentation,theshapecanbeviewedatanarbitrary highresolution.Anotherappealingpropertyofthisrepresentationisthatthesurfacenormalcan beanalyticallycomputedusingthespatialderivative d D S ( f S ; x ) d x viaback-propagationthroughthe network.Thisishelpfulforsuccessiveanalysistaskssuchasrendering. Asin[34,108,117],leveragingdeepneuralnetworks,afamilyoraninstanceofshapefunc- tionscanberepresentedusingadecodernetwork D S andeachshape S isencodedbyalatent representation f S 2 R d S (Fig5.2.a): D S : R 3 R d S ! [ 0 ; 1 ] : (5.1) Theshapedecoder'sarchitecturefollowsBAE-NET[33].BAE-NETisajointshapeco- segmentationandreconstructionnetwork,whichtakesshapelatentrepresentation f S andaspatial 66 Figure5.2: Shapeandalbedodecodernetworks.Shapedecoder D S takesashapelatentrepresentation f S andaspatialpoint x =( x ; y ; z ) andproducestheimplicitforeachbranch.Theoutputlayer groupsthebranchoutputs,viamaxpooling,toformthespatialprobabilityofoccupancy.Albedodecoder D A receivesbothlatentrepresentations f S ; f A andestimatesthealbedocolorsof4branches,oneofwhichis selectedbytheshapebranch/segmentationandreturnedasthealbedocolorof x . point x asinputs.Itiscomposedof3fullyconnectedlayerseachfollowedbyaLeakyReLU,except theoutput( Sigmoid ).Thelayergivestheimplicitforfourbranches ( o 1 ; o 2 ; o 3 ; o 4 ) . Finally,amaxpoolingoperatoronbranchoutputsresultsintheimplicit o .BAE-NET ismuchshallowerandthinnercomparedtoIM-NET[34],sinceitcaresmoreaboutthequalityof segmentationratherthanreconstruction.Weproposetointegratetheshapeintoalbedolearning, whichisshowntobothsegmentationandreconstruction. AlbedoImplicitField. Foracompletedmodel,eachvertexontheshapesurfaceisassigneda RGBalbedocolor.Extendingtheideaoftheoccupancytoalbedo,weproposetorepresent thealbedoasa colored .Thealbedodecoder D A returnsanRGBcolorforanyspatiallocation x 2 R 3 .Oneapproachforthecoloredisnaïvelyusingasinglealbedolatentrepresentation f A 67 torepresentacoloredshape, i.e. , D A ( f A ; x ) .However,itputsaredundantburdento f A toencode theobjectgeometry, e.g. ,thepositionofthetire,andbodyofacar.Hence,weproposetotakethe shapelatentvector f S asanadditionalinputtothealbedodecoder D A ( f A ; f S ; x ) (Fig5.2.b): D A : R 3 R d T R d S ! R 3 : (5.2) Forsimplicitywewillomit f S ; f A in D inlatersections. Thealbedodecoderhasasimilararchitectureastheshapedecoder,withafewdifferences. Theinputtothenetworkhasanadditionalvector,albedorepresentation f A .Theoutputisapplied Tanh activation.Also,thethirdlayergivesthecolorforfourbranches ( c 1 ; c 2 ; c 3 ; c 4 ) andeach with3channels.Ateveryspatiallocation,thecoloris c k ,where k = argmax i ( o i ) (Fig.5.2). Onekeymotionforintegratingshapesegmentationintoalbedodecoderisthat,differentpartsof anobjectoftendifferin both shapeandtexture.Thefouralbedobranchesessentiallyrepresentthe dominant albedocolorsoftheobject,whoselearningwillinturnencouragetheshapedecoderto segmentpartsthatdiffernotonlyinshape,butalsoindominantalbedo. 5.1.2Physis-BasedRendering Torenderanobjectimagefromshape,albedo,representedbylatentvectors f S , f A ,aswellas lighting L andprojectionmatrix P ,weasetof W H surfacepointscorrespondingto eachpixel.ThentheRGBcolorofeachpixeliscomputedviaalightingmodelusinglighting parameters L anddecoderoutputs. Cameramodel. Weassumeafullperspectivecameramodel.Anyspatialpoints x inthe3D worldspacecanbeprojectedin2Dbyamultiplicationbetweenaprojectionmatrix P andits homogeneouscoordinatesrepresentation, 68 (a)LinearSearch(b)Linear-BinarySearch Figure5.3: Raytracingforsurfacepointsdetection.InLinearsearch,candidates(redpoints)areuniformly distributedinthegrid.InLinear-Binarysearch,afterthepointinsidetheobjectfound,Binarysearch willbeusedbetweenthelastoutsidepointandcurrentinsidepointforallremainingiterations. u = P [ x ; 1 ] T ; (5.3) where P isa3 4fullperspectiveprojectionmatrix. Essentially, P canbeextendedtoits4 4versionwithzerotranslationinz-direction,With anabuseinannotationinhomogeneouscoordinates,relationbetween3Dpoints x anditscamera spaceprojection u canbewrittenas: u = Px ; and x = P 1 u : (5.4) Surfacepointdetection. Torendera2Dimage,foreachrayfromthecameratothepixel j =( u ; v ) ,weselectonefisurfacepointfl.Here,asurfacepointisastheinteriorpoint ( D S ( x ) > t )ortheouteriorpointwithlargest D S ( x ) incasetheraydoesn'thittheobject. Forefnetworktraining,insteadofexactsurfacepoints,weapproximatethem usingLinearsearchorLinear-Binarysearch(Fig.5.3). Intuitively,withthedistancemarginerrorof e ,inLinearsearch,fromaninitiallocationin 69 theobjectboundary,weevaluate D S ( x ) forallspatialpointcandidates x withstepsizeof e .In Linear-Binarysearch,aftertheinteriorpointisfound,as D S ( x ) isacontinuousfunction,a Binarysearchcanbeusedtobetterapproximatethesurfacepoint. Forbetterparallelization,thenumberofpointsevaluatedoneachrayisthesame.Inthiscase, Linear-Binarysearchdoesn'tresultinspeedupbutleadstobetterapproximationofsurfacepoints, hencebetterrenderquality. Imageformation. Weassumedistantlow-frequencyilluminationandapurelyLambertiansur- faceHencetheincomingradiancecanbeapproximatedviaSphericalHarmonics(SH) basisfunctions H b : R 3 ! R ,andcontrolledbycoef L .Atthepixel j withcorresponding surfacepoint x j ,theimagecolorvalueiscomputedasaproductofalbedo A andshading C : I j = A j : C j = A j : B 2 å b = 1 g b H b ( n j ) (5.5) = D A ( x j ) : B 2 å b = 1 g b H b s d D S ( x j ) d x j ; (5.6) where n j = s d D S ( x j ) d x j isthe L 2 -normalizedsurfacenormalat x j ,and s () isavectornormaliza- tionfunction.Weuse B = 3SHbands,whichleadsto B 2 = 9coefin L foreachofthree colorchannels. 5.1.3ModelLearning Ourmodelisdesignedtolearnfromreal-world2Dimages.However,inadditionwealsoneedto learnshapepriorfrom3DCADmodels,duetoinherentambiguityininverseproblems.We describelearningfrom2Dimages,andthenlearningfromCADmodels. 70 5.1.3.1UnsupervisedJointModelingandFitting Givenasetof2Dimages,withoutcorrespondinggroundtruth3Dshape,wethelossfunction as: L = L img + l sil L sil + l fea-const L fea-const + l reg L reg ; (5.7) where L img isthephotometricloss, L sil enforcesconsistencebetweenpredictedsilhouetteand groundtruthsilhouette,and L fea-const isthelocalfeatureconsistencyloss, L reg consistsofdifferent regularizationterms. SilhouetteConsistencyLoss. Giventheobject'ssilhouettemask M foreachimage,obtainedby anoff-the-shellsegmentationmethod[21],thesilhouetteconsistencylossis: L sil = 1 W H W H å j = 1 L D S ( f S ; x j ) ; o j (5.8) = 1 W H W H å j = 1 L D S ( E S ; E 1 P u j ) ; o j : (5.9) Withtheoccupancythetargetvalue o j isas o j = 0 : 5if M j = 1,otherwise o j = 0. Here,wealsoanalyzehowoursilhouettelossdifferstopriorwork.If3Dshapeisrepresented asamesh,thereisnogradientwhencomparingtwobinarymasks,unlessthepredictedsilhouetteis expensivelyapproximatedasinSoftrasterizer[89].Iftheshapeisrepresentedbyavoxel,theloss canprovidegradienttoadjustvoxeloccupancypredictions,butnottheobjectorientation[164]. Ourlosscanupdatebothoccupancycameraprojectionestimation(Eqn.5.9). PhotometricLoss. Toenforcesimilaritybetweenourreconstructionandinput,weusea L 1 loss ontheforeground: L img = 1 j M j ( ‹ I I ) M 1 : (5.10) 71 Toourbestknowledge,thisistheworkongeneric3Dobjectmodelingthatcanfully exploittheRGBcolorinformationtosupervisetheshapelearningratherthanjustsilhouetteguid- ance[89].Thisisonlypossibleduetotwodesignsofourapproach.1)Welearnthe completedmodelincludingalbedo.2)Theshapeimplicitrepresentation(contrasttovoxel)pro- videsaccurate,efsurfacenormalcomputation,allowsshadingdecomposition. LocalFeatureConsistencyLoss. Weproposeanovellocalfeatureconsistencylossbasedon the3Dsegmentationprovidedbytheshapedecoder.Weselect q boundarypoints U 3 D 2 R q 3 fromallpairsofneighboringsegmentsbasedontheshapedecoderbranches.Thenthese3Dpoints areprojectedto2Dlocations U 2 D 2 R q 2 ontheimageplaneusingtheestimatedprojectionmatrix P .Similarto[178],weretrievefeaturesoneachfeaturemapusingthelocation U 2 D andformthe localimagefeatures F 2 R q 256 ,where256isthefeaturedimension.Finally,weperformPCA toobtaintheengenvectorassociatedwiththelargesteigenvalue( v 2 R 1 256 ),whichdescribesthe largestvariationamongthevisualfeaturesof q points.Despitethedifferentcolorsoftwoimages ofthesameobjectcategory,weassumethatthismajorvariationissimilar.Thus,wethe localfeatureconsistencylossas: L fea-const = 1 j B j å ( i ; j ) 2 B v i v j 1 ; (5.11) where B isthetrainingbatch. Regularization. Wetworegularizationtermstoconstrainthelearning. Albedolocalconstancy :followingRetinextheory[82]whichassumesalbedotobepiecewise constant,weenforcethegradientsparsityintwodirections,similarto[144]: L alb-const = å t 2 N j w ( j ; t ) A j A t p 2 ; (5.12) 72 where N j representspixelj'ssetof4neighborpixels.Withtheassumptionthatpixelswiththe samechromaticity(i.e., c j = I j = j I j j )aremorelikelytohavethesamealbedo,wesettheconstant weight w ( i ; t )= exp a c j c t ,wherethecolorisreferencedfromtheinputimage.Inour experimentweset a = 15and p = 0 : 8asin[107]. Batch-wiseWhiteShading :Duetoambiguityinthemagnitudeoflighting,andthereforethe intensityofshading,itisnecessarytoincorporateconstraintsontheshadingmagnitudetoprevent thenetworkfromgeneratingarbitrarybright/darkshading.Tohandletheseambiguities,weusea Batch-wiseWhiteShading[144]constraintonshading: L bws = 1 m m å j = 1 C s ( r ) j c 1 ; (5.13) where C s ( r ) j isaredchanneldiffuseshadingofpixel j , m isthenumberofforegroundpixelsina trainingbatch. c isaconstantforthetargetaverageshading,whichissetto1.Thesameconstraint isappliedforblueandgreenchannels. 5.1.3.2SupervisedPriorLearningwithSyntheticImage TheCADmodelhelpstolearntheshapepriorandprovidesupervisionintraining. LearningShapeandAlbedoDecoder. Tolearntheshapeandalbedomodel(decoders),we adoptwidelyusedtechniqueswhichistrainingencoder-decodernetworks[34,50].Heretheinput totheencoderisa colored voxel,andtheencoder E 0 is3DCNN.Voxelispickedover2Dimages asitcontainsallshapeinformationwhichbettereliminatesambiguityfortheencodingprocess. Givenadatasetof N models,eachofwhichcanberepresentedasacolored3Doccupancy voxel V .Equivalently,eachmodelcanalsoberepresentedwith K spatialpoints x 2 R 3 andits occupancylabel o 2 [ 0 ; 1 ] andalbedo c .Thismodellearningobjectiveiswrittenas: 73 argmin D S ; D A ; E 0 N å i = 1 K å j = 1 L ( D S ( E 0 S ( V i ) ; x j ) ; o j )+ L ( D A ( E 0 S ( V i ) ; E 0 A ( V i ) ; x j ) ; c j ) : (5.14) Theloss L (softmaxcross-entropyor L p )penalizesdeviationofthenetworkpredictionfromthe actualvalue o j , c j . Wealsoadoptprogressivetrainingtechniques[34],totrainourmodelongraduallyincreas- ingresolutiondata.Sincethemodelstructuredoesn'tchangewhenswitchingtrainingdataof differentresolutions,thushigher-resolutionmodelscanbetrainedwithpre-trainedweightson low-resolutiondata.Progressivetrainingstabilizesandspeedsupthetraining. LearningImageEncoder. ForeachCADmodel,werendermultipleimagesofthesameobject withdifferentposesandlightingconditions.Hereeachtrainingsampleisatripletofvoxel,2D imageanditscorrespondinggroundtruthprojectionmatrix ( V ; I ; e P ) .Theycanbeusedasan additionalsupervisionforourencoderanddecoders. L S = E S ( I ) E 0 S ( V ) 2 2 ; (5.15) L A = E A ( I ) E 0 A ( V ) 2 2 ; (5.16) L P = E P ( I ) e P 2 2 ; (5.17) Thegroundtruthlatentrepresentationsareobtainedfromthegroundtruthvoxel( E 0 ( V ) ). 5.1.4ImplementationDetails 5.1.4.1Modeltraining Thefullmodelistrainedinthreestages.First,theshapeandalbedodecoderistrainedwithcolored voxeldata.Thentheencoderistrainedwith2Dsyntheticimagesasinputs.Bothsupervisedand 74 unsupervisedlossesareusedinthisstage.Finally,themodelmodule(encoderandalbedo decoder)canbeusingrealimageswithunsupervisedlosses.Weempiricallyfoundthat, therealimagestraininghasincrementalontheshapedecoder.Butit improvesthegeneralizationabilityofourencoderonmodeltorealimages.Hence,we decidetotheweightoftheshapedecoderafterthestage.TheencoderisaResNet- 18,whiledecodersare3layersMLPs[33].Weightsareinitializedfromanormaldistributionwith astandarddeviationof0 : 02.Adamoptimizerisusedwithalearningrateof0 : 0001inallstages. 5.1.5NetworkStructure ColoredVoxelEncoder. Tolearntheshapeandalbedomodels(prior)simultaneously,ourvoxel encoderrequirescoloredvoxelsasinput.WeobtaincolorvoxelizationfortheShapeNet3Dmesh modelsbythework[29].Fig.5.4showstwoexamplesofcolorvoxelization.Thevoxelencoder architecture(Table5.1)is3DCNN,whichisadoptedfrom[34,33]. Figure5.4: ColorvoxelizationofShapeNetmodels.Original3Dmesh(left)and64 3 coloredvoxel(right). ShapeandAlbedoDecoders. Theshapedecoderarchitectureisfollowedtheworkof[33] (unsupervisedcase).Thenetworktakesshapelatentrepresentation f S andaspatialpoint ( x ; y ; z ) asinputs.Itiscomposedof3fullyconnectedlayerseachofwhichareappliedwithLeakyReLU, exceptthealoutputisapplied Sigmoid activation(Fig.5.5).Thealbedodecoderarchitecture 75 Table5.1: Coloredvoxelencodernetworkstructure. LayerKernelsizeStride Activation function Outputsize (d1,d2,d3,C) input--- ( 64 ; 64 ; 64 ; 3 ) conv3d ( 4 ; 4 ; 4 )( 2 ; 2 ; 2 ) LReLU ( 32 ; 32 ; 32 ; 32 ) conv3d ( 4 ; 4 ; 4 )( 2 ; 2 ; 2 ) LReLU ( 16 ; 16 ; 16 ; 64 ) conv3d ( 4 ; 4 ; 4 )( 2 ; 2 ; 2 ) LReLU ( 8 ; 8 ; 8 ; 128 ) conv3d ( 4 ; 4 ; 4 )( 2 ; 2 ; 2 ) LReLU ( 4 ; 4 ; 4 ; 256 ) conv3d ( 4 ; 4 ; 4 )( 1 ; 1 ; 1 ) Sigmoid ( 1 ; 1 ; 1 ; 256 ) f A ---128 f S ---128 issimilar,withonlytwodifferences.Theinputtothenetworkhasanadditionalvector,albedo latentrepresentation f A .Theoutputisapplied Tanh activation.Fig.5.6depictsthealbedodecoder architecture. Figure5.5: Theshapedecodernetworkiscomposedof3fullyconnectedlayers,denotesasfiFCfl.The shapelatentvector(128-dim)isconcatenated,denotedfi+fl,withthexyzquery,makinga131-dimvector, andisprovidedasinputtothelayer.TheLeakyReLUactivationisappliedtothe2FClayerswhile thevalueisobtainedwith Sigmoid activationdenotedasfiSig.fl. LocalFeatureExtraction. Weselect q boundarypoints U 3 D 2 R q 3 fromallpairsof neighboringsegmentsbasedontheshapedecoderbranches.Thenthese3Dpointsareprojected to2Dlocations U 2 D 2 R q 2 ontheimageplaneusingtheestimatedprojectionmatrix P .Fig.5.7 showsoneexampleoftheselectedvisiblepoints.Weset q = 50inourexperiment. TheimageencoderisaResNet-18.Table5.2illustratesthedetailnetworkarchitec- ture.Giventhe3 D points U 3 D ,weidentitytheprojectedlocation U 2 D onthefeaturemaplayers oftheencoder.Here,weconcatenatefeaturesfromtheoutputsof conv 1, conv 2and conv 3(see 76 Figure5.6: Thealbedodecodernetworkisalsocomposedof3fullyconnectedlayers.,ittakes thepointcoordinate ( x ; y ; z ) ,alongwithshapeandalbedofeaturevectors,andoutputstheRGBcolorvalue. 'TH'denotes Tanh activation. Figure5.7: Oneexampleofboundarypointsselectionforlocalfeatureextraction. Table5.2)togetthelocalfeatures F 2 R q 256 (size:64 + 64 + 128)ofthepoint.Here,wereshape thefeaturemapstotheoriginalimagesizewithbilinearinterpolation. Tobetterillustratetheefyoftheproposedlocalfeatureconsistenceconstraintforpose andshapeestimation,weselecttheboundarypointsfor20pairsofinputimagesbasedon theground-truthcameraparameterandshapepartsinformation.Thenwedisturbtheselected pointswithadditivezero-meanGaussiannoise.Fig.5.8presentstheaveragelocalfeaturedistance undernoiseofdifferentstandarddeviations.Asshownbytheresults,thelocalfeaturedistance issensitivetothenoiseinselectedpoints,whichmeansthelocalfeatureconsistencelossenables theframeworktogeneratebettercameraandshapeparametersothatthecorrespondingsemantic pointscanbeobtained. 77 Table5.2: Imageencodernetworkstructure(slightlyfromResNet-18). LayerKernelsizeStrideActivationfunctionInputsizeOutputsize input---- ( 128 ; 128 ; 3 ) conv1 ( 7 ; 7 )( 2 ; 2 ) Max-pooling,BN,LReLU ( 128 ; 128 ; 3 )( 32 ; 32 ; 64 ) conv2(ResNetblock) ( 3 ; 3 ) -- ( 32 ; 32 ; 64 )( 32 ; 32 ; 64 ) conv3(ResNetblock) ( 3 ; 3 ) -- ( 32 ; 32 ; 64 )( 16 ; 16 ; 128 ) conv4(ResNetblock) ( 3 ; 3 ) -- ( 16 ; 16 ; 128 )( 8 ; 8 ; 256 ) conv5(ResNetblock) ( 3 ; 3 ) -- ( 8 ; 8 ; 256 )( 4 ; 4 ; 512 ) averagepool ( 4 ; 4 ) -- ( 4 ; 4 ; 512 )( 1 ; 1 ; 512 ) FC l ---51227 FC p ---51212 FC shape --Sigmoid512128 FC albedo --Sigmoid512128 Figure5.8: Localfeaturedistanceundernoiseofdifferentstandarddeviations. 5.2ExperimentalResults Westudyfouraspectsofproposedmethods,intermsofablationstudy,unsupervisedsegmentation, single-view3Dreconstructiononsynthetic,andreal-worldimages. 5.2.1ExperimentSetup Data. Forevaluationof3Dshapereconstruction,weusetheShapeNetCorev1dataset[26].Itis composedofCADmodelsofobjectsinvariouscategories.Followingthesettingsof[56],weuse thesametraining/testingsplit.Whileusingthesametestset,werendertrainingdataourselves, 78 Figure5.9: 3Dreconstructionusingmodelslearnedwith(thirdrow)andwithoutrealimage(secondrow). Higherqualityreconstructionisobservedinthebottom. addinglightingandreal-worldposevariations(posedistributionfromPascal3D + [177]training data).Thishelpsustoleveragetheshadingcuetobetterlearnthemodelaswellasmodel toreal-worldimages. WeuseimagesfromPascal3D + database[177],intoourunsupervisedmodeltrainingstep. Pascal3D + augments12rigidcategoriesofPascalVOC2012[44]with3Dannotations.Weselect thesame5categories(plane,car,chair,couchandtable)withoursyntheticdata.Thetraining subsetoffromPascal3D + imagesweconsideredafteroccludedinstances,whichwould affecttheimagedecompositiontrainingprocess. Metrics. Weadoptthestandard3Dreconstructionmetric:IoUandChamferDistance(CD)[108] forevaluation.Tocomparewithmethodsthatoutputpointclouds,weusemarchingcubesto obtainmeshesfrom256 3 -voxelizedmodels.ForIoU,largerisbetter.ForCD,smallerisbetter. 79 Table5.3: Effectoflosstermsonposeandreconstructionestimation. AzimuthangleerrorReconstructionerror(CD) w/o L sil 18 : 51 0 : 136 w/o L fea-const 15 : 02 0 : 124 w/o L reg 13 : 01 0 : 131 Fullmodel12 : 20 0 : 116 5.2.2AblationStudy EffectofUnsupervisedTraining. Bymodelingthecompletedshapeandestimatingimagefor- mationparameters,ourmethodcanleveragein-the-wildimageswithoutannotationsofitsground truthshapeviaunsupervisedlosses.Herewedemonstratetheofaddingrealimagesinto trainingtoimproveourmodelabilityonrealimages.Fig.5.9showsvisualreconstructions onimagesfromPix3DandPascal3D + datasetsofourmodelatdifferentstageoftraining:amodel trainedwithsyntheticdataonlyandamodeltrainedwithadditionalrealimages. EffectofLossTerms. Wecompareourfullmodelwithitspartialvariants,withoutsilhouette consistencyloss,localfeatureconsistencyloss,oralbedoregularizationloss.Weconductexperi- mentsonPascal3D+database(carcategory)andevaluatetheposeestimationandreconstruction. Table5.3showsquantitativecomparisonofthesefourmodels.Asthesilhouetteprovidesstrong constraintsonglobalshapeandpose,withoutsilhouetteloss,theperformanceonbothmetricsare severelyimpaired.Theregularizationhelpstodisentangleshadingfromalbedo,whichleadsto bettersurfacenormal,thusbettershapeandposeThelocalfeatureconsistencylosshelps tothemodelwhichimprovestheposeandshapeestimation.Theseresults demonstratethatallthelosscomponentspresentedinthisworkcontributetotheperformance. 80 Table5.4: Segmentationandshaperepresentationcomparisons(IoU/CD)onShapeNetpart[181].IoU isutilizedtomeasureforsegmentationagainstground-truthparts.CDisusedforshaperepresentation evaluation.Chair*istrainingonchair+tablejointset. Shape(#parts)airplane(3)chair(3)chair*(4)table(2) BAE-Net80 : 4 = 0 : 1986 : 6 = 0 : 2783 : 7 = 87 : 0 = 0 : 30 Proposed83 : 0 = 0 : 1687 : 4 = 0 : 2384 : 1 = 0 : 2888 : 2 = 0 : 25 5.2.3UnsupervisedSegmentation Asmodelingshape,albedoandco-segmentationareclosely-relatedtasks[188],jointlymodel- ingthemallowsustoexploittheircorrelation.Followingthesametrainingandtestingsetting with[33],weevaluateourmodel'sco-segmentationandshaperepresentationpoweronthecat- egoryofairplane,chairandtable.AsinTable5.4,ourmodelachievesahighersegmentation accuracy,comparingwithBAE-NET[33].Further,wecomparethepoweroftwomethodsin representing3Dshapes.Byfeedingaground-truthvoxelshapefromthetestingsettothevoxel encoderandshapedecoder,wecanestimatetheshapeparameterwhosedecodedshapematchesthe ground-truthCADmodel.ThelowerCD,aswellashigherIoU,inTable5.4showthatthenovel designofourshapeandalbedodecodersimprovesboththesegmentationandreconstruction. Weshowadditionalupsupervisedsegmentationresultsofour5categoriesonShapeNetPart datasetinFig.5.10.Weassignacolorfortheoutputofeachbranchofourshapedecoderand reasonablepartsareobtained.Sinceoursegmentationisunsupervisedandthemodelforeach categoryistrainedseparately,ourresultsarenotguaranteedtoproducethesamepartcountsfor allcategories.Fig.5.11showstheestimationsofalbedocolorsof4branches.Thefouralbedo branchesdorepresentthedominantalbedocolorsoftheobjects. 81 Figure5.10: UnsupervisedsegmentationresultsonShapeNetPartdataset.Werendertheoriginalmeshes withdifferentcolorsrepresentingdifferentparts. 82 Figure5.11: Visualizationofalbedobranchoutputsforour5categories.Werenderthealbedowith reconstructedmesh. 83 Table5.5:Quantitativecomparisonofsingle-view3Dreconstructiononsyntheticimagesof ShapeNet. Category ChamferDistanceIoU 3D-R2N2PSGPix2MeshAtlasNetIM-SVR Proposed 3D-R2N2PSGPix2MeshAtlasNetIM-SVR Proposed airplane0 : 2270 : 1370 : 187 0 : 104 0 : 1370 : 1100 : 426 0 : 5150 : 3920 : 554 0 : 577 car0 : 2130 : 1690 : 1800 : 1410 : 123 0 : 092 0 : 661 0 : 5010 : 2200 : 745 0 : 773 chair0 : 2700 : 2470 : 2650 : 2090 : 199 0 : 155 0 : 439 0 : 4020 : 2570 : 522 0 : 546 couch0 : 2290 : 2240 : 212 0 : 177 0 : 1810 : 1780 : 626 0 : 6000 : 2790 : 641 0 : 651 table0 : 2390 : 2220 : 2180 : 1900 : 173 0 : 164 0 : 420 0 : 3120 : 2330 : 450 0 : 479 Mean 0 : 2780 : 1880 : 2160 : 1750 : 187 0.165 0 : 493 0 : 4730 : 3000 : 546 0.567 5.2.43DImageDecomposition Wefurtherprovideseveral3Dimagedecompositionresultsonreal-worldimagesonFig.5.12. Sinceournetworkproducesafull3Dshape,wecanchangethereconstructionoranysinglecom- ponenttoadifferentviewpoint. 5.2.5Single-view3DReconstruction 5.2.5.1Reconstructiononsyntheticimages Monocular3Dreconstructionperformanceisevaluatedonsyntheticimages.Wecompare ourmodelagainstmultiplestate-of-the-artbaselinesthatleveragevarious3Drepresentations: 3D-R2N2[35](voxel),PointSetGeneration(PSG)[45](pointcloud),Pixel2Mesh[147],At- lasNet[54](mesh),andIM-SVR[34](implicitForourmodel,weemploybothsupervised andunsupervisedlosses. Ingeneral,ourmodelisabletopredict3Dshapesthatcloselyresemblethegroundtruth shapes(Fig.5.13.a).Ourapproachoutperformstheothermethodsinmostcategoriesandachieves thebestmeanscore(bothIoUandCD(Tab.5.5)).Whileusingthesameshaperepresentationas us,IM-SVR[34]onlylearnstoreconstructthe3Dshapebyminimizingthelatentrepresentation differentwithground-truthlatentvectors.Bymodelingalbedo,ourmodelisfromlearn- 84 Figure5.12: 3Dimagedecompositiononreal-worldimages.Ourworkdecomposesa2Dimageofgeneric objectsintoalbedo,completed3Dshapeandillumination. 85 Figure5.13:Qualitativecomparisonforsingle-view3DreconstructiononShapeNet,Pascal3D+, andPix3Ddatasets. ingwithbothsupervisedandunsupervised(photometric,silhouette)losses.Thisresultsinbetter performanceinbothquantitativeandqualitativecomparisons. 5.2.5.2Reconstructiononrealimages Wealsoevaluateourapproachinreconstructionontworealimagedatabases,Pascal3D + [177] andPix3D[147].OurmodelisonrealimagesfromPascal3D + train subset without accesstogroundtruth3Dshapes.Sincemostofreconstructionmethodsonlycaninfershapes forsynthetic.Here,wecompareproposedmethodwiththestate-of-the-artmethodswhichcan workforrealworldimages,including3D-R2N2[35],differentiablerayconsistency(DRC)[164], ShapeHD[174]andDAREC[123].Again,ourworkistheonethatcanfullyleveragereal imagestolearnmodelinaunsupervisedfashion.ForPascal3D + evaluation,weusethe val subsetofthe5categories.ForPix3D,weuse3categories(chair,couchandtable)whichare 86 Figure5.14:Qualitativecomparisonforsingle-view3DreconstructiononrealimagesfromPascal 3D+(left)andPix3D(right). overlappedwithour5realcategories. AsshowninFig.5.14,ourmodelinfersreasonableshapeseveninchallengingconditions. Quantitatively,Table5.6suggeststhattheproposedmethodperformsbetterthanother methodsinPascal3D + database.AsPascal3D + onlyhas10CADmodelsforeachobjectcate- goryasgroundtruth3Dshapes,thegroundtruthlabelsandthescorescanbeinaccurate,failing totheshapedetails.Wethereforeconductanexperimentonmoreprecise3Dannotation databasePix3D.AsshowninTable5.7,ourmodelalsohaslowestChamferDistance andbestqualityasinFig.5.14comparingtobaselines. Toprovidemorecomprehensivecomparisonsonthe3Dreconstructionquality.Weprovide morereconstructionresultsonPascal3D + [177](Fig.5.15)andPix3D[147]dataset(Fig.5.16. ComparisonsaremadewithShapeHD[174],AtlasNet[54]usingpre-trainedmodelsprovidedby theauthors. 87 Table5.6: Realimage3DreconstructiononPASCAL3D + withCD. Category 3D-R2N2DRCShapeHDDARECProposed plane0 : 3050 : 112 0 : 094 0 : 1080 : 102 car0 : 305 0 : 099 0 : 1290 : 1010 : 113 chair0 : 2380 : 1580 : 1370 : 135 0 : 119 couch0 : 3470 : 1690 : 176- 0 : 148 table0 : 3210 : 1620 : 153- 0 : 127 Mean 0 : 3030 : 1400 : 138- 0 : 122 Table5.7: Realimage3DreconstructiononPix3D + withCD. Category 3D-R2N2DRCShapeHDDARECProposed chair0 : 2390 : 1600 : 1230 : 112 0 : 091 couch0 : 3070 : 1780 : 137- 0 : 114 table0 : 2890 : 1630 : 133- 0 : 127 Mean 0 : 2780 : 1670 : 131- 0 : 110 5.3Conclusions Withtheobjectiveof3Dmodelingfromreal-world2Dimages,thischapterpresentsasemi- supervisedlearningapproachthatjointlylearnsthealgorithmandthemodels.Sinceour approachofferscompletedalbedoand3Dshapemodels,aswellasintrinsicdecompositionfrom images,weareabletoeffectivelyleveragerealimagesinthetraining.Asaresult,weobserve substantialimprovementonthequalityof3Dreconstructionfromasingleimage.Inessential, ourproposedmethodisapplicableto3Dmodelingandreconstructionforanyobjectcategoryif bothi)anin-the-wild2Dimagecollectionandii)CADmodelsoftheobjectareavailable.Weare interestedinapplyingthismethodtoawidevarietyofobjectcategoriesandbuildingafizoo"of 3Dmodels. 88 Figure5.15: Additional3DreconstructionresultsonPascal3D + [177]dataset. 89 Figure5.16: Additional3DreconstructionresultsonPix3D[147].Foreachinputimage,weshowrecon- structionsbyShapeHD[174],andgroundtruth.Ourreconstructionsresemblethegroundtruth. 90 Chapter6 ConclusionsandFutureWork Reconstructingfacesorgenericobjectsfromasinglephotographisextremelychallengingdue totheambiguityintheimageformationprocess.Reconstructionqualityishighlydependon expressivenessoftheunderlyingusedmodel.Givenlimitedinannotated3 D data,throughout thisthesis,Ihavepresentedanapproachtolearnandimprove3Dmodelsrepresentationpower aswellasabilitybyusinglargecollectionof2Din-the-wildimages.Evenachievingthe state-of-the-artperformance,thecurrentmodelstillhaslimitations. Lightingmodel TheLambertianlightingmodel,whichisusedinthisthesis,isknowntobeapoorapproxima- tionforthecomplexpropertiesoffacialskinorgenericobjects.Whenhumanssweat, theskinclearlyexhibitsspecularparticularlyonthenoseandforehead.Thespecular isevenmoreobviousonotherobjectslikecars.Amorecomplexlightingassumptionis necessarytoaccuratelyhandlethesescenarios. Ibelievebettermodelingthelightingiscriticalforunsupervised/weaklysupervisedapproach asusingaapproximationofarealrenderingprocesspreventsthemodelfromlearningthetrue shapeoralbedoasthesetruthfulelementscouldleadtoahigherlossvalueunderapoorapproxima- tionoflightingmodel.Incomputergraphics,extremelycomplex,physically-validlightingmodels havebeendevelopedformaterialsofrelevancetoface,forexampleforskin[79] andhair[100].However,thesemethodshaveproventobetoocomplexandtoocomputational- 91 expensivetointegrateinto3DMMpipelines. FeedbackMechanism Differentfromtaskswheresmallchangesinpredictedprobabilitycanbetolerated aslongasthe(classrankings)resultsaren'tchanged;ourmodelsdoregressionon pose,shape/albedoparametters.Highprecisionestimationisusuallyrequired.Currently,across allchapters,weuseasingleencodertoestimateparamettersfromtheinputimage.Withmultiple down-samplingoperationsinthenetworkstructure,maintainiginfomationoftheface,inclusing preciselandmarklocations,smallfacialstructurecouldbechallenging.Asaresults,theestimated shape,posecouldbeofffromthegroud-truthvalue. Besides,inourtasks,visualizingourcurrentestimations,intheformofreconstructedimages, givesusaluxuryofcomparingourestimationtotheoriginalinput.Studythedisperencybetween thereconstructionandinputimagecouldbeaformoffeedbacksignalthatwecanusetofurther thecurrentestimation.Hence,oneinterestingideathatwecouldexploreistolearnasecond encoderthattakebothoriginalinputandourrenderedimageasinputsandtrytoproduceparametter residualstoourinitalpredictedparametters. 92 APPENDIX 93 RepresentationLearningGANforpose-invariantface recognition(DR-GAN) Whileotherchaptersinthisthesislookingatimageformation/synthesisinamodel-driven approach,thereareotherapproachthatcanlearntomanipulateimageswithoutusingany3D models.Inthisappendix,Iwouldliketointroduceoneofourworkinthatdirectionwithan applicationonfacesynthesisandfacerecognition. A1Introduction Facerecognitionisoneofthemostwidelystudiedtopicsincomputervisionduetoitswideap- plicationinlawenforcement,biometrics,marketing,andetc.Recently,greatprogresshasbeen achievedinfacerecognitionwithdeeplearning-basedmethods[149,119,138].Forexample, surpassinghumanperformanceisreportedbySchroffetal.[138]onLabeledFacesintheWild (LFW)database.However,oneoftheshortcomingsoftheLFWdatabaseisthatitdoesnotoffer ahighdegreeofposevariationŠthevariancethathasbeenshowntobeamajorchallengein facerecognition.Uptonow,thekeyabilityofPose-InvariantFaceRecognition(PIFR)desiredby real-worldapplicationsisfarfromsolved[92,93,25,4,41].Arecentstudy[140]observesasig- drop,over10%,inperformanceofmostalgorithmsfromfrontal-frontalto facevwhilehumanperformanceonlydegradesslightly.Thisindicatesthatthepose variationremainstobeachallengeinfacerecognitionandwarrantsfuturestudy. InPIFR,thefacialappearancechangecausedbyposevariationoftensurpasses theintrinsicappearancedifferencesbetweenindividuals.Toovercomethesechallenges,awide Thischapterisadaptedfromfollowingpublications: [1]LuanTran,XiYin,andXiaomingLiu,fiDisentangledRepresentationLearningGANforPose-InvariantFace Recognition,flinCVPR,2017. [2]LuanTran,XiYin,andXiaomingLiu,fiRepresentationLearningbyRotatingyourFacesflinTPAMI,2019. 94 FigureA1: Givenoneormultiplein-the-wildfaceimagesastheinput,DR-GANcanproducea identityrepresentation,byvirtuallyrotatingthefacetoarbitraryposes.Thelearntrepresentationisboth discriminative and generative ,i.e.,therepresentationisabletodemonstratesuperiorPIFRperformance, andsynthesizeidentity-preservedfacesattargetposesbytheposecode. varietyofapproacheshavebeenproposed,whichcanbegroupedintotwocategories.First,some workemploy facefrontalization ontheinputimagetosynthesizeafrontal-viewface,wheretra- ditionalfacerecognitionalgorithmsareapplicable[58,194],oranidentityrepresentationcanbe obtainedviamodelingthefacefrontalization/rotationprocess[72,197,182].Theabilitytogen- eratearealisticidentity-preservedfrontalfaceisalsoforlawenforcementpractitioners toidentifysuspects.Second,otherworkfocuson learningdiscriminativerepresentations directly fromthenon-frontalfacesthrougheitheronejointmodel[119,138]ormultiple models[101,40].Incontrast,weproposeanovelframeworktotakethebestofbothworlds Š simultaneouslylearnpose-invariantidentityrepresentation and synthesizefaceswitharbitrary poses ,wherefacerotationisbothafacilitatorandaby-productforrepresentationlearning. AsshowninFig.A1,weproposeDisentangledRepresentationlearning-GenerativeAdversar- ialNetwork(DR-GAN)forPIFR.GenerativeAdversarialNetworks(GANs)[51]cangenerate 95 samplesfollowingadatadistributionthroughatwo-playergamebetweenagenerator G anda discriminator D .Despitemanyrecentpromisingdevelopments[109,39,124,30,11],imagesyn- thesisremainstobethemainobjectiveofGAN.Tothebestofourknowledge,thisisthework thatutilizesthegeneratorinGANforrepresentationlearning.Toachievethis,weconduct G with anencoder-decoderstructure(Fig.A2(d))tolearnadisentangledrepresentationforPIFR.The inputtotheencoder G enc isafaceimageofanypose,theoutputofthedecoder G dec isasynthetic faceatatargetpose,andthelearntrepresentationbridges G enc and G dec .While G servesasaface rotator, D istrainedtonotonlydistinguishrealvs.synthetic(orfake)images,butalsopredictthe identityandposeofaface.Withtheadditional D strivesfortherotatedfaceto havethesameidentityastheinputrealface,whichhastwoeffectson G :1)Therotatedfacelooks moreliketheinputsubjectintermsofidentity.2)Thelearntrepresentationismore inclusive or generative forsynthesizinganidentity-preservedface. InconventionalGANs, G takesarandomnoisevectortosynthesizeanimage.Incontrast,our G takesafaceimage,aposecode c ,andarandomnoisevector z astheinput,withtheobjective ofgeneratingafaceofthesameidentitywiththetargetposethatcanfool D ., G enc learnsamappingfromtheinputimagetoafeaturerepresentation.Therepresentationisthen concatenatedwiththeposecodeandthenoisevectortofeedto G dec forfacerotation.Thenoise modelsfacialappearancevariationsotherthanidentityorpose.Notethatitisacrucialarchitecture designtoconcatenateonerepresentationwith varying randomlygeneratedposecodesandnoise vectors.ThisenablesDR-GANtolearna disentangled identityrepresentationthatis exclusive or invariant toposeandothervariations,whichistheholygrailforPIFRwhenachievable. Mostexistingfacerecognitionalgorithmsonlytakesoneimagefortesting.Inpractice,there aremanyscenarioswhenanimagecollectionofthesameindividualisavailable[75].Inthis case,priorworkfuseresultseitherinthefeaturelevel[27]orthedistance-metriclevel[167,103]. 96 Differently,ourfusionisconductedwithinaframework.Givenmultipleimagesasthe input, G enc operatesoneachimage,andproducesanidentityrepresentationandacoef whichisanindicatorofthequalityofthatinputimage.Usingthedynamicallylearnedcoef therepresentationsofallinputimagesarelinearlycombinedasonerepresentation.Duringtesting, G enc takesanynumberofimagesandgeneratesasingleidentityrepresentation,whichisusedby G dec forfacesynthesisalongwiththeposecode. Ourgeneratorisessentialtobothrepresentationlearningandimagesynthesis.Weproposetwo techniquestofurtherimprove G enc and G dec respectively.First,wehaveobservedthatour G enc canalwaysoutperform D inrepresentationlearningforPIFR.Therefore,weproposetoreplacethe identitypartof D withthelatest G enc duringtrainingsothatasuperior D canpush G enc tofurtherimproveitself.Second,sinceour G dec learnsamappingfromthefeaturespacetothe imagespace,weproposetoimprovethelearningof G dec byregularizingtheaveragerepresentation oftworepresentationsfromdifferentsubjectstobeavalidface,assumingaconvexspaceofface identities.Thesetwotechniquesareshowntobeeffectiveinimprovingthegeneralizationability ofDR-GAN. Insummary,thispapermakesthefollowingcontributions. WeproposeDR-GANviaanencoder-decoderstructuredgeneratorthatcanfrontalizeor rotateafacewithanarbitrarypose,eventheextreme Ourlearntrepresentationisexplicitlydisentangledfromtheposevariationviatheposecode inthegeneratorandtheposeestimationinthediscriminator.Similardisentanglementis conductedforothervariations,e.g.,illumination. Weproposeanovelschemetoadaptivelyfusemultiplefacestoasinglerepresentationbased onthelearntcoefwhichempiricallyshowstobeagoodindicatorofthefaceimage 97 quality. Weachievestate-of-the-artfacefrontalizationandfacerecognitionperformanceonmultiple benchmarkdatasets,includingMulti-PIE[52],CFP[140],andIJB-A[75]. A2PriorWork GenerativeAdversarialNetwork(GAN). Goodfellow etal .[51]introduceGANtolearngen- erativemodelsviaanadversarialprocess.Withaminimaxtwo-playergame,thegeneratorand discriminatorcanbothimprovethemselves.GANhasbeenusedforimagesynthesis[39,127], imagesuperresolution[187],andetc.Morerecentworkfocusonincorporatingconstraintsto z orleveragingsideinformationforbettersynthesis.E.g.,MirzaandOsindero[109]feedclass labelstoboth G and D togenerateimagesconditionedonclasslabels.In[136]and[114],GANis generalizedtolearnadiscriminativewhere D istrainedtonotonlydistinguishbetween realvs.fake,butalsoclassifytheimages.InInfoGAN[30], G appliesinformationregularization totheoptimizationbyusingtheadditionallatentcode.Incontrast,thispaperproposesanovel DR-GANaimingforface representationlearning ,whichisachievedviamodelingthefacerota- tionprocess.InSec.A3.4,wewillprovidein-depthdiscussiononourdifferencetomostrelevant workinGANs. OnecrucialissuewithGANsisthedifforquantitativeevaluation.Previousworkeither performhumanstudytoevaluatethequalityofsyntheticimages[39]orusethefeaturesinthe discriminatorforimage[124].Incontrast,weinnovativelyconstructthegenerator forrepresentationlearning,whichcanbequantitativelyevaluatedforPIFR. FaceFrontalization. Generatingafrontalfacefromafaceisverychallengingdueto self-occlusion.Priormethodsinfacefrontalizationcanbeintothreecategories:3D- 98 basedmethods[194,58,84],statisticalmethods[135],anddeeplearningmethods[197,179,182, 72,191].E.g.,Hassner etal .[58]useamean3Dfacemodeltogenerateafrontalfaceforany subject.Apersonalizedfacemodelcouldbeusedbutaccurate3Dfacereconstructionremains achallenge[133,87,160,161].In[135],astatisticalmodelisusedforjointfrontalizationand landmarklocalizationbysolvingaconstrainedlow-rankminimizationproblem.Fordeeplearning methods,Kan etal .[72]proposeSPAEtoprogressivelyrotateanon-frontalfacetoafrontalone viaauto-encoders.Yang etal .[179]applytherecurrentactionunittoagroupofhiddenunitsto incrementallyrotatefacesinedyawangles. Allpriorworkfrontalizeonlynearfrontalin-the-wildfaces[58,194]orlarge-posecontrolled faces[182,197].Incontrast,wecansynthesizearbitrary-posefacesfromalarge-posein-the- wildface.Weusethe adversarialloss toimprovethequalityofthesyntheticimagesandidentity inthediscriminatortopreserveidentity. RepresentationLearning. Designingtheappropriateobjectivesforlearningagoodrepresen- tationisanopenquestion[10].Theworkin[99]isamongthetouseanencoder-decoder structureforrepresentationlearning,which,however,isnotexplicitlydisentangled.DR-GANis similartoDC-IGN[80]Šavariationalautoencoder-basedmethodtodisentangledrepresentation learning.However,DC-IGNachievesdisentanglementbyprovidingbatchtrainingsampleswith oneattributebeinged,whichmaynotbeapplicabletounstructuredin-the-wilddata. PriorworkalsoexplorejointrepresentationlearningandfacerotationforPIFRwhere[197, 182]aremostrelevanttoourwork.In[197],Multi-ViewPerceptron[197]isusedtountangle theidentityandviewrepresentationsbyprocessingthemwithdifferentneuronsandmaximizing thedatalog-likelihood.Yim etal .[182]useamulti-taskCNNtorotateafacewithanyposeand illuminationtoatargetpose,andthe L 2loss-basedreconstructionoftheinputisthesecondtask. Bothworkfocusonimagesynthesisandtheidentityrepresentationisaby-productduringthe 99 networklearning.Incontrast,DR-GANfocusesonrepresentationlearning,ofwhichfacerotation isbothafacilitatorandaby-product.Wedifferto[197,182]infouraspects.First,weexplicitly disentangletheidentityrepresentationfromposevariationsbyposecodes.Second,weemploythe adversariallossforhigh-qualitysynthesis,whichdrivesbetterrepresentationlearning.Third,none ofthemappliestoin-the-wildfacesaswedo.Finally,ourabilitytolearntherepresentationfrom multipleunconstrainedimageshasnotbeenobservedinpriorwork. FaceImageQualityEstimation. Lowimagequalityisknowntobeachallengeforvision tasks[95,32].Imagequalityestimationisimportantforbiometricrecognitionsystems[12,53, 157].Numerousmethodshavebeenproposedtomeasuretheimagequalityofdifferentbiometric modalitiesincludingface[1,3,116],iris[31,78],[148,150],andgait[111,105].In thescenariooffacerecognition,aneffectivealgorithmforfaceimagequalityestimationcanhelp toeither(i)reducethenumberofpoorimagesacquiredduringenrollment,or(ii)improvefeature fusionduringtesting.Bothcasescanimprovethefacerecognitionperformance.Abaza etal .[1] evaluatemultiplequalityfactorssuchascontrast,brightness,sharpness,focusandilluminationas afaceimagequalityindexforfacerecognition.However,theydidnotconsiderposevariance, whichisamajorchallengeinfacerecognition.Ozay etal .[116]employaBayesiannetworkto modeltherelationshipsbetweenqualityrelatedimagefeaturesandfacerecognition, whichisshowtoboosttheperformance.Theauthorsin[171]proposeapatch-based faceimagequalityestimationmethod,whichtakesintoaccountofgeometricalignment,pose, sharpness,andshadows. Inthiswork,weemployqualityestimationinaGANframeworkthatconsidersallfac- torsofimagequalitypresentedinthedataset,with no directsupervision.Foreachinputimage, DR-GANcangenerateacoefthatindicatesthequalityoftheinputimage.Therepresen- tationsfrommultipleimagesofthesamesubjectarefusedbasedonthelearntcoefto 100 generateonerepresentation.Wewillshowthatthelearntcoefarecorrelatedtothe imagequality,i.e.,ameasurementofhowgooditcanbeusedforfacerecognition. A3TheProposedDR-GANModel OurproposedDR-GANhastwovariations:thebasicmodelcantakeoneimagepersubjectfor training,termed single-imageDR-GAN ,andtheextendedmodelcanleveragemultipleimages persubjectforbothtrainingandtesting,termed multi-imageDR-GAN .Westartbyintroducing theoriginalGAN,followedbytwoDR-GANvariations,andtheproposedtechniquestoimprove thegeneralizationofourgenerator.Finally,wewillcompareourDR-GANwithpreviousGAN variationsindetail. A3.1GenerativeAdversarialNetwork GenerativeAdversarialNetworkconsistsofagenerator G andadiscriminator D thatcompeteina two-playerminimaxgame.Thediscriminator D triestodistinguishbetweenarealimage x anda syntheticimage G ( z ) .Thegenerator G triestosynthesizerealistic-lookingimagesfromarandom noisevector z thatcanfool D ,i.e., G ( z ) beingasarealimage.Concretely, D and G play thegamewiththefollowinglossfunction: min G max D L gan = E x ˘ p d ( x ) [ log D ( x )]+ E z ˘ p z ( z ) [ log ( 1 D ( G ( z )))] : (1) Itisprovedin[51]thatthisminimaxgamehasaglobaloptimumwhenthedistribution p g ofthe syntheticsamplesandthedistribution p d oftherealsamplesarethesame.Undermildconditions 101 FigureA2: ComparisonofpreviousGANarchitecturesandourproposedDR-GAN. (e.g., G and D haveenoughcapacity), p g convergesto p d .Inthebeginningoftraining,thesamples generatedfrom G areextremelypoorandarerejectedby D withhighInpractice,it isbetterfor G tomaximizelog ( D ( G ( z ))) insteadofminimizinglog ( 1 D ( G ( z ))) [51].This objectiveresultsinthesameedpointofthedynamicsof G and D butprovidesmuchstronger gradientsearlyinlearning.Asaresult, G and D aretrainedtoalternativelyoptimizethefollowing objectives: max D L D gan = E x ˘ p d ( x ) [ log D ( x )]+ E z ˘ p z ( z ) [ log ( 1 D ( G ( z )))] ; (2) max G L G gan = E z ˘ p z ( z ) [ log ( D ( G ( z ))] : (3) A3.2Single-ImageDR-GAN Oursingle-imageDR-GANhastwodistinctivenoveltiescomparedtopriorGANs.First,itlearns anidentityrepresentationforafaceimagebyusinganencoder-decoderstructuredgenerator,where therepresentationistheencoder'soutputandthedecoder'sinput.Sincetherepresentationisthe 102 inputtothedecodertosynthesizevariousfacesofthesamesubject,i.e.,virtuallyrotatinghis/her face,itisa generative representation. Second,theappearanceofafaceisdeterminedbynotonlytheidentity,butalsothenumerous distractivevariations,suchaspose,illumination,expression.Thus,theidentityrepresentation learnedbytheencoderwouldinevitablyincludethedistractivesidevariations.E.g.,theencoder wouldgenerate different identityrepresentationsfortwofacesofthesamesubjectwith0 and90 yawangles.Toremedythis,inadditiontotheclasslabelssimilartosemi-supervisedGAN[136], weemploysideinformationsuchasposeandilluminationtoexplicitlydisentanglethesevariations, whichinturnhelpstolearna discriminative representation. A3.2.1ProblemFormulation Givenafaceimage x withlabel y = f y d ; y p g ,where y d representsthelabelforidentityand y p forpose,theobjectivesofourlearningproblemaretwofold:1)tolearnapose-invariantidentity representationforPIFR,and2)tosynthesizeafaceimage ‹ x withthe same identity y d butata different posebyaposecode c .OurapproachistotrainaDR-GANconditionedonthe originalimage x andtheposecode c withitsarchitectureillustratedinFig.A2(d). DifferentfromthediscriminatorinconventionalGAN,our D isamulti-taskCNNconsisting ofthreecomponents: D =[ D r ; D d ; D p ] . D r 2 R 1 isforreal/fakeimage D d 2 R N d is foridentitywith N d asthetotalnumberofsubjectsinthetrainingset. D p 2 R N p is forposewith N p asthetotalnumberofdiscreteposes.Givenafaceimage x , D aims toclassifyitastherealimageclass,andestimateitsidentityandpose;whilegivenasynthetic faceimagefromthegenerator ‹ x = G ( x ; c ; z ) , D attemptstoclassify ‹ x asfake,usingthefollowing 103 objectives: L D gan = E x ; y ˘ p d ( x ; y ) [ log D r ( x )]+ E x ; y ˘ p d ( x ; y ) ; z ˘ p z ( z ) ; c ˘ p c ( c ) [ log ( 1 D r ( G ( x ; c ; z )))] ; (4) L D id = E x ; y ˘ p d ( x ; y ) [ log D d y d ( x )] ; (5) L D pos = E x ; y ˘ p d ( x ; y ) [ log D p y p ( x )] ; (6) where D d i and D p i arethe i thelementin D d and D p .Forclarity,wewilleliminateallsubscriptsfor expectedvaluenotations,asallrandomvariablesaresampledfromtheirrespecteddistributions ( x ; y ˘ p d ( x ; y ) ; z ˘ p z ( z ) ; c ˘ p c ( c )) .Theobjectivefortraining D istheweightedaverageof allobjectives: max D L D = l g L D gan + l d L D id + l p L D pos ; (7) whereweset l g = l d = l p = 1. Meanwhile, G consistsofanencoder G enc andadecoder G dec . G enc aimstolearnanidentity representation f ( x )= G enc ( x ) fromafaceimage x . G dec aimstosynthesizeafaceimage ‹ x = G dec ( f ( x ) ; c ; z ) withidentity y d andatargetposeby c ,and z 2 R N z isthenoisemodeling othervariationsbesidesidentityorpose.Theposecode c 2 R N p isaone-hotvectorwiththetarget pose y t being1.Thegoalof G istofool D toclassify ‹ x totheidentityofinput x andthetarget posewiththefollowingobjectives: L G gan = E [ log D r ( G ( x ; c ; z ))] ; (8) L G id = E [ log D d y d ( G ( x ; c ; z ))] ; (9) L G pos = E [ log D p y t ( G ( x ; c ; z ))] : (10) 104 Similarly,theobjectivefortrainingthediscriminator G istheweightedaverageofeach objective: max G L G = m g L G gan + m d L G id + m p L G pos ; (11) whereweset m g = m d = m p = 1. G and D improveseachotherduringthealternativetrainingprocess.With D beingmore powerfulindistinguishingrealvs.fakeimagesandclassifyingposes, G strivesforsynthesizing anidentity-preservedfacewiththetargetposetocompetewith D .Wefromthisprocess inthreeaspects.First,thelearntrepresentation f ( x ) willpreservemorediscriminativeidentity information.Second,theposecin D guidestheposeoftherotatedfacetobemore accurate.Third,withaseparateposecodeasinputto G dec , G enc istrainedtodisentanglethepose variationfrom f ( x ) ,i.e., f ( x ) shouldencodeas much identityinformationaspossible,butas little poseinformationaspossible.Therefore, f ( x ) isnotonlygenerativeforimagesynthesis,butalso discriminativeforPIFR. A3.2.2NetworkStructure Thenetworkstructureofsingle-imageDR-GANisadoptedfromCASIA-Net[180]withbatch normalization(BN)for G enc and D .Besides,sincethestabilityoftheGANgamesuffersifsparse gradientlayers(MaxPool,ReLU)areused,wereplacethemwithstridedconvolutionandexponen- tiallinearunit(ELU)respectively. D istrainedtooptimizeEqn.7byaddingafullyconnectedlayer withthesoftmaxlossforrealvs.fake,identity,andposerespectively. G includes G enc and G dec thatarebridgedbytheto-be-learnedidentityrepresentation f ( x ) 2 R N f ,whichis theAvgPooloutputinour G enc . f ( x ) isconcatenatedwithaposecode c andarandomnoise z . Aseriesoffractionally-stridedconvolutions(FConv)[124]transformsthe ( N f + N p + N z ) -dim 105 concatenatedvectorintoasyntheticimage ‹ x = G ( x ; c ; z ) ,whichisthesamesizeas x . G istrained tomaximizeEqn.11whenasyntheticface ‹ x isfedto D andthegradientisback-propagatedto update G . Previousworkinfacerotationuse L 2loss[197,182]toenforcethesyntheticfacetobesimilar tothegroundtruthfaceatthetargetpose.Thislineofworkrequiresthetrainingdatatoinclude faceimagepairsofthesameidentityatdifferentposes,whichisachievableforcontrolleddatasets suchasMulti-PIE,buthardtoforin-the-wilddatasets.Oncontrary,DR-GANdoesnot requireimagepairssincethereisnodirectsupervisiononthesyntheticimages.Thisenablesus toutilizeextensivereal-worldunstructureddatasetsformodeltraining.Toinitializethetraining, givenatrainingimage,werandomlysampletheposecodewithequalprobabilityforeachpose view.Sucharandomsamplingisconductedat each epochduringthetraining,forthepurpose ofassigning multiple posecodestoonetrainingimage.Forthenoisevector,wealsorandomly sampleeachdimensionindependentlyfromtheuniformdistributionintherangeof[ 1 ; 1]. A3.3Multi-ImageDR-GAN Oursingle-imageDR-GANextractsanidentityrepresentationandperformsfacerotationbypro- cessingonesingleimage.Yet,weoftenhavemultipleimagespersubjectintrainingandsometimes intesting.Toleveragethem,weproposemulti-imageDR-GANthatcanboththetraining andtestingstages.Fortraining,itcanlearnabetteridentityrepresentationfrommultipleimages thatarecomplementarytoeachother.Fortesting,itcanenabletemplate-to-templatematching, whichaddressesacrucialneedinreal-worldsurveillanceapplications. Themulti-imageDR-GANhasthesame D assingle-imageDR-GAN,butadifferent G as showninFig.A3.Given n images f x i g n i = 1 ofthesameidentity y d atvariousposesasinput, besidesextractingthefeaturerepresentation f ( x i ) , G enc alsoestimatesacoef w i 106 FigureA3: Generatorinmlti-imageDR-GAN.Fromanimagesetofasubject,wecanfusethefeaturesto asinglerepresentationviadynamicallylearntcoefandsynthesizeimagesinanypose. foreachimage,whichpredictsthequalityofthelearntrepresentation.Thefusedrepresentationof n imagesistheweightedaverageofallrepresentations, f ( x 1 ;:::; x n )= å n i = 1 w i f ( x i ) å n i = 1 w i : (12) Thisfusedrepresentationisthenconcatenatedwith c and z andfedto G dec togenerateanew image,whichisexpectedtohavethesameidentityasallinputimagesandatargetpose y t bytheposecode.Thus,eachsub-objectiveforlearning G has ( n + 1 ) terms: L G gan = n å i = 1 h E [ log ( D r ( G ( x i ; c ; z )))] i + E [ log ( D r ( G ( x 1 ;:::; x n ; c ; z )))] : (13) Thesimilarextensionappliedfor L G id and L G pos .Thecoef w i inEqn.12islearnedso thatanimagewithahigherqualitycontributesmoretothefusedrepresentation.Thequalityis anindicatorofthePIFRperformanceoftheimage,ratherthanthelow-levelimagequality.Face qualitypredictionisaclassictopicwheremanypriorworkattempttoestimatetheformerfrom thelatter[116,171].Ourcoeflearningisessentiallythequalityprediction,fromnovel 107 perspectivesincontrasttopriorwork.Thatis,withoutexplicitsupervision,itisdrivenby D throughthedecodedimage G dec ( f ( x 1 ;:::; x n ) ; c ; z ) ,andlearnedinthecontextof,asabyproduct of,representationlearning.Notethat,jointlytrainingmultipleimagespersubjectresultsin one , butnotmultiple,generator,i.e.,all G enc inFig.A3sharethesameparameters.Thismakesit xibletotakean arbitrarynumber ofimagesduringtestingforrepresentationlearningandface rotation. Forthenetworkstructure,multi-imageDR-GANonlymakesminorfromthe single-imagecounterpart.,attheendof G enc ,weaddonemoreconvolutionalto thelayerbeforeAvgPooltoestimatethecoef w .Weapply Sigmoid activationtoconstrain w intherangeof[0 ; 1].Duringtraining,despiteunnecessary,wekeepthenumberofinputimages persubject n thesameforthesakeofconvenienceinimagesamplingandnetworktraining.To mimicthevariationinthenumberofinputimages,weuseasimplebuteffectivetrick:applying Dropoutonthecoef w :each w issetto0withaprobabilityof0 : 5.Hence,duringtraining, thenetworktakesanynumberofinputsvaryingfrom1to n . DR-GANcanbeusedinPIFR,imagequalityprediction,andfacerotation.Whilethenetwork inFig.A2(d)isusedfortraining,ournetworkfortestingismuchFirst,forPIFR, only G enc isusedtoextracttherepresentationfromoneormultipleimages.Second,forquality prediction,only G enc isusedtocompute w fromoneimage.Thirdly,both G enc and G dec areused forfacerotationbyspecifyingatargetposeandanoisevector. A3.4ComparisontoPriorGANs WecompareDR-GANwithmostrelevantGANvariants(Fig.A2). ConditionalGAN. ConditionalGAN[109,81]extendsGANbyfeedingthelabelstoboth G 108 and D togenerateimagesconditionedonlabels,eitherclasslabels,modalityinformation,oreven partialdataforinpainting.IthasbeenusedtogenerateMNISTdigitsconditionedontheclasslabel andtolearnmulti-modalmodels.InconditionalGAN, D istrainedtoclassifyarealimagewith mismatchedconditionstoafakeclass.InDR-GAN, D arealimagetothecorresponding classbasedonthelabels. AuxiliaryGAN. Odena etal .[115]extendsconditionalGANtoaddanadditional to D toclassifyrealimagesinto N c classes.DR-GANsharesasimilarlossfor D butwith adistinguishpurpose.TheauxiliaryinOdena etal .[115]isusedtohelpimprovingthe stabilityandqualityofGANtraining.Meanwhile,weemploytwoadditionaltoguide therepresentationlearningintheencoder-decoderstructure G . AdversarialAutoencoder(AAE). InAAE[98], G istheencoderofanautoencoder.AAEhas twoobjectivesinordertoturnanautoencoderintoagenerativemodel:theautoencoderrecon- structstheinputimage,andthelatentvectorgeneratedbytheencodermatchesanarbitraryprior distributionbytraining D .DR-GANdifferstoAAEintwoaspects.First,theautoencoderin[98]is trainedtolearnalatentrepresentationsimilartoanimposedpriordistribution,whileourencoder- decoderlearnsdiscriminativeidentityrepresentations.Second, D inAAEistrainedtodistinguish real/fakedistributionswhileour D istrainedtoclassifyreal/fakeimages,theidentityandposeof theimages. A4Experiments DR-GANcanbeusedforfacerecognitionbyusingthelearntrepresentationfrom G enc ,andface rotationbyspecifyingdifferentposecodesandnoisevectorswith G .WeevaluateDR-GANquan- titativelyforPIFRandqualitativelyforfacerotation.Wefurtherconductexperimentstoanalyze 109 thetrainingstrategy,disentanglerepresentation,andimagecoefOurexperimentsarecon- ductedforbothcontrolledandin-the-wilddatabases. A4.1ExperimentalSettings Databases. Multi-PIE[52]isthelargestdatabaseforevaluatingfacerecognitionunderpose, illumination,andexpressionvariationsincontrolledsetting.Forfaircomparison,wefollowthe settingin[197]:using337subjectswithneutralexpression,9poseswithin 60 ,and20illumina- tions.The200subjectsareusedfortrainingandtherest137subjectsfortesting.Inthetesting set,oneimagepersubjectwithfrontalviewandneutralilluminationformsthegallerysetand theothersaretheprobeset.ForMulti-PIEexperiments,weaddanadditionalilluminationcode similartotheposecodetodisentangletheilluminationvariation.Therefore,wehave N d = 200, N p = 9, N il = 20.Further,todemonstrateourabilityinsynthesizinglarge-posefaces,wetraina secondmodelwithtrainingfacesupto90 (i.e., N p = 13). Forthein-the-wildsetting,wetrainonCASIA-WebFace[180]andAFLW[76],andteston CFP[140]andIJB-A[75].CASIA-WebFaceincludes494 ; 414near-frontalfacesof10 ; 575sub- jects.WeaddtheAFLW(25 ; 993images)tothetrainingsettosupplymoreposevariation.Since thereisnoidentityinformationinthisdataset,thoseimagesonlyusedtocomputeGAN,pose relatedlosses.CFPconsistsof500subjectseachwith10frontaland4images.Theevalua- tionprotocolincludesfrontal-frontal(FF)and(FP)faceveachhaving10 folderswith350same-personpairsand350different-personpairs.Asanotherlarge-posedatabase, IJB-Ahas5 ; 396imagesand20 ; 412videoframesof500subjects.Ittemplate-to-template facerecognitionwhereeachtemplatehasoneormultipleimages.Weremove27overlapsubjects betweenCASIA-WebfaceandIJB-Afromthetraining.Wehave N d = 10 ; 548, N p = 13.Weset 110 FigureA4:Themeanfacesof13posegroupsinCASIA-Webface.Theblurrinessshowsthe challengesofposeestimationforlargeposes. N f = 320, N z = 50forbothsettings. ImplementationDetails. Following[180],wealignallfaceimagestoacanonicalviewofsize 110 110.Werandomlysample96 96regionsfromthealigned110 110faceimagesfordata augmentation.Imageintensitiesarelinearlyscaledtotherangeof [ 1 ; 1 ] .Toprovideposelabels y p forCASIA-WebFace,weapply3Dfacealignment[71,70]toclassifyeachfacetooneof13 poses.ThemeanfaceimageofeachposegroupisshowninFig.A4.Themeanfacesof facesarelesssharpthanthoseofthenear-frontalposegroups,whichindicatestheposeestimation errorcausedbythefacealignmentalgorithm. OurimplementationisextensivelyfromapubliclyavailableimplementationofDC- GAN.Wefollowtheoptimizationstrategyin[124].Thebatchsizeissettobe64.Allweights areinitializedfromazero-centerednormaldistributionwithastandarddeviationof0 : 02.Adam optimizer[74]isusedwithalearningrateof0 : 0002andmomentum0 : 5. Evaluation. TheproposedDR-GANaimsforbothfacerepresentationlearningandfaceimage synthesis.Thecosinedistancebetweentworepresentationsisusedforfacerecognition.Wealso evaluatetheperformanceoffacerecognitionw.r.t.differentnumbersofimagesinbothtrainingand testing.Forimagesynthesis,weshowqualitativeresultsbycomparingdifferentlossesandinter- polationofthelearntrepresentations.Wealsoevaluatethevariouseffectsofdifferentcomponents inourmethod. 111 TableA1: DR-GANanditspartialvariantsperformancecomparison. V Method@FAR= : 01@FAR= : 001@Rank-1@Rank-5 DR-GAN D r 80 : 0 2 : 255 : 5 3 : 588 : 7 0 : 895 : 0 0 : 8 DR-GAN D p 78 : 0 2 : 053 : 9 6 : 887 : 5 0 : 894 : 5 0 : 7 DR-GAN81 : 2 2 : 756 : 2 9 : 189 : 0 1 : 495 : 1 0 : 9 FigureA5:GeneratedfacesofDR-GANanditspartialvariants. A4.2Ablationstudy DiscriminatorComponents. Ourdiscriminatorisdesignedasamulti-taskCNNwiththree components,namely D g ; D d ; D p ,forreal/fake,identityandposerespectively.While D d playsacriticalroletoguidethegeneratortopreservetheinputidentity,wewouldliketo studytheroleoftheremainingcomponents.TableA1presentstherecognitionperformanceof single-imageDR-GANpartialvariantswitheachof D componentsremoved.Whilethevariant withoutadversariallosshasaslightlyperformancedrop,themodelwithoutpose taskhasmoreseveredrop.Thisshowstheimportantofgeneratingfaceimagesindifferentposes. Also,theroleofeachcomponentisshowningeneratedfaces(Fig.A5).Whenremoving D r , generatedimageshaslowerqualityalthoughtheycanberealizedasfacesandincorrectposes. Whenremoving D p ,theposeofgeneratedimagescan'tbecontrolledbytheposecodeandusually affectedbytheinputface'spose.Thiscanbecausedbyposeinformationresidinginthefeature 112 representation.Thisalsoexplainstheseveredropinthemodel'srecognitionperformance. DisentangledRepresentation. InDR-GAN,weclaimthatthelearntrepresentationisdisentan- gledfromposevariationsviatheposecode.Tovalidatethis,followingtheenergy-basedweight visualizationmethodproposedin[184],weperformfeaturevisualizationontheFClayer,denoted as h 2 R 6 6 320 ,in G dec .Ourgoalistoselecttwooutofthe320thathavehighestre- sponsesforidentityandposerespectively.Theassumptionisthatifthelearntrepresentationis pose-invariant,thereshouldbeseparateneuronstoencodetheidentityfeaturesandposefeatures. Recallthatweconcatenate f ( x ) 2 R 320 , c 2 R 13 and z 2 R 50 intoonefeaturevector,which multiplieswithaweightmatrix W fc 2 R ( 320 + 13 + 50 ) ( 6 6 320 ) andgeneratestheoutput h with h i 2 R 6 6 beingthefeatureoutputofoneinFC.Let W fc =[ W fx ; W c ; W z ] denotetheweight matrixwiththreesub-matrices,whichwouldmultiplywith f ( x ) ; c ; z respectively.Takingtheiden- titymatrixasanexample,wehave W fx =[ W 1 fx ; W 2 fx ;:::; W 320 fx ] where W i fx 2 R 320 36 .Wecom- puteanenergyvector s d 2 R 320 witheachelementas: s i d = jj W i fx jj F .Wethenthewith thehighestenergyin s d as k d = argmax i s i d .Similarly,bypartitioning W c ,weanother, denotedas k p ,withthehighestenergyforpose. Giventherepresentation f ( x ) ofonesubject,alongwithaposecode c andnoise z ,wecan computetheresponsesoftwovia h k d =( f ( x ) ; c ; z ) | W k d fc and h k p =( f ( x ) ; c ; z ) | W k p fc .By varyingthesubjectsandposecodes,wegeneratetwoarraysofresponsesinFig.A6,foridentity ( h k d )andpose( h k p )respectively.Forbotharrays,eachrowrepresentstheresponsesofthesame subjectandeachcolumnrepresentsthesamepose.Theresponsesforidentityencodetheidentity features,whereeachrowshowssimilarpatternsandeachcolumndoesnotsharesimilarity.On contrary,forposeresponses,eachcolumnsharesimilarpatternswhileeachrowisnotrelated.This visualizationsupportsourclaimthatthelearntrepresentationispose-invariant. 113 FigureA6: Responsesoftwowiththehighestresponsestoidentity(left),andpose(right). Responsesofeachrowareofthesamesubject,andeachcolumnareofthesamepose.Notethewithin-row similarityontheleftandwithin-columnsimilarityontheright. TableA2: Comparisonofsinglevs.multi-imageDR-GANonCFP. MethodFrontal-Frontal DR-GAN:n=197 : 13 0 : 6890 : 82 0 : 28 DR-GAN:n=497 : 86 0 : 7592 : 93 1 : 39 DR-GAN:n=697 : 84 0 : 7993 : 41 1 : 17 Singlevs.MultipleImageDR-GAN. Weevaluatetheeffectofthenumberoftrainingimages( n ) persubjectonthefacerecognitionperformanceonCFP.,withthe same trainingset,we trainthreemodelswith n = 1 ; 4 ; 6,where n = 1denotessingle-imageDR-GANand n > 1denotes multi-imageDR-GAN.ThefacevperformanceonCFPusing f ( x ) ofeachmodelare showninTab.A2.Weobservetheadvantageofmulti-imageDR-GANoverthesingle-image counterpartdespitetheyusethe sameamount oftrainingdata,whichattributestomoreconstraints inlearning G enc thatleadstoabetterrepresentation.However,wedonotkeepincreasing n due tothelimitedcomputationcapacity.Intherestofthepaper,weusemulti-imageDR-GANwith n = 6unless 114 FigureA7: CoefdistributionsonIJB-A(a)andCFP(b).ForIJB-A,wevisualizeimagesatfour regionsofthedistribution.ForCFP,weplotthedistributionsforfrontalfaces(blue)andfaces(red) separatelyandshowimagesattheheadsandtailsofeachdistribution. A4.3 Inmulti-imageDR-GAN,welearnacoefforeachinputimagebyassumingthat thelearntcoefisindicativeoftheimagequality,i.e.,howgooditcanbeusedforface recognition.Therefore,alow-qualityimageshouldhavearelativelypoorrepresentationandsmall coefsothatitwouldcontributelesstothefusedrepresentation.Tovalidatethisassumption, wecomputethecoefforallimagesinIJB-AandCFPdatabasesandplotthe distributionasshowninFig.A7. ForIJB-A,weshowfourexampleimageswithlow,medium-low,medium-high,andhighco- efItisobviousthatthelearntcoefarecorrelatedtotheimagequality.Images withrelativelylowcoefareusuallyblurring,withlargeposesorfailurecropping.While imageswithrelativelyhighcoefareofveryhighqualitywithfrontalfacesandlessocclu- sion.SinceCFPconsistsof5 ; 000frontalfacesand2 ; 000faces,weplottheirdistributions separately.Despitesomeoverlapinthemiddleregion,thefacesclearlyhaverelativelylow coefcomparedtothefrontalfaces.Withineachdistribution,thecoefarerelatedto othervariationsexpectyawangles.Thelow-qualityimagesforeachposegrouparewithocclu- sionand/orchallenginglightingconditions,whilethehigh-qualityonesarewithlessocclusionand 115 FigureA8:Thecorrelationbetweentheestimatedcoefandtheprobabilities. undernormallighting. Toquantitativelyevaluatethecorrelationbetweenthecoefandfacerecognitionper- formance,weconductanidentityexperimentonIJB-A.,werandomly selectallframesofonevideoforeachsubjectandselecthalfofimagesfortrainingandremaining fortesting.Thetrainingandtestingsetssharethesameidentities.Therefore,inthetestingstage, wecanusetheoutputofthesoftmaxlayerastheprobabilityofeachtestingimagebelongingtothe rightidentityclass.Thisprobabilityisanindicatorofhowwelltheinputimagecanberecognized asthetrueidentity.Giventheestimatedcoefweplotthesetwovaluesforthetestingset, asshowninFig.A8.Thesetwovaluesarehighlycorrelatedtoeachotherwithacorrelationof 0 : 69,whichagainsupportsourassumptionthatthelearntcoefareindicativeoftheimage quality. Imageselectionwith w . Onecommonapplicationofimagequalityistopreventlow-quality imagesfromcontributingtofacerecognition.Tovalidatewhetherourcoefhavesuchus- ability,wedesignthefollowingexperiment.ForeachtemplateinIJB-A,wekeepimageswhose 116 TableA3: PerformanceofIJB-Awhenremovingimagesbythreshold w t .fiSelected"showsthepercentage ofretainedimages. w t SelectedV (%)@FAR= : 01@FAR= : 001@Rank-1@Rank-5 0100 : 0 84 : 3 1 : 472 : 6 4 : 491 : 0 1 : 595 : 6 1 : 1 0 : 194 : 984 : 2 1 : 772 : 7 2 : 9 91 : 3 1 : 3 95 : 7 1 : 0 0 : 2571 : 983 : 6 1 : 2 73 : 3 3 : 090 : 7 1 : 295 : 2 1 : 0 0 : 524 : 680 : 9 1 : 971 : 3 4 : 786 : 5 1 : 993 : 1 1 : 6 1 : 05 : 777 : 8 2 : 264 : 0 6 : 283 : 4 2 : 391 : 6 1 : 2 coef w arelargerthanathreshold w t ,orifall w aresmallerwekeeponeimage withthehighest w .Tab.A3reportstheperformanceonIJB-A,withdifferent w t .With w t being 0,alltestimagesarekeptandtheresultisthesameasTab.A6.Theseresultsshowthatkeeping allormajorityofthesamplesarebetterthanremovingthem.Thisisencouragingasitthe effectivenessofDR-GANinautomaticallydiminishingtheimpactoflow-qualityimages,without removingthembythresholding. Featurefusionwith w . Wealsowouldliketoshowourproposedfeaturefusionusingcoef w iseffectiveforthetemplatetotemplatematchingpurpose.Wecompareitwithmultiplefusion methodsinbothfeaturelevelandscorelevel.TableA4showscomparisonsofdifferentfusion methodsonourmulti-imageDR-GANfeatures.Tocomparetwotemplatewithsize n 1 ; n 2 ,for score-level,min,max,meanarerespectivelytakingminimum,maximumandaverageofall n 1 n 2 possiblepairwisedistances.Mean-ministheaverageof n 1 + n 2 minimumdistancesfromeach featurefromonetemplatetotheother.Allofthesemethodshavethetimecomplexityof O ( n 1 n 2 ) . Softmax,proposedin[2],aggregatesmultipleweightedaveragesofthepair-wisescores,where eachweightisthefunctionofthescoreusinganexponentialfunctionindifferentscales.Ithas thetimecomplexityof O ( mn 1 n 2 ) ,where m isthenumberofweightscale.Here,following[101], weuseatotalof m = 21scalesfrom0to20.Forfeature-levelfusion,max,meanarerespectively 117 TableA4: FusionschemescomparisonsonIJB-Adataset. V Method@FAR= : 01@FAR= : 001@Rank-1@Rank-5 Score Min78 : 3 2 : 746 : 0 6 : 986 : 7 1 : 494 : 0 0 : 6 Max22 : 8 2 : 012 : 3 2 : 330 : 6 2 : 852 : 8 : 0 2 : 7 Mean72 : 8 2 : 949 : 2 5 : 385 : 7 1 : 393 : 1 0 : 6 Mean-min82 : 4 2 : 258 : 5 6 : 390 : 2 1 : 0 95 : 6 0 : 5 Softmax 84 : 3 1 : 669 : 2 6 : 890 : 1 1 : 095 : 5 0 : 8 Feature Max19 : 0 1 : 312 : 1 1 : 745 : 4 5 : 362 : 6 0 : 9 Mean83 : 0 1 : 567 : 0 4 : 889 : 6 1 : 595 : 4 0 : 7 w -fusion 84 : 3 1 : 4 72 : 6 4 : 4 91 : 0 1 : 5 95 : 6 1 : 1 max-poolingandaverage-poolingalongeachfeaturedimension.Allfeature-levelfusionmethods, includingour w -fusion,havethetimecomplexityof O ( n 1 + n 2 ) .FromTab.A4,ourfusionusing estimated w achievesthebestperformanceamongallmethods. A4.4RepresentationLearning LossFunctionComparison. Our G dec and D canbeviewedasalossfunctionfor f ( x ) .Typical lossfunctionsusedindeeplearning-basedfacerecognitioncanbedividedintotwocategories: probability-andenergy-basedlosses.Probability-basedlosses(i.e.,softmaxanditsvariants)usu- allycomputeadistributionofprobabilitytoallidentities.Meanwhile,energy-basedlosses(con- trastive,triplet,etc.)associateanenergytoeachHere,wecompareDR-GANto multiplecommonlossfunctionsoffacerecognition.TohaveafaircomparisononIJB-A,forall functions,weuseour G enc networkarchitectureandfimeanmin"fusion.DR-GANbyitselfcan surpassallpriorlossfunctions(Tab.A5).Also,anyadvancedlossfunctioncanalsobeto DR-GAN:energy-basedlosses(center,triplet,etc.)canbeemployeddirectlyonourrepresentation f ( x ) orprobability-basedlosses(angular,additive-marginsoftmax,etc.)canbeusedtoreplacethe D d 'ssoftmax.Empirically,usingadditive-marginsoftmax[168]asasoftmaxreplacementon D d 118 TableA5: Lossfunctioncomparisons.Allusefimeanmin"fusion. V Method@FAR= : 01@FAR= : 001@Rank-1@Rank-5 Softmax75 : 9 3 : 944 : 1 9 : 987 : 8 0 : 994 : 6 0 : 6 Center[170]74 : 9 3 : 150 : 3 7 : 087 : 2 1 : 495 : 2 0 : 9 Triplet[138]74 : 9 3 : 150 : 3 7 : 087 : 2 1 : 495 : 2 0 : 9 AM-Softmax[168]81 : 3 3 : 052 : 7 8 : 988 : 7 0 : 794 : 3 0 : 4 DR-GAN singleimg. 81 : 2 2 : 756 : 2 9 : 189 : 0 1 : 495 : 1 0 : 9 DR-GAN82 : 4 2 : 358 : 5 8 : 090 : 2 1 : 0 95 : 6 0 : 5 DR-GAN AM 85 : 7 1 : 6 70 : 3 5 : 79 91 : 0 1 : 5 95 : 6 1 : 1 TableA6: PerformancecomparisononIJB-Adataset. V Method@FAR= : 01@FAR= : 001@Rank-1@Rank-5 GOTS[75]40 : 6 1 : 419 : 8 0 : 844 : 3 2 : 159 : 5 2 : 0 Wang etal .[167]72 : 9 3 : 551 : 0 6 : 182 : 2 2 : 393 : 1 1 : 4 DCNN[27]78 : 7 4 : 3Œ85 : 2 1 : 893 : 7 1 : 0 PAM frontal [101]73 : 3 1 : 855 : 2 3 : 277 : 1 1 : 688 : 7 0 : 9 PAMs[101]82 : 6 1 : 865 : 2 3 : 784 : 0 1 : 292 : 5 0 : 8 p-CNN[184]77 : 5 2 : 553 : 9 4 : 285 : 8 1 : 493 : 8 0 : 9 FF-GAN[185]85 : 2 1 : 066 : 3 3 : 390 : 2 0 : 695 : 4 0 : 5 DR-GAN85 : 6 1 : 575 : 1 4 : 291 : 3 1 : 695 : 8 1 : 0 DR-GAN AM 87 : 2 1 : 4 78 : 1 3 : 5 92 : 0 1 : 3 96 : 1 0 : 7 canfurtherimproveDR-GANperformance,wenamethisvariantasDR-GAN AM . ResultsonBenchmarkDatabases. WecompareDR-GANwithstate-of-the-artfacerecognizers onIJB-A,CFPandMulti-PIE. TableA6showstheperformanceofbothfaceandvonIJB-A.For ourresults,wereportresultsofmulti-imageDR-GANusingtheproposed w -fusion.Therow showstheperformanceofpresentedDR-GANmodel(usingtypicalsoftmaxloss).Thesecondrow presentsthevariantusingadditivemarginsoftmax[168].Comparedtothestateoftheart,DR- GANachievessuperiorresultsonbothvandThesein-the-wildresults 119 TableA7: Performance(Accuracy)comparisononCFP. MethodFrontal-Frontal Senguptaetal.[140]96 : 40 0 : 6984 : 91 1 : 82 Sankaranaetal.[137]96 : 93 0 : 6189 : 17 2 : 35 Chenetal.[28] 98 : 67 0 : 3691 : 97 1 : 70 Human96 : 24 0 : 6794 : 57 1 : 10 DR-GAN98 : 13 0 : 8193 : 64 1 : 51 DR-GAN AM 98 : 36 0 : 75 93 : 89 1 : 39 TableA8: rate(%)comparisononMulti-PIEdataset. Method0 15 30 45 60 Average Zhuetal.[196]94 : 390 : 780 : 764 : 145 : 972 : 9 Zhuetal.[197]95 : 792 : 883 : 772 : 960 : 179 : 3 Yimetal.[182] 99 : 595 : 0 88 : 579 : 961 : 983 : 3 Using L 2loss95 : 190 : 882 : 772 : 757 : 978 : 3 DR-GAN98 : 194 : 991 : 187 : 284 : 690 : 4 DR-GAN AM 98 : 1 95 : 091 : 388 : 085 : 890 : 8 showthepowerofDR-GANforPIFR. TableA7showsthecomparisononCFPevaluatedwithAccuracy.Resultsarereportedwith theaveragewithstandarddeviationover10folds.Overall,weachievecomparableperformance onfrontal-frontalvwhilehaving1 : 92%improvementonthev TableA8showsthefaceperformanceonMulti-PIEcomparedtothemethods withthesamesetting.Ourmethodshowsaimprovementforlarge-posefaces,e.g.,there ismorethan20%improvementmarginat 60 poses.Thevariationofrecognitionratesacross differentposesismuchsmallerthanthebaselines,whichsuggeststhatourlearntrepresentationis morerobusttotheposevariation. Representationvs.SyntheticImageforPIFR. Manypriorwork[58,194]usefrontalizedfaces forPIFR.ToevaluatetheidentitypreservationofsyntheticimagesfromDR-GAN,wealsoperform 120 TableA9: Representation f ( x ) vs.syntheticimage ‹ x onIJB-A. V Features@FAR= : 01@FAR= : 001@Rank-1@Rank-5 f ( ‹ x ) 78 : 5 1 : 960 : 3 3 : 786 : 9 1 : 694 : 2 1 : 3 D d ( ‹ x ) 77 : 1 2 : 953 : 5 6 : 285 : 7 1 : 793 : 6 1 : 6 f 0 ( ‹ x ) 79 : 2 2 : 960 : 8 7 : 389 : 2 1 : 495 : 3 1 : 1 f 0 ( ‹ x ) & f ( ‹ x ) 83 : 0 1 : 871 : 7 3 : 690 : 7 1 : 4 95 : 6 1 : 0 f ( x ) 84 : 3 1 : 4 72 : 6 4 : 4 91 : 0 1 : 5 95 : 6 1 : 1 FigureA9: FacerotationcomparisononMulti-PIE.Giventheinput(inillumination07and75 pose),we showsyntheticimagesof L 2loss(top),adversarialloss(middle),andgroundtruth(bottom).Column2-5 showtheabilityofDR-GANinsimultaneousfacerotationandre-lighting. facerecognitionusingourfrontalizedfaces.Anyfacefeatureextractorcouldbeappliedtothem, including G enc or D d .However,botharetrainedonrealimagesofvariousposes.Tospecializeto syntheticfrontalfaces,weune G enc withthesyntheticimagesanddenoteas f 0 ( ) .Asshown inTab.A9,althoughtheperformanceofsyntheticimages(anditsscore-levelfusiondenotedas f 0 ( ‹ x ) & f ( ‹ x ) )isnotasgoodasthelearntrepresentation,usingthe G enc onsynthetic frontalstillachievescomparableperfromancetothepreviousmethods,whichshowstheidentity preservationabilityofDR-GAN. A4.5FaceRotation AdversarialLossvs.L2loss. Priorwork[196,182,179]onfacerotationnormallyemploythe L 2losstolearnamappingbetweentwoviews.Tocomparethe L 2losswithouradversarialloss, 121 FigureA10: Interpolationof f ( x ) , c ,and z .(a)Syntheticimagesbyinterpolatingbetweentheidentity representationsoftwofaces(Column1and12).Notethesmoothtransitionbetweendifferentgendersand facialattributes.(b)Poseangles0 ,15 ,30 ,45 ,60 ,75 ,90 areavailableinthetrainingset.DR-GAN interpolatesin-between unseen posesvia continuous posecodes,shownaboveRow3.(c)Foreachimageat Column1,DR-GANsynthesizestwoimagesat z = 1 (Column2)and z = 1 (Column12),andin-between imagesbyinterpolatingalongtwo z . wetrainamodelwhere G issupervisedbyan L 2lossonthegroundtruthfacewiththetargetview. Thetrainingprocessiskeptthesameforafaircomparison.AsshowninFig.A9,DR-GANcan generatefarmorerealisticfacesthataresimilartothegroundtruthfacesinallviews.Meanwhile, imagessynthesizedbythe L 2losscannotmaintainhighfrequencycomponentsandareblurry. Infact, L 2losstreatseachpixelequally,whichleadstothelossofdiscriminativeinformation. ThisinferiorsynthesisisalsointhelowerPIFRperformanceinTab.A8.Incontrast,by integratingtheadversarialloss,weexpecttolearnamorediscriminativerepresentationforbetter recognition,andamoregenerativerepresentationforbetterfacesynthesis. VariableInterpolations. Takingtwoimagesofdifferentsubjects x 1 ; x 2 ,weextractfeatures f ( x 1 ) and f ( x 2 ) from G enc .Theinterpolationbetween f ( x 1 ) and f ( x 2 ) cangeneratemanyrepre- 122 FigureA11:FacerotationonCFP:(a)input,(b)frontalizedfaces,(c)realfrontalfaces,(d)rotated facesat15 ,30 ,45 poses.Weexpectthefrontalizedfacestopreservetheidentity,ratherthan allfacialattributes.Thisisverychallengingforfacerotationduetothein-the-wildvariationsand extremeviews.Theartifactintheimageboundaryisduetoimageextrapolationinpre- processing.Whentheinputsarefrontalfaceswithvariationsinroll,expression,orocclusions,the syntheticfacescanremovethesevariations. sentations,whichcanbefedto G dec tosynthesizefaceimages.InFig.A10(a),thetoprowshows atransitionfromafemalesubjecttoamalesubjectwithbeardandglasses.Similarto[124],these smoothsemanticchangesindicatethatthemodelhaslearnedessentialidentityrepresentationsfor imagesynthesis. Similarinterpolationcanbeconductedfortheposecodesaswell.Duringtraining,weuse aone-hotvector c tospecifythe discrete poseofthesyntheticimage.Duringtesting,wecould generatefaceimageswith continuous poses,whoseposecodeistheweightedaverage,i.e.,inter- polation,oftwoneighboringposecodes.Notethattheresultantposecodeisnolongeraone-hot vector.AsinFig.A10(b),thisleadstosmoothposetransitionfromoneviewtomanyviews unseen tothetrainingset. Wecanalsointerpolatethenoisevector z .Wesynthesizefrontalfacesat z = 1 and z = 1 (a vectorofall1s)andinterpolatebetweentwo z .Giventheedidentityrepresentationandpose 123 FigureA12: FacefrontalizationonIJB-A.Foreachoffoursubjects,weshow11inputimageswithesti- matedcoefoverlaidatthetopleftcornerrow)andtheirfrontalizedcounterpart(secondrow). Thelastcolumnisthegroundtruthfrontalandsyntheticfrontalfromthefusedrepresentationofall11im- ages.Notethechallengesoflargeposes,occlusion,andlowresolution,andour opportunistic frontalization. code,thesyntheticimagesareidentity-preservedfrontalfaces.AsinFig.A10(c),thechangeof z leadstothechangeofthebackground,illuminationcondition,andfacialattributessuchasbeard, whiletheidentityiswellpreservedandfacesareofthefrontalview.Thus, z modelsless facevariations. FaceRotationonBenchmarkDatabases. Ourgeneratoristrainedtobeafacerotator.Given oneormultiplefaceimageswitharbitraryposes,wecangeneratemultipleidentity-preservedfaces atdifferentviews.FigureA9showsthefacerotationresultsonMulti-PIE.Givenaninputimage atanypose,wecangeneratemulti-viewimagesofthesamesubjectbutatadifferentposeby specifyingdifferentposecodesorinadifferentlightingconditionbyvaryingilluminationcode. 124 FigureA13: FacefrontalizationonIJB-Aforanimagesetsubject)andavideosequence(second subject).Foreachsubject,weshow11inputimages(row),theirrespectivefrontalizedfaces(second row)andthefrontalizedfacesusing incrementally fusedrepresentationsfromallpreviousinputsuptothis image(thirdrow).Inthelastcolumn,weshowthegroundtruthfrontalface. Therotatedfacesaresimilartothegroundtruthwithwell-preservedattributessuchaseyeglasses. Oneapplicationoffacerotationisfacefrontalization.OurDR-GANcanbeusedforface frontalizationbyspecifyingthefrontal-viewasthetargetpose.FigureA11showsthefacefrontal- izationonCFP.Givenanextremeinputimage,DR-GANcangeneratearealisticfrontal facethathassimilaridentitycharacteristicsastherealfrontalface.Tothebestofourknowledge, thisistheworkthatisableto frontalizeaprwin-the-wildfaceimage .Whentheinput imageisalreadyinthefrontalview,thesyntheticimagescancorrectthepitchandrollangles, normalizeilluminationandexpression,andimputeoccludedfacialareas,asshowninthelastfew examplesofFig.A11. FigureA12showsfacefrontalizationresultsonIJB-A.Foreachsubjectortemplate,weshow 11imagesandtheirrespectivefrontalizedfaces,andthefrontalizedfacegeneratedfromthefused representation.Foreachinputimage,theestimatedcoef w isshownonthetop-leftcorner 125 ofeachimage,whichclearlyindicatesthequalityoftheinputimageaswellasthefrontalized image.Forexample,coefforlow-qualityorlarge-poseinputimagesareverysmall.These imageswillhaveverylittlecontributiontothefusedrepresentation.Finally,thefacefromthe fusedrepresentationhassuperiorqualitycomparedtoallfrontalizedimagesfromasingleinput face.Thisshowstheeffectivenessofourmulti-imageDR-GANintakingadvantageofmultiple imagesofthesamesubjectforbetterrepresentationlearning. Tofurtherevaluatefacefrontalizationresultsw.r.t.differentnumbersofinputimages,wevary thenumberofinputimagesfrom1to11andvisualizethefrontalizedimagesfromthe incremen- tally fusedrepresentations.AsshowninFig.A13,theindividuallyfrontalizedfaceshavevarying degreesofresemblancetothetruesubject,accordingtothequalitiesofdifferentinputimages. Thesyntheticimagesfromfusedrepresentations(thirdrow)improveasthenumberofimages increases. A5Conclusions ThispaperpresentsDR-GANtolearnadisentangledrepresentationforPIFR,bymodelingtheface rotationprocess.WearethetoconstructthegeneratorinGANwithanencoder-decoderstruc- tureforrepresentationlearning,whichcanbequantitativelyevaluatedbyperformingPIFR.Using theposecodefordecodingandposeinthediscriminatorleadtothedisentanglement ofposevariationfromtheidentityfeatures.Wealsoproposemulti-imageDR-GANtoleverage multipleimagespersubjectinbothtrainingandtestingtolearnabetterrepresentation.Thisis theworkthatisabletofrontalizeanextreme-posein-the-wildface.Weattributethesuperior PIFRandfacesynthesiscapabilitiestothediscriminativeyetgenerativerepresentationlearnedin G .Ourrepresentationisdiscriminativesincetheothervariationsareexplicitlydisentangledbythe 126 pose/illuminationcodes,andrandomnoise,andisgenerativesinceitsdecoded(synthetic)image wouldstillbeastheoriginalidentity. 127 PUBLICATIONS JournalPapers 1.LuanTranandXiaomingLiu,fiOnLearning3DFaceMorphableModelfromIn-the-wild Images,flinIEEETransactionsonPatternAnalysisandMachineIntelligence(TPAMI),July 2019. 2.LuanTran,XiYin,andXiaomingLiu,fiRepresentationLearningbyRotatingYourFaces,fl inIEEETransactionsonPatternAnalysisandMachineIntelligence(TPAMI),September 2018. ConferencePapers 1.FengLiu,LuanTran,andXiaomingLiu,fi3DFaceModelingfromDiverseRawScanData,fl ProceedingofIEEEInternationalConferenceonComputerVision(ICCV)2019,Seoul, SouthKorea,October,2019.(Oralpresentation) 2.BangjieYin*,LuanTran*,HaoxiangLi,XiaohuiShen,XiaomingLiu,fiTowardsInter- pretableFaceRecognition,flProceedingofIEEEInternationalConferenceonComputerVi- sion(ICCV)2019,Seoul,SouthKorea,October,2019.(Oralpresentation)(*denotesequal contributionbytheauthors). 3.LuanTran,FengLiu,XiaomingLiu,andfiTowardsNonlinear3DFaceMor- phableModel,flinProceedingofIEEEConferenceonComputerVisionandPatternRecog- nition(CVPR)2019,LongBeach,California,June,2019. 4.LuanTran,KihyukSohn,XiangYu,XiaomingLiu,andManmohanChandraker,fiGotta Adapt'EmAll:JointPixelandFeature-LevelDomainAdaptationforRecognitioninthe 128 Wild,flinProceedingofIEEEConferenceonComputerVisionandPatternRecognition (CVPR)2019,LongBeach,California,June,2019. 5.ZiyuanZhang,LuanTran,XiYin,YousefAtoum,JianWan,NanxinWang,andXiaoming Liu,fiGaitRecognitionviaDisentangledRepresentationLearning,flinProceedingofIEEE ConferenceonComputerVisionandPatternRecognition(CVPR)2019,LongBeach,Cali- fornia,June,2019.(Oralpresentation) 6.AnuragChowdhuryandYousefAtoum,LuanTran,XiaomingLiu,ArunRossfiMSU-AVIS dataset:FusingFaceandVoiceModalitiesforBiometricRecognitioninIndoorSurveillance Videos,flinProceedingofInternationalConferenceonPatternRecognition(ICPR),Beijing, China,August,2018. 7.LuanTranandXiaomingLiu,fiNonlinear3DFaceMorphableModel,flinProceedingof IEEEConferenceonComputerVisionandPatternRecognition(CVPR)2018,SaltLake City,Utah,June,2018.(Spotlightpresentation) 8.LuanTran,XiYin,andXiaomingLiu,fiDisentangledRepresentationLearningGANfor Pose-InvariantFaceRecognition,flinProceedingofIEEEConferenceonComputerVision andPatternRecognition(CVPR)2017,Honolulu,Hawaii,July,2017.(Oralpresentation) 9.LuanTran,XiaomingLiu,JiayuZhou,andRongJin,fiMissingModalitiesImputationvia CascadedResidualAutoencoder,flinProceedingofIEEEConferenceonComputerVision andPatternRecognition(CVPR)2017,Honolulu,Hawaii,July,2017. 129 BIBLIOGRAPHY 130 BIBLIOGRAPHY [1]A.Abaza,M.A.Harrison,T.Bourlai,andA.Ross.Designandevaluationofphotometric imagequalitymeasuresforeffectivefacerecognition. IETBiometrics ,2014. [2]W.AbdAlmageed,Y.Wu,S.Rawls,S.Harel,T.Hassner,I.Masi,J.Choi,J.Lekust,J.Kim, P.Natarajan,R.Nevatia,andG.Medioni.Facerecognitionusingdeepmulti-poserepresen- tations.In WACV ,2016. [3]M.Abdel-MottalebandM.H.Mahoor.Applicationnotes-algorithmsforassessingthequal- ityoffacialimages. IEEEComputationalIntelligenceMagazine ,2007. [4]R.Abiantun,U.Prabhu,andM.Savvides.Sparsefeatureextractionforpose-tolerantface recognition. TPAMI ,2014. [5]O.AldrianandW.A.Smith.Inverserenderingoffaceswitha3Dmorphablemodel. TPAMI , 2013. [6]B.Amberg,R.Knothe,andT.Vetter.Expressioninvariant3Dfacerecognitionwitha morphablemodel.In FG ,2008. [7]B.Amberg,S.Romdhani,andT.Vetter.OptimalstepnonrigidICPalgorithmsforsurface registration.In CVPR ,2007. [8]T.Bagautdinov,C.Wu,J.Saragih,P.Fua,andY.Sheikh.Modelingfacialgeometryusing compositionalVAEs.In CVPR ,2018. [9]A.D.Bagdanov,A.DelBimbo,andI.Masi.The2D/3Dhybridfacedataset.In Proceedingsofthe2011jointACMworkshoponHumangestureandbehaviorunderstand- ing ,pages79Œ80.ACM,2011. [10]Y.Bengio,A.Courville,andP.Vincent.Representationlearning:Areviewandnewper- spectives. TPAMI ,2013. [11]D.Berthelot,T.Schumm,andL.Metz.BEGAN:BoundaryEquilibriumGenerativeAdver- sarialNetworks. arXiv:1703.10717 ,2017. [12]S.Bharadwaj,M.Vatsa,andR.Singh.Biometricquality:Areviewofiris,and face. EURASIPJIVP ,2014. [13]V.BlanzandT.Vetter.Amorphablemodelforthesynthesisof3Dfaces.In Proceedingsof the26thannualconferenceonComputergraphicsandinteractivetechniques ,1999. 131 [14]V.BlanzandT.Vetter.Facerecognitionbasedona3Dmorphablemodel. TPAMI , 2003. [15]T.BolkartandS.Wuhrer.Agroupwisemultilinearcorrespondenceoptimizationfor3D faces.In ICCV ,2015. [16]F.L.Bookstein.Principalwarps:Thin-platesplinesandthedecompositionofdeformations. TPAMI ,1989. [17]J.Booth,E.Antonakos,S.Ploumpis,G.Trigeorgis,Y.Panagakis,andS.Zafeiriou.3Dface morphablemodelsfiIn-the-wildfl.In CVPR ,2017. [18]J.Booth,A.Roussos,E.Ververas,E.Antonakos,S.Poumpis,Y.Panagakis,andS.P. Zafeiriou.3DreconstructionoffiIn-the-wildflfacesinimagesandvideos. TPAMI ,2018. [19]J.Booth,A.Roussos,S.Zafeiriou,A.Ponniah,andD.Dunaway.A3Dmorphablemodel learntfrom10,000faces.In CVPR ,2016. [20]S.Bouaziz,Y.Wang,andM.Pauly.Onlinemodelingforrealtimefacialanimation. ACM TOG ,2013. [21]G.BrazilandX.Liu.Pedestriandetectionwithautoregressivenetworkphases.In CVPR , 2019. [22]A.BulatandG.Tzimiropoulos.Howfararewefromsolvingthe2D&3Dfacealignment problem?(andadatasetof230,0003Dfaciallandmarks).In ICCV ,2017. [23]C.Cao,Q.Hou,andK.Zhou.Displaceddynamicexpressionregressionforreal-timefacial trackingandanimation. ACMTOG ,2014. [24]C.Cao,Y.Weng,S.Zhou,Y.Tong,andK.Zhou.Facewarehouse:A3Dfacialexpression databaseforvisualcomputing. TVCG ,2014. [25]X.Chai,S.Shan,X.Chen,andW.Gao.Locallylinearregressionforpose-invariantface recognition. TIP ,2007. [26]A.X.Chang,T.Funkhouser,L.Guibas,P.Hanrahan,Q.Huang,Z.Li,S.Savarese, M.Savva,S.Song,H.Su,J.Xiao,L.Yi,andF.Yu.Shapenet:Aninformation-rich3D modelrepository. arXivpreprintarXiv:1512.03012 ,2015. [27]J.-C.Chen,V.M.Patel,andR.Chellappa.UnconstrainedfacevusingdeepCNN features.In WACV ,2016. [28]J.-C.Chen,J.Zheng,V.M.Patel,andR.Chellappa.Fishervectorencodeddeepconvolu- tionalfeaturesforunconstrainedfacevIn ICIP ,2016. 132 [29]K.Chen,C.B.Choy,M.Savva,A.X.Chang,T.Funkhouser,andS.Savarese.Text2shape: Generatingshapesfromnaturallanguagebylearningjointembeddings.In ACCV ,2018. [30]X.Chen,Y.Duan,R.Houthooft,J.Schulman,I.Sutskever,andP.Abbeel.InfoGAN: Interpretablerepresentationlearningbyinformationmaximizinggenerativeadversarialnets. In NIPS ,2016. [31]Y.Chen,S.C.Dass,andA.K.Jain.Localizedirisimagequalityusing2-Dwavelets.In ICB ,2006. [32]Y.Chen,Y.Tai,X.Liu,C.Shen,andJ.Yang.FSRNet:End-to-endlearningfacesuper- resolutionwithfacialpriors.In CVPR ,2018. [33]Z.Chen,K.Yin,M.Fisher,S.Chaudhuri,andH.Zhang.BAE-NET:Branchedautoencoder forshapeco-segmentation.In ICCV ,2019. [34]Z.ChenandH.Zhang.Learningimplicitforgenerativeshapemodeling.In CVPR , 2019. [35]C.B.Choy,D.Xu,J.Gwak,K.Chen,andS.Savarese.3D-R2N2:Aapproachfor singleandmulti-view3Dobjectreconstruction.In ECCV ,2016. [36]F.Cole,D.Belanger,D.Krishnan,A.Sarna,I.Mosseri,andW.T.Freeman.Facesynthesis fromfacialidentityfeatures.In CVPR ,2017. [37]T.F.Cootes,G.J.Edwards,andC.J.Taylor.Activeappearancemodels. TPAMI ,2001. [38]A.Dai,C.RuizhongtaiQi,andM.Nießner.Shapecompletionusing3D-encoder-predictor CNNsandshapesynthesis.In CVPR ,2017. [39]E.L.Denton,S.Chintala,A.Szlam,andR.Fergus.Deepgenerativeimagemodelsusinga Laplacianpyramidofadversarialnetworks.In NIPS ,2015. [40]C.DingandD.Tao.Robustfacerecognitionviamultimodaldeepfacerepresentation. TMM , 2015. [41]C.DingandD.Tao.Acomprehensivesurveyonpose-invariantfacerecognition. TIST , 2016. [42]P.Dollár,P.Welinder,andP.Perona.Cascadedposeregression.In CVPR ,2010. [43]P.Dou,S.K.Shah,andI.A.Kakadiaris.End-to-end3Dfacereconstructionwithdeep neuralnetworks.In CVPR ,2017. [44]M.Everingham,L.VanGool,C.K.Williams,J.Winn,andA.Zisserman.Thepascalvisual objectclasses(voc)challenge. IJCV ,2010. 133 [45]H.Fan,H.Su,andL.J.Guibas.Apointsetgenerationnetworkfor3Dobjectreconstruction fromasingleimage.In CVPR ,2017. [46]Y.Feng,F.Wu,X.Shao,Y.Wang,andX.Zhou.Joint3Dfacereconstructionanddense alignmentwithpositionmapregressionnetwork.In ECCV ,2018. [47]P.Garrido,L.Valgaerts,H.Sarmadi,I.Steiner,K.Varanasi,P.Perez,andC.Theobalt. Vdub:Modifyingfacevideoofactorsforplausiblevisualalignmenttoadubbedaudio track.In ComputerGraphicsForum ,volume34,pages193Œ204.WileyOnlineLibrary, 2015. [48]P.Garrido,L.Valgaerts,C.Wu,andC.Theobalt.Reconstructingdetaileddynamicface geometryfrommonocularvideo. ACMTOG ,2013. [49]P.Garrido,M.Zollhöfer,D.Casas,L.Valgaerts,K.Varanasi,P.Pérez,andC.Theobalt. Reconstructionofpersonalized3Dfacerigsfrommonocularvideo. ACMTOG ,2016. [50]R.Girdhar,D.F.Fouhey,M.Rodriguez,andA.Gupta.Learningapredictableandgenera- tivevectorrepresentationforobjects.In ECCV ,2016. [51]I.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu,D.Warde-Farley,S.Ozair,A.Courville, andY.Bengio.Generativeadversarialnets.In NeurIPS ,2014. [52]R.Gross,I.Matthews,J.Cohn,T.Kanade,andS.Baker.Multi-PIE. IVC ,2010. [53]P.GrotherandE.Tabassi.Performanceofbiometricqualitymeasures. TPAMI ,2007. [54]T.Groueix,M.Fisher,V.G.Kim,B.C.Russell,andM.Aubry.Atlasnet:Apapier-mâché approachtolearning3Dsurfacegeneration.In CVPR ,2018. [55]L.GuandT.Kanade.Agenerativeshaperegularizationmodelforrobustfacealignment. In ECCV ,2008. [56]C.Häne,S.Tulsiani,andJ.Malik.Hierarchicalsurfacepredictionfor3Dobjectreconstruc- tion.In 3DV ,2017. [57]A.W.Harley,K.G.Derpanis,andI.Kokkinos.Segmentation-awareconvolutionalnetworks usinglocalattentionmasks.In CVPR ,2017. [58]T.Hassner,S.Harel,E.Paz,andR.Enbar.Effectivefacefrontalizationinunconstrained images.In CVPR ,2015. [59]K.He,G.Gkioxari,P.Dollár,andR.Girshick.MaskR-CNN.In ICCV ,2017. [60]K.He,X.Zhang,S.Ren,andJ.Sun.Deepresiduallearningforimagerecognition.In CVPR ,2016. 134 [61]G.Huang,Z.Liu,L.vanderMaaten,andK.Q.Weinberger.Denselyconnectedconvolu- tionalnetworks.In CVPR ,2017. [62]R.Huang,S.Zhang,T.Li,R.He,etal.Beyondfacerotation:Globalandlocalperception ganforphotorealisticandidentitypreservingfrontalviewsynthesis.In ICCV ,2017. [63]A.S.Jackson,A.Bulat,V.Argyriou,andG.Tzimiropoulos.Largepose3Dfacerecon- structionfromasingleimageviadirectvolumetriccnnregression.In ICCV ,2017. [64]M.Janner,J.Wu,T.D.Kulkarni,I.Yildirim,andJ.Tenenbaum.Self-supervisedintrinsic imagedecomposition.In NeurIPS ,2017. [65]Z.-H.Jiang,Q.Wu,K.Chen,andJ.Zhang.Disentangledrepresentationlearningfor3D faceshape.In CVPR ,2019. [66]J.Johnson,A.Alahi,andL.Fei-Fei.Perceptuallossesforreal-timestyletransferandsuper- resolution.In ECCV ,2016. [67]A.JourablooandX.Liu.Pose-invariant3Dfacealignment.In ICCV ,2015. [68]A.JourablooandX.Liu.Large-posefacealignmentviaCNN-baseddense3Dmodel In CVPR ,2016. [69]A.JourablooandX.Liu.Pose-invariantfacealignmentviaCNN-baseddense3Dmodel IJCV ,2017. [70]A.JourablooandX.Liu.Pose-invariantfacealignmentviaCNN-baseddense3Dmodel IJCV ,2017. [71]A.Jourabloo,X.Liu,M.Ye,andL.Ren.Pose-invariantfacealignmentwithasingleCNN. In ICCV ,2017. [72]M.Kan,S.Shan,H.Chang,andX.Chen.StackedProgressiveAuto-Encoders(SPAE)for facerecognitionacrossposes.In CVPR ,2014. [73]T.Karras,T.Aila,S.Laine,andJ.Lehtinen.Progressivegrowingofgansforimproved quality,stability,andvariation.In ICLR ,2018. [74]D.KingmaandJ.Ba.Adam:Amethodforstochasticoptimization.In ICLR ,2015. [75]B.F.Klare,B.Klein,E.Taborsky,A.Blanton,J.Cheney,K.Allen,P.Grother,A.Mah, M.Burge,andA.K.Jain.Pushingthefrontiersofunconstrainedfacedetectionandrecog- nition:IARPAJanusBenchmarkA.In CVPR ,2015. [76]M.Koestinger,P.Wohlhart,P.M.Roth,andH.Bischof.Annotatedfaciallandmarksinthe wild:Alarge-scale,real-worlddatabaseforfaciallandmarklocalization.In ICCVW ,2011. 135 [77]P.Koppen,Z.-H.Feng,J.Kittler,M.Awais,W.Christmas,X.-J.Wu,andH.-F.Yin.Gaus- sianmixture3Dmorphablefacemodel. PatternRecognition ,2017. [78]E.Krichen,S.Garcia-Salicetti,andB.Dorizzi.Anewprobabilisticirisqualitymeasurefor comprehensivenoisedetection.In BTAS ,2007. [79]A.KrishnaswamyandG.V.Baranoski.Abiophysically-basedspectralmodeloflightinter- actionwithhumanskin.In ComputerGraphicsForum ,volume23,pages331Œ340.Wiley OnlineLibrary,2004. [80]T.D.Kulkarni,W.F.Whitney,P.Kohli,andJ.Tenenbaum.Deepconvolutionalinverse graphicsnetwork.In NIPS ,2015. [81]H.KwakandB.-T.Zhang.Waysofconditioninggenerativeadversarialnetworks.In NIPSW ,2016. [82]E.H.LandandJ.J.McCann.Lightnessandretinextheory. Josa ,1971. [83]C.Li,K.Zhou,andS.Lin.Simulatingmakeupthroughphysics-basedmanipulationof intrinsicimagelayers.In CVPR ,2015. [84]S.Li,X.Liu,X.Chai,H.Zhang,S.Lao,andS.Shan.Morphabledisplacementbased imagematchingforfacerecognitionacrosspose.In ECCV ,2012. [85]F.Liu,L.Tran,andX.Liu.3Dfacemodelingfromdiverserawscandata.In ICCV ,2019. [86]F.Liu,D.Zeng,Q.Zhao,andX.Liu.Jointfacealignmentand3Dfacereconstruction.In ECCV ,2016. [87]F.Liu,D.Zeng,Q.Zhao,andX.Liu.Jointfacealignmentand3Dfacereconstruction.In ECCV ,2016. [88]F.Liu,R.Zhu,D.Zeng,Q.Zhao,andX.Liu.Disentanglingfeaturesin3Dfaceshapesfor jointfacereconstructionandrecognition.In CVPR ,2018. [89]S.Liu,W.Chen,T.Li,andH.Li.Softrasterizer:Differentiablerenderingforunsupervised single-viewmeshreconstruction.In ICCV ,2019. [90]X.Liu.Discriminativefacealignment. TPAMI ,2009. [91]X.Liu.Video-basedfacemodelusingadaptiveactiveappearancemodel. Imageand VisionComputing ,2010. [92]X.LiuandT.Chen.Pose-robustfacerecognitionusinggeometryassistedprobabilistic modeling.In CVPR ,2005. 136 [93]X.Liu,J.Rittscher,andT.Chen.Optimalposeforfacerecognition.In CVPR ,2006. [94]X.Liu,P.Tu,andF.Wheeler.Facemodeltingonlowresolutionimages.In BMVC ,2006. [95]X.Liu,P.Tu,andF.Wheeler.Facemodeltingonlowresolutionimages.In BMVC ,2006. [96]Y.Liu,A.Jourabloo,W.Ren,andX.Liu.Densefacealignment.In ICCVW ,2017. [97]Z.Liu,P.Luo,X.Wang,andX.Tang.Deeplearningfaceattributesinthewild.In ICCV , 2015. [98]A.Makhzani,J.Shlens,N.Jaitly,andI.Goodfellow.Adversarialautoencoders.In ICLRW , 2015. [99]R.Marc'Aurelio,F.J.Huang,Y.-L.Boureau,andY.LeCun.Unsupervisedlearningof invariantfeaturehierarchieswithapplicationstoobjectrecognition.In CVPR ,2007. [100]S.R.Marschner,H.W.Jensen,M.Cammarano,S.Worley,andP.Hanrahan.Lightscattering fromhumanhairIn ACMTransactionsonGraphics(TOG) ,volume22,pages780Œ 791.ACM,2003. [101]I.Masi,S.Rawls,G.Medioni,andP.Natarajan.Pose-awarefacerecognitioninthewild. In CVPR ,2016. [102]I.Masi,A.T.Tran,T.Hassner,J.T.Leksut,andG.Medioni.Dowereallyneedtocollect millionsoffacesforeffectivefacerecognition?In ECCV ,2016. [103]I.Masi,A.T.Tran,T.Hassner,J.T.Leksut,andG.Medioni.Dowereallyneedtocollect millionsoffacesforeffectivefacerecognition?In ECCV ,2016. [104]M.Mathieu,C.Couprie,andY.LeCun.Deepmulti-scalevideopredictionbeyondmean squareerror. arXiv:1511.05440 ,2015. [105]D.S.Matovski,M.Nixon,S.Mahmoodi,andT.Onincludingqualityinapplied automaticgaitrecognition.In ICPR ,2012. [106]J.McDonaghandG.Tzimiropoulos.Jointfacedetectionandalignmentwithadeformable Houghtransformmodel.In ECCV ,2016. [107]A.Meka,M.Zollhöfer,C.Richardt,andC.Theobalt.Liveintrinsicvideo. ACMTOG , 2016. [108]L.Mescheder,M.Oechsle,M.Niemeyer,S.Nowozin,andA.Geiger.Occupancynetworks: Learning3Dreconstructioninfunctionspace.In CVPR ,2019. [109]M.MirzaandS.Osindero.Conditionalgenerativeadversarialnets. arXiv:1411.1784 ,2014. 137 [110]U.Mohammed,S.J.Prince,andJ.Kautz.Visio-lization:generatingnovelfacialimages. TOG ,2009. [111]D.Muramatsu,Y.Makihara,andY.Yagi.Viewtransformationmodelincorporatingquality measuresforcross-viewgaitrecognition. IEEEtransactionsoncybernetics ,2016. [112]C.NhanDuong,K.Luu,K.GiaQuach,andT.D.Bui.Beyondprincipalcomponents:Deep BoltzmannMachinesforfacemodeling.In CVPR ,2015. [113]Y.Nirkin,I.Masi,A.T.Tran,T.Hassner,andG.M.Medioni.Onfacesegmentation,face swapping,andfaceperception.In FG ,2018. [114]A.Odena.Semi-supervisedlearningwithgenerativeadversarialnetworks.In ICMLW , 2016. [115]A.Odena,C.Olah,andJ.Shlens.Conditionalimagesynthesiswithauxiliarygans. In ICML ,2017. [116]N.Ozay,Y.Tong,F.Wheeler,andX.Liu.Improvingfacerecognitionwithaquality-based probabilisticframework.In CVPRW ,2009. [117]J.J.Park,P.Florence,J.Straub,R.Newcombe,andS.Lovegrove.DeepSDF:Learning continuoussigneddistancefunctionsforshaperepresentation.In CVPR ,2019. [118]O.M.Parkhi,A.Vedaldi,andA.Zisserman.Deepfacerecognition.In BMVC ,2015. [119]O.M.Parkhi,A.Vedaldi,andA.Zisserman.Deepfacerecognition.In BMVC ,2015. [120]A.PatelandW.A.Smith.3Dmorphablefacemodelsrevisited.In CVPR ,2009. [121]P.Paysan,R.Knothe,B.Amberg,S.Romdhani,andT.Vetter.A3Dfacemodelforpose andilluminationinvariantfacerecognition.In AVSS ,2009. [122]B.T.Phong.Illuminationforcomputergeneratedpictures. CommunicationsoftheACM , 1975. [123]P.O.Pinheiro,N.Rostamzadeh,andS.Ahn.Domain-adaptivesingle-view3Dreconstruc- tion.In ICCV ,2019. [124]A.Radford,L.Metz,andS.Chintala.Unsupervisedrepresentationlearningwithdeep convolutionalgenerativeadversarialnetworks.In ICLR ,2016. [125]R.RamamoorthiandP.Hanrahan.Anefrepresentationforirradianceenvironment maps.In Proceedingsofthe28thannualconferenceonComputergraphicsandinteractive techniques ,2001. 138 [126]A.Ranjan,T.Bolkart,S.Sanyal,andM.J.Black.Generating3Dfacesusingconvolutional meshautoencoders.In ECCV ,2018. [127]S.Reed,Z.Akata,X.Yan,L.Logeswaran,B.Schiele,andH.Lee.Generativeadversarial texttoimagesynthesis.In ICML ,2016. [128]E.Richardson,M.Sela,andR.Kimmel.3Dfacereconstructionbylearningfromsynthetic data.In 3DV ,2016. [129]E.Richardson,M.Sela,R.Or-El,andR.Kimmel.Learningdetailedfacereconstruction fromasingleimage.In CVPR ,2017. [130]J.Roth,Y.Tong,andX.Liu.Unconstrained3Dfacereconstruction.In CVPR ,2015. [131]J.Roth,Y.Tong,andX.Liu.Adaptive3Dfacereconstructionfromunconstrainedphoto collections.In CVPR ,2016. [132]J.Roth,Y.Tong,andX.Liu.Adaptive3Dfacereconstructionfromunconstrainedphoto collections. TPAMI ,2017. [133]J.Roth,Y.Tong,andX.Liu.Adaptive3Dfacereconstructionfromunconstrainedphoto collections. TPAMI ,2017. [134]C.Sagonas,E.Antonakos,G.Tzimiropoulos,S.Zafeiriou,andM.Pantic.300facesin-the- wildchallenge:Databaseandresults. ImageandVisionComputing ,2016. [135]C.Sagonas,Y.Panagakis,S.Zafeiriou,andM.Pantic.Robuststatisticalfacefrontalization. In ICCV ,2015. [136]T.Salimans,I.Goodfellow,W.Zaremba,V.Cheung,A.Radford,andX.Chen.Improved techniquesfortrainingGANs.In NIPS ,2016. [137]S.Sankaranarayanan,A.Alavi,C.Castillo,andR.Chellappa.Tripletprobabilisticembed- dingforfacevandclustering.In BTAS ,2016. [138]F.Schroff,D.Kalenichenko,andJ.Philbin.FaceNet:Aembeddingforfacerecog- nitionandclustering.In CVPR ,2015. [139]M.Sela,E.Richardson,andR.Kimmel.Unrestrictedfacialgeometryreconstructionusing image-to-imagetranslation.In ICCV ,2017. [140]S.Sengupta,J.-C.Chen,C.Castillo,V.M.Patel,R.Chellappa,andD.W.Jacobs.Frontal tofacevinthewild.In WACV ,2016. [141]F.Shi,H.-T.Wu,X.Tong,andJ.Chai.Automaticacquisitionoffacialperfor- mancesusingmonocularvideos. ACMTOG ,2014. 139 [142]J.Shi,Y.Dong,H.Su,andS.X.Yu.Learningnon-lambertianobjectintrinsicsacross shapenetcategories.In CVPR ,2017. [143]Z.Shu,S.Hadap,E.Shechtman,K.Sunkavalli,S.Paris,andD.Samaras.Portraitlighting transferusingamasstransportapproach. TOG ,2018. [144]Z.Shu,E.Yumer,S.Hadap,K.Sunkavalli,E.Shechtman,andD.Samaras.Neuralface editingwithintrinsicimagedisentangling.In CVPR ,2017. [145]F.C.Staal,A.J.Ponniah,F.Angullia,C.Ruff,M.J.Koudstaal,andD.Dunaway.Describing crouzonandpfeiffersyndromebasedonprincipalcomponentanalysis. JournalofCranio- MaxillofacialSurgery ,2015. [146]D.StutzandA.Geiger.Learning3Dshapecompletionfromlaserscandatawithweak supervision.In CVPR ,2018. [147]X.Sun,J.Wu,X.Zhang,Z.Zhang,C.Zhang,T.Xue,J.B.Tenenbaum,andW.T.Freeman. Pix3D:Datasetandmethodsforsingle-image3Dshapemodeling.In CVPR ,2018. [148]E.TabassiandC.L.Wilson.Anovelapproachtoimagequality.In ICIP ,2005. [149]Y.Taigman,M.Yang,M.Ranzato,andL.Wolf.Deepface:Closingthegaptohuman-level performanceinfacevIn CVPR ,2014. [150]R.TeixeiraandN.Leite.Anewframeworkforqualityassessmentofhigh-resolution gerprintimages. TPAMI ,2016. [151]A.Tewari,M.Zollhoefer,F.Bernard,P.Garrido,H.Kim,P.Perez,andC.Theobalt.High- monocularfacereconstructionbasedonanunsupervisedmodel-basedfaceautoen- coder. TPAMI ,2018. [152]A.Tewari,M.Zollhöfer,P.Garrido,F.Bernard,H.Kim,P.Pérez,andC.Theobalt.Self- supervisedmulti-levelfacemodellearningformonocularreconstructionatover250Hz.In CVPR ,2018. [153]A.Tewari,M.Zollhöfer,H.Kim,P.Garrido,F.Bernard,P.Pérez,andC.Theobalt.MoFA: Model-baseddeepconvolutionalfaceautoencoderforunsupervisedmonocularreconstruc- tion.In ICCV ,2017. [154]J.Thies,M.Zollhöfer,M.Nießner,L.Valgaerts,M.Stamminger,andC.Theobalt.Real- timeexpressiontransferforfacialreenactment. ACMTrans.Graph. ,34(6):183:1Œ183:14, 2015. [155]J.Thies,M.Zollhöfer,M.Stamminger,C.Theobalt,andM.Nießner.Face2face:Real-time facecaptureandreenactmentofRGBvideos.In CVPR ,2016. 140 [156]J.Thies,M.Zollhöfer,M.Stamminger,C.Theobalt,andM.Nießner.FaceVR:Real-time facialreenactmentandeyegazecontrolinvirtualreality. arXiv:1610.03151 ,2016. [157]Y.Tong,F.Wheeler,andX.Liu.Improvingbiometricthroughquality-based faceandbiometricfusion.In CVPRW ,2010. [158]A.T.Tran,T.Hassner,I.Masi,andG.Medioni.Regressingrobustanddiscriminative3D morphablemodelswithaverydeepneuralnetwork.In CVPR ,2017. [159]A.T.Tran,T.Hassner,I.Masi,E.Paz,Y.Nirkin,andG.Medioni.Extreme3Dface reconstruction:Lookingpastocclusions.In CVPR ,2018. [160]L.TranandX.Liu.Nonlinear3Dmorphablemodel.In CVPR ,2018. [161]L.TranandX.Liu.Onlearning3Dfacemorphablemodelfromin-the-wildimages. TPAMI , 2019. [162]L.Tran,X.Yin,andX.Liu.DisentangledrepresentationlearningGANforpose-invariant facerecognition.In CVPR ,2017. [163]L.Tran,X.Yin,andX.Liu.Representationlearningbyrotatingyourfaces. TPAMI ,2018. [164]S.Tulsiani,T.Zhou,A.A.Efros,andJ.Malik.Multi-viewsupervisionforsingle-view reconstructionviadifferentiablerayconsistency.In CVPR ,2017. [165]S.TulyakovandN.Sebe.Regressinga3Dfaceshapefromasingleimage.In ICCV ,2015. [166]D.Vlasic,M.Brand,H.,andJ.Popovi ´ c.Facetransferwithmultilinearmodels.In TOG ,2005. [167]D.Wang,C.Otto,andA.K.Jain.Facesearchatscale. TPAMI ,2016. [168]F.Wang,J.Cheng,W.Liu,andH.Liu.Additivemarginsoftmaxforfacev IEEE SignalProcessingLetters ,2018. [169]Y.Wang,L.Zhang,Z.Liu,G.Hua,Z.Wen,Z.Zhang,andD.Samaras.Facerelighting fromasingleimageunderarbitraryunknownlightingconditions. TPAMI ,2009. [170]Y.Wen,K.Zhang,Z.Li,andY.Qiao.Adiscriminativefeaturelearningapproachfordeep facerecognition.In ECCV ,2016. [171]Y.Wong,S.Chen,S.Mau,C.Sanderson,andB.C.Lovell.Patch-basedprobabilistic imagequalityassessmentforfaceselectionandimprovedvideo-basedfacerecognition.In CVPRW ,2011. [172]H.Wu,X.Liu,andG.Doretto.Facealignmentviaboostedrankingmodels.In CVPR ,2008. 141 [173]J.Wu,Y.Wang,T.Xue,X.Sun,B.Freeman,andJ.Tenenbaum.Marrnet:3Dshape reconstructionvia2.5Dsketches.In NeurIPS ,2017. [174]J.Wu,C.Zhang,X.Zhang,Z.Zhang,W.T.Freeman,andJ.B.Tenenbaum.Learningshape priorsforsingle-view3Dcompletionandreconstruction.In ECCV ,2018. [175]Y.WuandQ.Ji.Robustfaciallandmarkdetectionunderheadposesandocclu- sion.In ICCV ,2015. [176]Y.Xiang,W.Kim,W.Chen,J.Ji,C.Choy,H.Su,R.Mottaghi,L.Guibas,andS.Savarese. Objectnet3D:Alargescaledatabasefor3Dobjectrecognition.In ECCV ,2016. [177]Y.Xiang,R.Mottaghi,andS.Savarese.Beyondpascal:Abenchmarkfor3Dobjectdetec- tioninthewild.In WACV ,2014. [178]Q.Xu,W.Wang,D.Ceylan,R.Mech,andU.Neumann.DISN:Deepimplicitsurface networkforhigh-qualitysingle-view3Dreconstruction.In NeurIPS ,2019. [179]J.Yang,S.E.Reed,M.-H.Yang,andH.Lee.Weakly-superviseddisentanglingwithrecur- renttransformationsfor3Dviewsynthesis.In NIPS ,2015. [180]D.Yi,Z.Lei,S.Liao,andS.Z.Li.Learningfacerepresentationfromscratch. arXiv:1411.7923 ,2014. [181]L.Yi,V.G.Kim,D.Ceylan,I.Shen,M.Yan,H.Su,C.Lu,Q.Huang,A.Sheffer,L.Guibas, etal.Ascalableactiveframeworkforregionannotationin3Dshapecollections. TOG ,2016. [182]J.Yim,H.Jung,B.Yoo,C.Choi,D.Park,andJ.Kim.Rotatingyourfaceusingmulti-task deepneuralnetwork.In CVPR ,2015. [183]L.Yin,X.Wei,Y.Sun,J.Wang,andM.J.Rosato.A3Dfacialexpressiondatabasefor facialbehaviorresearch.In FGR ,2006. [184]X.YinandX.Liu.Multi-taskconvolutionalneuralnetworkforfacerecognition. TIP ,2017. [185]X.Yin,X.Yu,K.Sohn,X.Liu,andM.Chandraker.Towardslarge-posefacefrontalization inthewild.In ICCV ,2017. [186]R.Yu,S.Saito,H.Li,D.Ceylan,andH.Li.Learningdensefacialcorrespondencesin unconstrainedimages.In ICCV ,2017. [187]X.YuandF.Porikli.Ultra-resolvingfaceimagesbydiscriminativegenerativenetworks.In ECCV ,2016. [188]A.R.Zamir,A.Sax,W.B.Shen,L.J.Guibas,J.Malik,andS.Savarese.Taskonomy: Disentanglingtasktransferlearning.In CVPR ,2018. 142 [189]E.Zell,J.Lewis,J.Noh,M.Botsch,etal.Facialretargetingwithautomaticrangeofmotion alignment. TOG ,2017. [190]L.ZhangandD.Samaras.Facerecognitionfromasingletrainingimageunderarbitrary unknownlightingusingsphericalharmonics. TPAMI ,2006. [191]Y.Zhang,M.Shao,E.K.Wong,andY.Fu.Randomfacesguidedsparsemany-to-one encoderforpose-invariantfacerecognition.In ICCV ,2013. [192]J.-Y.Zhu,Z.Zhang,C.Zhang,J.Wu,A.Torralba,J.Tenenbaum,andB.Freeman.Visual objectnetworks:imagegenerationwithdisentangled3Drepresentations.In NeurIPS ,2018. [193]X.Zhu,Z.Lei,X.Liu,H.Shi,andS.Z.Li.Facealignmentacrosslargeposes:A3D solution.In CVPR ,2016. [194]X.Zhu,Z.Lei,J.Yan,D.Yi,andS.Li.poseandexpressionnormalizationfor facerecognitioninthewild.In CVPR ,2015. [195]X.Zhu,X.Liu,Z.Lei,andS.Li.Facealignmentinfullposerange:A3Dtotalsolution. TPAMI ,2017. [196]Z.Zhu,P.Luo,X.Wang,andX.Tang.Deeplearningidentity-preservingfacespace.In ICCV ,2013. [197]Z.Zhu,P.Luo,X.Wang,andX.Tang.Multi-viewperceptron:adeepmodelforlearning faceidentityandviewrepresentations.In NIPS ,2014. [198]M.Zollhöfer,J.Thies,D.Bradley,P.Garrido,T.Beeler,P.Péerez,M.Stamminger, M.Nießner,andC.Theobalt.Stateoftheartonmonocular3Dfacereconstruction,tracking, andapplications. Eurographics ,2018. 143