3D FACE MODELING: APPLICATIONS IN GENERATIVE TASKS AND
OCCLUSION-AWARE RECONSTRUCTION

By

Rahul Dey

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science—Doctor of Philosophy

2023

ABSTRACT

3D modeling of human faces has emerged as a widely studied field within computer vision, with

applications in virtual reality, animation, medical imaging, and more, and is going to be a very

promising area of research in the coming years. Specifically, 3D modeling of single view face

images has been known to be a particularly challenging task because of its ill-posed nature, but

it comes with a wide range of applications. Two of the most promising approaches in this regard

are template-based approaches, such as the 3D morphable model (3DMM) of faces, and implicit

3D modeling approaches, such as implicit 3D-GANs. Over the years, 3DMM based approaches

have improved their capability to synthesize highly controllable 3D faces and generate accurate

3D face reconstructions of faces images, while implicit 3D-GANs have been shown to generate

high-fidelity 3D faces. However, even after significant advancements in these approaches, face

generative tasks, such as face inpainting and controllable face generation, are still primarily per-

formed in the 2D image space.

Faces are structured 3D objects with inherent attributes such as shape, pose and albedo, and

their projection in 2D images is affected by external factors such as illumination and camera pa-

rameters. Without an explicit consideration of these factors, existing generative approaches have to

implicitly model facial geometry and appearance. We contend that generative models that explic-

itly take these factors into account can leverage 3D priors, and more controllably and accurately

generate new faces, or fill in the missing regions in face images.

Further, the ill-posed nature of reconstructing 3D models from monocular face images makes

it a challenging task. This becomes even more challenging when facial occlusions such as face

masks, glasses, microphones, etc. are involved. This highlights the need for the development of

occlusion-aware 3D face reconstruction algorithms. We argue that such an algorithm should be (i)

robust to occlusions of varying types, sizes, and locations; and (ii) capable of generating diverse,

yet realistic solutions for the occluded parts to account for a lack of unique solution.

This thesis addresses the aforementioned challenges, by presenting the following: (i) a 3D-

aware face inpainting approach that considerably improves upon 2D-based baselines, especially

under challenging conditions; (ii) a controllable 3D face generation approach that combines the

capabilities of 3DMMs and implicit 3D-GANs by learning correspondence between them; and

(iii) an occlusion-aware 3D face reconstruction approach that generates a diverse, yet realistic set

of 3D reconstructions from a single occluded face image, with lower error on the visible face

regions than the baselines.

Copyright by
RAHUL DEY
2023

Dedicated to my parents, whose sacrifices have enabled me to reach this far.

v

ACKNOWLEDGEMENTS

I would like to take this opportunity to express my gratitude to the people who have supported and

guided me throughout my PhD journey.

First and foremost, I would like to thank my PhD supervisor Prof. Vishnu Boddeti, for the op-

portunity to be a part of his team from its conception, and for his guidance, expertise, constructive

criticism, and unwavering support throughout my research.

I am grateful to my thesis panel comprising of Prof. Arun Ross, Prof. Xiaoming Liu, and Dr.

Felix Juefei-Xu, for serving on my committee and providing their valuable feedback, mentorship,

and recommendations. Their esteemed presence in my academic journey has been a great fortune,

which I will always cherish.

I would also like to thank Prof. Bernhard Egger, Dr. Tim Marks, and Dr. Ye Wang for their

expertise and mentoring in the CoLa-SDF project.

I am grateful to Prof. Anil Jain, Prof. Jiayu Zhou, Prof. Hayder Radha, Prof. Sijia Liu and

other esteemed faculty members of MSU for the amazing course offerings that nurtured me in the

ares of computer vision, machine learning, and related fundamental fields.

I also thank the staff of the CSE graduate office for their help and support, as well as the staff of

the Division of Engineering Computing Services for troubleshooting many IT and compute related

issues.

I would like to thank my friends, including lab mates from the HAL lab, and other members

from the computer vision and machine learning groups at MSU for their support throughout this

journey, and for the many fun trips and leisure activities that made this journey much more enjoy-

able.

Finally, I would like to express my heartfelt gratitude towards my family - Mom, Dad, and my

elder sisters Deepika and Renuka for their continued love and support, without which, I would not

have been able to accomplish this.

vi

TABLE OF CONTENTS

CHAPTER 1

1.1 Contributions .

.

.

.

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
8

CHAPTER 2

. 10
BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3D Morphable Models of Faces . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Implicit 3D Models .
2.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
.
.
2.3
Implicit 3D-GANs
. 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
2.4 Related Work . .

.

CHAPTER 3

3DFACEFILL: AN ANALYSIS-BY-SYNTHESIS APPROACH TO
FACE COMPLETION . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Approach .
. 22
3.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Conclusion .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .

. .

.

.

.

.

CHAPTER 4

COLA-SDF: CONTROLLABLE LATENT STYLESDF FOR
DISENTANGLED 3D FACE GENERATION . . . . . . . . . . . . . . . 47
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
. 51
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.

.
.
. .
.
.
.
.

.
. .
.
.

4.1 Preliminaries
4.2 Approach .
4.3 Experiments .
4.4 Conclusion .

CHAPTER 5

DIVERSE3DFACE: TOWARDS ROBUST AND DIVERSITY-
PROMOTING 3D FACE RECONSTRUCTION FROM SINGLE-
VIEW IMAGES . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 68
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1 Preliminaries
5.2 Approach .
. 70
5.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4 Conclusion .

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
. .

.
.
. .

.

.

.

.

CHAPTER 6

FUTURE EXTENSIONS . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.1 Generating Diverse Textured 3D Reconstructions from a Single Occluded

Face Image . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 High-Resolution Diversity-Oriented 3DFaceFill
. . . . . . . . . . . . . . . . . 90
6.3 Extensions to CoLa-SDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

. .

CHAPTER 7

CONCLUSION .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 93

BIBLIOGRAPHY . .

. .

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

APPENDIX A

3DFACEFILL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 106

APPENDIX B

COLA-SDF .

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

APPENDIX C

DIVERSE3DFACE . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 123

vii

CHAPTER 1

INTRODUCTION

Over the recent years, the coming together of computer graphics and computer vision have led

to the emergence of powerful tools for analysis and synthesis of real world objects in 3D. Such

tools, when applied to human faces, have shown promising applications in AR/VR applications

such as overlaying virtual faces onto the real world, creating realistic and lifelike animations of

human faces, medical imaging, face recognition and so on. Specifically for faces, template based

approaches such as 3D morphable models (3DMMs) [Paysan et al., 2009, Li et al., 2017a], and

deep learning based approaches such as implicit 3D models [Mildenhall et al., 2020, Chan et al.,

2021, Hong et al., 2021] have emerged as very promising approaches of 3D modeling. By repre-

senting faces as linear combinations of shape and texture bases, 3DMMs can not only synthesize

new 3D faces in a highly controllable way, but also reconstruct 3D models from 2D face images.

Their nonlinear counterparts, commonly known as nonlinear 3DMMs [Tran and Liu, 2018, Tran

et al., 2019, Feng et al., 2021, Medin et al., 2022] have further improved the expressivity of these

models. On the other hand, implicit 3D models represent 3D objects using implicit functions rather

than meshes or surfaces, and have been shown to be capable of modeling high fidelity faces in 3D,

including the regions not modeled by 3DMMs, such as the inner mouth cavity and hair [Or-El et al.,

2022, Gu et al., 2021]. Despite such advances, the applicability of 3D modeling to face generative

tasks such as face inpainting, and controllable face generation and editing have not been properly

explored. Further, 3D face reconstruction approaches still struggle in the presence of occlusions

such as face mask, glasses, microphones, etc.

Face inpainting is the process of reconstructing missing or corrupted parts of an image by pre-

dicting and filling in the missing pixels based on the surrounding context. It has wide applications

in face editing and restoration, face de-occlusion, and virtual try-on to name a few. Existing face

inpainting approaches operate in 2D. Typically a masked face image is fed to an end-to-end au-

toencoder network that outputs the completed face image [Li et al., 2017b, Yu et al., 2018, Yu et al.,

2019, Zheng et al., 2019a]. The limitation of these approaches is that the completed faces often

1

3D Object

Shape

P ose = (s, R3D, t3D)

Illumination

Albedo

Figure 1.1 Human face is inherently a 3D structure comprised of a 3D shape, 3D pose, and albedo,
and its appearance when captured in an image can be impacted by global factors like lighting,
camera position, and surrounding objects.

have geometric artifacts, specially when the masks get large as shown in Fig. 1.3. We argue that

this limitation can be overcome by incorporating explicit 3D priors into the model. Human faces

are structured 3D objects with inherent attributes such as shape, pose and albedo. Their projection

in 2D images is affected by external factors such as illumination and camera pose (see Fig. 1.1). By

having to implicitly model the geometry, structure and appearance of faces, 2D-based approaches

often fail to sufficiently account for them, causing such artifacts [Li et al., 2017b].

Another major generative task involving faces is the controlled generation and editing of faces.

This is used to generate realistic human faces with specific attributes or characteristics and has

numerous applications in gaming, virtual reality, digital advertising, forensics and law enforce-

ment, fashion and beauty industries, etc. While several 2D based approaches exist [Tewari et al.,

2020b, Deng et al., 2020, Tripathy et al., 2021], they often do not control attributes such as pose

and hairstyle, and have limited applications wherever 3D models are required, e.g., gaming and

virtual reality. This necessitates the implementation of such approaches in 3D. However, while

3DMMs are highly disentangled and afford explicit control over attributes like shape, pose, albedo

and illumination, they do not include the inner mouth cavity and hair, and are limited in their ex-

pressivity and realism [Feng et al., 2021, Medin et al., 2022]. Implicit 3D models, on the other

2

Figure 1.2 Real world face images often have occlusions caused by various objects, including
face-related objects like glasses and beards, as well as unrelated objects like microphones or tools.
When analyzing face images, these occlusions should be excluded from the analysis. The images
are from the CelebA dataset [Liu et al., 2015].

hand, are highly expressive and generate high-fidelity 3D faces, but do not afford explicit control

over the facial attributes. A high fidelity 3D generative model, with a high degree of control, needs

to bring together the capabilities of both these approaches.

Another challenge in 3D modeling of 2D face images is occlusions. In-the-wild face images

often come with several forms of occlusions (see Fig. 1.2). Performing monocular 3D reconstruc-

tion from occluded face images confronts several challenges: (i) Robustness to occlusions: The

difficulty in 3D face reconstruction depends on the degree, size, shape and location of occlusions.

For example, the larger the occlusion, the more difficult it gets to reconstruct an accurate face 3D

model; and (ii) Lack of unique solution: In the presence of occlusion, it is not possible to know with

certainty how the occluded face would have looked like, even if the algorithm can reconstruct a

highly realistic-looking face 3D model with respect to the visible regions. In such cases, a method

that can generate a distribution of diverse solutions would be preferable. Most existing methods

3

of monocular 3D face reconstruction do not explicitly account for occlusions, which affects their

robustness to such occlusions. And the ones that do consider occlusions, do so by parsing and

using only the visible parts of the face image [Song et al., 2019b, Egger et al., 2018]. This affords

them some degree of robustness to occlusions, while still not accounting for the possible diversity

of solutions.

In this dissertation, we study and address the aforementioned challenges. First, we present

a 3D-aware face inpainting approach that incorporates explicit 3D modeling of faces, and show

its effectiveness over existing approaches, specially under challenging conditions. Then, towards

controllable 3D face generation, we present an approach that establishes correspondence between

the parameters of a nonlinear 3DMM model and the latent space of an implicit 3D-GAN to generate

highly controllable, yet high-fidelity 3D faces with explicit control over its physical attributes like

shape, pose, albedo, illumination, and hairstyle. Then, we explore ways to make 3D reconstruction

from monocular face images robust to occlusions, while simultaneously accounting for diversity

in the occluded regions. Finally, we discuss potential future work in this area. We now provide a

brief introduction of these approaches, followed by our specific contributions.

In Chapter 3, we look at the challenge with face inpainting, particularly under large variations

in pose, shape, illumination, and mask sizes and locations. Existing face inpainting approaches [Yu

et al., 2019, Zheng et al., 2019a] often result in poor photorealism under such conditions and of-

ten fail to preserve facial symmetry and variations in these factors while inpainting, as shown in

the example of extreme face poses, illumination variations, and diverse appearances and shapes

in Fig. 1.3. To this end, we present our approach called 3DFaceFill [Dey and Boddeti, 2022a],

which aims to address the challenge of face de-occlusion by explicitly disentangling a face image

into its 3D components. Our method completes the facial albedo in its UV representation and inte-

grates the 3D shape, 3D pose, and illumination with the inpainted albedo to render the completed

face image back. Additionally, we leverage facial symmetry in the UV representation of albedo

to aid in the inpainting of symmetric occluded facial regions. Through extensive experiments

across multiple datasets and challenging conditions, we demonstrate that 3DFaceFill improves

4

A:

C:

B:

D:

Input

DeepFillv2

PICNet

3DFaceFill

Input

DeepFillv2

PICNet

3DFaceFill

Figure 1.3 Inpainting of face images under diverse conditions by 3DFaceFill and existing ap-
proaches. By modeling the image formation process 3DFaceFill is able to generate more geo-
metrically consistent and photorealistic completions across diverse scenarios such as non-frontal
poses (A), light and dark complexions (B,D), non-uniform facial illumination (e.g. illumination
is different on two sides of the nose in C) and in cases where the baselines tend to distort face
components (e.g. nose in B).

face completion both quantitatively and qualitatively over the baselines [Yu et al., 2018, Zheng

et al., 2019b, Li et al., 2017b, Li et al., 2020a] by as much as 4db in terms of PSNR and ∼25% in

terms of LPIPS [Zhang et al., 2018b], a metric considered closer to human perception.

In Chapter 4, we evolve a method for controllable generation and subsequent editing of 3D

faces, called CoLa-SDF. We combine the controllability of nonlinear 3DMM approaches with the

high fidelity of implicit 3D-GANs by establishing correspondence between the parametric space of

nonlinear 3DMM, and the latent space of 3D-GANs. Building upon the impressive photorealism

and expressive 3D representation of StyleSDF [Or-El et al., 2022], CoLa-SDF adopts a similar

architecture but enforces the latent space to match the interpretable and physical parameters of the

nonlinear 3D morphable model MOST-GAN [Medin et al., 2022]. Through our experiments, we

showcase the effectiveness of CoLa-SDF in achieving high-fidelity face synthesis and subsequent

3D manipulation with full control over the disentangled latent parameters as shown in Fig. 1.4.

While 3D modeling can lead to improved face de-occlusion, occlusions themselves present

a major challenge to 3D reconstruction. This results in a chicken-and-egg problem. We tackle

the issue of 3D face reconstruction in Chapter 5. We specifically focused on two aspects of the

problem: robustness and diversity. Traditionally, monocular 3D reconstruction approaches, both

fitting-based [Paysan et al., 2009, Li et al., 2017a, Egger et al., 2018], as well as neural network-

5

Shape (identity-focused) variations

Shape (expressed-focused) variations

Original

Albedo variations

Illumination variations

Hair/Background variations

Figure 1.4 Our proposed method CoLa-SDF combines the controllability of physical attributes
afforded by 3DMM-based approaches with the high-quality generative capability of implicit 3D-
GANs. Generated images can be manipulated independently across shapes, expressions, abledos,
illumination conditions as well as hairstyles and backgrounds.

6

FLAME

DECA

CFR-GAN

Occ3DMM ExtremeOcc3D

Target Image

Reconstructions by Diverse3DFace (Ours)

FLAME

DECA

CFR-GAN

Occ3DMM ExtremeOcc3D

Target Image

Reconstructions by Diverse3DFace (Ours)

Figure 1.5 Diverse 3D reconstructions from a single occluded face image by Diverse3DFace vs.
singular solution by the baselines including FLAME-Fitting [Li et al., 2017a], DECA [Feng
et al., 2021], CFR-GAN [Ju et al., 2022], Occ3DMM [Egger et al., 2018], and Ex-
tremeOcc3D [Tu´an Tr´an et al., 2018].

7

based [Tran and Liu, 2019, Tran et al., 2019, Tu´an Tr´an et al., 2018, Wu et al., 2020, Sengupta

et al., 2018], rely on a global model to reconstruct a 3D model from a face image. This is not

optimum in the presence of occlusion as the global model is either affected by occlusion, or it needs

to be heavily regularized, which leads to sub-optimal 3D reconstruction (see Fig. 1.5). Further,

these approaches generate a single solution even in the presence of occlusion. In contrast, we

propose a global+local model that separates shape fitting on visible facial regions from those that

are occluded, resulting in higher accuracy in the reconstruction of visible parts. We follow this

by a diversity-oriented shape completion of the occluded parts, using a mesh-based VAE [Zhou

et al., 2020b] and a diversity loss based on the concept of determinantal point processes (DPP)

[Kulesza and Taskar, 2012]. Extensive experiments demonstrate that, on face images occluded

by masks, glasses, and other random objects, our approach generates a distribution of 3D shapes

having ∼50% higher diversity on the occluded regions compared to the baselines. Moreover, our

closest sample to the ground truth has ∼40% lower MSE than the singular reconstructions by both

occlusion-aware baselines [Egger et al., 2018, Tu´an Tr´an et al., 2018], and non-occlusion aware

baselines [Li et al., 2017a, Feng et al., 2021].

1.1 Contributions

Our specific contributions are the following:

• We explore and present ways to leverage explicit 3D face modeling in generative tasks such

as face inpainting and controllable 3D face generation

• In the context of 3D-aware face inpainting, we propose 3DFaceFill [Dey and Boddeti, 2022a]

which disentangles the partial face into its 3D components to aid in face completion, thereby

generating geometrically and photometrically better completions than baselines.

• We present a method called CoLa-SDF which leverages 3D modeling for controlled gener-

ation and manipulation of high-fidelity 3D faces, from which photorealistic 2D images can

be rendered in multiple views.

8

• We explore the problem of monocular 3D face reconstruction in the presence of occlusions

by focussing on (i) robustness to occlusion and scene variations such as shape, pose, illu-

mination, etc., and (ii) diversity of solutions rather than a single solution. To this end, we

propose Diverse3DFace [Dey and Boddeti, 2022b] that employs an ensemble of global+local

shape models that disentangle fitting on the visible regions from the occluded regions, fol-

lowed by diversity-oriented completion of the occluded regions using the power of DPP

[Kulesza and Taskar, 2012].

• We perform extensive quantitative and qualitative experiments to show the effectiveness of

our proposed 3D-based approaches of face inpainting and controllable 3D face generation,

and of our occlusion-aware 3D face reconstruction approach.

9

CHAPTER 2

BACKGROUND

In this chapter, we introduce the two main 3D modeling techniques we work with in this thesis:

the 3D morphable models of faces, and implicit 3D models. We then present previous approaches

that have dealt with similar tasks in face inpainting, controllable face generation, and 3D face

reconstruction.

2.1

3D Morphable Models of Faces

3D morphable models (3DMMs) are a popular technique in computer vision and graphics that

aim to model the variation in shape and texture of 3D objects, typically faces. A 3DMM is a

statistical model that represents the shape and appearance of a 3D object as a linear combination of

basis shapes and textures. These basis shapes and textures are learned from a set of training data,

usually a large set of 3D scans of faces, which allows the model to capture the natural variation

in shape and appearance of the object. The model can be used for a variety of tasks, such as face

reconstruction, facial expression analysis, and face recognition. It can also be used for generating

new faces that are statistically similar to the training data.

Some of the popular 3DMM models for faces are the Basel Face Model (BFM) [Paysan et al.,

2009, Gerig et al., 2018] and the Faces Learned with an Articulated Model and Expressions

(FLAME) model [Li et al., 2017a] (see Fig. 2.1). Specifically, FLAME [Li et al., 2017a] defines a

3D shape as:

Sw/pose(α, β, θ) = W (S(α, β, θ), J(α), θ, W ) ,

(2.1)

where the parameters α, β, θ represent the shape, expression, and pose parameters, respectively;

J ∈ R3K represents the locations of K face joints around which S(α, β, θ) is rotated, and finally

smoothed by the blend weights W . The un-aligned shape S(α, β, θ) is obtained by adding up the

contributions of shape, expression and pose variations on top of a template shape ¯S:

S(α, β, θ) = ¯S + BS(α; S) + BP (θ; P) + BE(β; E)

(2.2)

10

Figure 2.1 FLAME [Li et al., 2017a] and BFM [Gerig et al., 2018] 3DMMs. Image is sourced
from [Egger et al., 2020].

The shape and expression variations are modeled by linear blendshapes BS(α; S) = Sα and

BE(β; E) = Eβ, where S ∈ R3N ×|α| and E ∈ R3N ×|β| are orthonormal shape and expression

bases, respectively learned using PCA and N is the number of vertices. The pose blendshape

function is defined as BP (θ; P) = (R(θ) − R(θ∗)) P, where R(θ) comprises of rotation matrices

around the K joints and P ∈ R3N ×9K are the pose blendshapes describing the vertex offsets from

the rest pose activated by R. Further, texture is modeled by linear combination of a set of texture

basis T : BT (τ ; T ) = T τ .

3D face reconstruction using 3DMMs can be done from single view or multiview face images.

We focus on single view 3D face reconstruction in this dissertation. For this, first a set of 2D facial

landmarks are detected in the input image. These landmarks are used to normalize the pose and

scale of the 2D face. Then, the parameters of the 3DMM are optimized to best fit the input image

in terms of facial landmarks and appearance.

11

Figure 2.2 NeRF [Mildenhall et al., 2020] is a prominent implicit 3D model. The image synthesis
process consists of (a) sampling 3D coordinates along camera-rays in the given viewing direction,
(b) feeding these points and viewing directions to a neural network to obtain a set of color and
density values, and (c) integrate and compose these rays into 2D images using volume rendering
techniques. Image is sourced from the original paper.

2.2

Implicit 3D Models

Implicit 3D models such as NeRF[Mildenhall et al., 2020] refer to a type of 3D shape repre-

sentation that defines the shape of an object as a continuous, implicit function. In other words,

instead of representing a 3D object as a mesh of vertices and polygons or a point cloud, an implicit

3D model represents the object as a function that takes a 3D coordinate x, and often a viewing di-

rection v, as inputs and outputs the radiance, and either the volume density or the signed distance

value of that point. The implicit function is often modeled as a neural network. The volume den-

sity represents the opaqueness of the points, while the signed distance value indicates the distance

between the input point and the surface of the object. If the signed distance is positive, the point is

outside the object, and if it is negative, the point is inside the object. The surface of the object is

defined as the set of points where the signed distance value is zero.

To render an image from a particular viewpoint, these approaches perform ray-marching to

sample a set of 3D coordinates corresponding to each pixel in the image. This set of points, along

with the corresponding viewpoint, is passed to a neural network to obtain the color and density

values at these locations. Finally, classical volume rendering techniques are applied to render

these points into a 2D image (refer Fig. 2.2).

12

Figure 2.3 GRAF [Schwarz et al., 2020] is an implicit 3D-GAN that takes in a shape code and an
appearance code, and samples a set of 3D coordinates conditioned on the camera parameters, and
renders 2D images using volume rendering. It is trained in an adversarial manner.

2.3

Implicit 3D-GANs

Implicit 3D-GANs [Schwarz et al., 2020, Niemeyer and Geiger, 2021, Chan et al., 2021, Gu

et al., 2021, Or-El et al., 2022] are a type of generative adversarial networks (GANs) that can be

used to learn implicit 3D representations of objects. They combine the benefits of implicit 3D

models with the power of GANs to generate new, realistic 3D shapes.

Unlike traditional GANs, which learn to generate images or 2D representations of objects,

implicit 3D-GANs learn to generate 3D shapes as a continuous, implicit function. They take as

input a latent vector, a set of 3D coordinates, and a viewing direction and output the color, and

volume density or signed distance value at each 3D coordinate. From this, 2D images can be

generated using the volume rendering techniques mentioned earlier [Mildenhall et al., 2020].

Implicit 3D-GANs are trained using a combination of adversarial and 3D regularization losses,

which encourage the GAN to generate shapes that are both realistic and close to the true surface

of the object. They can be used for a variety of tasks, including 3D shape generation, shape

completion, and shape editing. The main advantage of implicit 3D-GANs is that they can generate

new, realistic shapes that are not limited to the training data, with complex, intricate details and

smooth surfaces.

13

2.4 Related Work

Due to their wide-ranging applications, both generative and 3D face modeling have seen a lot

of research in the recent years, which have led to several advancements. They have also benefited

vastly from advancements in related non-face specific approaches. We now review some of the

related work in the areas of image inpainting, face inpainting, 3D face reconstruction, implicit

3D-GANs, editable implicit 3D models, and diversity promoting generating models.

2.4.1

Image Inpainting

Earlier image inpainting approaches[Bertalmio et al., 2000, Criminisi et al., 2004, Barnes et al.,

2009, Hays and Efros, 2007] used diffusion or patch based methods to fill in the missing regions.

This produced sharp results but often lacked semantic consistency. Recent techniques employ a

CNN autoencoder along with a GAN loss to generate semantically consistent and realistic com-

pletions [Pathak et al., 2016, Yeh et al., 2017, Iizuka et al., 2017]. More recent methods focus

on architectural enhancements to improve inpainting for variable and free form masks. These in-

clude a more refined discriminator in PatchGAN [Isola et al., 2017], contextual attention in Deep-

Fillv2 [Yu et al., 2018] and gated convolutions [Liu et al., 2018, Yu et al., 2019]. In contrast, our

work in Chapter 3 adopts vanilla CNN architectures and instead relies on a more accurate 3D face

analysis-by-synthesis technique.

2.4.2 Face Inpainting

Face inpainting is a more challenging variant of image inpainting because of the complexity and

diversity of faces. To address this, many approaches impose additional geometric and photometric

priors in the form of face related losses [Song et al., 2019a, Li et al., 2017b, Chen et al., 2017b,

Zhang et al., 2017, Li et al., 2020b, Yuan and Park, 2019]. A recent approach called DSA [Zhou

et al., 2020a] uses oracle-learned attention maps and component-wise discriminators to generate

high-fidelity completions. While it often generates photorealistic completions in well-lit frontal

faces, it still relies on implicitly learned priors which are insufficient to enforce correct geometry

in challenging poses and illuminations. All these approaches rely on novel architectural advances

and loss functions while our method 3DFaceFill focuses on more explicit and precise modeling of

14

the image-formation process.

Concurrently, [Deng et al., 2018] completed self-occluded UV texture to synthesize new face

views. This assumes that the full face image and at least half of the UV texture is always visible.

In contrast, our 3DFaceFill goes beyond self-occlusion and instead, performs 3D factorization on

the masked face and completes its albedo for masked face completion. Furthermore, since texture

is not always symmetric due to illumination variations, [Deng et al., 2018] needs synthetically

completed texture maps for training; whereas 3DFaceFill performs completion on albedo which is

further disentangled from both geometry as well as illumination allowing us to effectively enforce

symmetry prior, without needing synthetically completed UV-maps for training, as it bears out in

our experiments. A few recent works have also attempted to leverage symmetry for face completion

[Zhang et al., 2018a, Li et al., 2020a]. However, these approaches employ complex symmetry

registration operations, which require huge computational resources; moreover these operations

are often susceptible to large geometric variations.

2.4.3 Linear and Nonlinear 3DMMs

Blanz and Vetter [Blanz and Vetter, 1999] proposed the first statistical 3DMM of human faces.

Since then, such models have grown to include complex pose, expression, and texture modalities in

faces [Paysan et al., 2009, Gerig et al., 2018]. FLAME, proposed by [Li et al., 2017a], models the

full human head and allows non-linear control over joint poses to generate articulated expressive

head instances. While relatively simple and effective, these linear models often lack expressiv-

ity and detail. Over the past several years, many approaches began adopting neural networks to

model higher-order complexities in the shape and texture spaces [Tewari et al., 2017, Sengupta

et al., 2018, Shu et al., 2017, Tran and Liu, 2019, Tran et al., 2019, Tuan Tran et al., 2017, Ramon

et al., 2021, Kim et al., 2018, Sanyal et al., 2019]. [Wu et al., 2020] leveraged facial symmetry

and illumination to learn a 3D model of faces from in-the-wild images in an unsupervised way.

[Medin et al., 2022] trained a nonlinear 3DMM, called MOST-GAN, to integrate the expressive-

ness of style-based GANs with the physical disentanglement of 3DMMs, along with a 2D hair

manipulation network. Some approaches took a coarse-to-fine approach to add details to 3D re-

15

constructions. DECA [Feng et al., 2021] adds a pose and expression conditioned displacement

map on top of a coarse shape to make the 3D reconstructions animatable. [Grassal et al., 2022]

employed a coarse mesh refinement approach to learn subject-specific head avatars that model the

entire head including hair.

Motivated by the advances in graph neural networks [Kipf and Welling, 2016, Veliˇckovi´c et al.,

2017, Defferrard et al., 2016, Morris et al., 2019], some recent approaches adopted graph con-

volutions to directly learn nonlinear representations on a mesh surface, while preserving the mesh

topology [Ranjan et al., 2018, Bouritsas et al., 2019, Zhou et al., 2020b]. A few methods took a hy-

brid approach of fitting a non-linear neural network model to the target image to generate detailed

3D reconstructions [Gecer et al., 2019, Yenamandra et al., 2021].

However, compared to implicit 3D-GANs, these models do not generate as high quality and

intricately detailed 3D faces. Further, they have limited modeling of hair and teeth since these

facial regions lack pointwise correspondence across subjects and are not part of the underlying

3DMM models. Also, these approaches are not designed explicitly to handle occlusions. Hence,

when used for 3D reconstruction, these approaches often produce artifacts and lead to poor shape

and pose estimation in the presence of facial occlusions.

2.4.4 Occlusion-Robust 3D Face Reconstruction

To improve occlusion robustness during 3D face reconstruction, a few approaches are explicitly

designed to handle occlusions [Tu´an Tr´an et al., 2018, Egger et al., 2018, Ju et al., 2022, Li et al.,

2023]. [Tu´an Tr´an et al., 2018] trained a neural network to regress a robust foundation shape from a

masked face image, over which a detailed bump map is added later. [Egger et al., 2018] employed

an EM-like approach to simultaneously optimize an occlusion mask and the model parameters

for a target occluded image. [Li et al., 2023] adopted this strategy of 3D reconstruction aiding

in occlusion segmentation and vice versa, to simultaneously train a face encoder and an outlier

segmentation network. However, these approaches rely on a global model to account for the entire

face, including the occluded parts, which is sub-optimal as the lack of information from such parts

needs to be countered using strong regularization. Moreover, they are limited to reconstructing a

16

singular 3D solution without considering the plurality of solutions that can explain the occluded

regions. In contrast, our proposed Diverse3DFace addresses the dual problems of robustness and

lack of uniqueness through a multistage approach that disentangles fitting on the visible regions

from diversity modeling on the occluded ones.

2.4.5 Diversity Promoting Generative Models

Diversity promoting algorithms have been employed in several areas in computer vision where

a distribution of outcomes is more desirable than a singular solution. Conditioning [Isola et al.,

2017, Yang et al., 2019] and regularization [Zhu et al., 2017, Ghosh et al., 2018, Suzuki et al.,

2016, Che et al., 2016, Srivastava et al., 2017] based techniques have been proposed to over-

come mode-collapse and promote diversity in GANs [Goodfellow et al., 2014]. As ill-posed

problems, diversity promoting algorithms are particularly useful for image inpainting and image

super-resolution. [Zheng et al., 2019b] introduced the notion of diversity of solutions in image

inpainting. They proposed a dual-pipeline C-VAE [Sohn et al., 2015] that maintains ground-truth

fidelity in one path while allowing diversity on the other. [Bahat and Michaeli, 2020] generated di-

verse super-resolution explanations by only enforcing consistency in the low-resolution space. As

one of the most seminal works in this field, [Kulesza and Taskar, 2012] introduced the framework

of Determinantal Point Processes (DPPs) to model diversity in machine learning tasks such as in-

ference, sampling, marginalization, etc. [Yuan and Kitani, 2019, Yuan and Kitani, 2020] adopted

DPP to sample multi-modal latent vectors for diverse human trajectory forecasting. [Elfeki et al.,

2019] devised a DPP-based objective to train GANs and VAEs to emulate the diversity in real data.

In Chapter 5, we adopt the idea of DPPs to generate diverse 3D reconstructions for an occluded

face by discovering latent space representations that maximize plausible diversity on the occluded

regions while remaining faithful to the visible parts.

2.4.6

Implicit Neural Representations and 3D-GANs

Instead of explicitly representing objects and scenes as meshes, voxel grids or point clouds,

implicit 3D models represent them through the parameters of a neural network. [Mildenhall et al.,

2020] proposed the first method of neural radiance fields (NeRFs), in which the density and radi-

17

ance of 3D points are queried through the network and rendered to an image using volume render-

ing. NeuS [Wang et al., 2021] adopted signed distance fields instead of density fields to represent

the object surfaces. While these models are fitted to a given scene and are not generative in na-

ture, several later approaches adopted neural rendering to learn implicit 3D-GANs [Schwarz et al.,

2020, Niemeyer and Geiger, 2021]. pi-GAN [Chan et al., 2021] proposed a novel architecture

based upon periodic activation function [Sitzmann et al., 2020] and feature-wise linear modulation

(FiLM) to improve view consistency and generation quality of implicit 3D-GANs. While these

methods were successful in generating 3D scenes that can be rendered to view-consistent images,

high computational cost prevented them from generating high-resolution images. EG3D [Chan

et al., 2022] introduced a tri-planar framework and showed that it improves computational effi-

ciency and multi-view consistency of generated images. More recent methods like StyleNeRF [Gu

et al., 2021] and StyleSDF [Or-El et al., 2022] have adopted a hybrid approach of combining a

low-resolution volume renderer with a CNN-based super-resolution network. Although these ap-

proaches enable direct manipulation of the 3D viewpoint, they otherwise lack any explicit control

over the generated objects.

2.4.7 Editable Implicit 3D Models

There have been several attempts to enable editing of implicit 3D-GANs. BANMo [Yang et al.,

2022] learned a neural blend skinning model to transform 3D points between the camera space and

a learned canonical space, enabling large deformations. NeRF-Editing [Yuan et al., 2022] utilized

ray-bending to edit the underlying static NeRF. HeadNeRF [Hong et al., 2021] disentangled the la-

tent space of an implicit 3D-GAN for faces by training on data containing multiple images for each

subject with the same labeled variations in expression and illumination. StyleRig [Tewari et al.,

2020b] and PIE [Tewari et al., 2020a] embed portrait images into the latent space of the pretrained

StyleGAN model [Karras et al., 2020, Karras et al., 2021] for editing. CLIP-NeRF [Wang et al.,

2022] performs text- or exemplar-based editing of low-resolution objects. Disentangled3D [Tewari

et al., 2022] and FENeRF [Sun et al., 2022] train separate shape deformation and appearance net-

works, but they do not disentangle illumination and only generate low-resolution images. RigNeRF

18

[Athar et al., 2022] enables editing of portraits by learning a deformation NeRF with respect to a

canonical space modeled by a 3DMM, but it is subject-specific and does not allow for generat-

ing new identities. However, compared to our method in Chapter 4, these methods lack explicit

semantic control over specific aspects of the face, and often lack photorealism.

19

CHAPTER 3

3DFACEFILL: AN ANALYSIS-BY-SYNTHESIS APPROACH TO FACE COMPLETION

©2022 IEEE. Reprinted, with permission, from

Dey, R. and Boddeti, V. N. 3DFaceFill: An Analysis-by-Synthesis Approach to Face Completion.

In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages

1586–1595, 2022.

In this chapter, we explore the applicability of 3D face modeling for face inpainting (also known

as completion). End-to-end image completion methods i.e., models that generate 2D completions

directly from 2D masked images, have witnessed remarkable progress in recent years. These ap-

proaches rely primarily on architectural advances in neural network designs to implicitly account

for photometric and geometric variations in image appearance. And even those that explicitly in-

clude scene geometry in their formulation do so largely in 2D. Consequently, object-based image

completions from such methods often suffer from poor photorealism, especially under large varia-

tions in pose, shape, illumination of objects in the image and the inpainting mask. For example, in

the context of faces, Fig. 1.3 shows face images having extreme poses (1.3.A), illumination varia-

tions across the face (1.3.C) and diverse appearances and shapes. Current state-of-the-art methods

such as DeepFillv2 [Yu et al., 2019] and PICNet [Zheng et al., 2019a], both of which operate end-

to-end on 2D image representations, often fail in preserving facial symmetry and the variations of

the aforementioned factors (pose, illumination, texture, shape) while inpainting.

Several attempts have been made to customize generic image inpainting solutions for structured

objects such as faces. General image inpainting approaches typically employ a CNN autoencoder

as the inpainter and train it using a combination of photometric and adversarial losses [Pathak

et al., 2016, Iizuka et al., 2017, Yu et al., 2018, Zheng et al., 2019a]. Face specific completion

methods [Li et al., 2017b, Song et al., 2019a] employ additional losses such as landmark loss,

perceptual loss and face parsing loss. However, these approaches still do not account for all factors

in the image formation process like illumination and pose variations and as such fail to effectively

20

Original

I

Face
Segmenter

Mf

Im

Iter 1

3DMM

ALBEDO
INPAINTER

RENDERER

ˆI

Output

Mask

Iter > 1

Figure 3.1 Overview: 3DFaceFill is an iterative inpainting approach where the masked face is dis-
entangled into its 3D shape, pose, illumination and partial albedo by the 3DMM module, following
which the partial albedo is inpainted and finally the completed image is rendered. During infer-
ence (only), the completed image is fed back through the whole pipeline in subsequent iterations,
while using the initial mask for albedo inpainting. During training, a pre-trained model segments
the image into face, hair and background for constraining the mask to lie only on the face. This
segmentation is optionally used during inference if necessary.

impose geometric priors such as facial symmetry. Moreover, the implicit enforcement of geometric

priors is still done in 2D as opposed to in 3D. This is a significant limitation as faces are inherently

symmetric 3D objects and their projections on 2D images are often affected by the aforementioned

factors of pose, illumination, shape etc.

In contrast to the foregoing, our approach advocates for an analysis-by-synthesis approach for

face completion that explicitly accounts for the 3D structure of faces i.e., shape and albedo, and

image formation factors i.e., pose and illumination. The key insight of our solution is that per-

forming face completion on the UV representation, as opposed to the 2D pixel representation,

allows us to effectively leverage the power of correspondence and ultimately lead to geometrically

and photometrically accurate face completion (see Fig.1.3). Our approach (see Fig. 3.1), dubbed

3DFaceFill, comprises of three components that are iteratively executed. First, the masked face

image is disentangled into its constituent geometric and photometric factors. Second, an autoen-

coder performs inpainting on the UV representation of facial albedo. Lastly, the completed face is

21

re-synthesized by a differentiable renderer. Our specific contributions are:

– We propose 3DFaceFill, a simple yet very effective face completion model that explicitly dis-

entangles photometric and geometric factors and perform inpainting in the UV representation of

facial albedo while preserving the associated facial shape, pose and illumination.

– We propose a 3D symmetry-aware network architecture and a symmetry loss for the inpainter to

propagate albedo features from the visible to symmetric masked regions of the UV representation.

Enforcing the symmetry prior in 3D, as opposed to 2D, allows 3DFaceFill to more effectively

leverage and preserve facial symmetry while inpainting.

– Given our trained model, we propose a simple refinement process at inference by iteratively re-

processing the face completion through the model. This process enables us to address the “chicken-

and-egg” problem of simultaneously inferring both the photometric and geometric factors and

completion of the face from a masked image. The procedure is especially effective for heavily

masked faces, improving the PSNR by up to 1dB.

– Extensive benchmarking on several datasets and unconstrained in-the-wild images results in

3DFaceFill producing photorealistic and geometrically consistent face completions over a range of

masks and real occlusions, especially in terms of pose, lighting, and attributes such as eye-gaze and

shape of nose along with a quantitative improvement of upto 4dB PSNR and 25% in LPIPS[Zhang

et al., 2018b].

3.1 Approach

In this section, we first present an overview of our proposed 3D face completion approach

(dubbed 3DFaceFill) followed by the details of each component. As shown in Fig. 3.1, 3DFaceFill

has three components: a 3DMM encoder, an albedo completion module and a renderer. Given a

masked face, 3DFaceFill first resolves it into its constituent 3D shape, pose and illumination using

the 3DMM encoder (Fig. 3.2). Then, we obtain the partial facial texture in the UV-domain by re-

projecting the mesh onto the input image (Fig. 3.2b). We further remove the shading component

to obtain an illumination-invariant partial albedo. The inpainter completes the partial albedo using

symmetric and learned priors. Finally, the renderer combines the inpainted albedo with the esti-

22

3D FACTORIZATION

Auv
m

Muv

Input
Im

Iter 1

E

α

θ

γ

S

×

H

×

S

Cuv

}

Iter 2

PyramidGAN

ˆI

LGAN

Completed

Igt

LI

Groundtruth

r
e
r
e
d
n
e
R

(a) Architecture

⊙
{

Horizontal
flip

⊕

ALBEDO
INPAINTER
G

(b) UV Sampling

ˆAuv

LA

ˆAuv
flip

(c) PyramidGAN

Flip

{Lsym

Figure 3.2 (a) Architecture: Given a masked face Im, the 3DMM encoder extracts its shape param-
eters α, pose θ and illumination parameters γ, from which we obtain the full shape S = Sα, and
shade represented in UV Cuv = Hγ by linear combination with the corresponding orthonormal
shape and spherical harmonics bases S and H, respectively. Then, we obtain a partial albedo Auv
m
as shown on the right in (b) by first, re-projecting the 3D mesh onto the masked image to obtain
the UV-texture Tuv
m ⊘ Cuv. Finally, the albedo in-
painter G completes the partial albedo as ˆAuv, conditioned on the UV-mask Muv. We then combine
the completed albedo with the estimated shape, pose and shade to obtain the completed image ˆI.
To generate photorealistic completion, the completed and groundtruth images are evaluated by the
proposed (c) PyramidGAN discriminator. (b) UV Sampling: 3D mesh is projected onto the face
image to obtain per vertex RGB values Tv(v). We map the per-vertex texture map to a UV texture
map Tuv using a pre-defined mapping.

m, and then, removing the shade from it Auv

m = Tuv

mated 3D factors to obtain the completed face. As a natural extension of the proposed approach,

we use 3D factorization and completion in a complimentary way to further improve completion

iteratively.

3.1.1

3D Factorization

Existing face image completion approaches directly operate on 2D, which makes it non-trivial

to enforce strong 3D geometric and photometric priors. This leads to poor face completion in

challenging conditions of poses, geometry, lighting, etc. This motivates us to adopt explicit 3D

factorization of face images to disentangle the appearance and geometric components, to enable

23

robust completion.

Essentially, the 3D factorization module is an inverse renderer Φ : I → (S, θ, γ, A) that

resolves a 2D face I into its constituent 3D shape S ∈ RN ×3, 3D pose θ = (s, R, t), illumination γ

and albedo A. Various 3DMM approaches like [Blanz and Vetter, 1999, Egger et al., 2018, Gecer

et al., 2019] can be a natural fit for this. However, being fitting based approaches, they are not

real time, leaving learning based 3D reconstruction approaches [Tewari et al., 2017, Sengupta

et al., 2018, Shu et al., 2017, Tu´an Tr´an et al., 2018, Tran and Liu, 2019, Tran et al., 2019, Wu

et al., 2020] as the obvious choices. While any of these approaches can potentially be used in

our approach, for the purpose of this work, we adopt a simplified version of the nonlinear 3DMM

presented in et al. [Tran and Liu, 2019].

The 3D factorizaiton module consists of a 3DMM encoder E and an albedo decoder GA (used

only during training). The encoder E first resolves the image I in to its shape α, albedo τ and

illumination γ parameters, and its 3D pose θ = (s, R, t). Using the shape coefficients, we obtain

the full 3D shape S by linear combination with the Basel Face Model’s (BFM)

[Paysan et al.,

2009] orthonormal shape bases S: S = Sα, where S ∈ R3N ×|α| and N is the number of vertices.

Similarly, we combine the illumination coefficients linearly with the spherical harmonics (SH)

bases H [Ramamoorthi and Hanrahan, 2001] to obtain the surface shading in the UV-domain:

Cuv = Hγ, where H ∈ RH×W ×3×9 (9 bases per color channel), and H and W are the height and

width of the UV-representation, respectively. The decoder GA maps the albedo coefficients into

the full UV-albedo GA : τ → Auv, which is then multiplied with the shade to obtain the texture

Tuv = Auv ⊙ Cuv. A differentiable renderer R [Tran and Liu, 2019] then re-projects the estimated

3D factors into image Iren using the Z-buffer technique:

Iren = R (S, Tuv, θ)

(3.1)

We train the module using masked images for robustness to partial inputs. For further details,

refer the appendix.

24

3.1.2 Albedo Completion Module

Architecturally, our albedo completion module is similar to other adversarially trained image-

completion autoencoders [Pathak et al., 2016, Li et al., 2017b, Yu et al., 2018]. However, ours has

the unique advantage of being solely focused on recovering the missing albedo, which has been

disentangled from other variations in shape, pose and illumination through 3D factorization and is

largely symmetric in its UV-representation. UVGAN [Deng et al., 2018] performs a similar com-

pletion of self-occluded UV-texture extracted from fully-visible face images. However, because

of the entangled illumination, they don’t use symmetry and need a synthetically completed texture

map for supervision, whereas we use symmetry as self-supervision.

To this end, we discard the soft albedo obtained from the 3DMM albedo decoder and instead

obtain the more realistic partial albedo from the input image in the UV space. This is done in two

steps: first, we reproject the obtained 3D mesh onto the face image and use bilinear interpolation

to sample the per-vertex texture (see Fig. 3.2b):

Tv

m(x, y, z) =

(cid:88)

Ip,q
m (1 − |x − p|)(1 − |y − q|)

(3.2)

p∈{⌊x⌋,⌈x⌉}
q∈{⌊y⌋,⌈y⌉}

Then, we map the sampled partial texture Tv

m onto the UV space using barycentric interpolation

on the predefined mesh-to-uv mappings Tv

m(v1, v2, v3) → Tuv

m(u, v). From the texture, we obtain

the partial albedo by simply removing the estimated shade: Auv

m = Tuv

m ⊘ Cuv, where ⊘ is the

element-wise division operation. We perform similar operations to unwarp the mask M on-to the

UV-space as Muv.

We use a U-Net [Ronneberger et al., 2015] based autoencoder G to complete the partial albedo

conditioned on the input mask, G : (Auv

m, Muv) → ( ˆAuv, σuv), where ˆAuv is the completed albedo

and σuv is the uncertainty of completion. In order to leverage the bilateral symmetry of the UV

facial albedo as an attention map, we modify the U-Net architecture (henceforth referred to as

Sym-UNet). This is specially helpful since we do not have access to the full groundtruth albedo

maps for training. To do so, we split the first convolution layer f1:2c into two parts: f1,1:c and

f2,c+1:2c with equal number of output channels c (see Fig. 3.2). The first filter operates on the

25

input albedo as such to obtain the response h1 = f1(Auv

m). The second, instead, operates on the

horizontally flipped albedo h2 = f2(hflip(Auv

m)). We then concatenate the responses h1 and h2

from these two filters and pass it through the rest of the network. During training, the first filter

learns to extract features from the visible parts of the albedo while the second filter learns to extract

features corresponding to the symmetrically opposite visible parts to apply on the occluded regions

(see Fig. 3.15).

A naive approach of doing so, however, results in artifacts from the symmetrical counterparts

to appear on the visible regions, making the network convergence difficult. Instead, we use gated

convolutions [Yu et al., 2019] as shown in Fig. 3.15 (in all but the final layer), to ensure that such

symmetric features are only transferred to the masked regions and do not create artifacts on the

visible regions. We use group normalization[Wu and He, 2018] and ELU activation[Clevert et al.,

2015] for all the feature layers and the final output is simply clipped between -1 and 1. We then

render the completed albedo ˆAuv, along with the estimated shape, pose and illumination to obtain

a completed image ˆI using eqn. 3.1. Finally, we simply blend the input and completed images to

obtain the output image: Iout = Im ⊙ (1 − M) + ˆI ⊙ M.

PyramidGAN Discriminator: To generate sharp and semantically realistic completions, we use

a multi-scale PatchGAN discriminator [Wang et al., 2018, Shocher et al., 2019], which we refer

to as the PyramidGAN. The PyramidGAN evaluates the final output Iout at multiple locations and

scales ranging from coarse and global to fine and local (refer to Fig. 3.2c). Features from each l-th

downsampling layer of the PyramidGAN Dl are used to evaluate an average hinge loss [Yu et al.,

2019, Juefei-Xu et al., 2018] for that layer. We then compute the average loss across all the layers

as the total loss, thus giving equal weightage to each scale:

LG = − Ep(z) [El∈L [Dl(G(z)]]

(3.3)

LD =Ex [El∈L[1 − Dl(x)]+] + Ep(z) [El∈L[1 + Dl(G(z)]+] ,

Training Losses: We train the albedo completion module with the following total loss:

L = λ1LA + λ2LI + λ3Lsym + λ4LGAN + λ5Lgp,

(3.4)

26

where LA = Lσ(|| ˆAuv − ˆAuv

gt ||1, σuv) and LI = Lσ(||ˆI − Igt||1, σ) are the pixel losses for the albedo

and the image, respectively, Lsym is the symmetry loss, LGAN is the GAN loss given in eqn. 3.3 and

Lgp is the WGAN-GP loss as described in [Gulrajani et al., 2017]. The albedo symmetry loss is

carefully applied on the masked regions whose symmetric counterparts are visible, to supplement

as supervised attention:

(cid:16)

Lsym = Lσ

(1 − Muv)Muv

flip ⊙ || ˆAuv − ˆAuv

flip||1, σuv(cid:17)

Here, Lσ(x, σ) is the aleatoric uncertainty loss[Kendall and Gal, 2017], given by:

Lσ(x, σ) =

1
dim(x)

(cid:88)

i

1
2

xie−σi +

σi
2

.

(3.5)

(3.6)

The loss coefficients are set to have similar magnitude for all the loss components.

In our

approach, the goal is to show the efficacy of explicit 3D consideration on the geometric and pho-

tometric accuracy of face completion. So, we withhold from using attention or face specific losses,

or refiner modules that many other approaches have used [Li et al., 2017b, Yu et al., 2018, Yu

et al., 2019, Zheng et al., 2019a, Zhou et al., 2020a, Medin et al., 2022] and leave them as future

add-ons.

Iterative Refinement: 3D factorization is an important first step of our proposed approach, which

itself leads to robust face completion in cases where 2D based methods fail. To make the 3D

factorization itself robust to partial images, we train the 3DMM encoder on face images with

randomly sized and randomly located masks. However, there is scope to further improve upon

this and leverage the full power of our proposed two-step approach. To do this, we adopt a simple

iterative refinement technique where face completion leads to improved 3D factorization and vice

versa, as shown in Fig. 3.1. During inference, the masked face is used to distill the 3D factors

in the first iteration; while in the next iteration, the completed face itself forms the input for 3D

analysis. This leads to iteratively refined 3D analysis (specially the 3D pose) as well as face

completion. Though one can repeat the iterative step many times, we experimentally found that

two such iterations are usually sufficient.

27

3.2 Experimental Evaluation

Datasets: We evaluate the proposed 3DFaceFill on the CelebA [Liu et al., 2015] and CelebA-

HQ [Lee et al., 2020] datasets. We use 80% split for training and 20% for evaluation. Further,

to evaluate the robustness and generalization performance, we do a cross-dataset evaluation on

the pose and illumination varying images from the MultiPIE [Gross et al., 2010] dataset and ∼50

in-the-wild face images downloaded from the internet1.

Implementation Details: We train both the 3D factorization and the completion modules inde-

pendently using the Adam optimizer with a learning rate of 10−4. We first train the 3DMM module

on the 300W-3D [Zhu et al., 2016] and the CelebA [Liu et al., 2015] datasets. Once the 3DMM

encoder is trained, we freeze it and use it to train the completion module on the CelebA [Liu

et al., 2015] dataset for 30k iterations. We generate random rectangular masks of varying sizes

and locations, and constrain them to lie in the segmented face region (Fig. 3.1). Please see the

appendix Sec. A.3 for further details on implementation and computational analysis.

Baselines: To evaluate the efficacy of 3DFaceFill, we perform qualitative and quantitative compar-

ison against baselines such as GFC [Li et al., 2017b], SymmFCNet [Li et al., 2020a], DeepFillv2

[Yu et al., 2019, Yu et al., 2018] and PICNet2 [Zheng et al., 2019a]. We use the publicly available

pretrained face models for DeepFillv2 [Yu et al., 2019], PICNet [Zheng et al., 2019a] and Symm-

FCNet [Li et al., 2020a]. For GFC [Li et al., 2017b], the pretrained model was not trained on the

same crop and alignment as ours, so we train it from scratch using their source code. Due to the

absense of extensive results, we present additional evaluation against baselines that do not provide

source codes or pre-trained models in the supplementary, using a small set of results obtained from

the corresponding authors.

3.2.1 Quantitative Evaluation

In addition to the typically used PSNR and SSIM metrics, we report LPIPS [Zhang et al.,

2018b], which is more suitable for image completion. Table 3.1 reports the overall values of these

1Source: https://unsplash.com/s/photos/face
2Following author guidelines, we sample top 10 completions ranked by its discriminator and chose the one closest

to the groudtruth for evaluation.

28

(a) CelebA dataset [Liu et al., 2015]

(b) CelebA-HQ dataset [Lee et al., 2020]

(c) MultiPIE dataset [Gross et al., 2010]

Figure 3.3 Quantitative Evaluation: We perform face completion over (a) CelebA, (b) CelebA-
HQ and (c) MultiPIE datasets across a range (0-90%) of mask to face area ratios and evaluate
the PSNR, SSIM and LPIPS [Zhang et al., 2018b] metrics. Our proposed 3DFaceFill consistently
outperforms all the baselines across all the datasets and mask-to-face area ratios.

29

5152535455565758595202224262830323436383DFaceFill (Ours)PICNetDeepFillv2SymmFCNetGFCMask/Face (%)PSNR (dB) (↑)51525354555657585950.860.880.90.920.940.960.9811.023DFaceFill (Ours)PICNetDeepFillv2SymmFCNetGFCMask/Face (%)SSIM (↑)51525354555657585950.020.040.060.080.10.120.140.160.180.23DFaceFill (Ours)PICNetDeepFillv2SymmFCNetGFCMask/Face (%)LPIPS (↓)5152535455565758595202224262830323436383DFaceFill (Ours)PICNetDeepFillv2SymmFCNetGFCMask/Face (%)PSNR (dB) (↑)51525354555657585950.850.90.9511.053DFaceFill (Ours)PICNetDeepFillv2SymmFCNetGFCMask/Face (%)SSIM (↑)51525354555657585950.050.10.150.23DFaceFill (Ours)PICNetDeepFillv2SymmFCNetGFCMask/Face (%)LPIPS (↓)51525354555657585951820222426283032343DFaceFill (Ours)PICNetDeepFillv2SymmFCNetGFCMask/Face (%)PSNR (dB) (↑)51525354555657585950.80.850.90.9511.053DFaceFill (Ours)PICNetDeepFillv2SymmFCNetGFCMask/Face (%)SSIM (↑)51525354555657585950.020.040.060.080.10.120.140.160.183DFaceFill (Ours)PICNetDeepFillv2SymmFCNetGFCMask/Face (%)LPIPS (↓)Dataset

Metric

CelebA

CelebAHQ

MultiPIE
(Pose)

MultiPIE
(Illu)

Internet

PSNR (↑)
SSIM (↑)
LPIPS (↓)

PSNR (↑)
SSIM (↑)
LPIPS (↓)

PSNR (↑)
SSIM (↑)
LPIPS (↓)

PSNR (↑)
SSIM (↑)
LPIPS (↓)

PSNR (↑)
SSIM (↑)
LPIPS (↓)

GFC
[Li et al.,
2017b]
27.0298
0.9257
0.1134

SymmFC
[Li et al.,
2020a]
25.8817
0.9273
0.0537

25.5836
0.8895
0.1076

24.7557
0.9187
0.0822

23.5749
0.8676
0.1232

24.1775
0.9042
0.0913

25.6203
0.9232
0.0535

24.7177
0.9289
0.0692

24.4813
0.8618
0.0747

24.2829
0.9168
0.0625

DeepFillv2
[Yu et al.,
2019]

PIC
[Zheng et al.,
2019a]

3DFaceFill

28.2097
0.9356
0.0499

27.9885
0.9311
0.0394

26.3385
0.9383
0.0527

26.4981
00.8718
0.0640

26.4957
0.9293
0.0493

28.1262
0.9424
0.0362

27.7020
0.9380
0.0376

26.4301
0.9451
0.0471

26.2938
0.8825
0.0540

25.6326
0.9317
0.0466

30.4917
0.9521
0.0326

29.9398
0.9492
0.0365

27.8226
0.9482
0.0409

27.8865
0.8935
0.0484

28.8463
0.9526
0.0390

Table 3.1 Quantitative evaluation of face-completion across the CelebA [Liu et al., 2015], Cele-
bAHQ [Lee et al., 2020], subset of MultiPIE [Gross et al., 2010] with pose variations, subset
of MultiPIE with illumination variations and internet downloaded in-the-wild images (Internet)
datasets (averaged over all mask-to-face ratios). Our method performs significantly better than
other approaches in terms of PSNR, SSIM and LPIPS [Zhang et al., 2018b].

metrics across all image-mask pairs for each dataset. Overall 3DFaceFill improves PSNR by 2dB-

3dB and LPIPS by 5-10% over the closest baselines. In addition, for all the methods, we report

PSNR, SSIM and LPIPS as a function of mask to face area ratio ( #M askP ixels

#F aceP ixels ) in Fig. 3.3a, 3.3b

and 3.3c for the CelebA, CelebA-HQ and Multi-PIE datasets, respectively. For the CelebA dataset,

we also show the error bands for each method. We make the following observations: (1) Across

all the datasets, 3DFaceFill achieves significantly better PSNR and LPIPS across all mask ratios.

(2) As can be seen from the error bands in Fig. 3.3a, the worst face completions by 3DFaceFill are

better than the best completions from most baselines. (3) Among the baselines, PIC [Zheng et al.,

2019a] and DeepFillV2 [Yu et al., 2019] perform comparably with the former being slightly better

in terms of LPIPS. (4) The effectiveness of 3DFaceFill over the baselines is more apparent as larger

parts of the face are to be completed i.e., as the mask ratio increases. (5) On the CelebA dataset

30

DSA [Zhou et al., 2020a]
PConv [Liu et al., 2018]
3DFaceFill

PSNR (↑) SSIM (↑) LPIPS [Zhang et al., 2018b] (↓)
28.6205
29.3067
31.8823

0.0436
0.0379
0.0335

0.9375
0.9479
0.9615

Table 3.2 Quantitative comparison of the proposed 3DFaceFill vs. PConv [Liu et al., 2018] and
DSA [Zhou et al., 2020a] on a small set of completed images obtained from the authors.

[Liu et al., 2015], the improvement ranges from ∼2dB PSNR for 0-10% mask ratio to ∼4dB PSNR

for 60-80% mask ratio. In terms of LPIPS, the improvement ranges from 5% for 0-10% mask ratio

to 25% for 60-90% mask ratio. Similar trends are seen across the CelebA-HQ [Lee et al., 2020]

and MultiPIE [Gross et al., 2010] datasets too. These results confirm our hypothesis that explicitly

modeling the image formation process leads to significantly better face completion. We provide

addtional quantitative comparisons against PConv [Liu et al., 2018], DSA [Zhou et al., 2020a] and

UVGAN [Deng et al., 2018] in the supplementary since these results are based on a limited number

of author-provided completions in the absense of source codes.

3.2.2 Qualitative Evaluation

Figs. 3.4 and 3.5 qualitatively compare face completion between 3DFaceFill and the baselines

over a wide variety of challenging conditions. Completions by the baselines are less photorealistic

and often contain artifacts in scenarios with dark complexion, tend to deform facial components

(e.g. nose) and fail to preserve symmetry (e.g. eye-gaze or eye-brow shape). In addition, the base-

lines tend to deform the shape of small faces (e.g. children) since they are mostly trained on adult

faces where the relative proportions of facial parts differs significantly. In contrast, 3DFaceFill

generates more photorealistic completions in all these cases (diverse conditions and mask types)

due to explicit 3D shape modeling, incorporating symmetry priors and disentanglement of pose

and illumination.

3.2.3 Comparison against PConv and DSA

PConv [Liu et al., 2018] and DSA [Zhou et al., 2020a] have not released publicly available

source codes or pre-trained models. Hence, to compare against them, we obtained face completions

31

DARKER COMPLEXION

LARGE POSES

ILLUMINATION CONTRAST

Input

GFC [Li
et al.,
2017b]

SymmFC
[Li et al.,
2020a]

DeepFillv2
[Yu et al.,
2019]

PIC [Zheng
et al.,
2019a]

3DFaceFill
(Ours)

Ground
Truth

Figure 3.4 Qualitative evaluation under diverse conditions (complexion, pose, illumination).

32

ASYMMETRY IN EYE-GAZE

SHAPE DEFORMATIONS

Input

GFC [Li
et al.,
2017b]

SymmFC
[Li et al.,
2020a]

DeepFillv2
[Yu et al.,
2019]

PIC [Zheng
et al.,
2019a]

3DFaceFill
(Ours)

Ground
Truth

Figure 3.5 Qualitative evaluation under diverse conditions (eye-gaze, shape).

33

Missing eye-
brows

Blurred eyes
and nose, Il-
lumination
contrast

Blurry cheeks

Asymmetric
eye-gaze

Blurry defor-
mation near
mouth and
asymmetric
eye-gaze

Input

DSA [Zhou
et al., 2020a]

PConv [Liu
et al., 2018]

3DFaceFill
(Ours)

Ground truth

Figure 3.6 Qualitative evaluation of 3DFaceFill vs. PConv [Liu et al., 2018] and DSA [Zhou et al.,
2020a] on a subset of images received from the respective authors. The text on the left mention
the specific deformities in the baselines (blurriness, artifacts, asymmetry and other geometric de-
formations), that is not present in the completions by 3DFaceFill.

34

for a small set of 14 partial images through correspondence with the respective authors3. We

show qualitative results in Fig. 3.6. One can observe that while PConv [Liu et al., 2018] and

DSA [Zhou et al., 2020a] tend to deform the facial components under certain conditions leading

to geometric and photometric artifacts, 3DFaceFill is free of such artifacts and generates more

realistic completions. In addition, we provide quantitative metrics on this small set in Tab. 3.2,

where 3DFaceFill reports better PSNR, SSIM and LPIPS [Zhang et al., 2018b] metrics over both

the baselines.

3.2.4 Cross-Dataset Evaluation

To further demonstrate the improved generalization performance and robustness afforded by

our method, we perform a cross-dataset comparison on the pose and illumination varying images

from the MultiPIE [Gross et al., 2010] dataset, using models that were trained on the CelebA

dataset [Liu et al., 2015]. Note that most baselines [Yu et al., 2018, Li et al., 2017b, Zheng et al.,

2019a, Zhou et al., 2020a] do not perform such an evaluation. We split the MultiPIE [Gross et al.,

2010] dataset into two subsets: (1) a pose varying subset with constant frontal illumination and ex-

pression, referred to as MultiPIE:Pose and (2) an illumination varying subset with constant frontal

pose and expression, referred to as MultiPIE:Illu. Table 3.1 reports the PSNR, SSIM and LPIPS

[Zhang et al., 2018b] metrics for all the methods on these two splits. It can be seen that 3DFaceFill

significantly outperforms the baselines in both the splits. Further, we show qualitative results by

3DFaceFill vs. the baselines DeepFillv2 [Yu et al., 2019] and PIC [Zheng et al., 2019a] in Fig. 3.7

(for Pose) and Fig. 3.8 (for Illumination), respectively. From Fig. 3.7, one can observe that the

baselines tend to generate fuzzy and deformed faces for extreme poses while 3DFaceFill generates

sharper and geometry-preserving completions. And, in the illumination-varying case, DeepFillv2

[Yu et al., 2019] tends to generate artifacts and PIC [Zheng et al., 2019a] tends to generate asym-

metric completions for extreme illumination, whereas the completions by 3DFaceFill are free of

such artifacts and preserve illumination contrast and symmetry.

3The images provided by PConv’s authors were obtained from a model trained on 512x512 sized images, vs.

256x256 for the other baselines including 3DFaceFill.

35

Input

DeepFillv2 [Yu
et al., 2019]

PIC [Zheng
et al., 2019a]

3DFaceFill
(Ours)

Ground Truth

Figure 3.7 Qualitative evaluation on the MultiPIE:Pose dataset. Image completion by 3DFaceFill
vs. baselines DeepFillv2 [Yu et al., 2019] and PIC [Zheng et al., 2019a] on the pose-varying
MultiPIE:Pose split [Gross et al., 2010]. While the baselines tend to generate blurred and deformed
faces in extreme poses, 3DFaceFill is pose-robust and generates more accurate completions across
a range of pose.

36

Input

DeepFillv2 [Yu
et al., 2019]

PIC [Zheng
et al., 2019a]

3DFaceFill
(Ours)

Ground Truth

Figure 3.8 Qualitative evaluation on the MultiPIE:Illu dataset. Image completion by 3DFaceFill vs.
the baselines DeepFillv2 [Yu et al., 2019] and PIC [Zheng et al., 2019a] on the illumination varying
MultiPIE:Illu split [Gross et al., 2010]. While the baselines tend to generate artifacts in extreme
illuminations, 3DFaceFill generates completions that look geometrically accurate and preserve the
illumination contrast.

37

Input

Mask

DeepFillv2 [Yu
et al., 2019]

PIC [Zheng
et al., 2019a]

3DFaceFill

Input

3DFaceFill

Input

3DFaceFill

Input

3DFaceFill

Figure 3.9 Face de-occlusion on real occlusions. The baselines DeepFillv2 and PIC generate
non-realistic completion (e.g. asymmetric eye-gaze in row 1 and blurry shape in row 2), whereas
3DFaceFill performs realistic de-occlusion, maintaining the structural and photometric integrity of
the face.

3.2.5 Real Occlusions

One of the potential applications of face completion is in de-occlusion. This is usually chal-

lenging when faces have large pose, illumination or shape variations. Fig. 3.9 shows a few real-

world de-occlusion examples of faces in such conditions. Notice that, in cases of challenging pose,

illumination, etc., the baselines tend to generate blurry and asymmetric face completions, whereas

3DFaceFill does more realistic de-occlusion.

3.2.6 Comparison against UVGAN

The proposed face completion method, 3DFaceFill, has three parts, (i) disentangling 2D image

into factors such as 3D pose, 3D shape, albedo and illumination (IL), (ii) enforcing symmetry in

UV albedo (SYM), and (iii) iterative refinement of face completion through progressively more

accurate 3D pose and shape estimation (IR). UVGAN [Deng et al., 2018] on the other hand, (i)

performs completion of the missing texture in the UV-representation due to self-occlusion instead

38

Method
UVGAN
UVGAN-Sym

IL SYM IR PSNR (↑) LPIPS (↓)
✗
✗
3DFaceFill-NoIR ✓
✓

28.719
28.621
29.959
30.492

0.0383
0.0392
0.0334
0.0326

3DFaceFill

✗
✓
✓
✓

✗
✗
✗
✓

Input

UVGAN

UVGAN-Sym

3DFaceFill

Ground truth

Figure 3.10 Comparing UVGAN [Deng et al., 2018] reformulated for face completion vs. 3DFace-
Fill.

of completing a partial face image itself, (ii) unlike 3DFaceFill, does not disentangle texture further

into albedo and illumination, (iii) does not impose symmetry prior on the UV texture, and (iv) uses

3DMM on a fully visible face image rather than a partial image to obtain texture. Since no source

code or pretrained model of UVGAN is available, we evaluate these differences in two ways: (A)

by reformulating UVGAN for face completion, and (B) comparing UVGAN with our Sym-UNet

model on their publicly released texture dataset. We now present the two evaluations.

3.2.7 Comparison with UVGAN [Deng et al., 2018] Reformulated for Face Completion

To simulate UVGAN [Deng et al., 2018] for face completion, we remove the illumination dis-

entanglement (IL), symmetry loss (SYM) and iterative refinement (IR) from 3DFaceFill (refer to

Fig. 3.10). We call the variant with SYM as UVGAN-Sym, and the variant with both IL and SYM

as 3DFaceFill-NoIR. Adding IR makes for our full model 3DFaceFill. We compare the above-

mentioned variants for face completion on the CelebA [Liu et al., 2015] dataset and report the

quantitative and qualitative results in Fig. 3.10. One can observe that 3DFaceFill significantly

outperforms UVGAN as well as the other variants both quantitatively as well as qualitatively. Fur-

ther, we can see that introducing the symmetry loss (SYM) in UVGAN-Sym hurts performance

since, unlike UV-albedo, UV-texture is not inherently symmetric in faces because of the entan-

gled illumination. Completion on the disentangled albedo (IL) instead improves performance in

39

(a) Input

(b) 3DFaceFill

(c) Groundtruth

Figure 3.11 Qualitative evaluation of texture completion by the proposed Sym-UNet on the UVDB-
MPIE dataset [Deng et al., 2018].

3DFaceFill-NoIR. Lastly, iterative refinement (IR) further improves completion on top of IL and

SYM. This demonstrates the effectiveness of the novelties that 3DFaceFill introduces over UVGAN

[Deng et al., 2018].

3.2.8 Sym-UNet vs. UVGAN on Texture Completion

In this evaluation, we trained our Sym-UNet model on the UVDB-MPIE texture dataset re-

leased by the authors of UVGAN [Deng et al., 2018]. We split the dataset into a 80:20 train-test

split and resized the texture maps to 192 × 256 for training. Similar to UVGAN, we do not in-

clude the symmetry loss because of the presence of illumination variations and the availability of

40

synthetically completed texture maps, which reduces the utility of symmetry-loss. The rest of the

Sym-UNet is retained as such. On the test set, we report a PSNR of 30.1 (vs. UVGAN’s 25.8)

and SSIM of 0.937 (vs. UVGAN’s 0.886). Further, we show qualitative results in Fig. 3.11, where

we see that our completed textures resemble the ground truth closely (we do not have the cor-

responding completions by UVGAN). Thus, our proposed Sym-UNet network is comparatively

better suited for UV-completion than the network used in UVGAN [Deng et al., 2018].

3.2.9

3D View Synthesis of Masked Faces

3DFaceFill has a unique advantage over other face completion approaches, in that unlike ex-

isting methods, our method can not only complete partial faces, but also render new views of the

completed face from different view-points. In Fig. 3.12, we show this through examples of face

views rendered from five different viewpoints by completing the missing albedo and self-occluded

regions in the masked faces.

3.2.10 Ablation Studies

3.2.10.1 Effect of Iterative Refinement

To evaluate the effectiveness of iteratively refining face completion at inference, we compare

the PSNR, SSIM and LPIPS [Zhang et al., 2018b] metrics on raw output images (before blending

with the visible image) at each iteration. As reported in Table 3.3, iteration 2 significantly improves

upon iteration 1 over all the metrics. After iteration 2, the metrics become more or less stable,

with a slight dip in performance. We hypothesize that it is a result of not training the model

for iterative refinement and only performing it at inference. Further, we visualize the absolute

difference heatmaps between the completed and the original image for both iterations 1 and 2 in

Fig. 3.13 to understand which parts of the face benefit most from refinement. Observe that the

largest differences are around the high-detail regions (eyes, beards, etc.), which we ascribe to more

accurate 3D pose and shape estimation from the completed face after iteration 1 than from the

partial face before.

41

(a) Input

(b) Completed and synthesized face views

Figure 3.12 3D Face View Synthesis. 3DFaceFill has the unique ability to not just complete masked
faces realistically, but also synthesize new views from them.

PSNR (↑)
SSIM (↑)
LPIPS (↓)

Iter 1

Iter 2

33.7587 34.5347
0.9678
0.9510
0.0185
0.0192

Iter 3
34.5018
0.9675
0.0186

Iter 4
34.4943
0.9670
0.0187

Iter 5
34.4428
0.9666
0.0188

Iter 6
34.4018
0.9652
0.0188

Table 3.3 Quantitative evaluation of iterative refinement.

Metric
PSNR (↑)
SSIM (↑)
LPIPS (↓)

Full GAN Patch GAN NoSym NoSym+Attn Full Model
31.7125
0.9654
0.0462

31.7969
0.9667
0.0442

31.6110
0.9665
0.0446

31.7552
0.9658
0.0454

32.1950
0.9678
0.0410

Table 3.4 Quantitative evaluation between the different ablation models and our full model on
masks blocking one-half of the face.

42

Input

Original

Iter1

Iter1 - Orig

Iter2

Iter2 - Orig

Iter2 - Iter1

Figure 3.13 Effect of Iterative Finetuning. We show raw completions (without blending) at itera-
tions 1 and 2 along with the difference heatmaps. Note the improvements in Iter2 over Iter1 and
the corresponding heatmap activations around eyes, eye-brows and other edges on the face.

Input

Original

NoSym Model

Full Model

Full-NoSym

Figure 3.14 Effect of using Symmetry. The full model includes Sym-UNet and symmetry loss (dur-
ing training) and can copy symmetric features when available. The absolute difference heatmaps
(Full-NoSym) shows that most difference is coming from components such as eyes, eye-brows,
etc.

43

3.2.10.2 Effect of Symmetry Constraint

To evaluate the effectiveness of Sym-UNet and the symmetry loss, we compare two variants

of the full model (Sym-UNet + symmetry loss). These include, (1) NoSym: Sym-UNet replaced

by standard UNet and with no symmetry loss, and (2) NoSym+Attn: NoSym model plus a self-

attention layer after the 3rd upsampling layer in the UNet decoder. Attention layers are commonly

employed by many inpainting models [Yu et al., 2018, Yu et al., 2019, Zheng et al., 2019a] for

capturing long-range spatial dependencies, so this variant seeks to compare the utility of attention

in lieu of symmetry priors for face inpainting. To best evaluate the benefit of symmetry constraints

for faces, the above model variations are evaluated on face images masked on one side of the face

as shown in Fig. 3.14.

The results in Table 3.4 indicate that the full model outperforms all the variants, with NoSym

being the worst among them. Also the NoSym+Attn variant does perform slightly better than

NoSym but is still far behind the full model. This indicates that, (i) though attention helps in the

absence of any prior constraints, explicitly enforcing geometric priors associated with structured

objects like faces is significantly more effective than implicitly learning them through attention,

and (ii) symmetry is a more useful feature for face inpainting and behaves like an attention on the

visible symmetric parts. As shown in Fig. 3.14, compared to the full model, the NoSym variant

results in larger inpainting errors as indicated by the difference heatmaps. Therefore, unlike the

full model the NoSym model tends to ignore the visible symmetric regions of the face leading to

inconsistencies between the visible and inpainted regions.

3.2.10.3 Effect of Symmetry Gating

We visualize the intermediate gating maps used in our model that control the flow of informa-

tion in the network (ref Fig. 3.15). We visualize two (out of 64) gating activations (1st - Gate1 and

33rd - Gate2) from the second layer of our Sym-UNet network. As can be seen in Fig. 3.15, while

Gate1 activates for the visible regions in the input albedo, Gate2 activates for the masked regions

to propagate useful features from the horizontally flipped albedo map to the symmetric side. This

enables Sym-UNet to leverage and maintain facial symmetry for inpainting. We also visualize the

44

Input

Input Albedo

Gate 1

Gate 2

Uncertainty σ

Output Albedo

Figure 3.15 Visualizing the Gating Activations and the Uncertainty-Maps. Observe that, while
Gate 1 activates for the visible regions, Gate 2 activates for the masked regions to propagate use-
ful features from the visible symmetric parts to their masked counterparts. The uncertainty map
captures the model’s uncertainty around the masked regions and the facial components such as
the eyes, thus incurring higher losses for these regions. (Note: higher values are represented by
warmer (redish) colors in the gating and uncertainty heatmaps).

estimated uncertainty map (σ) in Fig. 3.15 that is learned by the inpainter G in an unsupervised

way. Note that the uncertainty is usually higher around important facial components like the eyes

and the masked regions, which increases the loss incurred in these regions.

3.2.11 Discussions

The above described experiments and ablation studies demonstrate the effectiveness of 3DFace-

Fill, along with the utility of each of its components in performing robust face completion in chal-

lenging cases of facial pose, shape, illumination, etc. However, the formulation of our proposed

approach do impose a dependency on the fidelity of the underlying 3D model. Essentially, our

approach cannot inpaint on regions which are not included in the underlying 3D model and the

resolution of inpainting depends on the density of the 3D mesh. 3DFaceFill currently uses the

BFM model [Paysan et al., 2009], thanks to its widespread support. However, BFM [Paysan et al.,

2009] does not include the inner mouth, hairs and the upper head and has limited vertex density

around the eyes, which restricts inpainting in these regions. However, these limitations of the un-

45

derlying 3D model are not inherent to the proposed approach and do not invalidate the advantages

of our model in improving the geometric and photometric consistency of completion. Furthermore,

these limitations can potentially be mitigated by substituting BFM with a more detailed 3D face

model, such as the Universal Head Model (UHM) [Ploumpis et al., 2020], that includes the inner

mouth and detailed eye-balls, along with other improvements.

3.3 Conclusion

In this chapter, we proposed 3DFaceFill, a 3D-aware face completion method. Our solution

was driven by the hypothesis that performing face completion on the UV representation, as opposed

to 2D pixel representation, will allow us to effectively leverage the power of 3D correspondence

and ultimately lead to face completions that are geometrically and photometrically more accurate.

Experimental evaluation across multiple datasets and against multiple baselines show that face

completions from 3DFaceFill are significantly better, both qualitatively and quantitatively, under

large variations in pose, illumination, shape and appearance. These results validate our primary

hypothesis.

46

CHAPTER 4

COLA-SDF: CONTROLLABLE LATENT STYLESDF FOR DISENTANGLED 3D FACE
GENERATION

Face generation has a long history in the vision and graphics communities. The earliest of these

were based on 3D morphable models (3DMMs) [Paysan et al., 2009, Gerig et al., 2018, Li et al.,

2017a]. These models are highly controllable and allow editing of features such as shape, expres-

sion, texture, pose, and illumination in a disentangled manner. However, as they are linear models

based on principal components analysis (PCA), the faces synthesized by these models lack fine

details in shape and appearance. To address this, there has been a growth in nonlinear 3D face

reconstruction approaches [Medin et al., 2022, Feng et al., 2021, Tran and Liu, 2018]. These non-

linear approaches have significantly improved the expressivity of 3DMM models but are still far

behind the image quality generated by generative adversarial networks (GANs). The strict corre-

spondence assumption is one of the core limitations in terms of modelling for 3DMMs. On the

one hand, it simplifies modeling drastically, but it limits the ability to model texture, hair and other

elements that lack correspondence.

The striking photorealism of 2D style-based GANs [Karras et al., 2019, Karras et al., 2020,

Karras et al., 2021], as well as the ability of implicit neural representations [Mildenhall et al.,

2020] to learn detailed 3D object representations from 2D images, have led researchers to combine

the benefits of both models. The combined models [Gu et al., 2021, Or-El et al., 2022], often

referred to as implicit 3D-GANs, can be trained in an unsupervised way to learn and synthesize

the 3D structure and high-fidelity texture of faces. Essentially, implicit 3D-GANs learn to generate

an implicit representation of a 3D scene that can be rendered using volumetric rendering similar

to that in [Mildenhall et al., 2020]. Unlike both linear and nonlinear 3DMMs, highly complex

structures that do not follow the correspondence assumption (such as hair) can be part of the model.

However, existing implicit 3D-GANs are not able to support disentangled control or editing of

physical attributes such as shape, pose, albedo and illumination, and they require complicated

inversion-based approaches to perform such editing with limited success [Xia et al., 2022].

47

The main idea of our proposed model is imparting controllable generation and editing to im-

plicit 3D-GANs. Previous methods [Tewari et al., 2020b, Medin et al., 2022] have combined the

photorealism of 2D GANs with the controllability of 3DMMs with good success, but both methods

suffer from limitations. Because StyleRig [Tewari et al., 2020b] relies on the pretrained StyleGAN,

its disentangled controllability is limited to the amount of disentanglement in the pretrained Style-

GAN; for example, the inherently 2D nature of StyleGAN hampers its disentanglement of pose

from other attributes. MOST-GAN, a nonlinear 3DMM in which the texture map is modeled us-

ing the StyleGAN2 architecture, is excellent at modeling the 3D shape and texture of faces, but it

is unable to model the hair region in full 3D as there is no point-to-point correspondence across

subjects in the hair region.

By combining the ability of StyleSDF [Or-El et al., 2022] to learn 3D generation from 2D im-

ages with the disentangled controllability of MOST-GAN [Medin et al., 2022], we can retain the

best features of both implicit 3D-GANs and nonlinear 3DMMs. By incorporating the nonlinear

3DMM via loss functions only, we maintain the photorealism provided by the StyleSDF architec-

ture. The control is enforced during training of the StyleSDF architecture via loss functions that

incorporate MOST-GAN’s disentangled parameters using inverse rendering with MOST-GAN’s

image decoder.

To summarize, in this chapter we propose CoLa-SDF, which imparts controlled face gener-

ation and editing to implicit 3D-GAN. Our proposed approach utilizes a differentiable nonlinear

3DMM-based model to supervise the training of an implicit 3D-GAN in order to learn disentangled

representations for shape, texture, and illumination. In addition, we employ face parsing (semantic

segmentation of face images) to further disentangle a latent code for the hair and background from

the latent representation of the face. As a result, CoLa-SDF can generate high-fidelity 3D faces,

which can then be edited by changing separate latent codes for shape, texture, illumination, pose,

and hair and background, either independently or in various combinations. In summary, our main

contributions include:

• We propose a new method called CoLa-SDF that allow generation and subsequent editing

48

of high-fidelity 3D faces, from which photorealistic 2D images can be rendered in multiple

views.

• Our method builds upon the architecture of StyleSDF while disentangling the latent repre-

sentation into separate latent codes for shape, texture, illumination, and hair and background,

thereby allowing independent editing of each attribute.

• To achieve disentangled control, we sample in the PCA space of MOST-GAN parameters

and introduce novel parametric and image-based consistency losses utilizing the MOST-

GAN encodings and face parsing.

4.1 Preliminaries

Our method relies on StyleSDF [Or-El et al., 2022] and MOST-GAN [Medin et al., 2022],

which we now introduce in more detail.

StyleSDF [Or-El et al., 2022] consists of two components: a signed distance function (SDF)-based

volume renderer and a styled generator. Given a latent code z ∼ N (0, I), the volume renderer

takes in a 3D query point x and a viewing direction v and maps them into an SDF value d(x, z),

a radiance c(x, v, z), and a feature vector f (x, v, z). A low-resolution (64×64) image Ivol and

feature map F are generated using volume rendering. Each pixel is computed by querying points

along the ray r = o + tp originating from the camera position o and passing through the pixel

location corresponding to p as follows:

Ivol =

F(r) =

(cid:90) tf

tn
(cid:90) tf

tn

T (t)σ(r(t))c(r(t), p)dt,

(4.1)

T (t)σ(r(t))f (r(t), p)dt,

where T (t) = exp

(cid:16)

− (cid:82) tf
tn

(cid:17)

σ(r(s))ds

represents the visibility of each point along the ray. The

density field σ(x) is obtained from the SDF d(x) using the following model:

σ(x) =

1
δ

Sigmoid

(cid:18) −d(x)
δ

(cid:19)

,

(4.2)

49

where δ is a learned parameter. The styled generator maps the feature map F into a high-resolution

image I conditioned on the style-code w = g(z).

The volume renderer and the styled generator are trained separately. First, the volume renderer

is trained along with a low-resolution discriminator in an adversarial way. Then, the volume ren-

derer’s weights are frozen, and the styled generator is trained in an adversarial way, along with a

high-resolution discriminator. The volume renderer loss Lvol consists of the non-saturating GAN

loss with R1 regularization [Mescheder et al., 2018] Ladv, pose alignment loss Lview, eikonal loss

Leik, and minimal surface loss Lsurf:

Lvol = Ladv + λviewLview + λeikLeik + λsurfLsurf,

(4.3)

where λview, λeik, and λsurf are the weights for the pose, eikonal, and minimal surface losses, re-

spectively.

The pose alignment loss Lview enforces that the generated images follow the input pose. This

loss is applied both on the volume generator, as well as the low-resolution discriminator (but only

when iterating through generated images). For this, the low-resolution disciminator is modified,

such that, in addition to the image score, it also predicts the pose ( ˆϕ, ˆθ) of the image. The pose

alignment loss is defined as the smoothed L1 loss between the pose (ϕ, θ) used by the volume

renderer to generate images, and the pose ( ˆϕ, ˆθ) predicted by the low-resolution discriminator:

Lview =






(ˆθ − θ)2

if |ˆθ − θ|≤ 1

|ˆθ − θ|

otherwise

(4.4)

The eikonal loss enforces physical validity of the signed distance field [Gropp et al., 2020]:

Leik = E(x (||∇d(x)||2−1)2 .

(4.5)

The minimal surface loss penalizes the SDF values that are close to zero to avoid spurious

zero-crossings and non-visible surfaces from being formed:

Lsurf = E(x (exp(−100|d(x)|)) .

(4.6)

50

The styled generator is trained using a combination of a path regularization loss Lpath as well

as Ladv defined above:

Lgen = Ladv + λpathLpath,

(4.7)

where λpath is the weight of the path loss.

MOST-GAN [Medin et al., 2022] is a nonlinear 3DMM that includes a set of encoders for shape

Eα, albedo Eτ , illumination Eγ, and pose Eθ, a shape decoder Gα and an albedo decoder Gτ .

Given a face image, the encoders extract the shape parameters α, the albedo parameters τ , the

spherical harmonics illumination parameters γ [Ramamoorthi and Hanrahan, 2001, Zhang and

Samaras, 2006] and a 3D pose θ. The decoders generate the full 3D shape S and albedo map

A: Gα : α → S , Gτ : τ → A. Next, a differentiable renderer R [Ravi et al., 2020] renders

the reconstructed face image Imost from the generated 3D model, lighting and pose parameters:

Imost = R(S, A, γ, θ). In this work, we use the pre-trained MOST-GAN weights provided by the

authors.

4.2 Approach

4.2.1 Overview

Our proposed approach is based on building a semantically disentangled latent space for an im-

plicit 3D GAN, such that each part of the latent code corresponds to a different physical attribute.

We achieve this by enforcing a correspondence between the latent codes for these factors (shape,

albedo and illumination) and the parameters of a 3DMM, which has built-in disentangled repre-

sentations of these parameters. Pose control can be easily handled using 3D volume rendering and

the view-dependence property of implicit 3D GANs [Gu et al., 2021, Or-El et al., 2022]. However,

3DMMs do not facilitate disentanglement of hair and background, because these attributes are not

represented well in 3DMM models. In order to encourage part of the latent code to correspond to

hair and background, we introduce a photo-consistency loss on the hair and background regions of

the generated images that encourages different faces generated using the same hair and background

codes to have consistent hair and background appearance.

51

zα

zτ

zγ

zhairbg

zrest

Renderer
Mapping
Network

w

3D Shape

Generator Mapping Network

Low-res
Feature Map

v

x

SDF
Volume
Renderer

2D Styled
Generator

High-res
Image

Low-res Image

Low-res
Discriminator

Lvol

MOST-GAN
Encoder

Face
Parser

α, τ, γ, θ
Lmost

High-res
Discriminator

Lgen

Lhairbg + Lface

Figure 4.1 Overview: (Top) The SDF volume renderer generates the low-resolution SDF surface,
image and feature map conditioned on the latent codes zα, zτ , zγ, zhairbg and zrest, which the styled
generator decodes into a high-resolution image. (Bottom) To disentangle shape, albedo and illumi-
nation, we enforce parametric consistency between the sampled latent codes and the MOST-GAN
encodings α, τ , γ, θ. To disentangle hair/background, we alternately resample face parameters
zα, zτ , and zγ and enforce image-based consistency on the hair and background; followed by re-
sampling zhairbg and enforcing consistency on the face regions.

Disentangling the latent space of an implicit 3D GAN according to a 3DMM requires the

3DMM to be differentiable and highly expressive, so for our model we adopted the nonlinear

3DMM model MOST-GAN [Medin et al., 2022], as it matches these requirements. For our im-

plicit 3D GAN architecture, we selected StyleSDF [Or-El et al., 2022], both because of its high

rendering quality and because it explicitly models the object’s 3D shape in the form of signed

distance field (SDF). Since our proposed modifications and enhancements to StyleSDF enable dis-

entangled control of physical attributes by modifying disjoint segments of its latent code, we call

our model Controllable Latent StyleSDF (CoLa-SDF).

4.2.2 Architecture

At the core of our method, we use StyleSDF [Or-El et al., 2022] and largely maintain its

architecture. In order to successfully disentangle the latent code, we make two key changes to

52

StyleSDF (refer Fig. 4.1). First, we partition the 256-dimensional latent code z into separate

latent codes that will correspond to the face shape zα, albedo zτ , illumination zγ, and hair and

background zhairbg. We also introduce a final segment of the latent code, zrest, which the model

is free to assign to any facial appearance factors not explained by MOST-GAN [Medin et al.,

2022]. Second, we modify the training method for StyleSDF and incorporate novel consistency

loss functions. One set of consistency loss functions enforces consistency between the latent codes

that generate a face and the parameters that MOST-GAN extracts from the generated face image.

A second set of consistency loss functions minimizes the impact that changes in zhairbg can have

on the face appearance, and it similarly minimizes the effect that the face-specific latent codes can

have on the hair and background appearance. Careful design of both the latent code factorization

and the consistency losses during training are crucial to attain the desired disentanglement. We

now describe these in detail.

4.2.3 Latent Code Factorization

We partition the 256 dimensions of the latent code z into disjoint subsets: 128 dimensions

corresponding to the MOST-GAN [Medin et al., 2022] attributes, further partitioned into zα, zτ ,

and zγ; 64 dimensions zhairbg corresponding to hair and background appearance, and 64 dimensions

zrest to account for any remaining details in and around the face. To determine the dimensionality

to allot to each of the MOST-GAN factors zα, zτ , and zγ, we perform eigen-decomposition over the

corresponding data covariance matrices Σα, Στ , and Σγ respectively, that we obtain by encoding

images in the FFHQ [Karras et al., 2019] dataset to the MOST-GAN [Medin et al., 2022] shape α,

albedo τ , and illumination γ parameters using the pre-trained encoders. Based on this analysis, we

chose a dimensionality of dα = 37 for zα and dτ = 64 for zτ , which accounted for well over 95%

of the variance in their respective distributions. In order to enable full explicit control over the 27

spherical harmonics lighting parameters used in MOST-GAN, we chose dγ = 27 for zγ. Since we

desire zω ∼ N (0, I) for ω ∈ (α, τ, γ), we use eigen-decomposition to create a mapping between

the parameter encoding of MOST-GAN [Medin et al., 2022] and the corresponding latent codes in

53

our model:

ωsample = U ′

ωΛ′

ωzω + µω,

(4.8)

where U ′

ω and Λ′

ω are the top dω eigenvectors and eigenvalues of Σω and µω is the data mean.

4.2.4 Training

4.2.4.1 StyleSDF Losses

As in [Or-El et al., 2022], we train the model in two stages. In the first stage, we train the

volume renderer, then we freeze its weights in the second stage and train the 2D styled generator.

In addition to the original StyleSDF losses, which we described in Sec. 4.1, in both stages we

introduce new consistency losses that we will describe in Sec. 4.2.4.2.

In the first stage, training the volume renderer, the loss Lvol consists of the non-saturating GAN

loss with R1 regularization [Mescheder et al., 2018] Ladv, pose alignment loss Lview, eikonal loss

Leik, and minimal surface loss Lsurf, as defined in [Or-El et al., 2022]. In the second stage, training

the styled 2D generator, the loss Lgen consists of a path regularization loss Lpath as well as Ladv

defined above:

Lvol = Ladv + λviewLview + λeikLeik + λsurfLsurf,

Lgen = Ladv + λpathLpath,

(4.9)

where λview = 15, λeik = 1, λsurf = 1 and λpath = 2.

4.2.4.2 CoLa-SDF Losses: MOST-GAN Consistency and Hair/Background Consistency

We introduce the MOST-GAN consistency and hair consistency losses to both stages of train-

ing, in addition to the original StyleSDF losses (4.9). In the first stage, our new losses are applied

to the low-res images, while in the second stage, they are applied to the high-res images.

We enforce consistency of the rendered image with respect to the sampled MOST-GAN [Medin

et al., 2022] parameters using the MOST-GAN consistency loss Lmost:

Lmost = λαLα + λτ Lτ + λγLγ + λθLθ,

(4.10)

54

where Lα = ||Eα(I) − αsample||2

2 enforces that the MOST-GAN’s shape encoding of rendered

image Eα(I) is the same as the sampled shape parameters αsample obtained from Eq. 4.8. Similarly,

we define the albedo consistency loss Lτ and the illumination consistency loss Lγ as ℓ2-error losses

between the predicted MOST-GAN parameters and the sampled parameters. We enforce pose-

consistency between the pose encodings over the two sub-iterations as Lθ = ||Eθ(Is1)−Eθ(Is2)||2
2.

We set λα = 3000, λτ = 100, λγ = 100 and λθ = 1000.

Existing 3DMM-based approaches do not model hair and background. Hence, to disentangle

hair/background from other physical attributes, we adopt a novel approach where we force the

hair/background code zhairbg to only model the hair and background. Specifically, we perform a

second sub-iteration followed by each generator iteration, where, during even iterations, we re-

sample zα, zτ and zγ, and enforce hair and background consistency using Lhairbg.

In the odd

iterations, we re-sample zhairbg and enforce face consistency using Lface. The hair/background and

face consistency losses are defined as:

Lhairbg = Lphoto(Is1, Is2, Mh) + Lvgg(Is1, Is2, Mh)

Lface = Lphoto(Is1, Is2, Mf ) + Lvgg(Is1, Is2, Mf )

(4.11)

(4.12)

Here, Is1 and Is2 are the images rendered in sub-iterations 1 and 2, respectively, Mh = Mhairbg,s1 ∪

Mhairbg,s2 is the union of the hair masks from the two sub-iterations, and Mf = Mface,s1 ∪ Mface,s2

is the union of the face masks from the two sub-iterations. We use a pre-trained face parser [Chen

et al., 2017a] to parse the rendered face images into one segmentation masks for the face and

one for hair and background. We define the masked photometric loss as Lphoto(x1, x2, m) =

||(x1 − x2) ⊙ m||1, where ⊙ is the element-wise product operator. Similarly, we define the masked

perceptual loss as Lvgg(x1, x2, m) = ||ϕ(x1 ⊙ m) − ϕ(x2 ⊙ m)||2
2.

Thus, the overall loss for stage 1, volume renderer training, is given by:

Lcola

vol = Lvol + Lmost + λhairbgLhairbg + λfaceLface.

Similarly, the overall loss for stage 2, the training of the 2D styled generator, is given by:

Lcola

gen = Lgen + Lmost + λhairbgLhairbg + λfaceLface.

55

(4.13)

(4.14)

We set λhairbg = 5 in even iterations but = 0 in odd iterations, and λface = 5 in odd iterations but = 0

in even iterations, for both Eqs. (4.13) and (4.14).

4.2.4.3

Initialization of Each Stage

To obtain meaningful MOST-GAN encodings and face parsing, we need the generated images

to look like faces. Hence, we initialize each stage by training with only StyleSDF based losses for

up to 5000 iterations, following which Lmost, Lhairbg and Lface are introduced. Failing to do so may

result in longer training time and poor convergence.

4.3 Experiments

4.3.1

Implementation Details

We trained CoLa-SDF’s volume renderer and styled generator separately for 400,000 and

200,000 iterations, respectively. We trained the volume renderer with a batch-size of 20 and ray-

sampling frequency (samples per ray) of 24, on a machine with Intel Xeon Gold 6326 processor

with 64 cores and 10 Nvidia A40 GPUs. While training the styled-generator, we freeze the volume

renderer and the renderer mapping network and increase the ray-sampling frequency to 64. We

trained the styled-generator with a batch-size of 40 on the same machine. Training the volume

renderer takes 3 days and the styled-generator takes 4 days on this machine.

4.3.2 Datasets and Evaluation

We train our model on the FFHQ dataset [Karras et al., 2019], which consists of 70,000 high-

resolution images of portrait faces from varying age, ethnicity, and image conditions. We evaluate

our model in terms of both its face generation and subsequent editing capabilities. To evaluate gen-

eration quality numerically, we compare our model’s capability to generate photorealistic images

with existing methods in terms of FID. To evaluate image editing, we demonstrate our model’s

capability to disentangle the latent space for shape, albedo, illumination and hair/background and

explicitly edit these properties.

56

Method
GRAF [Schwarz et al., 2020]
PiGAN [Chan et al., 2021]
GIRAFFE [Niemeyer and Geiger, 2021]
Ours
StyleSDF [Or-El et al., 2022]

FID (↓)
79.2
83.0
31.2
19.4
11.5

Table 4.1 FID evaluations at 256x256 resolution. Our method, while enabling disentanglement,
demonstrates the second best performance.

4.3.3 Face Generation

We demonstrate the face generation capability of CoLa-SDF by rendering face images from

multiple viewpoints. Our method’s view-consistent synthesis is demonstrated in Fig. 4.2, which

renders two randomly generated faces in viewpoints up to ±0.45 radians azimuth and ±0.225

radians elevation. To demonstrate the quality of the underlying 3D surface, we also show the

corresponding marching cubes mesh obtained from the signed distance field. In addition, for each

example, we map the latent code for shape zα to the MOST-GAN parameter α using Eq. (4.8)

and generate the corresponding MOST-GAN mesh using its decoder S = Gα(α). As shown in

the figure, the generated MOST-GAN meshes correspond well with the images and the marching-

cubes mesh generated by our method, which demonstrates that CoLa-SDF has learned a high-

degree of correspondence with MOST-GAN.

To quantitatively evaluate image generation quality of our method, we compute the FID

[Heusel et al., 2017] metric after downsampling the generated images to a resolution of 256×256.

We compare our method against the FIDs reported by GRAF [Schwarz et al., 2020], PiGAN

[Chan et al., 2021], GIRAFFE [Niemeyer and Geiger, 2021] and StyleSDF [Or-El et al., 2022].

As shown in Tab. 4.1, while StyleSDF reports the best FID, our method is a close second, a small

price to pay for our method’s disentangled control over the latent space.

4.3.4 Disentanglement of the Latent Space

While most 3DMM-based models can only disentangle shape, albedo, and illumination, our

model additionally provides separate control over hairstyle and background. In the following sub-

sections, we qualitatively and quantitatively evaluate CoLa-SDF’s latent space disentanglement in

57

SDF surface

MOST-GAN

SDF surface

MOST-GAN

Figure 4.2 Multiview image renderings and 3D shapes extracted from SDF from CoLa-SDF, along
with the corresponding MOST-GAN [Medin et al., 2022] reconstructions.

58

terms of shape, albedo, illumination, and hair/background.

Shape, Albedo, Lighting and Hairstyle Manipulation: To demonstrate the disentanglement ca-

pability of our model, we manipulate the shape, albedo, lighting, and hair and background of

generated faces and show their variations (see Fig. 1.4). For a face image generated using some

latent code z, we modify attributes of the image by independently resampling one or more of zα,

zτ , zγ and zhairbg from the latent space and replacing the original values by the resampled values

for the selected portions of z. Then we use the modified latent code to generate a modified image.

While MOST-GAN’s shape parameters zα correspond to both identity and expression variations,

many individual dimensions of zα correspond more with either identity or expression. By altering

the values in these dimensions, we can change the face shape to selectively focus on either identity-

related or expression-related shape changes, as shown in Fig. 1.4. Altering the albedo code results

in changes to properties such as lip color, skin tone, facial hair, and eyebrow density, while leaving

the face shape virtually unchanged. Similarly, varying the illumination and hair/background latent

codes only affect those factors, while maintaining the face’s shape and albedo.

Illumination Editing using Spherical Harmonics: Since MOST-GAN’s illumination code is

based on the spherical harmonics coefficients [Ramamoorthi and Hanrahan, 2001], we can per-

form controlled manipulation of illumination by directly setting the values of the spherical har-

monics coefficients, then using Eq. (4.8) to map these values into the space of zγ. We traverse

through the first two spherical harmonics bases for each channel and show the illumination vari-

ations in Fig. 4.3. Traversing through the first basis results in global illumination change, while

traversing through the second basis results in the illumination direction changing from right to left.

Notice that as the magnitude and direction of light changes, it affects not only the face but also the

hair and background. This is in contrast to 3DMM-based methods like MOST-GAN, which apply

illumination only to the face region. As a result, illumination editing using our method is more

natural than that of 3DMM-based approaches.

To further demonstrate the correspondence between CoLa-SDF’s illumination latent code and

the spherical harmonics coefficients [Ramamoorthi and Hanrahan, 2001], we show controlled il-

59

Figure 4.3 Illumination editing using spherical harmonics. For three randomly generated faces, we
can alter the lighting by directly modifying the spherical harmonics coefficients. Varying the first
spherical harmonics coefficient (left) controls the level of global (ambient) illumination, while the
second coefficient (right) controls the illumination’s horizontal directionality.

60

Figure 4.4 Directional rotation of illumination.

61

(a) Shape transfer.

(b) Albedo transfer.

(c) Lighting transfer.

(d) Hair transfer.

Figure 4.5 Transfering physical attributes from source to target through the latent code.

lumination manipulation in Fig. 4.4. Starting from an initial illumination setting (shown in the

left column), we project it to the spherical harmonics space using Eq. (4.8) and rotate the lighting

around the camera axis in increments of π/5 radians (36◦). The results demonstrate that CoLa-SDF

can perform any desired illumination editing.

62

4.3.4.1 Attribute Transfer

To further demonstrate the attribute disentanglement of our method, we transfer attributes such

as shape, albedo, lighting, and hair and background from a source image (left column) to a target

image (top row), as illustrated in Fig. 4.5.

Shape Transfer (Fig. 4.5a): Our method can transfer extreme identity- and expression-related

shape variations from the source image to the target image, while keeping other physical attributes

intact. These changes include width and height of the face, roundness of the face (row 1), sharpness

of the jawline (row 4), as well as expression changes such as frowning (row 1) and smiling.

Albedo Transfer (Fig. 4.5b): Our model can transfer attributes such as skin tone, thickness of

eyebrows (rows 3 and 4), and lip color.

Interestingly, our model can also transfer eyeglasses,

which are external to the face and hence not accounted for by any 3DMM model. In addition, we

were surprised to observe that hair color is affected by the albedo code in addition to the hair and

background code.

Illumination Transfer (Fig. 4.5c): While skin color is a property of facial albedo, we note that

the illumination code can change the tone, hue and brightness of the overall image, including hair

and background.

Hair and Background Transfer (Fig. 4.5d): Notice that transferring the hair and background

does not change the identity or other attributes of the face. In this figure, we again observe that

while the hair/background code determines the hair geometry/hairstyle, its color is also partly

controlled by the albedo code.

4.3.4.2

Identity Consistency across Unrelated Attributes

In this section, we analyze the effect on the identity of the generated face of changing identity-

related attributes such as shape and albedo versus non-identity-related attributes such as pose,

illumination, and hair and background. We randomly generated 1000 face images from our model

and edited their viewpoint, illumination, hair/background, shape, and albedo by resampling their

latent codes from the corresponding normal distributions. We extract the identity features from the

63

Face identity match (% of samples with unchanged identity) between original and edited images.
View Illumination Hair/backround Shape Albedo
97.7

Shape + Albedo
4.7

65.7

75.2

98.2

99.7

Table 4.2 Evaluation of face identity consistency as measured by ArcFace [Deng et al., 2019] af-
ter resampling non-identity-related latents (view, illumination and hair/background), and identity-
related latents (shape, albedo).

(a) Ours

(b) Without face loss Lface

Figure 4.6 Without face loss Lface, the hair/background latent code does not get fully disentangled
from the face. This leads to changes in the face region with the hair/background latent code as can
be seen in the examples to the right.

original and the edited images using the state-of-the-art face-recognition model ArcFace [Deng

et al., 2019], and measure the identity match between the original and edited faces (using Ar-

cFace threshold of 70◦). The results, in Tab. 4.2, show that as desired, changes in viewpoint,

illumination, and hair and background have minimal impact on the generated face’s identity. In

contrast, changing shape and albedo individually cause partial but not complete identity alterations

(this corresponds well with human perception of identity changes in Figs. 4.5a and 4.5b), while

simultaneously changing both shape and albedo codes results in a clear change of identity. This

demonstrates that our method has successfully disentangled the identity-related attributes of face

from its non-identity-related attributes.

4.3.5 Ablation Studies

The development of CoLa-SDF involved a number of important design choices. To show the

effects of some of these choices, these ablation studies demonstrate how various omissions from

or additions to CoLa-SDF detract from its overall performance.

64

4.3.5.1 Without face consistency loss

CoLa-SDF-NoFaceLoss: As described in Sec. 4.2.4.2, we enforce hair/background disentangle-

ment through a combination of the hair/background consistency loss Lhairbg and the face consis-

tency loss Lface. The hair/background consistency loss, Lhairbg, ensures that when we keep the

hair/background code zhairbg the same but change the other latent codes, the hair/background re-

gions in the image will change as little as possible. Similarly, Lface ensures that when we change

zhairbg but keep the other latent codes the same, the face region will change as little as possible.

To study the importance of the face consistency loss, we train a model without Lface loss

and evaluate it in terms of its hair/background disentanglement. We call this variant CoLa-SDF-

NoFaceLoss. Specifically, we perform interpolation between two hair/background codes while

keeping all the other latent codes the same. If the model has well disentangled hair/background

from the face region, changing zhairbg should not affect the face. We show the comparison be-

tween our model and CoLa-SDF-NoFaceLoss in Fig. 4.6. Note that, with CoLa-SDF-NoFaceLoss,

changing zhairbg changes facial-hair in the first row, and causes shape changes in the second row.

On the other hand, with our model, changing zhairbg does not any cause noticeable changes in the

face regions.

4.3.5.2

Independent mapping of each attribute:

CoLa-SDF-SeparateMappers: In CoLa-SDF, the five latent codes zα, zτ , zγ, zhairbg, and zrest all

feed into the same Renderer Mapping Network, which outputs a combined style code w, as shown

in the top left of Fig. 4.1. For this ablation study, we replace the single volume renderer mapping

network with five separate renderer mapping networks, one for each of shape α, albedo τ , illumi-

nation γ, hair/background zhairbg, and zrest. We sample the shape parameters from α ∼ N (µα, Σα),

where µα and Σα are the data mean and covariance obtained from MOST-GAN encodings of the

FFHQ dataset [Karras et al., 2020] images. We used the same method to sample τ and γ. For sam-

pling zhairbg and zrest, we use the standard normal distribution N (0, 1). The individual mappers have

similar architecture as the original combined renderer mapping network, but with different input

and output dimensions. The input dimensions for shape, albedo, illumination, hairbg, and rest are

65

Ablation Variants
Separate Mappers With Perceptual Consistency With Photometric Consistency Ours
19.4

25.85

21.38

23.04

FID (↓)

Table 4.3 FID evaluations at 256x256 resolution. CoLa-SDF with Separate Mappers performs the
worst, while enforcing photometric or perceptual consistency losses also harm the FID scores. Our
proposed method scores the best FID scores while maintaining latent space disentanglement.

150, 200, 27, 64, 64, respectively. Their output dimensions are 37, 64, 27, 64, 64, respectively, to

match the latent code factorization of CoLa-SDF. We concatenate the shape, albedo, illumination,

hairbg, and rest style-codes obtained from these independent mappers to form the combined style

code, w, which is then passed through the rest of the algorithm exactly as in CoLa-SDF. Note that

we adopt separate mappers only during the volume renderer phase; the generator mapping network

remains a single network as in CoLa-SDF. This variant, though, results in a loss of image quality

and diversity, as evaluated in terms of FID [Heusel et al., 2017] (see Tab. 4.3).

4.3.5.3 Add low-res to high-res consistency loss

CoLa-SDF-Photometric: In this variant, we adopt a photometric consistency loss Lphotocons to

enforce consistency between the high-resolution image obtained from the styled generator (after

downsampling it to the low-resolution scale), and the low-resolution image obtained from the

volume renderer:

Lphotocons = ||down(Igen, size(Ivol) − Ivol||2
2,

(4.15)

where down(x, (h, w)) downsamples image x to height h and width w using bilinear interpolation.

This acts as an additional loss to ensure that the disentanglement of physical attributes in the

volume renderer reflects well in the styled-generator too.

CoLa-SDF-Perceptual: This variant is similar to CoLa-SDF-Photometric, except that we replace

the photometric consistency loss with a perceptual consistency loss [Zhang et al., 2018b]:

Lvggcons = ||ϕ(down(Igen, size(Ivol)) − ϕ(Ivol)||2
2,

(4.16)

where ϕ is the VGGFace [Parkhi et al., 2015] model.

66

We found that, both these variants lead to higher FID metrics as reported in Tab. 4.3, which is

an indicator of low image quality and diversity.

4.3.6 Limitations

CoLa-SDF may sometime generate artifacts during hair/background editing as shown in Fig. 4.7.

We believe this is due to the model’s incapability to differentiate between hair and cap, and ending

up interpolating between them. In addition, as observed in Figs. 4.5b and 4.5d, CoLa-SDF has

learned to model hair geometry through the hair/background code, but hair color is controlled by

a combination of the albedo and hair/background codes. This may be due to a correlation between

hair-color and texture in the training dataset.

Figure 4.7 Spurious artifacts during hairstyle editing.

4.4 Conclusion

We propose a new method called CoLa-SDF that combines the disentangled controllability

of nonlinear 3DMM approaches with the high fidelity of implicit 3D GANs for generating 3D

faces and rendering them to images. Building upon the architecture of StyleSDF, we enforce

the latent space to match the physical parameters of the nonlinear 3D morphable model MOST-

GAN, as well as disentangling control of hair and background, a feat we believe is a first of its

kind. We demonstrate high-fidelity image synthesis and subsequent 3D manipulation with full

control over the disentangled latent parameters. Overall, the proposed model presents a promising

solution for generating high-quality 3D faces with controllable properties, which can have practical

applications in many areas including AR/VR, dataset synthesis and augmentation, media, and

avatar creation.

67

CHAPTER 5

DIVERSE3DFACE: TOWARDS ROBUST AND DIVERSITY-PROMOTING 3D FACE
RECONSTRUCTION FROM SINGLE-VIEW IMAGES

©2022 IEEE. Reprinted, with permission, from

Dey, R. and Boddeti, V. N. Generating Diverse 3D Reconstructions from a Single Occluded Face

Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

(CVPR), pages 1547-1557, 2022.

Single image-based 3D face reconstruction has improved significantly in recent years [Zollh¨ofer

et al., 2018, Egger et al., 2020]. This includes advances in statistical models [Blanz and Vetter,

1999, Paysan et al., 2009, Li et al., 2017a, Ploumpis et al., 2020] as well as neural network-based

models [Tewari et al., 2017, Tran and Liu, 2019, Sengupta et al., 2018, Wu et al., 2020, Feng et al.,

2021, Gecer et al., 2019, Tran et al., 2019, Wei et al., 2019, Tuan Tran et al., 2017]. However,

facial occlusions remain a significant challenge to this task. In-the-wild face images often come

with several forms of occlusions and unless dealt with explicitly, often lead to erroneous 3D recon-

struction in terms of shape, expression, pose, etc. [Egger et al., 2020, Egger et al., 2018, Tu´an Tr´an

et al., 2018].

3D reconstruction of partially occluded faces presents two main challenges. First, 3D recon-

struction models need to selectively use features from the visible regions while ignoring those

from the occluded parts. Failure to do so, either implicitly or explicitly, will lead to poor 3D re-

constructions with an incorrect pose, expression, or both. Second, there could be a distribution

of 3D reconstructions that are consistent with the visible parts in the image yet diverse on the oc-

cluded parts. Failure to account for all such modes limits the utility of 3D reconstruction models.

Addressing these two challenges is the primary goal of this paper.

Existing 3D face reconstruction solutions, however, are ill-equipped to overcome both of these

challenges simultaneously. From a reconstruction perspective, a majority of the approaches that

reconstruct 3D faces from a single image restrict themselves to fully-visible face images. And,

even those that explicitly account for facial occlusions [Tu´an Tr´an et al., 2018, Egger et al., 2018],

68

do so only in a holistic manner using a global model that implicitly uses features from the oc-

cluded regions as well. This form of global model-based fitting can introduce errors (see Fig. 1.5)

in the pose and expression of the 3D reconstruction, especially when large portions of the face

are occluded. From a diversity perspective, existing approaches are, by design, limited to only

generating a single plausible 3D reconstruction. However, in many practical applications, for a

single occluded face image, it is desirable to generate multiple reconstructions that are consistent

on the visible parts of the face, while spanning a diverse yet realistic set of reconstructions on the

occluded parts (see Fig. 1.5). While the concept of generating diverse solutions has been explored

in other contexts such as image generation [Elfeki et al., 2019], image completion [Zheng et al.,

2019b], super-resolution [Bahat and Michaeli, 2020] and trajectory forecasting [Yuan and Kitani,

2019], they have not been explored for monocular 3D face reconstruction of occluded faces.

In this work, we propose Diverse3DFace [Dey and Boddeti, 2022b], which is designed to

simultaneously yield a diverse, yet plausible, set of 3D reconstructions from a single occluded face

image. Diverse3DFace consists of three modules: a global + local shape fitting process, a graph

neural network based variational autoencoder (Mesh-VAE), and a Determinantal Point Process

(DPP) [Kulesza and Taskar, 2012] based iterative optimization procedure. The global + local

shape fitting process affords robustness against large occlusions by decoupling shape fitting on the

visible regions from that of the occluded regions. The Mesh-VAE enables to learn a distribution

over a compact latent space over the different factors of variation in the 3D shapes of faces. And,

the DPP-based iterative optimization procedure enables us to sample from the latent space of the

Mesh-VAE and optimize them to generate a diverse set of reconstructions spanning the different

modes of the latent space. Our specific contributions in this paper are:

– We propose Diverse3DFace, a simple yet effective diversity promoting 3D face reconstruction

approach that generates multiple plausible 3D reconstructions corresponding to a single occluded

face image.

– For robustness to occlusions, we propose a global+local PCA model-based shape fitting that

disentangles the fitting on each facial component from the others. The models are learned from

69

a dataset of FLAME [Li et al., 2017a] registered 3D meshes. During inference, the local pertur-

bations on various facial components are added on top of a coarse global fit to generate the final

detailed fitting.

– We employ a DPP [Kulesza and Taskar, 2012] based diversity loss in the context of generating

diverse 3D reconstructions of faces. We define the quality and similarity terms in the DPP kernel

to maximize diversity while remaining in the space of realistic 3D head shapes.

– We conduct extensive qualitative and quantitative experiments to show the efficacy of the pro-

posed approach in generating 3D reconstructions that are faithful to the visible face while simulta-

neously capturing multiple diverse modes on the occluded parts. The solution from Diverse3DFace

that is closest to the ground truth is on average 30-50% better than the unique solutions of the base-

lines [Feng et al., 2021, Li et al., 2017a] in terms of per-vertex ℓ2-error.

5.1 Preliminaries

Determinantal Point Processes: Determinantal Point Processes (DPPs) originated in quantum

physics to model the negative correlations between the quantum states of fermions [Macchi, 1975].

DPPs were first introduced in machine learning by Kulesza and Taskar [Kulesza and Taskar, 2012]

as a probabilistic model of repulsion between points. A point process over a ground set Y describes

the probability of all its 2Y subsets. A point process is determinantal when the probability of

choosing a random subset Y ⊆ Y is given by the determinant of the sub-kernel matrix LY indexed

by the elements of Y, i.e., P (Y ⊆ Y) = det(LY ). Given a data matrix B ∈ RD×N , we can compute

the kernel as the Gram matrix L = BT B. In this case, the determinant of the sub-kernel matrix

det(LY ) is related to the volume spanned by the elements of B. Thus, conceptually, DPP assigns

a higher probability to a subset whose elements tend to be orthogonal (diverse) to each other, thus

spanning a larger volume.

5.2 Approach

Reconstructing diverse 3D shapes in a single stage, using only a global model, is sub-optimal

due to multiple reasons, as we show in our experiments (Sec. 5.3.1). First, fitting a global model to

70

Global + Local Shape Fitting using Lfitting
Coarse Shape

Partial Fit

Target Image

Diverse Shape Completion using Ldiverse

αG, βG
θ

Keypoints

+
Local Details

Occlusion Mask

Face Mask

αR1
βR1

αR2
βR2

...

αR14
βR14

+

+...

+

Fitting Output

Visible Mask

µ
Σ

∼

Emesh

z(t = 0)

Dmesh

)
t
(
z
r
e
v
o

s
n
o
i
t
a
r
e
t
I

y
t
i
s
r
e
v
i
D

z(t = 100)

Dmesh

...

z(t = Ncomp)

Dmesh

. . .

. . .

. . .

Figure 5.1 Overview: As input, we need the target image, the occlusion mask, facial landmarks,
and optionally a face mask. We use the HRNET model [Wang et al., 2020] to obtain both the
landmark locations and their confidence values, which we use to estimate the occlusion labels.
Given these input, we first fit our proposed global + local blendshape model to obtain the coarse
and local fittings as outlined in Algorithm 5.1, which we then add together to obtain the final
fitting. We re-project the fitted shape onto the visible mask to obtain a partial fit, zeroed out on
the occluded regions. We map the partial fit onto a latent space using the Mesh-VAE encoder Emesh
and sample N latent vectors z. We then iteratively optimize the z’s to capture diverse modes with
respect to the occluded regions while remaining consistent with the visible regions as outlined in
Algorithm 5.2 to obtain the final set of 3D reconstructions.

a few visible sub-regions requires striking a careful trade-off between robustness and local fidelity

which is challenging to achieve. Second, diversification of the occluded regions will inadvertently

affect the quality of fitting on the visible regions, and vice-versa. Given these observations, we pro-

pose a three-step approach to generate diverse, yet realistic 3D reconstructions from an occluded

face image. In step 1, we use an ensemble of disentangled global+local shape models to perform

robust 3D reconstruction with respect to the visible parts of the face. In step 2, we employ a VAE

to map the partial fit to a latent space from which multiple reconstructions can be drawn. Finally,

in step 3 we iteratively optimize the latent embeddings to promote realistic geometric diversity

on the occluded face regions while maintaining fidelity to the visible ones. We now describe our

complete algorithm along with its different components.

71

5.2.1 Global + Local Shape Model

A robust partial 3D reconstruction that accurately fits the visible parts of the face is a prereq-

uisite for generating diverse solutions. Existing approaches of occlusion-robust 3D reconstruc-

tion typically employ a global model to fit or regress based on the visible regions [Egger et al.,

2018, Tu´an Tr´an et al., 2018]. Because of the global nature of such models, errors in occlusion

segmentation affect the quality of 3D reconstruction [Saito et al., 2016], even on the visible parts

(see Fig. 5.3). Typically, strong regularization is employed to mitigate such effects. However,

while heavier regularization leads to more robustness against occlusions, it comes at the cost of

sub-optimal fitting. This observation, along with the successful application of localized deforma-

tion components in computer graphics [Neumann et al., 2013, Schwartz et al., 2020], motivated

us to adopt an ensemble of global + local models as an effective approach to generate robust 3D

reconstructions with respect to the visible parts. Note that, in this stage of our solution, we are not

concerned about the reconstruction quality in the occluded regions. We now describe the details of

our proposed global+local 3D head model.

Our global+local shape model is based on the FLAME mesh topology [Li et al., 2017a]. We use

the FLAME registered D3DFACS [Cosker et al., 2011] and CoMA [Ranjan et al., 2018] datasets

to compute the local PCA models. The FLAME [Li et al., 2017a] model comes with vertex masks

corresponding to 14 parts on the human head. We trained individual PCA models corresponding

to each of these parts to account for local variations. To do so, we first take FLAME-registered

meshes and fit the full FLAME model [Li et al., 2017a] to these by optimizing the following fitting

loss:

Lgtfit = min
α,β,θ

||Sgt − ˜S(α, β, θ)||,

(5.1)

Here Sgt is the ground-truth shape and ˜S(α, θ, β) is obtained using Eqs. (2.1) and (2.2). We

then unpose both the ground-truth and the fitted shapes by removing the variations due to pose θ

as described in [Li et al., 2017a] and obtain S0

gt and ˜S(α, 0, β), respectively. The full FLAME

model consists of |α|= 300 shapes and |β|= 100 expression bases to account for complete global

variations. From this, we retain the top NS shape and NE expression bases (based on eigenvalues)

72

and discard the rest to compute shape residuals ˜Sres = S0

gt − ˜Scoarse, where

˜Scoarse = ¯S +

NS(cid:88)

n=1

αnSn +

NE(cid:88)

n=1

βnEn

(5.2)

We then compute the region-wise shape and expression PCA models (SRi, ERi) using the
region-wise residuals MRi ⊙ ˜Sres (here MRi is the vertex-mask for the i-th region). For com-

puting the shape bases, we set NS = 10 and NE = 100 (removing all expression variations); while

for the expression bases, we set NE = 10 and NS = 300 (removing all identity variations). The

global + local model can then be represented as,

S(αG, αR, βG, θ, βR) = SG(αG, βG, θ) + SR(αR, βR),

(5.3)

where SG(αG, βG, θ) is the coarse global shape given by the top NS shape and NE expression

global bases, along with the pose blendshapes P ( Eq. (2.2)); and SR(αR, βR) represent the sum

of all local variations and is given by,

SR(αR, βR) =

14
(cid:88)

Ri,i=1

(SRiαRi + ERiβRi)

(5.4)

5.2.2 Shape Completion using Mesh-VAE

We use the global+local model to fit robust 3D reconstruction on the visible parts of the oc-

cluded face. But this does not ensure robust and consistent reconstruction on the occluded parts

since the local PCA models have noisy (occluded) or no data to fit to. To address this drawback

and to enable the generation of a distribution of plausible 3D reconstructions rather than a singular

solution, which is one of our primary goals, we adopt a mesh-based VAE (dubbed Mesh-VAE) as

our shape completion model.

We assume that human head meshes can be mapped onto a continuous and regularized low-

dimensional latent space Z. Then, given a masked (partial) 3D mesh Sm, the Mesh-VAE learns

the conditional likelihood of mesh completions Sc and the corresponding latent embeddings z:

p(Sc, z|Sm) = p(z|Sm)p(Sc|z, Sm),

(5.5)

73

5.2.3 DPP Driven Shape Diversification

Even though the Mesh-VAE can sample multiple shape completions from p(Sc|z, Sm), in prac-

tice, the generated samples from a VAE are not guaranteed to cover all the modes [Yuan and Kitani,

2019] (see Sec. 5.3.1). To enforce diversity, we formulate a DPP on shape completions and develop

a diversity loss to optimize their latent embeddings.

We adopt the quality-diversity based formulation of the DPP kernel L [Kulesza and Taskar,

2012], which seeks to balance the quality of samples with their diversity. Specifically, for elements

i, j in a set, its kernel entry is given by Li,j = qiSi,jqj, where qi denotes the quality of element i,

and Si,j represents the similarity between i and j. Maximizing the determinant of such a kernel

matrix implies maximizing the quality of each sample while minimizing the similarity between

distinct samples. For two shape completions Sc,i and Sc,j, we define the similarity as

(cid:18)

Si,j = exp

−

k
mediani,j(disti,j)

(cid:19)

disti,j

,

(5.6)

where disti,j = ||Sc,i − Sc,j||2 is the ℓ2 distance between the i-th and j-th shape completions and k

is a scaling factor. To ensure that the completed samples look realistic, we relate the quality of a

sample with the probability of its latent embedding zi lying within 3σ of the prior N (0, I) as:

qi = exp(− max(0, zT

i zi − 3

√

d)),

(5.7)

where d is the dimensionality of zi. For numerical stability [Yuan and Kitani, 2019], we adopt

expected cardinality of L as the DPP loss:

Ldpp = −tr (cid:0)I − (L + I)−1(cid:1) ,

(5.8)

where I is the identity matrix, and tr(.) represents the trace of a matrix.

5.2.4

Inference

Given an occluded face image Im, our goal is to generate a distribution of plausible 3D recon-

structions Sc,1, ..., Sc,M . We do this in three steps which we describe below:

Step 1 Partial Shape Fitting: In this stage, we first fit our global + local PCA model on the visible

parts of the face image Im to obtain a partial reconstruction Sm. We employ the following fitting

74

loss:

Lfitting = λf

1 Llmk + λf

2 Lpho + λf

3 Lreg,

(5.9)

where Llmk is the landmark loss, Lpho is the photometric loss and Lreg applies ℓ2-regularization over

the model parameters. We use an off-the-shelf landmark detector HRNET [Wang et al., 2020] to

detect 68 landmarks on the face along with their confidence values. We mark those landmarks as

visible whose confidence exceeds a threshold ϵ (set to 0.2) and apply the landmark loss on those

points. To add local details, we apply an ℓ1-based photometric loss between the input image and

the rendered image Iren on the visible regions, where Iren = R(Sm, Bτ (τ , A), γ, c), τ are the fitted

albedo parameters, A are the orthonormal albedo bases from [Li et al., 2017a], Bτ (τ , A) = Aτ

and c is the estimated camera parameters. We restrict the photometric loss to the visible face region

using the face mask Mf and the occlusion mask Mo:

Lpho = ||(Im − Iren) ⊙ Mf ⊙ (1 − Mo)||1

(5.10)

Step 2 We use the encoder to map the partial fit Sm to a latent distribution from which we sample

the latent embeddings z ∼ N (µ, diag(σ2)), where µ, σ = Emesh(Sm).

Step 3 Diversity Promoting Shape Completion: In this stage, we perform a diversity promoting

iterative shape completion routine, which forces the latent embeddings towards diverse modes

w.r.t the occluded regions while remaining faithful to the visible regions. At each iteration, we

obtain a distribution of shape completions using the decoder Sc,j = Dmesh(zj), ∀j = 1...M , and

update the z’s to minimize a diversity loss:

Ldiversity = λ1LS + λ2Lpho + λ3Ldpp

(5.11)

Here LS is the shape consistency loss defined as the ℓ1-norm between the Sc,j’s and Sm applied on

the visible vertices, Lpho is the photometric loss (Eq. (5.10)) and Ldpp is the DPP loss (Eq. (5.8)).

The loss coefficients are set to have similar magnitude for all the loss components.

We outline the full steps for partial shape fitting and diversification in Algorithm 5.1 and

Algorithm 5.2, respectively.

75

Algorithm 5.1 Shape Fitting on the Visible Face Regions

Input: Image Im, Occlusion mask Mo, Face mask Mf , Global models S, E, P, Local models SRi,
ERi for i = 1 to 14, Albedo model A, Landmarks detector H
Parameters: α, β, θ, γ, τ , c, αRi, βRi for i = 1 to 14
Hyperparameters: ϵ = 0.1, niter, λf
Output: Partially fitted shape Sm

2 , λf

1 , λf

3 , η

Detect landmarks from image LI, Lconf ← H(Im)
Set Lvalid ← 1 when Lconf > ϵ else 0
for j = 1 to niter do

Obtain Sm using Eqs. (2.1), (2.2), (5.3) and (5.4)
Select 68 landmarks from shape LS ← Mlmk(S)
Obtain rendered image Iren ← R(Sm, Bτ (τ , A), γ, c)
Lf
Lf
Lf
Lfitting = λf
Update w ← w − η∇wLfitting for w ∈ α, β, θ, τ , γ, c, αRi, βRi for i = 1 to 14

lmk ← ||(LS − LI) ⊙ Lvalid||1
pho ← ||(Im − Iren) ⊙ Mf ⊙ (1 − Mo)||1
reg ← ℓ2 regularization loss over all parameters

lmk + λf

pho + λf

3 Lf
reg

1 Lf

2 Lf

end for

Algorithm 5.2 Diverse Shape Completions

Input: Mesh-VAE Encoder Emesh and Decoder Dmesh
Input from Algorithm 5.1: Im, Mo, Mf , LI, Lvalid, θ, γ, τ , c
Hyperparameters: ncomp, λ1, λ2, λ3, η
Output: M Shape completions {Sc,j=1:M }

Sample the vertex mask Mv
Obtain latent parameters µ, σ ← Emesh(Sm ⊙ Mv
o)
Sample M latent vectors z1, ..., zM ∼ N (µ, σ2I)
for i = 1 to ncomp do

o by projecting S onto Mo

Obtain Sc,j ← Dmesh(zj) for j = 1...M
Obtain Iren,j ← R(Sc,j, Bτ (τ , A), γ, c) for j = 1...M
LS ← (cid:80)M
Lpho ← (cid:80)M
Ldpp ← Ldpp(Sj=1:M
Ldiversity = λ1LS + λ2Lpho + λ3Ldpp
Update zj ← zj − η∇zj Ldiversity for j = 1 to M

j=1||(Sc,j − Sm) ⊙ (1 − Mv
j=1||(Im − Iren,j) ⊙ Mf ⊙ (1 − Mo)||1

o) using Eq. (5.8)

⊙ Mv

o)||1

c

end for

76

5.3 Experimental Evaluation

Datasets: We use the FLAME [Li et al., 2017a] registered head meshes from the CoMA [Ranjan

et al., 2018] and D3DFACS [Cosker et al., 2011] datasets for training the Mesh-VAE, as well as

for evaluating the proposed approach. Note that, other than the Mesh-VAE, our approach does

not involve training any other modules. We split the two datasets into 80:10:10 train:val:test splits

based on subject ID. We train the Mesh-VAE model using the combined training splits from the two

datasets. During training, we augment the meshes with occlusion masks of random (contiguous)

shapes at random locations. To evaluate our approach, we use the test split of the CoMA dataset

[Ranjan et al., 2018] consisting of subjects that were excluded from training. Furthermore, we

conduct a qualitative evaluation on the un-annotated images from the CelebA dataset [Liu et al.,

2015]. For both datasets, the test images are artificially augmented with occlusions such as masks,

glasses, and other random objects.

Implementation: We implement the Mesh-VAE as a fully convolutional graph neural network

(GNN) based upon the MeshConv architecture presented in [Zhou et al., 2020b]. MeshConv [Zhou

et al., 2020b] uses spatially varying convolution kernels to account for the irregularity of local mesh

structures and was shown to outperform fixed kernel-based GNN approaches [Kipf and Welling,

2016, Defferrard et al., 2016, Morris et al., 2019, Veliˇckovi´c et al., 2017, Ranjan et al., 2018,

Bouritsas et al., 2019] on reconstruction tasks. To train Mesh-VAE as a shape completion model,

we augment the training meshes with random continuous masks covering 25-40% of the vertices.

However, in practice, directly training the Mesh-VAE for inpainting is very challenging, especially

with large degrees of occlusions. We adopt a curriculum learning [Bengio et al., 2009] approach to

overcome this challenge and progressively introduce larger occlusions during the training process,

i.e., we start with easier shape completion tasks and progressively increase its difficulty. We use a

combination of ℓ1-reconstruction, ℓ1-Laplacian, and the KL-divergence losses to train the network.

Note that we do not use partial shape completions fitted to occluded face images using either the

FLAME [Li et al., 2017a] or our global+local model to train the Mesh-VAE, and instead use ground

truth meshes to avoid any bias towards either shape model.

77

Occlusion DECA [Feng et al., 2021] FLAME [Li et al., 2017a] Global+Local (Ours)

Glasses
Face-mask
Random
Overall

57.83
61.18
70.34
62.91

47.89
30.37
47.56
41.24

39.98
30.11
38.27
35.85

Table 5.1 Comparison of 3D reconstruction accuracy evaluated in terms of mean shape error (MSE)
×10−3.

Baselines: To evaluate the efficacy of Diverse3DFace in terms of diversity and robustness to oc-

clusions, we compare against baselines such as FLAME [Li et al., 2017a], DECA [Feng et al.,

2021], CFR-GAN [Ju et al., 2022], Occ3DMM [Egger et al., 2018] and Extreme3D [Tu´an Tr´an

et al., 2018] using publicly available implementations or pretrained models (wherever applicable).

Due to the difficulty and unreliability in obtaining dense correspondence between FLAME and

other mesh topologies, we perform a quantitative comparison only against methods based on the

FLAME [Li et al., 2017a] topology. In other cases, we report qualitative comparisons based on

face images with various occlusions patterns.

Metrics: The goal of this paper is to generate diverse yet realistic 3D reconstructions of occluded

face images. Such an approach should have three desired qualities: 1) the reconstructed shapes

should fit as accurately as possible to the visible regions, 2) the occluded regions should be diverse

from each other, and 3) at least one of the reconstructed shapes should be very similar to the

ground truth shape. There is no prior work on diverse 3D reconstruction, and as such, there are

no established metrics. So we define the following three metrics to evaluate the aforementioned

qualities: (1) Closest Sample Error (CSE): the per-vertex ℓ2-error between the ground-truth shape

and the closest reconstructed shape (lower is better), (2) Average Self Distance-Visible (ASD-V):

the per-vertex ℓ2-distance on the visible regions between a 3D completion and its closest neighbor,

averaged across all the samples (lower is better), and (3) Average Self Distance-Occluded (ASD-

O): ASD on occluded regions (higher is better). These metrics are inspired by those defined for

diverse trajectory forecasting [Yuan and Kitani, 2019].

78

5.3.1 Quantitative Results

5.3.1.1 Fitting on the Visible Regions

Tab. 5.1 reports the 3D reconstruction accuracy in terms of mean shape error (MSE) on

artificially occluded test images from the CoMA dataset [Ranjan et al., 2018] for different ap-

proaches using the FLAME [Li et al., 2017a] topology. Across all occlusion types, our proposed

global+local model reports the lowest MSE values. The large gap between FLAME (fitting) [Li

et al., 2017a], DECA [Feng et al., 2021] and our approach demonstrates the necessity of region-

specific model fitting for occlusion robustness.

5.3.2 Error Histogram Analysis

In Fig. 5.2, we plot the histograms of shape fitting errors (in terms of MSE) when the FLAME

[Li et al., 2017a] and our global+local model are used to fit to partially occluded face images. One

can observe that, while FLAME registers smaller errors (less than 10 MSE) on more number of

samples than the global+local model, there are significantly more number of samples (∼ 15%)

where FLAME registers very high MSE errors (> 50 MSE) than the global+local model. One

can conclude that our global+local model is more robust than the global FLAME model [Li et al.,

2017a] on samples with challenging occlusions.

5.3.2.1 Diversity on the Occluded Regions

Due to the lack of existing diverse 3D reconstruction approaches, we formulate four baselines

to evaluate the diversity performance of Diverse3DFace: 1) fitting FLAME on the visible parts plus

DPP loss on the occluded parts (FLAME+DPP), 2) replace FLAME in (1) with our global+local

model (Global+Local+DPP), 3) fitting global+local model followed by shape completions by the

Mesh-VAE as per the learned distribution p(Sc, z|Sm) (Global+Local+VAE), and 4) replacing

the global+local model with FLAME[Li et al., 2017a] in Diverse3DFace (FLAME+VAE+DPP).

We report the quantitative metrics in Tab. 5.2. Across all occlusion types, FLAME+DPP and

Global+Local+DPP report much higher CSE and ASD-V, and lower ASD-O than Diverse3DFace.

Though Global+Local+VAE obtains lower CSE than Diverse3DFace, it does so at the cost of

79

Figure 5.2 Histogram of MSE for shape fitting on occluded face images by FLAME [Li et al.,
2017a] and our Global+local model.

reduced diversity in terms of ASD-O. FLAME+VAE+DPP reports better diversity metrics but at

the cost of higher CSE errors. On the other hand, Diverse3DFace reports the lowest ASD-V, the

highest ASD-O, and the second-lowest CSE, satisfying the three desired qualities mentioned earlier.

Since the CelebA dataset [Liu et al., 2015] is not labeled with groundtruth 3D shape, we do

not compute the Closest Sample Distance (CES) on this dataset. As reported in ??, our ap-

proach obtains the maximum ASD-O across all occlusion types, the lowest ASD-V for Glasses,

as well as the second lowest (compared to Mesh-VAE) ASD-V for Facemasks and Random oc-

clusions. This is further corroborated by the significantly higher ASD-O/ASD-V ratios reported

by Diverse3DFace compared to the baselines. Compared to this, single-stage diversity fitting

baselines viz. FLAME+DPP and Global+Local+DPP generate the lowest ASD-O/ASD-V ratios,

signifying that the 3D reconstructions generated by these approaches are neither diverse on the

occluded regions, nor consistent with respect to the visible regions. On the other hand, one-pass

samples generated by Global+Local+VAE are consistent with the visible face as reported by low

ASD-V, but not diverse on the occluded regions (low ASD-O).

These observations confirm our hypothesis that explicitly accounting for occlusions and op-

80

2.938.814.6720.5326.432.2738.134449.8755.7305101520Global+localFLAMEMSEPercent of samplesOcclusion

Type

Glasses

CSE
(↓)
41.26
Face-mask 28.14
Random 43.12
36.81
Overall

FLAME+DPP
ASD-V
(↓)
3.83
3.07
3.61
3.61

ASD-O
(↓)
3.26
4.58
4.06
4.06

Global+Local+DPP
ASD-V
(↓)
2.25
2.30
2.59
2.35

ASD-O
(↑)
3.11
3.57
3.51
3.39

CSE
(↓)
38.17
28.06
38.85
34.55

Global+Local+VAE
ASD-V
(↓)
1.01
0.89
0.97
0.95

ASD-O
(↑)
1.38
1.79
1.61
1.59

CSE
(↓)
32.88
25.95
36.58
31.18

FLAME+VAE+DPP
ASD-V
(↓)
0.63
0.61
0.78
0.77

ASD-O
(↑)
4.43
7.88
5.44
5.92

CSE
(↓)
42.58
27.97
43.00
37.45

Global+Local+VAE+DPP (Ours)

CSE (↓) ASD-V (↓) ASD-O (↑)

36.30
27.58
39.11
33.71

0.61
0.85
0.72
0.73

4.50
7.89
5.62
6.05

Table 5.2 Evaluation of diverse reconstructions by the baselines vs. Diverse3DFace in terms of
CSE, ASD-V and ASD-O (in order of 10−3).

Occlusion
Type
Glasses
Face-mask
Random
Overall

FLAME+DPP

Global+Local+DPP

Gloal+Local+VAE

Diverse3DFace (Ours)

ASD-V (↓) ASD-O (↑) ASD−O
0.866
1.429
1.027
1.150

ASD−V (↑) ASD-V (↓) ASD-O (↑) ASD−O
1.391
1.400
1.211
1.281

ASD−V (↑) ASD-V (↓) ASD-O (↑) ASD−O
1.444
2.160
1.633
1.808

ASD−V (↑) ASD-V (↓) ASD-O (↑) ASD−O
ASD−V (↑)
5.235
7.252
5.181
6.011

0.68
1.03
0.83
0.90

3.56
7.47
4.30
5.41

1.17
1.62
1.29
1.41

2.99
3.99
3.84
3.88

2.98
4.93
4.23
4.44

2.15
2.85
3.17
3.03

3.44
3.45
4.12
3.86

0.81
0.75
0.79
0.78

Table 5.3 Quantitative evaluation of the diversity in 3D reconstruction of occluded faces from the
CelebA dataset [Liu et al., 2015] between the baselines vs. Diverse3DFace in terms of the ASD-V
and ASD-O metrics (in order of 10−3) and the ratio between them.

timizing for diversity can lead to 3D reconstructions that are both more accurate (on the visible

regions) and more geometrically diverse (on the occluded regions). Among the different occlusion

types, we report the highest ASD-O for face-masks. These results are consistent with the fact that

human faces have higher variability in the mouth and nose regions, which our approach is able to

learn and reproduce.

5.3.3 Qualitative Results

5.3.3.1 Fitting on the Visible Regions

FLAME vs Global+Local PCA Model:

In addition to the quantitative comparison done in

Tab. 5.1, we qualitatively compare the occlusion robustness of the global FLAME [Li et al., 2017a]

model vs. our global+local model. In Fig. 5.3, we show some failure cases of the FLAME [Li

et al., 2017a] based fitting on severely occluded images. Notice the severe deformations on the

FLAME [Li et al., 2017a] fitted outputs, especially around the mouth. In contrast, the fittings by

our global+local models look more faithful and detailed with respect to the visible parts. These

observations further support our claim that a global+local model-based fitting performs better than

a global-model based fitting on occluded face images.

81

Target
Image

FLAME
Fitting

Global+Local
Model
Fitting

Figure 5.3 FLAME [Li et al., 2017a] based fitting (middle row) vs. our Global+Local fitting (last
row) on occluded face images (top row).

Target
Image

FLAME
[Li et al.,
2017a]

DECA
[Feng et al.,
2021]

CFR-GAN
[Ju et al.,
2022]

Occ3DMM
[Egger
et al., 2018]

Extreme3D
[Tu´an Tr´an
et al., 2018]

Reconstructions by
Diverse3DFace (Ours)

Ground
truth

Figure 5.4 Qualitative evaluation on the CoMA dataset [Ranjan et al., 2018]: Reconstructed sin-
gular 3D meshes from the target image by the baselines vs. the diverse reconstructions (one full
shape followed by six partial zoomed-in variations) from Diverse3DFace.

82

Target
Image

FLAME
[Li et al.,
2017a]

DECA
[Feng et al.,
2021]

CFR-GAN
[Ju et al.,
2022]

Occ3DMM
[Egger
et al., 2018]

Extreme3D
[Tu´an Tr´an
et al., 2018]

Reconstructions by Diverse3DFace (Ours)

Figure 5.5 Qualitative evaluation on the CelebA dataset [Liu et al., 2015]: Reconstructed singu-
lar 3D meshes from the target image by the baselines vs.
the diverse reconstructions from Di-
verse3DFace.

5.3.3.2 Diverse 3D Reconstructions

Fig. 5.4 shows qualitative results of 3D reconstruction on the artificially occluded CoMA [Ran-

jan et al., 2018] images. All the baselines can only generate a single 3D reconstruction w.r.t the

target image. We observe that the reconstructions generated by Diverse3DFace look diverse yet

plausible and visually more faithful to the ground truth in the visible regions.

In comparison,

FLAME-based fitting [Li et al., 2017a], and DECA [Feng et al., 2021] do not explicitly handle

occlusions and generate soft and erroneous shapes. CFR-GAN [Ju et al., 2022] and Occ-3DMM

[Egger et al., 2018] get the pose wrong in multiple instances. Extreme3D [Tu´an Tr´an et al., 2018]

83

Target Image

Fitting by
Global-
local model

3D Reconstructions by Diverse3DFace

Figure 5.6 Set of 3D reconstructions by Diverse3DFace on real-world occluded face images.

generates visually better reconstructions of the visible parts of the face but gets the expression

wrong in the second row.

In Fig. 5.5, we show further visual comparisons on the occlusion-

augmented images from the CelebA [Liu et al., 2015] dataset. Note that we do not have ground

truth scans for these images. However, visual results suggest that the baselines, by being holis-

tic models, do not explicitly exclude features from the occluded regions and often get incorrect

poses and expressions on these images. Meanwhile, the reconstructions from Diverse3DFace look

diverse on the occluded regions yet consistent w.r.t to the visible parts of the face.

5.3.4 Real-world Occlusions

We present examples of diverse 3D reconstructions by our approach on real-world occluded

face images in Fig. 5.6. For these images, we inferred the occlusion mask using the face segmen-

tation model by Nirkin et al. [Nirkin et al., 2018]. These results further demonstrate the efficacy

of Diverse3DFace to generate diverse, yet plausible 3D reconstructions on real world occlusions

ranging from glasses, scarf, facemasks, etc.

84

Target Image

Interpolated 3D Reconstructions

Figure 5.7 Controlled generated of diverse 3D reconstructions between two distinct modes. Di-
verse3DFace can be used to generate controlled diversity on the occluded regions by performing
interpolation between two distinct shapes in the latent space.

5.3.4.1 Diversity Interpolations

A potential application of Diverse3DFace is to perform controlled diversification around an

occluded region during 3D reconstruction. To do this, we can first generate a set of diverse 3D

reconstructions for an occluded target image and then allow the user to select two distinct samples

to perform interpolation in-between. We perform interpolation in the latent space: z(α) = αz1 +

(1 − α)z2. This affords the user control over the extent and type of diversity. We present examples

of such interpolations in Fig. 5.7.

5.3.4.2 Moving the Occlusion Around the Face

In this section, we evaluate the diversity and robustness performance of Diverse3DFace to

occlusions at different locations on the face.

Fig. 5.8 shows the set of 3D reconstruction by

Diverse3DFace when the occlusion moves around the face occupying the left cheek, mouth, the

right cheek, center and the periocular (eye) regions of the face. Our method generates diverse,

yet plausible set of 3D reconstructions for all the cases. We particularly note the high degree of

diversity in expression that occurs when the mouth region is occluded, as is expected.

85

Target Image

Diverse 3D Reconstructions by Diverse3DFace

Figure 5.8 Qualitative evaluation of the diversity and robustness performance of Diverse3DFace to
occlusions at different facial locations.

86

nσ

1

2

3

4

5

0.53
0.69
0.86
0.81
0.79

0.81
0.95
1.02
1.05
0.98

0.93
1.18
1.30
1.23
1.06

1.40
1.61
1.94
1.92
1.57

1.88
1.98
2.14
2.03
1.98

k
0.1
0.25
0.5
1
2

k
0.1
0.25
0.5
1
2

nσ

1

2

3

4

5

3.63
4.13
5.98
5.18
4.42

4.92
6.37
8.25
7.89
6.68

5.62
7.65
9.16
8.84
7.40

7.17
8.18
11.19
10.72
9.78

8.64
10.73
14.53
12.96
12.21

(a) ASD-V (↓)

(b) ASD-O (↑)

Table 5.4 Effect of the hyperparameters k and nσ on the diversity metrics ASD-V and ASD-O on
the CoMA dataset [Ranjan et al., 2018].

5.3.5 Ablation Study on Diversity Hyperparameters

The diversity generated by our approach is determined by the DPP loss:

Ldpp = −tr (cid:0)I − (L + I)−1(cid:1) .

(5.12)

Here, the DPP kernel entry for the i, j-th element is given by Li,j = qiSi,jqj, where qi denotes the

quality of element i, and Si,j represents the similarity between i and j. The DPP optimization tries

to maximize the quality of each sample, while minimizing the similarity between distinct samples.

As stated in the main paper, we control the similarity term Si,j = exp

the quality term qi = exp(− max(0, zT

i zi − nσ

√

d)) using two parameters k and nσ, respectively.

(cid:16)

−

k

mediani,j (disti,j ) disti,j

(cid:17)

and

In Tab. 5.4, we study the effects of the two hyper-parameters k and nσ on diversity as measured

by the diversity metrics ASD-V and ASD-O. As shown in Tab. 5.4, we obtain maximum ASD-V,

as well as, ASD-O at k = 0.5; whereas both metrics increase as nσ increases. Thus, we set k = 0.5

in our experiments while we choose nσ = 3 as a sweet spot between minimizing ASD-V and

maximizing ASD-O. The user can change the value of nσ to tweak the diversity-realism trade-off.

5.4 Conclusion

We proposed Diverse3DFace, an approach to reconstruct diverse yet plausible 3D reconstruc-

tions corresponding to a single occluded face image. Our approach was motivated by the fact that,

in the presence of occlusions, a distribution of plausible 3D reconstructions is more desirable than

a single unique solution. We proposed a three-step solution that first fits a robust partial shape

using an ensemble of global+local PCA models, maps it to a latent space, and iteratively optimizes

87

the embeddings to promote diversity in the occluded parts while retaining fidelity with respect to

the visible parts of the face. Experimental evaluation across multiple occlusion types and datasets

show the efficacy of Diverse3DFace, both in terms of robustness and diversity, compared to mul-

tiple baselines. To our knowledge, this is the first approach that generates a distribution of diverse

3D reconstructions of a single occluded face image.

A limitation of the proposed approach is its dependence on the robustness of the global+local

fitting in the first step for further diverse completions. Although such a locally disentangled fitting

demonstrably performs better than a global model fitting, it may still be affected in cases where

the initial landmark or face-mask estimates are wrong.

88

CHAPTER 6

FUTURE EXTENSIONS

So far, we have focussed on how 3D-aware generative modeling can improve face inpainting,

and controlled face generation and editing. We also studied ways to make monocular 3D face

reconstruction generate robust and diverse solutions in the presence of occlusions. These works

have natural extensions that we now propose.

6.1 Generating Diverse Textured 3D Reconstructions from a Single Occluded Face Image

In Chapter 5, we generated a distribution of diverse, but realistic 3D reconstructions corre-

sponding to an occluded face image such that we retain fidelity with respect to the visible parts,

and allow for realistic diversity on the occluded parts of the face. This work was motivated by

the observation that, in the presence of occlusion, no one 3D reconstruction can be said to be the

correct one. However, one can naturally extend this reasoning to the domain of appearance too.

That is, it is possible for the occluded part to vary in shape, expression, and albedo, while global

factors like illumination can remain constant. Reconstruction models therefore need to account

for diversity from several perspectives. The utility of such an algorithm will not just be restricted

to occlusion robust 3D reconstruction, but will also extend to editing specific parts of face in 3D,

both in shape as well as appearance. While one way of attempting this can be simply extending

Diverse3DFace to include texture by estimating a partial texture with respect to the visible regions,

followed by diverse completions using a texture-VAE, yet another way could be by leveraging the

power of diffusion models [Ho et al., 2020]. Diffusion models have gained much traction in the

recent years in the domain of generative modelling [Dhariwal and Nichol, 2021, Lugmayr et al.,

2022], and inherently support diversity, which stems out of its stochastic sampling approach. This

way, we can model a joint distribution of both shape and texture and sample diverse 3D recon-

structions from this, conditioned on the partial estimates.

We show a proposed overview of such an approach in Fig. 6.1. It consists of three components:

(i) a partial 3D reconstruction component, (ii) a generative prior component, and (iii) an explicit

diversification component. The partial 3D reconstruction involves reconstructing just the visible

89

Figure 6.1 Overview of the proposed DDPM powererd Diverse3DFace (DivFusionFace). After
obtaining an initial partial 3D reconstruction, we propose to transform our mesh to its UV rep-
resentation, and perform diverse shape and texture completion using a UV-DDPM and diverse
sampler. The completed UVs can then be transformed back to their mesh representation.

part, and can be done with using an occlusion-robust 3D reconstruction algorithm such as our

global+local model (see Sec. 5.2.1). Then the partial shape and texture in their UV representations

are inpainted using a DDPM. To promote diversity, we can further replace the standard sampler in

DDPM with a diversity-aware reverse diffusion sampler. We achieve this by incorporating the idea

of DPP kernel [Kulesza and Taskar, 2012] into the reverse process of a DDPM, such that generated

samples take diverse sampling trajectories from each other. This will enable our proposed approach

to generate samples with controllable levels of diversity, both in terms of texture and shape.

6.2 High-Resolution Diversity-Oriented 3DFaceFill

We have shown that 3DFaceFill [Dey and Boddeti, 2022a] can inpaint partial face images

while maintaining the structural integrity of human face, owing to it incorporating explicit face 3D

90

priors. However, the implementation in Chapter 3 has two main limitations: (i) it is limited by the

resolution of the underlying 3D model which in the case of 3DFaceFill is the Basel Face Model

(BFM) [Paysan et al., 2009], (ii) it does not model the regions not included in the underlying 3D

model, such as hair and teeth, and (iii) it generates a single inpainted solution, which does not

satisfy the desired property of diversity in the presense of missing information, as recognized in

this thesis. A future extension of this work should try to fix these limitations.

For the first and second limitation of limited resolution, we can adopt on the following two

approaches:

1. We can add a second stage to the pipeline that takes in the low-resolution inpainted image

from the first stage, and super-resolves it into a higher resolution image similar to the styled-

generator in Chapter 4; or

2. We can employ a higher-resolution face 3D model that also includes inner mouth such as the

UHM model [Ploumpis et al., 2020], instead of the BFM model [Paysan et al., 2009]

Diverse inpainting with respect to geometry can be achieved by plugging in our approach Di-

verse3DFace [Dey and Boddeti, 2022b], as against the 3DMM model used in 3DFaceFill, for

diverse occlusion-aware 3D reconstruciton. Moreover, the extension proposed in Sec. 6.1 would

enable diversity in both geometry and appearance.

6.3 Extensions to CoLa-SDF

While our presented approach CoLa-SDF can generate high fidelity 3D faces with high degree

of control over shape, albedo, illumination, hairstyle and pose, it can be made even more practically

useful with some extensions. We outline some of these below:

1. Text-based Control: Text-conditioned generative modeling have been becoming more pop-

ular with the introduction of CLIP [Radford et al., 2021]. Using CLIP, we can train neural

networks to control the latent codes of CoLa-SDF corresponding to these attributes, en-

abling text controlled face generation in 3D. We can also extend this to Diverse3DFace.

Whereas diversity is introduced in a stochastic way using DPP [Kulesza and Taskar, 2012]

91

in Diverse3DFace, a future extension can make shape completion on the occluded parts con-

ditioned on textual inputs.

2. Explicit Identity-Preservation: A use-case for CoLa-SDF could be identity-preserving re-

construction and editing of faces in 3D. In the current implementation, CoLa-SDF preserves

identity implicitly when editing illumination, pose, and hair/background regions. However,

this is not enforced using an identity preserving loss. A future extension can be trained

simply by adding an identity preserving loss, followed by a thorough face recognition eval-

uation, can demonstrate important practical applications such as in virtual avatars, virtual

meetings and others.

3. Semantic Hair-control: Though CoLa-SDF can edit hair and background independent of

the rest of the facial attributes, we cannot explicitly control semantic attributes of hair such

as length, style, and color. This can be achieved 1) by using carefully sampled latent codes

corresponding to specific attributes in the training set, or 2) by using a semantic 3D model

of human hair such as [Wu et al., 2022].

92

CHAPTER 7

CONCLUSION

In this thesis, we explored the possibilities and opportunities that come with 3D modeling of faces

to tasks such as face inpainting and controlled 3D face generation. We also studied the problem of

occlusions in 3D reconstruction and clamied that robustness, diversity and maintaining structural

integrity of the face should be the cornerstone criteria by which such occlusion-aware models

should be evaluated.

Towards 3D-aware face inpainting, we proposed 3DFaceFill [Dey and Boddeti, 2022a], which

was driven by the hypothesis that 3D disentanglement of face image into 3D shape, 3D pose, albedo

and illumination, followed by albedo inpainting in the UV representation, as opposed to 2D pixel

representation, will allow us to effectively leverage the power of 3D correspondence and ultimately

lead to face completions that are geometrically and photometrically more accurate. Experimental

evaluation across multiple datasets and against multiple baselines show that face completions from

3DFaceFill are significantly better, both qualitatively and quantitatively, under large variations in

pose, illumination, shape and appearance, which validate our hypothesis.

To enable controllable generation of 3D faces, we proposed CoLa-SDF that combines the dis-

entangled controllability of nonlinear 3DMM approaches with the high fidelity of implicit 3D-

GANs. Building upon the architecture of StyleSDF [Or-El et al., 2022], we enforce the latent space

to match the physical parameters of the nonlinear 3D morphable model MOST-GAN [Medin et al.,

2022]. We also enforced disentangled control of hair and background, a feat we believe is a first of

its kind. We demonstrate high-fidelity image synthesis and subsequent 3D manipulation with full

control over the over the 3D shape, pose, albedo, illumination and hairstyle of the generated face.

To address the challenge of facial occlusions in single view 3D face reconstruction, we pro-

posed Diverse3DFace [Dey and Boddeti, 2022b], which reconstructs diverse yet plausible 3D mod-

els corresponding to a single occluded face image. Our approach was motivated by the three fold

criteria of occlusion robustness, diversity and maintaining structural integrity of faces. We pre-

sented a three-step solution of first fitting a robust partial shape using an ensemble of global+local

93

PCA models, mapping it to the latent space of a mesh-VAE, and iteratively optimizing the em-

beddings to promote diversity in the occluded parts, while retaining fidelity with respect to the

visible parts of the face. Experimental evaluation across multiple occlusion types and datasets

show the efficacy of Diverse3DFace compared to multiple baselines, both in terms of robustness

and diversity.

Limitations: Despite the improvements our proposed approaches have over the traditional ap-

proaches in terms of face inpainting, controllable face generation, and occlusion-aware 3D recon-

struction, our approaches have certain limitations. 3DFaceFill is based upon the template based

BFM model [Paysan et al., 2009] which doesn’t include inner mouth cavity and hair, and is limited

in its resolution. These limitations, thus, transfer to 3DFaceFill too. In Chapter 6, we propose

future enhancements to overcome these limitations including adopting a different 3D model such

as UHM [Ploumpis et al., 2020], and adding a refiner or a subsequent super-resolver module.

3DFaceFill also generates a singular, and not diverse solutions. We can replace the underlying 3D

reconstruction algorithm with the proposed Diverse3DFace to enable diverse completions.

Our approach for controllable 3D face generation, CoLa-SDF, sometimes produce artifacts

when sampling from beyond three standard deviations from the mean MOST-GAN parameters.

Further, when interpolating between two hairstyles, it sometimes generate incomplete hats. Both

these effects may be due a lack of such examples in the FFHQ dataset [Karras et al., 2020] on

which it is trained. Fine tuning on a more diverse face dataset, or weighted sampling to favor

sampling extreme examples more often during training may address these challenges.

A limitation of Diverse3DFace is its dependence on the robustness of the global+local fitting in

the first step for further diverse completions. Although such a locally disentangled fitting demon-

strably performs better than a global model fitting, it may still be affected in cases where the initial

landmark or face-mask estimates are wrong. Also, extending Diverse3DFace to include texture

and model diversity in both shape and texture is a desirable objective. To address this, we have

proposed future extensions in Sec. 6.1.

94

BIBLIOGRAPHY

[Athar et al., 2022] Athar, S., Xu, Z., Sunkavalli, K., Shechtman, E., and Shu, Z. (2022). Rignerf:
Fully controllable neural 3d portraits. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 20364–20373.

[Bahat and Michaeli, 2020] Bahat, Y. and Michaeli, T. (2020). Explorable super resolution. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
2716–2725.

[Barnes et al., 2009] Barnes, C., Shechtman, E., Finkelstein, A., and Goldman, D. B. (2009).
Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans.
Graph., 28(3):24.

[Bengio et al., 2009] Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum

learning. In International Conference on Machine Learning.

[Bertalmio et al., 2000] Bertalmio, M., Sapiro, G., Caselles, V., and Ballester, C. (2000). Image
inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive
techniques, pages 417–424.

[Blanz and Vetter, 1999] Blanz, V. and Vetter, T. (1999). A morphable model for the synthesis of
3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive
techniques, pages 187–194.

[Bouritsas et al., 2019] Bouritsas, G., Bokhnyak, S., Ploumpis, S., Bronstein, M., and Zafeiriou,
S. (2019). Neural 3d morphable models: Spiral convolutional networks for 3d shape represen-
tation learning and generation. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 7213–7222.

[Chan et al., 2022] Chan, E. R., Lin, C. Z., Chan, M. A., Nagano, K., Pan, B., De Mello, S.,
Gallo, O., Guibas, L. J., Tremblay, J., Khamis, S., et al. (2022). Efficient geometry-aware 3d
In Proceedings of the IEEE/CVF Conference on Computer
generative adversarial networks.
Vision and Pattern Recognition, pages 16123–16133.

[Chan et al., 2021] Chan, E. R., Monteiro, M., Kellnhofer, P., Wu, J., and Wetzstein, G. (2021).
pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis.
In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages
5799–5809.

[Che et al., 2016] Che, T., Li, Y., Jacob, A. P., Bengio, Y., and Li, W. (2016). Mode regularized

generative adversarial networks. arXiv preprint arXiv:1612.02136.

[Chen et al., 2017a] Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L.
(2017a). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu-
tion, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence,
40(4):834–848.

95

[Chen et al., 2017b] Chen, Y.-A., Chen, W.-C., Wei, C.-P., and Wang, Y.-C. F. (2017b). Occlusion-
aware face inpainting via generative adversarial networks. In 2017 IEEE International Confer-
ence on Image Processing (ICIP), pages 1202–1206. IEEE.

[Clevert et al., 2015] Clevert, D.-A., Unterthiner, T., and Hochreiter, S. (2015). Fast and accurate
deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289.

[Cosker et al., 2011] Cosker, D., Krumhuber, E., and Hilton, A. (2011). A facs valid 3d dynamic
action unit database with applications to 3d dynamic morphable facial modeling. In 2011 inter-
national conference on computer vision, pages 2296–2303. IEEE.

[Criminisi et al., 2004] Criminisi, A., P´erez, P., and Toyama, K. (2004). Region filling and ob-
IEEE Transactions on image processing,

ject removal by exemplar-based image inpainting.
13(9):1200–1212.

[Defferrard et al., 2016] Defferrard, M., Bresson, X., and Vandergheynst, P. (2016). Convolutional
neural networks on graphs with fast localized spectral filtering. Advances in neural information
processing systems, 29:3844–3852.

[Deng et al., 2018] Deng, J., Cheng, S., Xue, N., Zhou, Y., and Zafeiriou, S. (2018). Uv-gan:
Adversarial facial uv map completion for pose-invariant face recognition. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 7093–7102.

[Deng et al., 2019] Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2019). Arcface: Additive angular
margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 4690–4699.

[Deng et al., 2020] Deng, Y., Yang, J., Chen, D., Wen, F., and Tong, X. (2020). Disentangled and
controllable face image generation via 3d imitative-contrastive learning. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition, pages 5154–5163.

[Dey and Boddeti, 2022a] Dey, R. and Boddeti, V. N. (2022a).

3dfacefill: An analysis-by-
synthesis approach to face completion. In Proceedings of the IEEE/CVF Winter Conference
on Applications of Computer Vision, pages 1586–1595.

[Dey and Boddeti, 2022b] Dey, R. and Boddeti, V. N. (2022b). Generating diverse 3d reconstruc-
In Proceedings of the IEEE/CVF Conference on

tions from a single occluded face image.
Computer Vision and Pattern Recognition (CVPR), pages 1547–1557.

[Dhariwal and Nichol, 2021] Dhariwal, P. and Nichol, A. (2021). Diffusion models beat gans on

image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794.

[Egger et al., 2018] Egger, B., Sch¨onborn, S., Schneider, A., Kortylewski, A., Morel-Forster, A.,
Blumer, C., and Vetter, T. (2018). Occlusion-aware 3d morphable models and an illumination
prior for face image analysis. International Journal of Computer Vision, 126(12):1269–1287.

[Egger et al., 2020] Egger, B., Smith, W. A., Tewari, A., Wuhrer, S., Zollhoefer, M., Beeler, T.,
Bernard, F., Bolkart, T., Kortylewski, A., Romdhani, S., et al. (2020). 3d morphable face
models—past, present, and future. ACM Transactions on Graphics (TOG), 39(5):1–38.

96

[Elfeki et al., 2019] Elfeki, M., Couprie, C., Riviere, M., and Elhoseiny, M. (2019). Gdpp: Learn-
ing diverse generations using determinantal point processes. In International Conference on
Machine Learning, pages 1774–1783. PMLR.

[Feng et al., 2021] Feng, Y., Feng, H., Black, M. J., and Bolkart, T. (2021). Learning an animat-
able detailed 3d face model from in-the-wild images. ACM Transactions on Graphics (TOG),
40(4):1–13.

[Gecer et al., 2019] Gecer, B., Ploumpis, S., Kotsia, I., and Zafeiriou, S. (2019). Ganfit: Gener-
ative adversarial network fitting for high fidelity 3d face reconstruction. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[Gerig et al., 2018] Gerig, T., Morel-Forster, A., Blumer, C., Egger, B., Luthi, M., Sch¨onborn,
In 2018 13th IEEE
S., and Vetter, T. (2018). Morphable face models-an open framework.
International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 75–82.
IEEE.

[Ghosh et al., 2018] Ghosh, A., Kulharia, V., Namboodiri, V. P., Torr, P. H., and Dokania, P. K.
(2018). Multi-agent diverse generative adversarial networks. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR).

[Goodfellow et al., 2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,
Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in
neural information processing systems, pages 2672–2680.

[Grassal et al., 2022] Grassal, P.-W., Prinzler, M., Leistner, T., Rother, C., Nießner, M., and Thies,
J. (2022). Neural head avatars from monocular rgb videos. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 18653–18664.

[Gropp et al., 2020] Gropp, A., Yariv, L., Haim, N., Atzmon, M., and Lipman, Y. (2020). Implicit

geometric regularization for learning shapes. arXiv preprint arXiv:2002.10099.

[Gross et al., 2010] Gross, R., Matthews, I., Cohn, J., Kanade, T., and Baker, S. (2010). Multi-pie.

Image and Vision Computing, 28(5):807–813.

[Gu et al., 2021] Gu, J., Liu, L., Wang, P., and Theobalt, C. (2021). Stylenerf: A style-based
3d-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985.

[Gulrajani et al., 2017] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C.
(2017). Improved training of wasserstein gans. In Advances in neural information processing
systems, pages 5767–5777.

[Hays and Efros, 2007] Hays, J. and Efros, A. A. (2007). Scene completion using millions of

photographs. ACM Transactions on Graphics (ToG), 26(3):4–es.

[He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 770–778.

97

[Heusel et al., 2017] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S.
(2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium.
Advances in neural information processing systems, 30.

[Ho et al., 2020] Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models.

Advances in Neural Information Processing Systems, 33:6840–6851.

[Hong et al., 2021] Hong, Y., Peng, B., Xiao, H., Liu, L., and Zhang, J. (2021). Headnerf: A

real-time nerf-based parametric head model. arXiv preprint arXiv:2112.05637.

[Iizuka et al., 2017] Iizuka, S., Simo-Serra, E., and Ishikawa, H. (2017). Globally and locally

consistent image completion. ACM Transactions on Graphics (ToG), 36(4):1–14.

[Isola et al., 2017] Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017). Image-to-image transla-
tion with conditional adversarial networks. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 1125–1134.

[Ju et al., 2022] Ju, Y.-J., Lee, G.-H., Hong, J.-H., and Lee, S.-W. (2022). Complete face recovery
gan: Unsupervised joint face rotation and de-occlusion from a single-view image. In WACV.

[Juefei-Xu et al., 2018] Juefei-Xu, F., Dey, R., Boddeti, V. N., and Savvides, M. (2018). Rankgan:
a maximum margin ranking gan for generating faces. In Asian Conference on Computer Vision,
pages 3–18. Springer.

[Karras et al., 2021] Karras, T., Aittala, M., Laine, S., H¨ark¨onen, E., Hellsten, J., Lehtinen, J., and
Aila, T. (2021). Alias-free generative adversarial networks. Advances in Neural Information
Processing Systems, 34:852–863.

[Karras et al., 2019] Karras, T., Laine, S., and Aila, T. (2019). A style-based generator architecture
for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 4401–4410.

[Karras et al., 2020] Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T.
In Proceedings of the

(2020). Analyzing and improving the image quality of stylegan.
IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119.

[Kendall and Gal, 2017] Kendall, A. and Gal, Y. (2017). What uncertainties do we need in
In Advances in neural information processing

bayesian deep learning for computer vision?
systems, pages 5574–5584.

[Kim et al., 2018] Kim, H., Zollh¨ofer, M., Tewari, A., Thies, J., Richardt, C., and Theobalt, C.
(2018). Inversefacenet: Deep monocular inverse face rendering. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 4625–4634.

[Kipf and Welling, 2016] Kipf, T. N. and Welling, M. (2016). Semi-supervised classification with

graph convolutional networks. arXiv preprint arXiv:1609.02907.

[Kulesza and Taskar, 2012] Kulesza, A. and Taskar, B. (2012). Determinantal point processes for

machine learning. arXiv preprint arXiv:1207.6083.

98

[Lee et al., 2020] Lee, C.-H., Liu, Z., Wu, L., and Luo, P. (2020). Maskgan: Towards diverse and
In IEEE Conference on Computer Vision and Pattern

interactive facial image manipulation.
Recognition (CVPR).

[Li et al., 2023] Li, C., Morel-Forster, A., Vetter, T., Egger, B., and Kortylewski, A. (2023). Ro-
bust model-based face reconstruction through weakly-supervised outlier segmentation. In 36th
IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE.

[Li et al., 2017a] Li, T., Bolkart, T., Black, M. J., Li, H., and Romero, J. (2017a). Learning a
model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc.
SIGGRAPH Asia), 36(6):194:1–194:17.

[Li et al., 2020a] Li, X., Hu, G., Zhu, J., Zuo, W., Wang, M., and Zhang, L. (2020a). Learning
symmetry consistent deep cnns for face completion. IEEE Transactions on Image Processing,
29:7641–7655.

[Li et al., 2017b] Li, Y., Liu, S., Yang, J., and Yang, M.-H. (2017b). Generative face completion.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3911–
3919.

[Li et al., 2020b] Li, Z., Hu, Y., He, R., and Sun, Z. (2020b). Learning disentangling and fusing
networks for face completion under structured occlusions. Pattern Recognition, 99:107073.

[Lin et al., 2017] Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll´ar, P. (2017). Focal loss
for dense object detection. In Proceedings of the IEEE international conference on computer
vision, pages 2980–2988.

[Liu et al., 2018] Liu, G., Reda, F. A., Shih, K. J., Wang, T.-C., Tao, A., and Catanzaro, B. (2018).
Image inpainting for irregular holes using partial convolutions. In Proceedings of the European
Conference on Computer Vision (ECCV), pages 85–100.

[Liu et al., 2015] Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in

the wild. In Proceedings of International Conference on Computer Vision (ICCV).

[Lugmayr et al., 2022] Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., and
Van Gool, L. (2022). Repaint: Inpainting using denoising diffusion probabilistic models. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
11461–11471.

[Maas et al., 2013] Maas, A. L., Hannun, A. Y., and Ng, A. Y. (2013). Rectifier nonlinearities

improve neural network acoustic models. In Proc. icml, volume 30, page 3.

[Macchi, 1975] Macchi, O. (1975). The coincidence approach to stochastic point processes. Ad-

vances in Applied Probability, 7(1):83–122.

[Medin et al., 2022] Medin, S. C., Egger, B., Cherian, A., Wang, Y., Tenenbaum, J. B., Liu, X.,
and Marks, T. K. (2022). Most-gan: 3d morphable stylegan for disentangled face image ma-
nipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages
1962–1971.

99

[Mescheder et al., 2018] Mescheder, L., Geiger, A., and Nowozin, S. (2018). Which training
In International conference on machine learning,

methods for gans do actually converge?
pages 3481–3490. PMLR.

[Mildenhall et al., 2020] Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi,
R., and Ng, R. (2020). Nerf: Representing scenes as neural radiance fields for view synthesis.
In European conference on computer vision, pages 405–421. Springer.

[Miyato et al., 2018] Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018). Spectral
In International Conference on Learning

normalization for generative adversarial networks.
Representations.

[Morris et al., 2019] Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, J. E., Rattan, G.,
and Grohe, M. (2019). Weisfeiler and leman go neural: Higher-order graph neural networks. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4602–4609.

[Neumann et al., 2013] Neumann, T., Varanasi, K., Wenger, S., Wacker, M., Magnor, M., and
Theobalt, C. (2013). Sparse localized deformation components. ACM Transactions on Graphics
(TOG), 32(6):1–10.

[Niemeyer and Geiger, 2021] Niemeyer, M. and Geiger, A. (2021). Giraffe: Representing scenes
as compositional generative neural feature fields. In Proc. IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR).

[Nirkin et al., 2018] Nirkin, Y., Masi, I., Tuan, A. T., Hassner, T., and Medioni, G. (2018). On face
segmentation, face swapping, and face perception. In 2018 13th IEEE International Conference
on Automatic Face & Gesture Recognition (FG 2018), pages 98–105. IEEE.

[Or-El et al., 2022] Or-El, R., Luo, X., Shan, M., Shechtman, E., Park, J. J., and Kemelmacher-
Shlizerman, I. (2022). Stylesdf: High-resolution 3d-consistent image and geometry generation.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 13503–13513.

[Parkhi et al., 2015] Parkhi, O. M., Vedaldi, A., and Zisserman, A. (2015). Deep face recognition.

In British Machine Vision Conference.

[Pathak et al., 2016] Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., and Efros, A. A. (2016).
Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 2536–2544.

[Paysan et al., 2009] Paysan, P., Knothe, R., Amberg, B., Romdhani, S., and Vetter, T. (2009).
In 2009 Sixth IEEE
A 3d face model for pose and illumination invariant face recognition.
International Conference on Advanced Video and Signal Based Surveillance, pages 296–301.
Ieee.

[Ploumpis et al., 2020] Ploumpis, S., Ververas, E., O’Sullivan, E., Moschoglou, S., Wang, H.,
Pears, N., Smith, W., Gecer, B., and Zafeiriou, S. P. (2020). Towards a complete 3d morphable
model of the human head. IEEE Transactions on Pattern Analysis and Machine Intelligence.

100

[Radford et al., 2021] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S.,
Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models
In International conference on machine learning, pages
from natural language supervision.
8748–8763. PMLR.

[Ramamoorthi and Hanrahan, 2001] Ramamoorthi, R. and Hanrahan, P. (2001). An efficient rep-
resentation for irradiance environment maps. In Proceedings of the 28th annual conference on
Computer graphics and interactive techniques, pages 497–500.

[Ramon et al., 2021] Ramon, E., Triginer, G., Escur, J., Pumarola, A., Garcia, J., Giro-i Nieto,
X., and Moreno-Noguer, F. (2021). H3d-net: Few-shot high-fidelity 3d head reconstruction. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5620–5629.

[Ranjan et al., 2018] Ranjan, A., Bolkart, T., Sanyal, S., and Black, M. J. (2018). Generating 3D
In European Conference on Computer Vision

faces using convolutional mesh autoencoders.
(ECCV), pages 725–741.

[Ravi et al., 2020] Ravi, N., Reizenstein, J., Novotny, D., Gordon, T., Lo, W.-Y., Johnson, J.,
and Gkioxari, G. (2020). Accelerating 3d deep learning with pytorch3d. arXiv preprint
arXiv:2007.08501.

[Ronneberger et al., 2015] Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolu-
tional networks for biomedical image segmentation. In International Conference on Medical
image computing and computer-assisted intervention, pages 234–241. Springer.

[Saito et al., 2016] Saito, S., Li, T., and Li, H. (2016). Real-time facial segmentation and perfor-
mance capture from rgb input. In European conference on computer vision, pages 244–261.
Springer.

[Sanyal et al., 2019] Sanyal, S., Bolkart, T., Feng, H., and Black, M. J. (2019). Learning to regress
In Proceedings of the

3d face shape and expression from an image without 3d supervision.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7763–7772.

[Schwartz et al., 2020] Schwartz, G., Wei, S.-E., Wang, T.-L., Lombardi, S., Simon, T., Saragih,
J., and Sheikh, Y. (2020). The eyes have it: An integrated eye and face model for photorealistic
facial animation. ACM Trans. Graph., 39(4).

[Schwarz et al., 2020] Schwarz, K., Liao, Y., Niemeyer, M., and Geiger, A. (2020). Graf: Gener-
ative radiance fields for 3d-aware image synthesis. Advances in Neural Information Processing
Systems, 33:20154–20166.

[Sengupta et al., 2018] Sengupta, S., Kanazawa, A., Castillo, C. D., and Jacobs, D. W. (2018).
Sfsnet: Learning shape, reflectance and illuminance of facesin the wild’. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 6296–6305.

[Shocher et al., 2019] Shocher, A., Bagon, S., Isola, P., and Irani, M. (2019). Ingan: Capturing
and retargeting the” dna” of a natural image. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 4492–4501.

101

[Shu et al., 2017] Shu, Z., Yumer, E., Hadap, S., Sunkavalli, K., Shechtman, E., and Samaras, D.
In Proceedings of the IEEE

(2017). Neural face editing with intrinsic image disentangling.
conference on computer vision and pattern recognition, pages 5541–5550.

[Sitzmann et al., 2020] Sitzmann, V., Martel, J., Bergman, A., Lindell, D., and Wetzstein, G.
(2020). Implicit neural representations with periodic activation functions. Advances in Neu-
ral Information Processing Systems, 33:7462–7473.

[Sohn et al., 2015] Sohn, K., Lee, H., and Yan, X. (2015). Learning structured output represen-
tation using deep conditional generative models. Advances in neural information processing
systems, 28:3483–3491.

[Song et al., 2019a] Song, L., Cao, J., Song, L., Hu, Y., and He, R. (2019a). Geometry-aware
face completion and editing. In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 33, pages 2506–2513.

[Song et al., 2019b] Song, L., Gong, D., Li, Z., Liu, C., and Liu, W. (2019b). Occlusion robust
face recognition based on mask learning with pairwise differential siamese network. In Pro-
ceedings of the IEEE International Conference on Computer Vision, pages 773–782.

[Srivastava et al., 2017] Srivastava, A., Valkov, L., Russell, C., Gutmann, M. U., and Sutton, C.
(2017). Veegan: Reducing mode collapse in gans using implicit variational learning. In Pro-
ceedings of the 31st International Conference on Neural Information Processing Systems, pages
3310–3320.

[Sun et al., 2022] Sun, J., Wang, X., Zhang, Y., Li, X., Zhang, Q., Liu, Y., and Wang, J. (2022).
Fenerf: Face editing in neural radiance fields. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 7672–7682.

[Suzuki et al., 2016] Suzuki, M., Nakayama, K., and Matsuo, Y. (2016). Joint multimodal learning

with deep generative models. arXiv preprint arXiv:1611.01891.

[Tewari et al., 2020a] Tewari, A., Elgharib, M., Bernard, F., Seidel, H.-P., P´erez, P., Zollh¨ofer,
M., and Theobalt, C. (2020a). Pie: Portrait image embedding for semantic control. ACM
Transactions on Graphics (TOG), 39(6):1–14.

[Tewari et al., 2020b] Tewari, A., Elgharib, M., Bharaj, G., Bernard, F., Seidel, H.-P., P´erez, P.,
Zollhofer, M., and Theobalt, C. (2020b). Stylerig: Rigging stylegan for 3d control over por-
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
trait images.
Recognition, pages 6142–6151.

[Tewari et al., 2022] Tewari, A., Pan, X., Fried, O., Agrawala, M., Theobalt, C., et al. (2022).
Disentangled3d: Learning a 3d generative model with disentangled geometry and appearance
from monocular images. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 1516–1525.

[Tewari et al., 2017] Tewari, A., Zollhofer, M., Kim, H., Garrido, P., Bernard, F., Perez, P., and
Theobalt, C. (2017). Mofa: Model-based deep convolutional face autoencoder for unsupervised

102

monocular reconstruction. In Proceedings of the IEEE International Conference on Computer
Vision Workshops, pages 1274–1283.

[Tran et al., 2019] Tran, L., Liu, F., and Liu, X. (2019). Towards high-fidelity nonlinear 3d face
In Proceedings of the IEEE Conference on Computer Vision and Pattern

morphable model.
Recognition, pages 1126–1135.

[Tran and Liu, 2018] Tran, L. and Liu, X. (2018). Nonlinear 3d face morphable model. In Pro-
ceedings of the IEEE conference on computer vision and pattern recognition, pages 7346–7355.

[Tran and Liu, 2019] Tran, L. and Liu, X. (2019). On learning 3d face morphable model from

in-the-wild images. IEEE transactions on pattern analysis and machine intelligence.

[Tripathy et al., 2021] Tripathy, S., Kannala, J., and Rahtu, E. (2021). Facegan: Facial attribute
controllable reenactment gan. In Proceedings of the IEEE/CVF winter conference on applica-
tions of computer vision, pages 1329–1338.

[Tuan Tran et al., 2017] Tuan Tran, A., Hassner, T., Masi, I., and Medioni, G. (2017). Regressing
robust and discriminative 3d morphable models with a very deep neural network. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages 5163–5172.

[Tu´an Tr´an et al., 2018] Tu´an Tr´an, A., Hassner, T., Masi, I., Paz, E., Nirkin, Y., and Medioni, G.
(2018). Extreme 3d face reconstruction: Seeing through occlusions. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 3935–3944.

[Veliˇckovi´c et al., 2017] Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Ben-

gio, Y. (2017). Graph attention networks. arXiv preprint arXiv:1710.10903.

[Wang et al., 2022] Wang, C., Chai, M., He, M., Chen, D., and Liao, J. (2022). Clip-nerf: Text-
In Proceedings of the IEEE/CVF

and-image driven manipulation of neural radiance fields.
Conference on Computer Vision and Pattern Recognition, pages 3835–3844.

[Wang et al., 2020] Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu,
Y., Tan, M., Wang, X., et al. (2020). Deep high-resolution representation learning for visual
recognition. IEEE transactions on pattern analysis and machine intelligence.

[Wang et al., 2021] Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., and Wang, W. (2021).
Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction.
arXiv preprint arXiv:2106.10689.

[Wang et al., 2018] Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., and Catanzaro, B.
(2018). High-resolution image synthesis and semantic manipulation with conditional gans. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–
8807.

[Wei et al., 2019] Wei, H., Liang, S., and Wei, Y. (2019). 3d dense face alignment via graph

convolution networks. arXiv preprint arXiv:1904.05562.

103

[Wu et al., 2022] Wu, K., Ye, Y., Yang, L., Fu, H., Zhou, K., and Zheng, Y. (2022). Neuralhdhair:
Automatic high-fidelity hair modeling from a single image using implicit neural representations.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 1526–1535.

[Wu et al., 2020] Wu, S., Rupprecht, C., and Vedaldi, A. (2020). Unsupervised learning of proba-
bly symmetric deformable 3d objects from images in the wild. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 1–10.

[Wu and He, 2018] Wu, Y. and He, K. (2018). Group normalization. In Proceedings of the Euro-

pean conference on computer vision (ECCV), pages 3–19.

[Xia et al., 2022] Xia, W., Zhang, Y., Yang, Y., Xue, J.-H., Zhou, B., and Yang, M.-H. (2022).
Gan inversion: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[Yang et al., 2019] Yang, D., Hong, S., Jang, Y., Zhao, T., and Lee, H. (2019). Diversity-sensitive

conditional generative adversarial networks. arXiv preprint arXiv:1901.09024.

[Yang et al., 2022] Yang, G., Vo, M., Neverova, N., Ramanan, D., Vedaldi, A., and Joo, H. (2022).
Banmo: Building animatable 3d neural models from many casual videos. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2863–2873.

[Yeh et al., 2017] Yeh, R. A., Chen, C., Yian Lim, T., Schwing, A. G., Hasegawa-Johnson, M.,
and Do, M. N. (2017). Semantic image inpainting with deep generative models. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages 5485–5493.

[Yenamandra et al., 2021] Yenamandra, T., Tewari, A., Bernard, F., Seidel, H.-P., Elgharib, M.,
Cremers, D., and Theobalt, C. (2021). i3dmm: Deep implicit 3d morphable model of human
heads. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion, pages 12803–12813.

[Yu et al., 2018] Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., and Huang, T. S. (2018). Generative
image inpainting with contextual attention. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 5505–5514.

[Yu et al., 2019] Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., and Huang, T. S. (2019). Free-form
image inpainting with gated convolution. In Proceedings of the IEEE International Conference
on Computer Vision, pages 4471–4480.

[Yuan and Park, 2019] Yuan, X. and Park, I. K. (2019). Face de-occlusion using 3d morphable
model and generative adversarial network. In Proceedings of the IEEE International Conference
on Computer Vision, pages 10062–10071.

[Yuan and Kitani, 2019] Yuan, Y. and Kitani, K. (2019). Diverse trajectory forecasting with de-

terminantal point processes. arXiv preprint arXiv:1907.04967.

[Yuan and Kitani, 2020] Yuan, Y. and Kitani, K. (2020). Dlow: Diversifying latent flows for di-
verse human motion prediction. In European Conference on Computer Vision, pages 346–364.
Springer.

104

[Yuan et al., 2022] Yuan, Y.-J., Sun, Y.-T., Lai, Y.-K., Ma, Y., Jia, R., and Gao, L. (2022). Nerf-
editing: geometry editing of neural radiance fields. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 18353–18364.

[Zhang et al., 2018a] Zhang, J., Zhan, R., Sun, D., and Pan, G. (2018a). Symmetry-aware face
In Asian Conference on Computer Vision,

completion with generative adversarial networks.
pages 289–304. Springer.

[Zhang and Samaras, 2006] Zhang, L. and Samaras, D. (2006). Face recognition from a single
training image under arbitrary unknown lighting using spherical harmonics. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 28(3):351–363.

[Zhang et al., 2018b] Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. (2018b).

The unreasonable effectiveness of deep features as a perceptual metric. In CVPR.

[Zhang et al., 2017] Zhang, S., He, R., Sun, Z., and Tan, T. (2017). Demeshnet: Blind face inpaint-
ing for deep meshface verification. IEEE Transactions on Information Forensics and Security,
13(3):637–647.

[Zheng et al., 2019a] Zheng, C., Cham, T.-J., and Cai, J. (2019a). Pluralistic image completion.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
1438–1447.

[Zheng et al., 2019b] Zheng, C., Cham, T.-J., and Cai, J. (2019b). Pluralistic image completion. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
1438–1447.

[Zhou et al., 2020a] Zhou, T., Ding, C., Lin, S., Wang, X., and Tao, D. (2020a). Learning oracle
In Proceedings of the IEEE/CVF Conference on

attention for high-fidelity face completion.
Computer Vision and Pattern Recognition, pages 7680–7689.

[Zhou et al., 2020b] Zhou, Y., Wu, C., Li, Z., Cao, C., Ye, Y., Saragih, J., Li, H., and Sheikh, Y.
(2020b). Fully convolutional mesh autoencoder using efficient spatially varying kernels. arXiv
preprint arXiv:2006.04325.

[Zhu et al., 2017] Zhu, J.-Y., Zhang, R., Pathak, D., Darrell, T., Efros, A. A., Wang, O., and
Shechtman, E. (2017). Multimodal image-to-image translation by enforcing bi-cycle consis-
tency. In Advances in neural information processing systems, pages 465–476.

[Zhu et al., 2016] Zhu, X., Lei, Z., Liu, X., Shi, H., and Li, S. Z. (2016). Face alignment across
In Proceedings of the IEEE conference on computer vision and

large poses: A 3d solution.
pattern recognition, pages 146–155.

[Zollh¨ofer et al., 2018] Zollh¨ofer, M., Thies, J., Garrido, P., Bradley, D., Beeler, T., P´erez, P.,
Stamminger, M., Nießner, M., and Theobalt, C. (2018). State of the art on monocular 3d face
In Computer Graphics Forum, volume 37, pages
reconstruction, tracking, and applications.
523–550. Wiley Online Library.

105

APPENDIX A 3DFACEFILL

A.1 Generalization Performance of 3DFaceFill on In-the-Wild Images downloaded from

the Internet

To compare the generalization performance of different methods, we evaluate face completion

on a small dataset of ∼ 50 in-the-wild face images downloaded from the internet1 (referred to as

Internet). We report the quantitative metrics in Table 3.1, where one can see significant margins

between 3DFaceFill and the closest baselines across all the three metrics, demonstrating the bet-

ter generalization performance of our proposed method. Fig. A.1 shows qualitative comparison

on a small sample where 3DFaceFill generates more realistic completions, thanks to the explicit

imposition of 3D face priors. This shows that the principles behind 3DFaceFill can improve the

generalization performance of image completion approaches on structured objects such as faces.

A.2 Further Qualitative Results on Pose and Illumination Varying Images

We present further face completion on the pose and illumination varying images from the

MultiPIE dataset [Gross et al., 2010] in Figs. A.2 and A.3.

1Source: https://unsplash.com/s/photos/face

106

Input

DeepFillv2 [Yu
et al., 2019]

PIC [Zheng
et al., 2019a]

Diverse3DFace
(Ours)

Ground Truth

Figure A.1 Qualitative evaluation (of generalization performance) on the Internet downloaded im-
ages.

107

t
u
p
n
I

2
v
l
l
i
F
p
e
e
D

t
e
N
C
I
P

e
c
a
F
D
3
e
s
r
e
v
i
D

h
t
u
r
T
d
n
u
o
r
G

t
u
p
n
I

2
v
l
l
i
F
p
e
e
D

t
e
N
C
I
P

e
c
a
F
D
3
e
s
r
e
v
i
D

h
t
u
r
T
d
n
u
o
r
G

Figure A.2 Qualitative evaluation of Diverse3DFace vs. baselines DeepFillv2 [Yu et al., 2019] and
PIC [Zheng et al., 2019a] on the pose-varying MultiPIE:Pose split [Gross et al., 2010]. While the
baselines tend to generate blurred and deformed faces in extreme poses, Diverse3DFace is pose-
robust and generates more accurate completions across a range of pose.

108

t
u
p
n
I

2
v
l
l
i
F
p
e
e
D

t
e
N
C
I
P

e
c
a
F
D
3
e
s
r
e
v
i
D

h
t
u
r
T
d
n
u
o
r
G

t
u
p
n
I

2
v
l
l
i
F
p
e
e
D

t
e
N
C
I
P

e
c
a
F
D
3
e
s
r
e
v
i
D

h
t
u
r
T
d
n
u
o
r
G

Figure A.3 Qualitative evaluation of Diverse3DFace vs.the baselines DeepFillv2 [Yu et al., 2019]
and PIC [Zheng et al., 2019a] on the illumination varying MultiPIE:Illu split [Gross et al., 2010].
While the baselines tend to generate artifacts in extreme illuminations, Diverse3DFace generates
completions that look geometrically accurate and preserve the illumination contrast.

109

A.3

Implementation Details

In this section, we provide further implementation details on 3DFaceFill. In sub-section A.3.1,

we give detailed network architectures for the modules used in 3DFaceFill. In sub-section A.3.2,

we provide details of the loss functions used to train the 3D factorization module. Lastly, we give

full training details of the different components in sub-section A.3.3.

A.3.1 Network Architectures

We report the detailed network architectures for the 3DMM Encoder E, the Albedo Decoder

GA, the Sym-UNet module, the PyramidGAN discriminator and the Face Parser in Tables A.1 to

A.5. Our network architectures for the 3DMM modules are based on the architectures used in

[Tran and Liu, 2019] for the corresponding modules. Insipired by Miyato et al. [Miyato et al.,

2018], we use spectral normalization in all our convolution layers. The abbreviated operators used

are defined as follows:

• Conv(cin, cout, k, s, p): 2D convolution with cin input channels, cout output channels, kernel

size k, stride s and padding p.

• Deconv(cin, cout, k, s, p): 2D transposed convolution (deconvolution) with cin input chan-

nels, cout output channels, kernel size k, stride s and padding p.

• GN(n): Group normalization [Wu and He, 2018] with n groups

• ELU: Exponential linear unit [Clevert et al., 2015] activation, LReLU(α): Leaky ReLU

[Maas et al., 2013] with a negative slope of α

• ResUnit(cin, cout, k, s, p): Residual unit [He et al., 2016] with cin input channels, cout output

channels, kernel size k, stride s, padding p with group normalization [Wu and He, 2018] and

ELU activation [Maas et al., 2013]

• SigGNConv(cin, cout, k, s, p): 2D convolution with cin input channels, cout output channels,

kernel size k, stride s and padding p followed by group normalization [Wu and He, 2018]

and sigmoid activation

110

• SigGNDeconv(cin, cout, k, s, p): 2D transposed convolution with cin input channels, cout out-

put channels, kernel size k, stride s and padding p followed by group normalization [Wu and

He, 2018] and sigmoid activation

• SpectralConv(cin, cout, k, s, p): 2D convolution with cin input channels, cout output channels,

kernel size k, stride s, padding p and spectral normalization [Miyato et al., 2018]

• Upsample(sh, sc): Upsamples height by sh and width by sw using nearest neighbour inter-

polation.

A.3.2

3DMM Module Losses

The 3DMM module is trained using a combination of supervised, reconstruction and regular-

ization losses:

L3DMM = λsupLsup + λrecLrec + λregLreg,

(A.1)

where, Lsup = λSL(Sgt, ˜S) + λθLθ + λT L(Tuv

gt , ˜Tuv) + λlmarkLlmark use the groundtruth shape

Sgt, pose θgt, texture Tuv

gt and 2D landmarks when available, Lrec enforces similarity between the

rendered and grountruth images, and Lreg = λ3dsymL3dsym + λconstLconst are regularization losses

to enforce bilateral symmetry of albedo and effective separation of shade and albedo. All loss

coefficients λ’s are set to have equal weightage for all the loss terms. We now define these losses:

- Shape loss is defined as:

L(Sgt, ˜S) = E

(cid:20)(cid:12)
(cid:12)
(cid:12)|Sgt − ˜S|
(cid:12)
(cid:12)
(cid:12)

(cid:21)

2

2

,

where Sgt and ˜S are the groundtruth and predicted 3D shapes, respectively.

- Pose loss is defined as a combination of scale, translation and rotation losses:

Lθ = λsE (cid:2)(sgt − ˜s)2(cid:3) + λtE (cid:2)||tgt − ˜t||2

2

(cid:3) + λrLR,

where s represents scale, t represents the translation, and LR = E

(cid:104)

||quat(Rgt) − quat( ˜R)||2

2

(cid:105)

is

the rotation loss with R representing the rotation along the X, Y and Z axes and quat(.) gives its

quaternion representation.

111

3DMM Encoder
Image → SpectralConv(3, 32, 7, 2, 3) + GN(8) + ELU
SpectralConv(32, 64, 3, 1, 1) + GN(16) + ELU
SpectralConv(64, 64, 3, 2, 1) + GN(16) + ELU
SpectralConv(64, 96, 3, 1, 1) + GN(24) + ELU
SpectralConv(96, 128, 3, 1, 1) + GN(32) + ELU
SpectralConv(128, 128, 3, 2, 1) + GN(32) + ELU
SpectralConv(128, 196, 3, 1, 1) + GN(48) + ELU
SpectralConv(196, 256, 3, 1, 1) + GN(64) + ELU
SpectralConv(256, 256, 3, 2, 1) + GN(64) + ELU
SpectralConv(256, 256, 3, 1, 1) + GN(64) + ELU
SpectralConv(256, 256, 3, 1, 1) + GN(64) + ELU
SpectralConv(256, 512, 3, 2, 1) + GN(128) + ELU
SpectralConv(512, 512, 3, 1, 1) + GN(128) + ELU → feats

feats → SpectralConv(512, 160, 3, 1, 1) + GN(40) + ELU
AvgPool(7,7)
Linear(160, 6) + Tanh → Pose

feats → SpectralConv(512, 160, 3, 1, 1) + GN(40) + ELU
AvgPool(7,7)
Linear(160, 27) → Illumination

feats → SpectralConv(512, 512, 3, 1, 1) + GN(128) + ELU
SpectralConv(512, 512, 3, 1, 1) + GN(128) + ELU
AvgPool(7,7)
Linear(512, 199+29) → 199 Shape + 29 Expression coefficients

Output size
112x112
112x112
56x56
56x56
56x56
28x28
28x28
28x28
14x14
14x14
14x14
7x7
7x7

7x7
1x1

7x7
1x1

7x7
7x7
1x1

feats → SpectralConv(512, 512, 3, 1, 1) + GN(128) + ELU
AvgPool(7,7) → Albedo features
Model Complexity

7x7
1x1
17.4M

Table A.1 Network architecture of the 3DMM Encoder E. The P ose1 corresponds to the scale,
P ose2:4 correspond to the yaw, roll and pitch angles normalized by π/2 and P ose5:6 correspond to
the X and Y translations normalized by the input image size.

112

Albedo Decoder
Albedo features → Upsample(3,4)
SpectralConv(512, 512, 3, 1, 1) + GN(128) + ELU
SpectralConv(512, 256, 3, 1, 1) + GN(64) + ELU
Upsample(2,2)
SpectralConv(256, 256, 3, 1, 1) + GN(64) + ELU
SpectralConv(256, 128, 3, 1, 1) + GN(32) + ELU
SpectralConv(128, 128, 3, 1, 1) + GN(32) + ELU
Upsample(2,2)
SpectralConv(128, 160, 3, 1, 1) + GN(40) + ELU
SpectralConv(160, 96, 3, 1, 1) + GN(32) + ELU
SpectralConv(96, 128, 3, 1, 1) + GN(32) + ELU
Upsample(2,2)
SpectralConv(128, 128, 3, 1, 1) + GN(32) + ELU
SpectralConv(128, 64, 3, 1, 1) + GN(16) + ELU
SpectralConv(64, 96, 3, 1, 1) + GN(24) + ELU
Upsample(2,2)
SpectralConv(96, 96, 3, 1, 1) + GN(32) + ELU
SpectralConv(96, 64, 3, 1, 1) + GN(16) + ELU
SpectralConv(64, 64, 3, 1, 1) + GN(16) + ELU
Upsample(2,2)
SpectralConv(64, 64, 3, 1, 1) + GN(16) + ELU
SpectralConv(64, 32, 3, 1, 1) + GN(8) + ELU
SpectralConv(32, 32, 3, 1, 1) + GN(8) + ELU
Upsample(2,2)
SpectralConv(32, 32, 3, 1, 1) + GN(8) + ELU
SpectralConv(32, 16, 3, 1, 1) + GN(4) + ELU
SpectralConv(16, 16, 3, 1, 1) + GN(4) + ELU
Conv(16, 3, 1, 1, 0) + Tanh → Albedo
Model Complexity

Output size
3x4
3x4
3x4
6x8
6x8
6x8
6x8
12x16
12x16
12x16
12x16
24x32
24x32
24x32
24x32
48x64
48x64
48x64
48x64
96x128
96x128
96x128
96x128
192x256
192x256
192x256
192x256

5.54M

Table A.2 Network architecture of the Albedo Decoder DA that decodes the 512 dimensional
Albedo features from the 3DMM Encoder E into 3 × 192 × 256 dimensional Albedo represen-
tation in the UV space.

113

Input
X
X
hf lip(X)
hf lip(X)

Layer
ResUnit(4, 32, 3, 2, 1)
SigGNConv(4, 32, 3, 2, 1)
ResUnit(4, 32, 3, 2, 1)
SigGNConv(4, 32, 3, 2, 1)

(f 1 ⊙ g1, f 1′ ⊙ g1′) ResUnit(64, 64, 3, 2, 1)
(f 1 ⊙ g1, f 1′ ⊙ g1′) SigGNConv(64, 64, 3, 2, 1)

f 2 ⊙ g2
f 2 ⊙ g2
f 3 ⊙ g3
f 3 ⊙ g3
f 4 ⊙ g4
f 4 ⊙ g4
f 5 ⊙ g5
f 5 ⊙ g5

f 51 ⊙ g51
(x4, f 4 ⊙ g4)
f 51 ⊙ g51
f 41 ⊙ g41
(x3, f 3 ⊙ g3)
f 41 ⊙ g41
f 31 ⊙ g31
(x2, f 2 ⊙ g2)
f 31 ⊙ g31
f 21 ⊙ g21
(x1, f 1 ⊙ g1)
f 21 ⊙ g21
f 11 ⊙ g11
x0
f 11 ⊙ g11
f 01 ⊙ g01

ResUnit(64, 128, 3, 2, 1)
SigGNConv(64, 128, 3, 2, 1)
ResUnit(128, 256, 3, 2, 1)
SigGNConv(128, 256, 3, 2, 1)
ResUnit(256, 512, 3, 2, 1)
SigGNConv(256, 512, 3, 2, 1)
ResUnit(512, 256, 3, 1, 1)
SigGNConv(512, 256, 3, 1, 1)

Upsample(2,2)
ResUnit(512, 128, 3, 1, 1)
SigGNDeconv(256, 128, 4, 2, 1)
Upsample(2,2)
ResUnit(256, 64, 3, 1, 1)
SigGNDeconv(128, 64, 4, 2, 1)
Upsample(2,2)
ResUnit(128, 64, 3, 1, 1)
SigGNDeconv(128, 64, 4, 2, 1)
Upsample(2,2)
ResUnit(128, 64, 3, 1, 1)
SigGNDeconv(128, 64, 4, 2, 1)
Upsample(2,2)
ResUnit(64, 32, 3, 1, 1)
SigGNDeconv(64, 32, 4, 2, 1)
Conv(32, 4, 1, 1, 0)

Model Complexity

Output
f 1
g1
f 1′
g1′
f 2
g2
f 3
g3
f 4
g4
f 5
g5
f 51
g51

x4
f 41
g41
x3
f 31
g31
x2
f 21
g21
x1
f 11
g11
x0
f 01
g01
( ˆAuv, σuv)
11.7M

Table A.3 Network architecture of the Albedo Inpainter G (Sym-UNet). The input to the network is
the concatenation of the masked Albedo Auv
m , Muv).
Outputs are the completed Albedo ˆAuv and the uncertainty map σuv.

m and the mask Muv in the UV space X = (Auv

114

Input Layer
Igt/ˆI SpectralConv(3, 32, 4, 2, 1) + GN(8) + LReLU(.2)
x0
x1
x1
x2
x2
x3
x3
x4

SpectralConv(32, 64, 4, 2, 1) + GN(16) + LReLU(.2)
SpectralConv(64, 1, 1, 1, 0)
SpectralConv(64, 128, 4, 2, 1) + GN(32) + LReLU(.2)
SpectralConv(128, 1, 1, 1, 0)
SpectralConv(128, 256, 4, 2, 1) + GN(64) + LReLU(.2)
SpectralConv(256, 1, 1, 1, 0)
SpectralConv(256, 512, 4, 2, 1) + GN(128) + LReLU(.2)
SpectralConv(512, 1, 1, 1, 0)

Model Complexity

Output
x0
x1
out1
x2
out2
x3
out3
x4
out4
2.79M

Table A.4 Network architecture of the PyramidGAN discriminator D.

Input
Image
x1
x2
x3
x4
x5

Layer
ResUnit(3, 32, 3, 1, 1)
ResUnit(32, 64, 3, 2, 1)
ResUnit(64, 128, 3, 2, 1)
ResUnit(128, 256, 3, 2, 1)
ResUnit(256, 256, 3, 2, 1)
ResUnit(256, 256, 3, 2, 1)

x6

x52

Upsample(2,2)
(x51, x5) ResUnit(512, 256, 3, 1, 1)
Upsample(2,2)
(x41, x4) ResUnit(512, 128, 3, 1, 1)
Upsample(2,2)
(x31, x3) ResUnit(256, 64, 3, 1, 1)
Upsample(2,2)
(x21, x2) ResUnit(128, 32, 3, 1, 1)
Upsample(2,2)

x42

x32

x22

(x11, x1) ResUnit(64, 32, 3, 1, 1)

Output
x1
x2
x3
x4
x5
x6

x51
x52
x41
x42
x31
x32
x21
x22
x11
x12

x12

Conv(32, 3, 1, 1, 0) + Softmax2d (Mf , Mo, Mb)
Model Complexity

7.18M

Table A.5 Network architecture of the face parser. (x, y) represents the concatenation of tensors
x and y along the channel dimension. The output of the network consist of a face mask Mf , an
occlusion mask Mo and a background mask Mb.

115

- Texture loss is defined as:

L(Tuv

gt , ˜Tuv) = E

(cid:104)

where Tuv is the texture represented in UV space.

- Landmark loss is defined as:

||Tuv

gt − ˜Tuv||2

2

(cid:105)

,

Llmark =

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

M(`) ∗






S(:, d)

1



(cid:13)
2
(cid:13)
(cid:13)
(cid:13)

 − U
(cid:13)
(cid:13)
(cid:13)
2

,

where M is the camera projection matrix obtained from the pose θ, d selects 68 indices corre-

sponding to sparse 2D landmarks on the 3D face mesh S and U ∈ R68×2 are the groundtruth

locations of 2D facial landmarks.

- Reconstruction loss is defined as:

Lrec = ||(Igt − Irec) ⊙ Mf ||2
2 ,

where Igt and Irec are the original and the rendered images, respectively and Mf is the face mask.

- Albedo symmetry loss is defined as:

L3dsym(A) = ∥Auv − hflip(Auv)∥1,

where Auv is the UV representation of albedo and hflip() is the horizontal image flipping operation.

- Albedo constancy loss is defined as:

Lconst(A) =

(cid:88)

vuv

j ∈Ni

ω(vuv

i , vuv

j )∥Auv(vuv

i ) − Auv(vuv

j )∥p
2,

where Ni denotes the 4-neighborhood around vuv

i and the weight ω(vuv

i , vuv

j ) = exp(−α∥c(vuv

i ) −

c(vuv

j )∥) enforce that pixels with similar chromaticity should have similar albedo.

A.3.3 Training Details

3DMM Module: We train the 3DMM module in two stages. First, we train it using the 300W-3D

dataset [Zhu et al., 2016], which has ground-truth shape, pose, texture and landmark annotations,

116

for 100k iterations in a supervised way. Then, we further train it on the CelebA dataset [Liu

et al., 2015] with 1/10th of the original learning rate for further 30k iterations in an unsupervised

way, whereby we use only the reconstruction loss, 2D landmark loss and the regularization losses.

During this stage, we use landmark detections from HRNet [Wang et al., 2020] as groundtruth

for the landmark loss. To make the 3DMM encoder robust to partial face images, we introduce

artificial occlusions in the training images using random rectangular masks of varying sizes and

locations. In addition, we also use random horizontal flipping as a data augmentation. During

inference, occlusions are removed from the input image using the occlusion mask and passed

through the 3DMM encoder to obtain occlusion-robust factorization.

Albedo Inpainting Module: The albedo inpainting module is trained on the CelebA dataset [Liu

et al., 2015] for 30k iterations. To obtain the UV representations of the partial albedo and the mask,

we re-project the 3D mesh obtained from the pretrained 3DMM module on the partial image and

mask, respectively as shown in Fig. 3.2. On the GAN loss Eq. (3.3), we update the inpainter G

and the discriminator D alternatively using a ratio of 1:1. On all the other completion losses, we

update the inpainter G continuously. Other than the random face masks, we use random horizontal

flipping as the only data augmentation to train the albedo inpainter.

Face Segmentation Module: Since our method inpaints only the facial region in the UV domain,

we restrict the image masks to lie on the face region too. For this, we train a UNet [Ronneberger

et al., 2015] based face segmentation model that separates the face region from the background,

hair and inner mouth. The face segmenter predicts segmentation masks for (a) the face, (b) hair

and other occlusions and (c) the background. We train the face segmentation module on the

CelebAMask-HQ dataset [Lee et al., 2020] for a total of 50k iterations using the ground-truth

annotations provided by the dataset. We use Focal loss [Lin et al., 2017] to train this module.

For all the modules, except the discriminator D, we use the Adam optimizer with an initial

learning rate of 10−4 and a step-decay of 0.98 per epoch, while for the PyramidGAN discriminator,

we use an initial learning rate of 3 × 10−4. The input images are first aligned to 256 × 256 using

the method suggested in [Lee et al., 2020], which is the alignment used in the CelebA-HQ dataset.

117

For training, we randomly crop the images to a size of 224 × 224 while during inference we use

central crop. The full training takes 2 days on an Intel Xeon E5-2650 machine with two NVIDIA

RTX 2080 GPUs, while inference takes 0.1 sec per image on a single GPU.

118

APPENDIX B COLA-SDF

B.1 Attribute Transfer

We show additional source-to-target attribute transfer results, including shape transfer in Fig. B.1,

texture transfer (transfer of both albedo and illumination) in Fig. B.2, and hair/background trans-

fer in Fig. B.3. In Fig. B.3, we again observe that while the hair geometry and style is mainly

controlled by the hair/background code, its appearance is partly controlled by the albedo and illu-

mination codes. These results show CoLa-SDF’s ability to transfer one attribute while keeping the

rest intact and demonstrate the attribute disentangled latent space learned by CoLa-SDF.

119

Figure B.1 Further shape transfer results using CoLa-SDF.

120

Figure B.2 Further texture (albedo + illumination) transfer results using CoLa-SDF.

121

Figure B.3 Further hair/background transfer results using CoLa-SDF.

122

APPENDIX C DIVERSE3DFACE

C.1

Implementation Details

C.1.1 Optimization

We use the PyTorch library to implement our approach. In our experiments, we found that the

SGD optimizer, with a learning rate of 5 × 10−3 gives the best results as compared to the Adam

and RMSprop optimizers. For photometric fitting, we used the texture model provided by https:

//flame.is.tue.mpg.de/index.htmlFLAME. We run the fitting stage (Algorithm 1) for niter = 2000

iterations and the diversity stage (Algorithm 2) for ncomp = 300 iterations. In Algorithm 1, we

set the loss weights as follows: λf

1 = 5, λf

2 = 16, λf

3 = 10−3. During the diversifying shape

completion stage (Algorithm 2), we set λ1 = 1000, λ2 = 500, λ3 = 0.025. Further, we found

that using a slightly smaller learning rate for the eyeball components while fitting the global+local

model gives better results. For these components, we set the learning rate to be 0.5 times that of

the other components.

C.1.2 Mesh-VAE

The Mesh-VAE model is based on the fully convolutional mesh autoencoder (Meshconv) archi-

tecture proposed by Zhou et al. [Zhou et al., 2020b]. Meshconv [Zhou et al., 2020b] uses spatially

varying convolutional kernels for different mesh vertices to account for the irregular structure of a

3D mesh. The spatially varying kernels are sampled from the span of a shared weight basis, using

learned per-vertex coefficients. In addition, Meshconv defines pooling and unpooling operations

on a 3D mesh by performing feature aggregation Monte Carlo sampling [Zhou et al., 2020b].

We trained the Mesh-VAE with FLAME [Li et al., 2017a] registered groundtruth scans pro-

vided in the CoMA [Ranjan et al., 2018] and D3DFACS [Cosker et al., 2011] datasets. We per-

turbed the input meshes with uniformly sampled rectangular masks (in XY) within a range around

the mesh center, while gradually increasing the size of the mask per training epoch until it covered

∼40% of the vertices. We detail the network architecture for the Mesh-VAE in Tabs. C.1 and C.2.

The abbreviated operators used are defined as follows:

123

Input

Layer

5023 × 3 Mesh → vcDownConv(inc = 3, outc = 32, s = 2, r = 43, M = 17) + vcDownRes(2)

vcDownConv(inc = 32, outc = 64, s = 1, r = 27, M = 17) + vcDownRes(1)
vcDownConv(inc = 64, outc = 128, s = 2, r = 54, M = 17) + vcDownRes(2)
vcDownConv(inc = 128, outc = 256, s = 1, r = 25, M = 17) + vcDownRes(1)
vcDownConv(inc = 256, outc = 512, s = 2, r = 81, M = 17) + vcDownRes(2)
vcDownConv(inc = 512, outc = 1024, s = 1, r = 27, M = 17) + vcDownRes(1)
vcDownConv(inc = 1024, outc = 64, s = 2, r = 37, M = 17) + vcDownRes(2)
vcDownConv(inc = 1024, outc = 64, s = 2, r = 37, M = 17) + vcDownRes(2)

feats
feats

Output size Output
1367 × 32
1367 × 64
270 × 128
270 × 256
45 × 512
45 × 1024
10 × 64
10 × 64

feats
µ
log σ2

Model Complexity 9M

Table C.1 Network architecture of the Mesh-VAE Encoder Emesh.

Input
10 × 64 z

Layer
vcUpConv(inc = 64, outc = 1024, s = 2, r = 8, M = 17) + vcUpRes(2)
vcUpConv(inc = 1024, outc = 512, s = 1, r = 27, M = 17) + vcUpRes(1)
vcUpConv(inc = 512, outc = 256, s = 2, r = 16, M = 17) + vcUpRes(2)
vcUpConv(inc = 256, outc = 128, s = 1, r = 25, M = 17) + vcUpRes(1)
vcUpConv(inc = 128, outc = 64, s = 2, r = 12, M = 17) + vcUpRes(2)
vcUpConv(inc = 64, outc = 32, s = 1, r = 27, M = 17) + vcUpRes(1)
vcUpConv(inc = 32, outc = 3, s = 2, r = 24, M = 17) + vcUpRes(2)

Output size Output
45 × 1024
45 × 512
270 × 256
270 × 128
1367 × 64
1367 × 32
5023 × 3 Output

Model Complexity 8M

Table C.2 Network architecture of the Mesh-VAE Decoder Dmesh.

• vcDownConv(inc, outc, s, r, M ) + vcDownRes(s): Downward residual block (as defined in

Meshconv [Zhou et al., 2020b]), with inc input channels, outc output channels, s stride,

r kernel radius and M number of shared weight bases. The output is activated with ELU

[Clevert et al., 2015] activation.

• vcUpConv(inc, outc, s, r, M ) + vcUpRes(s): Upward residual block (as defined in Meshconv

[Zhou et al., 2020b]), with inc input channels, outc output channels, s stride, r kernel radius

and M number of shared weight bases. The output is activated with ELU [Clevert et al.,

2015] activation.

124