GEOMETRY-AIDED 3D IMAGE PROCESSING By Ze Zhang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Doctor of Philosophy 2023 ABSTRACT The recent developments in 3D technology have profoundly impacted the digital world. Thus, the geometric modeling of 3D shapes through surfaces and curves plays an ever more critical role in society than before. Similarly, regularly sampled data on volumes (3D images) have found a wide range of applications in various areas, such as medicine, biology, engineering, military, entertainment, etc. Compared to 2D images, 3D images have one more dimension to store information. Therefore, they can directly approximate the physical world without first slicing up the object under investigation, which, however, leads to much more data complexity and technical difficulty. To address the problem of increased dimensionality, one may often leverage the lower dimensional geometric features abundant in various applications to facilitate computational tasks. Thus, how geometry could assist in 3D image processing is an intriguing topic, which we explore from three aspects in this dissertation. With three real-world problems, we present novel geometry-aided 3D image processing al- gorithms with contributions in several areas, including biology, computer vision, and computer graphics. In these applications, we demonstrate how geometric structures are vital in guiding image processing, improving accuracy, and facilitating downstream analyses. First, we analyze the 3D location-dependent fluorescence data of plant cells. We showed that with the volumetric diffusion using the 3D Laplacian matrix, we could produce an accurate adaptive local threshold to segment cytoskeleton in 3D microscope images of plant cells. Moreover, we propose several indices describing geometric and topological characteristics of cytoskeletons to help biologists understand actin filament dynamics in plant cells. Second, we employ 3D-direction-dependent lighting conditions and introduce shadow masks to our face relighting pipeline generated from rough geometry to remove or add more accurate shadows for human face images. In addition, with the assistance of geometric information, we may generate relatively accurate spherical harmonics coefficients (a representation for low-frequency 3D-direction fields) to model the illumination for the relighting task. Last, we explore how geometry shapes could help with a deep learning algorithm for a particular field of both locations and directions, the dynamical neural radiance field for human heads. Given a set of input images or a video of a talking human head, we first embed a geometric model of the human head shape into the 3D implicit field. Then, one set of latent codes anchored on a morphable 3D surface mesh automatically turns the human face with any specific pose and expression into a radiance field. With the radiance field, it is straightforward to assemble RGB values for rays from arbitrary camera positions along chosen directions to form photorealistic rendering of novel views of the same person with changing expressions. Thus, we may reanimate the human head with better realism than a crude textured surface mesh. Furthermore, with the help of the radiance field, we may refine the geometric surfaces to align better with the input videos if surface meshes are needed for downstream applications. Through the three examples of analyzing 3D spatial functions, 3D directional functions, and functions of both position and direction, we show the promises of geometry-assisted processing tools with concrete applications. Given the shape-dependent nature of most 3D images, we believe there are ample opportunities for integrating geometric representations into their processing pipelines. TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 ILEE: AN ALGORITHM FOR QUANTITATIVE ANALYSIS OF CYTOSKELETAL IMAGES . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Implicit Laplacian of Enhanced Edge . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Conclusion and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 CHAPTER 3 SPHERICAL HARMONICS AND SHADOW MAPS FOR FACE RELIGHTING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 CHAPTER 4 NERF HEAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 APPENDIX CONTRIBUTION ON MULTI-AUTHORED PUBLICATIONS . . . . . 64 iv CHAPTER 1 INTRODUCTION With the advances of 3D imaging technology and the availability of affordable 3D scanners, the amount of 3D data has increased tremendously. Compare to 2D images, 3D images provides us with rich information about the full geometry of 3D objects since it could best approximate the real- world objects without distortion. But the 3D image processing algorithms are usually expensive, computationally costly and math heavy. Data size will grow exponentially in 3D space with high resolution, therefore an efficient data structure and efficient processing algorithm are essential. The geometric and topological characteristic of 3D shapes are complex. Unlike 2D images, 3D data could have different representations, the geometric properties vary from one representation to another. How to analyse geometric features of the object, how to utilize the geometric information to render the shape into 2D image under different illumination, how to reconstruct the 3D mesh from 2D image, all of which are interesting topics with a lot of problems unsolved. In this document, we will explore what role geometry plays in solving the questions mentioned above. In biology, scientists takes 3D microscopy images of plant cells and observe the structure of cytoskeleton to understand the dynamics of plant cells. Traditional ways of processing the 3D data is to project the 3D images to a 2D plane, which ignores the filament growing in vertical direction. To avoid information loss on Z axis, we propose a 3D algorithms to segregate cytoskeleton from the background. Due to the imaging mechanism of microscopy sensor, the intensity distribution of the filament is not globally consistent. We take the shape of filament into consideration, generate adaptive local thresholds by solving a partial differential equation to segment the cytoskeleton from the background as well as preserve all the geometric details on filament surface. In computer vision, deep learning has gained notable success in 2D domain, but it is not fully employed on 3D data due to the complex nature of the 3D shapes. We also explore how the 3D geometric information could guide the learning of our network on face relighting task. It is a challenging problem to modify shadows in portrait images given a target lighting, since shadows are the result of interaction of 3D shapes and lighting condition in 3D space. With shadow maps 1 generated based on the face mesh under source lighting and target lighting, we could train our network to remove and add shadows in 2D portrait images. View synthesis has been a long-standing and well-studied problem in Computer vision, with applications ranging from animation, image editing to VR, AR. Recent works mostly focus on learning the implicit representation for 3D objects. One big drawback is it requires dense input. We think it is possible to learn the representation for dynamic human head with sparse input leveraging the geometry of human head. The rest of document is organized as follows. We first discuss the adaptive local thresholding algorithm and quantitative analysis on cytoskeleton in plant cells Chapter 2. Then, we demonstrate how shadow maps plays an important rule in face relighting task in Chapter 3. In Chapter 4, we shows how geometry helps with view synthesis, and we also discuss future work briefly. 2 CHAPTER 2 ILEE: AN ALGORITHM FOR QUANTITATIVE ANALYSIS OF CYTOSKELETAL IMAGES 2.1 Introduction The eukaryotic cytoskeleton plays essential roles in cell signaling, trafficking, and motion in plant cells. Recent work towards defining the temporal and spatial dynamics of cytoskeletal organization, including as a function of cell status, has utilized quantitative analysis of cytoskeletal fluorescence images as a standard approach to defining cytoskeletal function. However, due to the uneven spatial distribution of the cytoskeleton, including varied filament shape and unstable binding efficiency to staining markers, these approaches may result in inaccurate cytoskeletal segmentation. Additionally, quantitative approaches currently suffer from human bias, as well as information loss caused by z-axis projection of raw images. To overcome these obstacles, we developed Implicit Laplacian of Enhanced Edge (ILEE), a cytoskeletal component segmentation algorithm, which uses a 2D/3D-compatible, unguided local thresholding approach, therefore providing less biased and more stable and accurate results. Empowered by ILEE, we constructed a Python library named ILEE_CSK, for automated quantitative analysis of cytoskeleton images, which computes cytoskeletal indices that cover density, bundling, severing, branching, and directionality. Compared to various classic approaches, the ILEE generates descriptive data with higher accuracy, stability, robustness, and efficiency. In addition to the analysis described herein, we have released ILEE_CSK as an open-source library for the community, together with Google Colab pipelines, as convenient and user-friendly access for biologists that requires no programming knowledge or specific computer configuration for usage. 2.1.1 Background Higher eukaryotes have evolved complex mechanisms to organize and co-regulate a multitude of cellular processes, including growth, development, movement, cell division, and response to environmental stimuli. For example, plants coordinate growth with resistance against abiotic and biotic stress by engaging numerous systemic signaling processes, among which the cytoskeleton 3 plays an indispensable role [1]. To facilitate these processes and ensure robust and highly specific responses to changes in cell status, plants utilize two types of cytoskeleton − microfilaments and microtubules − to connect intercellular signaling to extracellular environments. Structurally, both are chains dynamically assembled from monomeric subunits named global actin and tubulin, respectively, and are involved in ceaseless events of polymerization/depolymerization, bundling, severing, and branching [2, 3], which is commonly referred to as "cytoskeletal dynamic". Spatially, the cytoskeleton forms a web-like matrix within the cytoplasm, and through its vast connectivity, functionally links the plasma membrane, numerous organelles, vesicles, and cellular environments − the sum of which serves as a cell surveillance and signaling platform that functions as a structural and information network [4]. As a structural component of the cell, the cytoskeleton controls numerous physical processes such as movement, shaping, cellular trafficking, and intercellular communication [5]. It also provides the mechanical force required for chromosome separation and plasma membrane division during mitosis and meiosis [6]. In addition to its role within the cytoplasm, the cytoskeleton is also required for a variety of functions within the nucleus, including RNA polymerase recruitment, transcription initiation, and chromosome scaffolding [7]. 2.1.2 Related Work Over the past several decades, confocal microscopy-based methods using fluorescence markers have been developed to monitor changes in cytoskeletal organization [8]. While showing advan- tages in real-time observation and intuitive visual presentation, these approaches possess critical limitations. They are subject to interpretation from captured images, which potentially involves human bias. As a step to remedy this limitation, the emergence of computational algorithm-based analyses offers a solution to describe the quantitative features of the cytoskeletal architecture with reduced human bias. However, while early studies introduced the concept of using generalizable image processing pipelines [9, 10] to transfer the task of evaluation away from the user and into a series of computer-based quantitative indices, several key bottlenecks emerged. First, most of the quantitative algorithms described to date are limited to 2D images. As a result, these approaches require the user to manually generate z-axis projections from raw data, resulting in an incredi- 4 ble amount of information loss, especially within the perpendicular portion of the cytoskeleton. Second, many approaches require users to manually set the threshold to segment cytoskeletal com- ponents from the images, resulting in sampling bias. Lastly, the accuracy and robustness of current algorithms greatly vary among different types of biological samples. This latter hurdle imposes a considerable disparity in the algorithm performance for plants (usually with curvy and spherical cytoskeleton) and animal (typically straight and complanate) samples, which compromises the per- formance of some advanced cytoskeleton analysis algorithms [11, 12, 13] when directly applied to plant cell images. In fact, while sample source dramatically impacts our ability to evaluate the features of cytoskeletal function across all eukaryotes, the vast majority of current approaches are developed based on cytoskeletal images from animal cells, which indicates potential systemic bias when applied to other types of image samples. Previous work described the development of a global-thresholding-based pipeline to define and evaluate two key parameters of cytoskeleton filament organization in living plant cells: cytoskeletal density, defined by occupancy, and bundling, defined by statistical skewness of fluorescence [14]. Interestingly, while it utilizes manual global thresholding (MGT), which can potentially introduce a certain level of user bias, it still outperforms many standardized adaptive/automatic global or local thresholding approaches such as Otsu [15] or Niblack [16]. As a further advance of this early work, Higaki and colleagues developed the use of coefficient of variation (CV) of fluorescence to quantify the level of filament bundling, which improved the robustness and utility of the algorithm [17]. However, not only does this pipeline consume a considerable amount of time and effort from users for massive sample processing, but it also leaves unaddressed two key issues of rigor in image processing and analysis: information loss and human bias. 2.1.3 Our contribution In the current study, we developed implicit Laplacian of enhanced edge (ILEE), a 2D/3D compatible unguided local thresholding algorithm for cytoskeletal segmentation and analysis, which is based on the native value, first-order derivative (i.e., gradient), and second-order derivative (i.e., Laplacian) of the cytoskeleton image altogether (see Fig. 2.1). The research described herein 5 supports ILEE as a superior quantitative imaging platform, one that overcomes current limitations related to information loss through dimensional reduction, human bias, and inter-sample instability. As shown, ILEE can accurately process cytoskeleton samples with a high dynamic range of fluorescence brightness and thickness, such as live plant samples. As a key advance in the development of ILEE, we further established an ILEE-based Python library for the fully-automated quantitative analysis of 14 cytoskeletal indices within 5 primary classes: density, bundling, connectivity, branching, and anisotropy. This platform not only en- ables the acquisition and evaluation of key actin filament parameters with high accuracy from both projected 2D and native 3D images, but also improves the accessibility to a broader range of biologically-relevant states, including polymerization/depolymerization, bundling, severing, branching, and directional regulation. Herein, we introduce the core ILEE algorithm and pro- pose several novel indices reflecting cytoskeletal dynamics. Using a defined series of images from multiple biological replicates of pathogen-infected plant cells, we demonstrate the performance of this algorithm by multi-perspective comparative analysis. Further, we provide evidence that supports the further advancement of 3D-based cytoskeletal computational approaches – a sig- nificant enhancement over currently available 2D-based approaches. Our library, ILEE_CSK, is publicly released at GitHub https://phylars.github.io/ILEE_CSK. In addition, we devel- oped ILEE Google Colab pipelines for data processing, visualization, and statistical analysis, which is a convenient and user-friendly interface that requires no programming experience or particular computational device. 2.2 Implicit Laplacian of Enhanced Edge 2.2.1 Pipeline Raw images generated by laser scanning confocal microscopy are typically obtained through detecting in-focus photons by a sensor from each resolution unit on a given focal plane. Since the cytoskeleton is a 3D structure that permeates throughout the cell, current approaches to capture filament organization and architectural parameters rely on scanning of each plane along the z- axis, independently, at regular intervals within a given depth, and reconstruction into 3D images. 6 Figure 2.1 The pipeline of ILEE. Figure 2.2 The input and output of ILEE. However, due to limited computational biological resources, most studies have exclusively employed the z-axis projected 2D image, which results in substantial information loss, as well as systemic bias in downstream analyses. In our newly developed algorithm, we integrated both 2D and 3D data structures into the same processing pipeline to ameliorate the aforementioned conflict (Fig. 2.1). In short, this pipeline enabled automatic processing and evaluation of both traditional 2D and native 3D z-stack image 7 Figure 2.3 Cytoskeleton segmentation by ILEE (a) steps of ILEE, (b) how coefficient k influences ILEE performance, (c) optimal 𝐾2 estimation. analysis. As shown in Fig. 2.1, cytoskeleton segmentation using ILEE requires 3 inputs: an edge- enhanced image, a global gradient threshold that recognizes the edges of potential cytoskeletal components, and the Laplacian smoothing coefficient K (described below). With these inputs, a local threshold image is generated via ILEE, and the pixels/voxels with values above the threshold image at the same coordinates are classified as cytoskeletal components. The output of this algorithm is a binary image (Fig. 2.2). Once acquired, the binary image is further skeletonized [18] to enable the downstream calculation of numerous cytoskeleton indices, the sum of which comprises the quantitative features of cytoskeletal dynamics (Fig. 2.2). Additionally, because the 2D and 3D modes share a common workflow, all of the calculated cytoskeleton indices also share the definition for both modes, regardless of the difference in dimensional spaces. This additional feature enables a horizontal comparison of both modes by the user, which we assert will significantly contribute to the community by providing massive image datasets for further examination, and comparison through the open-source library. In general, the ultimate goal of this approach, and resultant algorithm, is 8 Figure 2.4 NNES global thresholding. to construct a pipeline that enables the automated detection of the eukaryotic cytoskeleton from complex biological images in an accurate and unbiased manner. 2.2.2 ILEE The pixel intensity of the cytoskeleton has a wide variation among samples, due to varied bundle thickness, the concentration of fluorescent dye, and its binding efficiency. Therefore, it is a highly challenging task to segment the cytoskeleton from the background. The pixel value generated by the confocal telescope is mainly influenced by three factors: 1. the real fluorescence emitted by the dye molecules. 2. the diffraction signal emitted by the actin filament in the neighboring space. 3. systematic error brought by the equipment. In our case, it is usually the ground noise generated by the imaging sensor. Since the photon sensor has a fixed setting, the distribution of ground noise is constant. But the diffraction signal varies when the thickness of the filament bundle in the local area changes. Therefore, an adaptive local thresholding algorithm is necessary to obtain reasonable cytoskeleton segregation results. The key idea of ILEE is to solve a global partial differential equation (PDE) based on Laplacian 9 to generate a locally adaptive threshold for the cytoskeleton. One advantage of the Laplacian operator is that it could remove undesirable noise while still retaining desirable geometric features [19], which really helps tackle our problem. As we observe, the cytoskeleton usually has a very smooth surface and tubular shape. Therefore, ILEE could selectively filter out high-frequency noise while preserving salient geometric features of individual actin filaments, in leveraging the spectral characteristics of Laplace operators. We utilize the Laplacian function to build a global linear system. This way, the final result would be impacted by both the global shape and the local brightness level. Therefore we could avoid the drawback that local operators tend to restrict performance at varying filament thicknesses. Additionally, the edge of the cytoskeletal component is smoothed and elongated using a significant difference filter (SDF) and a Gaussian filter, the sum of which serves to enhance the continuity of the edge and contributes to the accuracy of edge detection (Fig. 2.1). Since we use an implicit method to solve the PDE function involving both the current state of the system and the later one, we name our algorithm Implicit Laplacian of Enhanced Edge (ILEE). 2.2.2.1 Linear System We build the linear system based on a global Laplace operator. We add boundary constraints to our partial derivative equation to better preserve the geometric feature. First, we roughly estimate the edge pixel of actin filament based on gradient, because edges usually have very high gradient magnitude due to the dramatic change of the brightness. Then, we generate a selection set 𝑆 with only points whose gradient is larger than the pre-estimated global threshold to mark the potential boundary: 𝑆 = {(𝑥, 𝑦, 𝑧)|∇| 𝑓 (𝑥, 𝑦, 𝑧)| > 𝑐} (2.1) where 𝑐 is the global threshold for edge pixels, 𝑓 (𝑥, 𝑦, 𝑧) is the original image function. Our goal is to preserve the edge points with high gradients, which serve as a guide for local thresholding: 𝑔(𝑥, 𝑦, 𝑧)| 𝑠 = 𝑓 (𝑥, 𝑦, 𝑧)| 𝑠 (2.2) 10 After Laplacian processing, 𝑔(𝑥, 𝑦, 𝑧) should be harmonic. so we get: ∇2 𝑔(𝑥, 𝑦, 𝑧) = 0 (2.3) Putting these two equations together to build our linear system. The final equation can be written as: (𝐿 + 𝑘𝑆)𝑔 = 𝑘𝑆 𝑓 (2.4) 𝑘𝑆 𝑔= 𝑓 (2.5) 𝐿 + 𝑘𝑆 where 𝑔 is the input image, 𝑓 denotes the output locally-adaptive threshold. 𝐿 is the global Laplacian matrix, 𝑘 is the coefficient that adjusts the influence the Laplacian has on the final result. 𝑆 is the selection matrix, a diagonal matrix with the i-th diagonal entry being 1 if the i-th pixel has a norm of the gradient above the global gradient threshold. We use the implicit Euler scheme to construct this differential equation to ensure the Laplacian operator behaves as a low-frequency filter to reduce high-frequency noise. The linear system could be solved efficiently since 𝐴 = 𝐿 + 𝑘𝑆 is sparse. We use an off-the-shelf Matlab GPU linear solver (conjugate gradient). However, any generic linear solver can be used instead. 2.2.2.2 Global threshold In order to identify ground noise and locate the background for downstream analyses (e.g., adaptive local thresholding), we designed an algorithm that calculates a global threshold using the morphological features of the ground noise, namely, non-connected negative element scanning (NNES). In brief, NNES calculates the total number of non-connected negative elements at different global thresholds, resulting in the identification of a representative value with a maximum non- connected negative element count (Fig. 2.4b (i)). The global threshold for the coarse background (Fig. 2.4b (iii)) will be determined using a linear model trained by the representative value rendered by NNES and manual global thresholding (MGT), a global threshold determined by operators experienced in cytoskeleton image analysis. NNES can maintain stability and accuracy over different samples that vary in the distribution of native pixel value because ground noise is the image component with the lowest value that is subject to a normal distribution and generally does not interfere with the actual fluorescence signal. Another accessible method is to directly use 11 Figure 2.5 Local anisotropy. the peak-of-frequency brightness of the image as a representative value to train a model. However, this approach is less accurate because the interval near the theoretical peak is always turbulent and non-monotone, a limitation potentially due to the pollution of diffracted light. 2.2.2.3 Global Laplacian matrix Laplace operator is a second-order differential operator in n-dimensional Euclidean space. It is a measurement of how a point differs from its neighbor average. For each pixel, the Laplacian value is based on its immediate neighbors. In 3D space, it is defined as: 𝜕2 𝑓 𝜕2 𝑓 𝜕2 𝑓 Δ𝑓 = + + = 𝑓𝑥 + 𝑓 𝑦 + 𝑓 𝑧 (2.6) 𝜕𝑥 2 𝜕𝑦 2 𝜕𝑧 2 It is the sum of the second partial derivatives of the function with respect to each independent variable. In 3D Cartesian space, there are three independent variables: x, y, z. The derivatives on these three axes are defined as: 12 𝑓 (𝑥 + Δ𝑥, 𝑦, 𝑧) − 2 𝑓 (𝑥, 𝑦, 𝑧) + 𝑓 (𝑥 − Δ𝑥, 𝑦, 𝑧) 𝑓𝑥 = (2.7) Δ𝑥 2 𝑓 (𝑥, 𝑦 + Δ𝑦, 𝑧) − 2 𝑓 (𝑥, 𝑦, 𝑧) + 𝑓 (𝑥, 𝑦 − Δ𝑦, 𝑧) 𝑓𝑦 = (2.8) Δ𝑦 2 𝑓 (𝑥, 𝑦, 𝑧 + Δ𝑧) − 2 𝑓 (𝑥, 𝑦, 𝑧) + 𝑓 (𝑥, 𝑦, 𝑧 − Δ𝑧) 𝑓𝑧 = (2.9) Δ𝑧2 If we discretize the continuous Laplacian function for 3D images, treat the whole 3D images as a discrete grid, Δ𝑥, Δ𝑦, and Δ𝑧 are the sampling intervals along 𝑥, 𝑦, 𝑧 axis, respectively. For our 3D image data, Δ𝑥 and Δ𝑦 would be pixel length and width, Δ𝑧 is the gap between different image planes. To build the global Laplacian matrix, we transformed the 3D images into one n-dimensional vector 𝑣, and the corresponding Laplacian matrix is a large sparse 𝑛 × 𝑛 matrix written as (since the size of L matrix grows rapidly with 3D images, we take the L matrix for a 3 × 3 2D image for example):    𝑐  − Δ𝑥2 2 − Δ𝑦2 2     − 1  Δ𝑥 2 𝑐 − Δ𝑥1 2 − Δ𝑦2 2     − Δ𝑥2 2 − Δ𝑦2 2    𝑐     1 2 1  − 2  Δ𝑦 𝑐 − Δ𝑥 2 − Δ𝑦2     𝐿 =  − Δ𝑦1 2 − Δ𝑥1 2 𝑐 − Δ𝑥1 2 − Δ𝑦1 2   (2.10)   − Δ𝑦1 2 − Δ𝑥2 2 − Δ𝑦1 2     𝑐    2 2    − Δ𝑦2 𝑐 − Δ𝑥 2       − Δ𝑦2 2 − Δ𝑥1 2 𝑐 − Δ𝑥1 2    − Δ𝑦2 2 − Δ𝑥2 2    𝑐    where 𝑐 = ( Δ𝑥 2 + Δ𝑦2 2 2 2 + Δ𝑧2 ). We use mirror technique to deal with boundary pixels. 2.2.2.4 Coefficient K To determine the appropriate setting of K, we first tested how different K values influence the result of the local threshold image. As shown in Figure 2.3b, a low value of K generated a binary 13 image that is highly consistent with the selected edge. When K increases, the total threshold image shifted towards the average value of the selected edges with increasing loss of detail. A relatively lower K enables the accurate recognition of thin and faint actin filament components, yet is unable to cover the full width of thick filaments. Conversely, a high K value covers thick actin filaments with improved accuracy, resulting in a binary image that is less noisy; however, thin and/or faint filaments tend to be omitted as pseudo-negative pixels (Fig. 2.3b). To overcome this dilemma, we applied a strategy using a lower and a higher to compute two different threshold images, as well as binary images, that focuses on thin/faint components and thick components, respectively. Then, we generated a full outer-join image that contains all cytoskeleton components in these two binary images. This approach led to improved recognition of actin with varying morphologies (see Fig. 2.3b). 2.2.3 Analysis Based on the binary image generated by ILEE, we developed an automatic tool to evaluate the quantitative features of the cytoskeleton. We first extracted the filament skeleton from the binary image, and then defined 12 indices to fully describe cytoskeleton from geometric, topological, and statistical aspects. The skeleton, which is usually the medial axis of the actin filament, plays a vital role in cytoskeleton analysis. It indicates the growing direction and bundling density of the filament. It also reflects the topological structure of the cytoskeleton in a straightforward way. An accurate skeleton is also required to evaluate the filament thickness. To help with the downstream analysis, we applied the parallel thinning algorithm proposed by [18] to generate the skeleton for our sample data. The skeletonization algorithm takes a 3D binary image as input and builds an Euler table to remove surface points in all eight directions simultaneously. So we could obtain reasonable skeletons with desirable geometric and topological features preserved. As the last step of our pipeline, cytoskeletal indices are automatically calculated from the binary image and skeleton generated by ILEE. As a substantial expansion from the previously defined cytoskeletal indices (e.g., occupancy, skewness, and CV), we propose 12 novel indices 14 within 5 classes. In short, these indices describe quantitative features of cytoskeletal morphology and dynamics, and each of these is critical considerations within the context of complex biological samples (Fig. 2.1a). It is worth noting that the accuracy of these indices could be enhanced with a certain level of image post-processing (e.g., oversampling) Figure 2.6 ILEE thresholding result. Density: For the index class "density", we developed a novel set of metrics to evaluate linear density, a feature that measures filament length per unit of 2D/3D space. 15 • Occupancy: the frequency of the positive pixels in the binary image 𝐼 𝑏 generated by ILEE. We denote the number of pixels as 𝑁, the occupancy is written as: Í 𝐼𝑏 𝑂 (𝐼 𝑏 ) = (2.11) 𝑁 • Linear Density: the length of filament per unit. Since cytoskeleton in plant cells are shaped like a cylinder, it is easier to estimate filament length on skeletonized image 𝐼 𝑠𝑘 . We use the sum of the Euclidean lengths of all (graph theory defined) branches obtained by the Skan library [20] as the total length of the skeletonized filament and divide it by N. Bundling: For "bundling", we developed two new, highly robust indices referred to as diameter- by-total DT (diameter TDT) and diameter-by-skeleton DT (diameter SDT), both of which measure the physical thickness of filament bundles, in addition to the indirect indices: skewness and CV, which estimate the relative bundling level based on the statistical distribution of fluorescence intensity. • Skewness: the probability distribution of the filament pixel value with regarding its mean in the raw image 𝐼𝑟 . Mathematically, it is defined as: 1 ∑︁ 𝐼𝑟 (𝑥, 𝑦) − 𝜇 3 𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = ( ) (2.12) 𝑁𝑓 𝜎 𝐼𝑏 (𝑥,𝑦)==1 where 𝑁 𝑓 is the number of filament pixel numbers, 𝜇 is the mean, and 𝜎 is the standard deviation. • CV: the coefficient of variance of filament pixels. It could be defined as: 𝜎 𝐶𝑉 = (2.13) 𝜇 • Diameter TDT: average filament diameter estimated by Euclidean Distance Transform of the whole binary image. The Euclidean distance transformation map 𝐼 𝑑 𝑖𝑠 stores the distance from one filament pixel to its nearest background pixel. For background pixel, its Euclidean distance would be 0. • Diameter SDT: average filament diameter estimated by Euclidean Distance Transform on only filament pixels. 16 Connectivity: For the class "connectivity", we proposed two indices: total connected element and its derived index severing activity, which estimates the severing activities within per unit of the cytoskeleton. This additional metric assumes that severing generates two visible cytoskeletal filaments, which is distinguishable from filament depolymerization. This is an important con- sideration in terms of the biological activity of the cytoskeleton, as it enables the decoupling of the impact of filament depolymerization and filament severing, key activities of the eukaryotic actin-depolymerizing factor (ADF) and cofilin family of proteins [21]. • Severing activity: the number of connected components in binary image per unit length of the filament. So we get the number of connected components, then divide it by the total length of the filament skeleton to evaluate the severing activity. Branching: For the class "branching", our algorithm is based on Skan, a recently developed Python library for the graph-theoretical analysis of the cytoskeleton [20]. To further explore the relationship between filament morphology and the biological activity of branching, we specifically designed an additional index, referred to as "branching activity", which we define as the total number of additional branches emerging from any non-end-point node per unit length of the filament. In total, this index measures the abundance/frequency of cytoskeletal branching. • Branching activity: the number of branching point counts per unit length of filament skeleton. We first classify different types of branches. Only T type with three branches at one node and X type with four branches at one node is considered a non-end node. We only collect T-type nodes and X-type nodes and then divide the number by the total length of the filament skeleton. Local anisotropy: Finally, our library is capable of estimating the level of directional cytoskele- tal growth by indexing local anisotropy, which measures how the directionality of local filaments. We generate both numerical and visual output (Fig. 2.1c). We performed a local averaging of filament alignment tensor, which is constructed as follows. First, we calculate the unit direction vector for each straight filament segment 𝑑𝑖 . Then, the covariance matrix for each segment is obtained from the following equation: 𝑡𝑖 = 𝑑𝑖 𝑑𝑖𝑡 (2.14) 17 This rank-3 tensor is independent of the orientation of the filament segment, and can thus be averaged over a region containing a collection or unoriented line segments. We weigh each filament tensor in a circular/spherical local region by the length of every filament to produce a smoothed tensor field. The tensor field is written: Í 𝑖∈𝑆 𝑤 𝑖 𝑡𝑖 𝑡= Í (2.15) 𝑤𝑖 where t is the averaged tensor for the local region 𝑆. We calculate the eigenvectors and eigenvalues of t to describe the anisotropy for region 𝑆. The eigenvector corresponding to the largest eigenvalue indicated the primary orientation of filaments in this region as shown in Fig. 2.5. The difference between the maximum and the minimum eigenvalues is an indicator of the anisotropy in this region. If all the eigenvalues are the same, the indicator is 0, which implies an isotropic region with filaments growing towards random directions. If the eigenvalues other than the maximum are all nearly 0, all the filaments in this region are parallel to each other. In this case, they are all aligned with the maximum eigenvector, the dominant filament direction of this region. 2.3 Result We designed a series of experiments to demonstrate the advantages of ILEE over traditional methods. We showed that ILEE could achieve high accuracy and stability over various samples. ILEE achieves high accuracy over actin images. we constructed a dataset of actin images from Arabidopsis leaves with diverse morphology and compared ILEE with numerous traditional global and local thresholding algorithms. We also followed the traditional MGT pipeline asking some independent scientists with rich cytoskeleton analysis experience to manually set a global threshold for each sample and compared ILEE with the MGT result. To evaluate the accuracy of each algorithm in terms of filament segregation, we manually generated the ground truth binary image from each sample, using a digital drawing monitor (Fig. 2.6a, ground truth). We used each of the ground truth binary images as a reference and compared the filament architecture obtained by ILEE, MGT, and 6 additional adaptive thresholding algorithms. These additional thresholding algorithms include Otsu[22], Triangle[23], Li[24], Yan[25], Niblack[16], and Sauvola[26] (Fig. 2.6). As an additional element of rigor, because 18 Figure 2.7 Quantitative analysis of cytoskeleton. pseudo-positive pixels can be obtained due to user bias during the generation of the ground truth images (even when the operator is experienced in the actin imaging field), we further analyzed and categorized each non-connected component of pseudo-positive pixel by its shape and connectivity to matched elements, and identified the actin-like pseudo-positive pixels as possible real actin 19 components. As shown in Fig. 2.6a (visualized demonstration), 2.6b (quantitative analysis), and 2.6c (bias analysis), ILEE offers superior performance, with the highest rate of accuracy with low pseudo- positive and pseudo-negative rates, as well as the lowest bias over local filament thickness. It is noteworthy, however, that the adaptive global thresholding approaches (from Otsu to Yan) tend to be relatively accurate when judging the thick and bright bundles of the cytoskeleton. However, these approaches are unable to capture faint filaments, and as a result, generate a high pseudo-negative rate. Conversely, both adaptive local thresholding approaches, Niblack and Sauvola, generate numerous block-shaped pseudo-positive elements and fail to capture the near-edge region of thick filaments. For MGT and Li method, although they showed satisfactory match rate, as well as lower, averaged pseudo-positive/negative rates, their performance is far less stable than ILEE (Fig. 2.6b). As the next step in our analysis, we evaluated the accuracy and stability of cytoskeletal indices using ILEE versus other commonly used imaging algorithms. To do this, we first evaluated the ground truth indices from the manually generated binary images. In brief, quantitative measure- ments were collected from all methods and normalized by the relative fold to the result generated from the corresponding ground truth image. As shown in Fig. 2.6d, ILEE showed improved stability compared to all other quantitative approaches and the highest accuracy for occupancy, skewness, CV, and diameter. However, we did observe that in terms of the morphology-sensitive indices (i.e., linear den- sity, severing activity, and branching activity), the ILEE algorithm did not fully conform with data collected from the ground truth binary images. ILEE generated stable yet dramatically lower output compared to those derived from ground truth images (Fig. 2.6. Upon further inspection, we determined that this is because the manually portrayed ground truth images and ILEE results showed different tendencies in judging the pixels in the narrow areas between two bright filaments. Theoretically, human eyes could "hallucination" imaginary filaments that do not statistically exist. There are no criteria to assess which one is more accurate so far. Compare to other approaches, many displayed obvious, somewhat predictable, inaccuracies, the MGT and Li methods still gen- 20 erated satisfactory results, which echoes their performance in actin segmentation. However, the performance of these two algorithms among diverse and complex biological samples was not as stable as ILEE. In order to further evaluate the stability and robustness of ILEE performance, we continued to analyze the variance coefficient of all groups (Supplemental Fig. 10), uncovering that ILEE is the only approach that simultaneously maintained high accuracy and stability. Next, we tested the robustness of ILEE and other approaches against noise signal disturbance by adding different levels of Gaussian noise to the image dataset (Supplemental Figs. 11-13). Using this approach, we observed that ILEE is still the best performing algorithm, maintaining stable and accurate results of image binarization and cytoskeleton indices against increasing noise. Taken together, these results demonstrate that ILEE has superior accuracy, stability, and robustness over MGT and other classic automated image thresholding approaches in terms of both cytoskeleton segmentation and index computation. ILEE helps discover new features in response to bacterial infection. Our primary impetus for the creation of the ILEE algorithm was to develop a method to define cytoskeleton organization from complex samples, including those during key transitions in cellular status. For example, our previous research has demonstrated that the activation of immune signaling is associated with specific changes in cytoskeletal organization[27, 28]. Complementary to these studies, other research identified the temporal and spatial induction of changes in the cytoskeletal organization as a function of the pathogen (e.g., Pseudomonas syringae) infection and disease development[29, 30]. The sum of these studies, which broadly applied MGT-based quantitative analysis of cytoskeleton architecture, concluded that virulent bacterial infection triggers elevated density (by occupancy) yet induced no changes in filament bundling (by skewness) in the early stages of infection. Since one of our major motivations herein was to develop an ILEE-based toolkit supported by novel cytoskeletal indices to investigate the process of pathogen infection and immune signaling activation, we collected raw data from a previous study[31] describing a bacterial infection experiment using Arabidopsis expressing an actin fluorescence marker (i.e., GFP-fABD2), followed by confocal 21 imaging and data analysis by ILEE as well as MGT conducted by three independent operators with rich experience in actin quantificational analysis (Fig. 2.7). Additionally, because researchers sometimes apply a universal global threshold to all images from a batch of biological experiments to avoid tremendous labor consumption, we included this approach and aimed to observe its performance as well. In this experiment, the only categorical variant is whether sampled plants are treated with bacteria (EV) or not (mock). In total, nine indices that cover features of density, bundling, severing, branching, and directional order are measured and compared. Our first concern is whether bias generated by MGT will influence the result and conclusion generalized from raw image samples of the experiment. We thereby analyzed the correlation of individual MGT values set by the three operators and found only a weak correlation between different operators (Fig. 2.7b), which indicates MGT bias indeed has the potential to impact quantitative results. Interestingly, while minor statistical discrepancies between MGTs by different operators are found in some indices (i.e., skewness and severing activity), most of the MGT results (both adaptive or fixed) shows the same trend as 2D ILEE, yet with far higher standard deviation, or lower stability (Supplemental Fig. 14a) over a certain biological treatment. This indicates that the historical data based on MGT should be majorly trustworthy despite the biased single data points, but an accurate conclusion must be based on a high sampling number that balances the deviation of individuals. Since ILEE provides more stable results over biological repeats, we are also interested in whether it renders higher statistical power to identify potential significant differences. Therefore, we compared the p-values of t-tests conducted for each index (Supplemental Fig. 14b) and found that ILEE indeed has the superior statistical power to distinguish numerical differences over datasets. We believe this demonstrates ILEE as a better choice for actin segregation. Next, we attempted to understand whether different indices of the same class, particularly density and bundling, can reflect the quantitative level of the class in accordance, or instead show inconsistency. For density, we correlated the occupancy and linear density values of all methods over actin images of both mock and EV groups and found that occupancy and linear density measurements are in high conformity, with a Pearson coefficient at 0.98 (Supplemental Fig. 15). 22 Interestingly, while both demonstrate a high positive correlation, 2D ILEE and MGT do not share the same numeric relationship. Moreover, 3D ILEE has a weaker correlation, potentially due to cavities introduced by the skeleton image involved in linear density calculation. For bundling indices, we were interested in their level of conformity because direct indices (based on binary shape) and indirect indices (based on relative fluorescence intensity) are completely different strategies to measure bundling. Using the same approach of correlating analysis, we found that diameter_TDT and diameter_SDT indeed display strong positive correlation, while skewness and CV have merely medium-low correlation, which echoes the previous report demonstrating skewness and CV have different performance on the bundling evaluation. Unexpectedly, we also found that CV (as a representative of indirect indices) and diameter-SDT (as a representative of direct indices) have a striking correlation of zero. This is perplexing, as it raises the question of whether skewness or CV should be regarded as an accurate measurement of bundling (see Discussion). This discrepancy is also reflected by the result of 3D ILEE, whose CV and diameter-SDT over mock versus EV reveals the converse results at the significant difference. In general, we believe the biological conclusion that DC3000 treatment renders increased actin bundling level should be reconsidered with further inspection. Last but not least, we sought to learn if additional features of plant actin cytoskeletal dynamics in response to virulent bacterial infection can be identified by the newly introduced indices and enhanced performance of ILEE. As shown in Fig. 2.7d, we observed significantly increased severing activity, local anisotropy, and decreased branching activity triggered by EV compared to the mock. At a minimum, these discoveries potentially lead to new biological interpretations, and as a result, may contribute to the identification of additional immune-regulated processes as a function of actin dynamics. However, while most of the 2D approaches were consistent and in agreement with the other indices, the severing activity estimated by 3D ILEE indicates a significant, but opposite, conclusion. After diagnosing the difference of each 2D ILEE and 3D ILEE sample, we concluded this is potentially due to information loss and misinterpretation by z-axis projection in the 2D-based approach. Therefore, we do not recommend totally depending on the 2D model 23 for the analysis of filament severing at the current stage and wish to gather more insight from the community in the future. ILEE is compatible with various sample types. Cytoskeleton imaging from live plant samples is arguably one of the most difficult types of images to evaluate due to the dynamic topology and uneven brightness of actin filaments. While we demonstrated that ILEE shows superior performance over plant actin samples, ILEE and the ILEE_CSK library are generally designed for non-specific confocal images of the cytoskeleton and are therefore applicable to other types of samples. To investigate the compatibility of ILEE to other types of image samples, we tested ILEE on both plant microtubules35 and animal cell actin images (Supplemental Fig. 16). Importantly, we found ILEE can process, with high fidelity and accuracy, both plant and animal cytoskeletal features. This is encouraging, as animal cells generally possess a high volume of straight actin filament bundles, and therefore Hough transform-based feature detection is commonly applied to facilitate and/or enhance the performance of cytoskeleton segregation accuracy. However, this approach has certain limitations; specifically, they neglect and/or miscalculate curvy cytoskeleton fractions11,12. With the advancement of ILEE, Hough transform will not be absolutely necessary, and the potential cytoskeleton indices that rigorously require Hough transform can still utilize ILEE as a provider of binary image input for more accurate results. In addition, we found that images of a single animal cell sometimes contain "void background" areas that are truly blank without any cellular component. This is different from plant tissue whose total image field is sample area, which may negatively influence the accuracy of the computed indices. To solve this issue and further support the animal cell sample, we developed a single-cell mode in the ILEE_CSK, which identifies the effective cell area using the statistical features of the brightness histogram and hence secures accurate index output. While ILEE was already tested on both plant and animal image samples, we would like to encourage researchers in the community to report issues (https://phylars.github.io/ILEE\_CSK/Help\%20needed/) where ILEE cannot segregate cytoskeleton correctly to help us improve the compatibility of ILEE and the ILEE_CSK library. 24 2.4 Conclusion and Future work While ILEE has already remedied many disadvantages of traditional methods such as MGT, we are still working to further advance the ILEE approaches presented herein. Our goal is to ultimately arrive at a method that not only improves upon our currently described actin segmentation algorithms but also integrates time and space to describe a 4D model of cytoskeleton dynamics, as well as general cellular processes tractable by microscopy. As such, we offer the following as an initial list of potential upgrades and applications to be integrated into our library: ILEE compatibility to x-y-t and x-y-z-t data, where t represents time. We are in the process of developing a 4D-compatible analysis of cytoskeletal dynamics that tracks filament organization over time. This approach will provide a temporal evaluation of supported indices with high accuracy and robustness. Deep learning-based cytoskeleton segmentation algorithm with "foreign object" removal. As presented herein, ILEE enables the generation of trustworthy binary images on a large scale, which enables the construction of deep learning models to identify cytoskeleton components from confocal images with potentially better performance. The deep learning-based approach is also the key to solving the ultimate problem of all current cytoskeleton segmentation algorithms (including ILEE), which is the inability to detect and erase non-cytoskeleton objects with high fluorescence, such as the nucleus and cytoplasm. As one approach to circumvent this limitation 25 CHAPTER 3 SPHERICAL HARMONICS AND SHADOW MAPS FOR FACE RELIGHTING 3.1 Introduction Face relighting is the problem of turning a source image of a human face into a new image of the same face under the desired illumination different from the original lighting. It has long been studied in computer vision and computer graphics and has a wide range of applications in face-related problems such as face recognition [32] as well as in entertainment. With the everlasting interest in consumer photography and photo editing, the ability to produce realistic relit face images will remain an important problem. Many existing face relighting models utilize intrinsic decomposition of the image into face geometry, lighting, and reflectance [33, 34, 35, 36, 37, 32, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]. The source image is then relit by rendering with a novel illumination. Other relighting methods employ image-to-image translation [48, 49, 34] or style transfer [50, 51, 52, 53]. For most face relighting applications, one important requirement is the preservation of the subject’s local facial details during relighting. Intrinsic decomposition methods often compromise high-frequency details and can leave artifacts in the relit face images due to geometry or reflectance estimation errors. Another essential feature of a desirable face relighting model is proper shadow handling. For entertainment, in particular, adding and removing shadows accurately is crucial in producing photorealistic results. Most existing relighting methods, however, do not model hard self-cast shadows caused by directional lights. Our proposed method uses an hourglass network to formulate the relighting problem as a ratio (quotient) image [54] estimation problem. In particular, the ratio image estimated by our model can be multiplied with the source image to generate the target image under the new illumination. Such an approach allows our relighting model to maintain the local facial details of the subject while adjusting the intensity of each pixel during relighting. Thus, we employ a ratio image estimation loss to enable ratio image learning, as well as a structural dissimilarity (DSSIM) loss based on the structural similarity metric (SSIM) [55] to enhance the perceptual quality of relit faces. In addition, 26 we incorporate PatchGAN [56] to further improve the plausibility. During training, we generate and leverage shadow masks, which indicate estimated shadow regions for each image using the lighting direction and 3D shape from 3D Morphable Models (3DMM) [57] fitting. The shadow masks enable us to handle shadow through weighted ratio image estimation loss. We place a higher emphasis on the pixels close to shadow borders in the source and target relighting images, with larger weights placed on borders of high-contrast cast shadows over soft ones. This simple strategy allows learning how to accurately add and remove both hard and soft shadows under various relighting scenarios. Our training process can leverage images with both diffuse and directional lighting across multiple datasets, which improves our ability to handle diverse lighting and generalize to unseen data over methods that only train on a single dataset [48, 34, 49]. To enable this, we use our shadow masks to estimate the ambient lighting intensity in each image and modify our lighting to account for differences in ambient lighting across images and datasets. Thus, our model accommodates differences in the environment between images in controlled and in-the-wild settings. Our proposed method has three main contributions: ⋄ We propose a novel face relighting method that models both high-contrast cast shadows and soft shadows, while preserving the local facial details. ⋄ Our technical approach involves single image-based ratio image estimation to better preserve local details, shadow border reweighting to handle hard shadows, and ambient light compensation to account for dataset differences. ⋄ Our approach achieves the state-of-the-art relighting results on two benchmarks quantitatively and qualitatively. 3.2 Related Work Face Relighting Among prior face relighting work, many conduct relighting via intrinsic de- composition and rendering [33, 34, 35, 36, 37, 32, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]: the source image is decomposed into face geometry, reflectance, and lighting, and recombined with modified lighting to render relit images. As the decomposition generally relies heavily on single image 27 Figure 3.1 Overview of our proposed method. face reconstruction, which remains largely an open problem, these methods tend to produce results that lack high-frequency detail found in the source image and contain artifacts from geometry and reflectance estimation error. Our method bypasses this issue by avoiding intrinsic decomposition entirely and estimating a ratio image instead. The ratio image only affects the intensity of each pixel in the source image in a smooth way aside from shadow borders, thus preserving the local facial details of the subject. Other relighting methods do not perform explicit intrinsic decomposition. Sun et al. [48] use image-to-image translation to produce high-quality relighting. However, their results deteriorate when the input image contains hard shadows or sharp specularities, or when presented with strong directional light. Zhou et al. [49] assume a spherical harmonics (SH) lighting model and estimate the target luminance from the source luminance and a target lighting. However, their model is trained on images with primarily diffuse lighting and thus only handles soft shadows. We train our model on a mixture of images with diffuse and strong directional light. We also assign a larger emphasis on learning the ratio image near the shadow borders of high-contrast cast shadows. Hence, our model handles shadows effectively. Some methods relight using a style transfer approach [50, 51, 52, 53] by transferring the lighting conditions of a reference image as a style to a source image. However, since they require a 28 reference image for each unique target lighting, they are less flexible to use in practice. Our model only requires a source image and a target lighting as input. Ratio Images in Face Relighting Prior face relighting methods that incorporated ratio images often require multiple images per subject as input [54, 58] or both the source and target images [59], limiting their real-world applicability. Wen et al. [60] propose the first work on single image face relighting with ratio images, by estimating the ratio between the radiance environment maps. But they use a fixed face geometry with manually labeled feature correspondences while only considering diffuse reflections. We instead estimate the ratio image between the source and target images and thus can directly produce non-diffuse relighting without resorting to a face mesh. Zhou et al. [49] use a ratio image-based algorithm to synthesize their Deep Portrait Relighting (DPR) training data. Our work is the first ratio image-based face relighting method that can model non-diffuse relighting effects, including shadows caused by strong directional lights while requiring only one source image and a target lighting as input. 3.3 Method 3.3.1 Pipeline Our model takes a source image and a target lighting as input and outputs the relit image under the target lighting along with the estimated source lighting. We represent the source and target lighting as the first 9 Spherical Harmonics (SH) coefficients. We adopt the hourglass network structure [49], but rather than directly estimating the target luminance, we instead estimate the ratio image between the input and target luminance. Similar to [49], our model only modifies the luminance channel: they use the Lab color space and estimate the target luminance before recombining with the source image’s a and b channels to generate the relit image, whereas we estimate the ratio image for the 𝑌 channel of the YUV color space. The 𝑌 channel of the target image is then computed by multiplying the estimated ratio image with the 𝑌 channel of the source image, which is then recombined with the U and V channels of the source image and converted to RGB to produce the final relit image. 29 3.3.2 Training Losses We employ several loss functions to estimate ratio images that preserve high-frequency details while capturing significant changes around shadow borders. To directly supervise the ratio image learning, we employ the following ratio image estimation loss 𝐿 ratio : 1 𝐿 ratio = ∥ log10 (R 𝑝 ) − log10 (R𝑡 )∥ 1 . (3.1) 𝑁 Here, R 𝑝 and R𝑡 are the predicted and ground truth ratio images, respectively, and 𝑁 is the number 1 of pixels in the image. Defining the loss in the log space ensures that ratio image values of 𝑟 and 𝑟 receive equal weight in the loss. We have two additional loss functions that place a higher emphasis on the ratio image estimation near the shadow borders of the source and target images. The shadow border ratio image loss 𝐿 i,border is defined as: 1 𝐿 i,border = ∥W𝑖 ⊙ (log10 (R 𝑝 ) − log10 (R𝑡 ))∥ 1 , (3.2) 𝑁i where 𝑖 is 𝑠 or 𝑡 denoting the source or target respectively, and ⊙ is element multiplication. W𝑠 and W𝑡 are per-pixel weights that are element-wise multiplied with the per-pixel ratio image error, enabling our model to emphasize the ratio image estimation at or near shadow borders. 𝑁s and 𝑁t are the number of pixels with nonzero weights in W𝑠 and W𝑡 respectively. See Sec. ?? for details on W𝑠 and W𝑡 . We also supervise the source lighting estimation using the loss term 𝐿 lighting defined as: 𝐿 lighting = ∥ℓ 𝑝 − ℓ𝑠 ∥ 2 , (3.3) where ℓ 𝑝 and ℓ𝑠 are the predicted and ground truth source lighting, represented as the first 9 SH coefficients. Similar to [49], we define a gradient consistency loss 𝐿 gradient to enforce that the image gradients of the predicted and target ratio images (R 𝑝 and R𝑡 ) are similar, and a face feature consistency loss 𝐿 face to ensure that images of the same subject under different lighting conditions have the same 30 face features. 𝐿 gradient preserves the image edges and avoids producing blurry relit images. 𝐿 face further preserves the local facial details of the subject during relighting. To enhance the perceptual quality of our relit images, we employ two PatchGAN [56] discrim- inators: one operates on 70 × 70 and the other on 140 × 140 patches. We train the discriminators jointly with the hourglass network using the predicted luminance as fake samples and the target luminance as real samples. We denote this loss as 𝐿 adversarial . Finally, similar to [34], we define a structural dissimilarity (DSSIM) loss 𝐿 DSSIM between R 𝑝 and R𝑡 as: 1 − SSIM(R 𝑝 , R𝑡 ) 𝐿 DSSIM = . (3.4) 2 Our final loss function 𝐿 is the sum: 𝐿 = 𝐿 ratio + 𝐿 sborder + 𝐿 tborder + 𝐿 lighting + 𝐿 gradient + 𝐿 face + 𝐿 adversarial + 𝐿 DSSIM . (3.5) We use coefficients to balance all the terms in the loss function. 3.3.3 Shadow maps Spherical harmonics has long been used for diffuse and ambient lighting in Computer Graphics. SH lighting is a way to calculate the illumination on 3D models from image-based lighting sources in order to enable you to catch, relight and display global illumination style images. Spherical harmonics are camera-independent and require relatively low computational effort. Spherical harmonics are the specific function defined on a unit sphere. They could be obtained by solving partial differential equations specifically Laplace’s equation in the spherical domains. Since spherical harmonics form a complete set of orthogonal functions, any function defined on the surface of a sphere could be represented as the linear combination of spherical harmonics. All SH lighting techniques involve representing lighting equations with spherical functions that have been projected into frequency space using the spherical harmonics as a basis. Spherical harmonics have been commonly used in modeling illumination for Lambertian reflectance. Previous work [49] uses SFS network to estimate the SH coefficients. Since we are aware of either the light direction or the point light position for our datasets, we could calculate SH coefficient numerically to avoid 31 Figure 3.2 Shadow Masks and Weights. (a) input image, (b) shadow mask, (c) shadow border weights (intensity proportional to weights). the estimation error produced by the neural network. We propose a method to estimate precise SH coefficients. We first create an environment map on a unit sphere according to the lighting condition, and then we project the environment map to SH basis to generate the SH coefficients for our dataset. The environment map is reduced to the first nine SH coefficients since high-frequency lighting details are not required in our case. SH basis: The spherical harmonics 𝑌𝑙𝑚 (𝜃, 𝜙) are the angular portion of the solution to Laplace’s equation in spherical coordinates. Spherical harmonics take their simplest form in the Cartesian coordinate system. They can be defined as homogeneous polynomials of degree 𝑙, order 𝑚 at 32 (𝑥, 𝑦, 𝑧) that satisfy Laplace’s equation. The set of real spherical harmonics are written as: 1 1 𝑌00 = √ 2 𝜋 √︂ 1 3 𝑥 𝑌1−1 = √︁ 2 𝜋 𝑥 2 + 𝑦2 + 𝑧2 √︂ 1 3 𝑧 𝑌10 = √︁ 2 𝜋 𝑥 2 + 𝑦2 + 𝑧2 √︂ 1 3 𝑦 𝑌11 = √︁ 2 𝜋 𝑥 2 + 𝑦2 + 𝑧2 √︂ 1 15 𝑥𝑦 𝑌2−2 = (3.6) 2 𝜋 𝑥 + 𝑦2 + 𝑧2 2 √︂ 1 15 𝑦𝑧 𝑌2−1 = 2 𝜋 𝑥 + 𝑦2 + 𝑧2 2 √︂ 1 5 −𝑥 2 − 𝑦 2 + 2𝑧2 𝑌20 = 4 𝜋 𝑥 2 + 𝑦2 + 𝑧2 √︂ 1 15 𝑧𝑥 𝑌21 = 2 𝜋 𝑥 2 + 𝑦2 + 𝑧2 √︂ 1 15 𝑥 2 − 𝑦 2 𝑌22 = 2 𝜋 𝑥 2 + 𝑦2 + 𝑧2 where (x,y,z) is the unit vector on the unit sphere. Environment map: Similar to [61], we model our light as a Gaussian light (where mean position 𝜇 is the light position in camera coordinates and standard deviation 𝜎 = 8◦ ). Then we project the light to a unit sphere centering at origin to generate a 256 × 128 environment map. The pixel intensity of the environment map is defined as: 𝜃 > 90◦   0   𝐼 (𝜃, 𝜙) = 𝛼2 (3.7)  𝑒 − 2𝜎2  0◦ ≤ 𝛼 ≤ 90◦  where 𝛼 is the angle between the light direction and the unit vector of inclination 𝜃 and azimuth 𝜙. 𝜎 is 8◦ for the Gaussian lighting distribution. We discretize the SH basis to the same size as our environment map. To get the SH coefficients, we project the environment map onto the basis function and integrate it over the surface of the unit 33 sphere. The formula of 𝑖-th coefficient is written as: ∫ 2𝜋 ∫ 𝜋 𝛼𝑖 = 𝑌𝑖 (𝜃, 𝜙)𝐼 (𝜃, 𝜙)𝑠𝑖𝑛𝜙 𝑑𝜙 𝑑𝜃 (3.8) 0 0 where 𝑌𝑖 (𝜃, 𝜙) is the 𝑖-th SH basis function at the element of (𝜃, 𝜙), 𝐼 (𝜃, 𝜙) is the corresponding intensity value on environment map. The discrete form is: ∑︁ ∑︁ 𝛼𝑖 = 𝑌𝑖 (𝜃, 𝜙)𝐼 (𝜃, 𝜙)𝑠𝑖𝑛𝜙 (3.9) Ambient light: Since ambient light is inconsistent across images, especially between light stage and in-the-wild images, we introduce a method to estimate ambient light using the shadow mask. Since only ambient light contributes to shadowed regions, we use the average intensity of the shadow pixels as an estimate of the ambient light intensity in the image. To sum the contributions of directional and ambient light, we first model each image’s directional light as a point light and estimate the corresponding 9 SH coefficients. We then add the estimated ambient light intensity to the 0th SH coefficient, which represents overall light intensity. Since SH coefficients provide us with the overall global illumination without high-frequency details. Shadow mask is a necessary step to incorporate 3D geometric information and guide the neural network to learn how to remove and add shadows. We create shadow masks for all training images using the lighting direction and the 3D face mesh offered by 3DMM fitting. We estimate the 3D mesh, transformation matrix, and camera matrix for each face. To generate the shadow mask, we utilize ray tracing algorithm to cast parallel rays along the z-axis towards the mesh. If the ray hits the face, we check for two kinds of shadows at the intersection point: self shadows and cast shadows. Portions of the face between the light source and the intersection point will block the light and cast a shadow. To determine if the point lies in a cast shadow, we cast a ´’shadow feeler” ray from the intersection point to the light source [62]. If the ray hits the surface, the intersection point is in a cast shadow. To determine if the point lies in self shadow, we compute the angle between the light source’s direction and the surface normal of the intersection point. If the angle is obtuse, the light hits the back of the surface and the kind of shadow is considered self shadow. 34 Method Si-MSE MSE DSSIM SfSNet [33] 0.0545 0.1330 0.3151 DPR [49] 0.0282 0.0702 0.1818 Our model 0.0220 0.0292 0.1605 Table 3.1 Relighting Performance Using Target Lighting on Multi-PIE. We compare our relighting performance against prior methods that take a source image and a target Spherical Harmonics lighting as input. Our method is able to outperform previous work across all three metrics. Method Si-MSE MSE DSSIM Shih et al. [50] 0.0374 0.0455 0.2260 Shu et al. [53] 0.0162 0.0243 0.1383 Our model 0.0148 0.0204 0.1150 Table 3.2 Lighting Transfer Evaluation on Multi-PIE. We compare our lighting transfer performance against two existing lighting transfer algorithms. Each input image is assigned a random reference image. Our model outperforms both approaches across all three metrics. If the intersection point is either in a cast shadow or self shadow, we assign 0 to the corresponding pixel. Otherwise, the pixel is illuminated and is 1 in the shadow mask. Border weights: We reweight our ratio image estimation loss to assign larger weights to pixels near the shadow border. Higher-contrast hard shadows should have a higher weight than lower- contrast soft shadows, which pushes the neural network to put more emphasis on learning the shadow boundary accurately. 3.4 Result 3.4.1 Dataset We train our model using images from two datasets, one with mostly diffuse lighting and one with strong directional lights. Our diffuse lighting dataset is the Deep Portrait Relighting (DPR) dataset, where we use the same training images as [49]. Our dataset with strong directional lighting is the Extended Yale Face Database B [63], which contains 16, 380 images of 28 subjects with 9 poses and 65 illumination conditions (64 distinct lighting directions and one ambient lighting). 35 (a) Source Image (b) Target Image (c) Our model (d) DPR [49] (e) SIPR [48] (f) SfSNet [33] Figure 3.3 Qualitative Relighting Results on Multi-PIE Using Target Lighting. Here, we compare our relighting results on Multi-PIE with prior work that takes a source image and a target lighting as input. Images for SIPR [48] are provided by the authors. Notice that our model produces significantly better cast shadows, especially around the nose than previous methods. (a) Source Image (b) Reference Image (c) Target Image (d) Our model (e) Shih et al. [50] (f) Shu et al.[53] Figure 3.4 Qualitative Lighting Transfer Results on Multi-PIE. Here, we compare our lighting transfer results with two baselines by estimating the target lighting from the reference image. Images [53] are provided by the authors. Notice that our model is able to produce the correct lighting, whereas [50] produces the wrong lighting. Both baselines also strongly alter the skin tone of the subjects, whereas our method is able to maintain their skin tone reasonably well. 3.4.2 Quantitative Evaluation We compare our model’s relighting performance with prior face relighting work on the Multi- PIE dataset [64], where each subject is illuminated under 19 different lighting conditions (18 images with known directional lights, one image with no directional lighting). Relighting Evaluation Using Target Lightings. We compare against prior relighting methods 36 (a) Source Image (b) Target Lighting (c) Our model (d) DPR [49] (e) SIPR [48] (f) SfSNet [33] Figure 3.5 Qualitative Relighting Results on FFHQ Using Target Lighting. We compare our relighting results on FFHQ subjects with prior work. Images for SIPR [48] are provided by the authors. Across all lighting conditions, our model produces better cast shadows than prior work, especially around the nose and eyes. [49, 33] that take a source image and a target lighting as input. For each subject and each session, we randomly select one of the 19 images as the source image and one image as the target image, which serves as the relighting groundtruth. The target image’s lighting is then used to relight the source image. This leads to a total of 921 relit images. We evaluate the relighting performance using three error metrics: Si-MSE [49], MSE, and DSSIM. The results are shown in Table 3.1. Relighting Evaluation Using Reference Images. We also compare our model’s performance against relighting methods that require both a source and a reference image as input [50, 53]. These 37 methods relight the source image by transferring the reference image’s lighting to the source. For each Multi-PIE image, we randomly select a reference image across all subjects from the dataset and estimate the target lighting from the reference image. We then relight the source image using the estimated target lighting. The results are shown in Table 3.2. Facial Detail Preservation Evaluation. To compare our model’s ability to preserve the subject’s local and global facial details during relighting with prior work, we compute the average cosine similarity of the VGG-Face [65] features of the relit and groundtruth Multi-PIE images across different layers of VGG-Face. In particular, we use the layer before each max-pooling layer in VGG-Face. The results of the evaluation are shown in Fig. 3.6. Our model achieves noticeably higher cosine similarity than prior work in earlier layers and is comparable or slightly better than prior work in later layers, indicating that our model is better at preserving local facial features than previous methods. 3.4.3 Qualitative Evaluations Figure 3.6 Facial Detail Preservation Evaluation on Multi-PIE. We compute the cosine similarity between the VGG-Face [65] features of each method’s relit images and the ground truth target images as a measure of their ability to preserve the subject’s local and global facial details during relighting. Notice that our method is consistently the best at preserving local features based on the cosine similarity in the earlier layers. 38 We demonstrate qualitative relighting results on the Multi-PIE [64] dataset and the FFHQ dataset [66] and compare them with prior work. On Multi-PIE, we include relit images using target lighting (See Fig. 3.3) as well as relit images produced from lighting transfer (See Fig. 3.4). When applying target lighting, our model is able to produce noticeably better cast shadows than DPR [49], SIPR [48], and SfSNet [33]. Our model also avoids overlighting the image, whereas [49] often produces images that appear overexposed. When performing lighting transfer, Shih et al. [50] is unable to produce the correct lighting whereas our model is able to estimate and transfer the correct target lighting. On FFHQ, we perform relighting using target lighting and compare it with previous approaches (See Fig. 3.5). Our approach handles cast shadows better than prior work, as seen by the shadows around the nose, eyes, and other regions, while also maintaining similar or better visual quality in the relit images. 39 CHAPTER 4 NERF HEAD 4.1 Introduction Building free-view human head avatars is a long-standing research topic in the computer vision, computer graphics, and machine learning communities. Such avatars are crucial in various VR/AR applications, in particular, in facial animation for teleconferencing, social media apps, movie/animation production, and game industry. One common approach in previous methods is to use 3D morphable models (3DMM) to approximate the human head geometry and thus provide access to head poses and facial expressions. The deformation of the 3DMM is often driven by a low dimensional parameter space such as PCA-based features. By changing the parameters of the expression/pose/identity in this space, we could articulate the 3D template shape with characteristics of specific individuals. However, the 3D reconstruction based on 3DMM usually lacks details even with the appearance modeled through texture mapping, due to the limitation in resolution and the restriction in feature space dimension for template-based 3D explicit representations that are efficient enough for real- time rendering. The model is typically merely capable of building a head mesh without hair or more often just a face mesh. It could not model hair and head accessories with complex geometric and topological features, like glasses and earrings, which are typically present in real life. Moreover, it remains a challenging problem to separately generate photorealistic view-dependent textures for 3DMM shapes, leading to difficulties in realistic rendering of human head images using simple illumination models. A concurrent work, Neural Head[67], creates a relatively accurate dynamic mesh including the explicit geometric modeling of hair and accessories along with the corresponding textures. However, rendering quality of the resulting model-based images seems still lower than that of implicit models with image-based rendering when sufficient input images are provided for training. Image-based rendering is another major approach for rendering generic digital scenes, including human heads, which can be carried out without any explicit mesh-based geometric models. For 40 instance, implicit volumetric representations such as Neural Radiance Fields (NeRF)[68], have been successfully employed for rendering various scenes with increasing popularity. Radiance fields can sample the radiance value at an arbitrary spatial point along an arbitrary direction within a certain range, thus allowing assembly output images from any camera location with any view direction within the range. In particular, NeRF learns an implicit volumetric density and RGB color field by fitting the rendered output to given input images from different views. With the implicit neural radiance field, one may directly generate photo-realistic renderings by assembling rays from the projection center to the image plane of any virtual camera, without any geometric modeling of object surface or a texture mapping. The perceived geometric and appearance details of NeRF results are hard to match for explicit methods like 3DMM. Moreover, some NeRF models also offers realistic relighting. There are a number of existing works that investigated the possibility of applying NeRF to render humans. For instance, EG3D[69] used three orthogonal planes to create an efficient NeRF representation, but it is not straightforward to generalize the model to handle the dynamic case when the face is deforming. Some approaches [70, 71] tried to combine NeRF with 3DMM parameters, but without the 3D shape guidance, they require a relatively large number of input images to learn the 3D neural field. Inspired by Neural Body[72] (which use NeRF to model full-body animation under the guidance of a deforming mesh) we propose a hybrid approach that combines the mesh representation with an implicit volumetric rendering to create photo-realistic results. This is done through a latent code stored on a dynamic mesh driven by expression and pose parameters, with the deformable shape embedded in an ambient volume. In contrast to Neural Body input, which has multiple views (although sparse), we present a method for face avatar that may take single view input and is able to reduce the input frames required for training significantly. Our main contributions include: • a 3DMM mesh-based NeRF pipeline for faces; 41 • a NeRF-based 3DMM mesh perturbation for refined NeRF; and • a double-layer mesh method for better code diffusion for model-guided NeRF. 4.2 Related Work 4.2.1 View synthesis and neural scene representation Traditional image-based rendering methods [73, 74] often use radiance/light field interpolation to generate novel views. Such methods often offer a limited range of viewpoints and require extremely dense input views for high quality output. As deep learning technology advances, learnable image-based methods quickly gained popularity. Such methods warp and synthesize input images into novel views[75] through sparser input data. Nevertheless, learning-based view synthesis still requires a large amount of image data and lacks the understanding of the underlying 3D geometry. To make full use of information from 3D scenes, some methods [76] utilize depth information to guide the synthesis of novel views. They, instead, rely heavily on the geometry proxy estimated from the input views. There are also methods[77] that aim to build 3D mesh representations explicitly, but accurate generic 3D reconstruction itself remains an open problem so far. A template mesh with fixed topology is typically unavailable for generic 3D scenes in the real world. Even generating photo-realistic textures for explicit meshes can be challenging, and the rendering with textured meshes still deviates further from the real-world images compared to purely image-based rendering. Recent methods [78, 79, 80, 81] that use implicit representations combined with neural network seem to finally meet the requirements of high-quality view synthesis. They model the radiance fields through volumetric fields of color and density (i.e., one set of RGB values and one density value at each point or voxel), given only a set of input RGB images. NeRF[68] attained great success in embedding feature volumes with neural networks and generate impressive results by outputting radiance given specific locations and directions through volumetric rendering. Our method follows the geometry-guided NeRF framework but incorporates a component that uses NeRF to reversely guide the geometry, and reiterate the NeRF generation to refine the results. 42 4.2.2 NeRF-based dynamic scene representation The original NeRF is designed for static scenes. However, follow-up works like [82, 83, 71, 84, 72] extend NeRF to dynamic scenes by taking time component as part of the input. [82, 83] uses scene flow to linearly interpolate between different frames. [84, 71, 85] convert different temporal frames to a canonical frame. Specifically, [71, 85] model human head with the assumption that head motion is small in typical input and output deformation/animation. [70, 72] utilizes a global transformation matrix to embed dynamic NeRF in a canonical space to address large movements in the video. [70] conditions NeRF on 3DMM expression parameters for different frames, while [72] uses SMPL parameters based on a template human body mesh to integrate different frames. We follow the embedding-based method for NeRF generation, but added a double-layer embedding to speed up the diffusion of the implicit model. 4.2.3 Human face reconstruction 3DMM is typically driven by a low-dimensional parameter space. Most face reconstruction methods [86] reply on 3DMM to provide a coarse template mesh as initialization. They either build the blendshape basis for high-frequency details like fine wrinkles [87] or learn the non-rigid deformation on top of the coarse surface[67]. NeRF-based methods[70, 69] learn an implicit representation for the human head. To build the explicit mesh, EG3D[69] extracts the isosurface from the depth field, which is technically only a 2.5D mesh. Our NeRF output can also be used to generate a 2.5D mesh to provide an approximate geometry if needed. Moreover, we allow a more flexible mesh deformation directly guided by the NeRF density field without resorting to a blendshape basis. 4.2.4 Human avatar reconstruction Traditional methods are usually based on 3DMM parameters to reanimate human subjects. 3DMM models built on a large of blendshapes provide us with the shape variation on a learned shape space. Therefore, the deformation could be limited and the control over the template mesh is coarse-grained. Recent methods start to combine 3DMM with neural network. [88, 86] disentangle appearance and shape. [88] use shape parameter to model deformation, while color 43 Figure 4.1 Overview of our proposed method. map in canonical space stays unchanged. [86] assign feature descriptors to each surface vertex to learn texture synthesis for the deformed mesh. In addition, the use of GAN enables the generation of more photo-realistic rendering for human avatars, like in [89]. [72, 90] use one network to learn the detailed shape deformation based on 3DMM face reconstruction while utilizing neural rendering to recover the texture information. 3DMM parameters are utilized to generate novel poses and expressions. Usually this kind of methods focus on mesh reconstruction quality, and the rendered image quality is not comparable to NeRf-based methods. To gain fine-grained control over avatar attributes, [91] uses local face models instead of 3DMM to offer more flexibility and expressiveness, and [92] represents attributes as localized masks and regresses each attribute value and its corresponding mask using neural network. It requires 2D mask annotations for a few shots that specify which region of the image an attribute controls. We design our avatar algorithm based on NeRF, the neural radiance field enables us to render high-fidelity face images. Inserting the human head mesh in the 3D neural field enables our network to reconstruct a dynamic human head. Additionally, the 3DMM parameters provide us with control over pose and expression. Finally, our NeRF-based rendering pipeline could learn the high-frequency details, such as hair and glasses, on top of the rough shape deformation without annotations. 44 Figure 4.2 Network structure. 4.3 Algorithm Given a single-view portrait video, our method reconstructs a dynamic human head avatar which enables the novel view synthesis of each input scene and arbitrary control of facial expressions and head poses. Our model learns an implicit neural representation based on NeRF[68] for human head. The overall pipeline is illustrated in Fig. 4.1. For each frame of a portrait video, we first reconstruct the coarse 3d morphable mesh for each frame. We use DECA[87] to also extract the corresponding pose, expression, and camera parameters for each frame. The pose parameters allow us to apply rigid global transformation to embed the head shape in a canonical bounding box which is shared by all frames. The expression parameter could be used for future facial expression manipulation. Inspired by [72], we attach the latent code for NeRF generation on the coarse 3D shape, whose embedding allows the generation of dynamic density and RGB fields. The latent codes attached to each vertex on the head surface typically denote the local geometric deformation and texture information during the training of the neural network. To generate the continuous radiance field, the latent codes are diffused to nearby space through a code diffusion process. The neural network could decode the sparse volumetric latent code field into color and density fields. The output image is then generated from different viewpoints via volume rendering. 45 4.3.1 Coarse facial geometry For this part of our pipeline, we simply employ DECA [87]. We provide the details on input preparation and our parameter choices below. Given single-view or multi-view portrait images/videos, we first randomly choose input frames that may cover a wide enough range of view angles if possible, since we aim to generate novel views of portrait animation. Then, for each input frame, which may have different expressions and poses, we use DECA [87] to reconstruct the shape 𝑆𝑡 , a parameterized mesh of 5,023 vertices. The type of coarse output mesh in DECA is FLAME[93], a statistical 3D head mesh model that combines separate linear identity shape and expression spaces with linear blend skinning (LBS) to articulate the neck, jaw, and eyeballs. The advantage of the consistent connectivity across different frames is the straightforward deformation of the NeRF volume fields as we detail next. We optionally allow enhancement for different regions of the face, for instance allowing the forehead area or top of the head to have two layers of vertices by offsetting the original mesh vertex along the normal direction by a preset distance. Similar treatment can be performed near the eyes or mouth for enhanced details to accommodate accessories or expressions. While our ablation study showed no improvements in the statistics that we employ for accuracy, the improvements to the targeted regions are visually noticeable. 4.3.2 Dynamic feature volume A set of latent codes 𝑍 = {𝑧 1 , 𝑧2 , ..., 𝑧 5023 } are anchored to the surface of shape 𝑆𝑡 , each latent code is a 16 dimensional vector to encode local appearance information and geometric details on top of the underlying 3D shape. The same vertex across all the input frames share the same set of latent codes, which integrate all the frames together and take advantage of the consistency across different frames in terms of the underlying 3D shape and the facial texture. The dynamics brought by expression parameters affect the 3D shape 𝑆𝑡 directly, therefore the embedding locations of the latent code in the feature volume will change accordingly, which leads to a dynamic feature volume to allow expression-dependent output. While our model has been trained to be individual-specific due to data availability, we believe it is possible to build feature volumes that also takes identity 46 parameters for generic human avatar, since both the coarse geometry and appearance model can be used to train the latent code representation across different identities. Following[72, 94], we use SparseCovNet [95] to take 𝑆𝑡 and 𝑍 as input and generate a multi- scale feature volume. As a typical NeRF field is represented by continuous volumetric fields instead of samples on a dynamic surface, SparseCovNet provides a reasonable bridge to link the sparse surface points to the ambient space with an implicit learned diffusion-like process. With this conversion, given a query point, we may retrieve its latent code in the multi-scale feature volume as shown in Fig. 4.2 and concatenate all the latent code at different scales as the final 352D feature vector. The latent code is decoded in the next stage of our pipeline into density and color functions. More specifically, we first compute the bounding box of the 3D head shape 𝑆𝑡 , and then divide the bounding box to a cartesian grid. Each grid cell (voxel) has a size of 5𝑚𝑚 × 5𝑚𝑚 × 5𝑚𝑚. The latent code assigned to each voxel is calculated as the mean of the latent codes on all the surface points situated inside the voxel. Similar to [72, 94], we also downsample the feature volume by the factors of 2, 4, 8 and 16 to build multi-scale feature volumes. From the feature volume at each scale, we obtain a feature vector at each query point through trilinear interpolation, and then concatenate all four feature vectors into a 352-dimensional vector as final output. For frame 𝑡, We denote the feature vector at a given sample point 𝑥. The output at this stage is thus a function of the sample point location 𝑥, parameterized by the latent code 𝐿 attached to the mesh vertices, and the given head shape 𝑆𝑡 , written as 𝑓 (𝑥; 𝑆𝑡 , 𝑍). While this embedding process is fairly standard, we optionally provide a NeRF-guided defor- mation to the surface point locations. Thus, after the first iteration of training, we perturb the position of each vertex to get a more accurate mesh surface, then insert the new surface points back to the 3D feature volume, and train the network again to generate new rendering images. In our ablation study in the next section, we found mixed results for this refinement. We speculate that we may need larger training sets or more accurate initial geometry to consistently outperform our current pipeline. 47 4.3.3 Color/density estimation At this stage of our pipeline, the feature vector associated with a spatial location is passed into the MLP network to regress density and color at this point. For color regression, we add viewing direction as input, since without viewing direction, the model will have difficulty representing specularity. We also add another per-frame 128D latent code to encode the temporally-varying factors across frames, such as illumination conditions. Given this treatment, the radiance field is indeed a 5D function that depends both on location and direction. Specifically, given a feature volume and a viewing direction, the radiance value at a given viewpoint is determined by casting a ray from the viewpoint along the viewing direction towards the feature volume. If the ray intersects with the bounding volume, sample points are distributed along the portion of the ray that is inside the bounding volume. For each sample point, we query the multi-scale feature volume to assemble the 352D latent code, which is then fed to a multi-layer perceptron (MLP) network along with the viewing direction to estimate the output color and density value, as done in typical practice. As the final representation of our dynamic radiance field, our MLP network takes the feature vector as as input, and outputs color and density predictions for each sample point inside the bounding volume. For a given frame 𝑡, the density is defined as a function of latent code at sample point, parameterized by a per-frame latent code that takes into account additional changes in the environment, such as illumination. Density The density model is trained to be directly a function of only the latent code 𝐿 (𝑥, 𝑍, 𝑆𝑡 ) at a given location 𝑥, the given latent code 𝑍 for the entire mesh, and the current mesh shape 𝑆𝑡 : 𝑑 (𝑥) = 𝑀𝑑 (𝐿 (𝑥, 𝑍, 𝑆𝑡 )), (4.1) where 𝑀𝑑 is the MLP network for density as shown in Fig. 4.2. 48 Color The color model additionally takes the given viewing direction 𝑣 and the 128D latent code 𝑙𝑡 associated with frame 𝑡 along with latent code 𝐿(𝑥, 𝑍, 𝑆𝑡 ) as its three input: 𝑐(𝑥) = 𝑀𝑐 (𝐿(𝑥, 𝑍, 𝑆𝑡 ), 𝑣, 𝑙𝑡 ), (4.2) where 𝑀𝑐 is the MLP network for color as shown in Fig. 4.2. 4.3.4 Volumetric rendering with radiance fields Accumulating the data along each ray and assembling rays into images is a standard procedure in volume rendering[96]. Once we obtain the procedure to produce radiance associated with any ray, the image is constructed by a simple loop that iterates over all the pixels. Thus, we provide the details for radiance estimate based on the sample points in the previous stage. We may see it as the evaluation of the radiance field based on the internal per-point density and directional emitted radiance (in RGB color) representation. For the output color (which corresponds to the direction-dependent physical quantity radiance), we integrate the color and density of the sample points along the ray. Specifically, the color obtained in the previous stage encodes the radiance emitted from one location towards the given direction, and the density indicates how the light interacts with the matter in the small volume at the point, filtering out a fraction of the light passing through the volume. A practical way to perform the integral numerically is the approximation through quadrature. Thus, based the previous sampled point data, the final color of the corresponding ray 𝑟 that reaches the projection center (origin of 𝑟) of the camera is given by: 𝑁 ∑︁ 𝐶 (𝑟) = 𝑇𝑘 (1 − exp(−𝑑 (xk )𝛿 𝑘 ))𝑐(xk ), (4.3) 𝑘=1 where 𝑘−1 ∑︁ 𝑇𝑘 = exp(− 𝑑 (xj )𝛿 𝑗 ), (4.4) 𝑗=1 and 𝛿 𝑘 = ||xk+1 − xk || 2 is the distance between adjacent sampled points, 𝑑 (𝑥 𝑘 ) is the density at sample point 𝑥 𝑘 , 𝑐(𝑥 𝑘 ) is the color at the sample point 𝑥 𝑘 along the ray direction. We set the total sample point number 𝑁 to 64 in our experiment. Note that the sample points are sorted from near to far. 49 4.3.5 3D mesh extraction Some 3D surface reconstruction methods exist for extracting geometric information from neural radiance fields generated by MLP. However, methods like EG3D [69] only generate a 2.5D mesh, i.e., a depth map from a given camera direction. Since the internal density model of NeRF learns for every point a probability of whether it is inside, it is only accurate in the sense of reproducing the image rendering. Thus, it is not necessarily reliable for the extraction of an accurate surface. On the other hand, the integral data along rays, as done in volumetric rendering, is more accurate than the point samples. That is why the 2.5D meshes generated by integrating a chosen direction are often more stable since they are based on the reasonable estimate of the closest point of the scene to the camera along each projection ray. However, to refine the mesh used in the NeRF construction, we need an honest 3D mesh. Thus, we propose to keep the mesh connectivity intact, and only perturb the locations of existing vertices. To have a reliable estimate as in the 2.5D reconstruction, we mimic the procedure of volumetric rendering by tracing along rays. The difference is that now we construct a ray that passes through the initial surface vertex locations, with the ray direction following the inward-pointing normal direction at that point of the original surface. To ensure that we start from an outside point, we first move surface point 𝑉 to the outside space of the head through offsetting the point by a safe distance 𝑎. Similar to 2.5D depth image generation, we calculate the depth for the point along the negative normal direction, and then move the surface point back along the negative normal direction by distance 𝑑 to reach the surface point. The new location 𝑉 ′ of the surface point is calculated as follows: 𝑉 ′ = 𝑉 + (𝑎 − 𝑑) 𝑛, ˆ (4.5) where 𝑛ˆ is the unit outward normal of the surface mesh. To make sure that the mesh is only perturbed, we clamp the 𝑎 − 𝑑 to the range [−𝜖, 𝜖], where 𝜖 = 5𝑚𝑚 is the offset distance upper bound. The following is the details of our procedure to generate depth for each sample point by integrating density along the normal direction to avoid the drawback of a 2.5d depth map: First, we 50 select a large 𝑎 (0.03 in our experiment) to make sure we reach the exterior of the human head, the density will be effectively zero in the outside space. Newt, we select a negative distance 𝑏(-0.01 in our experiment), to make sure we reach the interior of the head, where the density will be nearly 1. Then, we query the density function for 64 sample points between 𝑉 + 𝑎 𝑛ˆ and 𝑉 + 𝑏 𝑛. ˆ For each sample point, we integrate the density between this point and 𝑉 + 𝑎 𝑛: ˆ ∑︁ 𝑁 𝑑𝑒 𝑝𝑡ℎ = 𝑇𝑖 (1 − 𝑒 −𝛿𝑖 𝜎𝑖 )𝑑𝑖 , (4.6) 𝑖=1 where 𝑇𝑖 is evaluated the same way as in the volumetric rendering. The overall algorithm to extract a perturbed mesh from the neural radiance field is outlined as: Algorithm 4.1 Surface Update 1: function SurfaceUpdate(𝑉, 𝑛, ˆ 𝑎, 𝑏) 2: for 𝑖 = 1 to 𝑛 do ⊲ 𝑛 = 5, 023: vertex count 3: sample 64 points 𝑃 along 𝑛ˆ𝑖 with an offset between 𝑎 and 𝑏 4: for 𝑘 = 1 to 64 do 𝑑𝑒 𝑝𝑡ℎ(𝑃 𝑘 ) = 𝑘 𝑇𝑘 (1 − 𝑒 −𝛿 𝑘 𝜎𝑘 )𝑑 𝑘 Í 5: 6: if 𝑑𝑒 𝑝𝑡ℎ(𝑃 𝑘 ) == 0 then 7: 𝑑 = 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛(𝑃 𝑘 ) break 8: end if 9: end for 10: clamp 𝑎 − 𝑑 to [−𝜖, 𝜖] 11: 𝑉𝑖′ = 𝑉𝑖 + (𝑎 − 𝑑) 𝑛ˆ𝑖 12: end for 13: end function 4.3.6 Loss function There are several choices in loss functions that we have tested. The differences are the treatments of background pixels. The simplest choice is to model the background within the bounding volume, it creates the best transition but typically distorts the background in novel views. Another choice is through using automated background masks, we ignore the pixels in the background region in the loss function. This allows the network to focus on the foreground, but would require the use of the roughly estimated foreground mask based on the DECA mesh. The choice that creates better geometric construction is to force background pixels to match an accumulated alpha value of 0, which forces the density to appear where the head intersects with the camera rays. Regardless of 51 Figure 4.3 Comparison to state-of-the-art algorithms on self-reenactment. the choice, we optimize the network to reduce the difference between the rendered image and the ground truth images, with the loss function expressed as: 1 ∑︁ 𝐿= ||𝐶 (𝑟) − 𝐶𝑡 (𝑟)|| 22 (4.7) 𝑁 𝑟∈R where R is the set of camera rays that pass through the image plane, 𝑁 is the number of sampled rays, and 𝐶𝑡 is the ground truth pixel value. The summation range is according to the background treatment choice. 4.4 Experiments We tested our pipeline on multiple portrait videos from a diverse set of sources. We get three human videos by the courtesy of NerFace, two by the courtesy of authors of Neural Head, and two by the courtesy of authors of CoNeRF. In additional we downloaded public-domain videos of Barack Obama and Michelle Obama from Youtube to demonstrate the generality and robustness of our model. Since NerFace training dataset is already shuffled, we randomly select 300 frames from the test dataset for testing. For CoNeRF, we use 300 frames for training and take every other frame for testing. We measure the typically used metrics: 𝐿 1 distance, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS), and report 52 Figure 4.4 Increasing training frames will increase the performance on all four metrics. Video 𝐿 1 ↓ PSNR↑ SSIM↑ LPIPS↓ Head0 0.012 26.49 0.94 0.11 Head4 0.011 27.12 0.95 0.08 Person1 0.021 23.81 0.93 0.10 Person2 0.010 30.23 0.97 0.03 Person3 0.013 27.48 0.95 0.06 Video1 0.026 22.56 0.91 0.09 Average 0.015 26.28 0.94 0.08 Table 4.1 Quantitative result on different videos. them in Table 4.4. for different videos, the resolution, frame rate, and human head size in the video differ significantly, which may influence the performance of our algorithm. We also compare our quantitative results with baselines, like NerFace, FOMM, and DVP. Our algorithm outperforms these baselines. As for conerf, it requires users to manually annotate facial attributes. Our method and conerf don’t fall into the same category, there is no need for direct comparison with conerf. 53 Figure 4.5 The second iteration could capture more high-frequency details in mouth and glass region. Method 𝐿1 ↓ PSNR↑ SSIM↑ LPIPS↓ FOMM 0.036 23.77 0.91 0.16 DVP 0.021 25.67 0.93 0.10 NerFace 0.029 24.22 0.93 0.09 CoNeRF 0.015 32.24 0.98 0.17 Average 0.015 26.28 0.94 0.08 Table 4.2 Comparison with baselines. Method 𝐿1 ↓ PSNR↑ SSIM↑ LPIPS↓ 1st iteration 0.014 27.26 0.95 0.06 2nd iteration 0.014 27.45 0.95 0.06 Table 4.3 Ablation study. The quantitative comparison is shown in table 4.4. The qualitative comparison is shown in Fig. 4.3. As can be seen from the qualitative result, our algorithm could reconstruct the appearance and facial expressions and preserve facial details including wrinkles, while other algorithms either did not reconstruct correct facial expressions or missed high-frequency details, in these test cases. We ran an ablation study on the three videos from Neural Head to investigate what could be 54 the key factor to influence the final performance. Increasing the training frames will significantly increase the accuracy until a certain minimum frame number is reached, as shown in Fig 4.4. Our ablation study in table 4.4 also shows the refined mesh improves the overall performance slightly. However, from the qualitative result in Fig. 4.5, we could see in some specific circumstances, the refined mesh could improve the reconstruction of local regions, like the more deformable mouth region and the glasses that are not in the original DECA mesh. Another reason may have been that since DECA does not reconstruct the mouth cavity when the mouth is open, our refined mesh would assign vertices in this region, and the feature vectors attached to the vertices will encode appearance information and increase the expressiveness of our algorithm. 4.5 Conclusion and Future Work We presented a 3DMM-based NeRF for dynamic human head rendering. With the guidance of geometric shapes, we could significantly reduce the training frames. We could generate high- fidelity novel-view face images and also reanimate the human head by changing 3DMM parameters. While the refined mesh does not improve the result further, it shows promise in its handling of local regions. We will explore the use of shape regularizers and region-specific local distance thresholding, as well as seek a more accurate initial 3D mesh, to improve our neural radiance field- guided mesh perturbation. There are other aspects of our pipeline that could have been enhanced to improve the final rendering performance, such as the use of better foreground mask generation. 55 BIBLIOGRAPHY [1] N. Lian, X. Wang, Y. Jing, and J. Lin, “Regulation of cytoskeleton-associated protein activities: Linking cellular signals to plant cytoskeletal function,” Journal of Integrative Plant Biology, vol. 63, no. 1, pp. 241–250, 2021. [2] L. Blanchoin, R. Boujemaa-Paterski, C. Sykes, and J. Plastino, “Actin dynamics, architecture, and mechanics in cell motility,” Physiological reviews, vol. 94, no. 1, pp. 235–263, 2014. [3] G. J. Brouhard and L. M. Rice, “Microtubule dynamics: an interplay of biochemistry and mechanics,” Nature reviews Molecular cell biology, vol. 19, no. 7, pp. 451–463, 2018. [4] P. Li and B. Day, “Battlefield cytoskeleton: turning the tide on plant immunity,” Molecular Plant-Microbe Interactions, vol. 32, no. 1, pp. 25–34, 2019. [5] P. Nick, “Mechanics of the cytoskeleton,” in Mechanical integration of plant cells and plants, pp. 53–90, Springer, 2011. [6] J. G. Carlton, H. Jones, and U. S. Eggert, “Membrane and organelle dynamics during cell division,” Nature Reviews Molecular Cell Biology, vol. 21, no. 3, pp. 151–166, 2020. [7] I. Kristó, I. Bajusz, C. Bajusz, P. Borkúti, and P. Vilmos, “Actin, actin-binding proteins, and actin-related proteins in the nucleus,” Histochemistry and cell biology, vol. 145, no. 4, pp. 373–388, 2016. [8] M. Melak, M. Plessner, and R. Grosse, “Actin visualization at a glance,” Journal of cell science, vol. 130, no. 3, pp. 525–530, 2017. [9] N. Lichtenstein, B. Geiger, and Z. Kam, “Quantitative analysis of cytoskeletal organization by digital fluorescent microscopy,” Cytometry Part A: The Journal of the International Society for Analytical Cytology, vol. 54, no. 1, pp. 8–18, 2003. [10] S. A. Shah, P. Santago, and B. K. Rubin, “Quantification of biopolymer filament structure,” Ultramicroscopy, vol. 104, no. 3-4, pp. 244–254, 2005. [11] Y. Liu, A. Nedo, K. Seward, J. Caplan, and C. Kambhamettu, “Quantifying actin filaments in microscopic images using keypoint detection techniques and a fast marching algorithm,” in 2020 IEEE International Conference on Image Processing (ICIP), pp. 2506–2510, IEEE, 2020. [12] Y. Liu, K. Mollaeian, and J. Ren, “An image recognition-based approach to actin cytoskeleton quantification,” Electronics, vol. 7, no. 12, p. 443, 2018. [13] M. Alioscha-Perez, C. Benadiba, K. Goossens, S. Kasas, G. Dietler, R. Willaert, and H. Sahli, “A robust actin filaments image analysis framework,” PLoS computational biology, vol. 12, 56 no. 8, p. e1005063, 2016. [14] T. Higaki, N. Kutsuna, T. Sano, N. Kondo, and S. Hasezawa, “Quantification and cluster analysis of actin cytoskeletal structures in plant cells: role of actin bundling in stomatal movement during diurnal cycles in arabidopsis guard cells,” The Plant Journal, vol. 61, no. 1, pp. 156–165, 2010. [15] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE transactions on systems, man, and cybernetics, vol. 9, no. 1, pp. 62–66, 1979. [16] W. Niblack, An introduction to digital image processing. Strandberg Publishing Company, 1985. [17] T. Higaki, K. Akita, and K. Katoh, “Coefficient of variation as an image-intensity metric for cytoskeleton bundling,” Scientific reports, vol. 10, no. 1, pp. 1–13, 2020. [18] T.-C. Lee, R. L. Kashyap, and C.-N. Chu, “Building skeleton models via 3-d medial surface axis thinning algorithms,” CVGIP: Graphical Models and Image Processing, vol. 56, no. 6, pp. 462–478, 1994. [19] M. Desbrun, M. Meyer, P. Schröder, and A. H. Barr, “Implicit fairing of irregular meshes using diffusion and curvature flow,” in Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 317–324, 1999. [20] J. Nunez-Iglesias, A. J. Blanch, O. Looker, M. W. Dixon, and L. Tilley, “A new python library to analyse skeleton images confirms malaria parasite remodelling of the red blood cell membrane skeleton,” PeerJ, vol. 6, p. e4312, 2018. [21] K. Tanaka, S. Takeda, K. Mitsuoka, T. Oda, C. Kimura-Sakiyama, Y. Maéda, and A. Narita, “Structural basis for cofilin binding and actin filament disassembly,” Nature communications, vol. 9, no. 1, pp. 1–12, 2018. [22] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE transactions on systems, man, and cybernetics, vol. 9, no. 1, pp. 62–66, 1979. [23] G. W. Zack, W. E. Rogers, and S. A. Latt, “Automatic measurement of sister chromatid exchange frequency.,” Journal of Histochemistry & Cytochemistry, vol. 25, no. 7, pp. 741– 753, 1977. [24] C. Li and P. K.-S. Tam, “An iterative algorithm for minimum cross entropy thresholding,” Pattern recognition letters, vol. 19, no. 8, pp. 771–776, 1998. [25] J.-C. Yen, F.-J. Chang, and S. Chang, “A new criterion for automatic multilevel thresholding,” IEEE Transactions on Image Processing, vol. 4, no. 3, pp. 370–378, 1995. 57 [26] J. Sauvola and M. Pietikäinen, “Adaptive document image binarization,” Pattern recognition, vol. 33, no. 2, pp. 225–236, 2000. [27] J. L. Henty-Ridilla, J. Li, B. Day, and C. J. Staiger, “Actin depolymerizing factor4 regulates actin dynamics during innate immune signaling in arabidopsis,” The Plant Cell, vol. 26, no. 1, pp. 340–352, 2014. [28] J. L. Henty-Ridilla, M. Shimono, J. Li, J. H. Chang, B. Day, and C. J. Staiger, “The plant actin cytoskeleton responds to signals from microbe-associated molecular patterns,” PLoS pathogens, vol. 9, no. 4, p. e1003290, 2013. [29] M. Guo, P. Kim, G. Li, C. G. Elowsky, and J. R. Alfano, “A bacterial effector co-opts calmodulin to target the plant microtubule network,” Cell host & microbe, vol. 19, no. 1, pp. 67–78, 2016. [30] M. Shimono, Y.-J. Lu, K. Porter, B. H. Kvitko, J. Henty-Ridilla, A. Creason, S. Y. He, J. H. Chang, C. J. Staiger, and B. Day, “The pseudomonas syringae type iii effector hopg1 induces actin remodeling to promote symptom development and susceptibility during infection,” Plant physiology, vol. 171, no. 3, pp. 2239–2255, 2016. [31] Y.-J. Lu, P. Li, M. Shimono, A. Corrion, T. Higaki, S. Y. He, and B. Day, “Arabidopsis calcium-dependent protein kinase 3 regulates actin cytoskeleton organization and immunity,” Nature communications, vol. 11, no. 1, pp. 1–12, 2020. [32] H. Le and I. Kakadiaris, “Illumination-invariant face recognition with deep relit face images.,” in WACV, IEEE, 2019. [33] S. Sengupta, A. Kanazawa, C. D. Castillo, and D. W. Jacobs, “SfSnet: Learning shape, refectance and illuminance of faces in the wild.,” in CVPR, IEEE, 2018. [34] T. Nestmeyer, J.-F. Lalonde, I. Matthews, and A. Lehrmann, “Learning Physics-guided Face Relighting under Directional Light.,” in CVPR, IEEE, 2020. [35] Y. Wang, L. Zhang, Z. Liu, G. Hua, Z. Wen, Z. Zhang, and D. Samaras, “Face relighting from a single image under arbitrary unknown lighting conditions.,” PAMI, 2009. [36] B. Egger, S. Schonborn, A. Schneider, A. Kortylewski, A. Morel-Forseter, C. Blumer, and T. Vetter, “Occlusion-aware 3d morphable models and an illumination prior for face image analysis.,” IJCV, 2018. [37] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras, “Neural face editing with intrinsic image disentangling.,” in CVPR, IEEE, 2017. [38] L. Tran and X. Liu, “Nonlinear 3d face morphable model.,” in CVPR, IEEE, 2018. 58 [39] K. Genova, F. Cole, A. Maschinot, A. Sarna, D. Vlasic, and W. T. Freeman, “Unsupervised training for 3d morphable model regression.,” in CVPR, IEEE, 2018. [40] A. Tewari, M. Zollofer, H. Kim, P. ¨ Garrido, F. Bernard, P. Perez, and T. Christian, “Mofa: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Recon- struction.,” in ICCV, IEEE, 2017. [41] S. Yamaguchi, S. Saito, K. Nagano, Y. Zhao, W. Chen, K. Olszewski, S. Morishima, and H. Li, “High-fidelity facial reflectance and geometry inference from an unconstrained image.,” ACM Transactions on Graphics, 2018. [42] J. T. Barron and J. Malik, “Shape, illumination, and reflectance from shading.,” TPAMI, 2015. [43] D. Shahlaei and V. Blanz, “Realistic inverse lighting from a single 2d image of a face, taken under unknown and complex lighting.,” in International Conference on Automatic Face and Gesture Recognition, IEEE, 2015. [44] J. Lee, R. Machiraju, B. Moghaddam, and H. Pfister, “Estimation of 3D faces and illumination from single photographs using a bilinear illumination model.,” in Eurographics Conference on Rendering Techniques, 2005. [45] J. Lin, Y. Yuan, T. Shao, and K. Zhou, “Towards High-Fidelity 3D Face Reconstruction from In-the-Wild Images Using Graph Convolutional Networks,” in CVPR, IEEE, 2020. [46] G.-H. Lee and S.-W. Lee, “Uncertainty-Aware Mesh Decoder for High Fidelity 3D Face Reconstruction,” in CVPR, IEEE, 2020. [47] C. Li, K. Zhou, and S. Lin, “Intrinsic face image decomposition with human face priors.,” in ECCV, IEEE, 2014. [48] T. Sun, J. T. Barron, Y.-T. Tsai, Z. Xu, X. Yu, G. Fyffe, C. Rhemann, J. Busch, P. Debevec, and R. Ramamoorthi, “Single image portrait relighting.,” ACM Transactions on Graphics (SIGGRAPH), 2019. [49] H. Zhou, S. Hadap, K. Sunkavalli, and D. W. Jacobs, “Deep single-image portrait relighting,” in ICCV, IEEE, 2019. [50] Y. Shih, S. Paris, C. Barnes, W. T. Freeman, and F. Durand, “Style transfer for headshot portraits.,” ACM Transactions on Graphics, 2014. [51] F. Luan, S. Paris, E. Shechtman, and K. Bala, “Deep photo style transfer.,” in CVPR, IEEE, 2017. [52] Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz, “A closed-form solution to photorealistic image stylization.,” in ECCV, IEEE, 2018. 59 [53] Z. Shu, S. Hadap, E. Shechtman, K. Sunkavalli, S. Paris, and D. Samaras, “Portrait lighting transfer using a mass transport approach.,” ACM Transactions on Graphics, 2017. [54] A. Shashua and T. Riklin-Raviv, “The quotient image: Class-based re-rendering and recog- nition with varying illuminations.,” PAMI, 2001. [55] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity.,” IEEE Transactions on Image Processing, 2004. [56] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets.,” in Advances in Neural Information Processing Systems, 2014. [57] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” in Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 187–194, 1999. [58] P. Peers, N. Tamura, W. Matusik, and P. Debevec, “Post-production facial performance relighting using reflectance transfer.,” SIGGRAPH, 2007. [59] A. Stoschek, “Image-based re-rendering of faces for continuous pose and illumination direc- tions.,” in CVPR, IEEE, 2000. [60] Z. Wen, Z. Liu, and T. Huang, “Face Relighting with Radiance Environment Maps.,” in CVPR, IEEE, 2003. [61] T. Sun, J. T. Barron, Y.-T. Tsai, Z. Xu, X. Yu, G. Fyffe, C. Rhemann, J. Busch, P. E. Debevec, and R. Ramamoorthi, “Single image portrait relighting.,” ACM Trans. Graph., vol. 38, no. 4, pp. 79–1, 2019. [62] A. Appel, “Some techniques for shading machine renderings of solids,” in Proceedings of the April 30–May 2, 1968, spring joint computer conference, pp. 37–45, 1968. [63] A. Georghiades, P. Belhumeur, and D. Kriegman, “From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose.,” PAMI, 2001. [64] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-pie.,” Image Vision Comput- ing, 2010. [65] O. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition.,” in BMVC, 2015. [66] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks.,” CoRR, abs/1812.04948. [67] P.-W. Grassal, M. Prinzler, T. Leistner, C. Rother, M. Nießner, and J. Thies, “Neural head 60 avatars from monocular rgb videos,” arXiv preprint arXiv:2112.01554, 2021. [68] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in European conference on computer vision, pp. 405–421, Springer, 2020. [69] E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. De Mello, O. Gallo, L. Guibas, J. Tremblay, S. Khamis, et al., “Efficient geometry-aware 3d generative adversarial networks,” arXiv preprint arXiv:2112.07945, 2021. [70] G. Gafni, J. Thies, M. Zollhofer, and M. Nießner, “Dynamic neural radiance fields for monocular 4d facial avatar reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8649–8658, 2021. [71] S. Athar, Z. Shu, and D. Samaras, “Flame-in-nerf: Neural control of radiance fields for free view face animation,” arXiv preprint arXiv:2108.04913, 2021. [72] S. Peng, Y. Zhang, Y. Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou, “Neural body: Im- plicit neural representations with structured latent codes for novel view synthesis of dynamic humans,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9054–9063, 2021. [73] S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen, “The lumigraph,” in Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pp. 43–54, 1996. [74] A. Davis, M. Levoy, and F. Durand, “Unstructured light fields,” in Computer Graphics Forum, vol. 31, pp. 305–314, Wiley Online Library, 2012. [75] N. K. Kalantari, T.-C. Wang, and R. Ramamoorthi, “Learning-based view synthesis for light field cameras,” ACM Transactions on Graphics (TOG), vol. 35, no. 6, pp. 1–10, 2016. [76] E. Penner and L. Zhang, “Soft 3d reconstruction for view synthesis,” vol. 36, no. 6, 2017. [77] M. Waechter, N. Moehrle, and M. Goesele, “Let there be color! large-scale texturing of 3d reconstructions,” in European conference on computer vision, pp. 836–850, Springer, 2014. [78] J. Flynn, M. Broxton, P. Debevec, M. DuVall, G. Fyffe, R. Overbeck, N. Snavely, and R. Tucker, “Deepview: View synthesis with learned gradient descent,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2367–2376, 2019. [79] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo magnification: Learning view synthesis using multiplane images,” arXiv preprint arXiv:1805.09817, 2018. [80] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhofer, “Deepvoxels: 61 Learning persistent 3d feature embeddings,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2437–2446, 2019. [81] S. Lombardi, T. Simon, J. Saragih, G. Schwartz, A. Lehrmann, and Y. Sheikh, “Neural vol- umes: Learning dynamic renderable volumes from images,” arXiv preprint arXiv:1906.07751, 2019. [82] Z. Li, S. Niklaus, N. Snavely, and O. Wang, “Neural scene flow fields for space-time view synthesis of dynamic scenes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6498–6508, 2021. [83] W. Xian, J.-B. Huang, J. Kopf, and C. Kim, “Space-time neural irradiance fields for free- viewpoint video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9421–9431, 2021. [84] A. Pumarola, E. Corona, G. Pons-Moll, and F. Moreno-Noguer, “D-nerf: Neural radiance fields for dynamic scenes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10318–10327, 2021. [85] K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla, “Nerfies: Deformable neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5865–5874, 2021. [86] J. Thies, M. Zollhöfer, and M. Nießner, “Deferred neural rendering: Image synthesis using neural textures,” ACM Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1–12, 2019. [87] Y. Feng, H. Feng, M. J. Black, and T. Bolkart, “Learning an animatable detailed 3d face model from in-the-wild images,” ACM Transactions on Graphics (TOG), vol. 40, no. 4, pp. 1–13, 2021. [88] Z. Shu, M. Sahasrabudhe, R. A. Guler, D. Samaras, N. Paragios, and I. Kokkinos, “Deforming autoencoders: Unsupervised disentangling of shape and appearance,” in Proceedings of the European conference on computer vision (ECCV), pp. 650–665, 2018. [89] A. Tewari, M. Elgharib, G. Bharaj, F. Bernard, H.-P. Seidel, P. Pérez, M. Zollhofer, and C. Theobalt, “Stylerig: Rigging stylegan for 3d control over portrait images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6142–6151, 2020. [90] Y. Zheng, V. F. Abrevaya, M. C. Bühler, X. Chen, M. J. Black, and O. Hilliges, “Im avatar: Implicit morphable head avatars from videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13545–13555, 2022. [91] C. Wu, D. Bradley, M. Gross, and T. Beeler, “An anatomically-constrained local deformation model for monocular face capture,” ACM transactions on graphics (TOG), vol. 35, no. 4, 62 pp. 1–12, 2016. [92] K. Kania, K. M. Yi, M. Kowalski, T. Trzciński, and A. Tagliasacchi, “Conerf: Controllable neural radiance fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18623–18632, 2022. [93] T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero, “Learning a model of facial shape and expression from 4d scans.,” ACM Trans. Graph., vol. 36, no. 6, pp. 194–1, 2017. [94] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv-rcnn: Point-voxel feature set abstraction for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10529–10538, 2020. [95] B. Graham, M. Engelcke, and L. Van Der Maaten, “3d semantic segmentation with subman- ifold sparse convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9224–9232, 2018. [96] J. T. Kajiya and B. P. Von Herzen, “Ray tracing volume densities,” ACM SIGGRAPH computer graphics, vol. 18, no. 3, pp. 165–174, 1984. [97] P. Li, Z. Zhang, Y. Tong, B. M. Foda, and B. Day, “Ilee: Algorithms and toolbox for unguided and accurate quantitative analysis of cytoskeletal images,” Journal of Cell Biology, vol. 222, no. 2, p. e202203024, 2022. [98] A. Hou, Z. Zhang, M. Sarkis, N. Bi, Y. Tong, and X. Liu, “Towards high fidelity face relighting with realistic shadows,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14719–14728, 2021. 63 APPENDIX CONTRIBUTION ON MULTI-AUTHORED PUBLICATIONS Chapter 2 is based on the paper "ILEE: Algorithms and toolbox for unguided and accurate quan- titative analysis of cytoskeletal images" [97]. As co-first-authors, Pai Li and I contributed equally to this work. As an expert in biology, Pai Li was responsible for capturing data samples from plant cells using microscopes. He also collected data samples from other biology labs to ensure the diversity of our dataset. He also utilized his biostatistics background to propose the model for the global gradient threshold. I was responsible for building the linear system and implementing the computational work. We proposed the quantitative indices for the cytoskeleton together. I also built a python library for our pipeline, making our work an easy-to-use tool for biologists. We wrote our part of the documentation respectively. Chapter 3 is based on the paper "Towards High Fidelity Face Relighting with Realistic Shadows" [98]. I contributed to the geometry-related part of the work in this whole research project. I utilized the ray-tracing algorithm to generate shadow masks for face shapes. I also processed all lighting data to provide consistent lighting input based on Spherical Harmonics for the whole pipeline. Andrew Hou proposed the hourglass network structure and designed the loss function. We wrote the corresponding manuscript respectively. Chapter 4 is written solely for the thesis, and may be submitted in the future for publication in vision or graphics venues. I was the only student in this project. 64