Learning 3D model from 2D in-the-wild images

Understanding 3D world is one of computer vision's fundamental problems. While a human has no difficulty understanding the 3D structure of an object upon seeing its 2D image, such a 3D inferring task remains extremely challenging for computer vision systems. To better handle the ambiguity in this inverse problem, one must rely on additional prior assumptions such as constraining faces to lie in a restricted subspace from a 3D model. Conventional 3D models are learned from a set of 3D scans or computer-aided design (CAD) models, and represented by two sets of PCA basis functions. Due to the type and amount of training data, as well as, the linear bases, the representation power of these model can be limited. To address these problems, this thesis proposes an innovative framework to learn a nonlinear 3D model from a large collection of in-the-wild images, without collecting 3D scans. Specifically, given an input image (of a face or an object), a network encoder estimates the projection, lighting, shape and albedo parameters. Two decoders serve as the nonlinear model to map from the shape and albedo parameters to the 3D shape and albedo, respectively. With the projection parameter, lighting, 3D shape, and albedo, a novel analytically differentiable rendering layer is designed to reconstruct the original input. The entire network is end-to-end trainable with only weak supervision. We demonstrate the superior representation power of our models on different domains (face, generic objects), and their contribution to many other applications on facial analysis and monocular 3D object reconstruction.

Read