NORMALIZING FLOWS AIDED VARIATIONAL INFERENCE FOR UNCERTAINTY QUANTIFICATION By Sumegha Premchandar A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics – Doctor of Philosophy 2024 ABSTRACT Bayesian statistics is a powerful tool for quantifying uncertainties when estimating unknown model parameters. It is often the case that the posterior distributions arising from the Bayesian paradigm are intractable. This may be due to complex statistical model choices and high-dimensionality of the parameter space. Previously, Markov Chain Monte Carlo (MCMC) methods have been the preferred approach for sampling from posterior distributions with an unknown normalizing constant. However, MCMC methods run into a number of issues in practice. For instance, they do not always scale well to multimodal distributions defined on a high-dimensional support. Variational Inference (VI) has emerged as a scalable alternative to MCMC for sampling from intractable posterior distributions. Recently, Normalizing Flows aided VI (FAVI) has been used for sampling from complex and multimodal posterior distributions to overcome the limitations of existing mean-field and structured VI approaches. FAVI has had a significant impact across fields in applications such as computer vision, computational biology, and physics-based modelling. Despite its impact, there is limited research on the theoretical properties of the approximate posterior arising from FAVI. The computational cost of FAVI depends heavily on the choice of Normalizing Flow (NF) family, but there is no work quantifying the nature of the approximate posterior from FAVI at a particular complexity of the NF, especially with respect to uncertainty quantification. In this dissertation, we study the properties of the FAVI posterior with a focus on: (i) The trade-off between accurate recovery of the posterior samples and complexity of the selected NF family. (ii) Uncertainty quantification. We first provide background on FAVI and compare it to popular competitors (Mean-Field VI (MF-VI) and MCMC) over some basic statistical applications. Our results demonstrate that FAVI lies between MCMC and MF-VI in both statistical accuracy and computational efficiency. In this second part of this dissertation, we use the framework of Bayesian linear regression with 2 predictor variables to rigorously study the optimal Kullback-Leibler divergence between the FAVI approximation with Inverse Auto-regressive Flows (IAF) and the true posterior. We also derive the uncertainty quantification (credible interval coverage) resulting from using FAVI to approximate the posterior, as a function of the correlation between the regression predictors. We contrast this coverage with MF-VI (the most popular VI approach in the literature) and find that, given sufficient complexity of the NF, there is virtually no loss in coverage from FAVI relative to the true posterior, regardless of the correlation. On the other hand, the loss in coverage for MF-VI increases monotonically in the correlation. Next, we extend our results to the case of an arbitrary ?> 2 regression predictors. Our results (presented across complexity levels of the IAF transformations), demonstrate that given sufficient complexity of IAF, FAVI can completely recover the true posterior. To our knowledge, this is the first theoretical exploration of this kind. Finally, we discuss ongoing research and plans for future work where we will leverage our learning to use FAVI for Bayesian inference in high-dimensional linear models with spike and slab priors. Preliminary results show that FAVI can capture dependencies in the posterior more effectively than MF-VI. FAVI is one among many novel computational tools that has originated in machine learning literature for scalable Bayesian computation, but there has been little previous work analyzing its statistical properties and reliability for uncertainty quantification. By studying the FAVI posterior from a statistical lens, this dissertation bridges some of the gap between machine learning and statistics, and takes strides towards building reliable computational tools for Bayesian inference. Copyright by SUMEGHA PREMCHANDAR 2024 This thesis is dedicated to my parents - Jayanthi Premchandar and M.R. Premchandar. v ACKNOWLEDGEMENTS First and foremost, I would like to express my gratitude to my advisors Dr. Tapabrata Maiti and Dr. Shrijita Bhattacharya who have been instrumental in shaping my thesis. Their enthusiasm for tackling interesting and challenging research problems has been an example for me and has molded me into the researcher I am today. I have greatly appreciated their support and patience these past 3 years while I learned the ins and outs of doing research. I would also like to express my gratitude for the generous funding provided to me by my advisors during my third and fourth years of the PhD. This funding gave me the space and time to focus on my research. I am additionally very thankful to my committee members Dr. Yimin Xiao and Dr. Selin Aviyente for the time they spent attending committee meetings, reading my work and the valuable feedback they provided that greatly improved the quality of my work. I count myself lucky to have had some wonderful professors and mentors over the course of my PhD including Dr. Yimin Xiao, Dr. Haolei Weng and Dr. Shrijita Bhattacharya whose instruction in the prelim courses helped me build a strong foundational knowledge in probability and statistics. I am additionally grateful for the mentorship of Dr. Sandeep Madireddy for providing me with the opportunity to intern at Argonne National Laboratory and for introducing me to the exciting area of Neural Architecture Search. My time at Argonne was well spent, picking up computational and research skills that have been put to good use throughout my PhD. Aside from my professors and mentors, I would like to express my appreciation for the STT staff; particularly Andy, Tami and Ashlynn for making administrative processes far easier and MSU technology much more accessible. I have had a small but loyal army of friends and family by my side who have made this journey possible. To my friends Tathagata, Sikta and Hema; thank you for providing me with a sense of community and family in East Lansing. Thanks also to my other friends at MSU; Phuong, Satabdi, Soyeong, Nian, Haoxiang, Sang Kyu, Arka, Anirban, Sampriti, Alex and Sanket for making East Lansing feel much less lonely. I am especially indebted to Arka, who allowed me to pick his brain about my research many times and who has always been a source of useful advice for me. My support system outside of STT has been just as important as the people inside of it. I greatly vi value my family members, who have provided an unwavering love and support for me these past few years. They have made the effort to keep in touch even when my busy schedule sometimes left me with little energy to do the same. I reserve a special mention for all of my grandparents, who have been proud of even the smallest of my wins. Thanks to my friends Mahevash and Shailja for coming all the way to Chennai to visit me on my first trip back to India and for our enduring long-distance friendship. I owe a great deal to my friends Anisha and Nayantara; who have always been there for me at my lowest points and for reminding me of my worth outside of research. It is difficult for me to find the words to describe how much their friendship has meant to me. The near daily pictures of Ingee, my favourite ginger cat, and her sage advice communicated via Anisha kept my spirits amused on many an occasion. I can state with certainty that I would not have made it to the finish line without the love and support of my partner Nisarg. He has helped me with the more frustrating parts of using LATEX, debugged my code on many occasions and acted as a sounding board for my ideas. But more than this, thank you for having faith in me when I had none in myself, for encouraging me to treat myself more kindly and bringing so much levity to even the most stressful parts of my life. My deepest appreciation is for my sister Sucharita, whose unconventional and bite-sized wisdom got me through the more arduous phases of this PhD. Last and most important, I would like to thank my parents Jayanthi Premchandar and M.R. Premchandar for everything they have done for me. Thanks Appa, for doing the sometimes difficult job of pushing me when it was required and thank you Amma for being such a source of strength, resilience and unconditional love for me these past 5 years and more. This thesis belong to all of you. vii LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix TABLE OF CONTENTS CHAPTER 1 1.1 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 The basic Variational Inference algorithms . . . . . . . . . . . . . . . . . . . 1.3 Normalizing Flows aided Variational Inference 1.4 Preliminary Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Dissertation Outline . INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 3 4 5 6 CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . NORMALIZING FLOWS AIDED VARIATIONAL INFERENCE: BACK- 7 . GROUND, EXAMPLES AND COMPARISONS . . . . . . . . . . . . 7 . 2.1 Background . . . 9 . 2.2 When should we use Normalizing Flows VI? . . . . . . . . . . . . . . . . . . 2.3 Normalizing Flows 9 . . . 2.4 Discrete and Continuous-Time Flows . . . . . . . . . . . . . . . . . . . . . . . 10 . 13 2.5 Neural Auto-regressive Flows . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6 . 28 2.7 Looking Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustrative Examples . . . . CHAPTER 3 STATISTICAL PROPERTIES OF THE FAVI POSTERIOR: A CASE STUDY WITH LINEAR REGRESSION . . . . . . . . . . . . . . . . . 31 . 32 . 35 . 38 . 38 . 57 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Main Results . . . 3.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Extensions to higher dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 4 LINEAR REGRESSION WITH SPIKE AND SLAB PRIORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Variational Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Simulation study . 4.4 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation details . . . . 69 . 72 . 73 . 77 . 81 CHAPTER 5 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 APPENDIX A INFORMATION ON ENERGY DENSITY FUNCTIONS . . . . . . . . 89 APPENDIX B RUN-TIME COMPARISONS FOR LOGISTIC REGRESSION . . . . . 90 APPENDIX C ADDITIONAL RESULTS FOR SPIKE AND SLAB REGRESSION . . 91 viii LIST OF ABBREVIATIONS BVS CDF Bayesian variable selection Cumulative distribution function ECDF Empirical cumulative distribution function FAVI Flows aided variational inference GLM Generalized linear model IAF KDE Inverse auto-regressive flow Kernel density estimates MCMC Markov chain Monte Carlo MF MH NAF NF PDF PMF RW SGA SSE SVI VI DSF Mean-field Metropolis Hastings Neural auto-regressive flows Normalizing flows Probability distribution function Probability mass function Random-walk Stochastic gradient ascent Sum of squared error Structured variational inference Variational inference Deep sigmoidal flow ix CHAPTER 1 INTRODUCTION In various scientific fields, statistical and machine learning models play a significant role in inference and decision making. It is crucial to have models that accurately represent uncertainties in unknown model parameters, as this helps in building robust decision making processes. This is especially important in safety critical applications such as medical image analysis and autonomous driving. Bayesian statistics is a potent tool for quantifying uncertainties, without needing to resort to complex asymptotic results. It also has the added advantage of being able to incorporate domain specific prior knowledge into the statistical model. Bayesian statistics derives all inference about an unknown model parameter from the posterior distribution. In numerous applications, the posterior distribution does not have a closed form. In such a situation we say that the posterior distribution is “intractable” and we use approximate infer- ence methods to sample from it. A key challenge in Bayesian statistics is developing scalable and statistically accurate computational tools to generate samples from complex, intractable posterior distributions. It is difficult to balance the trade-offs between statistical accuracy and computational efficiency for approximate inference methods. Normalizing Flows aided Variational Inference (FAVI) is an algorithm for sampling from in- tractable distributions that originated in machine learning literature [33]. It was introduced to recover complex and multimodal distributons, while retaining some of the scalability that charac- terizes VI. FAVI been impactful across scientific fields such as computer vision [20], computational biology [7] and physics-based modelling ([43], [46]). Given its popularity across application areas, it is crucial to address some of the fundamental theoretical gaps in our knowledge surrounding this valuable tool. This can then enable its wider adoption for statistical inference and variable selection in high-dimension. In the rest of this chapter we will provide an overview of concepts relevant to our work and discuss important related literature. At the end of this chapter, we provide an outline for the rest of this dissertation. 1 1.1 Markov Chain Monte Carlo Markov Chain Monte Carlo (MCMC) methods generate samples from a Markov Chain with a stationary distribution equal to the target posterior distribution we wish to sample from. There are a number of MCMC methods in the literature, the most well known of which are the Metropolis Hastings algorithm [9] and Gibbs sampling [8]. The Metropolis-Hastings algorithm uses a proposal distribution that serves as a transition kernel for a Markov Chain. The proposal distribution is used to generate samples (a new state) based on some previous state. This new state is then accepted or rejected with a probability that depends on the un-normalized target posterior and the proposal distribution. One of the most popular choices for the proposal is a Gaussian distribution with mean equal to the previous state. This is known as the Gaussian Random-Walk Metropolis Hastings (RW-MH) algorithm and we use it for our comparison studies in chapter 2. Gibbs sampling is a special case of the Metropolis-Hastings algorithm. Gibbs sampling gen- erates samples from a multivariate distribution, one dimension at a time. If we have a joint distribution c ✓ , where ✓ = ( conditional distributions c ) \8 ( | \1,\ 2 . . .\ ? ( \1,\ 2 . . .\ 8 )2 1,\ 8 R?, it leverages information about the complete 1, . . .\ ? + , 1 )  8  ?, to produce samples. MCMC methods have the desirable property that the generated Markov chain samples are guaranteed to converge to samples from the target distribution. However, they run into a number of issues in practice. MCMC can be slow to converge, especially for high-dimensional state spaces and multimodal target distributions. The RW-MH algorithm requires a computational budget of ?2 $ ( ) to generate two approximately independent samples from the stationary distribution [26]. If a large number of samples are required, this can be computationally prohibitive. Gibbs sampling generally has faster mixing times than the RW-MH algorithm and does not require tuning of the proposal distribution, but we may not always have information on the complete conditionals. There are more contemporary MCMC methods such as the Hamiltonian Monte Carlo (HMC) that leverage gradient information of the target distribution to explore the state space much more efficiently than the MH algorithm. However, as discussed later in section 2.7, even HMC runs into 2 issues for highly multimodal target distributions. Lastly, assessing convergence of the Markov Chain to the stationary distribution is often chal- lenging [35]. Empirical convergence diagnostics such as the Gelman-Rubin diagnostic [10] are popularly used, however they do not always guarantee convergence of the Markov Chain. Further, calculating such diagnostics requires running multiple parallel chains from which many of the samples are discarded. This can be computationally expensive. Other measures such as upper bounds on the total variation distance between the density of the MCMC samples and the stationary distribution are difficult to obtain for many statistical models. 1.2 The basic Variational Inference algorithms Variational Inference (VI) surfaced in machine learning literature as a scalable means of sam- pling from intractable posterior distributions ([18]). VI is widely used in applications such as computer vision, topic modeling, and computational biology ([5], [7]). A common theme in these applications is the presence of large datasets and high-dimensional parameter spaces. In VI, a variational family of distributions Q is proposed, from which a distribution @q⇤ is selected based on its “closeness” to the target posterior. Keeping with the formulation in a majority of VI literature, we will use the Kullback-Leibler (KL) divergence as the measure of closeness. More precisely, VI reframes the problem of sampling from a target distribution ⇧ ⇡ ✓ | ( ) into the following optimization: @q⇤ 2 arg min@ q & ! @q ( ⇧ ⇡ . | ( )) || 2 (1.1) The choice of family & drives the trade-off between statistical accuracy and computational efficiency that characterizes VI. Mean-Field VI (MF-VI), enables computational efficiency by assuming any @q 2 & can be factorized as @q = ✓ ) ( ? 8=1 @q8 ( \8 ) . Structured VI (SVI) allows for some level of dependencies among \8 by estimating a non-diagonal covariance matrix ⌃ Œ R? ⇥ ?, 2 but it sacrifices some of the computational efficiency of MF-VI. A caveat of both MF-VI and SVI is that they cannot recover multimodal target distributions. 3 1.3 Normalizing Flows aided Variational Inference Some of the earliest mentions of the idea of using a Normalizing Flow for probabilistic mod- elling can be found in [39] and [38]. A Normalizing Flow (NF) is nothing but a composition of continuously differentiable mappings with differentiable inverse, applied to samples from a base distribution. Normalizing Flows aided VI (FAVI) was introduced in [33] to alleviate some of the issues encountered by MF-VI and SVI in sampling from complex and multimodal target distribu- tions. FAVI generates a family of distributions by sampling from a base distribution @0( applying differentiable and invertible mappings )B : R? 1 · · · R? such that ✓( = )( ! )( If i ✓ R< denotes the space of parameters for the transformations )B { } ( B=1, then any q 2 and ✓0) )1( ✓0) i results . in a distribution @q ✓( ( ) . We then choose an optimal distribution @q⇤ based on the minimization in (1.1). More complex choices of )B yield more expressive variational families, albeit at added computational cost. Among the many ways to choose )B, we limit our scope throughout this dis- sertation to the popular Auto-Regressive Flows, for which the cost of computing @ ✓( ( ) from ✓0 is ?( $ ( ) [28]. FAVI has demonstrated a lot of potential as a computational tool for Bayesian inference in complex statistical models ([28], [22]). However, very little is known about theoretical behaviour of the variational posterior arising from FAVI. Previous theoretical works on VI have mostly focussed on asymptotic posterior consistency properties of the approximate posterior arising from MF-VI or SVI ([3], [31], [42]). These results take a frequentist viewpoint and are focussed more on the impact of the variational approximation on central tendency estimates.1 They do not provide an in-depth exploration on the uncertainty quantification obtained from the variational posterior. Certain families of NFs are known to be highly expressive and in theory, they can model any target distribution with non-zero support on R? (see chapter 2). Given these properties, FAVI should demonstrate improved uncertainty quantification and recovery of the posterior samples for finite sample sizes, when compared to simpler variational families (e.g. mean-field). Consequently, we believe that while studying the theoretical properties of the approximate posterior obtained from 1By frequentist viewpoint we mean the assumption that there exists a true unknown parameter ✓0, that generates the observed data ⇡, by means of a probability distribution P ⇡ . ✓0) | ( 4 FAVI, it is crucial to look beyond frequentist consistency results to assess the overall statistical accuracy of posterior samples from a Bayesian perspective, especially with respect to uncertainty quantification. Chapters 2 and 3 provide more background on FAVI, as well as a current state of research in the area, open problems and necessary technical details. 1.4 Preliminary Notation We use lowercase letters to denote scalars (G R<) and capital letters denote matrices (- R< 2 R), boldface letters to denote vectors (x 2 3). The symbols ⇥ 2 represent the . ) ( and q . ) ( standard normal cumulative distribution function (cdf) and probability distribution function (pdf) respectively. The notation ?? is used to represent pair-wise independence between any two random variables. Let I denote an indicator function, that is, for any real valued random variable . and Borel set ; I R ) . ( = ) 2 2B ( For any scalar B R we use sign 2 8>>>>< 0, B >>>> ( : ) 1, when . 2 otherwise for the sign of B, that is: 1, if B> 0 1, 0,B if B< 0 = 0 1 0 . . . 0 sign = B ) ( 8>>>>>>>>< >>>>>>>> : The size < identity matrix is given by < R< ⇥ < = 2 We use the symbol & to denote a variational family of distributions and @q is a member of & with variational parameters q. We use @q⇤ to refer to the variational posterior, that is the optimal distribution from & that minimizes the KL divergence as per equation (1.1). We will sometimes use @q⇤ to refer to a specific member of & that may not satisfy the minimization (1.1), but we will make it clear if that is the case. 5 . 2 6 6 6 6 6 6 6 6 6 6 6 4 0 1 . . . 0 3 7 7 ... ... ... 7 7 7 7 0 0 . . . 1 7 7 7 7 7 5 The symbol ✓ is used to denote the unknown parameter of interest or latent variables and ⇡ refers to observed data. We will use ? to denote the dimensionality of the parameter space (✓ R?) 2 and ⇧ ⇡ ✓ | ( ) or ⇧ ⇡ . | ( ) for the target posterior distribution. The Kullback-Leibler (KL) divergence between 2 probability distributions @ and ? is defined as: ! @ ( || ? ) = ln @ ? x ) x ) ( ( πx 3x @ x ) ( See Tables 3.1 and 3.2 for more details on mathematical notation used in this dissertation. 1.5 Dissertation Outline In this dissertation we study the properties of the variational posterior arising from FAVI with a dual focus on: (i) the trade-off between accurate recovery of the posterior samples and complexity of the NF transformations )B (ii) uncertainty quantification. Chapter 2 serves as an exposition of FAVI and is written to be accessible to a broad scientific au- dience. It also includes comparisons to popular competitors (Mean-Field VI (MF-VI) and MCMC) over a variety of classical statistical applications. The results of our comparison studes indicate that FAVI lies somewhere between MCMC and MF-VI in statistical accuracy and computational efficiency. This motivates the problems we consider in the subsequent chapters of this dissertation. In Chapter 3 we begin with a rigorous study of the optimal Kullback-Leibler divergence and loss in uncertainty quantification between the FAVI approximation with Inverse Auto-regressive Flows (IAF) and the true posterior, within the specific context of Bayesian linear regression with 2 predictors. We also contrast the loss in uncertainty quantification (credible interval coverage) from using the FAVI approximate posterior with IAF to that of MF-VI as a function of the correlation between regression predictors. We then extend our results to the case of ?> 2 regression predictors. Our theoretical results highlight the benefits of FAVI for uncertainty quantification. We follow this with details on ongoing research, where we adapt FAVI for Bayesian inference in high-dimensional linear models, with spike and slab priors (chapter 4). We demonstrate its usefulness in capturing dependencies across latent variables as well as emphasize its limitations. Based on these limitations we suggest possible directions for future work. 6 CHAPTER 2 NORMALIZING FLOWS AIDED VARIATIONAL INFERENCE: BACKGROUND, EXAMPLES AND COMPARISONS A modified version of this chapter was first published in Notices of the American Mathematical Society in volume 70, number 7, year 2023; published by American Mathematical Society. © 2023 American Mathematical Society. 2.1 Background A major area of contemporary statistics research is learning to model probability distributions of varying complexity. The problem of learning to characterize probability distributions broadly takes 2 forms: estimating a probability density given samples from it and approximating densities that are known only up to a normalizing constant. The latter avenue of research has applications in Bayesian inference, where we wish to generate samples from the posterior distribution of model parameters given observed data. This chapter discusses the use of Normalizing Flows for Variational Inference (VI), a method wherein we can approximate and sample from complex probability densities [33]. This type of probabilistic modeling lies in the second avenue of research, where we do not have a normalizing constant for probability densities of interest. VI is a tool that emerged in machine learning to approximate probability densities. It is often applied in Bayesian statistics as a more scalable alternative to Markov Chain Monte Carlo (MCMC) methods for large datasets. Although scalable, earlier works such as mean-field or structured VI are limited when approximating more complex and multimodal probability distributions. Normalizing Flows are mappings from a simple base distribution to a more complex probability distribution. They are primarily used for modeling continuous distributions and can be used to specify very flexible probability models, thus improving the statistical accuracy of VI algorithms. There already exist comprehensive reviews for Normalizing Flow methods in general. An overview of different Normalizing Flow families is provided in [22], while [28] goes into depth on each family of flow models and extends this discussion to newer areas, such as flows for discrete 7 variables. These reviews are an overarching look at flows for probabilistic modelling and are focussed on applications in the machine learning literature. Discussion of applications of a more classical statistical nature is limited. An excellent exposition and survey of VI from a statistical lens is given in [4]. However, they only cover variational families with a known parametric form, such as mean-field and structured VI. We extend the discussion to variational families specified by Normalizing Flows. Further, this chapter is written to be accessible to readers new to the area. In latent variable modeling, we aim to learn the conditional distribution of latent variables ✓ = ( \1,\ 2, . . .\ ? ) given observed data ⇡, that is, ⇧ ⇡ ✓ | ( ) .1 We explain how solving this problem is useful in Bayesian statistics. In parametric statistics, stochasticity in the observed data is often described using a specific probability distribution ? ⇡ ✓ | ) ( , where ✓ needs to be estimated based on the data ⇡. In Bayesian inference, we assume a prior distribution c on ✓ representing our ✓ ) ( beliefs about the model parameter prior to observing the data. Based on the data, we update our beliefs via the posterior distribution ⇧ ⇡ ✓ | ( ) . The posterior can be calculated by Bayes theorem: For cases where the marginal likelihood < ⇡ ✓ c ✓ 3✓ is intractable we resort to ) approximate inference. MCMC methods have long been the go-to for sampling from posterior ( ( ) ( ) | ⇡ ⇧ ✓ | ( ) = c c ✓ ( ✓ ( ) ) 3✓ ⇡ ⇡ ? ✓ ? Ø ⇡ ( ( = ✓ | ) ✓ | ) ✓ ? Ø distributions when < ⇡ ) ( cannot be computed. MCMC algorithms generate samples from a Markov Chain whose stationary distribution converges to the target distribution of interest. One prominent example is the Metropolis-Hastings method [9], of which the Gibbs sampling algorithm [8] is a special case. However, these methods may not always scale well to high-dimensional models and can be slow to converge for multimodal distributions. VI has shown promise as a scalable alternative to MCMC. In VI, the target distribution is approximated by a family of distributions & among which we choose the optimal distribution @q⇤ to be “closest” to the target. To determine “closeness”, KL-divergence is often used. Intuitively, KL-divergence is something akin to a distance between 2 probability distributions. Thus, probabilistic modeling with VI becomes an optimization 1In much of the variational inference literature, z will be used for the unknown parameters (latent variables) instead of ✓. We use ✓ to be consistent with statistics literature. 8 problem: @q⇤ 2 arg min@ q & ! @q ( ⇧ ⇡ . | ( )) || 2 Mean-Field VI (MF-VI) is a popular approach in which the variational family & is defined based on the assumption that latent variables are independent of each other. The mean-field assumption is useful for faster computations during optimization but is restricted in the complexity of densities we can approximate. Structured VI takes this one step further by allowing dependencies across latent variables. However, even with Structured VI we cannot guarantee that we can approximate any density arbitrarily well. This is where Normalizing Flows come in. 2.2 When should we use Normalizing Flows VI? In [4], the authors observe that “VI is suited to large data sets and scenarios where we want to quickly explore many models; MCMC is suited to smaller data sets and scenarios where we happily pay a higher computational cost for more precise samples.” While this is generally true of MF-VI, Normalizing Flows VI lies somewhere between MCMC and other variational approximation approaches in terms of computational efficiency and accuracy. To shed some light on how Normalizing Flows VI compares to other sampling methods such as MCMC and MF-VI, we implement variational inference with Neural Auto-regressive Flows [16] for several examples. These examples cover classical Bayesian statistical applications in exponential family models, Gaussian linear regression and logistic regression. We cover scenarios of varying dimensions and complexity of the target distribution. This gives us a high-level idea of scalability vs. accuracy for these methods. We begin the following section by introducing Normalizing Flows and elaborate on how to use them for VI. We then proceed to examples in Section 2.6. Finally, we discuss some important takeaways & challenges remaining in the area in Section 2.7. 2.3 Normalizing Flows The main idea behind Normalizing Flows is to transform some simple continuous base distribu- tion into a “target” distribution that is usually more complex, via a series of bijective, continuously differentiable transformations with differentiable inverse [28]. These functions are often referred 9 to as “diffeomorphisms”. Let / 2 R? be a random variable whose density we wish to model. We begin with a random variable * sampled from some base distribution ?* defined on support R? and apply a diffeo- u ) ( morphism ) : R? ! variable formula [36]: R? such that / = ) * ( ) . The density of / is then given by the change of ?/ z ( ) = ?* ) u ( )| ( u )| 1 ) u denotes the absolute value of the determinant of Jacobian of ) w.r.t u. Thus, the function ( )| | ) transforms the density ?* ) density ‘flow’ through a mapping to obtain another density is called a Normalizing Flow. ( ) ( u into ?/ z . This process, wherein samples from one probability A natural question to ask is whether Normalizing Flows can be used to transform a simple base distribution (e.g uniform or standard normal distribution) into any target distribution. The paper [28] contains a constructive argument to show that Normalizing Flows can indeed recover any target density under rather general conditions. In practice, this is heavily dependent on the transformations ) that we employ. 2.4 Discrete and Continuous-Time Flows Normalizing Flows are mainly of two types - discrete time (finite flows) and continuous time (infinitesimal flows) [28]. Discrete-time Normalizing Flows are constructed by choosing a finite sequence of transformations )1,) 2, . . .) ( and applying them successively to some base distribu- tion ?* u ) ( such that z( = )( )( 1 · · · )1( u ) . Since we choose all transformations to be diffeomorphisms, the change of variables formula applies and we have: ?/( ( z( ) = ?* u )⇥| z( )( ( 1)| 1 ( )( z( 1 ( 2)| 1 · · ·⇥| )1 ( u )| 1 ⇥| The number of transformations (, is often called the flow depth. Increasing flow depth can help us model progressively more complex densities at the expense of increased computational cost due to the calculation of )B ( zB 1) . 10 We can think of discrete time flows as modelling the evolution of a probability density at (-many time points [28]. In contrast, continuous-time Normalizing Flows model this evolution continuously from some time C = 0 to ( as an ordinary differential equation 3zC 3C = 5 C, zC ( ) . A well known example of a continuous time flow is the Hamiltonian Flow, which is used for MCMC sampling [30]. 2.4.1 Normalizing Flows for Variational Inference We now expand on how Normalizing Flows are used to aid VI. We revert to our previous notation of using ✓ = \1,\ 2, . . .\ ? ( ) to represent the latent variables, ⇡ for the observed data and ⇡ ⇧ ✓ | ( ) for the target conditional distribution we wish to sample from. ⇡ ⇧ ✓ | ( ) = ? ( ⇡ | < c ✓ ) ⇡ ( ) ✓ ) ( Recall that, VI approximates the target distribution by choosing a family of distributions & = @q q and selecting the optimal distribution in this family @q⇤ closest to the target density in | 2 } { terms of KL-divergence: @q⇤ 2 arg min@ q & ! @q ( ⇧ ⇡ . | ( )) || 2 (2.1) Other metrics such as more generalized U-divergence measures [24] can be used in place of KL-divergence. However, KL-divergence is popular due to its versatility and relative ease of implementation. The optimization in (2.1) is difficult to work with due to the presence of the intractable marginal likelihood < ⇡ . ) ( In practice, we maximize the Evidence Lower Bound (ELBO) with respect to the variational parameters q due to its equivalence to (2.1). The ELBO is the negative KL-divergence between the variational distribution @ and the joint distribution ? of latent variables and the observed data. max@ q 2 &ELBO @q ( = max@ q E@ q ✓ ) ( & 2 n ⇡ . | )) ⇧ || ( ln ? ⇡, ✓ E@ q ✓ ) ( ln @q ✓ ) ( ) ( ⇥ ⇤ ⇥ ⇤ o ⇡, ✓ ) ( (2.2) Using Normalizing Flows to aid Variational Inference was first popularized in [33]. The idea is to start with some base distribution @0( ✓0) and then apply diffeomorphisms )1,) 2 . . .) ( 11 successively so that ✓( = )( 1 . . .) 1( induce a flexible variational family & = { space of parameters for the transformations )( ✓0) @q . The transformations )B ( B=1, parameterized by q, . In this case, the symbol i denotes the ) ( ✓ q i ( )B } 2 )| ( B=1. We have the following useful relations: ( ) ln @q ✓( = ln @0( ✓0) ) ( E@ q ✓ ) ( ⌘ ✓ ) ( = E@0 ( ✓0) ⌘ ( ( ln det m)B m✓B 1 ’B=1 )( )( ⇣ 1 . . .) 1( ⌘ ✓0)) (2.3) (2.4) (2.3) follows from the change of variable formula and (2.4) is a well known property of expectation (sometimes termed the law of the unconscious statistician). We simplify the maximization of the ELBO in (2.2) as follows: max@ q &E@ q 2 ✓ ) ( ln ? ⇡ c ✓ | ) ( ✓ ( ) ln @q ✓ ) ( = max@ q ⇥ E@0 ( ✓0) & 2 ⇢ = max@ q E@0 ( ✓0) & 2 ⇢ ⇥ ⇥ ln ? ⇡ ✓( ✓( c ) ( | ( ) + ⇤ E@0 ( ✓0) ⇤ ln ? ⇡ ✓( ✓( c ) ( | ( ) + E@0 ( ✓0) ⇤ ( ’B=1 h ( ’B=1 h E@0 ( ✓0) ln @0( ⇥ i ⌘ ✓0) ⇤ (2.5) (2.6) m)B m✓B 1 m)B m✓B 1 ⇣ ⇣ ln det ln det i ⌘ Equations (2.3) and (2.4) jointly imply (2.5). We are essentially re-parametrizing the expectation in terms of the base distribution @0. In (2.6), we are able to drop E@0 ( free of the parameter q. In practice, optimizing over @q ln @0( & effectively becomes optimizing over because it is ✓0) ✓0) ⇥ ⇤ 2 the parameters q of transformations )B ( ) ( B=1. We will sometimes refer to q as flow parameters. In = general, for ?-dimensional latent variables ✓, calculating the determinant of Jacobian )B ( det time [28]. Therefore, in addition to )1,) 2, . . . ,) B being diffeomorphisms, takes $ ?3 1) ✓B m)B m✓B ( 1 ) ( ) they are often selected such that computational complexity of calculating )B ( ✓B 1) is $ ? . ) ( There are myriad ways in which we can choose the Normalizing Flow transformations. Intu- itively, if we choose )B to be deep neural networks we should be able to approximate almost any well behaved function. But how do we ensure computational feasibility? Neural Auto-regressive Flows (NAF) were proposed in [16] as an attempt at achieving this balance between expressivity and computational feasibility. NAF satisfy the “Universal approximation property”. This means 12 that they can approximate any probability distribution within an arbitrarily small error margin in the weak convergence sense, provided the width of the neural networks transformations used in the flow are large enough. Further, the auto-regressive structure of these flows ensures the Jacobian determinants can be computed in $ ? ) ( time. Note that this is just one among many families of Normalizing Flows. Given these properties we choose to use NAF for our examples in Section 2.6. 2.5 Neural Auto-regressive Flows Auto-regressive flows are among the most popular Normalizing Flows discussed in the literature. We discuss some of the principals behind auto-regressive Normalizing Flows. We concentrate on describing NAF since we use these for the examples in which we contrast Normalizing Flows aided VI, MCMC and MF-VI. Continuing with the similar notation, we denote the input from the base distribution by ✓0 = ( 1,\ 0 \0 1,\ B \ B 2, . . .\ 0 ?) 2, . . .\ B and transformed latent variable by ✓1 = 1,\ 1 \1 2, . . .\ 1 ?) ( . For any vector ✓B = R?, we let ✓B 8: 9 = 8 ,\ B \ B 8 + 1, . . . ,\ B 9 ) ( be the sub-vector of ✓B running from the 8 8     ?. Auto-regressive flows are constructed such that each 9  ? is dependent only on the first 8 elements ✓1 1:8 of ✓1. More is made up of ? many diffeomorphisms such that: specifically, the transformer ) = g1,g 2, . . .g ? ( ) ?)2 ( 8th to 9 th element, where 1 transformed variable \1 8 , 1 \0 \1 1; 21) 1 = g1( ✓0 \0 \1 8 ; 28 8 = g8 1:8 ( ( 1)) 2 8   ? g8 is parameterized by the vector 28 ✓0 1:8 ( 1) . The functions 28 : R8 1 R<, 2 8   ! ? are referred to as conditioners and they enforce the auto-regressive property for the Normalizing Flow. See Figure 2.1 for a visualization of auto-regressive flows. As the name suggests, NAF uses a neural network for g8. The 2 types of transformations used are: (i) Deep Sigmoidal Flow (DSF) - This neural network uses a single hidden layer. (ii) Dense Deep Sigmoidal Flow (DDSF) - This uses a deep neural network. 13 Figure 2.1 Visualization of auto-regressive flows. For readers who are unfamiliar with the topic, think of a neural network as a somewhat complex function that takes some inputs and applies a series of operations and transformations to them. They generally involve multiplication of inputs with weight matrices, translation and and the application of certain “activation” functions. The DSF network is formally defined as: \1 8 = f 1 w>8 f ( a8.\ 0 8 + ( b8 )) a8, w8, b8 R 1 2 8   ? Here is the number of nodes in the hidden layer and f G ( ) = 1 /( 1 G 4 ) + is an activation function (sigmoid activation). The parameters w8, a8, b8 are the outputs of conditioner networks . 2w q ( 1, ,2 b q ( . ,2 a q ( . ) ) 9 F8, 9 = 1. This ensures invertibility of g8 [16]. Since the DDSF transformation leverages a ! 8 ) R?. Further, a8 and w8 are constrained as 08, 9 > 0 8, 9 , 0